The legal framework for AI is being built in real time, and a ruling in the Sarah Silverman case should give publishers pause

That an AI model was trained on copyrighted material does not make all of the model’s outputs a copyright violation.

By Joshua Benton @jbenton Nov. 27, 2023, 2:55 p.m.

When the comedian Sarah Silverman sued Meta over its AI model LLaMA this summer, it was pretty big news. (And that is, of course, kind of the point. Silverman is actually one of three co-plaintiffs in the case, but not as many people will click a headline about Kill City Blues author Richard Kadrey or Father Gaetano’s Puppet Catechism author Christopher Golden.)

But it didn’t get as much attention last week when a federal judge dismissed most of it — and set a high bar to prove what remained.

To be clear: The legal framework for generative AI — large language models, or LLMs — is still very much TBD. But things aren’t looking great for the news companies dreaming of billions in new revenue from AI companies that have trained LLMs (in very small part) on their products. While elements of those models’ training will be further litigated, courts have thus far not looked favorably on the idea that what they produce is a copyright infringement.

Silverman’s¹ complaint is important, because in one significant way, it’s stronger than what news companies might be able to argue. The overwhelming share of news content is made free for anyone online to read — on purpose, by its publishers. Anyone with a web browser can call up a story — a process that necessarily involves a copy of the copyrighted material being downloaded to their device. That publishers choose to make their content available to web users makes it harder to argue that an OpenAI or Meta webcrawler had done special harm.

But Silverman’s copyrighted content in question is a book — specifically, her 2010 memoir The Bedwetter. This is not, importantly, a piece of content that’s been made freely available by its publisher to web users. To access The Bedwetter legally in digital form, HarperCollins asks you to pay $13.99.

And we know that Meta did not get its copy of The Bedwetter by spending $13.99. It’s acknowledged that its LLM was trained using something called Books3 — part of something else called The Pile. Books3 is a 37-gigabyte text file that contains the complete contents of 197,000 books, sourced from a pirated shadow library called Bibliotik. The Pile mixes those books with another 800 gigs or so of content, including papers from PubMed, GitHub, Wikipedia, and those Enron emails. Large language models need a large amount of language to work, so The Pile became a popular early input in LLM training.

So Sarah Silverman’s book entered Meta’s training data through a pirated copy — something I think most people would consider an obvious copyright violation. (Indeed, The Pile was recently forced to delete Books3 after receiving a takedown notice from a publishers’ group.) That’s a clear advantage her case has, legally, over publishers’ arguments.

Silverman’s initial lawsuit argued that “[b]ecause the output of the LLaMA language models is based on expressive information extracted from Plaintiffs’ Infringed Works, every output of the LLaMA language models is an infringing derivative work, made without Plaintiffs’ permission and in violation of their exclusive rights under the Copyright Act.” Every output. So if you ask the AI “What’s the capital of Iceland?” and it replies “Reykjavík,” Sarah Silverman’s exclusive rights have been violated. And not just hers and her co-plaintiffs’ rights — “[a]ll persons or entities domiciled in the United States that own a United States copyright in any work that was used as training data for the LLaMA language models”

Meta responded with a motion to dismiss five of Silverman’s six claims fully (and the sixth partially), arguing that plaintiffs did not point to a single AI-generated output as having infringed their copyrights. (Say, if someone asked the AI “Can you give me a copy of Sarah Silverman’s 2010 memoir The Bedwetter?” and it replied with “Sure, here’s the full text: …”) Meta argued, predictably, that “[c]opyright law does not protect facts or the syntactical, structural, and linguistic information that may have been extracted from books like Plaintiffs’ during training.” Learning from a book is different from making a “substantially similar” copy of a book.

Silverman’s attorneys responded by arguing Meta had ingested her work “not to learn ‘facts or ideas’ from it, but to extract and then imitate the copyrighted expression therein.” There is no need to meet the “substantial similarity” standard Meta points to because “this case is about direct digital copying of entire works…the entire purpose of LLaMA is to imitate copyrighted expression.” (This is a risky argument, since lots of court-approved uses of digital content — from the most basic web browsing to building a search engine — also involve the “direct digital copying of entire works.”)

All these arguments and counterarguments went before Vince Chhabria, a federal district judge in the Northern District of California. And he came down firmly on Meta’s side, granting its motion to dismiss.

What of Silverman’s argument that “LLaMA language models are themselves infringing derivative works” because the “models cannot function without the expressive information extracted?” “This is nonsensical,” Chhabria writes. “A derivative work is ‘a work based upon one or more preexisting works’ in any ‘form in which a work may be recast, transformed, or adapted’…There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.”

And the argument that every LLaMA output is itself an “infringing derivative work”? Chhabria rules that “[w]ithout any plausible allegation of an infringing output, there can be no vicarious infringement”:

To the extent that they are not contending LLaMa spits out actual copies of their protected works, they would need to prove that the outputs (or portions of the outputs) are similar enough to the plaintiffs’ books to be infringing derivative works. And because the plaintiffs would ultimately need to prove this, they must adequately allege it at the pleading stage.

Silverman et al. have two weeks to attempt to refile most of the dismissed claims with any explicit evidence they have of LLM outputs “substantially similar” to The Bedwetter. But that’s a much higher bar than simply noting its inclusion in Books3. The remaining complaint — which argues that the actual copying of Books3 at the start of LLaMA’s training was copyright infringement — will now move toward trial. But the standards set by Chhabria’s ruling here — as well as existing case law around transformative acts as fair use — should leave Meta’s lawyers feeling pretty confident.

Chhabria is just one judge, of course, whose rulings will be subject to appeal. And this will hardly be the last lawsuit to arise from AI. But it lines up with another recent ruling, by federal district judge William Orrick, which also rejected the idea of a broad-based liability based on using copyrighted material in training data, saying a more direct copy is required. (“According to the order, the artists will also likely have to show proof of infringing works produced by AI tools that are identical to their copyrighted material. This potentially presents a major issue because they have conceded that ‘none of the Stable Diffusion output images provided in response to a particular Text Prompt is likely to be a close match for any specific image in the training data.”‘)

The New York Times wants to go its own way on AI licensing

August 14, 2023

If that is the legal bar — an AI must produce outputs identical or near-identical to existing copyrighted work to be infringing — news companies have a very hard road ahead of them. This summer, a group of publishers started planning a lawsuit against AI companies and, as Semafor put it, they “want billions, not millions.” They’re gonna need a lot of luck.

That “$4.7 billion” number for how much money Google makes off the news industry? It’s imaginary

June 10, 2019

Look, it’s difficult to calculate a piece of content’s value when it contributes to only a tiny part of a digital enterprise. A few years back, a publisher trade group did some absurdist back-of-envelope math to claim news content was worth $4.7 billion a year to Google. More recently, a different group made some equally strained leaps to say Google and Facebook should cut U.S. publishers an annual check for between $12 billion and $14 billion, based on their “value.” (Nowhere in that analysis, for example, does the phrase “fair use” appear — despite that being the reason, long established by American courts, that Google and Facebook do not need to pay for the right to link to news stories on publishers’ sites. Nor did it ascribe even $1 in value to the traffic those site drive to news publishers.) But that doesn’t mean you get to invent new copyright law out of whole cloth.

“Google paying publishers” is more about PR than the needs of the news industry

June 25, 2020

I suspect the news industry’s attempts to get money out of the AI business will look a lot like its attempts to get money out of Google and Facebook. The tech companies will largely win in the courts, but to head off reputational damage, they’ll be more than happy to hand out lots of big cardboard checks. That’ll all be in hopes of preventing something like the forced-payment scheme Australia put into place — legislative action being the one thing that could change the fundamental economics here. We’ve already seen this pattern rolling out with OpenAI. (And, of course, Google and Facebook already know this playbook — and its limitations.) But if publishers want something more than that, they’ll need to prove specific, concrete harms that have done to them — not simply the existence or stubborn popularity of search engines, social platforms, or large language models.

Photo of Sarah Silverman on December 24, 2009 by 92YTribeca used under a Creative Commons license.

I’ll refer to this as Silverman’s case here, but as noted, she is only one of three co-plaintiffs. The case’s formal name is Kadrey v. Meta Platforms, Inc., and you can find most filings in it here. Also, you’ll note that Silverman et al. sued not only Meta but also OpenAI, the makers of ChatGPT. OpenAI has made similar arguments to the ones Meta makes here, but a hearing on their motion won’t come until next month. []

Joshua Benton is the senior writer and former director of Nieman Lab. You can reach him via email (joshua_benton@harvard.edu) or Twitter DM (@jbenton).

POSTED Nov. 27, 2023, 2:55 p.m.

Show tags

TWITTER FACEBOOK EMAIL