Nieman Foundation at Harvard
HOME
          
LATEST STORY
Thousands of documentaries are fueling AI models built by Apple, Meta, and Nvidia
ABOUT                    SUBSCRIBE
Jan. 7, 2025, 11:17 a.m.
Business Models
Reporting & Production

Thousands of documentaries are fueling AI models built by Apple, Meta, and Nvidia

Subtitles for documentaries by Alex Gibney, Ava DuVernay and Ken Burns, and episodes of PBS’ Frontline and BBC’s Panorama, were used to train LLMs.

In November, The Atlantic published a groundbreaking investigation that showed more than 138,000 films and TV episodes had been used as training data for AI models. Until then it had been widely speculated, but not confirmed, that the output of Hollywood studios had been input into many of the large language models (LLMs) built by AI companies.

Alex Reisner’s reporting explained how a data set called OpenSubtitles had stripped the dialogue from these films and television shows. Those subtitles were then used as training data to build LLMs developed by Apple, Meta, Nvidia, Salesforce, Bloomberg, and Anthropic, the company behind the chatbot Claude. This training happened without permission from or compensation for copyright holders.

The Atlantic’s reporting sent shockwaves through the entertainment industry. In the days after the story broke, studio execs, showrunners, and directors said they were “outraged” by the news and accused AI companies of “organized crime.” Still, reactions to The Atlantic’s story have since largely zeroed in on the unauthorized use of Oscar-winning works, studio tentpole releases, sitcoms like “Seinfeld,” and film classics like The Godfather.

The OpenSubtitles data set, though, also includes the work of many broadcast journalists, investigative reporters and documentarians. Using a searchable database published by The Atlantic, I found hundreds of documentary films and over a thousand episodes of news programs and docuseries in the same training data.

The documentaries in OpenSubtitles include films by Ken Burns, Ava DuVernay, Michael Moore, Asif Kapadia, Werner Herzog, Jehane Noujaim, and Errol Morris. The films also include documentaries produced in-house by news broadcasters, including a dozen features produced and distributed by CNN Films. OpenSubtitles also includes flagship newsmagazines and investigative news programs, including episodes of “Frontline” from PBS, “60 Minutes” from CBS, Vice Media’s former HBO series, the short-lived Amazon Prime show “The New Yorker Presents,” and the long-running BBC program “Panorama.” The documentaries and news programs I identified date back to 1964 and run all the way up until 2018.

Like the films and shows in OpenSubtitles previously reported on, captions and subtitles for these documentaries and news programs were used to build models released by Apple, Meta, Nvidia, Salesforce, Anthropic, EleutherAI, Cerebras, and dozens of other AI companies.

While most AI developers that used OpenSubtitles to build LLMs are Silicon Valley companies, that list also includes Bloomberg, itself a news publisher, TV news broadcaster, and employer of journalists. Bloomberg used OpenSubtitles to develop its own large language model, BloombergGPT, which it documented in a preprint released after the model’s launch in 2023. Bloomberg declined my request for comment.

“It does feel invasive,” said Alex Gibney, the Oscar-winning documentary director. Over a dozen of Gibney’s features were included in the data set, including Enron: The Smartest Guys in the Room, Taxi to the Dark Side, and Going Clear: Scientology & the Prison of Belief. “I had thought about it as something that was coming, but it is intriguing and interesting and disturbing to me to find out that it is already way down the track.”

Gibney says all of his films in the data set have registered copyrights, either owned by his own production companies or others that supported his films. After I reached out, Gibney said he notified his lawyers and they are determining whether there are grounds for legal action.

“It does feel in some way, shape or form, that my intellectual property is being used to create other works,” said Gibney, though he acknowledged that U.S. courts have yet to set firm precedent on this copyright gray area and whether or not it constitutes fair use. “It raises larger questions that go way beyond my films, about how this material gets monetized and how the creators share or don’t share in the monetizing of it.”

Documentary filmmakers and news broadcasters are now faced with a challenge to their copyright, not unlike the ongoing challenges to the copyright of their journalist peers in digital news and book publishing. Original, reported, structured, narrativized, fact-checked, informational text has once again proven to be a vital — though, not valued — resource in building major AI models.

This is horrifying,” said Nancy Kates, founder of the documentary production company Question Why Films, whose 2013 documentary Regarding Susan Sontag is in the OpenSubtitles data set. “I know larger entities, such as The New York Times, have been taking action. I didn’t realize this wild abuse of copyright included violating the rights of tiny producing entities like me, or my little company.”

Many filmmakers and journalists simply don’t have the resources to fight copyright court battles against big AI companies themselves. “I would be interested in joining a class action, not pursuing it on my own,” said Kates, who confirmed her documentary, which was released by HBO Films, is registered with the U.S. Copyright Office.

Already screenwriters have started putting pressure on studios and networks to take up the mantle, and defend their intellectual property in court. In December, the Writers Guild of America (WGA) published an open letter addressed to major Hollywood executives calling for them to sue AI developers. “Tech companies have looted the studios’ intellectual property — a vast reserve of works created by generations of union labor — to train their artificial intelligence systems,” wrote WGA East and West representatives in the joint letter, which cited The Atlantic’s reporting. “After this industry has spent decades fighting piracy, it cannot stand idly by while tech companies steal full libraries of content for their own financial gain.”

A pile of documentaries

Launched in 2005, OpenSubtitles.org is a website that houses subtitles from official DVD and home video releases, as well as streaming services. Users can download subtitles of films and TV shows they want to watch for free from the site. In recent years, this repository has taken on a secondary purpose, after AI developers began scraping the millions of subtitle files, in over 60 languages, to use as training data.

Most notably, OpenSubtitles files were folded into a popular training data set released in 2020, known in the AI industry simply as “The Pile.” Alongside OpenSubtitles, The Pile includes 21 other major data sets, including Wikipedia articles, U.S. patent and trademark filings, Reddit posts, European Parliament transcripts, and a collection of pirated ebooks called Books3. The Pile has regularly been cited in journal articles and dataset cards as a core piece of training data for building commercial LLMs.

Using The Atlantic’s custom-built search tool, I found a large swath of documentaries included in The Pile, through the OpenSubtitles data set. Among them are programs from the world’s largest news broadcasters and most prolific documentary filmmakers working today.

Eight of Ken Burns’ PBS documentary miniseries are included in the dataset, including episodes of three Emmy-winning series: “The Civil War,” “Baseball,” and “The National Parks: America’s Best Idea.” In a statement, a spokesperson for Ken Burns and Florentine Films, his production company, said they were not previously aware of this unlicensed usage and want to speak with the AI companies involved. “We understand that any system based on learning relies, to some degree, on materials that already exist,” said the spokesperson. “That said, care should always be taken to respect the intellectual property rights of the individuals whose work makes the pre-existing materials possible.”

Nineteen episodes of “Frontline,” the PBS investigative documentary series, were also used as training data. I found episodes dating back to 1986, including a seminal season three program called “Memory of the Camps.” The episode was directed in part by Alfred Hitchcock and aired footage shot by Allied troops entering Nazi concentration camps for the first time. Frontline producers had salvaged the lost footage from a museum archive.

Seventy-five episodes of the Emmy-winning series Vice, hosted by the company’s executive chairman Shane Smith, are also included. The episodes span from seasons one to five, when the show was still airing on HBO. A Vice spokesperson confirmed that Vice Media had not given AI companies permission to use the series for training purposes.

In the data set are also 111 episodes of “Panorama,” the BBC’s flagship news documentary series and the world’s longest-running newsmagazine show. Episodes include early seasons from 1964 all the way up to season 65, which featured a 2017 documentary on Donald Trump’s first term deportation policies. Subtitles for David Attenborough’s narration of “Planet Earth” (2009) and “Planet Earth II” (2016), the BBC’s Emmy-winning nature documentary miniseries, were in the data set, as were ten Louis Theroux-hosted BBC Two specials, which included reporting on private security firms in Johannesburg, the Westboro Baptist Church, and elder care facilities in Arizona.

“Using BBC content without our permission to train AI models is against our terms of service and not in the public interest,” said a BBC spokesperson in a statement.

Last fall, the U.K. parliament proposed a copyright law reform that would allow AI companies to use protected works as training data, unless copyright holders opt out. The BBC has publicly opposed the law, alongside a coalition of news publishers, including the Guardian, Financial Times, Telegraph, and Getty Images.

“In addition to steps we’ve taken to prevent web crawlers like those from OpenAI and Common Crawl from accessing BBC websites, we want to [find] a more structured and sustainable approach with technology companies for the use of data,” said the spokesperson.

Can you steal subtitles?

One thing that sets OpenSubtitles apart from the unauthorized use of digital news articles or published nonfiction books to train AI models is that text is not the original medium of these works. OpenSubtitles data, in a sense, strips dialogue from films and episodes, leaving the video and audio elements behind.

Gibney, for one, still sees a clear value in the written dialogue used by AI developers, whether that text takes the form of an official script, subtitles, or closed captioning. “Clearly a film is more than a script. A film is more than that, but it also is that,” he said. “I sit down at a computer and I write my narration. That’s my intellectual property, and I also collate the words of other people to figure out how to structure a story.”

Unlike some documentary filmmakers, Gibney narrates much of his own work. He added that it’s “creepy” to think his own speech has been fuel for models built by Apple, Meta, and Anthropic. But even without a director’s narration, OpenSubtitles data includes the words of broadcast news anchors, documentary subjects, talking heads, and dialogue pulled from licensed archival footage. “Without opining on the legality of it all, to me, there is value in this material,” he said.

The U.S. Copyright Office, and a body of existing copyright case law in the U.S., have regularly ruled that subtitles, closed captioning, and similar forms of transcription for commercial purposes are “derivative works.” This means that these subtitles would likely be covered by any existing copyright registrations. Even so, the path for litigation, and for documentary filmmakers and news broadcasters to receive compensation for this mass unauthorized usage of their work, is murky at best.

News publishers and book authors are currently pressing forward with copyright suits against AI developers, including Microsoft, Perplexity AI, and Stability AI. (Wired recently published a helpful visualization of all the ongoing AI copyright lawsuits in the U.S.). A string of class action lawsuits and potentially precedent-setting cases filed by The New York Times and The Intercept against OpenAI are in the discovery phase, for example. But a possible “fair use” defense from AI companies still hangs over these cases as they wind through the courts.

“Many lawyers believe that what tech companies are doing when they ‘scrape’ the internet and use it to create generative AI will be considered fair use,” said Rachel Antell, in a statement on behalf of the directors of the Archival Producers Alliance (APA). A coalition of over 300 archival producers, the APA has spent the last year raising alarm bells about the ethics of using generative AI in documentaries. Part of the APA’s work has involved surveying experts in AI-related fields, including copyright lawyers.

The APA cautions filmmakers that if their work is posted online there is “a good chance” it has already been used to train LLMs. Antell, for example, has worked as an archival producer on over 20 documentaries, including the Oscar-nominated film Crip Camp. I found at least one of her own projects included in the OpenSubtitles data set.

“Unfortunately, at this point, the only way to guarantee the protection of your intellectual property from scraping is to keep it, and its scripts, offline,” she said. While processes for opting out of training data sets are expanding, there are still no guarantees. “We don’t want to see people forced to such extremes,” Antell said of the decision to remove works from social media or video hosting sites. “Limiting access to documentaries in this way could be a great cultural loss [and] puts an undue burden on many artists, who rely on platforms like Instagram and YouTube to publicize their work.”

With the OpenSubtitles data set, documentary filmmaking and broadcast journalism are just the latest professions to reckon with mass scraping of their copyright-protected works by AI companies. For nearly two years now, news publishers and book authors have been wrestling with the same knotty questions of ownership, licensing, and fair use. Perhaps more than anything, the data set, signals just how enmeshed the future of intellectual property is for these industries.

Photo by Happyphotons on Adobe Stock.

Andrew Deck is a generative AI staff writer at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email (andrew_deck@harvard.edu), Twitter (@decka227), or Signal (+1 203-841-6241).
POSTED     Jan. 7, 2025, 11:17 a.m.
SEE MORE ON Business Models
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Solidarity journalism could help news organizations build credibility
When reporting in solidarity, journalists use newsworthiness criteria, sourcing tactics, and framing styles that are distinct from those typically used by mainstream media.
Can solutions journalism work for local newsrooms?
Plus: Finding the strongest motivations for paying for news, how news orgs can help journalists’ mental health, and why partisan-based news consumption is heavier in the U.S.
Lessons learned in The Building of Lost Causes
“The skills we developed while facing down the fossil fuel industry — persistence through trolling campaigns, converting readers one by one, turning an upstart publication into essential reading — these aren’t just about journalism. They’re about how to keep building when everything around you feels like it’s crumbling.”