What if all the interactives a news organization ever made could be stored somewhere, accessible in the same form forever, even as the technologies people might use to access them change?
That’s the dream, and that’s what a small team led by Katherine Boss, the librarian for journalism, media, culture and communication at New York University, and Meredith Broussard, assistant professor at the Arthur L. Carter Journalism Institute at NYU, are trying to get the news industry closer to.
It’s a question that many people in the libraries world and a smaller set of people in the news industry have been worrying about and working on for some time. Boss and Broussard’s team will be building software that can zip up the entirety of a news app, using ProPublica’s Dollars for Docs database (which tracks payments pharmaceutical and medical device companies made to doctors) as a test case. For the prototype, they’re adapting an open source tool called ReproZip, created initially for replicating scientific experiments without also having to replicate everything else, such as installing additional software, or the operating system on which the original experiment ran. (The tests are currently funded through an Institute of Museum and Library Services grant. The team is now Boss, Broussard, and a reproducibility librarian Victoria Steeves, and it’ll add a third programmer to the team of Fernando Chirigati and Rémi Rampin.)“The software tool we’re trying to build is an emulation-based web archiving tool to archive the…well, internet. In particular, dynamic projects like news apps that can’t currently be fully archived by anyone,” Boss said. “This tool is the first step in that process. If we don’t have a way to capture and compress these things through emulation, we can’t begin to think about any other aspects of the process.”
As Boss and Broussard had explained to Nieman Lab when they first embarked on this project:
[T]here’s “migration,” and then there’s “emulation.” “Migration” is the traditional stuff we might associate with libraries: digitizing print materials, digitizing microfilms, moving VHS to DVDs and then DVDs to Blu-ray and then Blu-ray to streaming media. That process doesn’t make sense for digital “objects” like news apps that are dependent on many different types of software, and therefore have too many moving parts to migrate. What if, a hundred years out, we’re not even browsing the internet on computers, or at least not the computers we’re familiar with now?
As part of a Reynolds Journalism Institute fellowship this year, Broussard is also working on the online holding place that will allow people to access, through a web browser, these archived news apps just as they were first presented, without broken links or wonky graphics or dead-end interactions. Kind of like the way, say, someone might be able to go to a library and look at digitized versions of a notable writer’s collection of letters. Or the way the Internet Archive lets you play a 1986 Sega game online.
“Once we’ve packaged these apps, we need a place to put them. You can’t just package them up and put it anywhere on the internet, because as we know, stuff on the internet sometimes just disappears,” Broussard said. “A physical library has shelves you can put books on. A digital repository needs to have the digital equivalent.” Lots of similar repositories exist that hold other types of content; none yet specifically hold news apps. In their test case, the ReproZip-based software successfully preserved the backend of Dollars for Docs, but not the frontend, so the tool needed to be further adapted to account for that.
They’re aiming to launch the repository in the fall, starting with packaging and storing some of Broussard’s own recent — but already broken — news apps, like a 2016 campaign finance data project, or a 2014 database on textbooks in Philadelphia schools. Several other news organizations are now interested in archiving their news apps with them, Boss and Broussard said. Few news organizations are consciously considering the problem of archiving news apps, let alone are putting in place real archival strategies. The Tow Center is also researching this question. There’s the Save My News plugin, which lets users easily save their articles to a place like the Internet Archive. The New York Times put a serious team behind archiving all its story pages the way they originally looked when they were published. For the most part, news staff can’t spare the time or resources to archive projects systematically, and for posterity.Before they started building the archiving software, Boss and Broussard’s surveyed developers and journalists on the tech used to make and store their news apps, to get a sense of what programming languages, for instance, to prioritize in building out their archiving software.
“We discovered, for example, that nobody is using Haskell or Julia. But people are using Python and R. They’re using JavaScript, and frameworks like Django and Flask,” Broussard said. “We don’t want to build a tool that works for a language that nobody’s using.”
They also asked people their organizations’ archiving practices, or lack thereof. According to data that will be published in a forthcoming special issue of Digital Journalism (look for Volume 6, Issue 9, edited by Henrik Bodker):
“The issue is again that no organization has a way to compress these objects through emulation and send them to libraries. Libraries actually have the support and mandate to save dynamic digital objects like this, for 50, 100 years — we’re thinking way far into the future,” Boss said. “Our tool would make possible for newsrooms to package and send their stuff to libraries. That pipeline doesn’t exist right now, but it’s important to establish that.”