Why criminal courts are still a black box for data journalists

Testify’s groundbreaking investigations in Cleveland show the power of computational methods in courthouse reporting. Why, then, are its stories so hard to replicate?

By Andrew Deck Oct. 31, 2024, 12:31 p.m.

Testify, The Marshall Project’s investigation into Cuyahoga County criminal courts, started with a tip. Back in 2021, a data set pulled from the county’s records system landed on the desk of reporters at The Washington Post. The source had collected years’ worth of criminal dockets for Cleveland, Ohio.

“It was a little too local for them to work with, but they looked at it and said there’s something here,” said David Eads, the lead data editor for Testify. When the Post shared the database with Eads’ team, The Marshall Project was already laying the groundwork to launch its first local chapter in Cleveland, as well as a partnership with the nonprofit outlet Signal Cleveland.

To The Marshall Project, the tip was a proof of concept. Quickly, Eads realized they had a rare opportunity in criminal justice reporting — the chance to pull together a near-complete database of dockets from a single court. The leaked dataset had holes, but it showed Cuyahoga County’s records system was penetrable and could be scraped at scale. “We realized that we were going to need to acquire the data ourselves, to validate it, to understand it better,” said Eads.

Now, three years since Testify’s launch, The Marshall Project has collected tens of thousands of court dockets from Cuyahoga County that span from early 2016 to the present. The team has used the database to publish a series of investigations documenting systemic racial disparities in northern Ohio’s criminal justice system.

The first Testify investigation, published back in 2022, looked at the question of who was electing county judges in Cleveland. In 2020, nearly 30% of voters in Cuyahoga County left the judicial portion of their ballot blank. The Marshall Project pulled defendant addresses from its docket database and geocoded them to assign latitude and longitude values. Separately, they pulled county board of elections data to find out where the people who did vote for judges lived. Overwhelmingly, they found, the voters who were most likely to vote for judges — largely white residents living in affluent suburbs of Cleveland — were among the least likely to face judges in a Cuyahoga County court.

Last year, the Testify team won an Edward R. Murrow Award for excellence in innovation. One of its investigations was also cited in The Marshall Project’s 2023 Online Journalism Award for general excellence.

Testify’s reporting demonstrates the potential for computational methods to open up a new world of analysis to criminal justice reporters. To take full advantage of these methods, though, journalists first have to build a large database of court documents.

“There’s nothing stopping me or you or anyone else from looking up a single case,” said Eads. “[But] it is extremely prohibitive, just in a simple mechanical sense, if you want to ask, ‘How does this compare to all cases?’” Other good reporting questions — How does one case compare to other cases that a judge oversaw? How does it compare to the prosecutor’s record? Did the demographics of the defendants play a role? — are also tricky to answer. “All those sorts of analytical questions that contextualize a given decision in a bigger world are extremely difficult to access,” said Eads, “when what you can access is unstructured data, one case at a time.”

The challenge of collecting vast swaths of court records is part of the reason it’s so difficult to conduct data-driven investigations into courthouses around the country. No doubt, the demand for that reporting is there. “A lot of newsrooms have come to us saying, ‘We want to replicate [Testify] in our county. How do we do it?’” said Ilica Mahajan, a computational journalist for The Marshall Project, who has been working on Testify since 2021. “[When] I go to their county website for court records, it’s hardly ever possible.”

Across the U.S., criminal court documents sit inside archaic public access portals that can only pull files one by one. Meanwhile, federal criminal cases in government-run databases like PACER and for-profit databases like Westlaw, have prohibitive fees or minimal usability at scale. In some places, court records still aren’t digitized at all and are instead collecting dust in file rooms. As a result, a goldmine of data on our criminal justice system remains frozen in redundant systems.

When I set out to survey the state of data journalism about criminal courts in the U.S., I found that projects like Testify show the promise of computational methods on this beat — but ultimately, Cuyahoga remains the exception, not the rule. For these techniques to spread widely, there first needs to be an accompanying change in how counties handle public records access. In the meantime, data journalists are left to hope for the rare fulfillment of bulk records requests by courts or to search for cracks in their online systems where their scrapers might be able to squeeze through.

“There’s places where, five years ago, to get access to court records, you were going through a green screen terminal. It was a system that was really hot when it was designed in 1982,” said Eads. He recalled, years ago, building a scraper for Cook County, Illinois’ web portal that automated a cursor to click 30 characters to the left and then 10 characters down to locate the search entry field. “You wouldn’t be surprised to walk up to a big city system, or any system, and find out that it’s ancient,” he said.

Today, county portals still leave data journalists cursing at their computer screens. As simple as it may sound, one of the barriers to basic data journalism about local courts is how records are numbered. Across many counties in the U.S., court docket IDs are entirely randomized, so two cases seen back-to-back by a judge, for example, may end up labeled out of order.

Cuyahoga County was different. After the county declined to share its records in bulk directly, the Testify team committed to building a scraper. From the original leaked data set they confirmed not only were dockets labeled sequentially, but the public records website was searchable by those numbers. “That made it very easy to get a complete set of records for a particular time period,” said Mahajan, adding that the portal did not sit behind user logins or registrations, which would have been another complicating factor for a scraper.

At the outset of its investigation, the Testify team had 12 custom-built bots running simultaneously, scraping dockets from the county’s web portal. Some scrapers ran for nearly two months straight to feed the growing data set. Ultimately, the team was able to pull dockets from early 2016 all the way up to the present. Mahajan still routinely runs scrapers to keep the files up to date.

There are other states in the U.S. that stand out for the relative ease of their criminal court records access — Eads named Oklahoma and Washington, in particular. Relative to other government agencies, though, most courts retain a high level of discretion over how they share records. Some are even able to maintain compliance with public records laws by requiring that all documents be accessed in person.

“They can say, ‘We maintain public access — you just have to come to the courthouse and look it up on a terminal and pay $0.25 a sheet to print something out,’” said Eads. His reporters have sometimes been told to appear in person to look up files and download them onto thumb drives. “Courts often have this escape hatch.”

For the first time, two Pulitzer winners disclosed using AI in their reporting

May 9, 2024

County by county, barriers to data-driven criminal court reporting tend to be the highest. But even at the federal level, where there are more opportunities to access records outside these out-of-date county systems, infrastructure is a challenge.

Take another recent standout investigation into U.S. courts. In 2019, Reuters published a series called “Hidden Injustice,” exposing the impact of sealed documents in civil court. The team collected records for 3.2 million civil suits filed in federal court between 2006 and 2015. Using machine learning technology, they were able to review 90 million court actions and pull out the ones in which material was sealed. They used their findings to report several accountability stories, including one about how judges sealed damning information about OxyContin’s addictive health effects in the build-up to the national opioid crisis.

Before any automation was used, four Reuters reporters first had to comb through thousands of documents to label motions to seal, teaching their tool how to spot similar motions in other dockets. “We had to manually code several thousands records to train the classifier. So this is a useful tool, but not an easy one,” said Benjamin Lesser, a deputy data editor at Reuters, who worked as a reporter on the investigation. “It allowed us to answer questions we weren’t able to answer otherwise, but it still took a long time.”

The investigation focused on federal civil court, but similar to Testify, it is difficult to replicate Reuters’ database of dockets. Beyond employing dedicated investigative data journalists and spending untold hours training its classifier, the newsroom also had a backdoor into a for-profit legal research platform and proprietary database of federal court records — Westlaw. The platform is also owned by the Thomson Reuters Corporation.

Lesser and his team approached its sister company and negotiated access to Westlaw’s underlying data. This bulk structured data — millions of court dockets sorted into tables — was the fuel for Reuters’ machine-learning tool. It could never be accessed in the same way by an average, front-door Westlaw user. “Westlaw can still be a very useful tool to do lots of things. It’s just, you’re not going to likely get the bulk data,” explained Lesser, who isn’t aware of other news outlets that have entered data-sharing arrangements with the company. A spokesperson for Westlaw said they were not able to disclose customer information.

Even in other use cases, Westlaw and similar platforms like LexisNexis are costly for smaller newsrooms, local publications, and nonprofit outlets. Westlaw’s lowest-tier subscription cost is $132.80 per month, but that’s only for one user and only for access to a single state’s court records. Those prices can quickly balloon to over $500 per month with additional users or when you add access to federal court records or other states.

Similarly, PACER (Public Access to Court Electronic Records), despite being a prong of the federal government, still charges $0.10 per page to access. In a data-driven investigation, these costs can quickly turn prohibitive. In 2019 alone, the federal judiciary earned roughly $145 million in PACER fees.

“While organizations like The New York Times may not have a problem shelling out $1,000 to put behind a story, small local news operations don’t have wiggle room in their budget to be paying for these public records. That’s the first challenge,” said Amy Kristin Sanders, an associate professor of law and journalism at the University of Texas, Austin.

Some burgeoning initiatives target journalists and offer new solutions to both the infrastructure and cost barriers to accessing criminal court records.

A courts reporter wrote about a few trials. Then an AI decided he was actually the culprit.

September 23, 2024

SCALES is a project by Sanders and more than a dozen other legal academics and data scientists to create a free database of hundreds of thousands of federal court documents. The open-source machine learning tool allows journalists to acquire and analyze thousands of court dockets at a time through natural language processing (NLP) queries. The NLP element means journalists can prompt the database in layman’s terms, asking questions about the records without knowing sophisticated legal jargon or a programming language.

Even a newsroom that can afford PACER fees or a WestLaw subscription may still end up with a pile of raw unstructured data. For instance, PACER provides documents in PDF form. That makes it hard to process using even basic computational methods without first conducting full-scale optical character recognition (OCR), translating the copy into machine-readable text.

“Not every newsroom has the budget or the resources to have a trained data journalist in their newsroom. PACER only retrieves results, it does no analysis for you,” said Sanders. “What SCALES does is take that next step to do the analysis, to be able to make sense of those records without having to pay a data scientist.”

SCALES launched last year and is currently available to any journalist to use for free. To date, the researchers have collected more than 850,000 federal court records. All dockets have been legally obtained either through front-door partnerships with individual districts or through grant funding to pay for access to records in bulk, including $5 million from the National Science Foundation.

SCALES is built by academic legal researchers first and foremost, so there’s been a notable lag in newsroom adoption. (Sanders has been on the journalism conference circuit to build awareness, including recent appearances at Media Party Chicago and Reuters’ Momentum AI event this summer). And SCALES doesn’t have documents available past 2022, so it’s often best used to show historical trends in the courts or to underline contemporary reporting with supplemental data.

Perhaps the most important shortcoming of both PACER and SCALES, though, is that neither offers access to county-level court records. It is county courts where the majority of local criminal justice reporting is centered, and where projects like Testify have grounded their investigations. When asked if any journalism has been produced with SCALES to date, Sanders admitted that, to her knowledge, it hadn’t.

The SCALES team is beginning to address this issue through a new statewide initiative in Georgia, led by one of the team’s legal researchers, Adam Pah, a clinical associate professor of criminal justice at Georgia State University. Called the Integrated Justice Platform, so far the project is being piloted in Atlanta with the goal of eventually consolidating documents across police departments, county courts, and corrections offices in the state.

“The platform actually stitches together those records and merges them, so we can start to ask and answer questions about how individuals and changes transit through the system,” Pah said. The team hopes to launch the platform publicly in January.

The Garrison Project wants to bridge the gap between national and local criminal justice reporting

January 30, 2024

In the convoluted web of county records systems that are spread across the U.S., initiatives like SCALES are pushing the dial toward transparency in our courts. It remains a slow turn away from the green screen portals of yesterday.

“It’s a very difficult space, and the decreased funding and positions in newsrooms for this work, on the reporting side, make it even worse,” said Pah. “In a world where the tools exist but without people with time to effectively use them, that is still a great disadvantage to society.”

Every journalist and researcher I spoke to made clear that computational methods aren’t a replacement for shoe leather reporting on our criminal justice system. When done right, though, projects like Testify and “Hidden Injustice” show the power of bringing computational methods to coverage of the courts.

“For people who are experiencing the justice system, none of the things that we are going to report is a surprise to them,” said Mahajan. She points to some of the early findings of Testify, including that Black children are overrepresented in the Cuyahoga County adult court system and that the voters most likely to elect county judges in Cleveland are among the least likely to have faced a judge in court.

It’s one thing for criminal justice reporters to hear something anecdotally from sources every day. It is another for them to validate those experiences quantitatively and unequivocally with data. These findings may not be groundbreaking to incarcerated people or their families, said Mahajan, “but it helps a lot of people who experience the court system to not feel gaslit. [We can] say, not only do we believe you, but here’s exactly what is happening in the numbers.”

Adobe Stock illustration

Andrew Deck is a staff writer covering AI at Nieman Lab. Have tips about how AI is being used in your newsroom? You can reach Andrew via email, Bluesky, or Signal (+1 203-841-6241).

POSTED Oct. 31, 2024, 12:31 p.m.

SEE MORE ON Reporting & Production

PART OF A SERIES Crime News Now

Show tags

TWITTER FACEBOOK EMAIL