Ethan Zuckerman wants you to eat your (news) vegetables — or at least have better information

Zuckerman wants to create nutritional labels for news, showing how much marshmallow fluff you mix in with your meat and potatoes. But both the tech and politics of categorizing journalism have a long way to go.

By Andrew Phelps Nov. 8, 2011, 10 a.m.

Vending machine

When you walk into Ethan Zuckerman’s new office at the MIT Media Lab, past the all-seeing Mood Meter and a ping pong table that resembles a koi pond, there’s an old newspaper box on display. Inside, a glowing screen depicts the New York Times front page. A dial invites you to scroll through two decades of data.

Red and blue bars on the screen represent the share of domestic versus foreign news on Times front pages over the years. Except for spikes in the early ’90s and mid-2000s, foreign coverage has steadily dwindled since the late ’80s.

Zuckerman, who created Global Voices in 2004, is worried about the disappearance of international news in the United States. As he (and Eli Pariser and Alisa Miller) has argued, the Internet has not opened up a world of information from far-away places. It has given us more people and news sources that look like us. Too much.

U.C. San Diego scientists in 2009 estimated the “average” American consumes 34 gigabytes of media a day. (And I suspect most of the people reading Nieman Lab are above average.) Are we aware of all that we’re taking in? How varied, how nutritious is our media diet?

Now that he’s in charge of the Center for Civic Media (see my June interview), Zuckerman is putting resources behind an idea he has talked about for years: nutritional labels for news. Imagine being able to see a summary of your media diet. (Clay Johnson once took a stab at designing such a label.) Educating people about their intake might empower them to make healthier choices, Zuckerman argues. He explained the idea in 2008:

The holy grail in this model, as far as I’m concerned, would be a Firefox plugin that would passively watch your websurfing behavior and characterize your personal information consumption. Over the course of a week, it might let you know that you hadn’t encountered any news about Latin America, or remind you that a full 40% of the pages you read had to do with Sarah Palin. It wouldn’t neccesarily prescribe changes in your behavior, simply help you monitor your own consumption in the hopes that you might make changes.

I met with Zuckerman and graduate student Matt Stempeck to understand how it might work. The project’s working title is MediaRDI (Recommended Daily Intake), but to be clear: An end product, an actual piece of consumer software, is a ways away. Once Zuckerman began peeling back the onion, I realized how complicated this is.

First problem: As soon as you stick a label on something, it’s political.

Clay Johnson's proposed nutritional label for news

“Anything tracked on a label is the subject of enormous lobbying,” he told me. “If you decide that you’re going to start tracking saturated versus unsaturated fat, there’s enormous controversy over that, and whether or not we should be subdividing out sodium and potassium….Everyone has vested interest.”

Labels have two functions: descriptive and normative. He explained: “The descriptive component tells you what’s in a product, and the normative component tells you what you should be eating or drinking. So I’m looking at a Diet Coke can here. It tells me that it has 40 milligrams of sodium; that’s descriptive. And then it offers me that it’s 2 percent of my daily value; that’s normative.

“I’m not interested in going after the normative question. Telling people that they should be reading 30 percent local news, 40 percent international, 30 percent national — that seems like a poor path to go down. But being able to do some comparison to me seems like a very worthwhile thing to do.”

“Do you really expect that average citizens are going to use a tagging system to make the content semantically available?”

Tracking someone’s browsing habits and spitting out a descriptive list of URLs is not very helpful, though, Zuckerman said. “You need to be able to categorize it in some sort of hierarchy. I think it would be helpful to give people a geographical picture and a topical picture of what they’re consuming.”

How do you design a hierarchy that is comprehensive but precise? Future-proof? At what point in the Tunisian revolution does the story become a category; and at what point does that category become a subcategory, of Arab Spring? Is a story about a medical study labeled Science, Health, Health & Science, or Medicine? Moreover, a hierarchy that works for one publication may not work for another. This is how even descriptive labeling gets political.

There are existing attempts to categorize the world’s information; this is what Tim Berners-Lee and the semantic web crowd have been trying to do for years.

The Project for Excellence in Journalism hand-samples media every week to determine the portion sizes of topics in the news.

Fabrice Florin’s NewsTrust uses a human-powered, pro-am approach to determine quote versus factual statement versus inference.

Hal Roberts at Harvard’s Berkman Center uses a computer-based approach called clustering, which parses text from the top 1,000 U.S. blogs to put them in categories. Computers can tell the distinction between knitters and crafters, Zuckerman says, and between Google bloggers and Apple bloggers, because those groups use distinct language. But the results are not always helpful. A computer can tell you that Daily Kos and Redstate are “politics,” but it is bad at telling you that one is a liberal blog and the other conservative, he says. “Obama,” “economy,” and “election” are uttered in equal amounts on both sides of the aisle.

OpenCalais takes a different approach with entity extraction. The software fishes out meaningful words and phrases, sort of like tags. A story about the German economy might return the entities “Germany,” “economy,” “Angela Merkel,” and “Euro,” to name a few. A computer can easily pluck those entities out of text, but there are two problems: A human has to tell a computer what entities are important, one, and two, entities are not hierarchical. As far as a computer is concerned, “Angela Merkel” is not a subcategory of “German politicians” or “world leaders” — it’s just a flat piece of metadata.

Mike Dunn, the chief technology officer at Hearst Corporation, has talked a lot about getting newspapers to adopt a common taxonomy. Dunn is a proponent of the rNews specification, which was finalized just this month by an international standards body. That spec is backed by The New York Times, the AP, and Getty Images. It’s a start, but a standard is only as good as the news organizations that adopt it.

“We’re moving into a realm where it’s easier and easier for new people to create media,” Zuckerman said. “And at that point, do you really expect that average citizens are going to use a tagging system to make the content semantically available?”

That’s why he is taking the categorization into his own hands — or maybe 100,000 other people’s hands. Zuckerman wants to enlist an army of workers from Amazon’s Mechanical Turk, workers who share in monotonous labor and earn pennies for small tasks.

Ethan Zuckerman

“Take 10,000 or so stories from, let’s say, the L.A. Times, where we don’t have this data, and ask people to evaluate, ‘Is this local to the Los Angeles area? Is this national to the U.S.? Is this international? Or is this soft?’ Which is it to say, entertainment, sports, something along those lines. And those might be four arbitrary categories,” he said.

Zuckerman’s goal is not 100 percent accuracy — more like 80 or 90 percent, he said.

“Get 50,000 judgments, 100,000 judgments, at that point we have fairly big datasets. We have all the words in those stories that we might then use to train a machine-learning algorithm. And then the idea would be to then say, ‘Does this machine-learning algorithm let us classify the L.A. Times going forward?'”

So where does he get all this data? When you’re talking about all of the news outlets in America, that’s a challenge. Zuckerman’s students were be able to build the New York Times contraption because they had access to publicly available, well-documented APIs (and were willing to use the Times’ existing taxonomy). Only a handful of major news organizations are so accessible.

This is where Zuckerman is one big step ahead: In 2008 he helped create a gigantic computer program called Media Cloud, which has indexed every article from the top 25 mainstream U.S. news outlets and every post from the top 1,000 blogs from the past four years. That’s hundreds of terabytes’ worth of text, over 43 million articles, all searchable and visualizable.

The data is publicly available in daily chunks, but save for Bruce Etling’s work with Russian-language media, it’s a rich database waiting to be mined.

Zuckerman said he is inspired by the quantified self movement, the people who obsessively record every last bit of data about heart rate, sleep cycle, diet, exercise habits. He pulls out a Fitbit, a sort of high-tech pedometer, which shows he hasn’t taken as many steps today as he had thought.

“It says to me, perhaps I want to walk a little bit further later today, and I will walk to the restaurant where I’m meeting folks for dinner, rather than driving,” he said. That’s descriptive information acting prescriptive. Zuckerman is making a choice he might not have made without the information.

You might have your media diet pretty well regulated, that you don’t need someone to tell you what you’re consuming. You’re probably wrong — or at least I was. Here’s something you can try right now: Install a free application called RescueTime. It runs in the background and tracks how you spend your time online and offline. My detailed productivity report says I spend an inordinate, embarrassing amount of time in TweetDeck and a decent chunk of time in Facebook (even though I’m “never on it anymore”). I can also see how my focus compares to fellow users: As of this writing I’m more productive than 42 percent of others.

Just thinking about it pressures me to be more productive. It’s social guilt. Zuckerman thinks he walks more now just because he knows he’ll share the Fitbit data later. (He won’t even share his Rescue Time data with his wife: “I find my tendency to read every possible story on the Green Bay Packers so humiliating.”) He wants an eventual MediaRDI app to be social, encouraging people to share and compete with each other.

But does everyone want to see a digest of his own news nutrition, let alone look at someone else’s? I bet a lot of us are afraid of what it might say. Twenty-one years ago President Bush signed the Nutrition Labeling and Education Act. Many local laws now mandate posting calorie counts in chain restaurants. And America is still pretty fat.

POSTED Nov. 8, 2011, 10 a.m.