Nieman Foundation at Harvard
HOME
          
LATEST STORY
Journalists fight digital decay
ABOUT                    SUBSCRIBE
Oct. 18, 2019, 9:55 a.m.
Audience & Social

Here’s how researchers got inside 1,400 private WhatsApp groups

They…joined them! Plus: YouTube beats Indian news organizations 65-to-1, and machines can make fake news pretty well, but it can’t detect it.

The growing stream of reporting on and data about fake news, misinformation, partisan content, and news literacy is hard to keep up with. This weekly roundup offers the highlights of what you might have missed.

How the Tow Center monitored closed WhatsApp groups during the Indian election. Since private messages on WhatsApp are encrypted, most research on the platform has focused on public groups — leaving massive amounts of information unexplored. But researchers from Columbia’s Tow Center for Digital Journalism found a way around that: They joined private groups related to the 2019 Indian elections and monitored the flow of information there, something that other researchers could do, too.

“With no APIs, tools, or best practices in place to help outsiders tap into ongoing activity inside closed groups, we devised a strategy to monitor a subset of the political conversation, over a period of three and a half months. The study’s resulting data set — which grew to over a terabyte in size — contains 1.09 million messages, retrieved by joining 1,400 chat groups related to politics in the country.”

Here’s how they did it, writes Tow Center senior research fellow Priyanjana Bengani, who conducted the research with Tow fellow Ishaan Jhaveri.

There are two ways in which one can become part of a group on WhatsApp: get added to a group by an administrator or join via an invite link. We started by crawling the open web for public invite links, an approach adopted by researchers who have conducted similar research in Brazil. We augmented that by looking for invite links shared on Twitter, Facebook, and WhatsApp groups of which we were already a part. Activists, party affiliates, and campaign workers publish the details of these open groups, encouraging their audiences to join and see how they can help their party.

Simply searching for “WhatsApp invite links” or “WhatsApp groups” on a search engine led us to websites whose sole purpose was to aggregate WhatsApp invite links. Instead of blindly joining every single group, we joined only those relevant to us, (groups whose names indicated the group was politically charged). Those included: “Mission 2019,” “Modi: The Game Changer,” “Youth Congress,” and “INDIAN NATIONAL CONGRESS.”

We started this process with a single iPhone and a recently acquired US phone number. To be fully transparent, we identified ourselves as “Tow Center” with the description “a research group based out of Columbia University in New York.” We did not engage in any conversation. To respect WhatsApp’s fundamentally secure design, we joined groups with at least sixty participants. We focused on those with messages in Hindi or English. (We needed to be able to understand the content if we were to analyze it, and our lead researcher speaks both languages.) If the number of participants was fewer than sixty but the group name was clearly political, the invite link was added to a “Tentative” list that we maintained. If and when the number of members hit sixty, we joined the group. We reviewed our lists almost daily.

Within a few days, however, the NYC area code phone number that Tow was using for WhatsApp was banned, perhaps because it had been used to join more than 600 groups or because group administrators flagged it as suspicious. At that point, the researchers “activated multiple phones with fresh phone numbers (six in all), and kept the number of groups on each device below three hundred.”

Ultimately, Tow joined 1,400 private Indian political WhatsApp groups, was kicked out of about 200 and voluntarily left about 200 that turned out to be irrelevant. When the team backed up the content of the groups, it had a sample of 500,000 text messages, 300,000 images, 144,000 links, 118,000 videos, 12,000 audio files, 4,000 PDFs, and 500 contact cards.

One of the things the researchers found was that “35 percent of the media items in our dataset had been forwarded.” WhatsApp has identified forwarded messages as a key vector of the spread of misinformation and has sought to limit them. They also found that the 10 most-shared items in their dataset had no overlap with their 10 most-forwarded items. And they found that “links to news organizations are strikingly few. The Hindi version of NDTV, an Indian television media company, received the most links — just over 700. None of India’s other national news outlets had more than 300 links.” That means less than 1 percent of all links shared went to Indian news organizations. By comparison, 65 percent of links were to YouTube.

Fake news and the limits of AI. Tools powered by artificial intelligence can create fake news articles at least in a research setting, The Wall Street Journal’s Asa Fitch reported. “Large-scale synthesized disinformation is not only possible but is cheap and credible,” said Cornell professor Sarah Kreps, who coauthored research into whether “synthetic disinformation could generate convincing news stories about complex foreign policy issues” and found that it could — “The GPT-2 system worked so well that in an August survey of 500 people, a majority found its synthetic articles credible. In one group of participants, 72% found a GPT-2 article credible, compared with 83% who found a genuine article credible.

But it’s less clear whether machine learning can be used to detect fake news — new research out of MIT shows we’re not there yet, Axios reports: “While machines are great at detecting machine-generated text, they can’t identify whether stories are true or false.”

In one study, Schuster and team showed that machine learning-taught fact-checking systems struggled to handle negative statements (“Greg never said his car wasn’t blue”) even when they would know the positive statement was true (“Greg says his car is blue”).

The problem, say the researchers, is that the database is filled with human bias. The people who created FEVER tended to write their false entries as negative statements and their true statements as positive statements — so the computers learned to rate sentences with negative statements as false.

That means the systems were solving a much easier problem than detecting fake news. “If you create for yourself an easy target, you can win at that target,” said MIT professor Regina Barzilay. “But it still doesn’t bring you any closer to separating fake news from real news.”

Illustration from L.M. Glackens’ The Yellow Press (1910) via The Public Domain Review.

Laura Hazard Owen is the editor of Nieman Lab. You can reach her via email (laura@niemanlab.org) or Bluesky DM.
POSTED     Oct. 18, 2019, 9:55 a.m.
SEE MORE ON Audience & Social
Show tags
 
Join the 60,000 who get the freshest future-of-journalism news in our daily email.
Journalists fight digital decay
“Physical deterioration, outdated formats, publications disappearing, and the relentless advance of technology leave archives vulnerable.”
A generation of journalists moves on
“Instead of rewarding these things with fair pay, job security and moral support, journalism as an industry exploits their love of the craft.”
Prediction markets go mainstream
“If all of this sounds like a libertarian fever dream, I hear you. But as these markets rise, legacy media will continue to slide into irrelevance.”