This journalism professor made a NYC chatbot in minutes. It actually worked.

“The step that we need to make as a society is moving from, ‘This came from a machine, it must be correct,’ to, if I’m talking to a friend of mine who says something crazy, ‘I need to double check that, I need to cross reference it to make sure that it is accurate.'”

By Colin Lecher, The Markup May 13, 2024, 11 a.m.

Back in March, The Markup broke some news about a small-business advice chatbot being piloted by the New York City government. The bot — available at chat.nyc.gov — was meant to be an AI-powered, one-stop shop for entrepreneurs looking to set up a business. The city said the bot could tell visitors, in plain English, what permits they might need to start a company, for example, or to avoid fines and violations when they did.

It was heralded by the mayor’s office as the future of public services, but we found that the chatbot regularly lied to constituents, and its advice, if followed, would have caused people to break the law.

When asked, the NYC bot told The Markup that bosses could take workers’ tips (they can’t), that landlords could discriminate against tenants who wanted to use housing vouchers (wrong again), and that business-owners could refuse to take cash (businesses must take cash). Again and again, we found the bot messing up on city laws and regulations.

The story broke through nationally, becoming a cautionary tale about government agencies rushing to adopt AI without the proper safeguards in place. Mayor Eric Adams soon acknowledged at a press conference that the bot was providing false information, but defended the tool, saying it would improve over time. (The bot now seems to be willing to respond to fewer questions in general.)

One person who saw that story was Jonathan Soma, a professor in data journalism at Columbia University. Soma recently gave a series of video talks on how journalists can use AI to power their reporting.

One topic he touched on in the series: chatbots. Using the NYC chatbot as a starting point, he demonstrates in a video how to build a similar AI-powered chatbot that could scan and respond to questions based on uploaded documents.

In the video, it works — he specifically asks his own bot some of the same questions we asked the city’s bot, and he got accurate responses when asking about rules for landlords, for example. He also shows how a bot could respond based on just about anything else a person uploads, like a long text of Hungarian folktales.

Soma says chatbots may not be great for high-stakes legal advice, although he’s optimistic about its role in expanding data journalism: “When these bots hit the scene, everyone was terrified that it was going to replace the act of writing for journalists. But it turns out that even though it seems to be the best at writing, that is actually the task that AI is worst at: assembling information, turning it into something coherent.”

We caught up with Soma to talk more about how he built his own bot on the fly, and what responsible use of AI — especially for journalists — might look like. This interview has been edited for brevity and clarity.

Colin Lecher: When you first saw our chatbot story, what was your reaction?

Jonathan Soma: My reaction was, “Of course, this is what happens.” One-hundred percent of the time, when you have a chatbot, things are gonna go wrong. But when you have a lot of layers of bureaucracy in New York City and all the stuff that’s happening here at the moment, I think no one’s going to double check things. It was obvious that something like this was going to happen at some point, and it was just a matter of when and where.

But terrifying. You have a tool that is pitched as a useful tool to help people live their lives in a way that is most effective to obeying the law. Then it’s pitching them lies under the banner of, “This is sourced, this is true. You can trust this.” That’s the big problem with chatbots — they will cite sources, they will say facts that are not facts. Then maybe there’s a little asterisk down at the bottom that says,”Sorry, some things here might not be true, double check all of them.” Which is just absurd.

Lecher: Did you immediately think, “I could make a version of this,” or “I want to reverse engineer how this works”? What were you thinking on the technical end when you saw it?

Soma: On the technical end, right now, chatbots that generate responses to questions — so retrieval-augmented generation — is the hottest thing. It is what everyone wants to do, mostly because they want to replace employees who are doing customer service work.

It is very easy to get a small proof of concept going. No matter what tool they used to generate this, I know exactly how it worked, there’s no mystery there. The mystery in these chatbots is how you can get them to be accurate. And it’s still a big problem that all of these chatbots have, and clearly this one as well.

Lecher: Have you used similar tools in the past?

Soma: Yes. So I have been using these tools for quite a while. One of my favorite projects that I’ve made was a chatbot that would answer questions about Hungarian folktales, based on a book that was in Hungarian. I was using it to teach investigative journalists about how you can do investigations across different languages, because these chatbots can read different languages and interact with things that maybe you can’t speak.

It was very fun, again, as a proof of concept. But if someone was going to sit me down and make me defend a PhD thesis based on a Hungarian folktales chatbot, I don’t think I would pass.

Lecher: Can you talk through how you made your own version of the chatbot on the fly?

Soma: What happens is you take a list of documents, and you feed them into a machine. Every, let’s say paragraph or two of that document, is assigned a series of numbers that describes this as the meaning or the topic of that paragraph. When someone asks a question, what happens is the machine takes that question, converts it into the same sort of numbers, then finds maybe five or 10 sets of paragraphs that are close to the number that your question is. So in theory, it’s all the texts that are similar to your question, probably relevant to generating an answer. And then it sends the question along with all of those documents to the chatbot or to GPT, to the AI, and say, “Please answer this question based on these five or 10 sets of paragraphs that we think are relevant to the question.”

I did that by running some code that assigned all the numbers and cut all the paragraphs and sent everything out to GPT to get that answer. There are probably hundreds of tools that do this. It’s very, very common. A lot of slight variations on the approach, but generally speaking, you’re finding relevant documents and sending them to the AI along with the questions.

Lecher: And you used examples from the New York City bot that we highlighted that were wrong when we asked them, but in the examples you used, it correctly answered the questions, right?

Soma: Yes. So I went online, and I just found a few different sources that were on the various New York City nyc.gov websites that might have relevant information about this — the fact sheets that you all cited about the true answers to these things. And yes, it seemed to work fine whenever I asked those questions.

Lecher: Do you have any idea why your version worked, when the city’s seemed to fail so badly?

Soma: If I were being gracious, I would say that there is always the ability of the AI to make things up and hallucinate. Additionally, if you have a very large set of documents, it might be difficult to find the ones that are actually relevant. Because that’s an automated process, and maybe when I say, “Can I take a portion of my workers’ tips?” it finds documents that might not be relevant. And then the chatbot isn’t able to answer because the documents aren’t necessarily relevant.

But there were so many basic, basic questions that were answered incorrectly, I can’t imagine the documents were actually loaded. There must just have been gross malfeasance in terms of setting this chatbot up. Because it’s not hard to have it be right most of the time. The issue was getting it to be right 100 percent of the time. And from what I can tell, this city-based chatbot was almost never right, which is hard to do. It’s hard to be that wrong.

Lecher: On that topic of mostly right versus 100 percent right: Do you feel like it’s inevitable that there will be errors when you release chatbots like this?

Soma: Yeah, there are so many different opportunities for things to go wrong when you’re trying to use a chatbot that’s related to a set of documents. The AI can make things up. It can be referencing out-of-date documents, it can be referencing irrelevant documents. Maybe the way that you ask the question isn’t the way that the AI interprets the question, and so it’s answering something that is similar, but not exactly the same.

It is 100% guaranteed that, at some point, there’s going to be some sort of mistake in that chain, and there’s going to be some sort of error introduced, and you’re going to get a wrong answer. There’s almost no way to fact-check it. There are ways to make it slightly better and slightly better and slightly better and slightly better. But getting to 100%, I would say, is almost impossible.

Lecher: So you’re not recommending people run out and make their own chatbots.

Soma: Chatbots are great for low-stakes things. They are great when something is fun, they are great for a task where you do not need 100 percent accuracy, when you just want a little bit of guidance. “What restaurant should I go to?” Things like that. But relying on it for something like legal advice, based on what you can do in the city, is pretty irresponsible.

(After publication of The Markup’s story, several disclaimers were added to the bot, including one cautioning users to “not use its responses as legal or professional advice.”)

Lecher: You say in the video there are practical reasons for, journalists especially, to use similar AI organizing tools. Can you tell me a little bit about what those reasons might be?

Soma: With AI-related tasks, there’s always going to be margin of error. You have to use it for tasks where it’s OK if there is an error. For example, if there is something where I might have an intern do a task, I might say, “OK, I’m an expert at this, I could do it. But it’s probably a better use of the intern’s time, and maybe sometimes they’re going to make a mistake. I’m going to build into my process the idea that I need to double-check this work. I need to make sure that there are no errors, because, you know, sometimes something might go wrong.”

When you’re using AI tools, you just need to test all of the time. It’s very easy to get into a conversation with AI and think, “Wow, this is amazing. I’m getting an incredible poem about my cats. I’m getting a recipe about this thing or the other thing. I’m getting feedback or edits on something I have written,” and it’s very easy for you to say, “Yes, I will accept this. No, I won’t accept that edit. Yes, this poem is adorable, or no, write another one about how the cat is mean bean instead of the cat is nice,” or whatever. So you just need to make sure that sort of capacity for errors is built in. Either you’re reviewing everything manually, or it’s a task where instead of doing one review of a piece of legislation, instead you’re looking at 100,000 pieces of legislation, or a million pieces of legislation, something that you as a human being couldn’t do. And you say, “OK, AI, you go do it, I understand that maybe 3 percent of the time, you’re gonna give me some incorrect information. But because I’m doing something at scale, maybe it’s okay that there’s errors here and there. All I want is the general gist to be true, or the general gist to be accurate.” But you can’t do that when you’re building a chatbot and every single conversation has to be meaningful, and has to be accurate for the person who is using it.

Lecher: Are you optimistic journalists in particular are going to find creative ways to use these tools to do journalism that they couldn’t have before?

Soma: I think that AI in general is absolutely useful for journalism, and I’ve been teaching machine learning and AI to journalists long before ChatGPT hit the scene. I think it is explicitly chatbots that are probably the most problematic part, because they are so confident in everything that they say.

But journalists as a whole are very used to talking to people who are very confident, making statements that need to be fact-checked, that need to be cross-referenced. The step that we need to make as a society is moving from, “This came from a machine, it must be correct,” to, if I’m talking to a friend of mine who says something crazy, “I need to double check that, I need to cross reference it to make sure that it is accurate.”

This article was originally published on The Markup and was republished under the Creative Commons Attribution-NonCommercial-NoDerivatives license.