Fact-Checking AI: ChatGPT Falls Short in Accuracy Compared to Bard, Claude, and Copilot

Marian/Getty Images
Artificial Intelligence, known as generative AI, is likely to make factual errors. How do you manage to verify 150 presumed facts generated by ChatGPT without spending an entire weekend? Well, the answer for me was to turn to other AIs. In this article, I’ll discuss the project, evaluate each AI’s performance in a fact-checking competition, and provide some final thoughts and warnings if you want to explore this labyrinth. Last week, we published a project in which ChatGPT, running DALL-E 3, generated 50 picturesque images of each US state and listed three interesting facts about each state. The results were unusual, with ChatGPT placing the Golden Gate Bridge in Canada, Lady Liberty in the Midwest, and generating two Empire State Buildings. The individual facts were mostly accurate, but without independent fact-checking. It seemed like an ideal project for an AI. I used the GPT-4 language model and fed the facts to other large language models inside other AIs, such as Google’s Bard, Anthropic’s Claude, Bing in an attempt to get the best evaluation. Claude found the fact list mostly accurate but noted some clarifications for three items. Copilot, Microsoft’s Bing Chat AI, didn’t accept text from all 50-state facts, and responses were odd. Bard, on the other hand, outperformed in the fact-checking. But it often missed the point and got things as wrong as any other AI. ChatGPT fact-checked Bard’s errors and disputed Bard’s claim that Texas is the biggest state, and had a tizzy over Ohio vs. Kansas as the birthplace of aviation, which is controversial. Given this, it appears that using AI to verify facts generated by AI isn’t as straightforward as it seems.