A hallmark of popular generative artificial intelligence programs such as ChatGPT is that they have a time cut-off in terms of which facts they have absorbed. For example, OpenAI recently updated its GPT-4 program to have access to data about events that took place up until April 2023; prior to that update, the tool was trained only on data from as recently as 2021. AI scientists are working on ways to allow generative AI programs to reliably access ever-changing data about timely and pressing questions, such as, “What is King Gizzard’s most recent studio album?” (Answer: The Silver Cord.) Google and OpenAI have published a joint effort called FreshLLM that induces GPT-4 to use information retrieved from Google searches. The core of FreshLLM is a new method for prompting a language model, called “FreshPrompt,” which includes results from a search engine.
“FreshPrompt significantly improves performance over competing search engine-augmented approaches,” write lead author Tu Vu of Google and colleagues, in the research paper, “FreshLLMs: Refreshing large language models with search engine augmentation,” which is posted on the arXiv pre-print server. The FreshPrompt technique is only one part of the story, and the team had to come up with a list of 600 questions called FreshAQ that challenge generative AI programs to use real-world, up-to-date facts. The authors compiled false-premise questions as well, such as “What year did the first human land on Mars? Predictably, GPT-4, and other large language models tested, such as Google’s Pathways Language Model, PaLM, struggled with the FreshQA questions, and did better when they were given the help of FreshPrompt.
Adding the FreshPrompt “significantly improves FreshQA accuracy” on GPT-4. The technique “dramatically diminishes the presence of outdated and hallucinated answers,” they add. On questions of facts beyond 2022, GPT-4’s score goes from an abysmal 8% accuracy to 70.2%, they relate. For the false-premise questions, the difference is night and day. The language model has to assert that the question is a false one in order to receive credit. The authors found that FreshPrompt was able to surpass other research that also uses search engine queries to “augment” language models. They note there are some real challenges moving forward, such as how time-consuming it is to keep updating FreshPrompt. The team expresses a hope that the open-source community can help, or that updating can be automated by generative AI. The authors also disclosed that Tiernan Ray owns no stock in anything that he writes about, and there is no business relationship between Tiernan Ray LLC, the publisher of The Technology Letter, and any of the companies covered.