“FreshLLMs: Refreshing Large Language Models With Search Engine Augmentation”, Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, Thang Luong2023-10-05 (, , , , , )⁠:

Most large language models (LLMs) are trained once and never updated; thus, they lack the ability to dynamically adapt to our ever-changing world.

In this work, we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked.

We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises.

Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al 2022) as well as commercial systems such as Perplexity.AI.

Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers.

To facilitate future work, we release FreshQA at Github and commit to updating it at regular intervals.

Figure 2: Accuracy of different LLMS on FRESHQA under RELAXED and STRICT (no hallucination) evaluations. Models benchmarked on the same date of April 26, 2023. All models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises. [Are we looking at the same graphs…?]

…COT increases hallucination: Overall, FEW-SHOT and COT prompting are beneficial for large models and sometimes advantageous for moderately-sized models on questions with valid premises, especially on questions about never-changing or old knowledge. Under STRICT, FEW-SHOT and COT yields +36.1% and +26.9% respective accuracy improvement over zero-shot prompting with PALM-540B on questions involving knowledge before 2022 (+21.9% and +29.7% under RELAXED). COT largely demonstrates superior performance compared to FEW-SHOT under RELAXED, whereas FEW-SHOT obtains better results under STRICT, as COT introduces more room for hallucination. Multi-hop reasoning is challenging for several models: T5 LARGE and XL are incapable of dealing with multi-hop questions, while Flan-PaLM 540B, CODEX, and GPT-3.5 suffer the most when switching from one-hop to multi-hop questions. GPT-4 remains stable across these two types of questions (with a difference of less than 2% in accuracy across settings)