âHow to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questionsâ, 2023-09-26 ()â :
Large language models (LLMs) can âlieâ, which we define as outputting false statements despite âknowingâ the truth in a demonstrable sense. LLMs might âlieâ, for example, when instructed to output misinformation.
Here, we develop a simple lie detector that requires neither access to the LLMâs activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLMâs yes/no answers into a logistic regression classifier.
Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single settingâprompting GPT-3.5 to lie about factual questionsâthe detector generalizes out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales.
These results indicate that LLMs have distinctive lie-related behavioral patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.