[Twitter; code/data] Large language models (LLMs) excel in many tasks in 2023, but they still face challenges in complex reasoning. Theory-of-mind (ToM) tasks, which require understanding agents’ beliefs, goals, and mental states, are essential for common-sense reasoning involving humans, making it crucial to enhance LLM performance in this area.
This study measures the ToM performance of GPT-4 and GPT-3/ChatGPT-3 variants (Davinci-2, Davinci-3, GPT-3.5-Turbo), and investigates the effectiveness of in-context learning in improving their ToM comprehension. We evaluated prompts featuring two-shot chain-of-thought reasoning and step-by-step thinking instructions.
We found that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) (all models excluding davinci-002) improved their ToM accuracy via in-context learning. GPT-4 performed best in zero-shot settings, reaching nearly 80% ToM accuracy, but still fell short of the 87% human accuracy on the test set.
However, when supplied with prompts for in-context learning, all RLHF-trained LLMs exceeded 80% ToM accuracy, with GPT-4 reaching 100%.
These results demonstrate that appropriate prompting enhances LLM ToM reasoning, and they underscore the context-dependent nature of LLM cognitive capacities.
Figure 1: Demonstration of Prompting Methods used for Boosting ToM reasoning in LLMs. Examples of 4 prompting types used to test the ToM performance of LLMs. Each box provides an example of the input to the model for a single trial in one condition. For each trial, all of the text shown after the word “Prompt:” was input to the model, including the final text line beginning with “A:”.
Figure 3: Effects of In-context Learning Prompts on ToM performance in LLMs. ToM performance of models using various in-context learning methods. For each model, the gray bar on the far left shows the Zero-Shot baseline ToM performance. The next 3 bars (orange) show the ToM performance on Zero-Shot plus SS Thinking; Two-Shot CoT; and Two-Shot CoT plus SS Thinking. Error bars indicate the standard deviation across 20 repetitions (see Figure 2, caption).