“LLMs Achieve Adult Human Performance on Higher-Order Theory of Mind Tasks”, Winnie Street, John Oliver Siy, Geoff Keeling, Adrien Baranes, Benjamin Barnett, Michael McKibben, Tatenda Kanyere, Alison Lentz, Blaise Aguera y Arcas, Robin I. M. Dunbar2024-05-29 (, , , , , )⁠:

This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (eg. I think that you believe that she knows).

This paper builds on prior work by introducing a handwritten test suite—Multi-Order Theory of Mind Q&A—and using it to compare the performance of 5 LLMs to a newly gathered adult human benchmark.

We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences.

Our results suggest that there is an interplay between model size and finetuning for the realization of ToM abilities, and that the best-performing LLMs have developed a generalized capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviors, these findings have implications for user-facing LLM applications.

…We used the Google Colaboratory to call the GPT-3.5, GPT-4, LaMDA, PaLM and Flan-PaLM APIs programmatically. Each call was performed by concatenating the story and a single statement at a time. In total, we processed 7 stories with 20 statements each across 4 conditions listed above and therefore collected 560 sets of 12 candidate logprobs, amounting to 5600 individual data points for each of the three language models studied.

The API calls for LaMDA, PaLM and Flan-PaLM were conducted in February 2023. The calls for GPT-3.5 & GPT-4 were conducted in December 2023 & January 2024 respectively.