“Large Language Models Are Able to Downplay Their Cognitive Abilities to Fit the Persona They Simulate”, Jiří Milička, Anna Marklová, Klára VanSlambrouck, Eva Pospíšilová, Jana Šimsová, Samuel Harvan, Ondřej Drobil2024 (, , )⁠:

This study explores the capabilities of large language models to replicate the behavior of individuals with underdeveloped cognitive and language skills. Specifically, we investigate whether these models can simulate child-like language and cognitive development while solving false-belief tasks, namely, change-of-location and unexpected-content tasks. GPT-3.5-turbo and GPT-4 models by OpenAI were prompted to simulate children (n = 1296) aged one to 6 years. This simulation was instantiated through 3 types of prompts: plain zero-shot, chain-of-thoughts, and primed-by-corpus.

We evaluated the correctness of responses to assess the models’ capacity to mimic the cognitive skills of the simulated children. Both models displayed a pattern of increasing correctness in their responses and rising language complexity. That is in correspondence with a gradual enhancement in linguistic and cognitive abilities during child development, which is described in the vast body of research literature on child development. GPT-4 generally exhibited a closer alignment with the developmental curve observed in ‘real’ children. However, it displayed hyper-accuracy under certain conditions, notably in the primed-by-corpus prompt type. Task type, prompt type, and the choice of language model influenced developmental patterns, while temperature and the gender of the simulated parent and child did not consistently impact results.

We conducted analyses of linguistic complexity, examining utterance length and Kolmogorov complexity. These analyses revealed a gradual increase in linguistic complexity corresponding to the age of the simulated children, regardless of other variables.

These findings show that the language models are capable of downplaying their abilities to achieve a faithful simulation of prompted personas.

…A consistent pattern did not emerge regarding the impact of temperature (Figure 4)…In general, the ascent in complexity appeared to be less steep in personas generated by GPT-4 in contrast to those produced by GPT-3.5-turbo. As in the case of ToM, no clear pattern was observed concerning the effects of temperature (Figure 9)