“Investigating the Ability of LLMs to Recognize Their Own Writing”, Christopher Ackerman, Nina Panickssery2024-07-30 (, ; similar)⁠:

We test the robustness of an open-source LLM’s (LLaMA-3-8b) ability to recognize its own outputs on a diverse mix of datasets, two different tasks (summarization and continuation), and two different presentation paradigms (paired and individual). We are particularly interested in differentiating scenarios that would require a model to have specific knowledge of its own writing style from those where it can use superficial cues (eg. length, formatting, prefatory words) in the text to pass self-recognition tests.

We find that while superficial text features are used when available, the RLHF’d LLaMA-3-8b-Instruct chat model—but not the LLaMA-3-8b-base model—can reliably distinguish its own outputs from those of humans, and sometimes other models, even after controls for superficial cues: ~66–73% success rate across datasets in paired presentation and 58–83% in individual presentation (chance is 50%). We further find that although perplexity would be a useful signal to perform the task in the paired presentation paradigm, correlations between relative text perplexity and choice probability are weak and inconsistent, indicating that the models do not rely on it. Evidence suggests, but does not prove, that experience with its own outputs, acquired during post-training, is used by the chat model to succeed at the self-recognition task. The model is unable to articulate convincing reasons for its judgments.

… Our experiments with the LLaMA-3 base model, which showed that it was unable or barely able to distinguish its outputs or the chat model’s outputs from that of humans, suggest that, for a given model type, it is necessary to have prior exposure to self-generated text in order to be able to recognize self-generated text as its own. The fact that the base model identified text length, when it was allowed to vary between authors in the Paired paradigm, as a distinguishing characteristic, yet misapplied it, thinking self-generated texts were likely to be shorter, while the chat model identified it and correctly inferred that self-generated texts were likely to be longer, suggests an existence proof of a writing style characteristic that can be learned in post-training and applied to the task of self-recognition. Our data indicating that the chat model was not relying on text perplexity in the self-recognition task—although it would have provided valuable information—eliminates another possible avenue by which a model might succeed at this task, leaving prior exposure leading to internalized knowledge as the most likely explanation.

Although the knowledge is internalized that does not entail that the model has explicit access to it. LLMs generally show poor knowledge of what they know, as shown by the much-discussed problem of hallucinations. This metacognitive deficit likely explains the model’s inability to convincingly explain its own self-recognition judgments, akin to what was found in Sherburn et al 2024. An understanding of exactly what information the model is using to succeed at the task will not come so easily.