âChallenges With Unsupervised LLM Knowledge Discoveryâ, 2023-12-15 ()â :
We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledgeâinstead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation is that knowledge satisfies a consistency structure, which can be used to discover knowledge.
We first prove theoretically that arbitrary features (not just knowledge) satisfy the consistency structure of a particular leading unsupervised knowledge-elicitation method, contrast-consistent search. We then present a series of experiments showing settings in which unsupervised methods result in classifiers that do not predict knowledge, but instead predict a different prominent feature.
We conclude that existing unsupervised methods for discovering latent knowledge are insufficient, and we contribute sanity checks to apply to evaluating future knowledge elicitation methods. Conceptually, we hypothesize that the identification issues explored here, eg. distinguishing a modelâs knowledge from that of a simulated characterâs, will persist for future unsupervised methods.