Quantifying Truesight With SAEs

Gwern

Quantifying Truesight With SAEs

GPT, text style transfer, compression, computer security, truesight (stylometry)

Proposal to use SAEs to crack open the dark matter of LLM inference about text.

2025-05-25–2025-05-26 finished certainty: possible importance: 7 similar bibliography

LLMs know more than they say, like who wrote some text. How can we find out how much they really know, when their internals are a mysterious scrambled blackbox?

But we can use sparse autoencoders to turn their internal thoughts into a long list of simpler properties summarizing what the LLM knows. We don’t know what any of them are, but we can do many things with this summary.

We can find out what a LLM knows about an author by defining that as “the thoughts that it thinks for each of their writings”, and we can use that to define their ‘persona’; now we can make the LLM write like an author, and ‘ask the author questions’ to see what the LLM knows about them.

We can do a lot of other things, too, like see which parts of the text are hints about the persona (to analyze or modify them), or how much text we need for a good persona. We can also look at how their persona changes over time, or whose persona is similar, or how similar their relatives’ persona are (or maybe find those relatives).

LLMs are capable of inferring many latent variables about the data-generating process—in particular, about the author. This capability, “truesight”, means they can identify age, nationality, sex, and for publicly known authors with meaningful corpuses, can often infer the identity of the writer from as little as a paragraph of text.

How much can they infer? “Sampling can show the presence of knowledge but not the absence”, and so most truesight demonstrations are limited to what can be easily prompted out of a LLM. This is a loose lower bound on eliciting the LLM’s true knowledge of a passage, because it may struggle to verbalize its knowledge (just like a human often has little conscious access to things that they perceive in an image or piece of text), have reasons to deceive the user, or perhaps no one ever thought to ask a particular question.

We would like to get an idea of how much total truesight a LLM has, to avoid unpleasant surprises. This is relevant to understanding the misuse of LLMs and their security implications, but also to understand what the LLMs know about themselves—to what extent can a LLM truesight its own environment, and infer things like, “I am being tested for safety by OpenAI, and I will pretend to be nice until the testing is over.”? How could we do that?

We cannot simply ask it a lot of questions, as explained above.

We cannot train a LLM with some dataset of personal information, because we do not really have datasets with all possible privacy violations and corresponding text samples (nor do we have the ability to train most LLMs of interest), and any result will still be a lower bound—if we had more data in that dataset, perhaps we would get better results?

We could try to examine the LLM activations directly. They form an ‘embedding’ of everything the LLM knows about a text, and by definition, the truesight knowledge must be somewhere in there. Unfortunately, LLM activations are notoriously complicated, messy objects which are intended for efficient computation inside the LLM, and not for us to read things out of like “the author of this text is probably a 25yo white woman from San Diego who is writing on her smartphone”. They are highly non-linear, compressed, squashed-together, context-dependent: what one number means depends on all the other numbers. They are not ‘independent’ or nicely ‘linear’.

Fortunately for us, NN work has found a powerful way to unscramble the embedding omelette, and turn the small dense embedding into a (very) large simple linear embedding, where most numbers are zero, and relatively few are non-zero: Sparse Autoencoders (SAEs; eg). When we process text through the LLM and take the dense embedding and unpack it via a SAE, we will find that a passage mentioning ‘San Francisco’ may trigger a single entry which turns out to correspond to a sort of concept of ‘San Francisco’ (this is how the famous “Golden Gate Claude” worked: simply forcing it to keep that entry active). So a SAE would be expected to have features which give us a clean readout: “25yo”, “adult woman”, “Caucasian”, “from San Diego”, “using smartphone”. If we had enough data from people who were writing on smartphones, we could look at them, and see what feature they all activate; that might be the “smartphone usage” feature.

But unfortunately, as neat as this would be, it still doesn’t answer the question: we don’t have a dataset with every possible truesight capability labeled in large volumes which we could simply throw at a SAE and see what all the relevant features are, and thence directly measure the total truesight capability.

Our SAE simply gives us a very large ‘fingerprint’ or ‘DNA’ of a text, which is well-behaved, but does not by itself volunteer any information. It will not tell us about the total truesight potential, or point out unexpected patterns. So what can we do with it?

One possibility is to ask, what can we do with the ‘DNA’ of some organism that we can’t directly understand? It turns out that there are many useful things you can do with DNA that you don’t understand—like identical twin studies! All you know is that identical twins are much more genetically similar than their siblings are to each other, and you can learn a lot about how much those genetic differences cause differences in their traits, like diseases, even though you never learn which gene is responsible (or how). This is especially true when there are a lot of genes with small ‘simple’ effects, like just summing up. But these variance components are very useful to know, and it can be extended and refined in many ways: as long as you have ways of manipulating similarity, you can extract high-level quantities about the net causes of those similarities.

For example, we can ask, “how much of a measurement is random noise day-to-day?”, and look at how similar measurements on the same person are, when done on different days: since we know how similar a person is across days (100%, they are the same person), any difference must reflect some sort of intrinsic randomness. This test-retest reliability is important: if our measurement is 100% the same, great; if it’s approaching 0%, then that instability means it’s not telling us about anything permanent.

Our case is similar: we have texts whose similarity we can know, by construction or analysis, and we can turn them into ‘DNA’ (SAE embeddings), and compare the similarity of embeddings. So, we could look at writings by the same person across different days in response to the same prompt; this gives us an upper bound of how much each sample of text could reveal about the person.

We can also take an author’s response across a series of texts which differ in topic, and try to remove the features which vary and keep the invariant ones. This will tend to capture everything the LLM has managed to infer about ‘the author’, independent of any specific topic or sample. (We might keep only the SAE entries which are non-zero across most of the samples, or average them, or something else. It may work to concatenate the samples for a single long-context LLM, now that the context windows are so large, shuffle, average them all, and work with that.) This would let us clone author styles and perspectives, akin to Golden Gate Claude. We might call this specific embedding a ‘persona’, and it would yield ‘text style transfer’: we could hardwire it, and have the LLM write ‘in the style of X’. We can also measure the worsened performance of the LLM in trying to predict text written by Y when hardwired to assume that it was X: this gives us an absolute measurement of how useful truesight is (eg. “author inferences are worth ~0.01 bits per character on a social media corpus”). And then we can investigate manipulations: if we tell the model explicitly information about the author, this will repair some of the damage, and we can quantify what percentage of the truesight we have accounted for. (We will probably find, per the ‘bet on sparsity’ principle & ‘everything is correlated’ literature, that specifying age, sex, location, and name account for the majority of truesight’s predictive compression benefits, but that the rest is due to tens of thousands of subtler attributes.) Since we have a large set of persona, we can easily train a LLM as an embedding model to directly output a predicted persona from a short text snippet, allowing a bootstrap to elicit the best possible persona (ie. compute a persona on a large sample from an author, and then train the persona model to predict that persona on each of the sub-samples).

But this might do something even more interesting: beyond simply chatting with X or generating samples of X text which don’t sound so much like the normal Claude chatbot persona, we might be able to interrogate that persona. Because if we force the LLM, with the persona hardwired, to answer a question like “where are you from?”, then the most likely completion would presumably be the LLM’s best guess as to where X is from. So it’d just reply, “oh, I’m from San Diego. You?” And this applies to more freeform questions: “Why don’t you tell us a little about yourself? Who are you, where are you form, what are you doing these days?” “I’m a young woman from San Diego, Charlene Dottenmeyer, and I’m just killing time chatting with you on the bus since my stop isn’t for a while.” Or you could prompt for a capsule biography: “Charlene Dottenmeyer (born 1999_26ya); white woman from San Diego, working as a nutritionist (BS from San Diego University)…” Provide a few examples in a nice clean JSON format, and now we have a way to systematically extract structured author metadata.

The most obvious thing to do with this is to look how much we can extract as we increase text sample sizes. We can use a ground truth of some variable, like author age, to quantify the scaling curve: how much text do we need for each author to infer their true age to a certain accuracy? (Does it asymptote at 100%? How much text would we need?) We can also look at what text yields the most information gain: what comments leaked the most privacy violations? Which words, exactly, let the LLM infer the persona?

An advanced application would be my optimal-interviewing meta-learning procedure: extract the personas of many authors, and then generate many Q&A pairs, and stitch them together in the shortest Q&A sequence which yields the final persona. This is useful for revealing which questions are most useful (because they will show up early in optimal sequences to maximize information gain)—and because the LLM generates the questions and we can brute force generation of thousands or millions of questions, that can potentially reveal informative questions that we would never have thought to ask.

Another advanced application would be to directly draw on behavioral genetics, and look at similarity across relatives, since we may have access to information like relatives (many writers or authors have known relatives, like parents or siblings). This gives its own set of variance components of truesight, breaking down into genetics/family-influences/sibling-influences/random-error/etc, which can be re-analyzed in ways like genetic correlations over a lifetime (eg. ‘age’ by definition will keep changing).

[Error: JavaScript disabled.]

[Backlinks, similar links, and the bibliography require JS enabled to load.]

Bibliography

[Bibliography of links/references used in page]

Similar Links

Bibliography