Joshua Batson · Oct 5, 2023 · 9:51 PM UTC

Joshua Batson

In writing this paper, there were countless features we thought might be bugs. After careful inspection, ~all of them revealed surprising and subtle model properties. To me this capacity for surprise is the true test of a new technique. This thread is about my favorite finding.

Anthropic

@AnthropicAI

Oct 5

The fact that most individual neurons are uninterpretable presents a serious roadblock to a mechanistic understanding of language models. We demonstrate a method for decomposing groups of neurons into interpretable features with the potential to move past that roadblock.

376

Joshua Batson · Oct 5, 2023 · 9:51 PM UTC

Joshua Batson @thebasepoint

Oct 5

You're familiar with base64 text on the internet, even if you don't know it. It's the alphanumeric strings you see in URLs like "dQw4w9WgXcQ" in piped.video/watch?v=dQw4w9Wg…. Our little 1-layer model could recognize those random strings and try to produce them.

Joshua Batson · Oct 5, 2023 · 9:51 PM UTC

Joshua Batson @thebasepoint

Oct 5

When we split the MLP into a small number of features, we indeed get one feature that is activate on all tokens of all base64 strings (orange text below). When on, it makes the model predict little random tokens like "zc", "Ct", "ZY", etc.

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson @thebasepoint

Oct 5

When we split the MLP into more (4000) features, we get three base64 features. One fires on all base64 strings *except the single digits*. One fires on the digits. And one fires kind of randomly on some strings and not others?

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson @thebasepoint

Oct 5

It turns out the letters vs digits features were driven by *properties of the tokenizer*. Even on random alphanumeric strings, the tokenizer leaks information about future tokens! If you see a "7" the next token can't be an "8" because it would have been tokenized as "78".

Oct 5, 2023 · 9:52 PM UTC

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson @thebasepoint

Oct 5

When we compare the logit weights of the those two base64 features, we see they are highly correlated in their predictions *except* for 10 exceptions; the feature firing on digits suppresses digits.

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson @thebasepoint

Oct 5

The third one was harder to crack. But Brian Chen noticed that it seemed to fire hard to "ICAg", which is what you get by encoding three spaces " " into base64.

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson @thebasepoint

Oct 5

Inspired by this, he ran all the examples where the mystery feature fired through a base64 decoder. Every one decoded cleanly into plaintext, usually code snippets.

Joshua Batson · Oct 5, 2023 · 9:52 PM UTC

Joshua Batson @thebasepoint

Oct 5

It was pretty striking to see the features teach us something about the transformer's understanding of the data. If you want to explore the 4000 features we find and discover something cool, check out our interactive vis! transformer-circuits.pub/202…

Ivy Zhang · Oct 8, 2023 · 10:21 PM UTC

Ivy Zhang @_ivyzhang

16h

Replying to @thebasepoint

Not even sure how feasible this would be, but has there been any work on having a RL system or LLM try and design a better tokenizer?