“AI Is a Black Box. Anthropic Figured Out a Way to Look Inside: What Goes on in Artificial Neural Networks Work Is Largely a Mystery, Even to Their Creators. But Researchers from Anthropic Have Caught a Glimpse”, Steven Levy2024-05-21 (, , , )⁠:

…[paper] I met with Chris Olah and 3 of his colleagues, among 18 Anthropic researchers on the “mechanistic interpretability” team. They explain that their approach treats artificial neurons like letters of Western alphabets, which don’t usually have meaning on their own but can be strung together sequentially to have meaning. “‘C’ doesn’t usually mean something”, says Olah. “But ‘car’ does.” Interpreting neural nets by that principle involves a technique called dictionary learning, which allows you to associate a combination of neurons that, when fired in unison, evoke a specific concept, referred to as a feature.

“It’s sort of a bewildering thing”, says Josh Batson, an Anthropic research scientist. “We’ve got on the order of 17 million different concepts [in an LLM], and they don’t come out labeled for our understanding. So we just go look, when did that pattern show up?”

Last year, the team began experimenting with a tiny model that uses only a single layer of neurons. (Sophisticated LLMs have dozens of layers.) The hope was that in the simplest possible setting they could discover patterns that designate features.

[the still underestimated role of sheer brute force trial-and-error in DL]

They ran countless experiments with no success. “We tried a whole bunch of stuff, and nothing was working. It looked like a bunch of random garbage”, says Tom Henighan, a member of Anthropic’s technical staff. Then a run dubbed “Johnny”—each experiment was assigned a random name—began associating neural patterns with concepts that appeared in its outputs.

“Chris looked at it, and he was like, ‘Holy crap. This looks great’”, says Henighan, who was stunned as well. “I looked at it, and was like, ‘Oh, wow, wait, is this working?’”

Suddenly the researchers could identify the features a group of neurons were encoding. They could peer into the black box. Henighan says he identified the first 5 features he looked at. One group of neurons signified Russian texts. Another was associated with mathematical functions in the Python computer language. And so on.

Once they showed they could identify features in the tiny model, the researchers set about the hairier task of decoding a full-size LLM in the wild. They used Claude Sonnet, the medium-strength version of Anthropic’s 3 current models. That worked, too. One feature that stuck out to them was associated with the Golden Gate Bridge. They mapped out the set of neurons that, when fired together, indicated that Claude was “thinking” about the massive structure that links San Francisco to Marin County. What’s more, when similar sets of neurons fired, they evoked subjects that were Golden Gate Bridge-adjacent: Alcatraz, California governor Gavin Newsom, and the Hitchcock movie Vertigo, which was set in San Francisco. All told the team identified millions of features—a sort of Rosetta Stone to decode Claude’s neural net. Many of the features were safety-related, including “getting close to someone for some ulterior motive”, “discussion of biological warfare”, and “villainous plots to take over the world”.

…By suppressing those features, Anthropic says, the model can produce safer computer programs and reduce bias. For instance, the team found several features that represented dangerous practices, like unsafe computer code, scam emails, and instructions for making dangerous products. The opposite occurred when the team intentionally provoked those dicey combinations of neurons to fire. Claude churned out computer programs with dangerous buffer overflow bugs, scam emails, and happily offered advice on how to make weapons of destruction. If you twist the dial too much—cranking it to 11 in the Spinal Tap sense—the language model becomes obsessed with that feature. When the research team turned up the juice on the Golden Gate feature, for example, Claude constantly changed the subject to refer to that glorious span. Asked what its physical form was, the LLM responded, “I am the Golden Gate Bridge … my physical form is the iconic bridge itself.”

When the Anthropic researchers ramped up a feature related to hatred and slurs to 20× its usual value, according to the paper, “this caused Claude to alternate between racist screed and self-hatred”, unnerving even the researchers.

…The Anthropic researchers did not want to remark on OpenAI’s disbanding its own major safety research initiative, and the remarks by team co-lead Jan Leike, who said that the group had been “sailing against the wind”, unable to get sufficient computer power. (OpenAI has since reiterated that it is committed to safety.) In contrast, Anthropic’s Dictionary team says that their considerable compute requirements were met without resistance by the company’s leaders. “It’s not cheap”, adds Olah.

Anthropic’s work is only a start. When I asked the researchers whether they were claiming to have solved the black box problem, their response was an instant and unanimous no…David Bau says his enthusiasm is tempered by some of the approach’s limitations. Dictionary learning can’t identify anywhere close to all the concepts an LLM considers, he says, because in order to identify a feature you have to be looking for it. So the picture is bound to be incomplete, though Anthropic says that bigger dictionaries might mitigate this.