“Zoom In: An Introduction to Circuits—By Studying the Connections between Neurons, We Can Find Meaningful Algorithms in the Weights of Neural Networks”, Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, Shan Carter2020-03-10 (, ; similar)⁠:

Most work on interpretability aims to give simple explanations of an entire neural network’s behavior. But what if we instead take an approach inspired by neuroscience or cellular biology — an approach of zooming in? What if we treated individual neurons, even individual weights, as being worthy of serious investigation? What if we were willing spend thousands of hours tracing through every neuron and its connections? What kind of picture of neural networks would emerge?

In contrast to the typical picture of neural networks as a black box, we’ve been surprised how approachable the network is on this scale. Not only do neurons seem understandable (even ones that initially seemed inscrutable), but the “circuits” of connections between them seem to be meaningful algorithms corresponding to facts about the world. You can watch a circle detector be assembled from curves. You can see a dog head be assembled from eyes, snout, fur and tongue. You can observe how a car is composed from wheels and windows. You can even find circuits implementing simple logic: cases where the network implements AND, OR, or XOR over high-level visual features.

Three Speculative Claims about Neural Networks

  1. Features: Features are the fundamental unit of neural networks. They correspond to directions. By “direction” we mean a linear combination of neurons in a layer. You can think of this as a direction vector in the vector space of activations of neurons in a given layer. Often, we find it most helpful to talk about individual neurons, but we’ll see that there are some cases where other combinations are a more useful way to analyze networks — especially when neurons are “polysemantic.” (See the for a detailed definition.) These features can be rigorously studied and understood.
  2. Circuits: Features are connected by weights, forming circuits. A “circuit” is a computational subgraph of a neural network. It consists of a set of features, and the weighted edges that go between them in the original network. Often, we study quite small circuits — say with less than a dozen features — but they can also be much larger. (See the for a detailed definition.) These circuits can also be rigorously studied and understood.
  3. Universality: Analogous features and circuits form across models and tasks.