“Schema-Learning and Rebinding As Mechanisms of In-Context Learning and Emergence”, Sivaramakrishnan Swaminathan, Antoine Dedieu, Rajkumar Vasudeva Raju, Murray Shanahan, Miguel Lazaro-Gredilla, Dileep George2023-06-16 (, )⁠:

[cf. Xie et al 2021, Guntupalli et al 2023] In-context learning (ICL) is one of the most powerful and most unexpected capabilities to emerge in recent transformer-based large language models (LLMs). Yet the mechanisms that underlie it are poorly understood.

In this paper, we demonstrate that comparable ICL capabilities can be acquired by an alternative sequence prediction learning method using clone-structured causal graphs (CSCGs).

Moreover, a key property of CSCGs is that, unlike transformer-based LLMs, they are interpretable, which considerably simplifies the task of explaining how ICL works. Specifically, we show that it uses a combination of (1) learning template (schema) circuits for pattern completion, (2) retrieving relevant templates in a context-sensitive manner, and (3) rebinding of novel tokens to appropriate slots in the templates.

We go on to marshal evidence for the hypothesis that similar mechanisms underlie ICL in LLMs. For example, we find that, with CSCGs as with LLMs, different capabilities emerge at different levels of overparameterization, suggesting that overparameterization helps in learning more complex template (schema) circuits.

By showing how ICL can be achieved with small models and datasets, we open up a path to novel architectures, and take a vital step towards a more general understanding of the mechanics behind this important capability.

[Experimental “Up-goer-five”-style summary by GPT-4: “In this writing, we talk about a very strong and very surprising thing that some big talking computer programs can do. This thing is called ‘in-context learning’ or ICL. But we don’t really know how it works. We made a study where we show that we can make another program that can do similar things as ICL using a different way of learning. This different way uses special graphs called ‘clone-structured causal graphs’ or CSCGs. The good thing about CSCGs is that they are easy to understand, unlike the big talking computer programs. This makes it easier to explain how ICL works. We found out that ICL does 3 main things: (1) it learns shapes, (2) it finds the right shapes depending on what is happening, and (3) it puts new words in the right places in the shapes. We also found strong support that the big talking computer programs work the same way as CSCGs. When we make the CSCGs or the big talking computer programs bigger, they can do more hard things. This means that having more stuff in these programs helps them learn more challenging shapes. By showing that we can do ICL with small programs and small sets of words, we open up new ways to make different kinds of programs. This is an important step toward getting a better idea of how ICL works.”]