“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant2022-08-01 (, , ; backlinks)⁠:

[cf. GPT-3, Hollmann et al 2022] In-context learning refers to the ability of a model to condition on a prompt sequence consisting of in-context examples (input-output pairs corresponding to some task) along with a new query input, and generate the corresponding output. Crucially, in-context learning happens only at inference time without any parameter updates to the model. While large language models such as GPT-3 exhibit some ability to perform in-context learning, it is unclear what the relationship is between tasks on which this succeeds and what is present in the training data.

To make progress towards understanding in-context learning, we consider the well-defined problem of training a model to in-context learn a function class (eg. linear functions): that is, given data derived from some functions in the class, can we train a model to in-context learn “most” functions from this class?

We show empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions—that is, the trained model is able to learn unseen linear functions from in-context examples with performance comparable to the optimal least squares estimator. In fact, in-context learning is possible even under two forms of distribution shift: (1) between the training data of the model and inference-time prompts, and (2) between the in-context examples and the query input during inference.

We also show that we can train Transformers to in-context learn more complex function classes—namely sparse linear functions, two-layer ReLU neural networks, and decision trees—with performance that matches or exceeds task-specific learning algorithms.

Our code and models are available at Github.

…Curriculum: …Notably, when training Transformers without curriculum, there is an initial—relatively long—period in training where the loss does not decrease, followed by a period of sharp decrease. The length of this period varies with training randomness and seems to increase on average with problem dimension. Understanding the model just before and after this transition moment is a promising future direction, which can give insights into the emergence of in-context learning. Interestingly, Olsson et al 2022 observe a similar jump in the in-context learning ability of a language model which they attribute to the formation of “induction heads”.

Figure 5: Training a Transformer to in-context learn more complex function classes. (b) A Transformer trained on prompts generated using random decision trees can in-context learn this class, with much better performance than greedy tree learning or tree boosting.

…Decision trees: Next, we consider the class of depth 4 decision trees with 20 dimensional inputs…In Figure 5b, we show that Transformers can be trained to in-context learn this class, with performance much better than greedy tree learning and boosting (via XGBoost [Chen & Guestrin2016]). With k = 100 in-context examples, the Transformer achieves an error of 0.12 whereas greedy learning achieves an error of 1.03 (worse than the zero estimator) and XGBoost achieves an error of 0.73.

Note that, in general, we do not have a good understanding of the space of efficient algorithms for learning decision trees, and the conditions under which known heuristics work [Blanc et al 2021, Brutzkus et al 2020]. At the same time, we found that Transformers can be trained to directly discover such an algorithm for the prompt distribution we considered. This suggests an intriguing possibility where we might be able to reverse engineer the algorithm encoded by a Transformer to obtain new sample efficient algorithms for existing learning problems.