“Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution”, 2024-04-14 ():
n
Transformers exhibit in-context learning (ICL), enabling adaptation to various tasks via prompts without the need for computationally intensive fine-tuning. This paper reevaluates the claim that ICL with linear attention implements one step of gradient descent for simple linear regression tasks, revealing it relies on strong assumptions like feature independence.
Relaxing these assumptions, we prove that ICL with linear attention resembles preconditioned gradient descent, with a pre-conditioner that depends on the data covariance. Our experiments support this finding.
We also empirically explore softmax-attention and find that increasing the number of attention heads better approximates gradient descent.
Our work offers a nuanced perspective on the connection between ICL and gradient descent, emphasizing data assumptions.
See Also:
What learning algorithm is in-context learning? Investigations with linear models
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
An Explanation of In-context Learning as Implicit Bayesian Inference
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
Schema-learning and rebinding as mechanisms of in-context learning and emergence