“Revisiting the Equivalence of In-Context Learning and Gradient Descent: The Impact of Data Distribution”, Sadegh Mahdavi, Renjie Liao, Christos Thrampoulidis2024-04-14 (, )⁠:

n

Transformers exhibit in-context learning (ICL), enabling adaptation to various tasks via prompts without the need for computationally intensive fine-tuning. This paper reevaluates the claim that ICL with linear attention implements one step of gradient descent for simple linear regression tasks, revealing it relies on strong assumptions like feature independence.

Relaxing these assumptions, we prove that ICL with linear attention resembles preconditioned gradient descent, with a pre-conditioner that depends on the data covariance. Our experiments support this finding.

We also empirically explore softmax-attention and find that increasing the number of attention heads better approximates gradient descent.

Our work offers a nuanced perspective on the connection between ICL and gradient descent, emphasizing data assumptions.