‘attention ≈ SGD’ directory

See Also
Links
Miscellaneous
Bibliography

See Also

Links

“Learning without Training: The Implicit Dynamics of In-Context Learning ”, Dherin et al 2025

Learning without training: The implicit dynamics of in-context learning

“Transformers Represent Belief State Geometry in Their Residual Stream ”, Shai 2024

Transformers Represent Belief State Geometry in their Residual Stream :

View HTML (32MB):

/doc/www/www.greaterwrong.com/b50cd3db80065700f51b5a72a9f60a135b299bd3.html

“How Well Can Transformers Emulate In-Context Newton’s Method? ”, Giannou et al 2024

How Well Can Transformers Emulate In-context Newton’s Method?

“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models ”, Fu et al 2023

Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models

“CausalLM Is Not Optimal for In-Context Learning ”, Ding et al 2023

CausalLM is not optimal for in-context learning

“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention ”, Mahankali et al 2023

One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention

“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers ”, Dai et al 2022

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

“What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Akyürek et al 2022

What learning algorithm is in-context learning? Investigations with linear models

“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Garg et al 2022

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

“An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021

An Explanation of In-context Learning as Implicit Bayesian Inference

“Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar) ”

Reverse citations of ‘Transformers learn in-context by gradient descent’ (Google Scholar)

Miscellaneous

https://www.lesswrong.com/posts/S54HKhxQyttNLATKu/deconfusing-direct-vs-amortised-optimization

Bibliography

https://arxiv.org/abs/2211.15661#google: “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, Denny Zhou

link-bibliography
https://arxiv.org/abs/2208.01066: “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Shivam Garg, Dimitris Tsipras, Percy Liang, Gregory Valiant

link-bibliography

[Quote Of The Day]

[Site Of The Day]

[Annotation Of The Day]

[adblock public service announcement]