‘attention ≈ SGD’ directory
- See Also
 - Links
- “Learning without Training: The Implicit Dynamics of In-Context Learning ”, Dherin et al 2025
 - “Where Does In-Context Learning Happen in Large Language Models? ”, Sia et al 2025
 - “Transformers Represent Belief State Geometry in Their Residual Stream ”, Shai 2024
 - “How Well Can Transformers Emulate In-Context Newton’s Method? ”, Giannou et al 2024
 - “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models ”, Fu et al 2023
 - “CausalLM Is Not Optimal for In-Context Learning ”, Ding et al 2023
 - “One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention ”, Mahankali et al 2023
 - “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers ”, Dai et al 2022
 - “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Akyürek et al 2022
 - “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Garg et al 2022
 - “An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021
 - “Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar) ”
 
 - Miscellaneous
 - Bibliography
 
See Also
Links
“Learning without Training: The Implicit Dynamics of In-Context Learning ”, Dherin et al 2025
Learning without training: The implicit dynamics of in-context learning
“Where Does In-Context Learning Happen in Large Language Models? ”, Sia et al 2025
Where does In-context Learning Happen in Large Language Models?
“Transformers Represent Belief State Geometry in Their Residual Stream ”, Shai 2024
Transformers Represent Belief State Geometry in their Residual Stream :
“How Well Can Transformers Emulate In-Context Newton’s Method? ”, Giannou et al 2024
How Well Can Transformers Emulate In-context Newton’s Method?
“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models ”, Fu et al 2023
“CausalLM Is Not Optimal for In-Context Learning ”, Ding et al 2023
“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention ”, Mahankali et al 2023
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers ”, Dai et al 2022
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
“What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Akyürek et al 2022
What learning algorithm is in-context learning? Investigations with linear models
“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Garg et al 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
“An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021
An Explanation of In-context Learning as Implicit Bayesian Inference
“Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar) ”
Reverse citations of ‘Transformers learn in-context by gradient descent’ (Google Scholar)
Miscellaneous
Bibliography
https://arxiv.org/abs/2211.15661#google: “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”,https://arxiv.org/abs/2208.01066: “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”,