‘attention ≈ SGD’ directory
- See Also
- Links
- “Detailed Balance in Large Language Model-Driven Agents”, Song et al 2025
- “Insights into Claude-4.5-Opus from Pokémon Red”, Bradshaw 2025
- “Learning without Training: The Implicit Dynamics of In-Context Learning”, Dherin et al 2025
- “Where Does In-Context Learning Happen in Large Language Models?”, Sia et al 2025
- “Transformers Represent Belief State Geometry in Their Residual Stream”, Shai 2024
- “How Well Can Transformers Emulate In-Context Newton’s Method?”, Giannou et al 2024
- “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, Fu et al 2023
- “CausalLM Is Not Optimal for In-Context Learning”, Ding et al 2023
- “One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, Mahankali et al 2023
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Dai et al 2022
- “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”, Akyürek et al 2022
- “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022
- “An Explanation of In-Context Learning As Implicit Bayesian Inference”, Xie et al 2021
- “Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar)”
- Miscellaneous
- Bibliography
See Also
Links
“Detailed Balance in Large Language Model-Driven Agents”, Song et al 2025
“Insights into Claude-4.5-Opus from Pokémon Red”, Bradshaw 2025
“Learning without Training: The Implicit Dynamics of In-Context Learning”, Dherin et al 2025
Learning without training: The implicit dynamics of in-context learning
“Where Does In-Context Learning Happen in Large Language Models?”, Sia et al 2025
Where does In-context Learning Happen in Large Language Models?
“Transformers Represent Belief State Geometry in Their Residual Stream”, Shai 2024
Transformers Represent Belief State Geometry in their Residual Stream
“How Well Can Transformers Emulate In-Context Newton’s Method?”, Giannou et al 2024
How Well Can Transformers Emulate In-context Newton’s Method?
“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, Fu et al 2023
“CausalLM Is Not Optimal for In-Context Learning”, Ding et al 2023
“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, Mahankali et al 2023
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Dai et al 2022
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
“What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”, Akyürek et al 2022
What learning algorithm is in-context learning? Investigations with linear models
“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
“An Explanation of In-Context Learning As Implicit Bayesian Inference”, Xie et al 2021
An Explanation of In-context Learning as Implicit Bayesian Inference
“Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar)”
Reverse citations of ‘Transformers learn in-context by gradient descent’ (Google Scholar)
Miscellaneous
Bibliography
https://arxiv.org/abs/2211.15661#google: “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”,https://arxiv.org/abs/2208.01066: “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”,