‘attention ≈ SGD’ directory
- See Also
- Links
- “Detailed Balance in Large Language Model-Driven Agents”, Song et al 2025
- “Insights into Claude-4.5-Opus from Pokémon Red”, Bradshaw 2025
- “Scaled-Dot-Product Attention As One-Sided Entropic Optimal Transport”, Litman 2025
- “Learning without Training: The Implicit Dynamics of In-Context Learning”, Dherin et al 2025
- “MesaNet: Sequence Modeling by Locally Optimal Test-Time Training”, Oswald et al 2025
- “Where Does In-Context Learning Happen in Large Language Models?”, Sia et al 2025
- “Transformers Represent Belief State Geometry in Their Residual Stream”, Shai 2024
- “How Well Can Transformers Emulate In-Context Newton’s Method?”, Giannou et al 2024
- “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, Fu et al 2023
- “Uncovering Mesa-Optimization Algorithms in Transformers”, Oswald et al 2023
- “CausalLM Is Not Optimal for In-Context Learning”, Ding et al 2023
- “One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, Mahankali et al 2023
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Dai et al 2022
- “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”, Akyürek et al 2022
- “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022
- “An Explanation of In-Context Learning As Implicit Bayesian Inference”, Xie et al 2021
- “Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar)”
- Sort By Magic
- Miscellaneous
- Bibliography
See Also
Links
“Detailed Balance in Large Language Model-Driven Agents”, Song et al 2025
“Insights into Claude-4.5-Opus from Pokémon Red”, Bradshaw 2025
“Scaled-Dot-Product Attention As One-Sided Entropic Optimal Transport”, Litman 2025
Scaled-Dot-Product Attention as One-Sided Entropic Optimal Transport
“Learning without Training: The Implicit Dynamics of In-Context Learning”, Dherin et al 2025
Learning without training: The implicit dynamics of in-context learning
“MesaNet: Sequence Modeling by Locally Optimal Test-Time Training”, Oswald et al 2025
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training
“Where Does In-Context Learning Happen in Large Language Models?”, Sia et al 2025
Where does In-context Learning Happen in Large Language Models?
“Transformers Represent Belief State Geometry in Their Residual Stream”, Shai 2024
Transformers Represent Belief State Geometry in their Residual Stream
“How Well Can Transformers Emulate In-Context Newton’s Method?”, Giannou et al 2024
How Well Can Transformers Emulate In-context Newton’s Method?
“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models”, Fu et al 2023
“Uncovering Mesa-Optimization Algorithms in Transformers”, Oswald et al 2023
“CausalLM Is Not Optimal for In-Context Learning”, Ding et al 2023
“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention”, Mahankali et al 2023
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers”, Dai et al 2022
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
“What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”, Akyürek et al 2022
What learning algorithm is in-context learning? Investigations with linear models
“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
“An Explanation of In-Context Learning As Implicit Bayesian Inference”, Xie et al 2021
An Explanation of In-context Learning as Implicit Bayesian Inference
“Reverse Citations of ‘Transformers Learn In-Context by Gradient Descent’ (Google Scholar)”
Reverse citations of ‘Transformers learn in-context by gradient descent’ (Google Scholar)
Sort By Magic
Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.
Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.
language-models
incontext-study
in-context-learning
Miscellaneous
Bibliography
https://arxiv.org/abs/2211.15661#google: “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models”,https://arxiv.org/abs/2208.01066: “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”,