‘self-attention’ directory
Discussion of removing a major architectural limitation in Transformer neural networks: the length of the input it can look at. Beyond a few thousand inputs, the resource requirements explode quadratically, rendering it infeasible to encode raw text at the character level, much less use entire books, images, or many other kinds of data which could be useful. Even for text, this inability also forces limitations like the use of BPE text encoding (responsible for sabotaging GPT-3’s rhyming, among other things), forgetfulness, limits to prompt programming, and inability to write coherent long texts.
A bibliography of possibilities for fixing this are organized hierarchically below:
adding state, through recurrence (a memory) or creating a compressed history/
state as an explicit summary tinkering with matrix algebra to remove the quadratic explosion while still keeping more or less the same self-
attention mechanism approximating self-
attention : using attention on only a small subset of tokens at any time (dodging the quadratic limit), or using a mix of local and global attention (local attentions to do most of the work, and global attention on top of the local attentions, each one avoiding the quadratic by considering only a few inputs at a time)miscellaneous tricks: removing parts, using only randomized untrainable components (with no need to compute gradients over) etc
One of the most frustrating limitations of GPT-3 (as awesome as it is) is the context window: 2048 text tokens (BPEs) is adequate for many text-related tasks, and even GPT-3’s performance on that window is far from perfect, indicating it has a long way to go in truly understanding text. But 2048 BPEs runs out fast when you start prompt programming something hard, hacks like BPEs have nasty & subtle side-effects, and (as iGPT/
How do we get future Transformers with reasonable context windows and/
Below I compile & categorize research on breaking the dense attention quadratic bottleneck (overviews: Lilian Weng, Madison May; review: et al2020; benchmark suite: Long Range Arena1):

Table 1: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure (et al2020)
The summary as of mid-2023: dense Transformers remain surprisingly competitive, and the many proposed variants all have their own drawbacks; none have superseded standard GPT or T5-style Transformers in more than a few niches. To paraphase Chekhov: “If many remedies are prescribed for an illness you can be sure it has no cure.”
Efficient Attention
State
Recurrency
“Universal Transformers”, et al2018 (?); “Deep Equilibrium Models”, et al2019
“Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context”, et al2019 (blog)
“XLNet: Generalized Autoregressive Pretraining for Language Understanding”, et al20192
“Untangling tradeoffs between recurrence and self-attention in neural networks”, et al2020
“Feedback Transformer: Addressing Some Limitations of Transformers with Feedback Memory”, et al2020
“Shortformer: Better Language Modeling using Shorter Inputs”, et al2020
“SRU++: When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute”, 2021
“SwishRNN: Simple Recurrence Improves Masked Language Models”, et al2022
“Block-Recurrent Transformers”, et al2022
RNNs:
Compressed History/State
“Compressive Transformers for Long-Range Sequence Modeling”, et al2019; “Expire-Span: Not All Memories are Created Equal: Learning to Forget by Expiring”, et al2021
“Memory Transformer”, 2020
“Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks”, et al2018; “Perceiver: General Perception with Iterative Attention”, et al2021a/
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”, et al2021b “Mem2Mem: Learning to Summarize Long Texts with Memory Compression and Transfer”, et al2020
“∞-former: Infinite Memory Transformer”, et al2021
“Memorizing Transformers”, et al2021
“ABC: Attention with Bounded-memory Control”, et al2021
“Recursively Summarizing Books with Human Feedback”, et al2021
“MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition”, et al2022
“Token Turing Machines”, et al2022
Matrix Algebra Optimizations
Tricks like rewriting the softmax/
“Efficient Attention: Attention with Linear Complexities”, et al2018 (blog)
“Linformer: Self-Attention with Linear Complexity”, et al2020; “Luna: Linear Unified Nested Attention”, et al2021 (hierarchical?); “Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks”, et al2021
“AFT: An Attention Free Transformer”, et al2021
“LambdaNetworks: Modeling long-range Interactions without Attention”, 2020
“cosFormer: Rethinking Softmax in Attention”, et al2022
Approximations
Sparsity
“Image Transformer”, et al2018
Sparse Transformer: “Generating Long Sequences with Sparse Transformers”, et al2019 (blog)
“Adaptive Attention Span in Transformers”, et al2019
“Reformer: The Efficient Transformer”, et al2019 (blog: 1, 2); “SMYRF: Efficient Attention using Asymmetric Clustering”, et al2020; “Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding”, et al2020; “You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling”, et al2021
“Star-Transformer”, et al2019
“Efficient Content-Based Sparse Attention with Routing Transformers”, et al2020
“Sparse Sinkhorn Attention”, et al2020 (blog)
“BigBird: Transformers for Longer Sequences”, et al2020 (blog; see also ETC)
Axial attention: “Axial Attention in Multidimensional Transformers”, et al2019; et al2018; et al2020b; et al20203
“Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting”, et al2020
“OmniNet: Omnidirectional Representations from Transformers”, et al2021
“Combiner: Full Attention Transformer with Sparse Computation Cost”, et al2021
“Scatterbrain: Unifying Sparse and Low-rank Attention Approximation”, et al2021
“Sparse Is Enough in Scaling Transformers”, et al2021
Note: Several implementations are available in DeepSpeed
Global ↔︎ Local Attention
“LSRA: Lite Transformer with Long-Short Range Attention”, et al2020a
“BlockBERT: Blockwise self-attention for long document understanding”, et al2019
“BP-Transformer: Modeling Long-Range Context via Binary Partitioning”, et al2019
“Longformer: The Long-Document Transformer”, et al2020; “CD-LM: Cross-Document Language Modeling”, et al2021; “Simple Local Attentions Remain Competitive for Long-Context Tasks”, et al2021
“ETC: Encoding Long and Structured Data in Transformers”, et al2020; “LongT5: Efficient Text-To-Text Transformer for Long Sequences”, et al20214
“Conformer: Convolution-augmented Transformer for Speech Recognition”, et al2020 (et al2020)
“Multi-scale Transformer Language Models”, et al2020
“Hierarchical Transformers for Multi-Document Summarization”, 2019; “Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling”, et al2021
“Transformer-QL: A Step Towards Making Transformer Network Quadratically Large”, 2020
“Coordination Among Neural Modules Through a Shared Global Workspace”, et al2021
“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”, et al2021a; “Swin Transformer V2: Scaling Up Capacity and Resolution”, et al2021b
“Hierarchical Transformers Are More Efficient Language Models”, et al2021
“Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision”, et al2021
“AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity”, et al2021
“FLASH: Transformer Quality in Linear Time”, et al2022 (see also MLP-Mixer)
“NAT: Neighborhood Attention Transformer”, et al2022; “DiNAT: Dilated Neighborhood Attention Transformer”, 2022
Miscellaneous
Dropping components, non-trainable/
“Generating Wikipedia by Summarizing Long Sequences”, et al2018 (memory compressed)
“Pay Less Attention with Lightweight and Dynamic Convolutions”, et al2019b
“Music Transformer”, et al2020
“Synthesizer: Rethinking Self-Attention in Transformer Models”, et al2020
“Performer (FAVOR): Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers”, et al2020a (on turning Transformers into RNNs); “FAVOR+: Rethinking Attention with Performers”, et al2020b (blog; DRL use; can be trained in constant memory); “RFA: Random Feature Attention”, et al2020; “DPFP: Linear Transformers Are Secretly Fast Weight Memory Systems”, et al2021; “DAFT: A Dot Product Attention Free Transformer”, et al2021
“Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention”, et al2021; “Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method”, et al2021
“Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing”, et al2020
“RASP: Thinking Like Transformers”, et al2021 (examining limitations of efficient Transformers: in terms of algorithms, what does going from n2 to n cost? What “programs” do Transformers encode?)
“Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding”, et al2021
“On Learning the Transformer Kernel”, et al2021
Structured State Models (SSMs): “Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers”, et al2021a; “S4: Efficiently Modeling Long Sequences with Structured State Spaces”, et al2021b; “HiPPO: Recurrent Memory with Optimal Polynomial Projections”, et al2021c
“Self-attention Does Not Need 𝒪(n2) Memory”, 2021 (does still cost 𝒪(n2) compute)
MLPs (for removing attention entirely)
Retrieval
“REALM: Retrieval-Augmented Language Model Pre-Training”, et al2020
“MARGE: Pre-training via Paraphrasing”, et al2020a
“RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, et al2020b
“Current Limitations of Language Models: What You Need is Retrieval”, 2020
“Memorizing Transformers”, et al2022
See Also
- Gwern
-
Links
- “Dynamic Tanh: Transformers without Normalization ”, Zhu et al 2025
- “(How) Do Language Models Track State? ”, Li et al 2025
- “Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”, Paliotta et al 2025
- “Leveraging the True Depth of LLMs ”, González et al 2025
- “Language Models Use Trigonometry to Do Addition ”, Kantamneni 2025
- “How Has DeepSeek Improved the Transformer Architecture? ”, Erdil 2025
- “Where Does In-Context Learning Happen in Large Language Models? ”, Sia et al 2025
- “MiniMax-01: Scaling Foundation Models With Lightning Attention ”, MiniMax et al 2025
- “Emergent Effects of Scaling on the Functional Hierarchies within Large Language Models ”, Foop 2025
- “ICLR: In-Context Learning of Representations ”, Park et al 2024
- “Hymba: A Hybrid-Head Architecture for Small Language Models ”, Dong et al 2024
- “Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models ”, Ruis et al 2024
- “Long Context RAG Performance of Large Language Models ”, Leng et al 2024
- “Ask, and It Shall Be Given: Turing Completeness of Prompting ”, Qiu et al 2024
- “ALTA: Compiler-Based Analysis of Transformers ”, Shaw et al 2024
- “Tackling the Abstraction and Reasoning Corpus With Vision Transformers: the Importance of 2D Representation, Positions, and Objects ”, Li et al 2024
- “Differential Transformer ”, Ye et al 2024
- “Were RNNs All We Needed? ”, Feng et al 2024
- “NGPT: Normalized Transformer With Representation Learning on the Hypersphere ”, Loshchilov et al 2024
- “Masked Mixers for Language Generation and Retrieval ”, Badger 2024
- “The Mamba in the Llama: Distilling and Accelerating Hybrid Models ”, Wang et al 2024
- “When Can Transformers Count to n? ”, Yehudai et al 2024
- “What Matters in Transformers? Not All Attention Is Needed ”, He et al 2024
- “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024
- “An Empirical Study of Mamba-Based Language Models ”, Waleffe et al 2024
- “Attention As a Hypernetwork ”, Schug et al 2024
- “Scalable Matmul-Free Language Modeling ”, Zhu et al 2024
- “A Theoretical Understanding of Self-Correction through In-Context Alignment ”, Wang et al 2024
- “Attention As an RNN ”, Feng et al 2024
- “Your Transformer Is Secretly Linear ”, Razzhigaev et al 2024
- “Retrieval Head Mechanistically Explains Long-Context Factuality ”, Wu et al 2024
- “Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”, Pfau et al 2024
- “Towards Smaller, Faster Decoder-Only Transformers: Architectural Variants and Their Implications ”, Suresh & P 2024
- “RULER: What’s the Real Context Size of Your Long-Context Language Models? ”, Hsieh et al 2024
- “ReFT: Representation Finetuning for Language Models ”, Wu et al 2024
- “Do Language Models Plan Ahead for Future Tokens? ”, Wu et al 2024
- “Streamlining Redundant Layers to Compress Large Language Models ”, Chen et al 2024
- “Long-Form Factuality in Large Language Models ”, Wei et al 2024
- “Mechanistic Design and Scaling of Hybrid Architectures ”, Poli et al 2024
- “8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”, Levy 2024
- “How Well Can Transformers Emulate In-Context Newton’s Method? ”, Giannou et al 2024
- “RNNs Are Not Transformers (Yet): The Key Bottleneck on In-Context Retrieval ”, Wen et al 2024
- “A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention ”, Cui et al 2024
- “Rethinking Patch Dependence for Masked Autoencoders ”, Fu et al 2024
- “Attention versus Contrastive Learning of Tabular Data—A Data-Centric Benchmarking ”, Rabbani et al 2024
- “Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ”
- “SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention ”, Csordás et al 2023
- “Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models ”, Variengien & Winsor 2023
- “Can a Transformer Represent a Kalman Filter? ”, Goel & Bartlett 2023
- “Efficient Transformer Knowledge Distillation: A Performance Review ”, Brown et al 2023
- “Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers ”, Bozic et al 2023
- “In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering ”, Liu et al 2023
- “On Prefrontal Working Memory and Hippocampal Episodic Memory: Unifying Memories Stored in Weights and Activation Slots ”, Whittington et al 2023
- “LSS Transformer: Ultra-Long Sequence Distributed Transformer ”, Wang et al 2023
- “Simplifying Transformer Blocks ”, He & Hofmann 2023
- “GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling ”, Katsch 2023
- “Not All Layers Are Equally As Important: Every Layer Counts BERT ”, Charpentier & Samuel 2023
- “Implicit Chain-Of-Thought Reasoning via Knowledge Distillation ”, Deng et al 2023
- “Training Dynamics of Contextual N-Grams in Language Models ”, Quirke et al 2023
- “The Impact of Depth and Width on Transformer Language Model Generalization ”, Petty et al 2023
- “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models ”, Fu et al 2023
- “Characterizing Mechanisms for Factual Recall in Language Models ”, Yu et al 2023
- “Linear Representations of Sentiment in Large Language Models ”, Tigges et al 2023
- “Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages ”, Angluin et al 2023
- “How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? ”, Wu et al 2023
- “Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”, Amos et al 2023
- “Vision Transformers Need Registers ”, Darcet et al 2023
- “Interpret Vision Transformers As ConvNets With Dynamic Convolutions ”, Zhou et al 2023
- “Replacing Softmax With ReLU in Vision Transformers ”, Wortsman et al 2023
- “One Wide Feedforward Is All You Need ”, Pires et al 2023
- “Activation Addition: Steering Language Models Without Optimization ”, Turner et al 2023
- “Linearity of Relation Decoding in Transformer Language Models ”, Hernandez et al 2023
- “The Hydra Effect: Emergent Self-Repair in Language Model Computations ”, McGrath et al 2023
- “Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla ”, Lieberum et al 2023
- “FlashAttention-2: Faster Attention With Better Parallelism and Work Partitioning ”, Dao 2023
- “One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention ”, Mahankali et al 2023
- “Lost in the Middle: How Language Models Use Long Contexts ”, Liu et al 2023
- “Trainable Transformer in Transformer ”, Panigrahi et al 2023
- “Transformers Learn to Implement Preconditioned Gradient Descent for In-Context Learning ”, Ahn et al 2023
- “White-Box Transformers via Sparse Rate Reduction ”, Yu et al 2023
- “Blockwise Parallel Transformer for Long Context Large Models ”, Liu & Abbeel 2023
- “TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models ”, Hardt & Sun 2023
- “Brainformers: Trading Simplicity for Efficiency ”, Zhou et al 2023
- “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints ”, Ainslie et al 2023
- “Mimetic Initialization of Self-Attention Layers ”, Trockman & Kolter 2023
- “Toeplitz Neural Network for Sequence Modeling ”, Qin et al 2023
- “Finding Neurons in a Haystack: Case Studies With Sparse Probing ”, Gurnee et al 2023
- “How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model ”, Hanna et al 2023
- “Coinductive Guide to Inductive Transformer Heads ”, Nemecek 2023
- “Tighter Bounds on the Expressivity of Transformer Encoders ”, Chiang et al 2023
- “Tracr: Compiled Transformers As a Laboratory for Interpretability ”, Lindner et al 2023
- “Skip-Attention: Improving Vision Transformers by Paying Less Attention ”, Venkataramanan et al 2023
- “Hungry Hungry Hippos: Towards Language Modeling With State Space Models ”, Fu et al 2022
- “Scalable Adaptive Computation for Iterative Generation ”, Jabri et al 2022
- “Pretraining Without Attention ”, Wang et al 2022
- “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers ”, Dai et al 2022
- “Transformers Learn In-Context by Gradient Descent ”, Oswald et al 2022
- “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Akyürek et al 2022
- “Efficiently Scaling Transformer Inference ”, Pope et al 2022
- “Transformers Learn Shortcuts to Automata ”, Liu et al 2022
- “Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling ”, Chang et al 2022
- “Transformers Implement First-Order Logic With Majority Quantifiers ”, Merrill & Sabharwal 2022
- “The Lie Derivative for Measuring Learned Equivariance ”, Gruver et al 2022
- “Relaxed Attention for Transformer Models ”, Lohrenz et al 2022
- “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Garg et al 2022
- “Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments ”, Dong et al 2022
- “N-Grammer: Augmenting Transformers With Latent n-Grams ”, Roy et al 2022
- “Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits ”, Merrill & Sabharwal 2022
- “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules ”, Irie et al 2022
- “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”, Dao et al 2022
- “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer ”, Ge et al 2022
- “Overcoming a Theoretical Limitation of Self-Attention ”, Chiang & Cholak 2022
- “It’s Raw! Audio Generation With State-Space Models ”, Goel et al 2022
- “General-Purpose, Long-Context Autoregressive Modeling With Perceiver AR ”, Hawthorne et al 2022
- “Transformer Memory As a Differentiable Search Index ”, Tay et al 2022
- “The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention ”, Irie et al 2022
- “Attention Approximates Sparse Distributed Memory ”, Bricken & Pehlevan 2021
- “An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021
- “Long-Range Transformers for Dynamic Spatiotemporal Forecasting ”, Grigsby et al 2021
- “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation ”, Press et al 2021
- “Do Vision Transformers See Like Convolutional Neural Networks? ”, Raghu et al 2021
- “Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding ”, Luo et al 2021
- “RASP: Thinking Like Transformers ”, Weiss et al 2021
- “On the Distribution, Sparsity, and Inference-Time Quantization of Attention Values in Transformers ”, Ji et al 2021
- “SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training ”, Somepalli et al 2021
- “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition ”, Wang et al 2021
- “Less Is More: Pay Less Attention in Vision Transformers ”, Pan et al 2021
- “FNet: Mixing Tokens With Fourier Transforms ”, Lee-Thorp et al 2021
- “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Melas-Kyriazi 2021
- “RoFormer: Enhanced Transformer With Rotary Position Embedding ”, Su et al 2021
- “ALD: Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation ”, Parisotto & Salakhutdinov 2021
- “Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially With Depth ”, Dong et al 2021
- “Do Transformer Modifications Transfer Across Implementations and Applications? ”, Narang et al 2021
- “Linear Transformers Are Secretly Fast Weight Programmers ”, Schlag et al 2021
- “Unlocking Pixels for Reinforcement Learning via Implicit Attention ”, Choromanski et al 2021
- “Transformer Feed-Forward Layers Are Key-Value Memories ”, Geva et al 2020
- “AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction ”, Wang et al 2020
- “Inductive Biases for Deep Learning of Higher-Level Cognition ”, Goyal & Bengio 2020
- “Long Range Arena (LRA): A Benchmark for Efficient Transformers ”, Tay et al 2020
- “Current Limitations of Language Models: What You Need Is Retrieval ”, Komatsuzaki 2020
- “Efficient Transformers: A Survey ”, Tay et al 2020
- “HiPPO: Recurrent Memory With Optimal Polynomial Projections ”, Gu et al 2020
- “Pre-Training via Paraphrasing ”, Lewis et al 2020
- “Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers ”, Choromanski et al 2020
- “GPT-3: Language Models Are Few-Shot Learners ”, Brown et al 2020
- “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks ”, Lewis et al 2020
- “Synthesizer: Rethinking Self-Attention in Transformer Models ”, Tay et al 2020
- “PowerNorm: Rethinking Batch Normalization in Transformers ”, Shen et al 2020
- “On Layer Normalization in the Transformer Architecture ”, Xiong et al 2020
- “REALM: Retrieval-Augmented Language Model Pre-Training ”, Guu et al 2020
- “BERT’s Output Layer Recognizes All Hidden Layers? Some Intriguing Phenomena and a Simple Way to Boost BERT ”, Kao et al 2020
- “Rethinking Attention With Performers ”, Choromanski & Colwell 2020
- “Dynamic Convolution: Attention over Convolution Kernels ”, Chen et al 2019
- “Generalization through Memorization: Nearest Neighbor Language Models ”, Khandelwal et al 2019
- “Multiplicative Interactions and Where to Find Them ”, Jayakumar et al 2019
- “The Bottom-Up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives ”, Voita et al 2019
- “Large Memory Layers With Product Keys ”, Lample et al 2019
- “What Does BERT Look At? An Analysis of BERT’s Attention ”, Clark et al 2019
- “Are 16 Heads Really Better Than One? ”, Michel et al 2019
- “Pay Less Attention With Lightweight and Dynamic Convolutions ”, Wu et al 2019
- “On the Turing Completeness of Modern Neural Network Architectures ”, Pérez et al 2019
- “Music Transformer ”, Huang et al 2018
- “Character-Level Language Modeling With Deeper Self-Attention ”, Al-Rfou et al 2018
- “Attention Is All You Need ”, Vaswani et al 2017
- “A Deep Reinforced Model for Abstractive Summarization ”, Paulus et al 2017
- “Get To The Point: Summarization With Pointer-Generator Networks ”, See et al 2017
- “RAM: Dynamic Computational Time for Visual Attention ”, Li et al 2017
- “Hybrid Computing Using a Neural Network With Dynamic External Memory ”, Graves et al 2016
- “Scaling Memory-Augmented Neural Networks With Sparse Reads and Writes ”, Rae et al 2016
- “Modeling Human Reading With Neural Attention ”, Hahn & Keller 2016
- “Iterative Alternating Neural Attention for Machine Reading ”, Sordoni et al 2016
- “Adaptive Computation Time for Recurrent Neural Networks ”, Graves 2016
- “Foveation-Based Mechanisms Alleviate Adversarial Examples ”, Luo et al 2015
- “Generating Images from Captions With Attention ”, Mansimov et al 2015
- “DRAW: A Recurrent Neural Network For Image Generation ”, Gregor et al 2015
- “Neural Turing Machines ”, Graves et al 2014
- “Neural Machine Translation by Jointly Learning to Align and Translate ”, Bahdanau et al 2014
- “On Learning Where To Look ”, Ranzato 2014
- “Generating Sequences With Recurrent Neural Networks ”, Graves 2013
- “Efficient Transformers: A Survey § Table 1 ”
- “Attention and Augmented Recurrent Neural Networks ”
- “Hierarchical Object Detection With Deep Reinforcement Learning ”
-
“The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) /
Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL) ” - “100M Token Context Windows ”
- “Learning to Combine Foveal Glimpses With a Third-Order Boltzmann Machine ”
- “Show, Attend and Tell: Neural Image Caption Generation With Visual Attention ”
- “Recurrent Models of Visual Attention ”
- “Can Active Memory Replace Attention? ”
- “Dzmitry Bahdanau ”
- “Scaling Automatic Neuron Description ”
- “Monitor: An AI-Driven Observability Interface ”
- “Interpreting GPT: the Logit Lens ”
- “A Sober Look at Steering Vectors for LLMs ”
- “A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer ”
- “FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision ”
- Miscellaneous
- Bibliography
Gwern
“Absolute Unit NNs: Regression-Based MLPs for Everything ”, Gwern 2023
“Research Ideas ”, Gwern 2017
“GPT-3 Creative Fiction ”, Gwern 2020
Links
“Dynamic Tanh: Transformers without Normalization ”, Zhu et al 2025
“(How) Do Language Models Track State? ”, Li et al 2025
“Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”, Paliotta et al 2025
Thinking Slow, Fast: Scaling Inference Compute with Distilled Reasoners
“Leveraging the True Depth of LLMs ”, González et al 2025
“Language Models Use Trigonometry to Do Addition ”, Kantamneni 2025
“How Has DeepSeek Improved the Transformer Architecture? ”, Erdil 2025
“Where Does In-Context Learning Happen in Large Language Models? ”, Sia et al 2025
Where does In-context Learning Happen in Large Language Models?
“MiniMax-01: Scaling Foundation Models With Lightning Attention ”, MiniMax et al 2025
MiniMax-01: Scaling Foundation Models with Lightning Attention
“Emergent Effects of Scaling on the Functional Hierarchies within Large Language Models ”, Foop 2025
Emergent effects of scaling on the functional hierarchies within large language models
“ICLR: In-Context Learning of Representations ”, Park et al 2024
“Hymba: A Hybrid-Head Architecture for Small Language Models ”, Dong et al 2024
“Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models ”, Ruis et al 2024
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
“Long Context RAG Performance of Large Language Models ”, Leng et al 2024
“Ask, and It Shall Be Given: Turing Completeness of Prompting ”, Qiu et al 2024
Ask, and it shall be given: Turing completeness of prompting
“ALTA: Compiler-Based Analysis of Transformers ”, Shaw et al 2024
“Tackling the Abstraction and Reasoning Corpus With Vision Transformers: the Importance of 2D Representation, Positions, and Objects ”, Li et al 2024
“Differential Transformer ”, Ye et al 2024
“Were RNNs All We Needed? ”, Feng et al 2024
“NGPT: Normalized Transformer With Representation Learning on the Hypersphere ”, Loshchilov et al 2024
nGPT: Normalized Transformer with Representation Learning on the Hypersphere
“Masked Mixers for Language Generation and Retrieval ”, Badger 2024
“The Mamba in the Llama: Distilling and Accelerating Hybrid Models ”, Wang et al 2024
The Mamba in the Llama: Distilling and Accelerating Hybrid Models
“When Can Transformers Count to n? ”, Yehudai et al 2024
“What Matters in Transformers? Not All Attention Is Needed ”, He et al 2024
“Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”, Lee et al 2024
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
“An Empirical Study of Mamba-Based Language Models ”, Waleffe et al 2024
“Attention As a Hypernetwork ”, Schug et al 2024
“Scalable Matmul-Free Language Modeling ”, Zhu et al 2024
“A Theoretical Understanding of Self-Correction through In-Context Alignment ”, Wang et al 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
“Attention As an RNN ”, Feng et al 2024
“Your Transformer Is Secretly Linear ”, Razzhigaev et al 2024
“Retrieval Head Mechanistically Explains Long-Context Factuality ”, Wu et al 2024
Retrieval Head Mechanistically Explains Long-Context Factuality
“Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”, Pfau et al 2024
Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models
“Towards Smaller, Faster Decoder-Only Transformers: Architectural Variants and Their Implications ”, Suresh & P 2024
Towards smaller, faster decoder-only transformers: Architectural variants and their implications
“RULER: What’s the Real Context Size of Your Long-Context Language Models? ”, Hsieh et al 2024
RULER: What’s the Real Context Size of Your Long-Context Language Models?
“ReFT: Representation Finetuning for Language Models ”, Wu et al 2024
“Do Language Models Plan Ahead for Future Tokens? ”, Wu et al 2024
“Streamlining Redundant Layers to Compress Large Language Models ”, Chen et al 2024
Streamlining Redundant Layers to Compress Large Language Models
“Long-Form Factuality in Large Language Models ”, Wei et al 2024
“Mechanistic Design and Scaling of Hybrid Architectures ”, Poli et al 2024
“8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”, Levy 2024
“How Well Can Transformers Emulate In-Context Newton’s Method? ”, Giannou et al 2024
How Well Can Transformers Emulate In-context Newton’s Method?
“RNNs Are Not Transformers (Yet): The Key Bottleneck on In-Context Retrieval ”, Wen et al 2024
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval
“A Phase Transition between Positional and Semantic Learning in a Solvable Model of Dot-Product Attention ”, Cui et al 2024
“Rethinking Patch Dependence for Masked Autoencoders ”, Fu et al 2024
“Attention versus Contrastive Learning of Tabular Data—A Data-Centric Benchmarking ”, Rabbani et al 2024
Attention versus Contrastive Learning of Tabular Data—A Data-centric Benchmarking
“Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet ”
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
“SwitchHead: Accelerating Transformers With Mixture-Of-Experts Attention ”, Csordás et al 2023
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
“Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models ”, Variengien & Winsor 2023
Look Before You Leap: A Universal Emergent Decomposition of Retrieval Tasks in Language Models
“Can a Transformer Represent a Kalman Filter? ”, Goel & Bartlett 2023
“Efficient Transformer Knowledge Distillation: A Performance Review ”, Brown et al 2023
Efficient Transformer Knowledge Distillation: A Performance Review
“Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks As an Alternative to Attention Layers in Transformers ”, Bozic et al 2023
“In-Context Vectors: Making In Context Learning More Effective and Controllable Through Latent Space Steering ”, Liu et al 2023
“On Prefrontal Working Memory and Hippocampal Episodic Memory: Unifying Memories Stored in Weights and Activation Slots ”, Whittington et al 2023
“LSS Transformer: Ultra-Long Sequence Distributed Transformer ”, Wang et al 2023
LSS Transformer: Ultra-Long Sequence Distributed Transformer
“Simplifying Transformer Blocks ”, He & Hofmann 2023
“GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling ”, Katsch 2023
GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling
“Not All Layers Are Equally As Important: Every Layer Counts BERT ”, Charpentier & Samuel 2023
Not all layers are equally as important: Every Layer Counts BERT
“Implicit Chain-Of-Thought Reasoning via Knowledge Distillation ”, Deng et al 2023
Implicit Chain-of-Thought Reasoning via Knowledge Distillation
“Training Dynamics of Contextual N-Grams in Language Models ”, Quirke et al 2023
“The Impact of Depth and Width on Transformer Language Model Generalization ”, Petty et al 2023
The Impact of Depth and Width on Transformer Language Model Generalization
“Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study With Linear Models ”, Fu et al 2023
“Characterizing Mechanisms for Factual Recall in Language Models ”, Yu et al 2023
Characterizing Mechanisms for Factual Recall in Language Models
“Linear Representations of Sentiment in Large Language Models ”, Tigges et al 2023
Linear Representations of Sentiment in Large Language Models
“Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages ”, Angluin et al 2023
Masked Hard-Attention Transformers and Boolean RASP Recognize Exactly the Star-Free Languages
“How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? ”, Wu et al 2023
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?
“Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”, Amos et al 2023
Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors
“Vision Transformers Need Registers ”, Darcet et al 2023
“Interpret Vision Transformers As ConvNets With Dynamic Convolutions ”, Zhou et al 2023
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
“Replacing Softmax With ReLU in Vision Transformers ”, Wortsman et al 2023
“One Wide Feedforward Is All You Need ”, Pires et al 2023
“Activation Addition: Steering Language Models Without Optimization ”, Turner et al 2023
Activation Addition: Steering Language Models Without Optimization
“Linearity of Relation Decoding in Transformer Language Models ”, Hernandez et al 2023
Linearity of Relation Decoding in Transformer Language Models
“The Hydra Effect: Emergent Self-Repair in Language Model Computations ”, McGrath et al 2023
The Hydra Effect: Emergent Self-repair in Language Model Computations
“Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla ”, Lieberum et al 2023
“FlashAttention-2: Faster Attention With Better Parallelism and Work Partitioning ”, Dao 2023
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
“One Step of Gradient Descent Is Provably the Optimal In-Context Learner With One Layer of Linear Self-Attention ”, Mahankali et al 2023
“Lost in the Middle: How Language Models Use Long Contexts ”, Liu et al 2023
“Trainable Transformer in Transformer ”, Panigrahi et al 2023
“Transformers Learn to Implement Preconditioned Gradient Descent for In-Context Learning ”, Ahn et al 2023
Transformers learn to implement preconditioned gradient descent for in-context learning
“White-Box Transformers via Sparse Rate Reduction ”, Yu et al 2023
“Blockwise Parallel Transformer for Long Context Large Models ”, Liu & Abbeel 2023
Blockwise Parallel Transformer for Long Context Large Models
“TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models ”, Hardt & Sun 2023
TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models
“Brainformers: Trading Simplicity for Efficiency ”, Zhou et al 2023
“GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints ”, Ainslie et al 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
“Mimetic Initialization of Self-Attention Layers ”, Trockman & Kolter 2023
“Toeplitz Neural Network for Sequence Modeling ”, Qin et al 2023
“Finding Neurons in a Haystack: Case Studies With Sparse Probing ”, Gurnee et al 2023
Finding Neurons in a Haystack: Case Studies with Sparse Probing
“How Does GPT-2 Compute Greater-Than?: Interpreting Mathematical Abilities in a Pre-Trained Language Model ”, Hanna et al 2023
“Coinductive Guide to Inductive Transformer Heads ”, Nemecek 2023
“Tighter Bounds on the Expressivity of Transformer Encoders ”, Chiang et al 2023
“Tracr: Compiled Transformers As a Laboratory for Interpretability ”, Lindner et al 2023
Tracr: Compiled Transformers as a Laboratory for Interpretability
“Skip-Attention: Improving Vision Transformers by Paying Less Attention ”, Venkataramanan et al 2023
Skip-Attention: Improving Vision Transformers by Paying Less Attention
“Hungry Hungry Hippos: Towards Language Modeling With State Space Models ”, Fu et al 2022
Hungry Hungry Hippos: Towards Language Modeling with State Space Models
“Scalable Adaptive Computation for Iterative Generation ”, Jabri et al 2022
“Pretraining Without Attention ”, Wang et al 2022
“Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent As Meta-Optimizers ”, Dai et al 2022
Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers
“Transformers Learn In-Context by Gradient Descent ”, Oswald et al 2022
“What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”, Akyürek et al 2022
What learning algorithm is in-context learning? Investigations with linear models
“Efficiently Scaling Transformer Inference ”, Pope et al 2022
“Transformers Learn Shortcuts to Automata ”, Liu et al 2022
“Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling ”, Chang et al 2022
Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling
“Transformers Implement First-Order Logic With Majority Quantifiers ”, Merrill & Sabharwal 2022
Transformers Implement First-Order Logic with Majority Quantifiers
“The Lie Derivative for Measuring Learned Equivariance ”, Gruver et al 2022
“Relaxed Attention for Transformer Models ”, Lohrenz et al 2022
“What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”, Garg et al 2022
What Can Transformers Learn In-Context? A Case Study of Simple Function Classes
“Multitrack Music Transformer: Learning Long-Term Dependencies in Music With Diverse Instruments ”, Dong et al 2022
Multitrack Music Transformer: Learning Long-Term Dependencies in Music with Diverse Instruments
“N-Grammer: Augmenting Transformers With Latent n-Grams ”, Roy et al 2022
“Log-Precision Transformers Are Constant-Depth Uniform Threshold Circuits ”, Merrill & Sabharwal 2022
Log-Precision Transformers are Constant-Depth Uniform Threshold Circuits
“Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules ”, Irie et al 2022
Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules
“FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”, Dao et al 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
“TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer ”, Ge et al 2022
TATS: Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer
“Overcoming a Theoretical Limitation of Self-Attention ”, Chiang & Cholak 2022
“It’s Raw! Audio Generation With State-Space Models ”, Goel et al 2022
“General-Purpose, Long-Context Autoregressive Modeling With Perceiver AR ”, Hawthorne et al 2022
General-purpose, long-context autoregressive modeling with Perceiver AR
“Transformer Memory As a Differentiable Search Index ”, Tay et al 2022
“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention ”, Irie et al 2022
“Attention Approximates Sparse Distributed Memory ”, Bricken & Pehlevan 2021
“An Explanation of In-Context Learning As Implicit Bayesian Inference ”, Xie et al 2021
An Explanation of In-context Learning as Implicit Bayesian Inference
“Long-Range Transformers for Dynamic Spatiotemporal Forecasting ”, Grigsby et al 2021
Long-Range Transformers for Dynamic Spatiotemporal Forecasting
“Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation ”, Press et al 2021
Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation
“Do Vision Transformers See Like Convolutional Neural Networks? ”, Raghu et al 2021
Do Vision Transformers See Like Convolutional Neural Networks?
“Stable, Fast and Accurate: Kernelized Attention With Relative Positional Encoding ”, Luo et al 2021
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
“RASP: Thinking Like Transformers ”, Weiss et al 2021
“On the Distribution, Sparsity, and Inference-Time Quantization of Attention Values in Transformers ”, Ji et al 2021
On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
“SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training ”, Somepalli et al 2021
SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training
“Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition ”, Wang et al 2021
Not All Images are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition
“Less Is More: Pay Less Attention in Vision Transformers ”, Pan et al 2021
“FNet: Mixing Tokens With Fourier Transforms ”, Lee-Thorp et al 2021
“Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”, Melas-Kyriazi 2021
Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet
“RoFormer: Enhanced Transformer With Rotary Position Embedding ”, Su et al 2021
RoFormer: Enhanced Transformer with Rotary Position Embedding
“ALD: Efficient Transformers in Reinforcement Learning Using Actor-Learner Distillation ”, Parisotto & Salakhutdinov 2021
ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
“Attention Is Not All You Need: Pure Attention Loses Rank Doubly Exponentially With Depth ”, Dong et al 2021
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth
“Do Transformer Modifications Transfer Across Implementations and Applications? ”, Narang et al 2021
Do Transformer Modifications Transfer Across Implementations and Applications?
“Linear Transformers Are Secretly Fast Weight Programmers ”, Schlag et al 2021
“Unlocking Pixels for Reinforcement Learning via Implicit Attention ”, Choromanski et al 2021
Unlocking Pixels for Reinforcement Learning via Implicit Attention
“Transformer Feed-Forward Layers Are Key-Value Memories ”, Geva et al 2020
“AdnFM: An Attentive DenseNet Based Factorization Machine for CTR Prediction ”, Wang et al 2020
AdnFM: An Attentive DenseNet based Factorization Machine for CTR Prediction
“Inductive Biases for Deep Learning of Higher-Level Cognition ”, Goyal & Bengio 2020
Inductive Biases for Deep Learning of Higher-Level Cognition
“Long Range Arena (LRA): A Benchmark for Efficient Transformers ”, Tay et al 2020
Long Range Arena (LRA): A Benchmark for Efficient Transformers
“Current Limitations of Language Models: What You Need Is Retrieval ”, Komatsuzaki 2020
Current Limitations of Language Models: What You Need is Retrieval
“Efficient Transformers: A Survey ”, Tay et al 2020
“HiPPO: Recurrent Memory With Optimal Polynomial Projections ”, Gu et al 2020
“Pre-Training via Paraphrasing ”, Lewis et al 2020
“Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers ”, Choromanski et al 2020
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
“GPT-3: Language Models Are Few-Shot Learners ”, Brown et al 2020
“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks ”, Lewis et al 2020
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
“Synthesizer: Rethinking Self-Attention in Transformer Models ”, Tay et al 2020
Synthesizer: Rethinking Self-Attention in Transformer Models
“PowerNorm: Rethinking Batch Normalization in Transformers ”, Shen et al 2020
“On Layer Normalization in the Transformer Architecture ”, Xiong et al 2020
“REALM: Retrieval-Augmented Language Model Pre-Training ”, Guu et al 2020
“BERT’s Output Layer Recognizes All Hidden Layers? Some Intriguing Phenomena and a Simple Way to Boost BERT ”, Kao et al 2020
“Rethinking Attention With Performers ”, Choromanski & Colwell 2020
“Dynamic Convolution: Attention over Convolution Kernels ”, Chen et al 2019
“Generalization through Memorization: Nearest Neighbor Language Models ”, Khandelwal et al 2019
Generalization through Memorization: Nearest Neighbor Language Models
“Multiplicative Interactions and Where to Find Them ”, Jayakumar et al 2019
“The Bottom-Up Evolution of Representations in the Transformer: A Study With Machine Translation and Language Modeling Objectives ”, Voita et al 2019
“Large Memory Layers With Product Keys ”, Lample et al 2019
“What Does BERT Look At? An Analysis of BERT’s Attention ”, Clark et al 2019
“Are 16 Heads Really Better Than One? ”, Michel et al 2019
“Pay Less Attention With Lightweight and Dynamic Convolutions ”, Wu et al 2019
Pay Less Attention with Lightweight and Dynamic Convolutions
“On the Turing Completeness of Modern Neural Network Architectures ”, Pérez et al 2019
On the Turing Completeness of Modern Neural Network Architectures
“Music Transformer ”, Huang et al 2018
“Character-Level Language Modeling With Deeper Self-Attention ”, Al-Rfou et al 2018
Character-Level Language Modeling with Deeper Self-Attention
“Attention Is All You Need ”, Vaswani et al 2017
“A Deep Reinforced Model for Abstractive Summarization ”, Paulus et al 2017
“Get To The Point: Summarization With Pointer-Generator Networks ”, See et al 2017
Get To The Point: Summarization with Pointer-Generator Networks
“RAM: Dynamic Computational Time for Visual Attention ”, Li et al 2017
“Hybrid Computing Using a Neural Network With Dynamic External Memory ”, Graves et al 2016
Hybrid computing using a neural network with dynamic external memory
“Scaling Memory-Augmented Neural Networks With Sparse Reads and Writes ”, Rae et al 2016
Scaling Memory-Augmented Neural Networks with Sparse Reads and Writes
“Modeling Human Reading With Neural Attention ”, Hahn & Keller 2016
“Iterative Alternating Neural Attention for Machine Reading ”, Sordoni et al 2016
“Adaptive Computation Time for Recurrent Neural Networks ”, Graves 2016
“Foveation-Based Mechanisms Alleviate Adversarial Examples ”, Luo et al 2015
“Generating Images from Captions With Attention ”, Mansimov et al 2015
“DRAW: A Recurrent Neural Network For Image Generation ”, Gregor et al 2015
“Neural Turing Machines ”, Graves et al 2014
“Neural Machine Translation by Jointly Learning to Align and Translate ”, Bahdanau et al 2014
Neural Machine Translation by Jointly Learning to Align and Translate
“On Learning Where To Look ”, Ranzato 2014
“Generating Sequences With Recurrent Neural Networks ”, Graves 2013
“Efficient Transformers: A Survey § Table 1 ”
“Attention and Augmented Recurrent Neural Networks ”
“Hierarchical Object Detection With Deep Reinforcement Learning ”
Hierarchical Object Detection with Deep Reinforcement Learning
“The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make It Recurrent (Universal Transformer) · Stabilization for RL (GTrXL) ”
“100M Token Context Windows ”
“Learning to Combine Foveal Glimpses With a Third-Order Boltzmann Machine ”
Learning to combine foveal glimpses with a third-order Boltzmann machine
“Show, Attend and Tell: Neural Image Caption Generation With Visual Attention ”
Show, attend and tell: Neural image caption generation with visual attention
“Recurrent Models of Visual Attention ”
Recurrent models of visual attention
“Can Active Memory Replace Attention? ”
Can Active Memory Replace Attention?
“Dzmitry Bahdanau ”
“Scaling Automatic Neuron Description ”
Scaling Automatic Neuron Description
“Monitor: An AI-Driven Observability Interface ”
“Interpreting GPT: the Logit Lens ”
interpreting GPT: the logit lens
“A Sober Look at Steering Vectors for LLMs ”
A Sober Look at Steering Vectors for LLMs
“A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention With Linear Complexities · Transformers Are RNNs · ETC · Longformer ”
“FlashAttention-3: Fast and Accurate Attention With Asynchrony and Low-Precision ”
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Miscellaneous
/doc/
ai/ nn/ transformer/ attention/ 2023-trockman-figure7-gpt2attentionmatrixpatterns.png : /doc/
ai/ nn/ transformer/ attention/ 2022-tay-figure4-scalingofmodelbydepth.jpg : /doc/
ai/ nn/ transformer/ attention/ 2022-tay-figure5-scalingofmodelbymlpfeedforwardparameters.jpg : /doc/
ai/ nn/ transformer/ attention/ 2020-longrangearena-figure3-performancefrontier.jpg : /doc/
ai/ nn/ transformer/ attention/ 2020-tay-figure2-efficientattentiontaxonomy.png : /doc/
ai/ nn/ transformer/ attention/ 2020-tay-table1-efficienttransformermodels.png : https://
bclarkson-code.github.io/ posts/ llm-from-scratch-scalar-autograd/ post.html : https://
e2eml.school/ transformers.html : https://
github.com/ haizelabs/ thorn-in-haizestack : https://
lilianweng.github.io/ posts/ 2018-06-24-attention/ : https://
magazine.sebastianraschka.com/ p/ understanding-and-coding-self-attention : https://
mehta-rohan.com/ writings/ blog_posts/ attention.html : https://
nostalgebraist.tumblr.com/ post/ 740164510909890560/ information-flow-in-transformers : https://
shyam.blog/ posts/ beyond-self-attention/ : https://
vgel.me/ posts/ handmade-transformer/ : https://
vgel.me/ posts/ representation-engineering/ : https://
www.anthropic.com/ index/ 100k-context-windows : https://
www.beren.io/ 2024-03-03-Linear-Attention-as-Iterated-Hopfield-Networks/ : https://
www.dipkumar.dev/ becoming-the-unbeatable/ posts/ gpt-kvcache/ : https://
www.lesswrong.com/ posts/ euam65XjigaCJQkcN/ an-analogy-for-understanding-transformers : https://
www.lesswrong.com/ posts/ jGuXSZgv6qfdhMCuJ/ refusal-in-llms-is-mediated-by-a-single-direction : https://
www.lesswrong.com/ posts/ thePw6qdyabD8XR4y/ interpreting-openai-s-whisper https://
www.perfectlynormal.co.uk/ blog-induction-heads-illustrated
Bibliography
https://
: “Dynamic Tanh: Transformers without Normalization ”,arxiv.org/ abs/ 2503.10622#facebook https://
: “Thinking Slow, Fast: Scaling Inference Compute With Distilled Reasoners ”,arxiv.org/ abs/ 2502.20339 https://
: “MiniMax-01: Scaling Foundation Models With Lightning Attention ”,arxiv.org/ abs/ 2501.08313#minimax https://
: “ALTA: Compiler-Based Analysis of Transformers ”,arxiv.org/ abs/ 2410.18077#deepmind https://
: “Tackling the Abstraction and Reasoning Corpus With Vision Transformers: the Importance of 2D Representation, Positions, and Objects ”,arxiv.org/ abs/ 2410.06405 https://
: “Were RNNs All We Needed? ”,arxiv.org/ abs/ 2410.01201 https://
: “The Mamba in the Llama: Distilling and Accelerating Hybrid Models ”,arxiv.org/ abs/ 2408.15237 https://
: “What Matters in Transformers? Not All Attention Is Needed ”,arxiv.org/ abs/ 2406.15786 https://
: “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? ”,arxiv.org/ abs/ 2406.13121#google https://
: “An Empirical Study of Mamba-Based Language Models ”,arxiv.org/ abs/ 2406.07887 https://
: “Retrieval Head Mechanistically Explains Long-Context Factuality ”,arxiv.org/ abs/ 2404.15574 https://
: “Let’s Think Dot by Dot: Hidden Computation in Transformer Language Models ”,arxiv.org/ abs/ 2404.15758 https://
: “Long-Form Factuality in Large Language Models ”,arxiv.org/ abs/ 2403.18802#deepmind https://
: “Mechanistic Design and Scaling of Hybrid Architectures ”,arxiv.org/ abs/ 2403.17844 https://
: “8 Google Employees Invented Modern AI. Here’s the Inside Story: They Met by Chance, Got Hooked on an Idea, and Wrote the Transformers Paper—The Most Consequential Tech Breakthrough in Recent History ”,www.wired.com/ story/ eight-google-employees-invented-modern-ai-transformers-paper/ https://
: “Rethinking Patch Dependence for Masked Autoencoders ”,arxiv.org/ abs/ 2401.14391 https://
: “Efficient Transformer Knowledge Distillation: A Performance Review ”,arxiv.org/ abs/ 2311.13657 https://
: “Not All Layers Are Equally As Important: Every Layer Counts BERT ”,arxiv.org/ abs/ 2311.02265 https://
: “Linear Representations of Sentiment in Large Language Models ”,arxiv.org/ abs/ 2310.15154 https://
: “Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors ”,arxiv.org/ abs/ 2310.02980 https://
: “Interpret Vision Transformers As ConvNets With Dynamic Convolutions ”,arxiv.org/ abs/ 2309.10713 https://
: “Replacing Softmax With ReLU in Vision Transformers ”,arxiv.org/ abs/ 2309.08586 https://
: “Activation Addition: Steering Language Models Without Optimization ”,arxiv.org/ abs/ 2308.10248 https://
: “TTT-NN: Test-Time Training on Nearest Neighbors for Large Language Models ”,arxiv.org/ abs/ 2305.18466 https://
: “Brainformers: Trading Simplicity for Efficiency ”,arxiv.org/ abs/ 2306.00008#google https://
: “Mimetic Initialization of Self-Attention Layers ”,arxiv.org/ abs/ 2305.09828 https://
: “Skip-Attention: Improving Vision Transformers by Paying Less Attention ”,arxiv.org/ abs/ 2301.02240 https://
: “Hungry Hungry Hippos: Towards Language Modeling With State Space Models ”,arxiv.org/ abs/ 2212.14052 https://
: “Pretraining Without Attention ”,arxiv.org/ abs/ 2212.10544 https://
: “Transformers Learn In-Context by Gradient Descent ”,arxiv.org/ abs/ 2212.07677#google https://
: “What Learning Algorithm Is In-Context Learning? Investigations With Linear Models ”,arxiv.org/ abs/ 2211.15661#google https://
: “Efficiently Scaling Transformer Inference ”,arxiv.org/ abs/ 2211.05102#google https://
: “Multi-CLS BERT: An Efficient Alternative to Traditional Ensembling ”,arxiv.org/ abs/ 2210.05043 https://
: “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes ”,arxiv.org/ abs/ 2208.01066 https://
: “Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules ”,arxiv.org/ abs/ 2206.01649#schmidhuber https://
: “FlashAttention: Fast and Memory-Efficient Exact Attention With IO-Awareness ”,arxiv.org/ abs/ 2205.14135 https://
: “TATS: Long Video Generation With Time-Agnostic VQGAN and Time-Sensitive Transformer ”,arxiv.org/ abs/ 2204.03638#facebook https://
: “It’s Raw! Audio Generation With State-Space Models ”,arxiv.org/ abs/ 2202.09729 https://
: “General-Purpose, Long-Context Autoregressive Modeling With Perceiver AR ”,arxiv.org/ abs/ 2202.07765#deepmind https://
: “Train Short, Test Long: Attention With Linear Biases (ALiBi) Enables Input Length Extrapolation ”,arxiv.org/ abs/ 2108.12409#facebook https://
: “Do Vision Transformers See Like Convolutional Neural Networks? ”,arxiv.org/ abs/ 2108.08810#google https://
: “RASP: Thinking Like Transformers ”,arxiv.org/ abs/ 2106.06981 https://
: “Not All Images Are Worth 16×16 Words: Dynamic Transformers for Efficient Image Recognition ”,arxiv.org/ abs/ 2105.15075 https://
: “Less Is More: Pay Less Attention in Vision Transformers ”,arxiv.org/ abs/ 2105.14217 https://
: “FNet: Mixing Tokens With Fourier Transforms ”,arxiv.org/ abs/ 2105.03824#google https://
: “Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet ”,arxiv.org/ abs/ 2105.02723 https://
: “Long Range Arena (LRA): A Benchmark for Efficient Transformers ”,openreview.net/ forum?id=qVyeW-grC2k#google https://
: “Efficient Transformers: A Survey ”,arxiv.org/ abs/ 2009.06732#google https://
: “HiPPO: Recurrent Memory With Optimal Polynomial Projections ”,arxiv.org/ abs/ 2008.07669 https://
: “Synthesizer: Rethinking Self-Attention in Transformer Models ”,arxiv.org/ abs/ 2005.00743#google https://
: “PowerNorm: Rethinking Batch Normalization in Transformers ”,arxiv.org/ abs/ 2003.07845 https://
: “BERT’s Output Layer Recognizes All Hidden Layers? Some Intriguing Phenomena and a Simple Way to Boost BERT ”,arxiv.org/ abs/ 2001.09309 https://
: “Dynamic Convolution: Attention over Convolution Kernels ”,arxiv.org/ abs/ 1912.03458#microsoft