--- title: "Efficient Attention: Breaking The Quadratic Transformer Bottleneck" created: 2020-07-25 modified: 2020-07-25 status: finished confidence: highly likely importance: 5 cssExtension: dropcaps-yinit ...
> Discussion of removing a major architectural limitation in Transformer > neural networks: the length of the input it can look at. Beyond a few > thousand inputs, the resource requirements explode quadratically, > rendering it infeasible to encode raw text at the character level, much > less use entire books, images, or many other kinds of data which could > be useful. Even for text, this inability also forces limitations like > the use of BPE text encoding (responsible for sabotaging > [GPT-3's](https://arxiv.org/abs/2005.14165#openai "'GPT-3: Language Models are Few-Shot Learners', Brown et al 2020") > rhyming, among other things), forgetfulness, limits to prompt > programming, and inability to write coherent long texts. > > A bibliography of possibilities for fixing this are [organized hierarchically below](#efficient-attention): > > 1. adding **state**, through recurrence (a memory) or creating a > compressed history/state as an explicit summary > 2. tinkering with **matrix algebra** to remove the quadratic explosion > while still keeping more or less the same self-attention mechanism > 3. **approximating self-attention**: using attention on only a small > subset of tokens at any time (dodging the quadratic limit), or using > a mix of local and global attention (local attentions to do most of > the work, and global attention on top of the local attentions, each > one avoiding the quadratic by considering only a few inputs at a > time) > 4. **miscellaneous** tricks: removing parts, using only randomized > untrainable components (with no need to compute gradients over) etc
One of the most frustrating limitations of GPT-3 (as [awesome as it is](/gpt-3 "'GPT-3 Creative Fiction', Branwen 2020")) is the context window: 2048 text tokens (BPEs) is adequate for many text-related tasks, and even GPT-3's performance on that window is far from perfect, indicating it has a long way to go in truly understanding text. But 2048 BPEs runs out fast when you start prompt programming something hard, hacks like [BPEs](/gpt-3#bpes "'GPT-3 Creative Fiction § BPEs', Branwen 2020") have nasty & subtle side-effects, and (as iGPT/ViT indicate in their own ways) is totally inadequate for other modalities like images---a single small 256px image is already equivalent to a sequence of _l_ = 65,536, never mind video or raw audio! How do we get future Transformers with reasonable context windows and/or memory, which we can use for research papers, books, structured text, images, video, audio, point clouds, genomics, and so on, where we need to handle sequences with lengths in the millions? (Such improvements would permit not just doing things GPT-3 struggles to do, like write coherent novels, but many better architectures, like multimodal Transformers which can learn jointly from images & text, accessing image-based datasets like PDFs, and learning far more accurate human-like representations & tacit knowledge with less data & smaller models, providing large models useful for almost all conceivable tasks---especially robotics.) Below I compile & categorize research on breaking the dense attention quadratic bottleneck (overviews: [Lilian Weng](https://lilianweng.github.io/lil-log/2020/04/07/the-transformer-family.html#openai "The Transformer Family: Attention and Self-Attention · Multi-Head Self-Attention · Transformer · Adaptive Computation Time (ACT) · Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) · Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) · Make it Recurrent (Universal Transformer) · Stabilization for RL (GTrXL)"), [Madison May](https://www.pragmatic.ml/a-survey-of-methods-for-incorporating-long-term-context/ "A Survey of Long-Term Context in Transformers: Sparse Transformers · Adaptive Span Transformers · Transformer-XL · Compressive Transformers · Reformer · Routing Transformer · Sinkhorn Transformer · Linformer · Efficient Attention: Attention with Linear Complexities · Transformers are RNNs · ETC · Longformer"); review: [Tay et al 2020](https://arxiv.org/abs/2009.06732#google "Efficient Transformers: A Survey"); benchmark suite: [Long Range Arena](https://openreview.net/forum?id=qVyeW-grC2k#google "'Long Range Arena (LRA): A Benchmark for Efficient Transformers', Tay et al 2020")^[While not directly examining efficient attention mechanisms, ["Do Transformer Modifications Transfer Across Implementations and Applications?", Narang et al 2021](https://arxiv.org/abs/2102.11972#google "'Do Transformer Modifications Transfer Across Implementations and Applications?', Narang et al 2021"), which benchmarks Transformer activations/normalizations/depths/embeddings/weight-tying/architectures, finds that (as often in ML) the gains are smaller than reported & may reflect methodological issues like intensity of hyperparameter tuning & no-free-lunches, and the vanilla Transformer can be heavily hardware-optimized to allow much larger context lengths (eg. [FlashAttention](https://arxiv.org/abs/2205.14135 "‘FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness’, Dao et al 2022")). See also ["Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?"](https://arxiv.org/abs/2207.10551#google), Tay et al 2022.]): ![**Table 1**: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure ([Tay et al 2020](https://arxiv.org/pdf/2009.06732.pdf#org=google&page=6 "Efficient Transformers: A Survey: Table 1"))](/doc/ai/nn/transformer/attention/2020-tay-table1-efficienttransformermodels.png "Table 1: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure. Some papers presented sequentially may first appear at the same time, eg. as an ICLR submission. Papers annotated with a superscript '†' are peer-reviewed papers. Class abbreviations include: _FP_ = Fixed Patterns or Combinations of Fixed Patterns, _M_ = Memory, _LP_ = Learnable Pattern, _LR_ = Low Rank, _KR_ = Kernel and _RC_ = Recurrence. Furthermore, _n_ generally refers to the sequence length and _b_ is the local window (or block) size. We use subscript _g_ on _n_ to denote global memory length and _n~c~_ to denote convolutionally compressed sequence lengths."){.invert} The summary as of mid-2023: dense Transformers remain surprisingly competitive, and the many proposed variants all have their own drawbacks; none have superseded standard GPT or T5-style Transformers in more than a few niches. To paraphase Chekhov: "If many remedies are prescribed for an illness you can be sure it has no cure." # Efficient Attention ## State ### Recurrency - ["Universal Transformers"](https://arxiv.org/abs/1807.03819#googledeepmind), Dehghani et al 2018 (?); ["Deep Equilibrium Models"](https://arxiv.org/abs/1909.01377 "‘DEQ: Deep Equilibrium Models’, Bai et al 2019"), Bai et al 2019 - ["Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context"](https://arxiv.org/abs/1901.02860), Dai et al 2019 ([blog](https://www.lyrn.ai/2019/01/16/transformer-xl-sota-language-model/ "'Transformer-XL---Combining Transformers and RNNs Into a State-of-the-art Language Model', Rani Horev 2019")) - ["XLNet: Generalized Autoregressive Pretraining for Language Understanding"](https://arxiv.org/abs/1906.08237), Yang et al 2019^[For comparison, [Joe Davison](https://twitter.com/joeddav/status/1285238997011267585 "So I tried out GPT-3's trick of conditioning on training data with XLNet. While it doesn't do as well as the 175B GPT-3, it does much better than the version which is the same size as XLNet (0.4B). The visual below is from their paper on WinoGrande—I added the squares for XLNet.") finds XLNet is ~10--16× more parameter-efficient at few-shot learning: XLNet-0.4b ≈ GPT-3-6.7b.] - ["Untangling tradeoffs between recurrence and self-attention in neural networks"](https://arxiv.org/abs/2006.09471), Kerg et al 2020 - ["Feedback Transformer: Addressing Some Limitations of Transformers with Feedback Memory"](https://arxiv.org/abs/2002.09402#facebook "'Addressing Some Limitations of Transformers with Feedback Memory', Fan et al 2020"), Fan et al 2020 - ["Shortformer: Better Language Modeling using Shorter Inputs"](https://arxiv.org/abs/2012.15832), Press et al 2020 - ["SRU++: When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute"](https://arxiv.org/abs/2102.12459 "'When Attention Meets Fast Recurrence: Training Language Models with Reduced Compute', Lei 2021"), Lei 2021 - ["SwishRNN: Simple Recurrence Improves Masked Language Models"](https://arxiv.org/abs/2205.11588#google "‘Simple Recurrence Improves Masked Language Models’, Lei et al 2022"), Lei et al 2022 - ["Block-Recurrent Transformers"](https://arxiv.org/abs/2203.07852), Hutchins et al 2022 - **RNNs**: - *Transformer ↔ RNN* relationship: see [Transformer-XL](#dai-et-al-2019), [XLNet](#yang-et-al-2019), [Katharopoulos et al 2020](#katharopoulos-et-al-2020), [Yoshida et al 2020](#yoshida-et-al-2020), [AFT](#zhai-et-al-2020), [Lei 2021](#lei-2021), [Kasai et al 2021](https://arxiv.org/abs/2103.13076 "Finetuning Pretrained Transformers into RNNs"), [Parisotto & Salakhutdinov 2021](https://arxiv.org/abs/2104.01655 "Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation"), [Perceiver](https://arxiv.org/abs/2202.07765#deepmind "‘General-purpose, long-context autoregressive modeling with Perceiver AR’, Hawthorne et al 2022"), [SwishRNN](#lei-et-al-2022), [RWKV](https://arxiv.org/abs/2305.13048 "‘RWKV: Reinventing RNNs for the Transformer Era’, Peng et al 2023") - [*dynamic evaluation*](https://arxiv.org/abs/1308.0850 "‘Generating Sequences With Recurrent Neural Networks’, Graves 2013") - [*neural cache*](https://arxiv.org/abs/1612.04426#facebook "‘Improving Neural Language Models with a Continuous Cache’, Grave et al 2016") ### Compressed History/State - ["Compressive Transformers for Long-Range Sequence Modelling"](https://arxiv.org/abs/1911.05507#deepmind), Rae et al 2019; ["Expire-Span: Not All Memories are Created Equal: Learning to Forget by Expiring"](https://arxiv.org/abs/2105.06548#facebook "'Not All Memories are Created Equal: Learning to Forget by Expiring', Sukhbaatar et al 2021"), Sukhbaatar et al 2021 - ["Memory Transformer"](https://arxiv.org/abs/2006.11527), Burtsev & Sapunov 2020 - ["Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks"](https://arxiv.org/abs/1810.00825), Lee et al 2018; ["Perceiver: General Perception with Iterative Attention"](https://arxiv.org/abs/2103.03206#deepmind){#perceiver}, Jaegle et al 2021a/["Perceiver IO: A General Architecture for Structured Inputs & Outputs"](https://arxiv.org/abs/2107.14795#deepmind "'Perceiver IO: A General Architecture for Structured Inputs & Outputs', Jaegle et al 2021"){#perceiver-io}, Jaegle et al 2021b - ["Mem2Mem: Learning to Summarize Long Texts with Memory Compression and Transfer"](https://arxiv.org/abs/2010.11322#elementai "'Learning to Summarize Long Texts with Memory Compression and Transfer', Park et al 2020"), Park et al 2020 - ["∞-former: Infinite Memory Transformer"](https://arxiv.org/abs/2109.00301), Martins et al 2021 - ["Memorizing Transformers"](https://arxiv.org/abs/2203.08913#google), Wu et al 2021 - ["ABC: Attention with Bounded-memory Control"](https://arxiv.org/abs/2110.02488#allen), Peng et al 2021 - ["Recursively Summarizing Books with Human Feedback"](https://arxiv.org/abs/2109.10862#openai), Wu et al 2021 - ["MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition"](https://arxiv.org/abs/2201.08383#facebook), Wu et al 2022 - ["Token Turing Machines"](https://arxiv.org/abs/2211.09119#google), Ryoo et al 2022 ## Matrix Algebra Optimizations Tricks like rewriting the softmax/dot-product to be linear: - ["Efficient Attention: Attention with Linear Complexities"](https://arxiv.org/abs/1812.01243#sensetime), Shen et al 2018 ([blog](https://medium.com/@cmsflash/efficient-attention-attention-with-linear-complexities-b3c00c4348e3 "Efficient Attention: Attention with Linear Complexities [blog]")) - ["Linformer: Self-Attention with Linear Complexity"](https://arxiv.org/abs/2006.04768#facebook), Wang et al 2020; ["Luna: Linear Unified Nested Attention"](https://arxiv.org/abs/2106.01540), Ma et al 2021 (hierarchical?); ["Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks"](https://arxiv.org/abs/2105.02358 "‘Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks (EAMLP)’, Guo et al 2021"), Guo et al 2021 - ["Transformers are RNNs (Linear Transformers): Fast Autoregressive Transformers with Linear Attention"](https://arxiv.org/abs/2006.16236 "'Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention', Katharopoulos et al 2020"), Katharopoulos et al 2020 - ["AFT: An Attention Free Transformer"](https://openreview.net/forum?id=pW--cu2FCHY#apple "'An Attention Free Transformer', Anonymous 2020"), Zhai et al 2020 - ["LambdaNetworks: Modeling long-range Interactions without Attention"](https://openreview.net/forum?id=xTJEN-ggl1b), Bello 2020 - ["cosFormer: Rethinking Softmax in Attention"](https://arxiv.org/abs/2202.08791#sensetime), Qin et al 2022 ## Approximations ### Sparsity - ["Image Transformer"](https://arxiv.org/abs/1802.05751#google), Parmar et al 2018 - [Sparse Transformer: "Generating Long Sequences with Sparse Transformers"](https://arxiv.org/abs/1904.10509#openai "'Generating Long Sequences with Sparse Transformers', Child et al 2019"), Child et al 2019 ([blog](https://openai.com/research/sparse-transformer "Generative Modeling with Sparse Transformers: We've developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence --- whether text, images, or sound. It uses an algorithmic improvement of the *attention* mechanism to extract patterns from sequences 30× longer than possible previously.")) - ["Adaptive Attention Span in Transformers"](https://arxiv.org/abs/1905.07799#facebook), Sukhbaatar et al 2019 - ["Reformer: The Efficient Transformer"](https://arxiv.org/abs/2001.04451#google), Kitaev et al 2019 (blog: [1](https://www.pragmatic.ml/reformer-deep-dive/ "'A Deep Dive into the Reformer', Madison May"), [2](https://huggingface.co/blog/reformer "'The Reformer - Pushing the limits of language modeling', Patrick von Platen 2020")); ["SMYRF: Efficient Attention using Asymmetric Clustering"](https://arxiv.org/abs/2010.05315), Daras et al 2020; ["Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding"](https://arxiv.org/abs/2009.06097#microsoft), Wang et al 2020; ["You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling"](https://arxiv.org/abs/2111.09714), Zeng et al 2021 - ["Star-Transformer"](https://arxiv.org/abs/1902.09113), Guo et al 2019 - ["Efficient Content-Based Sparse Attention with Routing Transformers"](https://arxiv.org/abs/2003.05997#google), Roy et al 2020 - ["Sparse Sinkhorn Attention"](https://arxiv.org/abs/2002.11296#google), Tay et al 2020 ([blog](https://www.pragmatic.ml/sparse-sinkhorn-attention/ "'Optimal Transport and the Sinkhorn Transformer', Madison May")) - ["BigBird: Transformers for Longer Sequences"](https://arxiv.org/abs/2007.14062#google), Zaheer et al 2020 ([blog](https://blog.research.google/2021/03/constructing-transformers-for-longer.html "Constructing Transformers For Longer Sequences with Sparse Attention Methods"); see also [ETC](#etc)) - **Axial attention**: ["Axial Attention in Multidimensional Transformers"](https://arxiv.org/abs/1912.12180#google), Ho et al 2019; [Huang et al 2018](https://arxiv.org/abs/1811.11721 "CCNet: Criss-Cross Attention for Semantic Segmentation"); [Wang et al 2020b](https://arxiv.org/abs/2003.07853#google "Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation"); [Weissenborn et al 2020](https://arxiv.org/abs/1906.02634#google "Scaling Autoregressive Video Models")^[Speculative inclusion---there may be some way to use the factorization of axial attention, generally intended for multidimensional data like 2D images which can split the full attention into small linear-complexity Height × Width components, on 1D sequences like natural language.] - ["Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting"](https://arxiv.org/abs/2012.07436), Zhou et al 2020 - ["LogSparse Transformer: Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting"](https://arxiv.org/abs/1907.00235 "'Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting', Li et al 2019"), Li et al 2019 - ["OmniNet: Omnidirectional Representations from Transformers"](https://arxiv.org/abs/2103.01075#google), Tay et al 2021 - ["Combiner: Full Attention Transformer with Sparse Computation Cost"](https://arxiv.org/abs/2107.05768#google), Ren et al 2021 - ["Scatterbrain: Unifying Sparse and Low-rank Attention Approximation"](https://arxiv.org/abs/2110.15343#facebook), Chen et al 2021 - ["Sparse Is Enough in Scaling Transformers"](https://arxiv.org/abs/2111.12763#google), Jaszczur et al 2021 - Note: Several implementations are available in [DeepSpeed](https://www.deepspeed.ai/2020/09/08/sparse-attention-news.html "DeepSpeed Sparse Attention") ### Global ↔ Local Attention - ["LSRA: Lite Transformer with Long-Short Range Attention"](https://arxiv.org/abs/2004.11886 "'Lite Transformer with Long-Short Range Attention', Wu et al 2020"), Wu et al 2020a - ["BlockBERT: Blockwise self-attention for long document understanding"](https://arxiv.org/abs/1911.02972#facebook "'Blockwise Self-Attention for Long Document Understanding', Qiu et al 2019"), Qiu et al 2019 - ["BP-Transformer: Modelling Long-Range Context via Binary Partitioning"](https://arxiv.org/abs/1911.04070), Ye et al 2019 - ["Longformer: The Long-Document Transformer"](https://arxiv.org/abs/2004.05150), Beltagy et al 2020; ["CD-LM: Cross-Document Language Modeling"](https://arxiv.org/abs/2101.00406 "'Cross-Document Language Modeling', Caciularu et al 2021"), Caciularu et al 2021; "Simple Local Attentions Remain Competitive for Long-Context Tasks", Xiong et al 2021 - ["ETC: Encoding Long and Structured Data in Transformers"](https://arxiv.org/abs/2004.08483 "'ETC: Encoding Long and Structured Inputs in Transformers', Ainslie et al 2020"), Ainslie et al 2020; ["LongT5: Efficient Text-To-Text Transformer for Long Sequences"](https://arxiv.org/abs/2112.07916#google), Guo et al 2021^[One question I have about methods which reuse part of the context window for memory: can we do curriculum training, and efficiently train a Transformer normally with a fixed window for most of the training, and then switch over to overloading part of the context as the new memory ([Yoshida et al 2020](https://arxiv.org/abs/2008.07027 "Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size"))? That would hypothetically save much of the compute, although one might wonder if the learned algorithms & representations will be inferior compared to a Transformer which was always trained with memory.] - ["Conformer: Convolution-augmented Transformer for Speech Recognition"](https://arxiv.org/abs/2005.08100#google), Gulatti et al 2020 ([Zhang et al 2020](https://arxiv.org/abs/2010.10504#google "Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition")) - ["SMITH: Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Document Matching"](https://arxiv.org/abs/2004.12297#google "'Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching', Yang et al 2020"), Yang et al 2020 - ["Multi-scale Transformer Language Models"](https://arxiv.org/abs/2005.00581#facebook), Subramanian et al 2020 - ["Hierarchical Transformers for Multi-Document Summarization"](https://arxiv.org/abs/1905.13164), Liu & Lapata 2019; ["Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling"](https://arxiv.org/abs/2106.01040), Wu et al 2021 - ["Transformer-QL: A Step Towards Making Transformer Network Quadratically Large"](https://openreview.net/forum?id=WlT94P_zuHF), Hajra 2020 - ["Coordination Among Neural Modules Through a Shared Global Workspace"](https://arxiv.org/abs/2103.01197), Goyal et al 2021 - ["GANSformer: Generative Adversarial Transformers"](https://arxiv.org/abs/2103.01209 "'Generative Adversarial Transformers', Hudson & Zitnick 2021"), Hudson & Zitnick 2021 - ["Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"](https://arxiv.org/abs/2103.14030){#swin-1}, Liu et al 2021a; ["Swin Transformer V2: Scaling Up Capacity and Resolution"](https://arxiv.org/abs/2111.09883){#swin-2}, Liu et al 2021b - ["Hierarchical Transformers Are More Efficient Language Models"](https://arxiv.org/abs/2110.13711#nvidia "‘Hourglass: Hierarchical Transformers Are More Efficient Language Models’, Nawrot et al 2021"), Nawrot et al 2021 - ["Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision"](https://arxiv.org/abs/2107.02192#nvidia "'Long-Short Transformer: Efficient Transformers for Language and Vision', Zhu et al 2021"), Zhu et al 2021 - ["AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity"](https://arxiv.org/abs/2108.04962 "'Adaptive Multi-Resolution Attention with Linear Complexity', Zhang et al 2021"), Zhang et al 2021 - ["Fastformer: Additive Attention is All You Need"](https://arxiv.org/abs/2108.09084 "‘Fastformer: Additive Attention Can Be All You Need’, Wu et al 2021"), Wu et al 2021 - ["FLASH: Transformer Quality in Linear Time"](https://arxiv.org/abs/2202.10447#google "'Transformer Quality in Linear Time', Hua et al 2022"), Hua et al 2022 (see also [MLP-Mixer](/note/fully-connected#mlp-mixer)) - ["NAT: Neighborhood Attention Transformer"](https://arxiv.org/abs/2204.07143 "‘Neighborhood Attention Transformer’, Hassani et al 2022"), Hassani et al 2022; ["DiNAT: Dilated Neighborhood Attention Transformer"](https://arxiv.org/abs/2209.15001 "‘Dilated Neighborhood Attention Transformer’, Hassani & Shi 2022"), Hassani & Shi 2022 ## Miscellaneous Dropping components, non-trainable/randomized parts, etc: - ["Generating Wikipedia by Summarizing Long Sequences"](https://arxiv.org/abs/1801.10198#google), Liu et al 2018 (memory compressed) - ["Pay Less Attention with Lightweight and Dynamic Convolutions"](https://arxiv.org/abs/1901.10430#facebook), Wu et al 2019b - ["Music Transformer"](https://arxiv.org/abs/1809.04281#google), Huang et al 2020 - ["Synthesizer: Rethinking Self-Attention in Transformer Models"](https://arxiv.org/abs/2005.00743#google), Tay et al 2020 - ["Performer (FAVOR): Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers"](https://arxiv.org/abs/2006.03555#google "'Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers', Choromanski et al 2020"), Choromanski et al 2020a (on turning [Transformers into RNNs](#transformer-rnn)); ["FAVOR+: Rethinking Attention with Performers"](https://arxiv.org/abs/2009.14794#google), Choromanski et al 2020b ([blog](https://blog.research.google/2020/10/rethinking-attention-with-performers.html "'Rethinking Attention with Performers', Choromanski & Colwell 2020"); [DRL use](https://arxiv.org/abs/2102.04353 "'Unlocking Pixels for Reinforcement Learning via Implicit Attention', Choromanski et al 2021"); can be trained in [constant memory](https://arxiv.org/abs/2012.11346 "Sub-Linear Memory: How to Make Performers SLiM")); ["RFA: Random Feature Attention"](https://openreview.net/forum?id=QtTKTdVrFBB "'Random Feature Attention', Peng et al 2021"), Peng et al 2020; ["DPFP: Linear Transformers Are Secretly Fast Weight Memory Systems"](https://arxiv.org/abs/2102.11174#schmidhuber "'Linear Transformers Are Secretly Fast Weight Programmers', Schlag et al 2021"), Schlag et al 2021; ["DAFT: A Dot Product Attention Free Transformer"](https://openreview.net/forum?id=JVR4JswsEM "'A Dot Product Attention Free Transformer', Zhai et al 2021"), Zhai et al 2020 - ["Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention"](https://arxiv.org/abs/2102.03902), Xiong et al 2021; ["Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method"](https://arxiv.org/abs/2111.00035), Chen et al 2021 - ["Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing"](https://arxiv.org/abs/2006.03236), Dai et al 2020 - ["LazyFormer: Self Attention with Lazy Update"](https://arxiv.org/abs/2102.12702#microsoft), Ying et al 2021 - ["RASP: Thinking Like Transformers"](https://arxiv.org/abs/2106.06981), Weiss et al 2021 (examining limitations of efficient Transformers: in terms of algorithms, what does going from _n_^2^ to _n_ cost? What "programs" do Transformers encode?) - ["Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding"](https://arxiv.org/abs/2106.12566), Luo et al 2021 - ["On Learning the Transformer Kernel"](https://arxiv.org/abs/2110.08323), Chowdhury et al 2021 - **Structured State Models** (SSMs): ["Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers"](https://arxiv.org/abs/2110.13985 "'LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers', Gu et al 2021"){#gu-et-al-2021-ssm-a}, Gu et al 2021a; ["S4: Efficiently Modeling Long Sequences with Structured State Spaces"](https://arxiv.org/abs/2111.00396){#gu-et-al-2021-ssm-b}, Gu et al 2021b; ["HiPPO: Recurrent Memory with Optimal Polynomial Projections"](https://arxiv.org/abs/2008.07669){#gu-et-al-2021-hippo}, Gu et al 2021c - ["Self-attention Does Not Need 𝒪(_n_^2^) Memory"](https://arxiv.org/abs/2112.05682#google "'Self-attention Does Not Need 𝒪(n2) Memory', Rabe & Staats 2021"), Rabe & Staats 2021 (does still cost 𝒪(_n_^2^) compute) - ["How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers"](https://arxiv.org/abs/2211.03495), Hassid et al 2022 - [MLPs](/note/fully-connected "'Fully-Connected Neural Nets', Branwen 2021") (for removing attention entirely) ### Retrieval [Retrieval approaches](/doc/ai/nn/retrieval/index): - ["REALM: Retrieval-Augmented Language Model Pre-Training"](https://arxiv.org/abs/2002.08909#google), Guu et al 2020 - ["MARGE: Pre-training via Paraphrasing"](https://arxiv.org/abs/2006.15020#facebook "'Pre-training via Paraphrasing', Lewis et al 2020"), Lewis et al 2020a - ["RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"](https://arxiv.org/abs/2005.11401#facebook "'Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks', Lewis et al 2020"), Lewis et al 2020b - ["Current Limitations of Language Models: What You Need is Retrieval"](https://arxiv.org/abs/2009.06857), Komatsuzaki 2020 - ["Memorizing Transformers"](https://arxiv.org/abs/2203.08913#google), Wu et al 2022