Bibliography (135):

  1. GPT-3: Language Models are Few-Shot Learners

  2. GPT-3 Creative Fiction

  3. GPT-3 Creative Fiction § BPEs

  4. The Transformer Family: Attention and Self-Attention Multi-Head Self-Attention Transformer Adaptive Computation Time (ACT) Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) Make It Recurrent (Universal Transformer) Stabilization for RL (GTrXL)

  5. A Survey of Long-Term Context in Transformers: Sparse Transformers Adaptive Span Transformers Transformer-XL Compressive Transformers Reformer Routing Transformer Sinkhorn Transformer Linformer Efficient Attention: Attention With Linear Complexities Transformers Are RNNs ETC Longformer

  6. Efficient Transformers: A Survey

  7. Long Range Arena (LRA): A Benchmark for Efficient Transformers

  8. Do Transformer Modifications Transfer Across Implementations and Applications?

  9. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  10. Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

  11. Efficient Transformers: A Survey § Table 1

  12. Universal Transformers

  13. DEQ: Deep Equilibrium Models

  14. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

  15. Transformer-XL—Combining Transformers and RNNs Into a State-Of-The-Art Language Model

  16. XLNet: Generalized Autoregressive Pretraining for Language Understanding

  17. So I Tried out GPT-3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT-3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.

  18. Untangling tradeoffs between recurrence and self-attention in neural networks

  19. Addressing Some Limitations of Transformers with Feedback Memory

  20. Shortformer: Better Language Modeling using Shorter Inputs

  21. When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute

  22. Simple Recurrence Improves Masked Language Models

  23. Block-Recurrent Transformers

  24. Finetuning Pretrained Transformers into RNNs

  25. ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

  26. General-purpose, long-context autoregressive modeling with Perceiver AR

  27. RWKV: Reinventing RNNs for the Transformer Era

  28. Generating Sequences With Recurrent Neural Networks

  29. Improving Neural Language Models with a Continuous Cache

  30. Compressive Transformers for Long-Range Sequence Modeling

  31. Not All Memories are Created Equal: Learning to Forget by Expiring

  32. Memory Transformer

  33. Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

  34. Perceiver: General Perception with Iterative Attention

  35. Perceiver IO: A General Architecture for Structured Inputs & Outputs

  36. Learning to Summarize Long Texts with Memory Compression and Transfer

  37. ∞-former: Infinite Memory Transformer

  38. Memorizing Transformers

  39. ABC: Attention with Bounded-memory Control

  40. Recursively Summarizing Books with Human Feedback

  41. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

  42. Token Turing Machines

  43. Efficient Attention: Attention with Linear Complexities

  44. Efficient Attention: Attention With Linear Complexities [Blog]

  45. Linformer: Self-Attention with Linear Complexity

  46. Luna: Linear Unified Nested Attention

  47. Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks (EAMLP)

  48. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

  49. AFT: An Attention Free Transformer

  50. LambdaNetworks: Modeling long-range Interactions without Attention

  51. cosFormer: Rethinking Softmax in Attention

  52. Image Transformer

  53. Generating Long Sequences with Sparse Transformers

  54. Generative Modeling with Sparse Transformers: We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30× longer than possible previously

  55. Adaptive Attention Span in Transformers

  56. Reformer: The Efficient Transformer

  57. A Deep Dive into the Reformer

  58. The Reformer—Pushing the Limits of Language Modeling

  59. SMYRF: Efficient Attention using Asymmetric Clustering

  60. Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

  61. You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

  62. Star-Transformer

  63. Efficient Content-Based Sparse Attention with Routing Transformers

  64. Sparse Sinkhorn Attention

  65. Optimal Transport and the Sinkhorn Transformer

  66. BigBird: Transformers for Longer Sequences

  67. Constructing Transformers For Longer Sequences With Sparse Attention Methods

  68. Axial Attention in Multidimensional Transformers

  69. CCNet: Criss-Cross Attention for Semantic Segmentation

  70. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

  71. Scaling Autoregressive Video Models

  72. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

  73. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

  74. OmniNet: Omnidirectional Representations from Transformers

  75. Combiner: Full Attention Transformer with Sparse Computation Cost

  76. Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

  77. Sparse is Enough in Scaling Transformers

  78. DeepSpeed Sparse Attention

  79. Lite Transformer with Long-Short Range Attention

  80. Blockwise Self-Attention for Long Document Understanding

  81. BP-Transformer: Modeling Long-Range Context via Binary Partitioning

  82. Longformer: The Long-Document Transformer

  83. CDLM: Cross-Document Language Modeling

  84. ETC: Encoding Long and Structured Inputs in Transformers

  85. LongT5: Efficient Text-To-Text Transformer for Long Sequences

  86. Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size

  87. Conformer: Convolution-augmented Transformer for Speech Recognition

  88. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

  89. Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching

  90. Multi-scale Transformer Language Models

  91. Hierarchical Transformers for Multi-Document Summarization

  92. Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling

  93. Transformer-QL: A Step Towards Making Transformer Network Quadratically Large

  94. Coordination Among Neural Modules Through a Shared Global Workspace

  95. Generative Adversarial Transformers

  96. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

  97. Swin Transformer V2: Scaling Up Capacity and Resolution

  98. Hourglass: Hierarchical Transformers Are More Efficient Language Models

  99. Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision

  100. AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity

  101. Fastformer: Additive Attention Can Be All You Need

  102. Transformer Quality in Linear Time

  103. index#mlp-mixer

    [Transclude the forward-link's context]

  104. NAT: Neighborhood Attention Transformer

  105. DiNAT: Dilated Neighborhood Attention Transformer

  106. Generating Wikipedia by Summarizing Long Sequences

  107. Pay Less Attention with Lightweight and Dynamic Convolutions

  108. Music Transformer

  109. Synthesizer: Rethinking Self-Attention in Transformer Models

  110. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

  111. FAVOR+: Rethinking Attention with Performers

  112. Rethinking Attention With Performers

  113. Unlocking Pixels for Reinforcement Learning via Implicit Attention

  114. Sub-Linear Memory: How to Make Performers SLiM

  115. Random Feature Attention

  116. Linear Transformers Are Secretly Fast Weight Programmers

  117. A Dot Product Attention Free Transformer

  118. Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

  119. Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method

  120. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

  121. LazyFormer: Self Attention with Lazy Update

  122. RASP: Thinking Like Transformers

  123. Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

  124. On Learning the Transformer Kernel

  125. LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers

  126. S4: Efficiently Modeling Long Sequences with Structured State Spaces

  127. HiPPO: Recurrent Memory with Optimal Polynomial Projections

  128. Self-attention Does Not Need 𝒪(n2) Memory

  129. How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  130. ‘MLP NN’ directory

  131. ‘retrieval AI’ directory

  132. REALM: Retrieval-Augmented Language Model Pre-Training

  133. Pre-training via Paraphrasing

  134. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

  135. Current Limitations of Language Models: What You Need is Retrieval