Bibliography (135):

GPT-3: Language Models are Few-Shot Learners
GPT-3 Creative Fiction
GPT-3 Creative Fiction § BPEs
The Transformer Family: Attention and Self-Attention • Multi-Head Self-Attention • Transformer • Adaptive Computation Time (ACT) • Improved Attention Span: (Longer Attention Span (Transformer-XL) / Adaptive Attention Span / Localized Attention Span (Image Transformer)) • Less Time and Memory Cost: (Sparse Attention Matrix Factorization (Sparse Transformers) / Locality-Sensitive Hashing (Reformer)) • Make It Recurrent (Universal Transformer) • Stabilization for RL (GTrXL)
A Survey of Long-Term Context in Transformers: Sparse Transformers • Adaptive Span Transformers • Transformer-XL • Compressive Transformers • Reformer • Routing Transformer • Sinkhorn Transformer • Linformer • Efficient Attention: Attention With Linear Complexities • Transformers Are RNNs • ETC • Longformer
Efficient Transformers: A Survey
Long Range Arena (LRA): A Benchmark for Efficient Transformers
Do Transformer Modifications Transfer Across Implementations and Applications?
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?
Efficient Transformers: A Survey § Table 1
Universal Transformers
DEQ: Deep Equilibrium Models
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Transformer-XL—Combining Transformers and RNNs Into a State-Of-The-Art Language Model
XLNet: Generalized Autoregressive Pretraining for Language Understanding
So I Tried out GPT-3’s Trick of Conditioning on Training Data With XLNet. While It Doesn’t Do as well as the 175B GPT-3, It Does Much Better Than the Version Which Is the Same Size As XLNet (0.4B). The Visual below Is from Their Paper on WinoGrande—I Added the Squares for XLNet.
Untangling tradeoffs between recurrence and self-attention in neural networks
Addressing Some Limitations of Transformers with Feedback Memory
Shortformer: Better Language Modeling using Shorter Inputs
When Attention Meets Fast Recurrence: Training SRU++ Language Models with Reduced Compute
Simple Recurrence Improves Masked Language Models
Block-Recurrent Transformers
Finetuning Pretrained Transformers into RNNs
ALD: Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
General-purpose, long-context autoregressive modeling with Perceiver AR
RWKV: Reinventing RNNs for the Transformer Era
Generating Sequences With Recurrent Neural Networks
Improving Neural Language Models with a Continuous Cache
Compressive Transformers for Long-Range Sequence Modeling
Not All Memories are Created Equal: Learning to Forget by Expiring
Memory Transformer
Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks
Perceiver: General Perception with Iterative Attention
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Learning to Summarize Long Texts with Memory Compression and Transfer
∞-former: Infinite Memory Transformer
Memorizing Transformers
ABC: Attention with Bounded-memory Control
Recursively Summarizing Books with Human Feedback
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition
Token Turing Machines
Efficient Attention: Attention with Linear Complexities
Efficient Attention: Attention With Linear Complexities [Blog]
Linformer: Self-Attention with Linear Complexity
Luna: Linear Unified Nested Attention
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks (EAMLP)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
AFT: An Attention Free Transformer
LambdaNetworks: Modeling long-range Interactions without Attention
cosFormer: Rethinking Softmax in Attention
Image Transformer
Generating Long Sequences with Sparse Transformers
Generative Modeling with Sparse Transformers: We’ve developed the Sparse Transformer, a deep neural network which sets new records at predicting what comes next in a sequence—whether text, images, or sound. It uses an algorithmic improvement of the attention mechanism to extract patterns from sequences 30× longer than possible previously
Adaptive Attention Span in Transformers
Reformer: The Efficient Transformer
A Deep Dive into the Reformer
The Reformer—Pushing the Limits of Language Modeling
SMYRF: Efficient Attention using Asymmetric Clustering
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding
You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling
Star-Transformer
Efficient Content-Based Sparse Attention with Routing Transformers
Sparse Sinkhorn Attention
Optimal Transport and the Sinkhorn Transformer
BigBird: Transformers for Longer Sequences
Constructing Transformers For Longer Sequences With Sparse Attention Methods
Axial Attention in Multidimensional Transformers
CCNet: Criss-Cross Attention for Semantic Segmentation
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Scaling Autoregressive Video Models
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting
OmniNet: Omnidirectional Representations from Transformers
Combiner: Full Attention Transformer with Sparse Computation Cost
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
Sparse is Enough in Scaling Transformers
DeepSpeed Sparse Attention
Lite Transformer with Long-Short Range Attention
Blockwise Self-Attention for Long Document Understanding
BP-Transformer: Modeling Long-Range Context via Binary Partitioning
Longformer: The Long-Document Transformer
CDLM: Cross-Document Language Modeling
ETC: Encoding Long and Structured Inputs in Transformers
LongT5: Efficient Text-To-Text Transformer for Long Sequences
Adding Recurrence to Pretrained Transformers for Improved Efficiency and Context Size
Conformer: Convolution-augmented Transformer for Speech Recognition
Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition
Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching
Multi-scale Transformer Language Models
Hierarchical Transformers for Multi-Document Summarization
Hi-Transformer: Hierarchical Interactive Transformer for Efficient and Effective Long Document Modeling
Transformer-QL: A Step Towards Making Transformer Network Quadratically Large
Coordination Among Neural Modules Through a Shared Global Workspace
Generative Adversarial Transformers
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Swin Transformer V2: Scaling Up Capacity and Resolution
Hourglass: Hierarchical Transformers Are More Efficient Language Models
Long-Short Transformer (Transformer-LS): Efficient Transformers for Language and Vision
AdaMRA: Adaptive Multi-Resolution Attention with Linear Complexity
Fastformer: Additive Attention Can Be All You Need
Transformer Quality in Linear Time
index#mlp-mixer

[Transclude the forward-link's context]
NAT: Neighborhood Attention Transformer
DiNAT: Dilated Neighborhood Attention Transformer
Generating Wikipedia by Summarizing Long Sequences
Pay Less Attention with Lightweight and Dynamic Convolutions
Music Transformer
Synthesizer: Rethinking Self-Attention in Transformer Models
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
FAVOR+: Rethinking Attention with Performers
Rethinking Attention With Performers
Unlocking Pixels for Reinforcement Learning via Implicit Attention
Sub-Linear Memory: How to Make Performers SLiM
Random Feature Attention
Linear Transformers Are Secretly Fast Weight Programmers
A Dot Product Attention Free Transformer
Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention
Skyformer: Remodel Self-Attention with Gaussian Kernel and Nyström Method
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
LazyFormer: Self Attention with Lazy Update
RASP: Thinking Like Transformers
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding
On Learning the Transformer Kernel
LSSL: Combining Recurrent, Convolutional, and Continuous-time Models with Linear State-Space Layers
S4: Efficiently Modeling Long Sequences with Structured State Spaces
HiPPO: Recurrent Memory with Optimal Polynomial Projections
Self-attention Does Not Need 𝒪(n²) Memory
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
‘MLP NN’ directory
‘retrieval AI’ directory
REALM: Retrieval-Augmented Language Model Pre-Training
Pre-training via Paraphrasing
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Current Limitations of Language Models: What You Need is Retrieval