Skip to main content

Efficient Attention: Breaking The Quadratic Transformer Bottleneck

Discussion of removing a major architectural limitation in Transformer neural networks: the length of the input it can look at. Beyond a few thousand inputs, the resource requirements explode quadratically, rendering it infeasible to encode raw text at the character level, much less use entire books, images, or many other kinds of data which could be useful. Even for text, this inability also forces limitations like the use of BPE text encoding (responsible for sabotaging GPT-3’s rhyming, among other things), forgetfulness, limits to prompt programming, and inability to write coherent long texts.

A bibliography of possibilities for fixing this are organized hierarchically below:

  1. adding state, through recurrence (a memory) or creating a compressed history/state as an explicit summary

  2. tinkering with matrix algebra to remove the quadratic explosion while still keeping more or less the same self-attention mechanism

  3. approximating self-attention: using attention on only a small subset of tokens at any time (dodging the quadratic limit), or using a mix of local and global attention (local attentions to do most of the work, and global attention on top of the local attentions, each one avoiding the quadratic by considering only a few inputs at a time)

  4. miscellaneous tricks: removing parts, using only randomized untrainable components (with no need to compute gradients over) etc

One of the most frustrating limitations of GPT-3 (as awesome as it is) is the context window: 2048 text tokens (BPEs) is adequate for many text-related tasks, and even GPT-3’s performance on that window is far from perfect, indicating it has a long way to go in truly understanding text. But 2048 BPEs runs out fast when you start prompt programming something hard, hacks like BPEs have nasty & subtle side-effects, and (as iGPT/ViT indicate in their own ways) is totally inadequate for other modalities like images—a single small 256px image is already equivalent to a sequence of l = 65,536, never mind video or raw audio!

How do we get future Transformers with reasonable context windows and/or memory, which we can use for research papers, books, structured text, images, video, audio, point clouds, genomics, and so on, where we need to handle sequences with lengths in the millions? (Such improvements would permit not just doing things GPT-3 struggles to do, like write coherent novels, but many better architectures, like multimodal Transformers which can learn jointly from images & text, accessing image-based datasets like PDFs, and learning far more accurate human-like representations & tacit knowledge with less data & smaller models, providing large models useful for almost all conceivable tasks—especially robotics.)

Below I compile & categorize research on breaking the dense attention quadratic bottleneck (overviews: Lilian Weng, Madison May; review: Tay et al 2020; benchmark suite: Long Range Arena1):

Table 1: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure (Tay et al 2020)

Table 1: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure (Tay et al 2020)

The summary as of mid-2023: dense Transformers remain surprisingly competitive, and the many proposed variants all have their own drawbacks; none have superseded standard GPT or T5-style Transformers in more than a few niches. To paraphase Chekhov: “If many remedies are prescribed for an illness you can be sure it has no cure.”

Efficient Attention

State

Recurrency

Compressed History/State

Matrix Algebra Optimizations

Tricks like rewriting the softmax/dot-product to be linear:

Approximations

Sparsity

Global ↔︎ Local Attention

Miscellaneous

Dropping components, non-trainable/randomized parts, etc.:

Retrieval

Retrieval approaches:

Similar Links

[Similar links by topic]