“Efficient Attention: Breaking The Quadratic Transformer Bottleneck”, Gwern2020-07-25 (, ; backlinks; similar)⁠:

Discussion of removing a major architectural limitation in Transformer neural networks: the length of the input it can look at. Beyond a few thousand inputs, the resource requirements explode quadratically, rendering it infeasible to encode raw text at the character level, much less use entire books, images, or many other kinds of data which could be useful. Even for text, this inability also forces limitations like the use of BPE text encoding (responsible for sabotaging GPT-3’s rhyming, among other things), forgetfulness, limits to prompt programming, and inability to write coherent long texts.

A bibliography of possibilities for fixing this are organized hierarchically below:

  1. adding state, through recurrence (a memory) or creating a compressed history/state as an explicit summary

  2. tinkering with matrix algebra to remove the quadratic explosion while still keeping more or less the same self-attention mechanism

  3. approximating self-attention: using attention on only a small subset of tokens at any time (dodging the quadratic limit), or using a mix of local and global attention (local attentions to do most of the work, and global attention on top of the local attentions, each one avoiding the quadratic by considering only a few inputs at a time)

  4. miscellaneous tricks: removing parts, using only randomized untrainable components (with no need to compute gradients over) etc