Skip to main content

‘self-attention’ directory

Dis­cus­sion of re­mov­ing a major ar­chi­tec­tural lim­i­ta­tion in Trans­former neural net­works: the length of the input it can look at. Be­yond a few thou­sand in­puts, the re­source re­quire­ments ex­plode qua­drat­i­cally, ren­der­ing it in­fea­si­ble to en­code raw text at the char­ac­ter level, much less use en­tire books, im­ages, or many other kinds of data which could be use­ful. Even for text, this in­abil­ity also forces lim­i­ta­tions like the use of BPE text en­cod­ing (re­spon­si­ble for sab­o­tag­ing GPT-3’s rhyming, among other things), for­get­ful­ness, lim­its to prompt pro­gram­ming, and in­abil­ity to write co­her­ent long texts.

A bib­li­og­ra­phy of pos­si­bil­i­ties for fix­ing this are ⁠or­ga­nized hi­er­ar­chi­cally below:

  1. adding state, through re­cur­rence (a mem­ory) or cre­at­ing a com­pressed his­tory/state as an ex­plicit sum­mary

  2. tin­ker­ing with ma­trix al­ge­bra to re­move the qua­dratic ex­plo­sion while still keep­ing more or less the same self-attention mech­a­nism

  3. ap­prox­i­mat­ing self-attention: using at­ten­tion on only a small sub­set of to­kens at any time (dodg­ing the qua­dratic limit), or using a mix of local and global at­ten­tion (local at­ten­tions to do most of the work, and global at­ten­tion on top of the local at­ten­tions, each one avoid­ing the qua­dratic by con­sid­er­ing only a few in­puts at a time)

  4. mis­cel­la­neous tricks: re­mov­ing parts, using only ran­dom­ized un­train­able com­po­nents (with no need to com­pute gra­di­ents over) etc

One of the most frustrating limitations of GPT-3 (as awesome as it is) is the context window: 2048 text tokens (BPEs) is adequate for many text-related tasks, and even GPT-3’s performance on that window is far from perfect, indicating it has a long way to go in truly understanding text. But 2048 BPEs runs out fast when you start prompt programming something hard, hacks like BPEs have nasty & subtle side-effects, and (as iGPT/ViT indicate in their own ways) is totally inadequate for other modalities like images—a single small 256px image is already equivalent to a sequence of l = 65,536, never mind video or raw audio!

How do we get future Transformers with reasonable context windows and/or memory, which we can use for research papers, books, structured text, images, video, audio, point clouds, genomics, and so on, where we need to handle sequences with lengths in the millions? (Such improvements would permit not just doing things GPT-3 struggles to do, like write coherent novels, but many better architectures, like multimodal Transformers which can learn jointly from images & text, accessing image-based datasets like PDFs, and learning far more accurate human-like representations & tacit knowledge with less data & smaller models, providing large models useful for almost all conceivable tasks—especially robotics.)

Below I compile & categorize research on breaking the dense attention quadratic bottleneck (overviews: ⁠Lilian Weng, ⁠Madison May; review: Tayet al2020; benchmark suite: Long Range Arena⁠1⁠):

Table 1: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure. Some papers presented sequentially may first appear at the same time, eg. as an ICLR submission. Papers annotated with a superscript ‘†’ are peer-reviewed papers. Class abbreviations include: _FP_ = Fixed Patterns or Combinations of Fixed Patterns, _M_ = Memory, _LP_ = Learnable Pattern, _LR_ = Low Rank, _KR_ = Kernel and _RC_ = Recurrence. Furthermore, _n_ generally refers to the sequence length and _b_ is the local window (or block) size. We use subscript _g_ on _n_ to denote global memory length and _n~c~_ to denote convolutionally compressed sequence lengths.

Table 1: Summary of Efficient Transformer Models presented in chronological order of their first public disclosure (Tayet al2020)

The summary as of mid-2023: dense Transformers remain surprisingly competitive, and the many proposed variants all have their own drawbacks; none have superseded standard GPT or T5-style Transformers in more than a few niches. To paraphase Chekhov: “If many remedies are prescribed for an illness you can be sure it has no cure.”

Efficient Attention

State

Recurrency

Compressed History/State

Matrix Algebra Optimizations

Tricks like rewriting the softmax/dot-product to be linear:

Approximations

Sparsity

Global ↔︎ Local Attention

Miscellaneous

Dropping components, non-trainable/randomized parts, etc:

Retrieval

Retrieval approaches:

See Also

Gwern

“Absolute Unit NNs: Regression-Based MLPs for Everything ”, Gwern 2023

Absolute Unit NNs: Regression-Based MLPs for Everything

“Research Ideas ”, Gwern 2017

Research Ideas

“GPT-3 Creative Fiction ”, Gwern 2020

GPT-3 Creative Fiction

Miscellaneous