“Fine-Tuning Pre-Trained Transformers into Decaying Fast Weights”, Huanru Henry Mao2022-10-09 (, )⁠:

Autoregressive Transformers are strong language models but incur đ’Ș(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve đ’Ș(1) time and memory complexity.

We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative—decaying fast weights—that runs fast on GPU.

It outperforms prior methods, and retains 99% of attention’s performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.