âFine-Tuning Pre-Trained Transformers into Decaying Fast Weightsâ, 2022-10-09 ()â :
Autoregressive Transformers are strong language models but incur đȘ(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve đȘ(1) time and memory complexity.
We explore these approaches and find that they are unnecessarily complex, and propose a simple alternativeâdecaying fast weightsâthat runs fast on GPU.
It outperforms prior methods, and retains 99% of attentionâs performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.