[code] Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3× more compute than standard inference.
We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time so the model learns to make good use of gradient updates.
FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and improve language modeling perplexity.
Figure 2: Per-token negative-log-likelihood improvements over the baseline.
FWLs most improve LMs on long documents, rare tokens, and tokens repeated multiple times in the text. As the WikiText-103 dev set is small, we use a sparse transformer trained on 2⁄3 of the train set and evaluated on the other 1⁄3 to produce more robust results.
…Fast Weight Layers provide the benefits of dynamic evaluation at a fraction of the compute cost and memory usage. They can easily be added to existing language models and yield strong results on language modeling benchmarks. Applying FWLs to few-shot learning tasks is one interesting future direction: doing one (or perhaps a small number) of gradient updates on few-shot examples might offer a nice middle ground in-context learning where the model parameters are fixed and full fine-tuning. Indeed, Yoshida & Gimpel2021 show that ‘hidden state optimization’, a method closely related to dynamic evaluation, can improve few-shot LM performance.
…5. Limitations: FWLs can be viewed as an inductive bias encouraging the model to adapt to previous tokens. As an inductive bias, their value may be limited for larger models trained on larger datasets. While our experiments show FWLs improve models with hundreds of millions of parameters, initial experiments with bigger models suggest that their benefit decreases as models get larger, and we think it is unlikely that an add-on like a FWL will substantially improve models of the scale of GPT-3 (Brownet al2020). Furthermore, we have shown that using FWLs at training time makes them more effective, but this has a disadvantage as well. FWLs can’t be directly applied to already-trained transformer language models the way dynamic evaluation can: some fine-tuning with the fast weight layer added is required. Lastly, while we have shown FWLs improve LM perplexity, we have not evaluated FWLs at other text generation tasks, which we leave for future work.