“Memorizing Transformers”, Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy2021-10-05 (, ; similar)⁠:

We propose to use an external memory module to allow instant usage of newly acquired knowledge.

Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus acquiring new knowledge immediately. In this work, we extend language models with the ability to memorize the internal representations of past inputs.

Despite the fact that our implementation of memory is not differentiable, we demonstrate that an approximate k-NN lookup into the memory improves language modeling across various benchmarks and tasks, including generic webtext (C4), math papers (arXiv), books (PG-19), code (Github), as well as formal theorems (Isabelle). We show that the performance steadily improves when we increase the size of memory up to 131k tokens. We also find that the model is capable of making use of newly defined functions and theorems during test time.

…We demonstrate that a simple, effective, and scalable way to increase the size of the attention context is to use approximate k-nearest-neighbor (kNN) search into a large external memory of (key, value) pairs. There are efficient implementations of approximate kNN lookup on TPU, GPU, and CPU, including distributed implementations (Guo et al 2020), which opens the door to extremely large external memories.

In contrast to most other work on sparse, or long-range attention (c.f. §2), we treat the external memory as a large “cache”, and gradients are not back-propagated into the cache. This is a potential limitation, because it means that the network can only learn to query the external memory, and cannot directly learn what keys and values to put into it. However, we demonstrate empirically that (key, value) pairs which are useful for local (and thus fully differentiable) self-attention are also useful for long-range attention.

Using a non-differentiable cache is critical to scalability. The keys and values are a function of model parameters, so attempting to backpropagate gradients into the external memory would necessarily involve computing all of the keys and values with the current model parameters on every training step. If the external memory is not differentiable, we can instead reuse keys and values from prior training steps. With our technique, we are easily able to scale external memory up to a sequence lengths of 131k or 262k tokens on a single device, while maintaining a reasonable step time.

[cf. “dynamic evaluation”/“neural cache”]