“The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns via Spotlights of Attention”, Kazuki Irie, Róbert Csordás, Jürgen Schmidhuber2022-02-11 (, ; similar)⁠:

Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalized dot attention over the entire training experience. While this has been technically known since the ’60s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large.

We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modeling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns.

Our code is public.