“Generative Modeling With Sparse Transformers: We’ve Developed the Sparse Transformer, a Deep Neural Network Which Sets New Records at Predicting What Comes next in a Sequence—Whether Text, Images, or Sound. It Uses an Algorithmic Improvement of the attention Mechanism to Extract Patterns from Sequences 30× Longer Than Possible Previously”, 2019-04-23 (; backlinks; similar):
One existing challenge in AI research is modeling long-range, subtle interdependencies in complex data like images, videos, or sounds. The Sparse Transformer incorporates an 𝒪(N ⋅ √N) reformulation of the 𝒪(N2) Transformer self-attention mechanism, along with several other improvements, to apply it directly to these rich data types. Previously, models used on these data were specifically crafted for one domain or difficult to scale to sequences more than a few thousand elements long. In contrast, our model can model sequences with tens of thousands of elements using hundreds of layers, achieving state-of-the-art performance across multiple domains. At OpenAI, we’re using it to help us build AI systems that possess a greater ability to understand the world…Even computing a single attention matrix, however, can become impractical for very large inputs. We instead use sparse attention patterns, where each output position only computes weightings from a subset of input positions.
Future work and limitations:
The sparse attention patterns we introduced are only preliminary steps in the direction of efficient modeling of long sequences. We think exploring different patterns and combinations of sparsity is useful, and that learning sparse patterns is a particularly promising avenue of research for the next generation of neural network architectures.
Even with the improvements we described above, autoregressive sequence generation still seems impractical for very high resolution images or video. The optimized attention operations we have introduced, however, may be useful primitives to combine with other approaches to modeling high dimensional data, like multi-scale approaches.