“A Dot Product Attention Free Transformer”, Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang ZHANG, Joshua M. Susskind2021-10-05 (; backlinks; similar)⁠:

We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers (transformer) that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes.

We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity.

We conduct experiments on ImageNet-1K classification, as well as CIFAR-10 and enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.