“A Dot Product Attention Free Transformer”, 2021-10-05 (; backlinks; similar):
We introduce Dot Product Attention Free Transformer (DAFT), an efficient variant of Transformers (
transformer) that eliminates the query-key dot product in self attention. The core idea is to construct a decomposable attention map for each dimension of the query, key and value. This compositionality enables an implementation where the attention tensor does not to be computed or stored explicitly. A DAFT layer has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible with both large input and model sizes.We also introduce DAFT-conv, a model variant that takes advantage of locality and spatial weight sharing while maintaining global connectivity.
We conduct experiments on ImageNet-1K classification, as well as CIFAR-10 and enwik8, two autoregressive modeling tasks. We show that DAFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.
View PDF: