“AFT: An Attention Free Transformer”, 2020-09-28 (; similar):
We propose an efficient Transformer that eliminates attention.
[cf. RWKV] We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for spatial attention. AFT offers great simplicity compared with standard Transformers, where the multi-head attention operation is replaced with the composition of element-wise multiplications/divisions and global/local pooling. We provide several variants of AFT along with simple yet efficient implementations that are supported by main stream deep learning libraries.
We show that, surprisingly, we are able to train AFT effectively on challenging benchmarks, and also to match or surpass the standard Transformer counterparts.
[Keywords: Transformers, attention, efficient]
View PDF: