“Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better Than Dot-Product Self-Attention”, Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang2022-04-22 (, ; backlinks)⁠:

[code; cf. ChordMixer] Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require đ’Ș(N2) computing cost for sequence length N.

Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization.

In this paper, we propose a novel scalable and effective mixing building block called Paramixer [inspired by Chord P2P]. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as đ’Ș(N log N). Moreover, all factorizing matrices in Paramixer are full-rank, so it does not suffer from the low-rank bottleneck.

We have tested the new method on both synthetic and various real-world long sequential data sets and compared it with several state-of-the-art attention networks.

The experimental results show that Paramixer has better performance in most learning tasks. [OK-ish LRA results.]

Figure 1: Illustration of (a & b) the CHORD (c) & CDIL (d) protocols for n = 16. Each node in the circular graph represents a sequence element. The links between nodes correspond to the non-zero entries in W(m) (here m = 1) output from f(m). Note that the sparse structure of all factors in CHORD is the same, while it varies at different m’s in CDIL.