“Generating Structured Music through Self-Attention”, Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Andrew Dai, Matt Hoffman, Curtis Hawthorne, Douglas Eck2018 (, )⁠:

[samples] Music relies heavily on self-reference to build structure and meaning. We explore the TRANSFORMER architecture (Vaswani et al 2017) as a generative model for music, as self-attention has shown compelling results on tasks that require long-term structure such as Wikipedia summary generation (Liu et al 2018). However, timing information is critical for polyphonic music, and TRANSFORMER does not explicitly represent absolute or relative timing in its structure.

To address this challenge, Shaw et al 2018 introduced relative position representations to self-attention to improve machine translation. However, the formulation was not scalable to longer sequences.

We propose an improved formulation which reduces its memory requirements from 𝒪(l2d) to 𝒪(ld), making it possible to train much longer sequences and achieve faster convergence.

In experiments with symbolic music generation, we find that relative self-attention substantially improves sample quality. When primed, the model generates continuations that develop the prime in a coherent fashion and exhibit long-term structure.