[Visualization of Transformer attention pattern over the input history]
Generating long pieces of music is a challenging problem, as music contains structure at multiple timescales, from millisecond timings to motifs to phrases to repetition of entire sections. We present Music Transformer, an attention-based neural network that can generate music with improved long-term coherence. Here are three piano performances generated by the model:
Similar to Performance RNN, we use an event-based representation that allows us to generate expressive performances directly (ie. without first generating a score). In contrast to an LSTM-based model like Performance RNN that compresses earlier events into a fixed-size hidden state, here we use a Transformer-based model that has direct access to all earlier events.