Bibliography (4):

  1. Attention Is All You Need

  2. Adaptive Attention Span in Transformers

  3. Efficient Content-Based Sparse Attention with Routing Transformers

  4. Compressive Transformers for Long-Range Sequence Modeling