Bibliography (22):

  1. Do Transformer Modifications Transfer Across Implementations and Applications?

  2. https://arxiv.org/pdf/2009.06732.pdf#page=29

  3. Attention Is All You Need

  4. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  5. The Evolved Transformer

  6. Universal Transformers

  7. Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing

  8. Pay Less Attention with Lightweight and Dynamic Convolutions

  9. Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

  10. MLP-Mixer: An all-MLP Architecture for Vision

  11. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

  12. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model

  13. Language Modeling with Gated Convolutional Networks

  14. Scaling Laws for Autoregressive Generative Modeling

  15. Scaling Laws for Transfer

  16. Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

  17. Efficient Transformers: A Survey

  18. Long Range Arena (LRA): A Benchmark for Efficient Transformers

  19. Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers