Bibliography (22):

Do Transformer Modifications Transfer Across Implementations and Applications?
https://arxiv.org/pdf/2009.06732.pdf#page=29
Attention Is All You Need
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
The Evolved Transformer
Universal Transformers
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Pay Less Attention with Lightweight and Dynamic Convolutions
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
MLP-Mixer: An all-MLP Architecture for Vision
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Language Modeling with Gated Convolutional Networks
Scaling Laws for Autoregressive Generative Modeling
Scaling Laws for Transfer
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Efficient Transformers: A Survey
Long Range Arena (LRA): A Benchmark for Efficient Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Wikipedia Bibliography:
1. Neural scaling law :
  
  https://en.wikipedia.org/wiki/Neural_scaling_law
2. Cross-entropy
3. Autoregressive model