Do Transformer Modifications Transfer Across Implementations and Applications?
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Funnel-Transformer: Filtering out Sequential Redundancy for Efficient Language Processing
Pay Less Attention with Lightweight and Dynamic Convolutions
Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Breaking the Softmax Bottleneck: A High-Rank RNN Language Model
Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
Long Range Arena (LRA): A Benchmark for Efficient Transformers
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers
Wikipedia Bibliography: