https://x.com/tri_dao/status/1531437619791290369
Attention Is All You Need
β βend-to-endβ directory
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Language Models are Unsupervised Multitask Learners
Long Range Arena (LRA): A Benchmark for Efficient Transformers
Fitting Larger Networks into Memory: TLDR; We Release the Python/Tensorflow Package Openai/gradient-Checkpointing, That Lets You Fit 10Γ Larger Neural Nets into Memory at the Cost of an Additional 20% Computation Time
Training Deep Nets with Sublinear Memory Cost
Self-attention Does Not Need πͺ(n2) Memory
https://arxiv.org/pdf/2205.14135.pdf#page=18