-
https://x.com/tri_dao/status/1531437619791290369
-
Attention Is All You Need
-
β βend-to-endβ directory
-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
-
Language Models are Unsupervised Multitask Learners
-
Long Range Arena (LRA): A Benchmark for Efficient Transformers
-
Fitting Larger Networks into Memory: TLDR; We Release the Python/Tensorflow Package Openai/gradient-Checkpointing, That Lets You Fit 10Γ Larger Neural Nets into Memory at the Cost of an Additional 20% Computation Time
-
Training Deep Nets with Sublinear Memory Cost
-
Self-attention Does Not Need πͺ(n2) Memory
-
https://arxiv.org/pdf/2205.14135.pdf#page=18
-