Bibliography (13):

  1. https://x.com/tri_dao/status/1531437619791290369

  2. Attention Is All You Need

  3. ​ β€˜end-to-end’ directory

  4. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  5. Language Models are Unsupervised Multitask Learners

  6. Long Range Arena (LRA): A Benchmark for Efficient Transformers

  7. Fitting Larger Networks into Memory: TLDR; We Release the Python/Tensorflow Package Openai/gradient-Checkpointing, That Lets You Fit 10Γ— Larger Neural Nets into Memory at the Cost of an Additional 20% Computation Time

  8. Training Deep Nets with Sublinear Memory Cost

  9. Self-attention Does Not Need π’ͺ(n2) Memory

  10. https://arxiv.org/pdf/2205.14135.pdf#page=18