Bibliography (13):

https://x.com/tri_dao/status/1531437619791290369
Attention Is All You Need
‘end-to-end’ directory
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Language Models are Unsupervised Multitask Learners
Long Range Arena (LRA): A Benchmark for Efficient Transformers
Fitting Larger Networks into Memory: TLDR; We Release the Python/Tensorflow Package Openai/gradient-Checkpointing, That Lets You Fit 10× Larger Neural Nets into Memory at the Cost of an Additional 20% Computation Time
Training Deep Nets with Sublinear Memory Cost
Self-attention Does Not Need 𝒪(n²) Memory
https://arxiv.org/pdf/2205.14135.pdf#page=18
Wikipedia Bibliography: