Bibliography (19):

Attention Is All You Need
Scaling Laws for Neural Language Models
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
scaling_transformers Repo
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
https://arxiv.org/pdf/2109.10686#page=6&org=google
https://arxiv.org/pdf/2109.10686#page=7&org=google
https://arxiv.org/pdf/2109.10686#page=8&org=google
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
SQuAD: 100,000+ Questions for Machine Comprehension of Text
https://arxiv.org/pdf/2109.10686#page=9&org=google
https://github.com/google-research/google-research/scaling-transformers
Scaling Laws for Autoregressive Generative Modeling
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Wikipedia Bibliography: