Bibliography (19):

  1. Attention Is All You Need

  2. Scaling Laws for Neural Language Models

  3. T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

  4. scaling_transformers Repo

  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  6. https://arxiv.org/pdf/2109.10686#page=6&org=google

  7. https://arxiv.org/pdf/2109.10686#page=7&org=google

  8. https://arxiv.org/pdf/2109.10686#page=8&org=google

  9. Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

  10. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

  11. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

  12. SQuAD: 100,000+ Questions for Machine Comprehension of Text

  13. https://arxiv.org/pdf/2109.10686#page=9&org=google

  14. https://github.com/google-research/google-research/scaling-transformers

  15. Scaling Laws for Autoregressive Generative Modeling

  16. ByT5: Towards a token-free future with pre-trained byte-to-byte models