T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
https://github.com/google-research/google-research/scaling-transformers
ByT5: Towards a token-free future with pre-trained byte-to-byte models