https://arxiv.org/abs/2010.11929
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
GLUE Benchmark
Vision Transformer: An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
Attention Is All You Need
https://pile.eleuther.ai/