https://research.google/blog/mixture-of-experts-with-expert-choice-routing/
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer