Bibliography (7):

  1. https://research.google/blog/mixture-of-experts-with-expert-choice-routing/

  2. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  3. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

  4. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

  5. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

  6. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

  7. T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer