Bibliography (5):

  1. Pay Attention to MLPs

  2. Attention Is All You Need

  3. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

  4. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity