Bibliography:

  1. Mixture of A Million Experts

  2. Large Memory Layers with Product Keys

  3. Machine Learning Scaling