Bibliography:
Mixture of A Million Experts
Large Memory Layers with Product Keys
Machine Learning Scaling