Bibliography (10):

GPT-3: Language Models are Few-Shot Learners
https://web.archive.org/web/20230531203946/https://humanloop.com/blog/openai-plans
https://openai.com/index/gpt-4-research/
https://community.openai.com/t/a-question-on-determinism/8185/2
From Sparse to Soft Mixtures of Experts
https://arxiv.org/pdf/2308.00951.pdf#page=4&org=deepmind
Mixture-of-Experts with Expert Choice Routing
https://arxiv.org/pdf/2308.00951.pdf#page=10&org=deepmind
https://www.semianalysis.com/p/gpt-4-architecture-infrastructure
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity