Bibliography (16):

Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model
https://github.com/microsoft/DeepSpeed
MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
The LAMBADA dataset: Word prediction requiring a broad discourse context
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
https://arxiv.org/pdf/2201.11990.pdf#page=38&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=39&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=40&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=41&org=microsoftnvidia
GPT-3: Language Models are Few-Shot Learners
https://arxiv.org/pdf/2201.11990.pdf#page=42&org=microsoftnvidia
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Turing-NLG: A 17-billion-parameter language model by Microsoft
Scaling Laws for Neural Language Models
Wikipedia Bibliography:
1. Cross-entropy
2. Jeopardy!