Bibliography (16):

  1. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model

  2. https://github.com/microsoft/DeepSpeed

  3. MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

  4. The LAMBADA dataset: Word prediction requiring a broad discourse context

  5. Scaling Language Models: Methods, Analysis & Insights from Training Gopher

  6. https://arxiv.org/pdf/2201.11990.pdf#page=38&org=microsoftnvidia

  7. https://arxiv.org/pdf/2201.11990.pdf#page=39&org=microsoftnvidia

  8. https://arxiv.org/pdf/2201.11990.pdf#page=40&org=microsoftnvidia

  9. https://arxiv.org/pdf/2201.11990.pdf#page=41&org=microsoftnvidia

  10. GPT-3: Language Models are Few-Shot Learners

  11. https://arxiv.org/pdf/2201.11990.pdf#page=42&org=microsoftnvidia

  12. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

  13. Turing-NLG: A 17-billion-parameter language model by Microsoft

  14. Scaling Laws for Neural Language Models

  15. Wikipedia Bibliography:

    1. Cross-entropy

    2. Jeopardy!