MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism
The LAMBADA dataset: Word prediction requiring a broad discourse context
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
https://arxiv.org/pdf/2201.11990.pdf#page=38&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=39&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=40&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=41&org=microsoftnvidia
https://arxiv.org/pdf/2201.11990.pdf#page=42&org=microsoftnvidia
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Turing-NLG: A 17-billion-parameter language model by Microsoft
Wikipedia Bibliography: