Bibliography (11):

  1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  2. Language Models are Unsupervised Multitask Learners

  3. Training Deep Nets with Sublinear Memory Cost

  4. Automatic differentiation in PyTorch

  5. NVIDIA/Megatron-LM: Ongoing Research Training Transformer Models at Scale

  6. https://developer.nvidia.com/nccl

  7. https://github.com/NVIDIA/Megatron-LM

  8. https://developer.nvidia.com/blog/dgx-superpod-world-record-supercomputing-enterprise/