Bibliography (11):

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Language Models are Unsupervised Multitask Learners
Training Deep Nets with Sublinear Memory Cost
Automatic differentiation in PyTorch
NVIDIA/Megatron-LM: Ongoing Research Training Transformer Models at Scale
https://developer.nvidia.com/nccl
https://github.com/NVIDIA/Megatron-LM
https://developer.nvidia.com/blog/dgx-superpod-world-record-supercomputing-enterprise/
Wikipedia Bibliography: