-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
-
Language Models are Unsupervised Multitask Learners
-
Training Deep Nets with Sublinear Memory Cost
-
Automatic differentiation in PyTorch
-
NVIDIA/Megatron-LM: Ongoing Research Training Transformer Models at Scale
-
https://developer.nvidia.com/nccl
-
https://github.com/NVIDIA/Megatron-LM
-
https://developer.nvidia.com/blog/dgx-superpod-world-record-supercomputing-enterprise/
-