“MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism”, NVID I. A. ADLR2019-08-13 (, ; backlinks; similar)⁠:

Larger language models are dramatically more useful for NLP tasks such as article completion, question answering, and dialog systems. Training the largest neural language model has recently been the best way to advance the state-of-the-art in NLP applications. Two recent papers, BERT and GPT-2, demonstrate the benefits of large scale language modeling. Both papers leverage advances in compute and available text corpora to substantially surpass state-of-the-art performance in natural language understanding, modeling, and generation. Training these models requires hundreds of exaflops of compute and clever memory management to trade recomputation for a reduced memory footprint. However, for very large models beyond a billion parameters, the memory on a single GPU is not enough to fit the model along with the parameters needed for training, requiring model parallelism to split the parameters across multiple GPUs. Several approaches to model parallelism exist, but they are difficult to use, either because they rely on custom compilers, or because they scale poorly or require changes to the optimizer.

In this work, we implement a simple and efficient model parallel approach by making only a few targeted modifications to existing PyTorch transformer implementations. Our code is written in native Python, leverages mixed precision training, and uses the NCCL library for communication between GPUs. We showcase this approach by training an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24× the size of BERT and 5.6× the size of GPT-2. We have published the code that implements this approach at our GitHub repository.

Our experiments are conducted on NVIDIA’s DGX SuperPOD. Without model parallelism, we can fit a baseline model of 1.2b parameters on a single V100 32GB GPU, and sustain 39 TeraFLOPS during the overall training process, which is 30% of the theoretical peak FLOPS for a single GPU in a DGX2-H server. Scaling the model to 8.3 billion parameters on 512 GPUs with 8-way model parallelism, we achieved up to 15.1 PetaFLOPS sustained performance over the entire application and reached 76% scaling efficiency compared to the single GPU case.