“ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale”, 2020-05-19 (; backlinks; similar):
Today, we are happy to share our new findings and results as we introduce the improved ZeRO-2 and further developments with DeepSpeed:
An order-of-magnitude larger and faster training with ZeRO-2: ZeRO-2 expands the scope of memory optimizations in the original ZeRO by tackling the full spectrum of memory consumption during training. More specifically, ZeRO-2 introduces new technology to reduce the memory footprint of gradients, activation memory, and fragmented memory, in addition to optimizer state memory optimization in the original ZeRO. Altogether, the memory savings empower DeepSpeed to improve the scale and speed of deep learning training by an order of magnitude. More concretely, ZeRO-2 allows training models as large as 170 billion parameters up to 10× faster compared to state-of-the-art.
- Fastest BERT training: While ZeRO-2 optimizes large models during distributed training, we also introduce new technology to accelerate single GPU performance via kernel optimizations. These optimizations not only create a strong foundation for scaling out large models, but also improve the single GPU performance of highly tuned and moderately sized models like BERT by more than 30%, reaching a staggering performance of 64 teraflops per V100 GPU, which is over 50% of the hardware peak. Using these optimizations as the building block, DeepSpeed achieves the fastest BERT training record: 44 minutes on 1,024 NVIDIA V100 GPUs, compared with the best published result of 67 minutes on the same number and generation of GPUs.
…ZeRO-2: Training models with 100 billion parameters up to 10× faster:
- Model scale: State-of-the-art large models (trained without using ZeRO) such as OpenAI GPT-2, NVIDIA MegatronLM, and Google T5 have sizes of 1.5B, 8.3B, and 11b parameters respectively. ZeRO-2 provides system capability to efficiently run models of 170 billion parameters, an order-of-magnitude bigger than these largest models (Figure 2, top left). The tests were conducted using 400 NVIDIA V100 GPUs; with more devices (such as 1,000 GPUs), ZeRO-2 allows us to scale toward 200 billion parameters.
- Speed: Improved memory efficiency powers higher throughput and faster training. Figure 2 (bottom left) shows system throughput of ZeRO-2, ZeRO-1, and baseline model parallelism. Here we use a state-of-the-art model parallelism approach, NVIDIA Megatron-LM, as baseline-MP, while ZeRO-2 and ZeRO-1 both combine ZeRO-powered data parallelism with Megatron-LM model parallelism. ZeRO-2 runs 100-billion-parameter models with over 38 teraflops per GPU, 30% of hardware peak, and aggregated performance over 15 petaflops on the cluster with 400 NVIDIA V100 GPUs. For models of the same size, ZeRO-2 is up to 10× faster in training speed when compared to the baseline because model parallelism requires high communication bandwidth to be efficient, and models of these sizes require model parallelism across nodes where the communication bandwidth is limited. The memory savings of ZeRO-2 allows us to reduce model parallelism degree and fit the model without requiring inter-node model parallelism, drastically reducing communication cost. ZeRO-2 is also up to 5× faster than ZeRO-1 because its additional memory savings help reduce communication further and support even larger batch sizes.
- Scalability: We observe superlinear speedup (Figure 2, top right), where the performance more than doubles when the number of NVIDIA GPUs are doubled. ZeRO-2 reduces the memory footprint of the model states as we increase the data parallelism degree, allowing us to fit larger batch sizes per GPU and resulting in better performance.
- Democratizing large model training: ZeRO-2 empowers model scientists to train models up to 13 billion parameters efficiently without any model parallelism that typically requires model refactoring (Figure 2, bottom right). 13 billion parameters is larger than most of the largest state-of-the-art models (such as Google T5, with 11 billion parameters). With respect to throughput, we observe an average throughput of 37 teraflops (30% hardware peak) per V100 GPU for model sizes ranging from 2 billion to 13 billion parameters. Model scientists can therefore experiment freely with large models without worrying about model parallelism. In comparison, the implementations of classic data parallelism approaches (such as PyTorch Distributed Data Parallel) run out of memory with 1.4-billion-parameter models, while ZeRO-1 supports up to 6 billion parameters.
For more details about ZeRO-2, please see the DeepSpeed GitHub repository and the updated ZeRO paper.
View HTML (17MB):
ZeRO-2 & DeepSpeed: Shattering Barriers of Deep Learning Speed & Scale