“DeepSpeed: Extreme-Scale Model Training for Everyone”, 2020-09-10 (; backlinks; similar):
Today, we are happy to share our new advancements that not only push deep learning training to the extreme, but also democratize it for more people—from data scientists training on massive supercomputers to those training on low-end clusters or even on a single GPU.
More specifically, DeepSpeed adds 4 new system technologies that further the AI at Scale initiative to innovate across Microsoft’s AI products and platforms. These offer extreme compute, memory, and communication efficiency, and they power model training with billions to trillions of parameters. The technologies also allow for extremely long input sequences and power on hardware systems with a single GPU, high-end clusters with thousands of GPUs, or low-end clusters with very slow ethernet networks.
Trillion parameter model training with 3D parallelism: DeepSpeed enables a flexible combination of three parallelism approaches—ZeRO-powered data parallelism, pipeline parallelism, and tensor-slicing model parallelism. 3D parallelism adapts to the varying needs of workload requirements to power extremely large models with over a trillion parameters while achieving near-perfect memory-scaling and throughput-scaling efficiency. In addition, its improved communication efficiency allows users to train multi-billion-parameter models 2–7× faster on regular clusters with limited network bandwidth.
10× bigger model training on a single GPU with ZeRO-Offload: We extend ZeRO-2 to leverage both CPU and GPU memory for training large models. Using a machine with a single NVIDIA V100 GPU, our users can run models of up to 13 billion parameters without running out of memory, 10× bigger than the existing approaches, while obtaining competitive throughput. This feature democratizes multi-billion-parameter model training and opens the window for many deep learning practitioners to explore bigger and better models.
Powering 10× longer sequences and 6× faster execution through DeepSpeed Sparse Attention: DeepSpeed offers sparse attention kernels—an instrumental technology to support long sequences of model inputs, whether for text, image, or sound. Compared with the classic dense Transformers, it powers an order-of-magnitude longer input sequence and obtains up to 6× faster execution with comparable accuracy. It also outperforms state-of-the-art sparse implementations with 1.5–3× faster execution. Furthermore, our sparse kernels support efficient execution of flexible sparse format and empower users to innovate on their custom sparse structures.
1-bit Adam with up to 5× communication volume reduction: Adam is an effective and (probably the most well-utilized) optimizer for training many large-scale deep learning models. However, Adam is generally not compatible with communication-efficient optimization algorithms. Therefore, the communication cost could become a bottleneck while scaling across distributed devices. We introduce a new algorithm, 1-bit Adam with efficient implementation, which reduces communication volume by up to 5× while achieving similar convergence efficiency to Adam. We observe up to 3.5× faster distributed training in communication-constrained scenarios, allowing for scaling to different types of GPU clusters and networks.