ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training

Published April 19, 2021

By DeepSpeed Team Rangan Majumder , Vice President Andrey Proskurin , Corporate Vice President of Engineering

ZeRO-Infinity obtains excellent training efficiency—over 25 petaflops of sustained performance for multi-billion and multi-trillion parameter models on 512 NVIDIA V100 GPUs. The efficiency at model sizes of 500B is comparable to state-of-the-art 3D parallelism. Unlike ZeRO-Infinity, 3D parallelism cannot scale to models with trillions of parameters due to GPU memory constraint.

Since the DeepSpeed optimization library was introduced last year, it has rolled out numerous novel optimizations for training large AI models—improving scale, speed, cost, and usability. As large models have quickly evolved over the last year, so too has DeepSpeed. Whether enabling researchers to create the 17-billion-parameter Microsoft Turing Natural Language Generation (Turing-NLG) with state-of-the-art accuracy, achieving the fastest BERT training record, or supporting 10x larger model training using a single GPU, DeepSpeed continues to tackle challenges in AI at Scale with the latest advancements for large-scale model training. Now, the novel memory optimization technology ZeRO (Zero Redundancy Optimizer), included in DeepSpeed, is undergoing a further transformation of its own. The improved ZeRO-Infinity offers the system capability to go beyond the GPU memory wall and train models with tens of trillions of parameters, an order of magnitude bigger than state-of-the-art systems can support. It also offers a promising path toward training 100-trillion-parameter models.

ZeRO-Infinity at a glance: ZeRO-Infinity is a novel deep learning (DL) training technology for scaling model training, from a single GPU to massive supercomputers with thousands of GPUs. It powers unprecedented model sizes by leveraging the full memory capacity of a system, concurrently exploiting all heterogeneous memory (GPU, CPU, and Non-Volatile Memory express or NVMe for short). Learn more in our paper, “ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.” The highlights of ZeRO-Infinity include:

Offering the system capability to train a model with over 30 trillion parameters on 512 NVIDIA V100 Tensor Core GPUs, 50x larger than state of the art.
Delivering excellent training efficiency and superlinear throughput scaling through novel data partitioning and mapping that can exploit the aggregate CPU/NVMe memory bandwidths and CPU compute, offering over 25 petaflops of sustained throughput on 512 NVIDIA V100 GPUs.
Furthering the mission of the DeepSpeed team to democratize large model training by allowing data scientists with a single GPU to fine-tune models larger than Open AI GPT-3 (175 billion parameters).
Eliminating the barrier to entry for large model training by making it simpler and easier—ZeRO-Infinity scales beyond a trillion parameters without the complexity of combining several parallelism techniques and without requiring changes to user codes. To the best of our knowledge, it’s the only parallel technology to do this.

The video above shows how ZeRO-Infinity efficiently leverages GPU, CPU, and NVMe altogether by 1) partitioning each model layer across all data parallel processes, 2) placing the partitions on the corresponding data parallel NVMe devices, and 3) coordinating the data movement needed to compute forward/backward propagation and weight updates on the data parallel GPUs and CPUs, respectively.

We are also pleased to announce DeepSpeed’s integration with Azure Machine Learning and open-source solutions. The DeepSpeed curated environment in Azure Machine Learning makes it easier for users to get started on Azure. DeepSpeed is now integrated in Hugging Face v4.2 and PyTorch Lightning v1.2. Hugging Face and PyTorch Lightning users can easily accelerate their models with DeepSpeed through a simple “deepspeed” flag!

Addressing the needs of large model training now and into the future with ZeRO-Infinity

In the last three years, the largest trained dense model has grown over 1,000x, from a hundred million parameters in the pre-BERT era to over a hundred billion parameters now. However, in the same duration, single GPU memory has only increased by 5x (16 GB to 80 GB). Therefore, the growth in model size has been made possible mainly through advances in system technology for training large DL models, with parallel technologies such as model parallelism, pipeline parallelism, and ZeRO allowing large models to fit in aggregate GPU memory, creating a path to training larger and more powerful models.

The state-of-the-art in large model training technology is 3D parallelism. It combines model parallelism (tensor slicing) and pipeline parallelism with data parallelism in complex ways to efficiently scale models by fully leveraging the aggregate GPU memory and compute of a cluster. 3D parallelism has been used in DeepSpeed and NVIDIA Megatron-LM, among other frameworks.

Despite the incredible capabilities of 3D parallelism for large model training, we are now arriving at the GPU memory wall. The aggregate GPU memory is simply not large enough to support the growth in model size. Even with the newest NVIDIA A100 GPUs, which have 80 GB of memory, 3D parallelism requires 320 GPUs just to fit a trillion-parameter model for training. Furthermore, 3D parallelism requires significant code refactoring from data scientists, creating a large barrier to entry. Three questions arise:

Looking ahead, how do we support the next 1,000x growth in model size, going from models like GPT-3 with 175 billion parameters to models with hundreds of trillions of parameters?
Focusing on the present, how can we make the large models of today accessible to more data scientists who may not have access to hundreds to GPUs currently required to fit these models?
Can we make large model training easier by eliminating this need for model refactoring?

Today, we take a leap forward from 3D parallelism by introducing ZeRO-Infinity, a novel system capable of addressing all the above-mentioned challenges of large model training. ZeRO-Infinity extends the ZeRO family of technology with new innovations in data mapping and high-performance heterogeneous memory access, which allows ZeRO-Infinity to support massive model sizes on limited GPU resources by exploiting CPU and NVMe memory simultaneously, unencumbered by their limited bandwidth.

ZeRO-Infinity can also train these models without the need to combine multiple forms of parallelism in 3D parallelism. It does so via a novel memory-centric computation-tiling approach aimed at reducing GPU memory requirements of large individual layers that would otherwise require model parallelism (tensor slicing) to fit the model in GPU memory. In addition, ZeRO-Infinity makes large model training easy by identifying and automating all the communication required for training any arbitrary model architecture, virtually eliminating the need for any model refactoring even when scaling to trillions of parameters. Last but not least, ZeRO-Infinity offers a powerful compute-and-communication-overlap engine designed to push training efficiency to the limits by hiding as much communication latency as possible.

With all these innovations, ZeRO-Infinity redefines the capabilities of a DL system, offering unprecedented model scale that is accessible and easy to use while achieving excellent training efficiency.

Unprecedented model scale: Train 30-trillion-parameter models on 512 GPUs

ZeRO-Infinity offers a leap of orders of magnitude in DL training system technology, opening a path to supporting the next 1,000x increase in model scale by efficiently exploiting the heterogeneous memory systems on current and future generations of hardware. It runs a model with over a trillion parameters on a single NVIDIA DGX-2 node and over 30 trillion parameters on 32 nodes (512 GPUs). With a hundred DGX-2 nodes in a cluster, we project ZeRO-Infinity can train models with over a hundred trillion parameters. (see Figure 1 for details).

Figure 1: Comparing model scale between 3D parallelism and ZeRO-Infinity. Experiments are performed on GPU clusters using NVIDIA DGX-2 16-GPU systems (nodes). The model scales up to 32 trillion parameters on 512 V100 GPUs (32 DGX-2 nodes) based on measured runs, while the number of parameters on 64 and 128 DGX-2 nodes are based on projections.

To enable model training at this scale, ZeRO-Infinity extends the ZeRO family of technology with distinct innovations targeting different memory bottlenecks.

1. Stage 3 of ZeRO (ZeRO-3) allows for removing all memory redundancies in data-parallel training by partitioning model states across data-parallel processes.

2. Infinity Offload Engine, a novel data offloading library, allows for fully exploiting modern heterogeneous memory architectures by offloading partitioned model states to CPU or NVMe device memory, which are much bigger than GPU memory.

Figure 3: Breakdown of the total memory/storage available on a single NVIDIA DGX-2 system. It has 3x CPU memory and over 50x NVMe storage compared to GPU memory.

3. Activation checkpointing with CPU offload allows for reducing activation memory footprint, which can become the memory bottleneck on the GPU after the memory required by the model states are addressed by ZeRO-3 and the Infinity Offload Engine.

4. Memory-centric operator tiling, a novel computation rescheduling technique that works together with the ZeRO data access and communication schedule, allows for reducing the memory footprint of incredibly massive individual layers that can be too large to fit in GPU memory even one layer at a time.

Broader access to fine-tuning extremely large models: GPT-3 or even larger models on a single GPU

Figure 4: Comparing the largest model sizes that can be trained on a single NVIDIA DGX-2 node using various parallel DL training technologies. The NVIDIA DGX-2 node consists of 16 V100-32GB GPUs along with 1.5 TB of CPU memory and 20 TB of usable NVMe storage. The blue, orange, and green colors are used to represent technologies that use GPU memory only, GPU with CPU memory, and GPU with both CPU and NVMe memory, respectively. ZeRO-Infinity can in fact run with over a trillion parameters even on a single GPU compared to state of the art, which is 13 billion parameters with ZeRO Offload.

While pretraining is the first important step in creating a massive model, fine-tuning for specific tasks is essential to leveraging the full potential of the model for different scenarios. Making fine-tuning of massive models easily accessible to data scientists could allow the creation of many derived models to meet the need of various application scenarios. These tasks might range from grammar correction to writing assistance, from image captioning to code generation—any task possible with large AI models.

Unlike pretraining, which can require millions of GPU compute hours, fine-tuning a model with hundreds of billions of parameters is much cheaper, requiring significantly less GPU compute hours, and can be done on a single compute node with a handful of GPUs. While such compute resources are accessible to many businesses and users, they are unfortunately restricted by the memory available on these compute nodes, which in turn limits the size of the model that can be fine-tuned. It makes large model fine-tuning inaccessible to most businesses and companies that do not have access to massive GPU clusters.

ZeRO-Infinity completely changes this landscape by enabling data scientists with access to a single node, such as the NVIDIA DGX-2, to fine-tune models with over a trillion parameters (Figure 4). In fact, it can run models with over a trillion parameters even on a single GPU of such a node since it has enough CPU and NVMe memory. This is nearly 100x larger than state of the art for single GPU training. With ZeRO-Infinity, the memory bottleneck is no longer the GPU memory or even the CPU memory. Instead, we can now leverage them together with the much larger and cheaper NVMe memory.

Through ZeRO-Infinity, we take another step toward democratization of AI by enabling users and businesses with limited resources to leverage the power of massive models for their business-specific applications.

Train massive models without any code refactoring

Scaling models to hundreds of billions and trillions of parameters is challenging. Data parallelism cannot scale a model’s size much further beyond a billion parameters. Model parallelism with tensor slicing is challenging to efficiently scale beyond a single node due to communication overheads. Finally, pipeline parallelism cannot scale beyond the number of layers available in a model, which limits both the model size and the scale of GPUs. 

The only existing parallel technology available that can scale to over a trillion parameters on massively parallel GPU clusters is 3D parallelism, which combines data, model, and pipeline parallelism in complex ways. While such a system can be very efficient, it requires data scientists to do major model code refactoring, splitting the model into load-balanced pipeline stages. This also makes 3D parallelism inflexible in the type of models that it can support since models with complex dependency graphs cannot be easily converted into a load-balanced pipeline.

ZeRO-Infinity addresses these challenges in two ways. First, with groundbreaking model scaling, ZeRO-Infinity is the only DL parallel technology that can efficiently scale to trillions of parameters without requiring a hybrid parallelism strategy, greatly simplifying the system stack for DL training. Second, ZeRO-Infinity requires virtually no model refactoring from data scientists, liberating data scientists to scale up complex models from hundreds of billions to hundreds of trillions of parameters, as the compute becomes available.

Excellent training efficiency and superlinear scalability

ZeRO-Infinity can offload model states and activations to NVMe and CPU, which have orders-of-magnitude slower communication bandwidth (10–25 GB/sec) than GPU memory bandwidth (about 900 GB/sec). Furthermore, it incurs 50 percent additional GPU-to-GPU communication overhead of ZeRO-3 compared to data-parallel training. Despite these limitations, ZeRO-Infinity can achieve excellent training efficiency that is comparable to state-of-the-art GPU-only solutions like 3D parallelism, and it is significantly better than standard data-parallel training with PyTorch.

Figure 5: ZeRO-Infinity obtains excellent training efficiency—over 25 petaflops of sustained performance for multi-billion and multi-trillion parameter models on 512 NVIDIA V100 GPUs. The efficiency at model sizes of 500B is comparable to state-of-the-art 3D parallelism. Unlike ZeRO-Infinity, 3D parallelism cannot scale to models with trillions of parameters due to GPU memory constraint.

As a concrete example, ZeRO-Infinity achieves a sustained throughput of 37–50 teraflops/GPU for model sizes ranging from 400 billion parameters to 20 trillion parameters running on 512 NVIDIA V100 GPUs (see Figure 5). In comparison, 3D parallelism achieves very similar throughput (48 teraflops/GPU) for a 650-billion-parameter model, the largest model that can be trained on the same number of GPUs. Standard data-parallel training with PyTorch only achieves 30 teraflops per GPU for a 1.3 billion-parameter model, the largest model that can be trained using data parallelism alone.

There are three key innovations behind the excellent training efficiency of ZeRO-Infinity:

Bandwidth-centric partitioning enables parallel memory access resulting in virtually unlimited heterogeneous memory bandwidth. With ZeRO-Infinity, the effective NVMe and CPU memory bandwidth grow linearly with the number of available devices. For instance, the NVMe bandwidth is about 25 GB/sec per DGX-2 node, but on a cluster with 64 such nodes, this increases to 1.6 TB/sec, even faster than the GPU HBM2 memory on the NVIDIA V100 GPU that can achieve 0.9 TB/sec.

2. Communication-overlap-centric design and implementation allows ZeRO-Infinity to hide nearly all communication volume at a reasonable batch size. ZeRO-Infinity can effectively overlap NVMe read/write, CPU-GPU data transfers, GPU-GPU communication, and GPU computation all at once.

3. DeepNVMe module, created by the DeepSpeed team, allows for asynchronously reading and writing tensors to NVMe storage at near-peak NVMe bandwidth in PyTorch.

In addition to achieving high training efficiency, ZeRO-Infinity preserves superlinear scalability (see Figure 6) that we have demonstrated with all our previous ZeRO technologies (ZeRO-1, ZeRO-2, and ZeRO-Offload). This is possible because of the memory-and-compute access pattern of ZeRO-Infinity—it reduces the NVMe/CPU communication time as well as the optimizer update time linearly with the increasing number of GPUs and nodes, respectively.

Figure 6: ZeRO-Infinity achieves superlinear scalability with an increase in GPU count by leveraging the aggregate PCIe, CPU-memory, and NVMe-memory bandwidth, which also increases with the GPU count. Furthermore, it also leverages the aggregate CPU compute that increases linearly with the number of compute nodes, further supporting superlinear scaling.

ZeRO-Infinity redefines the large model training landscape

It was less than a year ago that 3D parallelism enabled training of a model at a scale of a trillion parameters with 800 NVIDIA V100 GPUs. Now, with ZeRO-Infinity, the same scale can be achieved on a single DGX-2 node (16 V100 GPUs) with virtually no model refactoring. Massive model training is no longer just a possibility for companies with access to massive supercomputers and heavy system expertise. Instead, it’s now easily accessible to many data scientists with access to only a single GPU or a few GPUs.

In addition, ZeRO-Infinity offers a paradigm shift in how we think about memory for large model training. It is no longer necessary to fit DL training on ultra-fast yet expensive memory with limited size, like HBM2. ZeRO-Infinity demonstrates that it is possible to transcend the GPU memory wall by leveraging cheap and slow, but massive, CPU or NVMe memory in parallel across multiple devices to achieve the aggregate bandwidth necessary for efficient training.

With memory no longer a limitation on model scale or efficiency, it is now critical that we focus on the innovations in compute performance and GPU-to-GPU bandwidth. While it is now possible to fit a 30 trillion parameter model for training on 512 NVIDIA V100 GPUs with ZeRO-Infinity, it is very challenging to complete the end-to-end pre-training in a reasonable time. This could demand 100x improvements in compute performance and the interconnect bandwidth between GPUs compared to what is available on current NVIDIA DGX V100 clusters. The state-of-the-art NVIDIA A100 GPUs and the DGX A100 nodes are good steps in that direction offering over 3x – 6x in compute performance and 2x improvement in interconnect bandwidth per GPU than the NVIDIA DGX V100 nodes. We welcome such improvements, and are excited that the NVIDIA A100 GPU will soon be available through Azure ND A100 v4 VMs.

Finally, we hope that with memory no longer a limitation, ZeRO-Infinity will further inspire an acceleration in compute and network bandwidth focused design of future ultra-powerful devices and supercomputing clusters necessary for the next 1000x growth in model scale and the quality improvements they will offer.

Please read our ZeRO-Infinity paper for more details and visit the DeepSpeed website and GitHub repository for the codes, tutorials, and documentations about these new technologies!

About DeepSpeed’s integration with Azure Machine Learning and open-source solutions

Azure Machine Learning: DeepSpeed and Azure Machine Learning team have made it simple for users to train DeepSpeed-powered models on Azure Machine Learning. Specifically, the DeepSpeed curated environment makes it simple for users to get started with DeepSpeed on Azure. Example DeepSpeed models are actively being added to the official Azure Machine Learning-examples repo. Get started with our Open AI GPT-2 and cifar examples. Azure Machine Learning provides powerful GPU support to accelerate model development.
Hugging Face: Hugging Face recently announced its integration with DeepSpeed, which allows users to easily accelerate their models through a simple “—deepspeed” flag and config file. Through this integration, DeepSpeed is able to bring 3x faster speedups in multi-GPU training compared with the original solution. DeepSpeed also allows fitting a significantly larger model for users who own just a single GPU (or a few GPUs) with much higher compute efficiency than alternatives.
PyTorch lighting: We are happy to announce that PyTorch Lightning integrates DeepSpeed as a plugin for DL training optimizations: Accessing Multi-Billion Parameter Model Training with Pytorch Lightning + DeepSpeed. To enable DeepSpeed in Lightning 1.2, it is as simple as passing plugins=’deepspeed’ to the Lightning trainer (docs).

About the DeepSpeed Team:

We are a group of system researchers and engineers—Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Shaden Smith, Elton Zheng, Reza Yazdani Aminabadi, Arash Ashari, Ammar Ahmad Awan, Cheng Li, Conglong Li, Niranjan Uma Naresh, Minjia Zhang, Jeffrey Zhu, Yuxiong He (team lead)—who are enthusiastic about performance optimization of large-scale systems. We have recently focused on deep learning systems, optimizing deep learning’s speed to train, speed to convergence, and speed to develop!

If this type of work interests you, the DeepSpeed team is hiring both researchers and engineers! Please visit our careers page.

Related publications

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Meet the authors

DeepSpeed Team

Learn more

Rangan Majumder

Vice President

Learn more

Andrey Proskurin

Corporate Vice President of Engineering

Learn more

Continue reading

Three bar plots. The first plot shows that the model size of XTC-BERT is 32 times smaller than that of BERT, and two dots show the accuracy of BERT and XTC-BERT, which are 83.95 and 83.44, respectively. The second one shows that INT8 using ZeroQuant can be 2.6 times faster than Baseline with FP16 using PyTorch and ZeoQuant can reduce the number of GPUs for inference from 2 to 1, which in total provides 5.2 times efficiency. It also shows that ZeroQuant has 50.4 accuracy compared to 50.5 using Baseline PyTorch. The third plot shows that ZeroQuant is more than 5000 times cheaper than baseline to compress a model, and the accuracy of ZeroQuant is 42.26 compared to 42.35 of baseline.

July 20, 2022

Research Areas

Artificial intelligence

Related tools

DeepSpeed

Related projects

AI at Scale

Microsoft Research Blog

ZeRO-Infinity and DeepSpeed: Unlocking unprecedented model scale for deep learning training

Addressing the needs of large model training now and into the future with ZeRO-Infinity

Unprecedented model scale: Train 30-trillion-parameter models on 512 GPUs

Broader access to fine-tuning extremely large models: GPT-3 or even larger models on a single GPU

Train massive models without any code refactoring

Excellent training efficiency and superlinear scalability

ZeRO-Infinity redefines the large model training landscape

Related publications

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Meet the authors

DeepSpeed Team

Rangan Majumder

Andrey Proskurin

Continue reading

DeepSpeed Compression: A composable library for extreme compression and zero-cost quantization

DeepSpeed: Advancing MoE inference and training to power next-generation AI scale

DeepSpeed powers 8x larger MoE model training with high performance

DeepSpeed: Accelerating large-scale model inference and training via system optimizations and compression

Research Areas

Related tools

Related projects