“Introducing the AI Research SuperCluster—Meta’s Cutting-Edge AI Supercomputer for AI Research”, Kevin Lee, Shubho Sengupta2022-01-24 (; similar)⁠:

Developing the next generation of advanced AI will require powerful new computers capable of quintillions of operations per second. Today, Meta is announcing that we’ve designed and built the AI Research SuperCluster (RSC)—which we believe is among the fastest AI supercomputers running today and will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022. Our researchers have already started using RSC to train large models in natural language processing (NLP) and computer vision for research, with the aim of one day training [dense?] models with trillions of parameters. RSC will help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more. Our researchers will be able to train the largest models needed to develop advanced AI for computer vision, NLP, speech recognition, and more. We hope RSC will help us build entirely new AI systems that can, for example, power real-time voice translations to large groups of people.

…The first generation of this infrastructure, designed in 2017, has 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day.

…RSC today comprises a total of 760 NVIDIA DGX A100 systems as its compute nodes, for a total of 6,080 GPUs—with each A100 GPU being more powerful than the V100 used in our previous system. The GPUs communicate via an NVIDIA Quantum 200 Gb/s InfiniBand 2-level Clos fabric that has no oversubscription. RSC’s storage tier has 175 petabytes of Pure Storage FlashArray [NVM], 46 petabytes of cache storage in Penguin Computing Altus [AMD Epyc] systems, and 10 petabytes of Pure Storage FlashBlade.

…Once we complete phase 2 of building out RSC, we believe it will be the fastest AI supercomputer in the world [past NVIDIA’s Selene used for Megatron-Turing NLG 530B, currently on par with Perlmutter], performing at nearly 5 exaflops of mixed precision compute. Through2022, we’ll work to increase the number of GPUs from 6,080 to 16,000, which will increase AI training performance by more than 2.5×. [bringing it past Summit] The InfiniBand fabric will expand to support 16,000 ports in a 2-layer topology with no oversubscription. The storage system will have a target delivery bandwidth of 16 TB⁄s and exabyte-scale capacity to meet increased demand.