China Has Already Reached Exascale – On Two Separate Systems

More analysis “Why Did China Keep Its Exascale Supercomputers Quiet?

Native CPU and accelerator architectures that have been in play on China’s previous large systems have been stepped up to make China first to exascale on two fronts. 

The National Supercomputing Center in Wuxi is set to unveil some striking news based on quantum simulation results on a forthcoming homegrown Sunway supercomputer.

The news is notable not just for the calculations, but the possible architecture and sheer scale of the new machine. And of course, all of this is notable because the United States and China are in a global semiconductor arms race and that changes the nature of how we traditionally compare global supercomputing might. We have been contemplating China’s long road to datacenter compute independence, of which HPC is but one workload, and these are some big steps.

The supercomputing community has long been used to public results on the Top 500 list of the world’s most powerful systems with countries actively vying for supremacy. However, with tensions at peak and the entity list haunting the spirit of international competition, we can expect China to remain mum about some dramatic system leaps. Including the fact that the country has already broken the (true/LINPACK) exascale barrier in 2021—on more than one machine.

We have it on outstanding authority (under condition of anonymity) that LINPACK was run in March 2021 on the Sunway “Oceanlite” system, which is the follow-on to the #4-ranked Sunway TaihuLight machine. The results yielded 1.3 exaflops peak performance with 1.05 sustained performance in the ideal 35 megawatt power sweet spot.

We’ve already published what little we knew about the Sunway Oceanlite architecture, and earlier this year (and now, in the absence of verified system information) our conjecture was that this new machine was a die shrink, allowing 2X the elements and 2X the performance per socket and with a doubling of sockets (and other engineering of course), Wuxi could create an exascale system. Clearly, Wuxi has.

Wuxi is using 42 million of those cores for sustained exascale supercomputing in full-scale quantum simulation production, which we learned today via a preview ahead of the annual Supercomputing Conference (SC21). The TaihuLight follow-on is capable of running a quantum simulation that can be parallelized across the entire machine. This simulation also bodes well for an AI/ML training and inference workloads as it highlights extensive use of mixed-precision math, including 16-bit floating point performance of a reported 4.4 exaflops.

Without delving into all the quantum details, the Wuxi team, along with collaborators at Tsinghua University and the Shanghai Research Center for Quantum Sciences, have developed the tensor-based simulator for random quantum circuits that is optimized for compute density and can “reduce the simulation sampling time of Google Sycamore to 304 seconds from the previously claimed 10,000 years.” This is just a preview abstract and there aren’t a lot of details on this result but it’s worth mentioning to tee up what we find out in mid-November when a paper is released detailing the simulation.

But let’s get back to fully benchmarked (LINPACK) exascale systems in China. The same authority confirmed that a second exascale run in China, this time on the Tianhe-3 system, which we previewed back in May 2019, reached almost identical performance with 1.3 exaflops peak and enough sustained to be functional exascale. We do not have a power figure for this but we were able to confirm this machine is based on the FeiTeng line of processors from Phytium, which is Arm-based with a matrix accelerator. (For clarity, FeiTeng is kind of like “Xeon,” it’s a brand of CPUs from Phytium).

This is not a new architecture. Here’s the analysis from 2015 when we first got wind of Phytium’s HPC ambitions, and here is a follow-on deep dive into the “Mars” 64-core FT-2000/64 architecture. The “Mars” processor then was always intended for us in China’s supercomputers but of course, has had to evolve with the times. The matrix engine that adds the real “oomph” to these devices is still based on an updated variant of that Matrix 2000 DSP accelerator we saw in Tianhe-2A (another top supercomputer of its day), which is called the Matrix-2000+ accelerator. The whole software stack for Tianhe-2A took major footwork to tune to the DSP. It was never likely that National University of Defense Technology would swap all of that effort for an architecture that performed quite well, especially on LINPACK.

Recall that this Phytium emergence and the emergence of the Matrix 2000 DSP accelerators for the Tianhe-2A system came about because China couldn’t use an Intel Xeon Phi many core processors as planned due to trade restrictions at the time.

From what we can tell on these two exascale systems there are modest changes to architectures, doubling of chip elements and sockets. That is not to minimize the effort, but it we do not suspect new architectures emerging that can fit another coming bit of news, a so-called Futures program that aims to deliver a 20 exaflops supercomputer by 2025, according to our same source, who is based in the United States but in the know about happenings in China.

But here’s something to keep in mind as we go forward in this frigid international climate: perhaps we can no longer expect to have a clear, Top 500 supercomputer list view into national competitiveness in quite the same way. If China, always a contender with the United States, is running LINPACK but not making the results public, what happens to the validity and international importance of that list, which has been a symbol of HPC progress for decades? What does China have to lose, would it not be in the national interest to show off not one, but two validated exascale for both peak and sustained results?

Here is something subtle to consider: the forthcoming “Frontier” supercomputer at Oak Ridge National Lab in the U.S. is expected to debut with 1.5 peak exaflops and an expected sustained figure around 1.3 exaflops. Perhaps China has decided to quietly leak that they are first to true exascale without having to publish benchmark results that might show a slightly better performance figure for a US- based machine. Just something to think about.

And here’s another subtle detail. Our source confirms these LINPACK results for both of China’s exascale systems—the first in the world—were achieved in March 2021. When did the entity list appear citing Phytium and Sunway and the centers that host their showboat systems? In April 2021.

The politics at play are strange and muddled. But our source, as close as can be to issues at hand, confirms China was first to exascale and with two separate machines based on two different (but fully Chinese native) architectures.

In the absence of US chips and accelerators being made available, it is clear that the trade restrictions will satisfy concerns in the near term that China is using US technology to boost development of its nuclear programs but in the long term, this is major impetus for China to kickstart chip development, fab building, and gun all the engines needed for the semiconductor wars that will continue to simmer, if not yet boil over.

Sign up to our Newsletter

Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.
Subscribe now

9 Comments

      • I see some info from http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html

        Should I run the single and double precision of the benchmarks?
        The results reported in the benchmark report reflect performance for 64 bit floating point arithmetic. On some machines this may be DOUBLE PERCISION, such as computers that have IEEE floating point arithmetic and on other computers this may be single precision, (declared REAL in Fortran), such as Cray’s vector computers.

  1. We will be unable to make software capable of designing hardware superior to human engineers until computing power exceeds the computational ability of the engineers building the fastest computer, unless multiple are connected to each other.

  2. They allowed Western tech companies to build in China and then pillaged IP. CCP did not create anything on their own. Every piece of tech has been stolen from ARM, Intel, Nvidia, etc with the help of AWS, Google, Facebook and greedy politicians. Everything is then built under duress by CCP citizens or slaves.

  3. To the editor (sp.): The “Mars” processor then was always intended for us in China’s supercomputers => The “Mars” processor then was always intended for use in China’s supercomputers

  4. China keeping mum? Given what a tight grip they keep on information about themselves it’s hard to believe you would have this knowledge if they didn’t intentionally leak it.

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. .