“AI and Compute”, Dario Amodei, Danny Hernandez, Girish Sastry, Jack Clark, Greg Brockman, Ilya Sutskever2018-05-26 (, , ; backlinks; similar)⁠:

[Further reading: “Parameter Counts In Machine Learning” (2021-06-19), Akronomicon leaderboard.] We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000× (a 2-year doubling period would yield only a 7× increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.

Three factors drive the advance of AI: algorithmic innovation, data (which can be either supervised data or interactive environments), and the amount of compute available for training. Algorithmic innovation and data are difficult to track, but compute is unusually quantifiable, providing an opportunity to measure one input to AI progress. Of course, the use of massive compute sometimes just exposes the shortcomings of our current algorithms. But at least within many current domains, more compute seems to lead predictably to better performance, and is often complementary to algorithmic advances…The trend represents an increase by roughly a factor of 10 each year. It’s been partly driven by custom hardware that allows more operations to be performed per second for a given price (GPUs and TPUs), but it’s been primarily propelled by researchers repeatedly finding ways to use more chips in parallel and being willing to pay the economic cost of doing so.

AlexNet to AlphaGo Zero: A 300,000× Increase in Compute. (The total amount of compute, in petaflop/s-days, used to train selected results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used.)

Eras: Looking at the graph we can roughly see 4 distinct eras:

  1. Before 2012: It was uncommon to use GPUs for ML, making any of the results in the graph difficult to achieve.
  2. 201222014: Infrastructure to train on many GPUs was uncommon, so most results used 1–8 GPUs rated at 1–2 TFLOPS for a total of 0.001–0.1 pfs-days.
  3. 201422016: Large-scale results used 10–100 GPUs rated at 5–10 TFLOPS, resulting in 0.1–10 pfs-days. Diminishing returns on data parallelism meant that larger training runs had limited value.
  4. 2016–2017: Approaches that allow greater algorithmic parallelism such as huge batch sizes, architecture search, and expert iteration, along with specialized hardware such as TPUs and faster interconnects, have greatly increased these limits, at least for some applications.

AlphaGoZero/AlphaZero is the most visible public example of massive algorithmic parallelism, but many other applications at this scale are now algorithmically possible, and may already be happening in a production context.

Addendum: Compute used in older headline results (2019-11-07)

We’ve updated our analysis with data that span 195953201212ya. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, 195953201212ya, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train models over the last half-century.

Two Distinct Eras of Compute Usage in Training AI Systems

Starting from the perceptron in 1959, we see a ~2-year doubling time for the compute used in these historical results—with a 3.4-month doubling time starting in ~2012. It’s difficult to draw a strong conclusion from this data alone, but we believe that this trend is probably due to a combination of the limits on the amount of compute that was possible to use for those results and the willingness to spend on scaling up experiments. [For one vivid account of the history of computing in AI in this period, see the “False Start” section in Hans Moravec’s1998 article.]