“Carbon Emissions and Large Neural Network Training”, 2021-04-21 (; similar):
The computation demand for machine learning (ML) has grown rapidly recently, which comes with a number of costs. Estimating the energy cost helps measure its environmental impact and finding greener strategies, yet it is challenging without detailed information.
We calculate the energy use and carbon footprint of several recent large models—T5, Meena, GShard, Switch Transformer, and GPT-3—and refine earlier estimates for the neural architecture search that found Evolved Transformer.
We highlight the following opportunities to improve energy efficiency and CO2 equivalent emissions (CO2e):
Large but sparsely activated MoE DNNs can consume <1⁄10th the energy of large, dense DNNs without sacrificing accuracy despite using as many or even more parameters.
Geographic location matters for ML workload scheduling since the fraction of carbon-free energy and resulting CO2e vary 5×–10×, even within the same country and the same organization.
We are now optimizing where and when large models are trained.
Specific datacenter infrastructure matters, as Cloud datacenters can be 1.4–2× more energy efficient than typical datacenters, and the ML-oriented accelerators inside them can be 2–5× more effective than off-the-shelf systems.
Remarkably, the choice of DNN, datacenter, and processor can reduce the carbon footprint up to 100–1000×.
These large factors also make retroactive estimates of energy cost difficult. To avoid miscalculations, we believe ML papers requiring large computational resources should make energy consumption and CO2e explicit when practical. We are working to be more transparent about energy use and CO2e in our future research. To help reduce the carbon footprint of ML, we believe energy usage and CO2e should be a key metric in evaluating models, and we are collaborating with MLPerf developers to include energy usage during training and inference in this industry standard benchmark.
…Most companies spend more energy on serving a DNN model (performing inference) than on training it. For example, NVIDIA estimated that 80–90% of the ML workload is inference processing [Leo19]. Similarly, Amazon Web services claimed that 90% of the ML demand in the cloud is for inference [Bar19]. Given its substantial role in the ML model lifecycle, AliBaba [eg. Hanguang 800], Amazon, Google, and NVIDIA designed ML accelerators solely for inference. If the total ML energy is split 10% on training and 90% on serving, then even if a given ML model required double the energy cost of training, it could reduce overall total carbon emissions if that model also cut serving energy by 20%.
…As Table 4 shows, the actual cost of Evolved Transformer NAS is nearly 2 orders of magnitude smaller than previously estimated [Str19]. Why the discrepancy? The answer is that, in addition to the efficiency of Google datacenters, there was a confusion in estimating the energy cost of NAS. In Evolved Transformer NAS, researchers used a small proxy task to search for the best models to save time and money, and then scaled up the found models to full size. Small proxies may not be obvious, which made it hard to estimate the CO2e correctly in retrospect from the NAS paper [So19]. Due to the misunderstanding of the usage of proxy tasks in NAS, it was assumed the search was done with full size tasks. Because of this assumption, despite considerable effort on their part, Strubell et al’s energy estimate for NAS ended up 18.7× too high for the average organization (see Appendix C) and 88× off in emissions for energy-efficient organizations like Google (see Appendix D).
…In terms of cost-benefit tradeoff, NAS can also lead to improved energy efficiency in training of downstream applications, and the benefit can dramatically outweigh the cost. Figure 4 shows that the Evolved Transformer, found by NAS [So19], has 37% fewer parameters and converges to the same accuracy with 25% less energy expenditure (see Table 1) than the vanilla Transformer (Big) model on WMT English to German translation. The use of Evolved Transformer instead of a regular Transformer architecture saved 48.5 tCO2e during the training of the Meena DNN (see Tables 1 & 4). The savings from this single reuse in Meena are ~15× larger than the energy cost of running the search to discover it. The results of the Evolved Transformer neural architecture search have been open-sourced. It can readily be used by anyone training ML models for NLP problems, similar to how a Transformer-style model can be used for NLP problems [Evo19].
…Finally, Google publishes its total energy consumption, and for 2019 it was 12.2 TeraWatt-hours [Goo20]. Row 18 of Table 4 shows the percentage that each NLP model training was of that total. Even if we assume all four of Google’s large NLP models in Table 4 were trained in 2019, the total represents less than 0.005%. The training of those 4 large NLP models is not a substantial fraction of Google’s energy consumption.
…For example, our large scale translation models (M4) have already been used to translate billions of queries annually for each mid-to-low resource language 25 with 2B speakers globally for these languages. Figure 7, from the GShard paper [Lep20], shows substantial improvements for translation of 100 different languages to English. The blue line on the top in the left represents the 600b parameter multi-lingual translation MoE model of GShard. The dashed black line near the bottom is for a traditional dense DNN that is fully activated for every token. The dense DNN requires ~10× more computational resources to train than the 600B sparse MoE model, despite substantially lower translation quality. Figure 7 shows the larger MoE model, the larger the BLEU score gains were across all languages; the lines rarely cross. The 600B MoE model improves average quality +13.5 BLEU, 7.4 higher than the 2.3B dense model.
GShard-600B’s emissions (Table 4) are 4.3 tCO2e—3.5 passenger SF-NY round trips—from consuming 24 MWh to train the model that could have 2B users; the amortized per-user CO2e impact of model training would be less than the CO2e impact of sending one text message.