Google deployed several TPU generations since 2015, teaching us lessons that changed our views: semiconductor technology advances unequally; compiler compatibility trumps binary compatibility, especially for VLIW domain-specific architectures (DSAs); target total cost of ownership vs initial cost; support multi-tenancy; deep neural networks (DNN) grow 1.5× annually; DNN advances evolve workloads; some inference tasks require floating point; inference DSAs need air-cooling; apps limit latency, not batch size; and backwards ML compatibility helps deploy DNNs quickly. These lessons molded TPUv4i, an inference DSA deployed since 2020.
Table 1: Key characteristics of DSAs.
The underlines show changes over the prior TPU generation, from left to right. System TDP includes power for the DSA memory system plus its share of the server host power, eg. add host TDP⁄8 for 8 DSAs per host.
…Document the unequal improvement in logic, wires, SRAM, and DRAM from 45 nm to 7 nm—including an update of Horowitz’s operation energy table16 from 45 nm to 7 nm—and show how these changes led to 4 systolic floating point matrix units for TPUv4i in 2020 versus one systolic integer matrix unit for TPUv1 in 2015.
Explain the difference between designing for performance per TCO vs per CapEx, leading to HBM and a low TDP for TPUv4i, and show how TPUv1’s headroom led to application scaleup after the 2017 paper21.
Explain backwards ML compatibility, including why inference can need floating point and how it spurred the TPUv4i and TPUv4 designs (§3). Backwards ML compatible training also tailors DNNs to TPUv4i (§2).
Measure production inference applications to show that DSAs normally run multiple DNNs concurrently, requiring Google inference DSAs to support multi-tenancy.
Discuss how DNN advances change the production inference workload. The 2020 workload keeps MLP and CNN from 2017 but adds BERT, and RNN succeeds LSTM.
Document the growth of production DNNs in memory size and computation by ~1.5× annually since 2016, which encourages designing DSAs with headroom.
Show that Google’s TCO and TDP for DNN DSAs are strongly correlated (r = 0.99), likely due to the end of Dennard scaling. TDP offers a good proxy for DSA TCO.
Document that the SLO limit is P99 time for inference applications, list typical batch sizes, and show how large on-chip SRAM helps P99 performance.
Explain why TPUv4i architects chose compiler compatibility over binary compatibility for its VLIW ISA.
Describe Google’s latest inference accelerator in production since March 2020 and evaluate its performance/TDP vs. TPUv3 and NVIDIA’s T4 inference GPU using production apps and MLPerf Inference benchmarks 0.5–0.7.
…TPUv1 required quantization—since it supported only integer arithmetic—which proved a problem for some datacenter applications. Early in TPUv1 development, application developers said a 1% drop in quality was acceptable, but they changed their minds by the time the hardware arrived, perhaps because DNN overall quality improved so that 1% added to a 40% error was relatively small but 1% added to a 12% error was relatively large.
…Alas, DNN DSA designers often ignore multi-tenancy. Indeed, multi-tenancy is not mentioned in the TPUv1 paper21. (It was lucky that the smallest available DDR3 DRAM held 8GB, allowing TPUv1 software to add multi-tenancy.)
…BERT appeared in 2018, yet it’s already 28% of the workload.