“Understanding Sources of Inefficiency in General-Purpose Chips”, Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, Mark Horowitz2010-06-01 (, ; similar)⁠:

[re AI: have GPUs/TPUs already eaten the low-hanging fruit here?] Due to their high volume, general-purpose processors and now chip multiprocessors (CMPs), are much more cost effective than ASICs, but lag substantially in terms of performance and energy efficiency.

This paper explores the sources of these performance and energy overheads in general-purpose processing systems by quantifying the overheads of a 720p HD H.264 encoder running on a general-purpose CMP system. It then explores methods to eliminate these overheads by transforming the CPU into a specialized system for H.264 encoding. We evaluate the gains from customizations useful to broad classes of algorithms, such as SIMD units, as well as those specific to particular computation, such as customized storage and functional units.

The ASIC is 500× more energy efficient than our original 4-processor CMP. Broadly applicable optimizations improve performance by 10× and energy by 7×. However, the very low energy costs of actual core ops (100s fJ in 90 nm) mean that over 90% of the energy used in these solutions is still “overhead”. Achieving ASIC-like performance and efficiency requires algorithm-specific optimizations. For each sub-algorithm of H.264, we create a large, specialized functional unit that is capable of executing 100s of operations per instruction. This improves performance and energy by an additional 25× and the final customized CMP matches an ASIC solution’s performance within 3× of its energy and within comparable area.

[Keywords: ASIC, H.264, chip multiprocessor, high-performance, energy efficiency, customization, Tensilica]

…We evaluate these strategies by transforming a general-purpose, Tensilica-based, extensible CMP system into a highly efficient 720p HD H.264 encoder. We choose H.264 because it demonstrates the large energy advantage of ASIC solutions (500×) and because there exist commercial ASICs that can serve as a benchmark. Moreover, H.264 contains a variety of computational motifs, from highly data parallel algorithms (motion estimation) to control intensive ones (CABAC).

The results are striking. Starting from a 500× energy penalty, adding relatively wide (16×) SIMD execution units improves performance by 10× and energy efficiency by 7×. Since SIMD units are often augmented with special fused instructions to accelerate important applications, we introduce our own custom fused instructions to improve both performance and energy efficiency by an additional 1.4×. Despite these customizations, which collectively improve energy efficiency by 10×, the resulting solution is still 50× less energy efficient than an ASIC.

An examination of the energy breakdown clearly demonstrates why. Since the SIMD unit customizes datapath widths of 8–12bits, functional unit energy comprises less than 10% of the total even when performing more than 10 operations per cycle. Thus, to create a truly efficient processor, one needs to construct instructions that aggregate enough computation to offset the energy overheads of flexible instruction and data fetch. Creating such “magic” instructions improves energy efficiency by another 18× and yields a solution within 3× of a full ASIC design.

While identifying the right customizations for a given application takes substantial effort, it is hard to achieve ASIC-like efficiencies without them. The inescapable conclusion is that truly efficient designs will require application-specialized hardware. If energy efficiency is going to drive future computing design, then we need frameworks that allow application experts to easily (and at low cost) create customized solutions. The fact that, for our application, we can achieve good efficiency using processor instruction extensions is an encouraging sign.