“A Full-Stack Accelerator Search Technique for Vision Applications”, 2021-05-26 (; similar):
The rapidly-changing ML model landscape presents an unique opportunity for building hardware accelerators optimized for specific datacenter-scale workloads. We propose Full-stack Accelerator Search Technique (FAST), a hardware accelerator search framework that defines a broad optimization environment covering key design decisions within the hardware-software stack, including hardware datapath, software scheduling, and compiler passes such as operation fusion and tensor padding.
Although FAST can be used on any number and type of deep learning workload, in this paper we focus on optimizing for a single or small set of vision models, resulting in faster and more power-efficient designs relative to a general purpose ML accelerator.
When evaluated on EfficientNet, ResNet-50v2, and OCR inference performance relative to a TPU-v3, designs generated by FAST optimized for single workloads can improve Perf/TDP (peak power) by over 6× in the best case and 4× on average. On a limited workload subset, FAST improves Perf/TDP 2.85× on average, with a reduction to 2.35× for a single design optimized over the set of workloads. In addition, we demonstrate a potential 1.8× speedup opportunity for TPU-v3 with improved scheduling.