“Warehouse-Scale Video Acceleration (Argos): Co-Design and Deployment in the Wild”, Parthasarathy Ranganathan, Daniel Stodolsky, Jeff Calow, Jeremy Dorfman, Marisabel Guevara, Clinton Wills Smullen IV, Aki Kuusela, Raghu Balasubramanian, Sandeep Bhatia, Prakash Chauhan, Anna Cheung, In Suk Chong, Niranjani Dasharathi, Jia Feng, Brian Fosco, Samuel Foss, Ben Gelb, Sara J. Gwin, Yoshiaki Hase, Da-ke He, C. Richard Ho, Roy W. Huffman Junior, Elisha Indupalli, Indira Jayaram, Poonacha Kongetira, Cho Mon Kyaw, Aaron Laursen, Yuan Li, Fong Lou, Kyle A. Lucke, J. P. Maaninen, Ramon Macias, Maire Mahony, David Alexander Munday, Srikanth Muroor, Narayana Penukonda, Eric Perkins-Argueta, Devin Persaud, Alex Ramirez, Ville-Mikko Rautio, Yolanda Ripley, Amir Salek, Sathish Sekar, Sergey N. Sokolov, Rob Springer, Don Stark, Mercedes Tan, Mark S. Wachsler, Andrew C. Walton, David A. Wickeraad, Alvin Wijaya, Hon Kwan Wu2021-02-27 (AI hardware, computer hardware; similar):
[media] Video sharing (eg. YouTube, Vimeo, Facebook, TikTok) accounts for the majority of internet traffic, and video processing is also foundational to several other key workloads (video conferencing, virtual/augmented reality, cloud gaming, video in Internet-of-Things devices, etc.). The importance of these workloads motivates larger video processing infrastructures and—with the slowing of Moore’s law—specialized hardware accelerators to deliver more computing at higher efficiencies.
This paper describes the design and deployment, at scale, of a new accelerator targeted at warehouse-scale video transcoding. We present our hardware design including a new accelerator building block—the video coding unit (VCU)—and discuss key design trade-offs for balanced systems at data center scale and co-designing accelerators with large-scale distributed software systems. We evaluate these accelerators “in the wild” serving live data center jobs, demonstrating 20–33× improved efficiency over our prior well-tuned non-accelerated baseline. Our design also enables effective adaptation to changing bottlenecks and improved failure management, and new workload capabilities not otherwise possible with prior systems.
To the best of our knowledge, this is the first work to discuss video acceleration at scale in large warehouse-scale environments.
[Keywords: video transcoding, warehouse-scale computing, domain-specific accelerators, hardware-software co-design]
Table 1: Offline 2-pass single output (SOT) throughput in VCU vs. CPU and GPU systems. · Encoding Throughput: Table 1 shows throughput and perf/TCO (performance per total cost of ownership) for the 4 systems and is normalized to the perf/TCO of the CPU system. The performance is shown for offline 2-pass SOT encoding for H.264 and VP9. For H.264, the GPU has 3.5× higher throughput, and the 8×VCU and 20×VCU provide 8.4× and 20.9× more throughput, respectively. For VP9, the 20×VCU system has 99.4× the throughput of the CPU baseline. The 2 orders of magnitude increase in performance clearly demonstrates the benefits of our VCU system.
The VCU package is a full-length PCI-E card and looks a lot like a graphics card. A board has 2 Argos ASIC chips buried under a gigantic, passively cooled aluminum heat sink. There’s even what looks like an 8-pin power connector on the end because PCI-E just isn’t enough power.
Google provided a lovely chip diagram that lists 10 “encoder cores” on each chip, with Google’s white paper adding that “all other elements are off-the-shelf IP blocks.” Google says that “each encoder core can encode 2160p in realtime, up to 60 FPS (frames per second) using 3 reference frames.”
The cards are specifically designed to slot into Google’s warehouse-scale computing system. Each compute cluster in YouTube’s system will house a section of dedicated “VCU machines” loaded with the new cards, saving Google from having to crack open every server and load it with a new card. Google says the cards resemble GPUs because they are what fit in its existing accelerator trays. CNET reports that “thousands of the chips are running in Google data centers right now”, and thanks to the cards, individual video workloads like 4K video “can be available to watch in hours instead of the days it previously took.”
Factoring in the research and development on the chips, Google says this VCU plan will save the company a ton of money, even given the below benchmark showing the TCO (total cost of ownership) of the setup compared to running its algorithm on Intel Skylake chips and Nvidia T4 Tensor core GPUs.