“HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks”, 2020-07-11 (; similar):
Convolutional neural networks (CNNs) are typically trained using 16-bit or 32-bit floating-point (FP) and research show that low-precision floating-point (FP) can be highly effective for inference. Low-precision FP can be implemented in field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) accelerators, but existing processors do not generally support custom precision FP.
We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS), a method of generating efficient custom-precision emulated bitslice-parallel software FP arithmetic.
We generate custom-precision FP routines optimized using a hardware synthesis design flow to create circuits. We provide standard cell libraries matching the bitwise operations on the target microprocessor architecture, and a code-generator to translate the hardware circuits to bitslice software equivalents. We exploit bitslice parallelism to create a very wide (32–512 element) vectorized convolutional neural network (CNN) convolution.
Hardware optimized bitslice-parallel floating-point operators (HOBFLOPS) multiply-accumulate (MAC) performance in CNN convolution on ARM and Intel processors are compared to Berkeley’s SoftFP16 equivalent MAC. HOBFLOPS16 outperforms SoftFP16 by 8× on Intel AVX512. HOBFLOPS offers arbitrary-precision FP with custom range and precision eg. HOBFLOPS9 performs at 6× the performance of HOBFLOPS16 on ARM Neon.
HOBFLOPS allows researchers to prototype different levels of custom FP precision in the arithmetic of software CNN accelerators. Furthermore, HOBFLOPS fast custom-precision FP CNNs may be valuable in cases where memory bandwidth is limited.