JSSC 2024第1期Digital Circuits28nm

Multipurpose Deep-Learning Accelerator for Arbitrary Quantization With Reduction of Storage, Logic, and

提出一种支持任意量化的深度学习加速器，具有高效能和多格式数据处理能力。

28nm LP CMOS, 1-8 bit, 30% sparsity, 0.87-5.55 TOPS, 15.1-95.9 TOPS/W

深度学习加速器任意量化运行时重配置位串行执行零消除器

▸基于LUT的运行时重配置

▸位串行执行减少计算浪费

▸兼容原始和游程压缩格式的零消除器和运行时密度检测器

Abstract

Various pruning and quantization heuristics have been proposed to compress recent deep-learning models. How- ever, the rapid development of new optimization techniques makes it difficult for domain-specific accelerators to efficiently process various models showing irregularly stored parameters or nonlinear quantization. This article presents a scalable- precision deep-learning accelerator that supports multiply-and- accumulate operations (MACs) with two arbitrarily quantized data sequences. The proposed accelerator includes three main features. To minimize logic overhead when processing arbitrarily quantized 8-bit precision data, a lookup table (LUT)-based run- time reconfiguration is proposed. The use of bit-serial execution without unnecessary computations enables the multiplication of data with non-equal precision while minimizing logic and latency waste. Furthermore, two distinct data formats, raw and run- length compressed, are supported by a zero-eliminator (ZE) and runtime-density detector (RDD) that are compatible with both formats, delivering enhanced storage and performance. For a precision range of 1–8 bit and fixed sparsity of 30%, the accelerator implemented in 28 nm low-power (LP) CMOS shows a peak performance of 0.87–5.55 TOPS and a power efficiency of 15.1–95.9 TOPS/W. The accelerator supports processing with arbitrary quantization (AQ) while achieving state-of-the-art (SOTA) power efficiency.