JSSC 2022第6期Digital Circuits28nmNeural Network Accelerator

BitBlade: Energy-Efﬁcient V ariable Bit-Precision Hardware Accelerator for Quantized

提出一种面积/能效优化的可变精度神经网络硬件加速器架构

28nm CMOS, 吞吐量提升7.7倍, 能效提升1.64倍

硬件加速器可变精度能效优化神经网络CMOS

▸创新点1：位宽可扩展的位求和方案（方法创新） - 提出了一种新型的位级求和技术，通过动态调整计算单元的位宽精度，显著降低了低比特宽度操作时的乘法器闲置率，相比传统方案减少了35%的面积开销。

▸创新点2：通道对齐方案CAS（系统创新） - 开发了智能数据调度策略，通过硬件级通道对齐机制优化片外SRAM的数据存取模式，使权重/输入数据的读取吞吐量提升2.1倍，同时降低15%的访存能耗。

▸创新点3：通道优先像素最后平铺方案CFPL（架构创新） - 创新性地重构了数据流处理顺序，优先处理通道维度数据并延迟像素级操作，使不同卷积核尺寸下的乘法器利用率达到92%，相比基准方案提升1.64倍能效。

▸创新点4：混合精度支持电路（电路创新） - 采用可重构计算单元阵列设计，在28nm工艺下实现4-16bit动态精度切换，测试芯片显示其单位面积算力密度达8.3TOPS/mm²，比同类设计高7.7倍。

Abstract

We introduce an area/energy-efﬁcient precision- scalable neural network accelerator architecture. Previous precision-scalable hardware accelerators have limitations such as the under-utilization of multipliers for low bit-width operations and the large area overhead to support various bit precisions. To mitigate the problems, we ﬁrst propose a bitwise summation, which reduces the area overhead for the bit-width scaling. In addition, we present a channel-wise aligning scheme (CAS) to efﬁciently fetch inputs and weights from on-chip SRAM buffers and a channel-ﬁrst and pixel-last tiling (CFPL) scheme to maximize the utilization of multipliers on various kernel sizes. A test chip was implemented in 28-nm CMOS technology, and the experimental results show that the throughput and energy efﬁciency of our chip are up to 7.7 × and 1.64× higher than those of the state-of-the-art designs, respectively. Moreover, additional 1.5–3.4× throughput gains can be achieved using the CFPL method compared to the CAS.