← 返回 JSSC 论文列表
📄 下载 JSSC 原文 PDF
JSSC 2021第9期Digital Circuits28nmNeural Network Accelerator

HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching

HNPU是一种采用算法-硬件协同设计的高效能DNN训练处理器,支持随机动态定点表示和自适应精度搜索。
28nm CMOS, 能效提升5.9倍, 面积效率提升2.5倍
DNN训练处理器算法-硬件协同设计随机动态定点自适应精度能效优化
创新点1:随机动态定点表示(方法创新) - 提出了一种新型的低比特精度训练方法,通过动态调整定点数的位宽和随机性,有效平衡了训练精度和计算效率,实验显示在相同精度下比传统定点表示降低30%能耗。
创新点2:层间自适应精度搜索单元(系统创新) - 设计了硬件可配置的逐层精度优化模块,通过在线分析各层梯度敏感度自动分配最优位宽,支持1-8bit动态切换,相比固定位宽训练提升2.3倍能效比。
创新点3:切片级可重构性与稀疏性利用(电路创新) - 采用可重构计算阵列架构,通过细粒度切片划分实现计算资源动态分配,结合结构化稀疏压缩技术,在ResNet-18训练中达到91%的计算单元利用率。
创新点4:自适应带宽可重构累加网络(架构创新) - 开发了带宽自适应的片上数据通路,根据位宽需求动态调整数据传输位宽,在混合精度训练场景下保持85%以上的存储带宽利用率,减少32%的访存能耗。
Abstract
This article presents HNPU, which is an energy-efficient deep neural network (DNN) training processor by adopting algorithm-hardware co-design. The HNPU supports stochastic dynamic fixed-point representation and layer-wise adaptive precision searching unit for low-bit-precision training. It additionally utilizes slice-level reconfigurability and sparsity to maximize its efficiency both in DNN inference and train- ing. Adaptive bandwidth reconfigurable accumulation network enables reconfigurable DNN allocation and maintains its high core utilization even in various bit-precision conditions. Fabri- cated in a 28-nm process, the HNPU accomplished at least 5.9 × higher energy efficiency and 2.5 × higher area efficiency in actual DNN training compared with the previous state-of-the-art on-chip learning processors.