JSSC 2020第10期Digital Circuits65nm

An Energy-Efﬁcient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on

提出一种支持CNN训练的高效能深度学习加速器，采用三种处理器核心优化不同计算类型。

65nm CMOS, 0.63-1.0V, 50MHz, 40.7mW, 47.4µJ/epoch

深度学习加速器卷积神经网络训练过程能效优化定点计算

▸传播核心中的掩码方案减少中间激活数据存储

▸权重梯度计算采用不同数据流架构提高PE利用率

▸修改的权重更新系统支持8位定点计算

Abstract

A scalable deep-learning accelerator supporting the training process is implemented for device personalization of deep convolutional neural networks (CNNs). It consists of three proces- sor cores operating with distinct energy-efﬁcient dataﬂow for dif- ferent types of computation in CNN training. Unlike the previous works where they implement design techniques to exploit the same characteristics from the inference, we analyze major issues that occurred from training in a resource-constrained system to resolve the bottlenecks. A masking scheme in the propagation core reduces a massive amount of intermediate activation data storage. It eliminates frequent off-chip memory accesses for holding the generated activation data until the backward path. A disparate dataﬂow architecture is implemented for the weight gradient computation to enhance PE utilization while maximally reuse the input data. Furthermore, the modiﬁed weight update system enables an 8-bit ﬁxed-point computing datapath. The processor is implemented in 65-nm CMOS technology and occupies 10.24 mm 2 of the core area. It operates with the supply voltage from 0.63 to 1.0 V , and the computing engine runs in near-threshold voltage of 0.5 V . The chip consumes 40.7 mW at 50 MHz with the highest efﬁciency and achieves 47.4 µJ/epoch of training efﬁciency for the customized CNN model.