JSSC 2023第7期Memory28nm

A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Processor With Dynamic

28纳米8位浮点张量核心CNN训练处理器，提升能效与速度

28nm CMOS, 16.4 TFLOPS/W, 7.3× FLOPs reduction, 4.7× training latency speedup

8位浮点张量核心CNN训练能效优化动态稀疏

▸高度并行张量核心保持高利用率

▸硬件高效通道门控实现动态输出激活稀疏

▸基于组Lasso的动态权重稀疏与梯度跳过

Abstract

Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU) phases of the training process; 2) hardware-efficient channel gating for dynamic output activation sparsity; 3) dynamic weight sparsity (WS) based on group Lasso; and 4) gradient skipping based on the FP prediction error. We develop a custom instruction set architecture (ISA) to flexibly support different CNN topologies and training parameters. The 28-nm prototype chip demonstrates large improvements in floating-point operations (FLOPs) reduction (7.3×), energy efficiency (16.4 TFLOPS/W), and overall training latency speedup (4.7×), for both supervised and self-supervised training tasks.