← 返回 JSSC 论文列表JSSC 2023第7期Memory28nm
A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Process
28纳米8位浮点张量核心CNN训练处理器,提升能效与速度
28nm CMOS, 16.4 TFLOPS/W, 7.3× FLOPs reduction, 4.7× training latency speedup
8位浮点张量核心CNN训练能效优化动态稀疏
▸高度并行张量核心保持高利用率
▸硬件高效通道门控实现动态输出激活稀疏
▸基于组Lasso的动态权重稀疏与梯度跳过
Abstract
Training deep/convolutional neural networks
(DNNs/CNNs) requires a large amount of memory and iterative
computation, which necessitates speedup and energy reduction,
especially for edge devices with resource/energy constraints.
In this work, we present an 8-bit floating-point (FP8) training
the processor which implements: 1) highly parallel tensor
cores (fused multiply–add trees) that maintain high utilization
throughout forward propagation (FP), backward propagation
(BP), and weight update (WU)