← 返回 JSSC 论文列表
📄 下载 JSSC 原文 PDF
JSSC 2023第7期Memory28nm

A 28-nm 8-bit Floating-Point Tensor Core-Based Programmable CNN Training Process

28纳米8位浮点张量核心CNN训练处理器,提升能效与速度
28nm CMOS, 16.4 TFLOPS/W, 7.3× FLOPs reduction, 4.7× training latency speedup
8位浮点张量核心CNN训练能效优化动态稀疏
高度并行张量核心保持高利用率
硬件高效通道门控实现动态输出激活稀疏
基于组Lasso的动态权重稀疏与梯度跳过
Abstract
Training deep/convolutional neural networks (DNNs/CNNs) requires a large amount of memory and iterative computation, which necessitates speedup and energy reduction, especially for edge devices with resource/energy constraints. In this work, we present an 8-bit floating-point (FP8) training the processor which implements: 1) highly parallel tensor cores (fused multiply–add trees) that maintain high utilization throughout forward propagation (FP), backward propagation (BP), and weight update (WU)