JSSC 2020第4期Digital Circuits16nm

A 0.32–128 TOPS, Scalable Multi-Chip-Module- Based Deep Neural Network Inference Accelerator With Ground-Referenced Signaling in 16 nm

基于多芯片模块的可扩展DNN加速器，实现高效能推理

16nm工艺, 1.29-TOPS/mm²面积效率, 0.11 pJ/op能效, 127.8 TOPS峰值性能

深度神经网络多芯片模块能效优化可扩展架构推理加速

▸创新点1：多芯片模块（MCM）网格网络连接，通过36芯片的网格网络实现灵活扩展，支持从移动设备到数据中心的广泛DNN推理需求，显著提升系统可扩展性和适应性。

▸创新点2：地面参考信号（GRS）技术，采用GRS技术优化芯片间通信，降低信号传输功耗和噪声，提升整体通信效率和可靠性。

▸创新点3：分层片上网络和封装网络优化通信，结合片上分布式权重存储和分层网络设计，最小化通信能耗，提升系统整体能效和性能。

▸创新点4：高性能指标，16nm工艺下实现1.29 TOPS/mm²的面积效率、0.11 pJ/op的能效效率，36芯片系统峰值性能达127.8 TOPS，ResNet-50推理速度达1903 images/s，显著提升DNN推理性能。

Abstract

Custom accelerators improve the energy efﬁciency, area efﬁciency, and performance of deep neural network (DNN) inference. This article presents a scalable DNN accelerator consisting of 36 chips connected in a mesh network on a multi- chip-module (MCM) using ground-referenced signaling (GRS). While previous accelerators fabricated on a single monolithic chip are optimal for speciﬁc network sizes, the proposed architecture enables ﬂexible scaling for efﬁcient inference on a wide range of DNNs, from mobile to data center domains. Communication energy is minimized with large on-chip distributed weight storage and a hierarchical network-on-chip and network-on-package, and inference energy is minimized through extensive data reuse. The 16-nm prototype achieves 1.29-TOPS/mm 2 area efﬁciency, 0.11 pJ/op (9.5 TOPS/W) energy efﬁciency, 4.01-TOPS peak performance for a one-chip system, and 127.8 peak TOPS and 1903 images/s ResNet-50 batch-1 inference for a 36-chip system.