JSSC 2019第1期Wireline I/O40nmSRAMDRAM

QUEST: Multi-Purpose Log-Quantized DNN Inference Engine Stacked on 96-MB 3-D SRAM Using Inductive Coupling Technology in 40-nm CMOS

QUEST是一款采用3D堆叠技术的多用途对数量化DNN推理引擎，具有高带宽和低延迟特性。

7.49 TOPS（二进制精度）, 1.96 TOPS（4位精度）, 300MHz时钟频率

DNN推理引擎3D堆叠对数量化ThruChip接口SRAM

▸采用3D堆叠技术实现96MB SRAM

▸支持对数量化编程以实现更高精度的DNN计算

▸使用ThruChip接口技术实现低延迟通信

Abstract

QUEST is a programmable multiple instruction, multiple data (MIMD) parallel accelerator for general-purpose state-of-the-art deep neural networks (DNNs). It features die- to-die stacking with three-cycle latency, 28.8 GB/s, 96 MB, and eight SRAMs using an inductive coupling technology called the ThruChip interface (TCI). By stacking the SRAMs instead of DRAMs, lower memory access latency and simpler hardware are expected. This facilitates in balancing the memory capacity, latency, and bandwidth, all of which are in demand by cutting- edge DNNs at a high level. QUEST also introduces log-quantized programmable bit-precision processing for achieving faster (larger) DNN computation (size) in a 3-D module. It can sustain higher recognition accuracy at a lower bitwidth region compared to linear quantization. The prototype QUEST chip is integrated in the 40-nm CMOS technology, and it achieves 7.49 tera operations per second (TOPS) peak performance in binary precision, and 1.96 TOPS in 4-bit precision at 300-MHz clock.