JSSC 2023第4期Digital Circuits5nm

A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-V ector Scaled 4-bit Quantization in 5 nm

提出一种高效执行Transformer的DNN加速器，采用每向量缩放量化技术，实现高能效推理。

5nm工艺, 0.46V, 95.6 TOPS/W, 0.67V, 1734 inferences/s/W (BERT-Base), 4714 inferences/s/W (ResNet-50)

深度学习加速器Transformer每向量缩放量化能效优化量化感知微调

▸创新点1：每向量缩放量化（VSQ）技术是一种方法创新，通过为每个64元素向量分配独立的缩放因子，实现了4位算术运算的高效执行，显著降低了能量开销，同时保持了较低的精度损失（<1%）。

▸创新点2：多级数据流设计属于系统创新，通过优化数据重用机制，显著提升了计算效率，使原型在5nm工艺下实现了95.6 TOPS/W的高能效比。

▸创新点3：量化感知微调（quantization-aware fine-tuning）是方法创新，通过针对性训练补偿量化误差，在BERT-Base和ResNet-50上分别实现仅0.7%和0.15%的精度损失，解决了传统量化导致Transformer模型精度崩溃的问题。

▸创新点4：低电压操作（0.46V-0.67V）是电路创新，通过近阈值电压设计，在保证38.7TOPS/W算力的同时，将能效提升至同类工作的前沿水平（4714 inferences/s/W）。

Abstract

The energy efﬁciency of deep neural network (DNN) inference can be improved with custom accelerators. DNN infer- ence accelerators often employ specialized hardware techniques to improve energy efﬁciency, but many of these techniques result in catastrophic accuracy loss on transformer-based DNNs, which have become ubiquitous for natural language processing (NLP) tasks. This article presents a DNN accelerator designed for efﬁcient execution of transformers. The proposed accelerator implements per-vector scaled quantization (VSQ), which employs an independent scale factor for each 64-element vector to enable the use of 4-bit arithmetic wit h little accuracy loss and low energy overhead. Using a multilevel dataﬂow to maximize reuse, the 5-nm prototype achieves 95.6 tera-operations per second per Watt (TOPS/W) at 0.46 V on a 4-bit benchmarking layer with VSQ. At a nominal voltage of 0.67 V, the accelerator achieves 1734 inferences/s/W (38.7 TOPS/W) with only 0.7% accuracy loss on BERT-Base and 4714 inferences/s/W (38.6 TOPS/W) with 0.15% accuracy loss on ResNet-50 by using quantization- aware ﬁne-tuning to recover accuracy, demonstrating a practical accelerator design for energy-efﬁcient DNN inference.