← 返回 JSSC 论文列表JSSC 2023第4期Digital Circuits5nm
A 956-TOPSW Deep Learning Inference Accelerator With Per-V ector Scaled 4-bit Qu
提出一种高效执行Transformer的DNN加速器,采用每向量缩放量化技术,实现高能效推理。
5nm工艺, 0.46V, 95.6 TOPS/W, 0.67V, 1734 inferences/s/W (BERT-Base), 4714 inferences/s/W (ResNet-50)
深度学习加速器Transformer每向量缩放量化能效优化量化感知微调
▸创新点1:每向量缩放量化(VSQ)技术是一种方法创新,通过为每个64元素向量分配独立的缩放因子,实现了4位算术运算的高效执行,显著降低了能量开销,同时保持了较低的精度损失(<1%)。
▸创新点2:多级数据流设计属于系统创新,通过优化数据重用机制,显著提升了计算效率,使原型在5nm工艺下实现了95.6 TOPS/W的高能效比。
▸创新点3:量化感知微调(quantization-aware fine-tuning)是方法创新,通过针对性训练补偿量化误差,在BERT-Base和ResNet-50上分别实现仅0.7%和0.15%的精度损失,解决了传统量化导致Transformer模型精度崩溃的问题。
▸创新点4:低电压操作(0.46V-0.67V)是电路创新,通过近阈值电压设计,在保证38.7TOPS/W算力的同时,将能效提升至同类工作的前沿水平(4714 inferences/s/W)。
Abstract
The energy efficiency of deep neural network (DNN)
inference can be improved with custom accelerators. DNN infer-
ence accelerators often employ specialized hardware techniques
to improve energy efficiency, but many of these techniques result
in catastrophic accuracy loss on transformer-based DNNs, which
have become ubiquitous for natural language processing (NLP)
tasks. This article presents a DNN accelerator designed for
efficient execution of transformers. The proposed accelerator
implements per