JSSC 2023第1期Digital Circuits

An Energy-Efﬁcient Transformer Processor Exploiting Dynamic Weak Relevances in Global Attention Y ang Wang , Y ubin Qin , Dazheng Deng, Jingchuan Wei , Y ang Zhou, Y uanqi Fan, Tianbao Chen, Hao Sun

提出一种能效优化的Transformer处理器，通过动态弱相关性处理降低计算能耗

未明确说明（需查阅完整论文获取具体指标）

Transformer处理器能效优化动态弱相关性近似计算硬件加速

▸采用大-精确-小-近似处理单元（PE）自适应计算弱相关token

▸双向渐进推测单元消除冗余零注意力计算

▸针对全局注意力机制优化的专用硬件架构

Abstract

Transformer-based models achieve tremendous suc- cess in many artiﬁcial intelligence (AI) tasks, outperforming conventional convolution neural networks (CNNs) from natural language processing (NLP) to computer vision (CV). Their success relies on the self-attention mechanism that provides a global rather than local receptive ﬁeld as CNNs. Despite its superiority, the global–level self-attention consumes ∼100× more operations than CNNs and cannot be effectively handled by the existing CNN processor due to the distinct operations. It inspires an urgent requirement to design a dedicated Transformer proces- sor. However, global self-attention involves massive naturally existent weakly related tokens (WR-Tokens) due to the redundant contents in human languages or images. These WR-Tokens gen- erate zero and near-zero attention results that introduce energy consumption bottleneck, redundant computations, and hard- ware under-utilization issues, making it challenging to achieve energy-efﬁcient self-attention computing. This article proposes a Transformer processor effectively handling the WR-Tokens to solve these challenges. First, a big-exact-small-approximate processing element (PE) reduces multiply-and-accumulate (MAC) energy for WR-Tokens by adaptively computing the small values approximately while computing the large values exactly. Sec- ond, a bidirectional asymptotical speculation unit captures and removes redundant computations of zero attention outputs by exploiting the loca