JSSC 2023第6期Memory未明确CIM

TranCIM: Full-Digital Bitlin e-Transpose CIM-based Sparse Transformer Accelerator With Pipeline/Parallel Reconﬁgurable Modes

提出基于数字存内计算的Transformer加速器TranCIM，支持动态稀疏注意力计算，显著降低能耗。

15.59 µJ/Token (BERT-base模型), 能效比现有方案提升12.08×–36.82×

存内计算Transformer加速器注意力机制数字电路稀疏计算

▸采用位线转置存内计算架构支持动态矩阵乘法

▸提出可重构流水线/并行模式适应不同计算需求

▸设计稀疏注意力调度器减少冗余计算

Abstract

Transformer models achieve excellent results in the ﬁelds like natural language processing, computer vision, and bioinformatics. Their large numbers of matrix multiplications (MMs) lead to substantial data movement and computation. Although computing-in-memory (CIM) has proven to be an efﬁcient architecture for MM computation, transformer’s atten- tion mechanism raises new challenges in memory access and computation aspects: the dynamic MM in attention layers causes redundant OFF -chip memory access; Attention layers dominate transformer’s computation and require high precision. Thus, we design a bitline-transpose CIM-based transformer acceler- ator TranCIM with pipeline/parallel reconﬁgurable modes. The pipeline mode alleviates off-chip access for attention layers. The parallel mode is used by fully-connected (FC) layers for high parallelism. The full-digital CIM supports INT16 for attention layers and INT8 for FC layers, without analog CIM’s nonideal issues. Moreover, a sparse attention scheduler (SAS) is proposed to reduce attention computation. The fabricated TranCIM chip only consumes 15.59 µJ/Token for the bidirectional encoder representations from transformer (BERT)-base model, achieving 12.08×–36.82× lower energy than prior CIM-based accelerators.