JSSC 2025第3期Memory28nmSRAMCIM

TT@CIM: A Tensor-Train In-Memory-Computing Processor Using Bit-Level-Sparsity Optimization and V ariable Precision Quantization Ruiqi Guo , Zhiheng Y ue ,X i nS i , Hao Li, Te Hu, Limei Tang, Y abing Wang, Hao Sun

提出TT@CIM处理器，利用张量分解和位级稀疏优化技术提升内存计算能效。

28nm CMOS, 峰值能效5.99-691.13 TOPS/W

内存计算张量分解位级稀疏能效优化量化处理

▸创新点1：TTD-CIM匹配数据流优化（系统创新） - 通过设计专门匹配张量分解（TTD）的数据流，最大化CIM存储器的利用率，减少额外MAC操作，显著提升计算效率，支持4/8位分解DNN的高效处理。

▸创新点2：位级稀疏编码CIM宏设计（电路创新） - 提出高比特级稀疏编码方案的CIM宏，优化MAC操作的功耗，通过减少冗余计算实现单次MAC操作功耗降低，提升能效至5.99-691.13 TOPS/W。

▸创新点3：可变精度量化方法（方法创新） - 结合查找表（LUT）的量化单元，动态调整量化精度，优化QuantOp的性能和能效，解决TTD引入的量化操作瓶颈问题。

▸创新点4：张量分解压缩技术（方法创新） - 应用TTD方法压缩完整DNN模型至CIM-SRAM容量内，消除片外通信瓶颈，首次实现全模型片上存储，突破传统CIM存储限制。

Abstract

Computing-in-memory (CIM) is an attractive approach for energy-efﬁcient deep neural network (DNN) processing, especially for low-power edge devices. However, today’s typical DNNs usually exceed CIM-static random access memory (SRAM) capacity. The int roduced off-chip communica- tion covers up the beneﬁts of CIM technique, meaning that CIM processors still encounter the memory bottleneck. To eliminate this bottleneck, we propose a CIM processor, called TT@CIM, which applies the tensor-train decomposition (TTD) method to compress the entire DNN to ﬁt within CIM-SRAM. However, the cost of storage reduction by TTD is to introduce multiple serial small-size matrix multiplications, resulting in massive inefﬁcient multiply-and-accumulate (MAC) and quantization operations (QuantOps). To achieve high energy efﬁciency, three optimization techniques are proposed in TT@CIM. First, TTD-CIM-matched dataﬂow is proposed to maximize CIM utilization and mini- mize additional MAC operations. Second, a bit-level-sparsity- optimized CIM macro with high bit-level-sparsity encoding scheme is designed to reduce the power consumption of one MAC operation. Third, a variable precision quantization method and a lookup table-based quantization unit are presented to improve the performance and energy efﬁciency of QuantOp. Fabricated in 28-nm CMOS and tested on 4/8-bit decomposed DNNs, TT@CIM achieves 5.99-to-691.13-TOPS/W peak energy efﬁciency depending on the operating voltage. Manuscript received 15 Jan