JSSC 2025第3期MemoryCIM

T-PIM: An Energy-Efﬁcient Processing-in-Memory Accelerator for End-to-End On-Device Training

T-PIM是一种支持端到端设备上训练的高能效内存计算加速器。

无

内存计算设备上训练能效神经网络加速器

▸创新点1：支持端到端设备上训练（系统创新）。T-PIM首次在PIM架构中实现了完整的端到端训练流程，包括前向传播、反向传播和权重更新，解决了传统PIM无法支持复杂训练计算的难题，成功在芯片上运行VGG16模型的完整训练。

▸创新点2：全定制8T-SRAM PIM宏实现高效AND操作（电路创新）。采用8T-SRAM单元实现原位逻辑AND运算，结合位串行计算架构，支持1-8位可变输入精度，相比传统架构能效提升达161.08 TOPS/W。

▸创新点3：可配置算术单元提升推理性能（方法创新）。通过动态配置算术单元支持权重数据的2^n位精度，在推理时实现计算路径优化，使CIFAR10数据集推理能效达到7.59 TOPS/W。

▸创新点4：稀疏计算优化设计（系统创新）。采用零值跳过和计算单元门控技术，对输入数据和权重的稀疏性进行双重优化，减少无效计算功耗，提升整体能效30%以上。

Abstract

Recently, on-device training has become crucial for the success of edge intelligence. However, frequent data movement between computing units and memory during training has been a major problem for battery-powered edge devices. Processing-in-memory (PIM) is a novel computing paradigm that merges computing logic into memory, which can address the data movement problem with excellent power efﬁciency. However, previous PIM accelerators cannot support the entire training process on chip due to its computing complexity. This article presents a PIM accelerator for end-to-end on-device training (T-PIM), the ﬁrst PIM realization that enables end- to-end on-device training as well as high-speed inference. Its full-custom PIM macro contains 8T-SRAM cells to perform the energy-efﬁcient in-cell AND operation and the bit-serial- based computation logic enables fully variable bit-precision for input data. The macro supports various data mapping methods and computational paths for both fully connected and convolutional layers, in order to handle the complex training process. An efﬁcient tiling scheme is also proposed to enable T-PIM to compute any size of deep neural network with the implemented hardware. In addition, conﬁgurable arithmetic units in a forward propagation path make T-PIM handle power-of-two bit-precision for weight data, enabling a signiﬁcant performance boost during inference. Finally, T-PIM efﬁciently handles sparsity in both operands by skipping the computation of zeros i