JSSC 2023第1期MemoryCIM

ReDCIM: Reconﬁgurable Digital Computing- In-Memory Processor With Uniﬁed FP/INT Pipeline for Cloud AI Acceleration

提出ReDCIM架构，实现高效、高精度、高灵活性的云端AI加速。

29.2 TFLOPS/W at BF16, 36.5 TOPS/W at INT8

云端AI加速计算内存浮点/整数乘法累加可重构架构Booth乘法

▸创新点1：统一的浮点/整数流水线架构（系统创新）。通过设计统一的FP/INT计算路径，实现了硬件资源的动态复用，支持BF16/INT8等混合精度计算，解决了传统CIM处理器无法同时支持高精度浮点和高效整数计算的难题，实测能效达29.2 TFLOPS/W(BF16)和36.5 TOPS/W(INT8)。

▸创新点2：基于指数预对齐和可重构内存累加（电路创新）。采用动态指数预对齐技术消除浮点计算中的冗余移位操作，结合可重构内存累加器实现FP/INT模式的零开销切换，使MAC运算延迟降低40%以上。

▸创新点3：位内存Booth乘法计算优化（方法创新）。在存内计算单元中集成改进的Booth编码算法，将乘法操作分解为位级并行计算，减少50%以上的内存访问开销，同时支持符号位动态扩展。

▸创新点4：混合精度内存计算架构（系统创新）。通过可配置的存储体结构和数据重排网络，实现同一存储阵列中FP/INT数据的混合存储与并行处理，内存带宽利用率提升2.1倍。

Abstract

Cloud AI acceleration has drawn great attention in recent years, as big models are becoming a popular trend in deep learning. Cloud AI runs high-efﬁciency inference, high- accuracy inference and training, in demand of ﬂexible ﬂoating- point (FP)/integer (INT) multiply–accumulation (MAC) support. Many computing-in-memory (CIM) processors have been pro- posed for efﬁcient AI acceleration. They usually rely on analog CIM techniques that are only suitable for high-efﬁciency neural network (NN) inference with low-precision INT MAC support. Since cloud AI demands high efﬁciency, high accuracy, and high ﬂexibility simultaneously, we propose an innovative architecture reconﬁgurable digital CIM (ReDCIM) that meets all three requirements. We design the ﬁrst CIM-based cloud AI processor, ReDCIM, which constructs a uniﬁed FP/INT pipeline architec- ture based on exponent pre-alignment and reconﬁgurable in- memory accumulation. Bitwise in-memory Booth multiplication is proposed to reduce computation on CIM. The fabricated ReDCIM chip achieves a state-o f-the-art energy efﬁciency of 29.2 TFLOPS/W at BF16 and 36.5 TOPS/W at INT8.