← 返回 JSSC 论文列表JSSC 2023第1期MemoryCIM
ReDCIM Reconfigurable Digital Computing- In-Memory Processor With Unified FPINT Pi
提出ReDCIM架构,实现高效、高精度、高灵活性的云端AI加速。
29.2 TFLOPS/W at BF16, 36.5 TOPS/W at INT8
云端AI加速计算内存浮点/整数乘法累加可重构架构Booth乘法
▸创新点1:统一的浮点/整数流水线架构(系统创新)。通过设计统一的FP/INT计算路径,实现了硬件资源的动态复用,支持BF16/INT8等混合精度计算,解决了传统CIM处理器无法同时支持高精度浮点和高效整数计算的难题,实测能效达29.2 TFLOPS/W(BF16)和36.5 TOPS/W(INT8)。
▸创新点2:基于指数预对齐和可重构内存累加(电路创新)。采用动态指数预对齐技术消除浮点计算中的冗余移位操作,结合可重构内存累加器实现FP/INT模式的零开销切换,使MAC运算延迟降低40%以上。
▸创新点3:位内存Booth乘法计算优化(方法创新)。在存内计算单元中集成改进的Booth编码算法,将乘法操作分解为位级并行计算,减少50%以上的内存访问开销,同时支持符号位动态扩展。
▸创新点4:混合精度内存计算架构(系统创新)。通过可配置的存储体结构和数据重排网络,实现同一存储阵列中FP/INT数据的混合存储与并行处理,内存带宽利用率提升2.1倍。
Abstract
Cloud AI acceleration has drawn great attention
in recent years, as big models are becoming a popular trend
in deep learning. Cloud AI runs high-efficiency inference, high-
accuracy inference and training, in demand of flexible floating-
point (FP)/integer (INT) multiply–accumulation (MAC) support.
Many computing-in-memory (CIM) processors have been pro-
posed for efficient AI acceleration. They usually rely on analog
CIM techniques that are only suitable for high-efficiency neural
network (NN) infer