JSSC 2024第9期Memory28nmSRAMCIM

A 28-nm 64-kb 31.6-TFLOPS/W Digital-Domain Floating-Point-Computing-Unit and Double-Bit 6T-SRAM Computing-in-Memory Macro for Floating-Point CNNs

提出一种基于双比特6T SRAM的浮点计算内存宏，支持高能效浮点乘加运算。

31.6 TFLOPS/W能效比，2.05 TFLOPS/mm²面积效率，支持BF16格式

存内计算浮点运算SRAM人工智能加速能效优化

▸采用双比特存储单元(DBcells)和浮点计算单元(FCUs)提升吞吐量

▸设计高精度乘法单元(HFMC)和低精度近似计算单元(LAMC)降低带宽和面积

▸支持浮点(FP-MAC)和整数(INT-MAC)乘加运算的新型CIM架构

▸提出ShareFloatv2数据类型实现浮点数内存映射

▸基于LUT的Tensorflow训练方法提高推理精度

Abstract

With the rapid advancement of artificial intelli- gence (AI), computing-in-memory (CIM) structure is proposed to improve energy efficiency (EF). However, previous CIMs often rely on INT8 data types, which pose challenges when addressing more complex networks, larger datasets, and increas- ingly intricate tasks. This work presents a double-bit 6T static random-access memory (SRAM)-based floating-point CIM macro using: 1) a cell array with double-bitcells (DBcells) and floating-point computing units (FCUs) to improve through- put without the sacrifice of inference accuracy; 2) an FCU with high-bit full-precision multiply cell (HFMC) and low-bit approximate-calculation multiply cell (LAMC) to reduce internal bandwidth and area cost; 3) a CIM macro architecture with FP processing circuits to support both floating-point MAC (FP-MAC) and integer (INT)-multiplication and accumulation (MAC); 4) a new ShareFloatv2 data type to map floating point in CIM array; and 5) a lookup table (LUT)-based Tensorflow training method to improve inference accuracy. A fabricated 28-nm 64-kb digital-domain SRAM-CIM macro achieved the best EF (31.6 TFLOPS/W) and the highest area efficiency (2.05 TFLOPS/mm 2) for FP-MAC with Brain Float16 (BF16) IN/W/OUT on three AI tasks: classification@CIFAR100, detec- tion@COCO, and segmentation@VOC2012.