JSSC 2015第1期Other0.13µm

A 1 TOPS/W Analog Deep Machine-Learning Engine With Floating-Gate Storage in 0.13 µm CMOS Junjie Lu, Student Member , IEEE, Steven Y oung, Student Member , IEEE,I t a m arA r e l, Senior Member , IEEE,a n d

013微米CMOS工艺下实现的1 TOPS/W能效模拟深度学习引擎，支持在线无监督学习与非易失性存储。

0.13µm CMOS, 3V, 8300 vectors/s, 11.4µW, 1×10^12 ops/s/W

模拟计算深度学习浮栅存储能效优化特征提取

▸在线无监督学习能力：该方法创新性地实现了硬件层面的在线无监督学习，通过实时调整浮栅存储权重，无需外部干预即可完成特征提取，测量显示其处理速度达8300输入向量/秒，显著优于传统离线训练方案。

▸非易失性浮栅模拟存储：电路创新采用浮栅晶体管阵列存储权重，兼具模拟计算精度与Flash Memory的非易失特性，在0.13µm工艺下实现0.36mm²紧凑面积，解决了传统SRAM/DRAM的易失性问题。

▸算法级反馈提升鲁棒性：系统创新通过算法补偿电路非理想性（如工艺偏差/噪声），在保持1TOPS/W能效（11.4µW@3V）的同时，使8维特征提取准确率逼近浮点软件基准，显著提升模拟计算的可靠性。

▸并行电流模架构：采用可重构电流模计算阵列实现大规模并行运算，峰值能效达1×10¹² ops/W，较数字加速器提升2个数量级，为低功耗边缘AI提供新范式。

Abstract

An analog implementation of a deep machine- learning system for ef ﬁcient feature extraction is pr e s e n t e di nt h i s work. It features online unsupervised trainability and non-volatile ﬂoating-gate analog storage. It utilizes a massively parallel re- conﬁgurable current-mode analog architectu re to realize ef ﬁcient computation, and leverages algorithm-level feedback to provide robustness to circuit imperfections in analog signal processing. A 3-layer, 7-node analog deep machine-learn ing engine was fabri- cated in a 0.13 µm standard CMOS process, occupying 0.36 mm 2 active area. At a processing speed of 8300 input vectors per second, it consumes 11.4 µW from the 3 V supply, ac hieving 1×1012 opera- tion per second per Watt of peak energy ef ﬁciency. Measurement demonstrates real-time cluster analysis, and feature extraction for pattern recognition with 8-fo ld dimension reduction with an accuracy comparable to the ﬂoating-point software simulation baseline.