JSSC 2024第10期Digital Circuits40nm

A 73.8k-Inference/mJ SVM Learning Accelerator for Brain Pattern Recognition Tzu-Wei Tong, Tai-Jung Chen , Yi-Yen Hsieh , and Chia-Hsiang Yang

一款用于脑模式识别的低功耗SVM学习加速器，采用CP-SVM算法和硬件优化，显著提升能效。

40nm CMOS, 0.85V, 40MHz, 9.68mW, 73.8k inference/mJ

SVM加速器脑模式识别能效优化硬件加速CMOS

▸创新点1：采用CP-SVM算法（方法创新），通过聚类分区策略将大规模数据分解为多个子问题并行处理，显著降低训练和推理延迟，分别达到99%和91%的减少，解决了传统SVM在嵌入式设备上的计算瓶颈。

▸创新点2：核变换技术（算法创新），通过数学重构将高维核运算转化为低维线性运算，减少PE阵列的硬件复杂度达42%，同时保持计算精度，优化了硬件资源利用率。

▸创新点3：稀疏感知跳过机制（系统创新），动态识别输入数据的稀疏性并跳过零值相关计算，消除冗余操作，结合数据调度策略提升PE利用率，整体PE阵列处理延迟降低96%。

▸创新点4：硬件架构优化（电路创新），采用链式互连减少数据交换器面积93%，集成多排序器为跨集群排序器节省52%面积，最终芯片能效达73.8k inference/mJ，面积效率510k inference/s/mm²，均超越现有技术3.4倍以上。

Abstract

Machine learning (ML) has been widely adopted in neural signal processing and support vector machine (SVM) stands out for its efficacy given limited training data. The constrained battery capacity of implanted devices necessitates a dedicated accelerator with high energy efficiency. This work presents an energy-efficient SVM learning accelerator for brain pattern recognition. By employing the cluster-partitioning SVM (CP-SVM) algorithm, this work achieves up to 99% and 91% latency reductions for training and inference, respectively, com- pared to the conventional SVMs. Efficient hardware mapping is achieved through algorithm and architecture co-optimizations. Kernel transformation reduces the processing element (PE) array’s hardware complexity by 42%. Sparsity-aware skipping eliminates redundant computations, leading to latency reduction. The design space of the PE array is explored to minimize the hardware cost. Data scheduling is applied to improve the PE utilization. Overall, the processing latency for the PE array is reduced by 96%. For PE array implementation, the area of the data exchanger is reduced by 93% by utilizing the chained interconnect. By integrating multiple sorters into one cross- cluster sorter, the sorter area is reduced by 52%. Fabricated in a 40-nm CMOS technology, the proposed SVM learning processor dissipates 9.68 mW at 40 MHz from a supply voltage of 0.85 V . The chip achieves the energy efficiency of 73.8k inference/mJ and 811 training/mJ, exceling p