JSSC 2024第3期Power Management65nm

FreFlex: A High-Performance Processor for Convolution and Attention Computations via Sparsity-Adaptive Dynamic Frequency Boosting Sadhana

FreFlex处理器通过稀疏自适应动态频率调制和二维脉动阵列提升卷积和注意力计算的能效和性能。

160 GOPS/s/mm², 1.1 GHz, 0.6–1.0 TOPS/W

AI加速器稀疏性动态频率调制卷积计算注意力计算

▸创新点1：稀疏自适应动态频率调制（SA-DFM）是一种系统级创新，通过动态调整时钟频率以匹配输入数据的稀疏性，从而在保持功耗预算的同时提升性能。该方法在无稀疏性时仅增加7%的功耗，而在高稀疏性下可实现1.8倍的性能提升。

▸创新点2：二维脉动阵列处理单元是一种电路创新，优化了卷积和注意力计算的数据流，通过并行处理和数据重用提高了计算效率。该设计在65nm CMOS工艺下实现了1.1 GHz的最大频率和160 GOPS/s/mm²的性能密度。

▸创新点3：利用稀疏性提升性能是一种方法创新，通过实时统计输出层的零元素数量来预测下一层的稀疏性，从而动态调整硬件资源。这一方法在0.6-1.0 TOPS/W的能效范围内显著提升了计算效率。

▸创新点4：硅原型验证展示了该设计的实际可行性，在65nm CMOS节点上实现了高能效（0.6-1.0 TOPS/W）和高性能（160 GOPS/s/mm²），为稀疏计算硬件提供了可扩展的解决方案。

Abstract

A high degree of sparsity in machine learning (ML) models has been highlighted as a significant opportunity to improve energy and delay efficiencies by skipping the computation of zero elements in operands. Despite the potential, its unstructured positions of zeros and a wide range of sparsity make it challenging to exploit this nature in hardware implementations that are often built on regular structures. To address these challenges, this article presents a low-power and high-performance AI accelerator, the so-called FreFlex, via sparsity-adaptive dynamic frequency modulation (SA-DFM) conjointly with the proposed processing element (PE) in a 2-D systolic array. The sparsity of each layer is determined by counting zero elements from the output while the layer is being computed. Then, the clock frequency is optimally modulated based on the sparsity level obtained from the previous layer’s output, which becomes an input of the next layer. The unused power slack due to the sparsity is exploited to boost performance while fully using the power budget. The proposed technique achieves up to 1.8 × performance improvement by exploiting the sparsity while incurring less than 7% power overhead, even when there is no sparsity. The silicon prototype, fabricated in a 65-nm CMOS node, demonstrates 0.6–1.0-TOPS/W efficiency for convolution and attention computations, with a performance of 160 GOPS/s/mm 2 with a maximum frequency of 1.1 GHz.