JSSC 2020第7期Memory65nmNeural Network Accelerator

An 8.93 TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity for On-Device

提出一种基于分层粗粒度稀疏的LSTM神经网络加速器，实现高效能语音识别。

65-nm LP CMOS, 8.93 TOPS/W

LSTM神经网络加速器分层粗粒度稀疏语音识别能效优化

▸创新点1：分层粗粒度稀疏（HCGS）算法硬件协同优化（方法创新）。通过算法与硬件的协同设计，实现了高效的权重压缩，减少了存储和计算需求，压缩比高达16倍，同时保持低错误率。

▸创新点2：块递归权重压缩技术（方法创新）。采用块级递归压缩方法，显著降低了权重存储的索引内存开销，解决了传统元素级稀疏方案的效率问题。

▸创新点3：高效能LSTM网络实现（系统创新）。在65-nm LP CMOS工艺下，实现了8.93 TOPS/W的能效，适用于实时语音识别任务，并在TIMIT、TED-LIUM和LibriSpeech数据集上验证了低错误率。

▸创新点4：硬件加速器设计（电路创新）。通过优化内存访问和计算单元，提升了LSTM网络的并行处理能力，进一步降低了能耗。

Abstract

Long short-term memory (LSTM) is a type of recurrent neural networks (RNNs), which is widely used for time-series data and speech applications, due to its high accuracy on such tasks. However, LSTMs pose difﬁculties for efﬁcient hardware implementation because they require a large amount of weight storage and exhibit computation complexity. Prior works have proposed compression techniques to alleviate the storage/computation requirements of LSTMs but elementwise sparsity schemes incur sizable index memory overhead and structured compression techniques report limited compression ratios. In this article, we present an energy-efﬁcient LSTM RNN accelerator, featuring an algorithm-hardware co-optimized memory compression technique called hierarchical coarse-grain sparsity (HCGS). Aided by the HCGS-based blockwise recur- sive weight compression, we demonstrate LSTM networks with up to 16 × fewer weights while achieving minimal error rate degradation. The prototype chip fabricated in 65-nm LP CMOS achieves up to 8.93 TOPS/W for real-time speech recogni- tion using compressed LSTMs based on HCGS. HCGS-based LSTMs have demonstrated energy-efﬁcient speech recognition with low error rates for TIMIT, TED-LIUM, and LibriSpeech data sets.