JSSC 2022第4期Memory28nm

OmniDRL: An Energy-Efﬁcient Deep Reinforcement Learning Processor With Dual-Mode Weight Compression and Sparse

OmniDRL是一款面向边缘设备的高效能深度强化学习处理器，通过数据压缩和稀疏训练降低内存访问。

28nm CMOS, 3.6×3.6mm², 4.18 TFLOPS峰值性能, 29.3 TFLOPS/W峰值能效

深度强化学习边缘计算能效优化数据压缩稀疏训练

▸组稀疏训练(GST)提高权重压缩率

▸指数均值差编码进一步压缩权重和特征图

▸片上稀疏权重转置器避免片外转置

Abstract

In this article, we present an energy-efﬁcient deep reinforcement learning (DRL) processor, OmniDRL, for DRL training on edge devices. Recently, the need for DRL train- ing is growing due to the DRL’s distinct characteristics that can be adapted to each user. However, a massive amount of external and internal memory access limits the implementation of DRL training on resource-constrained platforms. OmniDRL proposes four key features that can reduce external memory access by compressing as much data as possible and can reduce internal memory access by directly processing compressed data. A group-sparse training (GST) enables a high weight compression ratio (CR) for every DRL iteration by selective utilization of weight grouping and weight pruning. A group-sparse train- ing core is proposed to fully take advantage of compressed weight from GST by skipping redundant operations and reusing duplicated data. An exponent-mean-delta encoding additionally compresses the exponent of both weight and feature map for higher CR and low memory power consumption. A world- ﬁrst on-chip sparse weight transposer enables the DRL training process of compressed weight wit hout off-chip transposer. As a result, OmniDRL is fabricated in a 28-nm CMOS technology and occupies a 3 .6 × 3.6m m 2 die area. It shows a state-of-the- art peak performance of 4.18 TFLOPS and a peak energy efﬁ- ciency of 29.3 TFLOPS/W. It achieved 7.42-TFLOPS/W energy efﬁciency for training robot agent (Mujoco Halfcheetah, TD3)