JSSC 2021第9期Digital Circuits65nmNeural Network Accelerator

GANPU: An Energy-Efﬁcient Multi-DNN Training Processor for GANs With Speculative Dual-Sparsity Exploitation

GANPU是一种高效能的多DNN训练处理器，专为GAN设计，适用于移动设备。

75.68 TFLOPS/W for 16-bit floating-point computation

生成对抗网络能效多DNN训练移动设备隐私保护

▸创新点1：自适应时空工作负载复用（系统创新） - 通过动态调整计算资源在时间和空间上的分配，优化多个DNN在单个GAN模型中的并行执行，显著提高硬件利用率，解决了GAN中不同网络和层间操作特性差异大的问题。

▸创新点2：双稀疏利用架构（电路创新） - 设计了一种新型处理架构，能够同时利用输入和输出特征的ReLU稀疏性，跳过冗余的零值计算，提升计算效率，适用于推理和训练两种场景。

▸创新点3：指数ReLU推测算法（方法创新） - 提出了一种轻量级的PE架构和算法，仅通过指数部分预测输出特征的零值位置，减少了硬件开销，实现了75.68 TFLOPS/W的能效，比现有技术高4.85倍。

Abstract

This article presents generative adversarial network processing unit (GANPU), an energy-efﬁcient multiple deep neural network (DNN) training processor for GANs. It enables on-device training of GANs on performance- and battery-limited mobile devices, without sending user-speciﬁc data to servers, fully evading privacy concerns. Training GANs require a massive amount of computation, and therefore, it is difﬁcult to accelerate in a resource-constrained platform. Besides, networks and layers in GANs show dramatically changing operational characteristics, making it difﬁcult to optimize the processor’s core and bandwidth allocation. For higher throughput and energy efﬁciency, this article proposed three key features. An adaptive spatiotemporal workload multiplexing is proposed to maintain high utilization in accelerating multiple DNNs in a single GAN model. To take advantage of ReLU sparsity during both inference and training, dual-sparsity exploitation architecture is proposed to skip redundant computations due to input and output feature zeros. Moreover, an exponent-only ReLU speculation (EORS) algorithm is proposed along with its lightweight processing element (PE) architecture, to estimate the location of output feature zeros during the inference with minimal hardware over- head. Fabricated in a 65-nm process, the GANPU achieved the energy efﬁciency of 75.68 TFLOPS/W for 16-bit ﬂoating-point computation, which is 4.85 × higher than the state of the art. As a result, GANPU enabl