JSSC 2021第10期Memory65nmDRAM

An Energy-Efﬁcient GAN Accelerator With On-Chip Training for Domain-Speciﬁc Optimization

提出一种能效优化的GAN训练加速器，支持用户本地数据重训练。

65nm CMOS, 0.38-TFLOPS/W能效, 256×256图像100轮训练<30s, 274mW功耗

生成对抗网络能效优化片上训练实例归一化领域特定优化

▸选择性层重训练(SELRET)减少69%计算量

▸实例归一化层重排序(ROLIN)减少EMA

▸分阶段优化FP和EP的EMA

Abstract

Generative adversarial networks (GANs) consist of multiple deep neural networks cooperating and competing with each other. Due to their complex architectures and large feature map sizes, training GANs requires a huge amount of computations. Moreover, instance normalization (IN) layers in GANs dramatically increase the external memory access (EMA). However, retraining GANs with user-speciﬁc data is critical on mobile devices because the pre-trained model outputs distorted images under user-speciﬁc conditions. This article proposes a GAN training accelerator to enable energy-efﬁcient domain-speciﬁc optimization of GAN with user’s local data. Selective layer retraining (SELRET) picks out layers that are effective in enhancing the quality of the retrained model. Without image quality degradation, the SELRET reduces the required computation by 69%. Moreover, reordering layers for instance normalization (ROLIN) is proposed to reduce the EMA of intermediate data. Through the implementation of the proposed architecture, which splits and reorders the IN layers, 38.7% and 32.2% of overall EMA reduction are achieved in the forward propagation (FP) stage and the error propagation (EP) stage, respectively. The proposed processor is fabricated in a 65-nm CMOS process, showing 0.38-TFLOPS/W energy efﬁciency. The chip can retrain a face modiﬁcation GAN with a custom dataset of 256 × 256 images over 100 epochs under 30 s while only con- suming 274 mW. Compared to the previous FPGA implementa-