JSSC 2023第1期Digital Circuits4nmNeural Network Accelerator

A Multi-Mode 8k-MAC HW-Utilization-Aware Neural Processing Unit With a Uniﬁed Multi-Precision Datapath in 4-nm Flagship Mobile SoC

一款4纳米工艺的多模式8k-MAC神经处理单元，支持多种精度计算并优化硬件利用率。

4.26 TFLOPS/W (FP16), 11.59 TOPS/W (INT8), 1.72 TFLOPS/mm², 3.45 TOPS/mm²

神经处理单元多精度计算硬件利用率动态操作模式能效优化

▸统一多精度MAC支持INT4/8/16和FP16数据

▸动态重构计算流以提升硬件利用率

▸支持从极低功耗到低延迟的动态操作模式

Abstract

This article presents an 8k-multiply-accumulate (MAC) neural processing unit (NPU) in 4-nm mobile system- on-chip (SoC). The uniﬁed multi-precision MACs support from integer (INT)4/8/16 to ﬂoating point (FP)16 data with high area and energy efﬁciency. When the NPU meets some layers having low hardware (HW) utilization, such as depthwise con- volution or shallow layers with a few input channels, the NPU reconﬁgures the computational ﬂow to enhance the utilization up to four times after getting basic tensor information from a compiler, such as operation types and shapes. The NPU supports a dynamic operation mode to cover extremely low-power to low-latency requirements. The NPU achieves 4.26 tera FP operations per second (TFLOPS)/W and 11.59 tera operations per second (TOPS)/W for DeepLabV3 (FP16) and MobileNet- EdgeTPU (INT8), respectively, as well as high area efﬁciency (1.72 TFLOPS/mm 2 and 3.45 TOPS/mm 2).