⚡ 本页包含 AI 生成的分析内容,仅供参考
该论文提出了一种面向大数据应用的可扩展深度学习/推理处理器,采用四并行MIMD架构,实现了1.93TOPS/W的高能效。针对深度学习训练中大规模迭代权重更新带来的计算与带宽瓶颈,该处理器通过并行架构和存储优化显著提升了性能。
analysis in image retrieval with high accuracy [1]. As Fig. 4.6.1 shows, various applications, such as text, 2D image and motion recognition use DL due to its best-in-class recognition accuracy. There are 2 types of DL: supervised DL with labeled data and unsupervised DL with unlabeled data. With unsupervised DL, most of learning time is spent in massively iterative weight updates for a restricted Boltzmann machine [2]. For a ~100MB training dataset, >100 TOP computational capability and ~40GB/s IO and SRAM data bandwidth is required. So, a 3.4GHz CPU needs >10 hours learning time with a ~100K input-vector dataset and takes ~1 second for recognition, which is far from real-time processing. Thus, DL is typically done using cloud servers or high-performance GPU environments with learning-on-server capability. However, the wide use of
Seongwook Park, Kyeongryeol Bong, Dongjoo Shin, Jinmook Lee,
Sungpill Choi, Hoi-Jun Yoo KAIST, Daejeon, Korea Recently, deep learning (DL) has become a popular approach for big-data