← 返回 JSSC 论文列表
📄 下载 JSSC 原文 PDF
JSSC 2008第1期Digital Circuits65nm

An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS

65nm CMOS工艺下80核处理器实现1 TFLOPS性能
65nm CMOS, 1.07V, 4.27GHz, 97W, 1.0 TFLOPS
多核处理器片上网络浮点运算能效优化高频设计
创新点1:二维网状网络结构 - 采用8x10二维阵列的片上网络架构,提供2 Terabits/s的双向带宽,显著提升多核间通信效率,属于系统级创新。该设计通过优化的路由算法和低延迟链路,实现了高吞吐量数据传输。
创新点2:单周期浮点乘加器(FPMAC) - 每个计算单元集成两个流水线化单精度浮点乘加器,支持单周期累加循环,实现4 GHz高频运算。这种电路级创新通过精简指令路径和优化数据流,将计算吞吐量提升至1.0 TFLOPS。
创新点3:动态睡眠晶体管技术 - 结合细粒度时钟门控和动态体偏置技术,在65nm工艺下实现97W功耗控制。该电源管理创新通过实时调整晶体管阈值电压,在4.27GHz高频运行时仍保持能效比15-FO4。
创新点4:异步时钟域设计 - 采用mesochronous时钟方案解决大规模阵列的时钟同步问题,属于方法创新。通过局部时钟域解耦和自适应校准,在275mm²芯片面积上实现80个计算单元的高精度时序同步。
Abstract
This paper describes an integrated network-on-chip architecture containing 80 tiles arranged as an 8 10 2-D array of floating-point cores and packet-switched routers, both designed to operate at 4 GHz. Each tile has two pipelined single-precision floating-point multiply accumulators (FPMAC) which feature a single-cycle accumulation loop for high throughput. The on-chip 2-D mesh network provides a bisection bandwidth of 2 Terabits/s. The 15-FO4 design employs mesochronous clocking, fine-grained clock gating, dynamic sleep transistors, and body-bias techniques. In a 65-nm eight-metal CMOS process, the 275 mm /50custom design contains 100 M transistors. The fully functional first silicon achieves over 1.0 TFLOPS of performance on a range of bench- marks while dissipating 97 W at 4.27 GHz and 1.07 V supply.