引用本文: | 金洲,段懿洳,伊恩鑫,等.并行规约与扫描原语在ReRAM架构上的性能优化.[J].国防科技大学学报,2022,44(5):80-91.[点击复制] |
JIN Zhou,DUAN Yiru,YI Enxin,et al.Accelerating parallel reduction and scan primitives on ReRAM-based architectures[J].Journal of National University of Defense Technology,2022,44(5):80-91[点击复制] |
|
|
|
本文已被:浏览 4669次 下载 3305次 |
并行规约与扫描原语在ReRAM架构上的性能优化 |
金洲,段懿洳,伊恩鑫,戢昊男,刘伟峰 |
(中国石油大学(北京) 信息科学与工程学院, 北京 102249)
|
摘要: |
规约与扫描是并行计算中的核心原语,其并行加速至关重要。然而,冯·诺依曼体系结构下无法避免的数据移动使其面临“存储墙”等性能与功耗瓶颈。近来,基于ReRAM等非易失存储器的存算一体架构支持的原位计算可一步实现矩阵-向量乘,已在机器学习与图计算等应用中展现了巨大的潜力。提出面向忆阻器存算一体架构的规约与扫描的并行加速方法,重点阐述基于矩阵-向量乘运算的计算流程和在忆阻器架构上的映射方法,实现软硬件协同设计,降低功耗并提高性能。相比于GPU,所提规约与扫描原语可实现高达两个数量级的加速,平均加速比也可达到两个数量级。分段规约与扫描最大可达到五个(平均四个)数量级的加速,并将功耗降低79%。 |
关键词: 规约 扫描 ReRAM 存算一体架构 并行计算 |
DOI:10.11887/j.cn.202205009 |
投稿日期:2021-12-27 |
基金项目:国家自然科学基金资助项目(61972415);计算机体系结构国家重点实验室开放课题资助项目(CARCHA202115) |
|
Accelerating parallel reduction and scan primitives on ReRAM-based architectures |
JIN Zhou, DUAN Yiru, YI Enxin, JI Haonan, LIU Weifeng |
(College of Information Science and Engineering, China University of Petroleum, Beijing 102249, China)
|
Abstract: |
Reduction and scan are two critical primitives in parallel computing. Thus, accelerating reduction and scan shows great importance. However, the Von Neumann architecture suffers from performance and energy bottlenecks known as “memory wall” due to the unavoidable data migration. Recently, NVM (non-volatile memory) such as ReRAM (resistive random access memory), enables in-situ computing without data movement and its crossbar architecture can perform parallel GEMV (matrix-vector multiplication) operation naturally in one step. ReRAM-based architecture has demonstrated great success in many areas, e.g. accelerating machine learning and graph computing applications, etc. Parallel acceleration methods were proposed for reduction and scan primitives on ReRAM-based PIM(processing in memory) architecture, the computing process in terms of GEMV and the mapping method on the ReRAM crossbar were focused, and the co-design of software and hardware was realized to reduce power consumption and improve performance. Compared with GPU, the proposed reduction and scan algorithm achieved substantial speedup by two orders of magnitude, and the average acceleration ratio can also reach two orders of magnitude. The case of segmentation can achieve up to five (four on average) orders of magnitude. Meanwhile, the power consumption decreased by 79%. |
Keywords: reduction scan ReRAM processing in memory parallel computing |
|
|
|
|
|