并行规约与扫描原语在ReRAM架构上的性能优化
作者:
作者单位:

(中国石油大学(北京) 信息科学与工程学院, 北京 102249)

作者简介:

金洲(1990—),女,江苏盐城人,讲师,博士,E-mail:jinzhou@cup.edu.cn; 刘伟峰(通信作者),男,教授,博士,博士生导师,E-mail:weifeng.liu@cup.edu.cn

通讯作者:

中图分类号:

TN95

基金项目:

国家自然科学基金资助项目(61972415);计算机体系结构国家重点实验室开放课题资助项目(CARCHA202115)


Accelerating parallel reduction and scan primitives on ReRAM-based architectures
Author:
Affiliation:

(College of Information Science and Engineering, China University of Petroleum, Beijing 102249, China)

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    规约与扫描是并行计算中的核心原语,其并行加速至关重要。然而,冯·诺依曼体系结构下无法避免的数据移动使其面临“存储墙”等性能与功耗瓶颈。近来,基于ReRAM等非易失存储器的存算一体架构支持的原位计算可一步实现矩阵-向量乘,已在机器学习与图计算等应用中展现了巨大的潜力。提出面向忆阻器存算一体架构的规约与扫描的并行加速方法,重点阐述基于矩阵-向量乘运算的计算流程和在忆阻器架构上的映射方法,实现软硬件协同设计,降低功耗并提高性能。相比于GPU,所提规约与扫描原语可实现高达两个数量级的加速,平均加速比也可达到两个数量级。分段规约与扫描最大可达到五个(平均四个)数量级的加速,并将功耗降低79%。

    Abstract:

    Reduction and scan are two critical primitives in parallel computing. Thus, accelerating reduction and scan shows great importance. However, the Von Neumann architecture suffers from performance and energy bottlenecks known as “memory wall” due to the unavoidable data migration. Recently, NVM (non-volatile memory) such as ReRAM (resistive random access memory), enables in-situ computing without data movement and its crossbar architecture can perform parallel GEMV (matrix-vector multiplication) operation naturally in one step. ReRAM-based architecture has demonstrated great success in many areas, e.g. accelerating machine learning and graph computing applications, etc. Parallel acceleration methods were proposed for reduction and scan primitives on ReRAM-based PIM(processing in memory) architecture, the computing process in terms of GEMV and the mapping method on the ReRAM crossbar were focused, and the co-design of software and hardware was realized to reduce power consumption and improve performance. Compared with GPU, the proposed reduction and scan algorithm achieved substantial speedup by two orders of magnitude, and the average acceleration ratio can also reach two orders of magnitude. The case of segmentation can achieve up to five (four on average) orders of magnitude. Meanwhile, the power consumption decreased by 79%.

    参考文献
    相似文献
    引证文献
引用本文

金洲,段懿洳,伊恩鑫,等.并行规约与扫描原语在ReRAM架构上的性能优化[J].国防科技大学学报,2022,44(5):80-91.
JIN Zhou, DUAN Yiru, YI Enxin, et al. Accelerating parallel reduction and scan primitives on ReRAM-based architectures[J]. Journal of National University of Defense Technology,2022,44(5):80-91.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2021-12-27
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2022-09-28
  • 出版日期: 2022-10-28
文章二维码