引用本文: | 张旭,常轶松,张科,等.面向图计算应用的处理器访存通路优化设计与实现.[J].国防科技大学学报,2020,42(2):13-22.[点击复制] |
ZHANG Xu,CHANG Yisong,ZHANG Ke,et al.Design and implementation of a novel off-chip memory access path for graph computing[J].Journal of National University of Defense Technology,2020,42(2):13-22[点击复制] |
|
|
|
本文已被:浏览 8043次 下载 5810次 |
面向图计算应用的处理器访存通路优化设计与实现 |
张旭1,2,常轶松1,2,3,张科1,2,3,陈明宇1,2,3 |
(1. 中国科学院计算技术研究所, 北京 100190;2. 中国科学院大学, 北京 100049;3. 鹏城实验室, 广东 深圳 518000)
|
摘要: |
针对图计算应用的访存特点,提出并实现一种支持高并发、乱序和异步访存的高并发访存模块(High Concurrency and high Performance Fetcher, HCPF)。通过软-硬件协同的设计方法,HCPF可同时处理192条共8种类型的内存访问请求,且访存粒度可由用户定义,满足图计算应用对海量低延迟细粒度数据访问的需求。同时,HCPF扩展了基于内存语义的跨计算节点定制互连技术,支持远程内存的细粒度直接访问,为后续实现分布式图计算框架提供技术基础。结合上述两个核心研究内容,基于流水线RISC-V处理器核,设计并实现了可支持HCPF的RISC-V片上系统(System-on-Chip,SoC)架构,搭建基于FPGA的原型验证平台,并使用自研测试程序对HCPF进行初步性能评测。实验结果表明,HCPF相比原有访存通路,最高可将基于数组和随机地址的两种随机内存访问性能分别提升至3.5倍和2.7倍。远程内存直接访问4 Byte数据的延时仅为1.63 μs。 |
关键词: 内存级并行 访存通路 图计算应用 |
DOI:10.11887/j.cn.202002002 |
投稿日期:2019-09-15 |
基金项目:国家重点研发计划资助项目(2017YFB1001602);国家自然科学基金资助项目(61702485);中国科学院青年创新促进会资助项目(2017143) |
|
Design and implementation of a novel off-chip memory access path for graph computing |
ZHANG Xu1,2, CHANG Yisong1,2,3, ZHANG Ke1,2,3, CHEN Mingyu1,2,3 |
(1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;2. University of Chinese Academy of Sciences, Beijing 100049, China;3. Peng Cheng Laboratory, Shenzhen 518000, China)
|
Abstract: |
A novel asynchronous memory access path, which supports highly concurrent and out-of-order off-chip memory requests was proposed. In order to satisfy the requirements of graph applications, a software-defined interface in our proposed memory access path to handle hundreds of kinds of off-chip memory requests with arbitrary granularity via hardware-software co-design methodology was implemented. A custom memory semantic interconnect was designed for fine-grained remote memory access among various computing nodes leveraged in future distributed graph processing scenarios. Last but not least, we integrate our proposed novel memory access path into a RISC-V instruction set architecture-based SoC(system-on-chip) architecture and implement an FPGA prototype. Based on our custom random access microbenchmarks, preliminary evaluation results show that performance of array-based and random address-based off-chip memory access is improved by 3.5x and 2.7x respectively using our proposed asynchronous memory access path, and accessing 4 bytes data from remote memory only takes 1.63 μs. |
Keywords: memory-level parallelism memory access path graph computing |
|
|
|
|
|