高性能异构加速器MiniGo算子优化方法

doi:10.11887/j.cn.202401014

首页 > 过刊浏览>2024年第46卷第1期 >131-140. DOI:10.11887/j.cn.202401014

高性能异构加速器MiniGo算子优化方法
DOI:
                        10.11887/j.cn.202401014
                    
作者:
                        
                        
                    
作者单位:(1. 国防科技大学 计算机学院, 湖南 长沙 410073;2. 国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073)
作者简介:乔鹏(1988—),男,河北保定人,助理研究员,博士,E-mail:pengqiao@nudt.edu.cn
通讯作者:
中图分类号:TP391
基金项目:国家重点实验室稳定支持资助项目(WDZC20205500104)

Optimizing operator computation of MiniGo on high-performance heterogeneous accelerator

Author:

Affiliation:

(1. College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;2. National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China)

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

根据高性能异构加速器的特性和MiniGo的训练模式提出了一种高效的并行计算方法。对片上计算资源进行合理规划,实现异构设备之间的流水并行优化；根据异构设备间存在共享存储段设计了共享内存编码模式,减少数据传输开销；根据数字信号处理簇内具有多计算资源的特点结合算子计算-访存特性设计了不同的算子并行计算优化策略。同时,面向TensorFlow实现了一个易于使用的高性能计算库。实验结果显示,该方法实现了典型算子的多核并行计算。相对于单核,卷积算子加速比为24.69。相较于裁剪版8核FT2000+CPU,该方法训练和自博弈执行速度加速比分别为3.83和1.5。

Abstract:

An efficient parallel computing method based on the characteristics of the high-performance heterogeneous accelerator and the training mode of MiniGo was proposed. The on-chip computing resources were reasonably planned to achieve pipelining parallel optimization between heterogeneous devices. The shared memory programming was designed according to the existence of shared storage segments between heterogeneous devices to reduce data transmission costs. According to the characteristics of multiple computing resources in a digital signal processing cluster, combined with the computing-memory access feature of the operators, different optimization strategies were designed. At the same time, this method provides an easy-use high-performance operator library for TensorFlow. The experimental results show that this method realizes the multi-core parallel computing of operators. The speedup of convolution was 24.69 compared with that was achieved on a single core. Compared with the cropped version of the 8-core FT2000+ CPU, the speedup of training and self-play execution on this method were 3.83 and 1.5, respectively.

参考文献

相似文献

引证文献

引用本文

乔鹏,贺周雨,李荣春,等.高性能异构加速器MiniGo算子优化方法[J].国防科技大学学报,2024,46(1):131-140.
QIAO Peng, HE Zhouyu, LI Rongchun, et al. Optimizing operator computation of MiniGo on high-performance heterogeneous accelerator[J]. Journal of National University of Defense Technology,2024,46(1):131-140.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-12-15
最后修改日期:
录用日期:
在线发布日期: 2024-01-28
出版日期: 2024-02-28

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码