面向大规模异构计算平台的MiniGo高效训练方法-High efficient training method of MiniGo on large-scale heterogeneous computing platform

面向大规模异构计算平台的MiniGo高效训练方法

2024,46(5):209-218

李荣春
国防科技大学并行与分布计算全国重点实验室, 湖南长沙 410073，rongchunli@nudt.edu.cn,he535040@163.com
贺周雨
国防科技大学并行与分布计算全国重点实验室, 湖南长沙 410073，rongchunli@nudt.edu.cn,he535040@163.com
乔鹏
国防科技大学并行与分布计算全国重点实验室, 湖南长沙 410073
姜晶菲
国防科技大学并行与分布计算全国重点实验室, 湖南长沙 410073
窦勇
国防科技大学并行与分布计算全国重点实验室, 湖南长沙 410073
李东升
国防科技大学并行与分布计算全国重点实验室, 湖南长沙 410073

摘要：
提出一种适用于大规模异构计算平台训练MiniGo智能体的高效多级并行训练方法,包括节点间任务级并行、中央处理器-数字信号处理器(central processing unit-digital signal processor, CPU-DSP)异构并行、DSP核内并行。实现了高效的输入/输出部署,消除网络通信瓶颈。提出了面向CPU-DSP共享内存结构的异构计算内存管理,减少异构设备间的数据搬运。实现了共享内存编程优化,并利用DSP实现密集卷积计算算子加速优化。结果表明,与16核CPU计算相比,单核DSP算子加速最大加速比达16.44；该方法实现计算节点规模从1 067扩展至4 139,得到达到给定终止条件所需时间从43.02 h降至16.05 h,可扩展效率为69.1%。评估表明,该方法能够实现MiniGo在大规模异构计算平台的高效并行训练。

关键词：
MiniGo 大规模异构计算平台数字信号处理器

基金项目：
国家自然科学基金资助项目(61902415)

High efficient training method of MiniGo on large-scale heterogeneous computing platform

LI Rongchun
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,rongchunli@nudt.edu.cn,he535040@163.com
HE Zhouyu
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,rongchunli@nudt.edu.cn,he535040@163.com
QIAO Peng
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
JIANG Jingfei
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
DOU Yong
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
LI Dongsheng
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China

Abstract：
An efficient multi-level parallel training method suitable for training MiniGo agents on large-scale heterogeneous computing platforms was proposed, including task level parallelism between nodes, CPU-DSP(central processing unit-digital signal process) heterogeneous parallelism and DSP core parallelism. Efficient input/output deployment and eliminated the bottleneck of network communication were realized. A heterogeneous computing memory management oriented to CPU-DSP shared memory structure was proposed to reduce the data handling between heterogeneous devices. Shared memory programming optimization was realized, and the dense convolution calculation operator acceleration optimization was realized by DSP. Results show that compared with 16 core CPU calculation, the maximum acceleration ratio of single core DSP operator acceleration is 16.44. In this method, the scale of computing nodes is expanded from 1 067 to 4 139, the time required to reach the given termination condition is reduced from 43.02 h to 16.05 h, and the expansion efficiency is 69.1%. Evaluation shows that this method can realize the efficient parallel training of MiniGo on large-scale heterogeneous computing platforms.

Key words：
MiniGo large-scale heterogeneous computing platform DSP

收稿日期：
2022-06-27

下载PDF全文