引用本文: | 李荣春,贺周雨,乔鹏,等.面向大规模异构计算平台的MiniGo高效训练方法.[J].国防科技大学学报,2024,46(5):209-218.[点击复制] |
LI Rongchun,HE Zhouyu,QIAO Peng,et al.High efficient training method of MiniGo on large-scale heterogeneous computing platform[J].Journal of National University of Defense Technology,2024,46(5):209-218[点击复制] |
|
|
|
本文已被:浏览 1006次 下载 478次 |
面向大规模异构计算平台的MiniGo高效训练方法 |
李荣春,贺周雨,乔鹏,姜晶菲,窦勇,李东升 |
(国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073)
|
摘要: |
提出一种适用于大规模异构计算平台训练MiniGo智能体的高效多级并行训练方法,包括节点间任务级并行、中央处理器-数字信号处理器(central processing unit-digital signal processor, CPU-DSP)异构并行、DSP核内并行。实现了高效的输入/输出部署,消除网络通信瓶颈。提出了面向CPU-DSP共享内存结构的异构计算内存管理,减少异构设备间的数据搬运。实现了共享内存编程优化,并利用DSP实现密集卷积计算算子加速优化。结果表明,与16核CPU计算相比,单核DSP算子加速最大加速比达16.44;该方法实现计算节点规模从1 067扩展至4 139,得到达到给定终止条件所需时间从43.02 h降至16.05 h,可扩展效率为69.1%。评估表明,该方法能够实现MiniGo在大规模异构计算平台的高效并行训练。 |
关键词: MiniGo 大规模异构计算平台 数字信号处理器 |
DOI:10.11887/j.cn.202405022 |
投稿日期:2022-06-27 |
基金项目:国家自然科学基金资助项目(61902415) |
|
High efficient training method of MiniGo on large-scale heterogeneous computing platform |
LI Rongchun, HE Zhouyu, QIAO Peng, JIANG Jingfei, DOU Yong, LI Dongsheng |
(National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China)
|
Abstract: |
An efficient multi-level parallel training method suitable for training MiniGo agents on large-scale heterogeneous computing platforms was proposed, including task level parallelism between nodes, CPU-DSP(central processing unit-digital signal process) heterogeneous parallelism and DSP core parallelism. Efficient input/output deployment and eliminated the bottleneck of network communication were realized. A heterogeneous computing memory management oriented to CPU-DSP shared memory structure was proposed to reduce the data handling between heterogeneous devices. Shared memory programming optimization was realized, and the dense convolution calculation operator acceleration optimization was realized by DSP. Results show that compared with 16 core CPU calculation, the maximum acceleration ratio of single core DSP operator acceleration is 16.44. In this method, the scale of computing nodes is expanded from 1 067 to 4 139, the time required to reach the given termination condition is reduced from 43.02 h to 16.05 h, and the expansion efficiency is 69.1%. Evaluation shows that this method can realize the efficient parallel training of MiniGo on large-scale heterogeneous computing platforms. |
Keywords: MiniGo large-scale heterogeneous computing platform DSP |
|
|
|
|
|