面向大规模异构计算平台的MiniGo高效训练方法
2024,46(5):209-218
李荣春
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,rongchunli@nudt.edu.cn,he535040@163.com
贺周雨
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,rongchunli@nudt.edu.cn,he535040@163.com
乔鹏
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
姜晶菲
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
窦勇
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
李东升
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,rongchunli@nudt.edu.cn,he535040@163.com
贺周雨
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,rongchunli@nudt.edu.cn,he535040@163.com
乔鹏
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
姜晶菲
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
窦勇
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
李东升
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
摘要:
提出一种适用于大规模异构计算平台训练MiniGo智能体的高效多级并行训练方法,包括节点间任务级并行、中央处理器-数字信号处理器(central processing unit-digital signal processor, CPU-DSP)异构并行、DSP核内并行。实现了高效的输入/输出部署,消除网络通信瓶颈。提出了面向CPU-DSP共享内存结构的异构计算内存管理,减少异构设备间的数据搬运。实现了共享内存编程优化,并利用DSP实现密集卷积计算算子加速优化。结果表明,与16核CPU计算相比,单核DSP算子加速最大加速比达16.44;该方法实现计算节点规模从1 067扩展至4 139,得到达到给定终止条件所需时间从43.02 h降至16.05 h,可扩展效率为69.1%。评估表明,该方法能够实现MiniGo在大规模异构计算平台的高效并行训练。
提出一种适用于大规模异构计算平台训练MiniGo智能体的高效多级并行训练方法,包括节点间任务级并行、中央处理器-数字信号处理器(central processing unit-digital signal processor, CPU-DSP)异构并行、DSP核内并行。实现了高效的输入/输出部署,消除网络通信瓶颈。提出了面向CPU-DSP共享内存结构的异构计算内存管理,减少异构设备间的数据搬运。实现了共享内存编程优化,并利用DSP实现密集卷积计算算子加速优化。结果表明,与16核CPU计算相比,单核DSP算子加速最大加速比达16.44;该方法实现计算节点规模从1 067扩展至4 139,得到达到给定终止条件所需时间从43.02 h降至16.05 h,可扩展效率为69.1%。评估表明,该方法能够实现MiniGo在大规模异构计算平台的高效并行训练。
基金项目:
国家自然科学基金资助项目(61902415)
国家自然科学基金资助项目(61902415)
High efficient training method of MiniGo on large-scale heterogeneous computing platform
LI Rongchun
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,rongchunli@nudt.edu.cn,he535040@163.com
HE Zhouyu
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,rongchunli@nudt.edu.cn,he535040@163.com
QIAO Peng
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
JIANG Jingfei
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
DOU Yong
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
LI Dongsheng
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,rongchunli@nudt.edu.cn,he535040@163.com
HE Zhouyu
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,rongchunli@nudt.edu.cn,he535040@163.com
QIAO Peng
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
JIANG Jingfei
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
DOU Yong
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
LI Dongsheng
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
Abstract:
An efficient multi-level parallel training method suitable for training MiniGo agents on large-scale heterogeneous computing platforms was proposed, including task level parallelism between nodes, CPU-DSP(central processing unit-digital signal process) heterogeneous parallelism and DSP core parallelism. Efficient input/output deployment and eliminated the bottleneck of network communication were realized. A heterogeneous computing memory management oriented to CPU-DSP shared memory structure was proposed to reduce the data handling between heterogeneous devices. Shared memory programming optimization was realized, and the dense convolution calculation operator acceleration optimization was realized by DSP. Results show that compared with 16 core CPU calculation, the maximum acceleration ratio of single core DSP operator acceleration is 16.44. In this method, the scale of computing nodes is expanded from 1 067 to 4 139, the time required to reach the given termination condition is reduced from 43.02 h to 16.05 h, and the expansion efficiency is 69.1%. Evaluation shows that this method can realize the efficient parallel training of MiniGo on large-scale heterogeneous computing platforms.
An efficient multi-level parallel training method suitable for training MiniGo agents on large-scale heterogeneous computing platforms was proposed, including task level parallelism between nodes, CPU-DSP(central processing unit-digital signal process) heterogeneous parallelism and DSP core parallelism. Efficient input/output deployment and eliminated the bottleneck of network communication were realized. A heterogeneous computing memory management oriented to CPU-DSP shared memory structure was proposed to reduce the data handling between heterogeneous devices. Shared memory programming optimization was realized, and the dense convolution calculation operator acceleration optimization was realized by DSP. Results show that compared with 16 core CPU calculation, the maximum acceleration ratio of single core DSP operator acceleration is 16.44. In this method, the scale of computing nodes is expanded from 1 067 to 4 139, the time required to reach the given termination condition is reduced from 43.02 h to 16.05 h, and the expansion efficiency is 69.1%. Evaluation shows that this method can realize the efficient parallel training of MiniGo on large-scale heterogeneous computing platforms.
收稿日期:
2022-06-27
2022-06-27