多核数字信号处理卷积算法并行优化

2024,46(1):103-112
许金伟
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,xujinwei13@nudt.edu.cn
王庆林
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
李娅琳
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
姜晶菲
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
高蕾
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
李荣春
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
李东升
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
摘要:
针对国防科技大学自主研发的异构多核数字信号处理(digital signal processing, DSP)芯片的特征以及卷积算法自身特点,提出了一种面向多核DSP架构的高性能多核并行卷积实现方案。针对1×1卷积提出了特征图级多核并行方案;针对卷积核大于1的卷积提出了窗口级多核并行优化设计,同时提出了逐元素向量化计算的核内并行优化实现。实验结果表明,所提并行优化方法实现单核计算效率最高能达到64.95%,在带宽受限情况下,多核并行扩展效率可达到48.36%~88.52%,在典型网络ResNet50上的执行性能与E5-2640 CPU相比,获得了5.39倍性能加速。
基金项目:
国家自然科学基金资助项目(61732018)

Parallel optimization of convolution algorithm on multi-core DSP

XU Jinwei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China,xujinwei13@nudt.edu.cn
WANG Qinglin
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
LI Yalin
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
JIANG Jingfei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
GAO Lei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
LI Rongchun
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
LI Dongsheng
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China
Abstract:
According to the characteristics of the heterogeneous multi-core DSP(digital signal processing) chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm, a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed. A feature graph level multi-core parallel scheme is proposed for 1×1 convolution. For convolutions with kernels larger than 1, a window level multi-core parallel optimization design was proposed, and an element-wise vectorization based intra-core parallel optimization implementation was proposed. The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%. When the bandwidth is limited, the parallel expansion efficiency of multi-core can still reach 48.36% ~ 88.52%. Compared with E5-2640 CPU, the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration.
收稿日期:
2022-09-20
     下载PDF全文