引用本文: | 许金伟,王庆林,李娅琳,等.多核数字信号处理卷积算法并行优化.[J].国防科技大学学报,2024,46(1):103-112.[点击复制] |
XU Jinwei,WANG Qinglin,LI Yalin,et al.Parallel optimization of convolution algorithm on multi-core DSP[J].Journal of National University of Defense Technology,2024,46(1):103-112[点击复制] |
|
|
|
本文已被:浏览 8223次 下载 2512次 |
多核数字信号处理卷积算法并行优化 |
许金伟1,2,王庆林1,2,李娅琳1,2,姜晶菲1,2,高蕾1,2,李荣春1,2,李东升1,2 |
(1. 国防科技大学 计算机学院, 湖南 长沙 410073;2. 国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073)
|
摘要: |
针对国防科技大学自主研发的异构多核数字信号处理(digital signal processing, DSP)芯片的特征以及卷积算法自身特点,提出了一种面向多核DSP架构的高性能多核并行卷积实现方案。针对1×1卷积提出了特征图级多核并行方案;针对卷积核大于1的卷积提出了窗口级多核并行优化设计,同时提出了逐元素向量化计算的核内并行优化实现。实验结果表明,所提并行优化方法实现单核计算效率最高能达到64.95%,在带宽受限情况下,多核并行扩展效率可达到48.36%~88.52%,在典型网络ResNet50上的执行性能与E5-2640 CPU相比,获得了5.39倍性能加速。 |
关键词: 多核DSP 卷积神经网络 卷积算法 并行优化 |
DOI:10.11887/j.cn.202401011 |
投稿日期:2022-09-20 |
基金项目:国家自然科学基金资助项目(61732018) |
|
Parallel optimization of convolution algorithm on multi-core DSP |
XU Jinwei1,2, WANG Qinglin1,2, LI Yalin1,2, JIANG Jingfei1,2, GAO Lei1,2, LI Rongchun1,2, LI Dongsheng1,2 |
(1. College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;2. National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China)
|
Abstract: |
According to the characteristics of the heterogeneous multi-core DSP(digital signal processing) chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm, a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed. A feature graph level multi-core parallel scheme is proposed for 1×1 convolution. For convolutions with kernels larger than 1, a window level multi-core parallel optimization design was proposed, and an element-wise vectorization based intra-core parallel optimization implementation was proposed. The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%. When the bandwidth is limited, the parallel expansion efficiency of multi-core can still reach 48.36% ~ 88.52%. Compared with E5-2640 CPU, the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration. |
Keywords: multi-core DSP CNNs convolutional algorithms parallel optimization |
|
|
|
|
|