多核数字信号处理卷积算法并行优化

doi:10.11887/j.cn.202401011

首页 > 过刊浏览>2024年第46卷第1期 >103-112. DOI:10.11887/j.cn.202401011

多核数字信号处理卷积算法并行优化
DOI:
                        10.11887/j.cn.202401011
                    
作者:
                        
                        
                    
作者单位:(1. 国防科技大学 计算机学院, 湖南 长沙 410073;2. 国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073)
作者简介:许金伟(1990—),男,河南淮阳人,助理研究员,博士,E-mail:xujinwei13@nudt.edu.cn
通讯作者:
中图分类号:TP391
基金项目:国家自然科学基金资助项目(61732018)

Parallel optimization of convolution algorithm on multi-core DSP

Author:

Affiliation:

(1. College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;2. National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China)

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

针对国防科技大学自主研发的异构多核数字信号处理(digital signal processing, DSP)芯片的特征以及卷积算法自身特点,提出了一种面向多核DSP架构的高性能多核并行卷积实现方案。针对1×1卷积提出了特征图级多核并行方案；针对卷积核大于1的卷积提出了窗口级多核并行优化设计,同时提出了逐元素向量化计算的核内并行优化实现。实验结果表明,所提并行优化方法实现单核计算效率最高能达到64.95%,在带宽受限情况下,多核并行扩展效率可达到48.36%～88.52%,在典型网络ResNet50上的执行性能与E5-2640 CPU相比,获得了5.39倍性能加速。

Abstract:

According to the characteristics of the heterogeneous multi-core DSP(digital signal processing) chip independently developed by National University of Defense Technology and the characteristics of the convolution algorithm, a high-performance multi-core parallel convolution implementation scheme for multi-core DSP architecture was proposed. A feature graph level multi-core parallel scheme is proposed for 1×1 convolution. For convolutions with kernels larger than 1, a window level multi-core parallel optimization design was proposed, and an element-wise vectorization based intra-core parallel optimization implementation was proposed. The experimental results show that the proposed parallel optimization method can reach a maximum single core computing efficiency of 64.95%. When the bandwidth is limited, the parallel expansion efficiency of multi-core can still reach 48.36% ~ 88.52%. Compared with E5-2640 CPU, the execution performance on the typical network ResNet50 achieves 5.39x performance acceleration.

参考文献

相似文献

引证文献

引用本文

许金伟,王庆林,李娅琳,等.多核数字信号处理卷积算法并行优化[J].国防科技大学学报,2024,46(1):103-112.
XU Jinwei, WANG Qinglin, LI Yalin, et al. Parallel optimization of convolution algorithm on multi-core DSP[J]. Journal of National University of Defense Technology,2024,46(1):103-112.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-09-20
最后修改日期:
录用日期:
在线发布日期: 2024-01-28
出版日期: 2024-02-28

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码