引用本文: | 王庆林,裴向东,廖林玉,等.多核数字信号处理器矩阵乘卷积算法性能评测.[J].国防科技大学学报,2023,45(1):86-94.[点击复制] |
WANG Qinglin,PEI Xiangdong,LIAO Linyu,et al.Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors[J].Journal of National University of Defense Technology,2023,45(1):86-94[点击复制] |
|
|
|
本文已被:浏览 11674次 下载 13258次 |
多核数字信号处理器矩阵乘卷积算法性能评测 |
王庆林1,2,裴向东1,廖林玉1,2,王浩旭1,2,李荣春1,2,梅松竹1,2,李东升1,2 |
(1. 国防科技大学 计算机学院, 湖南 长沙 410073;2. 国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073)
|
摘要: |
矩阵乘卷积算法能够为各种卷积配置提供高性能基础实现,是面向给定芯片进行卷积性能优化的首要选择。针对国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor, DSP)芯片的特征以及矩阵乘卷积算法自身的特点,提出了一种面向多核DSP架构的高性能并行矩阵乘卷积实现算法ftmEConv。该算法由输入特征图转换、卷积核转换、矩阵乘以及输出特征图转换这四个均运行在通用多核DSP上的并行化部分构成,通过有效挖掘通用DSP核中功能单元的潜力来提升各个部分的性能。实验结果表明,ftmEConv实现了高达42.90%的计算效率,与芯片上的其他矩阵乘卷积算法实现相比,获得了高达7.79倍的性能加速。 |
关键词: 多核数字信号处理器 卷积神经网络 卷积算法 算法优化 |
DOI:10.11887/j.cn.202301009 |
投稿日期:2022-09-13 |
基金项目:国家自然科学基金资助项目(62002365) |
|
Evaluating matrix multiplication-based convolution algorithm on multi-core digital signal processors |
WANG Qinglin1,2, PEI Xiangdong1, LIAO Linyu1,2, WANG Haoxu1,2, LI Rongchun1,2, MEI Songzhu1,2, LI Dongsheng1,2 |
(1. College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;2. Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China)
|
Abstract: |
The matrix multiplication-based convolutional algorithm, which can efficiently implement convolutions with different parameters, is the first choice of convolution performance optimization for a given chip. Based on the architecture of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Technology and the characteristic of the matrix multiplication-based convolutional algorithm, a parallel implementation of the matrix multiplication-based convolutional algorithm (called ftmEConv) for different convolutions on multi-core DSPs was proposed. The ftmEConv consists of four parallelized parts(input feature maps transformation, filter transformation, matrix multiplication, and output feature maps transformation), all of which were optimized for multi-core DSPs, and the performance of each part was improved by effectively exploiting the potential of all functional units in DSP cores. The experimental results demonstrate that ftmEConv achieves computational efficiency of up to 42.90%. Compared with other implementations of the matrix multiplication-based convolutional algorithm on heterogeneous chips, ftmEConv gets a speedup of up to 7.79 times. |
Keywords: multi-core digital signal processors convolutional neural networks convolutional algorithms algorithm optimization |
|
|
|
|
|