国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073,wangqinglin_thu@163.com
裴向东
国防科技大学 计算机学院, 湖南 长沙 410073,18903588277@163.com
廖林玉
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073
王浩旭
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073
李荣春
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073
梅松竹
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073
李东升
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073
矩阵乘卷积算法能够为各种卷积配置提供高性能基础实现,是面向给定芯片进行卷积性能优化的首要选择。针对国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor, DSP)芯片的特征以及矩阵乘卷积算法自身的特点,提出了一种面向多核DSP架构的高性能并行矩阵乘卷积实现算法ftmEConv。该算法由输入特征图转换、卷积核转换、矩阵乘以及输出特征图转换这四个均运行在通用多核DSP上的并行化部分构成,通过有效挖掘通用DSP核中功能单元的潜力来提升各个部分的性能。实验结果表明,ftmEConv实现了高达42.90%的计算效率,与芯片上的其他矩阵乘卷积算法实现相比,获得了高达7.79倍的性能加速。
国家自然科学基金资助项目(62002365)
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China,wangqinglin_thu@163.com
PEI Xiangdong
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China,18903588277@163.com
LIAO Linyu
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
WANG Haoxu
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
LI Rongchun
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
MEI Songzhu
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
LI Dongsheng
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China
The matrix multiplication-based convolutional algorithm, which can efficiently implement convolutions with different parameters, is the first choice of convolution performance optimization for a given chip. Based on the architecture of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Technology and the characteristic of the matrix multiplication-based convolutional algorithm, a parallel implementation of the matrix multiplication-based convolutional algorithm (called ftmEConv) for different convolutions on multi-core DSPs was proposed. The ftmEConv consists of four parallelized parts(input feature maps transformation, filter transformation, matrix multiplication, and output feature maps transformation), all of which were optimized for multi-core DSPs, and the performance of each part was improved by effectively exploiting the potential of all functional units in DSP cores. The experimental results demonstrate that ftmEConv achieves computational efficiency of up to 42.90%. Compared with other implementations of the matrix multiplication-based convolutional algorithm on heterogeneous chips, ftmEConv gets a speedup of up to 7.79 times.