多核数字信号处理器并行矩阵转置算法优化

doi:10.11887/j.cn.202301006

首页 > 过刊浏览>2023年第45卷第1期 >57-66. DOI:10.11887/j.cn.202301006

多核数字信号处理器并行矩阵转置算法优化
DOI:
                        10.11887/j.cn.202301006
                    
作者:
                        
                        
                    
作者单位:(1. 国防科技大学 计算机学院, 湖南 长沙 410073;2. 国防科技大学 并行与分布处理国防科技重点实验室, 湖南 长沙 410073)
作者简介:裴向东(1985—),男,山西长治人,博士研究生,E-mail:18903588277@163.com； 王庆林(通信作者),男,贵州思南人,副研究员,博士,硕士生导师,E-mail:wangqinglin_thu@163.com
通讯作者:
中图分类号:TP391
基金项目:国家自然科学基金资助项目(62002365)

Optimizing parallel matrix transpose algorithm on multi-core digital signal processors

Author:

Affiliation:

(1. College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;2. Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China)

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

矩阵转置是矩阵运算的基本操作,广泛应用于信号处理、科学计算以及深度学习等各种领域。随着国防科技大学自主研制的飞腾异构多核数字信号处理器(digital signal processor,DSP)在各种领域中的推广应用,对高性能矩阵转置实现提出了强烈需求。针对飞腾异构多核DSP的体系结构特征与矩阵转置操作的特点,提出了一种适配不同数据位宽(8 B、4 B以及2 B)矩阵的并行矩阵转置算法ftmMT。该算法基于DSP中向量处理单元的Load/Store部件实现了向量化,同时基于矩阵分块实现了多个DSP核的并行处理,通过隐式乒乓设计实现了片上向量化转置与片外访存的重叠以及访存性能的大幅提升。实验结果表明,ftmMT能够显著加快矩阵转置操作,与CPU上的开源转置库HPTT相比,可获得高达8.99倍的性能加速。

Abstract:

Matrix transpose is one of the common matrix operations, which is widely employed in various fields such as signal processing, scientific computing, and deep learning. With the popularization of Phytium heterogeneous multi-core DSPs(digital signal processors) developed by National University of Defense Technology, there is a strong demand for high-performance matrix transpose implementations for Phytium multi-core DSPs. Based on the architecture of multi-core DSPs and the characteristic of matrix transpose operations, a parallel matrix transpose algorithm (called ftmMT) for matrices with different element bit widths (8 B, 4 B, and 2 B) was proposed. In ftmMT, the main optimizations include vectorization based on vector Load/Store functions, core-level parallelization based on matrix blocking, and overlapping between vectorization and memory access through implicit ping-pong methods. The experimental results show that ftmMT can significantly improve the performance of matrix transpose operations, and achieve a speedup of up to 8.99 times in comparison with the open-source transpose library HPTT running on CPU.

参考文献

相似文献

引证文献

引用本文

裴向东,王庆林,廖林玉,等.多核数字信号处理器并行矩阵转置算法优化[J].国防科技大学学报,2023,45(1):57-66.
PEI Xiangdong, WANG Qinglin, LIAO Linyu, et al. Optimizing parallel matrix transpose algorithm on multi-core digital signal processors[J]. Journal of National University of Defense Technology,2023,45(1):57-66.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-07-09
最后修改日期:
录用日期:
在线发布日期: 2023-01-16
出版日期: 2023-02-28

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码