长向量处理器高效RNN推理方法

doi:10.11887/j.cn.202401013

首页 > 过刊浏览>2024年第46卷第1期 >121-130. DOI:10.11887/j.cn.202401013

长向量处理器高效RNN推理方法
DOI:
                        10.11887/j.cn.202401013
                    
作者:
                        
                        
                    
作者单位:(1. 国防科技大学 计算机学院, 湖南 长沙 410073;2. 国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073)
作者简介:苏华友(1985—),男,广西桂林人,副研究员,博士,硕士生导师,E-mail:shyou@nudt.edu.cn
通讯作者:
中图分类号:TP391
基金项目:国家自然科学基金资助项目(61872377)；湘江实验室基金资助项目(22XJ01012)

Efficient RNN inference engine on very long vector processor

Author:

Affiliation:

(1. College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;2. National Key Laboratory of Parallel and Distributed Computing, National University of Defense Technology, Changsha 410073, China)

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

模型深度的不断增加和处理序列长度的不一致对循环神经网络在不同处理器上的性能优化提出巨大挑战。针对自主研制的长向量处理器FT-M7032,实现了一个高效的循环神经网络加速引擎。该引擎采用行优先矩阵向量乘算法和数据感知的多核并行方式,提高矩阵向量乘的计算效率；采用两级内核融合优化方法降低临时数据传输的开销；采用手写汇编优化多种算子,进一步挖掘长向量处理器的性能潜力。实验表明,长向量处理器循环神经网络推理引擎可获得较高性能,相较于多核ARM CPU以及Intel Golden CPU,类循环神经网络模型长短记忆网络可获得最高62.68倍和3.12倍的性能加速。

Abstract:

With the increasing depth and the inconsistent length of processing sequences, the performance optimization of RNN(recurrent neural network) on different processors makes it difficult to researchers. An efficient RNN acceleration engine was implemented for the self-developed long vector processor FT-M7032. This engine proposed a row-first matrix vector multiplication algorithm and a data-aware multi-core parallel method to improve the computational efficiency of matrix vector multiplication. It proposed a two-level kernel fusion optimization method to reduce the overhead of temporary data transmission. Optimized handwritten assembly codes for multiple operators were integrated to further tap the performance potential of long vector processors. Experiments show that the RNN engine for long-vector processors is efficient, when compared with the multi-core ARM CPU and Intel Golden CPU, the RNN-like model long short term memory networks can achieve a performance acceleration of up to 62.68 times and 3.12 times, respectively.

参考文献

相似文献

引证文献

引用本文

苏华友,陈抗抗,杨乾明.长向量处理器高效RNN推理方法[J].国防科技大学学报,2024,46(1):121-130.
SU Huayou, CHEN Kangkang, YANG Qianming. Efficient RNN inference engine on very long vector processor[J]. Journal of National University of Defense Technology,2024,46(1):121-130.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2022-11-07
最后修改日期:
录用日期:
在线发布日期: 2024-01-28
出版日期: 2024-02-28

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码