面向AMX单元的矩阵算子优化方法

doi:10.11887/j.issn.1001-2486.2509005

首页 > 过刊浏览>2026年第48卷第3期 >357-367. DOI:10.11887/j.issn.1001-2486.2509005

面向AMX单元的矩阵算子优化方法
DOI:
                        10.11887/j.issn.1001-2486.2509005
                    
作者:
                        
                        
                    
作者单位:国防科技大学计算机学院, 湖南 长沙 410073
作者简介:杨维铃(1996—),男,湖北荆州人,博士研究生,E-mail:w.yang@nudt.edu.cn；
通讯作者:
中图分类号:TP301.6;TP393
基金项目:国家自然科学基金委员会联合基金资助项目（U24B20151）

Matrix operator optimization method for AMX unit

Author:

Affiliation:

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073 , China

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

在混合专家模型的推理过程中,矩阵算子构成了性能瓶颈,其中尤以注意力模块和专家计算所涉及的矩阵算子耗时最为显著。尽管现有方法已对GPU上的矩阵算子进行了深度优化,但鉴于GPU与CPU在内存架构及计算单元方面存在显著差异,这些优化方法难以直接迁移至CPU平台。为此,专门针对CPU的高级矩阵扩展单元,提出一种矩阵算子性能优化方案FlashMatrix。创新性地设计了高效的数据布局转换策略,有效规避了因数据布局转换而引发的额外内存访问开销;针对矩阵乘运算,精心构建了计算访存比最优的微内核,以实现寄存器的高效复用。实验结果表明,相较于当前CPU平台上最先进的矩阵计算库oneDNN,FlashMatrix平均实现了2.5倍的加速效果。对于端到端的推理性能,FlashMatrix实现了约1.2的加速比。

Abstract:

In the inference process of mixture of experts models, matrix operators constitute the primary performance bottleneck, with those in the attention module and expert computation being particularly time-consuming. Although existing approaches have extensively optimized matrix operators on GPUs, the substantial differences between GPU and CPU architectures in memory hierarchy and compute units make these optimizations difficult to transfer directly to CPU platforms. To address this limitation, FlashMatrix was introduced as a matrix-operator optimization scheme tailored for CPU equipped with advanced matrix extensions. FlashMatrix incorporates an efficient data layout transformation strategy that avoids additional memory-access overhead caused by layout conversions, and employs a carefully designed micro-kernel for matrix multiplication that achieves an optimal compute-to-memory ratio through effective register reuse. Experimental results show that, compared with the state-of-the-art CPU matrix-computation library oneDNN, FlashMatrix delivers an average 2.5 × speedup. For end-to-end inference performance, FlashMatrix achieves a speedup of approximately 1.2×.

参考文献

相似文献

引证文献

引用本文

杨维铃,方建滨,董德尊.面向 AMX单元的矩阵算子优化方法[J].国防科技大学学报,2026,48(3):357-367

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-09-30
最后修改日期:
录用日期:
在线发布日期: 2026-06-04
出版日期:

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码