资源受限场景下基于算子感知的大模型推理张量卸载方法

doi:10.11887/j.issn.1001-2486.25050035

首页 > 过刊浏览>2025年第47卷第6期 >60-70. DOI:10.11887/j.issn.1001-2486.25050035

资源受限场景下基于算子感知的大模型推理张量卸载方法
DOI:
                        10.11887/j.issn.1001-2486.25050035
                    
作者:
                        
                        
                    
作者单位:国防科技大学 计算机学院, 湖南 长沙 410073
作者简介:张建锋（1984—），男，陕西宝鸡人，副研究员，博士，E-mail：jfzhang@nudt.edu.cn
通讯作者:
中图分类号:TP181
基金项目:国家自然科学基金创新群体资助项目（62421002）；国防科技大学自主科研基金资助项目（24-ZZCX-JDZ-07）

Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios

Author:

Affiliation:

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073 , China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

在一些资源受限场景下，大语言模型的高效推理部署面临严峻挑战。当前主流的模型推理优化技术，虽然在一定程度上提高了模型推理效率，但是仍然存在部署粒度较为粗糙、推理精度较差等问题。根据不同算子对GPU亲和度不同的发现，提出算子感知张量卸载（operator-aware tensor offloading，OATO）方法。OATO能够提取算子的语义知识，并基于此设计了智能算子调度算法，可以生成全局最优模型部署方案。同时，将OATO方法集成进最新的大模型推理框架Llama.cpp中，实现了算子感知的张量卸载增强推理引擎OALlama.cpp。实验结果表明，相比于业内最先进的推理引擎Llama.cpp和FlexGen，OALlama.cpp在3种大模型上均取得最好的推理性能，尤其是在LlaMA3-8B模型GPU加载75%权重的场景下，OALlama.cpp的首词生成速度相比FlexGen和Llama.cpp提升近1倍。

Abstract:

Efficient inference deployment of large language models faces severe challenges in resource-constrained scenarios. Although current mainstream inference optimization techniques have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an OATO (operator-aware tensor offloading) approach was proposed. OATO could extract operators′semantic knowledge and used it to design an intelligent scheduling algorithm, which further yielded a globally optimal model-deployment plan. Meanwhile, the OATO approach was integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp. Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models. Notably, in the scenario where 75% of the LlaMA3-8B model weights are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

参考文献

相似文献

引证文献

引用本文

张建锋, 谢栋, 蹇松雷, 等. 资源受限场景下基于算子感知的大模型推理张量卸载方法[J]. 国防科技大学学报, 2025, 47(6): 60-70.

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-05-24
最后修改日期:
录用日期:
在线发布日期: 2025-12-02
出版日期:

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码