资源受限场景下基于算子感知的大模型推理张量卸载方法

资源受限场景下基于算子感知的大模型推理张量卸载方法
DOI:
                        
                    
作者:
                        
                        
                    
作者单位:国防科技大学 计算机学院
作者简介:
通讯作者:
中图分类号:TP181
基金项目:国家自然科学基金创新群体项目(62421002)；国防科技大学自主科研基金(24-ZZCX-JDZ-07)

Operator-Aware Tensor Offloading Technology for Large Language Model Inference in Resource-Constrained Scenarios

Author:

Affiliation:

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献()

资源附件

文章评论

摘要:

大语言模型作为当前人工智能领域的核心技术突破,已在多个关键领域展现出卓越的性能表现,然而在一些资源受限场景下,大语言模型的高效推理部署面临严峻挑战。当前主流的模型推理优化技术,如量化、稀疏化和分层混合推理等,虽然在一定程度上提高了模型推理效率,但是仍然存在部署粒度较为粗糙、推理精度较差等问题。根据不同算子对GPU亲和度不同的发现,提出基于算子感知的大模型推理张量卸载方法OATO。该方法能够提取算子的语义知识,并基于此设计了智能算子调度算法,可以生成全局最优模型部署方案。同时,将OATO方法集成进最新的大模型推理框架Llama.cpp之中,实现了算子感知的张量卸载增强推理引擎OALlama.cpp。实验结果表明,相比与业内最先进的推理引擎Llama.cpp和FlexGen,OALlama.cpp在3种大模型上均取得最好的推理性能,尤其是在LlaMA3-8B模型在GPU加载75%权重的场景下,OALlama.cpp的首词生成速度相比FlexGen和Llama.cpp提升近一倍。

Abstract:

As a core technological breakthrough in the current field of artificial intelligence, large language models (LLMs) have demonstrated excellent performance in multiple key domains. However, efficient inference deployment of LLMs faces severe challenges in resource-constrained scenarios. Although current mainstream inference optimization techniques such as quantization, sparsification, and hierarchical hybrid inference have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an operator-aware tensor offloading approach(OATO) is proposed. This approach can extract semantic knowledge of operators and incorporates an intelligent operator scheduling algorithm to rank operators by their GPU affinity, thereby generating a globally optimal model deployment scheme. Meanwhile, the OATO approach is integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp. Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models. Notably, in the scenario where 75% of the weights of the LlaMA3-8B model are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

参考文献

相似文献

引证文献

引用本文

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:2025-05-24
最后修改日期:2025-10-17
录用日期:2025-08-07
在线发布日期: 2025-09-22
出版日期:

首页

期刊介绍

投稿指南

编委会

出版声明

开放获取声明

联系我们

期刊订阅

Rss

AI检索

English

引用本文

分享

文章指标

历史

文章二维码