资源受限场景下基于算子感知的大模型推理张量卸载方法
作者:
作者单位:

国防科技大学 计算机学院, 湖南 长沙 410073

作者简介:

张建锋(1984—),男,陕西宝鸡人,副研究员,博士,E-mail:jfzhang@nudt.edu.cn

通讯作者:

中图分类号:

TP181

基金项目:

国家自然科学基金创新群体资助项目(62421002);国防科技大学自主科研基金资助项目(24-ZZCX-JDZ-07)


Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios
Author:
Affiliation:

College of Computer Science and Technology, National University of Defense Technology, Changsha 410073 , China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献()
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    在一些资源受限场景下,大语言模型的高效推理部署面临严峻挑战。当前主流的模型推理优化技术,虽然在一定程度上提高了模型推理效率,但是仍然存在部署粒度较为粗糙、推理精度较差等问题。根据不同算子对GPU亲和度不同的发现,提出算子感知张量卸载(operator-aware tensor offloading,OATO)方法。OATO能够提取算子的语义知识,并基于此设计了智能算子调度算法,可以生成全局最优模型部署方案。同时,将OATO方法集成进最新的大模型推理框架Llama.cpp中,实现了算子感知的张量卸载增强推理引擎OALlama.cpp。实验结果表明,相比于业内最先进的推理引擎Llama.cpp和FlexGen,OALlama.cpp在3种大模型上均取得最好的推理性能,尤其是在LlaMA3-8B模型GPU加载75%权重的场景下,OALlama.cpp的首词生成速度相比FlexGen和Llama.cpp提升近1倍。

    Abstract:

    Efficient inference deployment of large language models faces severe challenges in resource-constrained scenarios. Although current mainstream inference optimization techniques have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an OATO (operator-aware tensor offloading) approach was proposed. OATO could extract operators′semantic knowledge and used it to design an intelligent scheduling algorithm, which further yielded a globally optimal model-deployment plan. Meanwhile, the OATO approach was integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp. Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models. Notably, in the scenario where 75% of the LlaMA3-8B model weights are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

    参考文献
    相似文献
    引证文献
引用本文

张建锋, 谢栋, 蹇松雷, 等. 资源受限场景下基于算子感知的大模型推理张量卸载方法[J]. 国防科技大学学报, 2025, 47(6): 60-70.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2025-05-24
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2025-12-02
  • 出版日期:
文章二维码