Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios

doi:10.11887/j.issn.1001-2486.25050035

Home > Archive>Volume 47, Issue 6, 2025 >60-70. DOI:10.11887/j.issn.1001-2486.25050035

Operator-aware tensor offloading approach for large language model inference in resource-constrained scenarios
DOI:
                        10.11887/j.issn.1001-2486.25050035
                    
CSTR:
                        
Author:
                        
Affiliation:College of Computer Science and Technology, National University of Defense Technology, Changsha 410073 , China
Clc Number:TP181
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Efficient inference deployment of large language models faces severe challenges in resource-constrained scenarios. Although current mainstream inference optimization techniques have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an OATO (operator-aware tensor offloading) approach was proposed. OATO could extract operators′semantic knowledge and used it to design an intelligent scheduling algorithm, which further yielded a globally optimal model-deployment plan. Meanwhile, the OATO approach was integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp. Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models. Notably, in the scenario where 75% of the LlaMA3-8B model weights are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

Reference

Cited by

Get Citation

张建锋, 谢栋, 蹇松雷, 等. 资源受限场景下基于算子感知的大模型推理张量卸载方法[J]. 国防科技大学学报, 2025, 47(6): 60-70.

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:May 24,2025
Revised:
Adopted:
Online: December 02,2025
Published:

Home

About Journal

Guide for Authors

Editorial Board

Publication Statement

Open Access Statement

Contact

Journal Subscription

Rss

AI assistant

Chinese

Get Citation

Share

Article Metrics

History

Article QR Code