Operator-Aware Tensor Offloading Technology for Large Language Model Inference in Resource-Constrained Scenarios

Operator-Aware Tensor Offloading Technology for Large Language Model Inference in Resource-Constrained Scenarios
DOI:
                        
CSTR:
                        
Author:
                        
Affiliation:
Clc Number:TP181
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

As a core technological breakthrough in the current field of artificial intelligence, large language models (LLMs) have demonstrated excellent performance in multiple key domains. However, efficient inference deployment of LLMs faces severe challenges in resource-constrained scenarios. Although current mainstream inference optimization techniques such as quantization, sparsification, and hierarchical hybrid inference have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an operator-aware tensor offloading approach(OATO) is proposed. This approach can extract semantic knowledge of operators and incorporates an intelligent operator scheduling algorithm to rank operators by their GPU affinity, thereby generating a globally optimal model deployment scheme. Meanwhile, the OATO approach is integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp. Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models. Notably, in the scenario where 75% of the weights of the LlaMA3-8B model are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.

Reference

Cited by

Get Citation

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:May 24,2025
Revised:October 17,2025
Adopted:August 07,2025
Online: September 22,2025
Published:

Home

About Journal

Guide for Authors

Editorial Board

Publication Statement

Open Access Statement

Contact

Journal Subscription

Rss

AI assistant

Chinese

Get Citation

Share

Article Metrics

History

Article QR Code