Abstract:As a core technological breakthrough in the current field of artificial intelligence, large language models (LLMs) have demonstrated excellent performance in multiple key domains. However, efficient inference deployment of LLMs faces severe challenges in resource-constrained scenarios. Although current mainstream inference optimization techniques such as quantization, sparsification, and hierarchical hybrid inference have improved model inference efficiency to some extent, they still suffer from issues like coarse-grained deployment and poor inference accuracy.Based on the discovery that different operators exhibit varying degrees of GPU affinity, an operator-aware tensor offloading approach(OATO) is proposed. This approach can extract semantic knowledge of operators and incorporates an intelligent operator scheduling algorithm to rank operators by their GPU affinity, thereby generating a globally optimal model deployment scheme. Meanwhile, the OATO approach is integrated into the latest large model inference framework Llama.cpp to implement an operator-aware tensor offloading enhanced inference engine, referred to as OALlama.cpp. Experimental results show that compared with the state-of-the-art inference engines Llama.cpp and FlexGen, OALlama.cpp achieves the best inference performance on three large models. Notably, in the scenario where 75% of the weights of the LlaMA3-8B model are loaded on the GPU, the first-token generation speed of OALlama.cpp is nearly doubled compared with FlexGen and Llama.cpp.