Abstract:In the inference process of mixture-of-experts (MoE) models, matrix operators constitute the primary performance bottleneck, with those in the attention module and expert computation being particularly time-consuming. Although existing approaches have extensively optimized matrix operators on GPUs, the substantial differences between GPU and CPU architectures in memory hierarchy and compute units make these optimizations difficult to transfer directly to CPU platforms. To address this limitation, FlashMatrix is introduced as a matrix-operator optimization scheme tailored for CPUs equipped with Advanced Matrix Extensions (AMX). FlashMatrix incorporates an efficient data layout transformation strategy that avoids additional memory-access overhead caused by layout conversions, and employs a carefully designed micro-kernel for matrix multiplication that achieves an optimal compute-to-memory ratio through effective register reuse. Experimental results show that, compared with the state-of-the-art CPU matrix-computation library oneDNN, FlashMatrix delivers an average 2.5× speedup. For end-to-end inference performance, FlashMatrix achieves a speedup of approximately 1.2×.