Abstract:In the current landscape of large-scale model training, the contradiction between the exponential growth of model parameters and the slow increase in GPU memory capacity has become increasingly prominent. Among memory optimization technologies, recomputation and computational offloading reduce GPU memory overhead by trading time for space. The development trends of recomputation and computational offloading are first analyzed in this article. Then, the hardware bandwidth bottlenecks and software ecosystem adaptation challenges faced by memory optimization are analyzed, with a focus on the heterogeneous architecture characteristics of domestic artificial intelligence platforms. It also delves into the memory optimization technologies for large model training on domestic platforms such as MT-3000, with the aim of providing technical references for large model training on domestic platforms.