Abstract:An Offload-mode heterogeneous parallel algorithm via inner-out subdomain dividing was proposed for CFD(computational fluid dynamics) program CNS. Combined with the characteristics of finite difference computing and fourth order Runge-Kutta method in structure mesh, the scheme of ghost region was introduced, based on which a Ghost-Region-Shrinking computing scheme was designed, significantly reducing the overhead of data movement between heterogeneous computing resources, making the computing and MPI communication on CPU absolutely overlap with the accelerator computing under load balance condition, bringing better heterogeneous synergetic parallelism. Parameter of the ghost region for the computing validity was given and load balance tuning was demonstrated. On a server with CPU (Intel Haswell Xeon E5-2670 12 cores×2)+MIC (Xeon Phi 7120A ×2), an averaged performance improvement of 5.9× was gained over the algorithm of using accelerator with task blocks integrally. Compared with MPI/OpenMP two-level parallel algorithm running on 24 Intel Haswell CPU cores, the proposed method achieved an accelerating of 1.27× with one MIC and 1.45× with two MICs. Finally the bottleneck and disadvantage were discussed.