Abstract:With the widespread application of multi-core digital signal processors in high-performance computing and artificial intelligence, achieving efficient and portable parallel programming on these heterogeneous architectures has become a significant challenge. An efficient OpenCL-based heterogeneous parallel programming system, MOCL4, is designed and implemented for the domestically developed heterogeneous multi-core DSP platform (FT-M7032). MOCL4 collaborates runtime and compiler optimizations to efficiently map OpenCL"s SPMD execution model onto the DSP"s SIMD vector units, while supporting efficient DMA-based data transfers across memory hierarchies. Experimental results show that MOCL4, while ensuring correctness of OpenCL semantics, significantly improves kernel execution performance. The average speedup on the PolyBench benchmark suite is 10.12x, and its performance on typical compute-intensive tasks (e.g., GEMM) is close to that of manually optimized code. MOCL4 provides a parallel programming solution for multi-core DSPs that balances high performance with programmability.