Abstract:The vectorization of algorithm mapping for vector processors is a critical issue. An efficient vectorization method of triangular matrix multiplication which supports the in-place calculation was presented. L1D was configured as SRAM and the ping pong pattern with double buffering was designed to smooth the data transfers of multilevel storage structure, which made the kernel computation overlap the DMA data transfer fully and run with peak speed throughout, so then the optimal computation efficiency was achieved. Irregular triangular matrix multiplication computation was evenly distributed to all vector processing elements to fully exploit multiple levels of parallelism for vector processor. Result matrix was stored in multiplier matrix, thus, the in-place calculation was achieved and the memory space was saved. Experimental results show that the performance of triangular matrix multiplication attained from the presented vectorization method achieves 1053.7 GFLOPS and the efficiency of that reaches to 91.47%.