Abstract:This paper presents a programmable shuffle unit with the efficient shuffle pattern memory for vector DSPs. The shuffle instructions can efficiently execute without occupying the system’s key resource such as the general registers or the memory bandwidth. We compress the switch-matrix by differentiating the shuffle granularity and indexing the elements. The memory efficiency of our scheme is higher than the state-of-art methods. Programmers can design the shuffle patterns ahead of time and load them into the shuffle pattern memory by the DMA or other ways. Experimental results show that our scheme can reduce the execution cycles by 7.4%~17.4% for the applications with the shuffle instruction requirement, at the cost of 0.6% additional chip area.