Abstract:As criterions and algorithms evolve and become more complex, high performance embedded application demands the high performance and energy efficiency. The challenge, however, is how to turn the VLSI capability into the actual computing performance. This research proposed an energy efficient processor architecture named ET (Embedded Tera-scale Computing), which is composed of many lightweight VLIW processor cores, also named small cores. Each core executes a thread with the mechanisms for explicitly managing the data and instructions. ET uses a hierarchical data registers to reduce the cost of delivering data, and the asymmetric and distributed instruction registers to deliver the instructions. In order to further reduce the energy, ET employs non-deep pipeline and simple control flow and optimizes the execution of loop body of applications. The primary result shows that ET can achieve the 1TOPS performance and the 100GOPS/W efficiency when scaled to 40nm.