
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,heyuanhongcs@nudt.edu.cn
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073

Quantization and pruning optimization method for attention mechanism

HE Yuanhong
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha 410073, China,heyuanhongcs@nudt.edu.cn
JIANG Jingfei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha 410073, China
XU Jinwei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha 410073, China
To address the significant computation and memory overhead of models based on attention mechanism, model compression techniques, such as collaborative optimization of quantization and pruning, were studied. A symmetric linear fixed point quantization method was proposed for four activation matrices of query, key, value and probability in the attention mechanism. Meanwhile, a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss. Experimental results on different datasets show that for the typical attention-based model BERT, this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss, which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.