注意力机制量化剪枝优化方法

2024,46(1):113-120
何源宏
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073,heyuanhongcs@nudt.edu.cn
姜晶菲
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
许金伟
国防科技大学 计算机学院, 湖南 长沙 410073;
国防科技大学 并行与分布计算全国重点实验室, 湖南 长沙 410073
摘要:
面向基于注意力机制模型的巨大计算和访存开销问题,研究量化和剪枝协同优化的模型压缩技术,提出针对注意力机制中查询、键、值、概率共四个激活值矩阵的对称线性定点量化方法。同时,提出概率矩阵剪枝方法和渐进式剪枝策略,有效降低剪枝精度损失。在不同数据集上的实验结果表明,针对典型基于注意力机制模型BERT,在较低或者没有精度损失的情况下该优化方法可达到4位或8位定点量化、0.93~0.98的稀疏度,大幅度降低模型计算量,为加速量化稀疏模型的推理奠定良好的基础。
基金项目:
重点实验室稳定支持重点资助项目(WDZC20215250103)

Quantization and pruning optimization method for attention mechanism

HE Yuanhong
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha 410073, China,heyuanhongcs@nudt.edu.cn
JIANG Jingfei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha 410073, China
XU Jinwei
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China;
National Key Laboratory of Paralle and Distributed Computing, National University of Defense Technology, Changsha 410073, China
Abstract:
To address the significant computation and memory overhead of models based on attention mechanism, model compression techniques, such as collaborative optimization of quantization and pruning, were studied. A symmetric linear fixed point quantization method was proposed for four activation matrices of query, key, value and probability in the attention mechanism. Meanwhile, a probability matrix pruning method and a progressive pruning strategy were proposed to effectively reduce the pruning accuracy loss. Experimental results on different datasets show that for the typical attention-based model BERT, this optimization method can achieve 4 bit or 8 bit fixed point quantization and 0.93~0.98 sparsity with little or no accuracy loss, which greatly reduces the model computation and lays a strong foundation for accelerating the inference of quantized sparse models.
收稿日期:
2022-10-17
     下载PDF全文