引用本文: | 车蕾.稀疏平衡变分自动编码器的文本特征提取.[J].国防科技大学学报,2022,44(1):169-178.[点击复制] |
CHE Lei.Text feature extraction based on sparse balanced variational autoencoder[J].Journal of National University of Defense Technology,2022,44(1):169-178[点击复制] |
|
|
|
本文已被:浏览 5556次 下载 4227次 |
稀疏平衡变分自动编码器的文本特征提取 |
车蕾 |
(北京信息科技大学 信息管理学院, 北京 100192)
|
摘要: |
针对文本特征提取方面的高维数据特征区分度较低、基于规则的特征学习的自学习性能差、变分自动编码器存在过度剪枝等问题,提出稀疏平衡变分自动编码器(Sparse Balanced Variational AutoEncoder,SBVAE)的文本特征提取模型。为消除噪声干扰,提高文本特征提取模型的鲁棒性,在文本特征提取的输入层采用双向降噪处理机制。提出一种稀疏平衡性处理,结合 KL(Kullback-Leibler)项权重的模拟退火算法以缓解KL散度引发的过度剪枝的影响,强制解码器更充分地利用潜变量。此模型提高了高维数据特征的区分度。从对比分析文本特征提取模型、稀疏性能、稀疏平衡处理对隐藏空间变分下界的影响等方面深入开展实验,验证了该模型具有较好的性能。该模型在复旦数据集和Reuters数据集上的最高准确率相较于主成分分析分别提升了12.36%、8.06%。 |
关键词: 变分自动编码器 降噪 稀疏平衡 过度剪枝 |
DOI:10.11887/j.cn.202201023 |
投稿日期:2020-07-07 |
基金项目:北京市教育委员会社科计划一般项目(SM201911232003);北京信息科技大学教学改革项目重点资助项目(2020JGZD03);教育部人文社科规划基金资助项目(20YJAZH129) |
|
Text feature extraction based on sparse balanced variational autoencoder |
CHE Lei |
(School of Information Management, Beijing Information Science & Technology University, Beijing 100192, China)
|
Abstract: |
In order to solve the problems of low feature differentiation of high-dimensional data in text feature extraction, poor self-learning performance of rule-based representation learning, and excessive pruning of variational autoencoder, a text feature extraction model based on SBVAE (sparse balanced variational autoencoder) was proposed. In order to eliminate noise interference and improve robustness of the text feature extraction model, a bidirectional noise reduction mechanism was designed for variational autoencoder in the input layer of the text feature extraction. A sparse balance method combined with simulated annealing algorithm of weights of KL (Kullback-Leibler) terms was proposed to alleviate the effect of excessive pruning caused by KL divergence, and forced decoders to make full use of the latent variables. The model improves the discrimination of high-dimensional data features. Experiments were carried out in several aspects, including comparative analysis of text feature extraction model, sparse performance and influence of sparse balance on the lower bound of variation in hidden space. The results show that the proposed model has good performance. The highest accuracy of the proposed model of Fudan and Reuters datasets is increased by 12.36% and 8.06% in comparison with that of PCA, respectively. |
Keywords: variational autoencoder noise reduction sparse balance excessive pruning |
|
|