Abstract:In order to solve the problems of low feature differentiation of high-dimensional data in text feature extraction, poor self-learning performance of rule-based representation learning, and excessive pruning of variational autoencoder, a text feature extraction model based on SBVAE (sparse balanced variational autoencoder) was proposed. In order to eliminate noise interference and improve robustness of the text feature extraction model, a bidirectional noise reduction mechanism was designed for variational autoencoder in the input layer of the text feature extraction. A sparse balance method combined with simulated annealing algorithm of weights of KL (Kullback-Leibler) terms was proposed to alleviate the effect of excessive pruning caused by KL divergence, and forced decoders to make full use of the latent variables. The model improves the discrimination of high-dimensional data features. Experiments were carried out in several aspects, including comparative analysis of text feature extraction model, sparse performance and influence of sparse balance on the lower bound of variation in hidden space. The results show that the proposed model has good performance. The highest accuracy of the proposed model of Fudan and Reuters datasets is increased by 12.36% and 8.06% in comparison with that of PCA, respectively.