面向混合特征数据的粒子群填补方法

2024,46(6):107-112
刘艺
军事科学院, 北京 100091,albertliu20th@163.com,zhengqibin1990@163.com
秦伟
军事科学院, 北京 100091
李庚松
军事科学院, 北京 100091
刘坤
军事科学院, 北京 100091
王强
军事科学院, 北京 100091
郑奇斌
军事科学院, 北京 100091,albertliu20th@163.com,zhengqibin1990@163.com
任小广
军事科学院, 北京 100091
摘要:
针对传统数据填补方法难以有效利用标签信息和缺失数据的随机信息的不足,提出面向混合型特征的粒子群优化填补算法。将连续型特征取值建模为高斯分布,均值和标准差作为优化参数。将离散型特征的取值概率作为参数进行优化。使用分类正确率作为优化目标,充分利用标签信息和缺失数据的随机信息。采用4种基于统计的方法和2种基于演化算法的填补方法作为对比,在6个典型的分类数据集上进行实验。结果表明,提出的方法在分类正确率指标上显著优于其他对比算法,同时具有较优的时间开销,能够有效解决混合特征数据缺失的问题。
基金项目:
国家自然科学基金资助项目(91948303);国家自然科学基金青年科学基金资助项目(61802426)

Particle swarm optimization based data imputation method for mixed features

LIU Yi
Academy of Military Sciences, Beijing 100091, China,albertliu20th@163.com,zhengqibin1990@163.com
QIN Wei
Academy of Military Sciences, Beijing 100091, China
LI Gengsong
Academy of Military Sciences, Beijing 100091, China
LIU Kun
Academy of Military Sciences, Beijing 100091, China
WANG Qiang
Academy of Military Sciences, Beijing 100091, China
ZHENG Qibin
Academy of Military Sciences, Beijing 100091, China,albertliu20th@163.com,zhengqibin1990@163.com
REN Xiaoguang
Academy of Military Sciences, Beijing 100091, China
Abstract:
Aiming at the deficiency of traditional data imputation methods in effectively using the label information and random characteristics of missing data, a particle swarm optimization based imputation method for mixed features was proposed. The value of continuous feature was modeled as Gaussian distribution, and the mean and standard deviation were used as optimization parameters. The value probability of categorical features was optimized as a parameter. The classification accuracy rate was used as the optimization target to make full use of random information of label information and missing data. Four statistical methods and two evolutionary algorithm based imputation methods were used to compare the results on six typical classification datasets. The results show that the proposed method significantly outperforms other comparison algorithms in terms of classification accuracy indicator, and has better time overhead at the same time, which can effectively solve the data missing problems of mixed features.
收稿日期:
2022-07-15
     下载PDF全文