卫星领域语料库构建与命名实体识别

2024,46(4):175-183
徐聪
中国科学院国家空间科学中心 复杂航天系统电子信息技术重点实验室, 北京 100190;
中国科学院大学, 北京 100049,xucong19@mails.ucas.edu.cn
石会鹏
国家无线电监测中心检测中心, 北京 100041
陈志敏
中国科学院国家空间科学中心 复杂航天系统电子信息技术重点实验室, 北京 100190
张鑫宇
中国科学院国家空间科学中心 复杂航天系统电子信息技术重点实验室, 北京 100190;
中国科学院大学, 北京 100049
王静
中国科学院国家空间科学中心 复杂航天系统电子信息技术重点实验室, 北京 100190
杨甲森
中国科学院国家空间科学中心 复杂航天系统电子信息技术重点实验室, 北京 100190
摘要:
针对卫星领域命名实体语料匮乏、现有算法识别性能较低的问题,提出一种考虑模糊边界的卫星领域实体标注方法,构建包含8类常见卫星领域实体的语料库,与该领域现有语料库相比粒度更细、覆盖更广,并以此为基础提出迁移学习和多网络融合的卫星领域实体识别算法。该算法采用预训练双向编码器对语料语义平滑迁移获得子词级别特征,采用双向长短期记忆(bi-directional long-short term memory, BiLSTM)神经网络捕捉上下文信息确定边界,以条件随机场作为解码器实现标签预测。实验结果表明:相比于BiLSTM等传统模型具有更优的识别性能,算法在8种实体上的F1值均在92%以上,微平均F1值达到96.10%。
基金项目:
中国科学院复杂航天系统电子信息技术重点实验室择优基金资助项目(Y42613A32S)

Satellite domain corpus construction and named entity recognition

XU Cong
Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China;
University of Chinese Academy of Sciences, Beijing 100049, China,xucong19@mails.ucas.edu.cn
SHI Huipeng
The State Radio_monitoring_center Testing Center, Beijing 100041, China
CHEN Zhimin
Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China
ZHANG Xinyu
Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China;
University of Chinese Academy of Sciences, Beijing 100049, China
WANG Jing
Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China
YANG Jiasen
Key Laboratory of Electronics and Information Technology for Space Systems, National Space Science Center, Chinese Academy of Sciences, Beijing 100190, China
Abstract:
Aiming at the lack of named entity corpus in the satellite domain and the low recognition performance of existing algorithms, a satellite domain entity labeling method considering fuzzy boundaries was proposed, constructed a corpus containing 8 common satellite domain entities where the granularity was finer and the coverage was wider in comparison with the existing corpora in this field. Based on this, a transfer learning and multi-network fusion satellite domain entity recognition algorithm was proposed. Algorithm used pretrained bidirectional encoder representations for transformers to smoothly transfer the semantics of the corpus for subword-level features, a BiLSTM (bi-directional long-short term memory) network for capturing contextual information to determine boundaries, and label prediction was achieved using a conditional random field as a decoder. Experimental results show that, compared with traditional models such as BiLSTM, the proposed algorithm has better recognition performance where the F1-score in 8 entities is all above 92% and the micro-average F1-score reaches 96.10%.
收稿日期:
2022-04-15
     下载PDF全文