引用本文: | 徐佳庆,胡小弢,杨汉芝,等.高性能互连网络中端口阻塞故障预测方法.[J].国防科技大学学报,2022,44(5):1-12.[点击复制] |
XU Jiaqing,HU Xiaotao,YANG Hanzhi,et al.Prediction method of port blocking failure in high performance interconnection networks[J].Journal of National University of Defense Technology,2022,44(5):1-12[点击复制] |
|
|
|
本文已被:浏览 5463次 下载 3531次 |
高性能互连网络中端口阻塞故障预测方法 |
徐佳庆,胡小弢,杨汉芝,王强,张磊,唐付桥 |
(国防科技大学 计算机学院, 湖南 长沙 410073)
|
摘要: |
随着系统规模、芯片功耗和链路速率的提升,高性能互连网络的整体故障率也不断上升,传统运维方式将难以为继,给高性能计算系统整体可靠性和可用性带来了巨大挑战。针对网络端口阻塞这类严重网络故障,提出无监督算法的预测模型。该模型从历史信息中挖掘征兆性规律并形成新的特征向量,应用K-means聚类算法对特征向量进行学习归类。在预测时,结合端口当前状态,利用二次指数平滑算法对未来状态进行预测,将得到的新特征向量使用K-means算法预判是否会发生阻塞故障。利用拓扑结构信息,分别对叶交换机和根交换机构建预测子模型,进而提升预测的精确率。结果表明,该预测模型能保持在召回率为88.2%的前提下,达到65.2%的准确率,可为运维人员提供有效的辅助。 |
关键词: 互连网络 故障预测 机器学习 |
DOI:10.11887/j.cn.202205001 |
投稿日期:2020-11-08 |
基金项目:国家重点研发计划资助项目(2018YFB0204300);并行与分布处理国防科技重点实验室基金资助项目(6142110180101) |
|
Prediction method of port blocking failure in high performance interconnection networks |
XU Jiaqing, HU Xiaotao, YANG Hanzhi, WANG Qiang, ZHANG Lei, TANG Fuqiao |
(College of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China)
|
Abstract: |
With the increase of system scale, chip power consumption and link rate, the overall failure rate of high-performance interconnection networks will continue rising, and the traditional operation and maintenance methods will be difficult to sustain, which brings great challenges to the overall reliability and availability of HPC(high performance computing). An unsupervised algorithm prediction model for serious network failures such as network port blocking was proposed. In this model, the symptomatic rules were extracted from the history information of the switch port status register and a new feature vector was formed. The K-means clustering algorithm was used to learn and classify the feature vectors. In the prediction, the DES(double exponential smoothing) algorithm was used to predict the port state in the future through a combination of the current state of the port, and a new feature vector was obtained and K-means algorithm was used to predict whether the port blocking failure would occur. The topology information was used to build independent sub prediction models with ToR switch ports and Spine switch ports respectively, so as to further improve the accuracy of prediction. The experimental results show that the prediction model can maintain the recall rate of 88.2%, and reach the accuracy rate of 65.2%. It can provide effective early warning and guidance for the operation and maintenance personnel in the actual system. |
Keywords: interconnection network failure prediction machine learning |
|
|
|
|
|