引用本文: | 胡维,蒋艳凰,刘光明,等.E级超级计算机故障预测的数据采集方法.[J].国防科技大学学报,2016,38(1):93-100.[点击复制] |
HU Wei,JIANG Yanhuang,LIU Guangming,et al.Data collection for failure prediction toward exascale supercomputers[J].Journal of National University of Defense Technology,2016,38(1):93-100[点击复制] |
|
|
|
本文已被:浏览 8311次 下载 6757次 |
E级超级计算机故障预测的数据采集方法 |
胡维1,2, 蒋艳凰1, 刘光明1,2, 董文睿1,2, 崔新武3 |
(1.国防科技大学 计算机学院, 湖南 长沙 410073;2.国家超级计算天津中心, 天津 300457;3.中国人民解放军95942部队, 湖北 武汉 430313)
|
摘要: |
面向未来E级超级计算机,提出用于故障预测的数据采集框架,能够全面采集与计算结点故障相关的状态数据。采用自适应多层分组数据汇集方法,有效解决随着系统规模增长数据汇集过程开销过大的问题。在TH-1A超级计算机上的实现和测试表明,该数据采集框架具有开销小、扩展性好的优点,能够满足未来大规模系统故障预测数据采集的需求。 |
关键词: 超级计算机 故障预测 数据采集方法 数据汇集 |
DOI:10.11887/j.cn.201601016 |
投稿日期:2015-04-09 |
基金项目:国家自然科学基金资助项目(61272141,61120106005);国家863计划资助项目(2012AA01A301) |
|
Data collection for failure prediction toward exascale supercomputers |
HU Wei1,2, JIANG Yanhuang1, LIU Guangming1,2, DONG Wenrui1,2, CUI Xinwu3 |
(1. College of Computer, National University of Defense Technology, Changsha 410073, China;2.
2. National Supercomputer Centre in Tianjin, Tianjin 300457, China;3. The PLA Unit 95942, Wuhan 430313, China)
|
Abstract: |
Aimed at an exascale supercomputer, an FPDC (failure prediction data collection framework) was introduced to fully collect the data related to the state of compute nodes’ health. An adaptive multi-layer data aggregation method was presented for data aggregation with less overhead. Extensive experiments, by implementing FPDC on TH-1A,indicate that the FPDC has the advantage of high efficiency and good scalability. |
Keywords: supercomputer failure prediction data collection method data aggregation |
|
|
|
|
|