引用本文: | 刘轶,高玉林,张国振.并行程序运行故障原因识别.[J].国防科技大学学报,2022,44(5):45-52.[点击复制] |
LIU Yi,GAO Yulin,ZHANG Guozhen.Identifying causes of execution failure for parallel programs[J].Journal of National University of Defense Technology,2022,44(5):45-52[点击复制] |
|
|
|
本文已被:浏览 4450次 下载 3371次 |
并行程序运行故障原因识别 |
刘轶,高玉林,张国振 |
(北京航空航天大学 计算机学院, 北京 100191)
|
摘要: |
高性能计算系统的复杂性和规模的不断增长使得系统的平均无故障时间越来越短,因此系统的硬软件故障导致并行程序运行出错的概率随之增加。此外,并行程序本身可能存在的编程错误也会导致运行出错。由于处理上述两类故障原因的措施迥异,所以在程序运行出现故障时,用户需要关注故障原因的类别。针对这一问题,设计和实现了一种基于作业管理系统Slurm的并行程序运行故障原因识别系统。通过对Slurm进行扩展,监控作业状态,重提交和重运行作业。根据作业运行结果,区分故障原因类别。故障注入方式进行的实验表明,该系统具有较高的识别准确率。 |
关键词: 高性能计算系统 Slurm 运行故障 故障检测 |
DOI:10.11887/j.cn.202205005 |
投稿日期:2020-11-12 |
基金项目:总体技术及评测技术与系统研究资助项目(2016YFB0200100) |
|
Identifying causes of execution failure for parallel programs |
LIU Yi, GAO Yulin, ZHANG Guozhen |
(School of Computer Science and Engineering, Beihang University, Beijing 100191, China)
|
Abstract: |
With the increasing of scale and complexity of high-performance computing systems, the mean time between failures is getting shorter, which causes an increasing probability of execution-failure caused by the hardware and software failures for parallel programs. In addition, the possible programming errors (i.e. bugs) that exist in parallel programs can also lead to execution failure. Approaches to deal with the above two types of execution failures are totally different, therefore, when an execution-failure occurs, the programmer must figure out if the failure is caused by a system fault or a programming bug. In response to this issue, a system to identifying causes of execution-failures for parallel programs was designed and implemented on the basis of the Slurm. The system has all the supported features of Slurm, as well as the ability to monitor job status, re-submit and re-run jobs. The experimental results of the job execution show that the system can distinguish the type of program execution-failures. Experiments conducted with fault injection also demonstrates the accuracy of the system. |
Keywords: high performance computing system Slurm execution failure fault detection |
|
|
|
|
|