并行程序运行故障原因识别
作者:
作者单位:

(北京航空航天大学 计算机学院, 北京 100191)

作者简介:

刘轶(1968—),男,河北安新人,教授,博士,博士生导师,E-mail:yi.liu@buaa.edu.cn

通讯作者:

中图分类号:

TP302

基金项目:

总体技术及评测技术与系统研究资助项目(2016YFB0200100)


Identifying causes of execution failure for parallel programs
Author:
Affiliation:

(School of Computer Science and Engineering, Beihang University, Beijing 100191, China)

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    高性能计算系统的复杂性和规模的不断增长使得系统的平均无故障时间越来越短,因此系统的硬软件故障导致并行程序运行出错的概率随之增加。此外,并行程序本身可能存在的编程错误也会导致运行出错。由于处理上述两类故障原因的措施迥异,所以在程序运行出现故障时,用户需要关注故障原因的类别。针对这一问题,设计和实现了一种基于作业管理系统Slurm的并行程序运行故障原因识别系统。通过对Slurm进行扩展,监控作业状态,重提交和重运行作业。根据作业运行结果,区分故障原因类别。故障注入方式进行的实验表明,该系统具有较高的识别准确率。

    Abstract:

    With the increasing of scale and complexity of high-performance computing systems, the mean time between failures is getting shorter, which causes an increasing probability of execution-failure caused by the hardware and software failures for parallel programs. In addition, the possible programming errors (i.e. bugs) that exist in parallel programs can also lead to execution failure. Approaches to deal with the above two types of execution failures are totally different, therefore, when an execution-failure occurs, the programmer must figure out if the failure is caused by a system fault or a programming bug. In response to this issue, a system to identifying causes of execution-failures for parallel programs was designed and implemented on the basis of the Slurm. The system has all the supported features of Slurm, as well as the ability to monitor job status, re-submit and re-run jobs. The experimental results of the job execution show that the system can distinguish the type of program execution-failures. Experiments conducted with fault injection also demonstrates the accuracy of the system.

    参考文献
    相似文献
    引证文献
引用本文

刘轶,高玉林,张国振.并行程序运行故障原因识别. Identifying causes of execution failure for parallel programs[J].国防科技大学学报,2022,44(5):45-52.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2020-11-12
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2022-09-28
  • 出版日期: 2022-10-28
文章二维码