超算环境科学工作流应用的容错
作者:
作者单位:

(1. 中国工程物理研究院 计算机应用研究所, 四川 绵阳 621900;2. 北京应用物理与计算数学研究所, 北京 100094)

作者简介:

李于锋(1982—),男,河南光山人,博士研究生,E-mail:liyf@caep.cn; 莫则尧(通信作者),男,研究员,博士,博士生导师,E-mail:zeyao_mo@iapcm.ac.cn

通讯作者:

中图分类号:

TP391

基金项目:

国家重点研发计划资助项目(2018YFB0703903)


Fault tolerance in HPC scientific workflow application
Author:
Affiliation:

(1. Institute of Computer Application, Chinese Academy of Engineering Physics, Mianyang 621900, China;2. Institute of Applied Physics and Computational Mathematics, Beijing 100094, China)

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    超算环境中科学工作流技术广泛应用于科学研究和工程仿真领域。复杂多物理过程数值模拟、多阶段数据处理等应用往往需要使用多种应用软件相互协作,构建业务流程自动执行来提升工作效率。然而在超算环境中执行科学工作流应用面临着资源失效、任务配置错误等异常情况,造成工作流执行中断,严重影响完成效率,故容错功能对超算工作流应用的稳定持续运行有重要意义。介绍了科学工作流的容错设计分类,并对典型工作流系统的容错设计进行分析评述;提出了基于决策树的事件-条件-动作容错模型,设计了非侵入式可扩展的容错架构,并针对自主研发的部署在超算环境下的科学工作流应用平台HSWAP,实现了运行时可配置的容错策略。在实际的工程仿真任务中,基于所提出模型和架构实现的容错机制为提高工作流执行效率发挥了重要作用。

    Abstract:

    Scientific workflow technologies in HPC are extensively applied in scientific research and engineering simulation domain. Application such as numerical simulation in complex multi-physics problems and multi-stages data process need software to compose an automatic executable workflow to increase the efficiency. There are lots of exceptions such as resource failure, task configurations errors which may cause the workflow execution to be ceased, therefore robust and continuous execution is important for workflow application. A taxonomy of fault tolerance in workflow was made and some fault tolerance techniques in typical workflow systems were reviewed. A decision-tree based event-condition-action fault tolerance model was proposed, and a non-intrusive extendable framework which was implemented in our HPC scientific workflow system HSWAP was designed. Runtime configurable error recovery strategies were also implemented in our fault tolerance software module. In order to validate our new model and framework, the fault tolerance functions were tested in real engineering simulation project. Results show that fault tolerance plays an important role in increasing workflow execution efficiency.

    参考文献
    相似文献
    引证文献
引用本文

李于锋,莫则尧,肖永浩,等.超算环境科学工作流应用的容错[J].国防科技大学学报,2020,42(6):82-89.
LI Yufeng, MO Zeyao, XIAO Yonghao, et al. Fault tolerance in HPC scientific workflow application[J]. Journal of National University of Defense Technology,2020,42(6):82-89.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2019-09-21
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2020-12-02
  • 出版日期: 2020-12-28
文章二维码