引用本文: | 李于锋,莫则尧,肖永浩,等.超算环境科学工作流应用的容错.[J].国防科技大学学报,2020,42(6):82-89.[点击复制] |
LI Yufeng,MO Zeyao,XIAO Yonghao,et al.Fault tolerance in HPC scientific workflow application[J].Journal of National University of Defense Technology,2020,42(6):82-89[点击复制] |
|
|
|
本文已被:浏览 6757次 下载 5518次 |
超算环境科学工作流应用的容错 |
李于锋1,2,莫则尧2,肖永浩1,赵士操1,段博文1 |
(1. 中国工程物理研究院 计算机应用研究所, 四川 绵阳 621900;2. 北京应用物理与计算数学研究所, 北京 100094)
|
摘要: |
超算环境中科学工作流技术广泛应用于科学研究和工程仿真领域。复杂多物理过程数值模拟、多阶段数据处理等应用往往需要使用多种应用软件相互协作,构建业务流程自动执行来提升工作效率。然而在超算环境中执行科学工作流应用面临着资源失效、任务配置错误等异常情况,造成工作流执行中断,严重影响完成效率,故容错功能对超算工作流应用的稳定持续运行有重要意义。介绍了科学工作流的容错设计分类,并对典型工作流系统的容错设计进行分析评述;提出了基于决策树的事件-条件-动作容错模型,设计了非侵入式可扩展的容错架构,并针对自主研发的部署在超算环境下的科学工作流应用平台HSWAP,实现了运行时可配置的容错策略。在实际的工程仿真任务中,基于所提出模型和架构实现的容错机制为提高工作流执行效率发挥了重要作用。 |
关键词: 容错 科学工作流 决策树模型 工作流引擎 |
DOI:10.11887/j.cn.202006010 |
投稿日期:2019-09-21 |
基金项目:国家重点研发计划资助项目(2018YFB0703903) |
|
Fault tolerance in HPC scientific workflow application |
LI Yufeng1,2, MO Zeyao2, XIAO Yonghao1, ZHAO Shicao1, DUAN Bowen1 |
(1. Institute of Computer Application, Chinese Academy of Engineering Physics, Mianyang 621900, China;2. Institute of Applied Physics and Computational Mathematics, Beijing 100094, China)
|
Abstract: |
Scientific workflow technologies in HPC are extensively applied in scientific research and engineering simulation domain. Application such as numerical simulation in complex multi-physics problems and multi-stages data process need software to compose an automatic executable workflow to increase the efficiency. There are lots of exceptions such as resource failure, task configurations errors which may cause the workflow execution to be ceased, therefore robust and continuous execution is important for workflow application. A taxonomy of fault tolerance in workflow was made and some fault tolerance techniques in typical workflow systems were reviewed. A decision-tree based event-condition-action fault tolerance model was proposed, and a non-intrusive extendable framework which was implemented in our HPC scientific workflow system HSWAP was designed. Runtime configurable error recovery strategies were also implemented in our fault tolerance software module. In order to validate our new model and framework, the fault tolerance functions were tested in real engineering simulation project. Results show that fault tolerance plays an important role in increasing workflow execution efficiency. |
Keywords: fault tolerance scientific workflow decision tree model workflow engine |
|
|
|
|
|