Fault tolerance in HPC scientific workflow application
CSTR:
Author:
Affiliation:

(1. Institute of Computer Application, Chinese Academy of Engineering Physics, Mianyang 621900, China;2. Institute of Applied Physics and Computational Mathematics, Beijing 100094, China)

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Scientific workflow technologies in HPC are extensively applied in scientific research and engineering simulation domain. Application such as numerical simulation in complex multi-physics problems and multi-stages data process need software to compose an automatic executable workflow to increase the efficiency. There are lots of exceptions such as resource failure, task configurations errors which may cause the workflow execution to be ceased, therefore robust and continuous execution is important for workflow application. A taxonomy of fault tolerance in workflow was made and some fault tolerance techniques in typical workflow systems were reviewed. A decision-tree based event-condition-action fault tolerance model was proposed, and a non-intrusive extendable framework which was implemented in our HPC scientific workflow system HSWAP was designed. Runtime configurable error recovery strategies were also implemented in our fault tolerance software module. In order to validate our new model and framework, the fault tolerance functions were tested in real engineering simulation project. Results show that fault tolerance plays an important role in increasing workflow execution efficiency.

    Reference
    Related
    Cited by
Get Citation

LI Yufeng, MO Zeyao, XIAO Yonghao, ZHAO Shicao, DUAN Bowen. Fault tolerance in HPC scientific workflow application[J]. Journal of National University of Defense Technology,2020,42(6):82-89.

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:September 21,2019
  • Revised:
  • Adopted:
  • Online: December 02,2020
  • Published: December 28,2020
Article QR Code