Abstract:Aiming at the demand of the distributed simulation system, this paper has built a common grid-based fault tolerance system. The system consists of three parts: simulation resource monitoring module, data saving module, and error recovery module. The implementation of monitoring module is built on top of grid's MDS, while data saving module, including the saving of the process space and the iterative relationship between processes, and fault recovery are realized based on checkpoint mechanism in the user space. In addition, we analyze the relationship between these three modules and the existing function modules in simulation system. In the end, we design and implement a fault tolerance broker in Client/Sever mode to automate the fault tolerance.