SCHEDULE: NOV 12-18, 2011

FTI: high performance Fault Tolerance Interface for hybrid systems

SESSION: Checkpointing Optimization


TIME: 10:30AM - 11:00AM

AUTHOR(S):Leonardo Arturo Bautista Gomez, Dimitri Komatitsch, Naoya Maruyama, Seiji Tsuboi, Franck Cappello, Satoshi Matsuoka, Takeshi Nakamura


Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 Petaflops runs (1152 GPUs) while checkpointing at high frequency.

Chair/Author Details:

Leonardo Arturo Bautista Gomez - Tokyo Institute of Technology

Dimitri Komatitsch - University of Toulouse

Naoya Maruyama - Tokyo Institute of Technology

Seiji Tsuboi - JAMSTEC

Franck Cappello - INRIA

Satoshi Matsuoka - Tokyo Institute of Technology

Takeshi Nakamura - JAMSTEC

