Checkpointing strategies for parallel jobs

SESSION: Checkpointing Optimization


TIME: 11:00AM - 11:30AM

AUTHOR(S):Marin Bougeret, Henri Casanova, Mikael Rabie, Yves Robert, Frédéric Vivien


This work provides a rigorous analysis of checkpointing strategies for minimizing expected job execution times on failure-prone platforms. We give the optimal solution for exponentially distributed failure inter-arrival times, for both sequential and parallel jobs. For non-exponentially distributed failures, we develop a dynamic programming algorithm to maximize the amount of work completed before the next failure, which provides a good heuristic for minimizing the expected execution time. We consider various models of job parallelism and of parallel checkpointing overhead. We present results from extensive simulation experiments assuming that failures follow Exponential or Weibull distributions, the latter being more representative of real-world systems. Our simulation results corroborate theoretical results, and show that our dynamic programming algorithm vastly outperforms previous solutions for Weibull failures. We also conduct simulation experiments based on failure logs of production clusters, which confirm the superiority of our approach for real-world clusters.

Chair/Author Details:

Marin Bougeret - ENS Lyon

Henri Casanova - University of Hawaii at Manoa

Mikael Rabie - ENS Lyon

Yves Robert - ENS Lyon

Frédéric Vivien - INRIA

