BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111116T190000Z DTEND:20111116T193000Z LOCATION:TCC 304 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: This work provides a rigorous analysis of checkpointing strategies=0Afor minimizing expected job execution times on failure-prone platforms.=0AWe give the optimal solution for exponentially=0Adistributed failure inter-arrival times, for both sequential and=0Aparallel jobs. For non-exponentially distributed failures, we=0Adevelop a dynamic programming algorithm to maximize the amount of=0Awork completed before the next failure, which provides a good=0Aheuristic for minimizing the expected execution time. We=0Aconsider various models of job parallelism and of parallel=0Acheckpointing overhead. We present results from extensive simulation experiments=0Aassuming that failures follow Exponential or Weibull distributions,=0Athe latter being more representative of real-world systems. Our=0Asimulation results corroborate theoretical results, and show =0Athat our dynamic programming algorithm vastly outperforms=0Aprevious solutions for Weibull failures. We=0Aalso conduct simulation experiments based on failure logs=0Aof production clusters, which confirm the superiority of our=0Aapproach for real-world clusters. SUMMARY:Checkpointing strategies for parallel jobs PRIORITY:3 END:VEVENT END:VCALENDAR