SC is the International Conference for
High Performance Computing, Networking,
Storage and Analysis



SCHEDULE: NOV 12-18, 2011

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

You can also create your personal schedule on the SC11 app (Boopsie) on your smartphone. Simply select a session you want to attend and "add" it to your plan. Continue in this manner until you have created your own personal schedule. All your events will appear under "My Event Planner" on your smartphone.

FTI: high performance Fault Tolerance Interface for hybrid systems

SESSION: Checkpointing Optimization

EVENT TYPE: Paper

TIME: 10:30AM - 11:00AM

AUTHOR(S):Leonardo Arturo Bautista Gomez, Dimitri Komatitsch, Naoya Maruyama, Seiji Tsuboi, Franck Cappello, Satoshi Matsuoka, Takeshi Nakamura

ROOM:TCC 304

ABSTRACT:
Large scientific applications deployed on current petascale systems expend a significant amount of their execution time dumping checkpoint files to remote storage. New fault tolerant techniques will be critical to efficiently exploit post-petascale systems. In this work, we propose a low-overhead high-frequency multi-level checkpoint technique in which we integrate a highly-reliable topology-aware Reed-Solomon encoding in a three-level checkpoint scheme. We efficiently hide the encoding time using one Fault-Tolerance dedicated thread per node. We implement our technique in the Fault Tolerance Interface FTI. We evaluate the correctness of our performance model and conduct a study of the reliability of our library. To demonstrate the performance of FTI, we present a case study of the Mw9.0 Tohoku Japan earthquake simulation with SPECFEM3D on TSUBAME2.0. We demonstrate a checkpoint overhead as low as 8% on sustained 0.1 Petaflops runs (1152 GPUs) while checkpointing at high frequency.

Chair/Author Details:

Leonardo Arturo Bautista Gomez - Tokyo Institute of Technology

Dimitri Komatitsch - University of Toulouse

Naoya Maruyama - Tokyo Institute of Technology

Seiji Tsuboi - JAMSTEC

Franck Cappello - INRIA

Satoshi Matsuoka - Tokyo Institute of Technology

Takeshi Nakamura - JAMSTEC

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library

   Sponsors    ACM    IEEE