SC is the International Conference for
High Performance Computing, Networking,
Storage and Analysis

SCHEDULE: NOV 12-18, 2011

Resilient Software for ExaScale Computing

EVENT TYPE: Birds of a Feather

TIME: 5:30PM - 7:00PM

SESSION LEADER(S):Marie-Christine Sawley, Roel Wuyts


ExaScale computing systems will likely consist of millions of cores executing applications with billions of threads, based on 14nm or less CMOS technology, according to the ITRS roadmap. Processing elements built on this technology, coupled with dynamic power management will exhibit high variability in performance, between cores and across different runs. Even worse, preliminary figures indicates that on average about every couple of minutes –at least- something in the system will break. Traditional checkpointing strategies are unlikely to work, given the time it will take to save the huge quantities of data combined with the fact that they will need to be restored frequently. This BoF wants to investigate resilient software: software that is able to survive failing hardware and continue to run, without minimal performance impact. Furthermore, we may also discuss tradeoffs between rerunning the application and the cost of instrumentation to deal with resilience.

