BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111116T001500Z DTEND:20111116T003000Z LOCATION:TCC LL1 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: With the increasing scale and complexity of high performance=0Acomputing (HPC) systems, reliability is becoming critical for these=0Asystems. System logs are the primary source of information to=0Aunderstand and analyze system problems. Nevertheless, manual log=0Aprocessing is time-consuming, errorprone, and not scalable.=0ACurrently little study has been done on automated log analysis for=0Apractical use in HPC systems. In this study, we present a log=0Aanalysis infrastructure by exploiting data mining and statistical learning technologies. Our work can be broadly divided into four parts: log pre-processing, online failure prediction, automatic root cause diagnosis, and reliability modeling. We evaluate our preliminary results by means of system logs collected from production HPC systems. The work can greatly improve our=0Aunderstanding of faults and failures arising from hardware/software=0Acomponents and their interactions in HPC systems. It can further=0Afacilitate the resilience research for HPC systems. SUMMARY:Log Analysis for Fault Management in Large-scale Systems PRIORITY:3 END:VEVENT END:VCALENDAR