BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111117T190000Z DTEND:20111117T193000Z LOCATION:TCC 303 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: We are witnessing a rapid evolution of HPC node architectures and on-chip parallelism as power and cooling limit increases in microprocessor clock speeds. We demonstrate a hierarchical approach towards effectively extracting performance for a variety of emerging multicore-based supercomputing platforms. Our examined application is a structured grid-based Lattice Boltzmann computation that simulates homogeneous isotropic turbulence in magnetohydrodynamics. First, we examine auto-tuning techniques including loop transformations, virtual vectorization, use of ISA-specific intrinsics, including programming model exploration (flat MPI, MPI-OpenMP, and MPI-Pthreads), as well as data and thread decomposition strategies designed to mitigate communication bottlenecks. We evaluate the impact of our hierarchical tuning techniques using a variety of problem sizes via large-scale simulations on state-of-the-art Cray XT4, Cray XE6, and IBM BlueGene/P platforms. Results show that our approach improves performance and energy by up to 3.4x using 49,152 cores, while providing a portable optimization methodology for a variety of numerical methods. SUMMARY:Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed Auto-Tuning PRIORITY:3 END:VEVENT END:VCALENDAR