BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111116T190000Z DTEND:20111116T193000Z LOCATION:TCC 303 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: We reconsider implementation of the Fast-Multipole-Method (FMM) on a computing node with a heterogeneous architecture with multicore CPU(s) and one or more GPUs, as well as on an interconnected cluster of such nodes. The FMM is a divide-and-conquer algorithm that performs a N-body sum using a spatial decomposition. Using the observation that the local-summation and the analysis-based translation parts of the FMM are independent, we map these respectively to the GPUs and CPUs. Careful analysis of the FMM is performed to distribute work optimally between the CPUs and GPUs. We first develop a single node version where the CPU part is parallelized using OpenMP and the GPU version via CUDA. New parallel algorithms for creating FMM data structures are presented together with load balancing strategies for the single node and distributed versions. Our 8-GPU performance is comparable with the 256-GPU results of the 2009 Bell-prize winner (Hamada et al., 2009). SUMMARY:Scalable Fast Multipole Methods on Distributed Heterogeneous Clusters PRIORITY:3 END:VEVENT END:VCALENDAR