BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111115T183000Z DTEND:20111115T190000Z LOCATION:TCC 303 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: As the computational power of GPUs continues to scale with Moore's=0ALaw, an increasing number of applications are becoming limited by=0Amemory bandwidth. We propose an approach for programming GPUs with=0Atightly-coupled specialized DMA warps for performing memory transfers=0Abetween on-chip and off-chip memories. Separate DMA warps improve=0Amemory bandwidth utilization by better exploiting available=0Amemory-level parallelism and by leveraging efficient inter-warp=0Aproducer-consumer synchronization mechanisms. DMA warps also improve=0Aprogrammer productivity by decoupling the need for thread array shapes=0Ato match data layout. To illustrate the benefits of this approach, we=0Apresent an extensible API, CudaDMA, that encapsulates synchronization=0Aand common sequential and strided data transfer patterns. Using=0ACudaDMA, we demonstrate speedup of up to 1.37x on representative=0Asynthetic micro-benchmarks, and 1.15x-3.2x on several kernels from=0Ascientific applications written in CUDA running on NVIDIA Fermi GPUs. SUMMARY:CudaDMA: Optimizing GPU Memory Bandwidth via Warp Specialization PRIORITY:3 END:VEVENT END:VCALENDAR