BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111116T183000Z DTEND:20111116T190000Z LOCATION:TCC 303 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: The GPU is offering more than an order of magnitude speedup of peak floating-point computing over conventional processors. In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEMM) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest vendor supplied library: CUBLAS 3.2. We further improve upon this with an implementation in the native machine language, leading to a 20% increase in performance over CUBLAS. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%). SUMMARY:Fast Implementation of DGEMM on Fermi GPU PRIORITY:3 END:VEVENT END:VCALENDAR