SC is the International Conference for
High Performance Computing, Networking,
Storage and Analysis



SCHEDULE: NOV 12-18, 2011

When viewing the Technical Program schedule, on the far righthand side is a column labeled "PLANNER." Use this planner to build your own schedule. Once you select an event and want to add it to your personal schedule, just click on the calendar icon of your choice (outlook calendar, ical calendar or google calendar) and that event will be stored there. As you select events in this manner, you will have your own schedule to guide you through the week.

You can also create your personal schedule on the SC11 app (Boopsie) on your smartphone. Simply select a session you want to attend and "add" it to your plan. Continue in this manner until you have created your own personal schedule. All your events will appear under "My Event Planner" on your smartphone.

Fast Implementation of DGEMM on Fermi GPU

SESSION: GPU Applications

EVENT TYPE: Paper

TIME: 10:30AM - 11:00AM

AUTHOR(S):Guangming Tan, Linchuan Li, Sean Triechler, Everett Phillips, Yungang Bao, Ninghui Sun

ROOM:TCC 303

ABSTRACT:
The GPU is offering more than an order of magnitude speedup of peak floating-point computing over conventional processors. In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEMM) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest vendor supplied library: CUBLAS 3.2. We further improve upon this with an implementation in the native machine language, leading to a 20% increase in performance over CUBLAS. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).

Chair/Author Details:

Guangming Tan - Institute of Computing Technology

Linchuan Li - Institute of Computing Technology

Sean Triechler - NVIDIA

Everett Phillips - NVIDIA

Yungang Bao - Institute of Computing Technology

Ninghui Sun - Institute of Computing Technology

Add to iCal  Click here to download .ics calendar file

Add to Outlook  Click here to download .vcs calendar file

Add to Google Calendarss  Click here to add event to your Google Calendar

The full paper can be found in the ACM Digital Library

   Sponsors    ACM    IEEE