BEGIN:VCALENDAR PRODID:-//Microsoft Corporation//Outlook MIMEDIR//EN VERSION:1.0 BEGIN:VEVENT DTSTART:20111117T213000Z DTEND:20111117T220000Z LOCATION:TCC 303 DESCRIPTION;ENCODING=QUOTED-PRINTABLE:ABSTRACT: Hadoop has become the de-facto platform for large-scale analysis in commercial applications, and increasingly so in scientific applications. However, applying=0AHadoop's byte-stream data model causes inefficiencies when used for scientific data that is stored in highly-structured, binary file formats. This limits the scalability of Hadoop applications in science. We introduce SciHadoop, a Hadoop=0Aplugin allowing scientists to specify logical queries over array-based data models. SciHadoop executes these queries as map/reduce programs defined over the logical data model. We describe the implementation of a SciHadoop prototype for netCDF data sets, and quantify the performance of three effective optimizations: the first optimization minimizes network traffic by intelligently partitioning the input space of mappers at the logical level; the second optimization avoids full-scans by pruning partitions using knowledge of query data dependencies; the third optimization minimizes data transfers by processing holistic aggregation functions (e.g. median) at mappers instead of reducers whenever possible. SUMMARY:SciHadoop: Array-based Query Processing in Hadoop PRIORITY:3 END:VEVENT END:VCALENDAR