SEARCH
ISTC-CC NEWSLETTER
RESEARCH HIGHLIGHTS
Ling Liu's SC13 paper "Large Graph Processing Without the Overhead" featured by HPCwire.
ISTC-CC provides a listing of useful benchmarks for cloud computing.
Another list highlighting Open Source Software Releases.
Second GraphLab workshop should be even bigger than the first! GraphLab is a new programming framework for graph-style data analytics.
ISTC-CC Abstract
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Proceedings of the 34th ACM SIGMOD International Conference on Management of Data (SIGMOD’15), May-June 2015.
Frank Austin Nothaft*, Matt Massie*, Timothy Danford*‡, Zhao Zhang*,
Uri Laserson^, Carl Yeksigian‡, Jey Kottalam*, Arun Ahuja†,
Jeff Hammerbacher†^, Michael Linderman†, Michael J. Franklin*,
Anthony D. Joseph*, David A. Patterson*
*AMPLab, University of California, Berkeley,
^Cloudera, San Francisco, CA
†Carl Icahn School of Medicine, Mount Sinai, New York, NY,
‡Genomebridge, Cambridge, MA
"Next generation" data acquisition technologies are allowing scientists to collect exponentially more data at a lower cost. These trends are broadly impacting many scientific fields, including genomics, astronomy, and neuroscience. We can attack the problem caused by exponential data growth by applying horizontally scalable techniques from current analytics systems to accelerate scientific processing pipelines.
In this paper, we describe ADAM, an example genomics pipeline that leverages the open-source Apache Spark and Parquet systems to achieve a 28X speedup over current genomics pipelines, while reducing cost by 63%. From building this system, we were able to distill a set of techniques for implementing scientific analyses efficiently using commodity "big data" systems. To demonstrate the generality of our architecture, we then implement a scalable astronomy image processing system which achieves a 2.8-8.9X improvement over the state-of-the-art MPI-based system.
FULL PAPER: pdf