INTEL SCIENCE & TECHNOLOGY CENTER

CLOUD COMPUTING

ISTC-CC NEWSLETTER

The ISTC-CC Update 2016 - NEW!

The ISTC-CC Update 2015

The ISTC-CC Update 2014

RESEARCH HIGHLIGHTS

Ling Liu's SC13 paper "Large Graph Processing Without the Overhead" featured by HPCwire.

ISTC-CC provides a listing of useful benchmarks for cloud computing.

Another list highlighting Open Source Software Releases.

Second GraphLab workshop should be even bigger than the first! GraphLab is a new programming framework for graph-style data analytics.

Open-source Spark framework makes iterative and interactive data analytics FAST, both to run and to write.

ISTC-CC Abstract

ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing

Berkeley Technical Report No. UCB/EECS-2013-207, December 2013.

Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, Andre Schumacher, Anthony D. Joseph, David A. Patterson

University of California at Berkeley

Current genomics data formats and processing pipelines are not designed to scale well to large datasets. The current Sequence/Binary Alignment/Map (SAM/BAM) formats were intended for single node processing [18]. There have been attempts to adapt BAM to distributed computing environments, but they see limited scalability past eight nodes [22]. Additionally, due to the lack of an explicit data schema, there are well known incompatibilities between libraries that implement SAM/BAM/Variant Call Format (VCF) data access.

To address these problems, we introduce ADAM, a set of formats, APIs, and processing stage implementations for genomic data. ADAM is fully open source under the Apache 2 license, and is implemented on top of Avro and Parquet [5, 26] for data storage. Our reference pipeline is implemented on top of Spark, a high performance in-memory map-reduce system [32]. This combination provides the following advantages: 1) Avro provides explicit data schema access in C/C++/C#, Java/Scala, Python, php, and Ruby; 2) Parquet allows access by database systems like Impala and Shark; and 3) Spark improves performance through in-memory caching and reducing disk I/O.

FULL PAPER: pdf