Ling Liu's SC13 paper "Large Graph Processing Without the Overhead" featured by HPCwire.
Another list highlighting Open Source Software Releases.
Second GraphLab workshop should be even bigger than the first! GraphLab is a new programming framework for graph-style data analytics.
Report on Management of Big Data Panel
JULY 17, 2013
ISTC-CC faculty Karsten Schwan organized a panel on Management of Big Data Systems at the International Conference on Autonomic Computing in San Jose in June 2103, featuring speakers from many of the key 'big data' companies in the U.S. The well-attended panel's charge was as follows: "New challenges for managing 'big data' applications arise from new usage models now being envisioned and/or pursued by researchers and practitioners. Such usage models include 'spot market' pricing for virtual machines at Amazon, 'fast data' processing for online data queries, and real-time analytics for the expected exabyte-level outputs of future high performance machines. Coupled with such use cases are new opportunities derived from the potentially immeasurably large collective data volumes captured by end devices like smartphones, wearables, and others. The purpose of this panel is to identify and discuss the 'management' issues that arise for new cloud usage models and big data applications, and to describe new problems the community should investigate. A desired outcome is to find issues common to these multiple usages and environments, and to discover and investigate cross-cutting solutions."
- Lucy Cherkasova, HP Labs
- Gregory Eitzmann, Google
- Sameh Elnikety, Microsoft Research
- Krishna Gade, Twitter
- Nagapramod Mandagere, IBM Research
- Sambavi Muthukrishnan, Facebook
- Rajeev Gandhi, Carnegie Mellon University and Yinzcam
- Dilma da Silva, Qualcomm
Below are a number of interesting comments made and challenges articulated by panelists, recorded by the panel moderators, Karsten Schwan and Vanish Talwar.
Lucy Cherkasova stressed the importance of better understanding the big data applications and systems now rapidly being developed. What are the effects of different data access patterns? How do we deal with hot vs. warm or cold data? What happens when we scale out, or when data velocity increases? The result is a need for modeling and performance understanding, along with realistic benchmarks and workloads driving those models.
Gregory Eitzman stressed the importance of going beyond thinking about how to capture and process big data, to also consider data stewardship. How to deal with data deletion and change? Will we face situations in which much of our datacenter capacity is taken by deleting and managing data (e.g., to respond to request for account removal) vs. using capacity to monetize it? How to cope with the stark difference in data processing that is able to scale by making tradeoffs concerning data correctness/accuracy or completeness vs. the precision -- 'banking precision' -- needed when data must be removed, curated, or cleansed? What data can be kept vs. discarded? Where can or should data be kept, not just referring to storage hierarchies, but across the many datacenters used by a company, to say, ensure uninterrupted operation under outages (e.g., cable is cut).
Sameh Elnikety used the example of Bing queries and its distributed graph processing engine to point out new ways to process data, avoiding data movement and consequent potential networking bottlenecks by shipping code to where data resides rather than moving data. Such processing differs substantially from that done by MapReduce jobs. He also pointed out the overwhelmingly important role of software in big data systems, with an ever accelerating rate of development of new data processing infrastructures in the open source domain.
Sambavi Muthukrishnan from Facebook, after reviewing Facebooks data warehousing infrastructure and data processing/storage requirements, focused on the hot vs. warm vs. cold nature of big data, and the consequent needs for different data representations, access, and storage methods. In later discussions, the question was raised whether and to what extent data cleansing or deletion might end up jeopardizing the efficient operation of the carefully constructed and maintained data processing and storage hierarchies now being used by companies like Facebook. Sambavi offered as a possible approach to dealing with this problem software and hardware methods that present a single flexible and unified framework for hot->cold data storage and access, to make it possible to manipulate data efficiently whenever and wherever needed. This leads to much necessary work on data management runtimes, on methods and policies for data migration, on automated changes in data representation, and others, to offload from developers the burden of having to understand storage hierarchies in too much detail, to facilitate the implementation of tasks like those requiring access to 'cold' data.
Nagapramod Mandagere from IBM Research offered a set of use cases already stressing big data infrastructure and likely to drive future development. When companies outsource their IT systems, e.g., to IBM, these widely dispersed datacenter systems must be managed to minimize downtime, to carry out timely updates, and other such management or troubleshooting activities. How to run analytics across the distributed monitoring/performance data sets captured for these datacenter systems? How to run queries in these settings? How to maintain and store performance and other management data? Needed are coordinated or orchestrated methods to manage distributed datacenter systems and their analytics.
Dilma daSilva from Qualcomm focused on the copious sensor and environmental data volumes 'at the edge'. Given limited cloud upload bandwidth, scalability is doubtful for the simple model in which all edge data is brought together in a few centralized datacenters. Yet centralized analysis is critical for obtaining insights about distributed activities, behaviors, and events, and we have already made great strides in efficiently performing such centralized big data processing. Questions asked include the following. To what extent can the centralized model be maintained? Are there ways to benefit from centralized processing without having to bring all data to a few central locations? Commonalities with the topics of distributed IT management (IBM) and distributed query processing (Microsoft) were noted by the audience.
Krishna Gade from Twitter focused his comments on stream processing. How to scale distributed stream processing systems, in lieu of workload changes (e.g., imbalance)? Specific issues include queue backups and delays, failure handling, and others. The Storm system addresses some of these issues, but Krishna raised as future interesting challenges the question of how to combine such stream or online processing with the batch processing so efficiently done by MapReduce-based software. A first step taken by Twitter in that direction is to use a single front end specification of the kinds of data processing required to then produce either stream or batch realizations of such processing, so as to not to burden programmers with making that distinction when data analytics are defined. Next steps may concern efficiency methods when 'compiling' such specifications, algorithms to combine current streaming data with existing data previously captured and/or processed, and others.
Rajeev Gandhi from CMU and Yinzcam presented Yinzcam's innovative sports entertainment applications provided in sports venues, providing fans with live replays, player stats, etc. For this application, he pointed out the unique new problems created when their data processing is done 'in the cloud', e.g., using Amazon's EC2 Virtual Machines, vs. on dedicated private computing infrastructure. A paper in the Big Data Track of ICAC also presented by Yinzcam presented novel methods for dealing with application elasticity when operating in the cloud, but in this talk, Rajeev pointed out the need for new cloud-level functionality to make it easier to run big data codes in clouds. Failure diagnosis support from the cloud is needed, for instance, an example being the fact that data storage in Amazon's RDS is limited to 1TB, and it is now up to applications to handle that fact, without the benefit of cloud-level events to deal with data thresholds or overflow.
Extensive discussion and interaction after initial presentations touched upon the many topics raised above. Is distributed (vs. currently centralized) query processing inevitable/desirable? Can such processing be made more efficient with new methods deciding what data to keep, what to process, what not to touch? Can we draw sufficiently good conclusions from such partial data processing? How to evaluate the resulting data precision/accuracy vs. performance tradeoffs? An interesting specific case is the question of which team's queries to run when a datacenter must run thousands of such queries (e.g., MapReduce jobs) in a single datacenter? How to prioritize, perhaps in ways favoring monetization or importance or value? How to capture or describe such value? A specific example given was Twitter data redundancy, where there will be many tweets on the same topic, so even trending need not process all such data.
Data hotness was another topic discussed, as hot data is stored for rapid processing, which is expensive, whereas cold data storage is almost unlimited in capacity, but its processing is painful. A related topic -- what data is important vs. not -- what can be filtered and how -- was discussed for multi-datacenter systems, along with the question about what data it is possible not to keep. Another question was whether and to what extent 'data deletion' breaks or disturbs our carefully constructed big data processing models. Seamless data accessibility is the property desired by panelists.
There was consensus that the fast moving nature of this field is both enabled by and almost mandates the use of open source software in these environments. In fact, the current rich ecosystem of open source solutions is acting as an accelerator for development in big data systems and applications, benefiting both large companies like Facebook and small companies operating in the 'data ecosystem'. Editors note: in related discussions at Intel's University Symposium on the day prior to this panel, the opinion was voiced that government, university, industry interaction will be critical to build and maintain the data ecosystem needed to encourage, accelerate, and maintain rapid innovation. Another reason for such cooperative action is the effect of big data applications on future computer architecture, mandating 'codesign' processes on infrastructures starting with sensor and handheld devices and ending in multi-datacenter systems.