Ling Liu's SC13 paper "Large Graph Processing Without the Overhead" featured by HPCwire.
Another list highlighting Open Source Software Releases.
Second GraphLab workshop should be even bigger than the first! GraphLab is a new programming framework for graph-style data analytics.
Multi-tenancy on GPGPU-based Servers
7th International Workshop on Virtualization Technologies in Distributed Computing (VTDC'13), June 2013.
Dipanjan Sengupta, Raghavendra Belapure, Karsten Schwan
Georgia Institute of Technology
While GPUs have become prominent both in high performance computing and in online or cloud services, they still appear as explicitly selected 'devices' rather than as first class schedulable entities that can be efficiently shared by diverse server applications. To combat the consequent likely under-utilization of GPUs when used in modern server or cloud settings, we propose 'Rain', a system level abstraction for GPU 'hyperthreading' that makes it possible to efficiently utilize GPUs without compromising fairness among multiple tenant applications. Rain uses a multi-level GPU scheduler that decomposes the scheduling problem into a combination of load balancing and per-device scheduling. Implemented by overriding applications' standard GPU selection calls, Rain operates without the need for application modification, making possible GPU scheduling methods that include prioritizing certain jobs, guaranteeing fair shares of GPU resources, and/or favoring jobs with least attained GPU services. GPU multi-tenancy via Rain is evaluated with server workloads using a wide variety of CUDA SDK and Rodinia suite benchmarks, on a multi-GPU, multi-core machine typifying future high end server machines. Averaged over ten applications, GPU multi-tenancy on a smaller scale server platform results in application speedups of up to 1.73x compared to their traditional implementation with NVIDIA's CUDA runtime. Averaged over 25 pairs of short and long running applications, on an emulated larger scale server machine, multitenancy results in system throughput improvements of up to 6.71x, and in 43% and 29.3% improvements in fairness compared to using the CUDA runtime and a naïve fair-share scheduler.
FULL PAPER: pdf