INTEL SCIENCE & TECHNOLOGY CENTER

CLOUD COMPUTING

ISTC-CC NEWSLETTER

The ISTC-CC Update 2016 - NEW!

The ISTC-CC Update 2015

The ISTC-CC Update 2014

RESEARCH HIGHLIGHTS

Ling Liu's SC13 paper "Large Graph Processing Without the Overhead" featured by HPCwire.

ISTC-CC provides a listing of useful benchmarks for cloud computing.

Another list highlighting Open Source Software Releases.

Second GraphLab workshop should be even bigger than the first! GraphLab is a new programming framework for graph-style data analytics.

Open-source Spark framework makes iterative and interactive data analytics FAST, both to run and to write.

ISTC-CC Abstract

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

ACM Transactions on Architecture and Compiler Optimization (TACO'13), January, 2013.

Dan Lustig, Abhishek Bhattacharjee*, Margaret Martonosi

Princeton University
* Rutgers University

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as chip multiprocessors (CMPs) become ubiquitous, TLB design and performance must be re-evaluated. Our paper begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases.

In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements.

Because of their benefits for parallel applications, their applicability to sequential workloads, and their readilyimplementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.

KEYWORDS: Design, Experimentation, Measurement, Performance, Translation Lookaside Buffer, Shared Last-Level TLB, TLB Prefetching, Simulation, Performance Evaluation

FULL PAPER: pdf