Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance

04/30/2018
by   Rachata Ausavarungnirun, et al.
0

In a modern GPU architecture, all threads within a warp execute the same instruction in lockstep. For a memory instruction, this can lead to memory divergence: the memory requests for some threads are serviced early, while the remaining requests incur long latencies. This divergence stalls the warp, as it cannot execute the next instruction until all requests from the current instruction complete. In this work, we make three new observations. First, GPGPU warps exhibit heterogeneous memory divergence behavior at the shared cache: some warps have most of their requests hit in the cache, while other warps see most of their request miss. Second, a warp retains the same divergence behavior for long periods of execution. Third, requests going to the shared cache can incur queuing delays as large as hundreds of cycles, exacerbating the effects of memory divergence. We propose a set of techniques, collectively called Memory Divergence Correction (MeDiC), that reduce the negative performance impact of memory divergence and cache queuing. MeDiC delivers an average speedup of 21.8 a state-of-the-art GPU cache management mechanism across 15 different GPGPU applications.

READ FULL TEXT

page 2

page 4

page 6

research
05/20/2018

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

A modern GPU aims to simultaneously execute more warps for higher Thread...
research
07/04/2018

Cimple: Instruction and Memory Level Parallelism

Modern out-of-order processors have increased capacity to exploit instru...
research
09/01/2022

Hermes: Accelerating Long-Latency Load Requests via Perceptron-Based Off-Chip Load Prediction

Long-latency load requests continue to limit the performance of high-per...
research
06/20/2016

Criticality Aware Multiprocessors

Typically, a memory request from a processor may need to go through many...
research
07/12/2021

DARM: Control-Flow Melding for SIMT Thread Divergence Reduction – Extended Version

GPGPUs use the Single-Instruction-Multiple-Thread (SIMT) execution model...
research
06/20/2017

Index Search Algorithms for Databases and Modern CPUs

Over the years, many different indexing techniques and search algorithms...
research
12/04/2017

Data Cache Prefetching with Perceptron Learning

Cache prefetcher greatly eliminates compulsory cache misses, by fetching...

Please sign up or login with your details

Forgot password? Click here to reset