CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

05/20/2018
by   Jie Zhang, et al.
0

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality. Consequently, cache locality-aware warp scheduling may undesirably increase cache interference and/or unnecessarily decrease TLP. In this paper, we propose Cache Interference-Aware throughput-Oriented (CIAO) on-chip memory architecture and warp scheduling which exploit unused shared memory space and take insight opposite to cache locality-aware warp scheduling. Specifically, CIAO on-chip memory architecture can adaptively redirect memory requests of severely interfering warps to unused shared memory space to isolate memory requests of these interfering warps from those of interfered warps. If these interfering warps still incur severe cache interference, CIAO warp scheduling then begins to selectively throttle execution of these interfering warps. Our experiment shows that CIAO can offer 54 locality-aware scheduling at a small chip cost.

READ FULL TEXT

page 1

page 6

page 10

page 11

research
04/30/2018

Holistic Management of the GPGPU Memory Hierarchy to Manage Warp-level Latency Tolerance

In a modern GPU architecture, all threads within a warp execute the same...
research
05/12/2017

Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disks

Resource utilization is one of the emerging problems in many-chip SSDs. ...
research
02/28/2020

Bringing Inter-Thread Cache Benefits to Federated Scheduling – Extended Results Technical Report

Multiprocessor scheduling of hard real-time tasks modeled by directed ac...
research
02/21/2019

Locality

The performance of modern computation is characterized by locality of re...
research
04/28/2017

Mixed-criticality Scheduling with Dynamic Redistribution of Shared Cache

The design of mixed-criticality systems often involvespainful tradeoffs ...
research
02/01/2018

PCOT: Cache Oblivious Tiling of Polyhedral Programs

This paper studies two variants of tiling: iteration space tiling (or lo...
research
03/05/2019

FUSE: Fusing STT-MRAM into GPUs to Alleviate Off-Chip Memory Access Overheads

In this work, we propose FUSE, a novel GPU cache system that integrates ...

Please sign up or login with your details

Forgot password? Click here to reset