Scratchpad Sharing in GPUs

07/12/2016
by   Vishwesh Jatala, et al.
0

GPGPU applications exploit on-chip scratchpad memory available in the Graphics Processing Units (GPUs) to improve performance. The amount of thread level parallelism present in the GPU is limited by the number of resident threads, which in turn depends on the availability of scratchpad memory in its streaming multiprocessor (SM). Since the scratchpad memory is allocated at thread block granularity, part of the memory may remain unutilized. In this paper, we propose architectural and compiler optimizations to improve the scratchpad utilization. Our approach, Scratchpad Sharing, addresses scratchpad under-utilization by launching additional thread blocks in each SM. These thread blocks use unutilized scratchpad and also share scratchpad with other resident blocks. To improve the performance of scratchpad sharing, we propose Owner Warp First (OWF) scheduling that schedules warps from the additional thread blocks effectively. The performance of this approach, however, is limited by the availability of the shared part of scratchpad. We propose compiler optimizations to improve the availability of shared scratchpad. We describe a scratchpad allocation scheme that helps in allocating scratchpad variables such that shared scratchpad is accessed for short duration. We introduce a new instruction, relssp, that when executed, releases the shared scratchpad. Finally, we describe an analysis for optimal placement of relssp instructions such that shared scratchpad is released as early as possible. We implemented the hardware changes using the GPGPU-Sim simulator and implemented the compiler optimizations in Ocelot framework. We evaluated the effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK, GPGPU-Sim, and Rodinia. The kernels that underutilize scratchpad memory show an average improvement of 19 baseline approach.

READ FULL TEXT
research
07/05/2019

RegDem: Increasing GPU Performance via Shared Memory Register Spilling

GPU utilization, measured as occupancy, is limited by the parallel threa...
research
11/28/2017

Implementing implicit OpenMP data sharing on GPUs

OpenMP is a shared memory programming model which supports the offloadin...
research
05/21/2019

Low Overhead Instruction Latency Characterization for NVIDIA GPGPUs

The last decade has seen a shift in the computer systems industry where ...
research
01/20/2022

A Guide to Particle Advection Performance

The performance of particle advection-based flow visualization technique...
research
01/08/2022

A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

Dynamic parallelism on GPUs allows GPU threads to dynamically launch oth...
research
12/02/2019

GPU Support for Automatic Generation of Finite-Differences Stencil Kernels

The growth of data to be processed in the Oil Gas industry matches t...
research
03/05/2020

Predicting Memory Compiler Performance Outputs using Feed-Forward Neural Networks

Typical semiconductor chips include thousands of mostly small memories. ...

Please sign up or login with your details

Forgot password? Click here to reset