A Compiler Framework for Optimizing Dynamic Parallelism on GPUs

01/08/2022
by   Mhd Ghaith Olabi, et al.
0

Dynamic parallelism on GPUs allows GPU threads to dynamically launch other GPU threads. It is useful in applications with nested parallelism, particularly where the amount of nested parallelism is irregular and cannot be predicted beforehand. However, prior works have shown that dynamic parallelism may impose a high performance penalty when a large number of small grids are launched. The large number of launches results in high launch latency due to congestion, and the small grid sizes result in hardware underutilization. To address this issue, we propose a compiler framework for optimizing the use of dynamic parallelism in applications with nested parallelism. The framework features three key optimizations: thresholding, coarsening, and aggregation. Thresholding involves launching a grid dynamically only if the number of child threads exceeds some threshold, and serializing the child threads in the parent thread otherwise. Coarsening involves executing the work of multiple thread blocks by a single coarsened block to amortize the common work across them. Aggregation involves combining multiple child grids into a single aggregated grid. Our evaluation shows that our compiler framework improves the performance of applications with nested parallelism by a geometric mean of 43.0x over applications that use dynamic parallelism, 8.7x over applications that do not use dynamic parallelism, and 3.6x over applications that use dynamic parallelism with aggregation alone as proposed in prior work.

READ FULL TEXT

page 1

page 8

research
08/25/2022

Exploring Thread Coarsening on FPGA

Over the past few years, there has been an increased interest in includi...
research
12/14/2020

Compilation Techniques for Graph Algorithms on GPUs

The performance of graph programs depends highly on the algorithm, the s...
research
08/29/2019

TapirXLA: Embedding Fork-Join Parallelism into the XLA Compiler in TensorFlow Using Tapir

This work introduces TapirXLA, a replacement for TensorFlow's XLA compil...
research
07/12/2016

Scratchpad Sharing in GPUs

GPGPU applications exploit on-chip scratchpad memory available in the Gr...
research
06/29/2023

SYCL compute kernels for ExaHyPE

We discuss three SYCL realisations of a simple Finite Volume scheme over...
research
04/21/2022

Parallel Vertex Cover Algorithms on GPUs

Finding small vertex covers in a graph has applications in numerous doma...
research
09/05/2023

Generalizing Hierarchical Parallelism

Since the days of OpenMP 1.0 computer hardware has become more complex, ...

Please sign up or login with your details

Forgot password? Click here to reset