Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

06/05/2022
by   Felipe A. Quezada, et al.
0

Dynamic Parallelism (DP) is a runtime feature of the GPU programming model that allows GPU threads to execute additional GPU kernels, recursively. Apart from making the programming of parallel hierarchical patterns easier, DP can also speedup problems that exhibit a heterogeneous data layout by focusing, through a subdivision process, the finite GPU resources on the sub-regions that exhibit more parallelism. However, doing an optimal subdivision process is not trivial, as there are different parameters that play an important role in the final performance of DP. Moreover, the current programming abstraction for DP also introduces an overhead that can penalize the final performance. In this work we present a subdivision cost model for problems that exhibit self similar density (SSD) workloads (such as fractals), in order understand what parameters provide the fastest subdivision approach. Also, we introduce a new subdivision implementation, named Adaptive Serial Kernels (ASK), as a smaller overhead alternative to CUDA's Dynamic Parallelism. Using the cost model on the Mandelbrot Set as a case study shows that the optimal scheme is to start with an initial subdivision between g=[2,16], then keep subdividing in regions of r=2,4, and stop when regions reach a size of B ∼ 32. The experimental results agree with the theoretical parameters, confirming the usability of the cost model. In terms of performance, the proposed ASK approach runs up to ∼ 60% faster than Dynamic Parallelism in the Mandelbrot set, and up to 12× faster than a basic exhaustive implementation, whereas DP is up to 7.5×.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2020

Whale: A Unified Distributed Training Framework

Data parallelism (DP) has been a common practice to speed up the trainin...
research
08/05/2020

Solving Dynamic Programming Problem by Pipeline Implementation on GPU

In this paper, we show the effectiveness of a pipeline implementation of...
research
07/30/2019

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to t...
research
02/18/2022

Uniting Control and Data Parallelism: Towards Scalable Memory-Driven Dynamic Graph Processing

Control parallelism and data parallelism is mostly reasoned and optimize...
research
07/23/2021

Octo-Tiger's New Hydro Module and Performance Using HPX+CUDA on ORNL's Summit

Octo-Tiger is a code for modeling three-dimensional self-gravitating ast...
research
09/12/2023

Just-in-Time autotuning

Performance portability is a major concern on current architectures. One...

Please sign up or login with your details

Forgot password? Click here to reset