Improving Scalability with GPU-Aware Asynchronous Tasks

02/23/2022
by   Jaemin Choi, et al.
0

Asynchronous tasks, when created with over-decomposition, enable automatic computation-communication overlap which can substantially improve performance and scalability. This is not only applicable to traditional CPU-based systems, but also to modern GPU-accelerated platforms. While the ability to hide communication behind computation can be highly effective in weak scaling scenarios, performance begins to suffer with smaller problem sizes or in strong scaling due to fine-grained overheads and reduced room for overlap. In this work, we integrate GPU-aware communication into asynchronous tasks in addition to computation-communication overlap, with the goal of reducing time spent in communication and further increasing GPU utilization. We demonstrate the performance impact of our approach using a proxy application that performs the Jacobi iterative method, Jacobi3D. In addition to optimizations to minimize synchronizations between the host and GPU devices and increase the concurrency of GPU operations, we explore techniques such as kernel fusion and CUDA Graphs to mitigate fine-grained overheads at scale.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2022

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Meeting both scalability and performance portability requirements is a c...
research
05/11/2023

GPU-initiated Fine-grained Overlap of Collective Communication with Computation

In order to satisfy their ever increasing capacity and compute requireme...
research
04/21/2023

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

Early-bird communication is a communication/computation overlap techniqu...
research
08/13/2020

A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations

Betweenness centrality (BC) is an important graph analytical application...
research
02/25/2022

HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark

We present hipBone, an open source performance-portable proxy applicatio...
research
10/26/2018

Integration of CUDA Processing within the C++ library for parallelism and concurrency (HPX)

Experience shows that on today's high performance systems the utilizatio...
research
11/08/2019

AMOEBA: A Coarse Grained Reconfigurable Architecture for Dynamic GPU Scaling

Different GPU applications exhibit varying scalability patterns with net...

Please sign up or login with your details

Forgot password? Click here to reset