From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

09/26/2022
by   Gregor Daiß, et al.
0

Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.

READ FULL TEXT
research
09/16/2020

Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

In this work we present a performance exploration on Eager K-truss, a li...
research
02/23/2022

Improving Scalability with GPU-Aware Asynchronous Tasks

Asynchronous tasks, when created with over-decomposition, enable automat...
research
03/04/2023

Stellar Mergers with HPX-Kokkos and SYCL: Methods of using an Asynchronous Many-Task Runtime System with SYCL

Ranging from NVIDIA GPUs to AMD GPUs and Intel GPUs: Given the heterogen...
research
05/22/2023

A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

Machine Learning (ML) models contain highly-parallel computations, such ...
research
04/26/2022

From Sand to Flour: The Next Leap in Granular Computing with NanoSort

The granularity of distributed computing is limited by communication tim...
research
05/11/2023

GPU-initiated Fine-grained Overlap of Collective Communication with Computation

In order to satisfy their ever increasing capacity and compute requireme...
research
08/13/2020

A Fine-Grained Hybrid CPU-GPU Algorithm for Betweenness Centrality Computations

Betweenness centrality (BC) is an important graph analytical application...

Please sign up or login with your details

Forgot password? Click here to reset