Scalable communication for high-order stencil computations using CUDA-aware MPI

03/02/2021
by   Johannes Pekkilä, et al.
0

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. In comparison to a theoretical performance model, our implementation exhibits strong scaling from one to 64 devices at 50%–87% efficiency in sixth-order stencil computations when the problem domain consists of 256^3–1024^3 cells.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2019

Improving strong scaling of the Conjugate Gradient method for solving large linear systems using global reduction pipelining

This paper presents performance results comparing MPI-based implementati...
research
02/14/2020

High-Performance High-Order Stencil Computation on FPGAs Using OpenCL

In this paper we evaluate the performance of FPGAs for high-order stenci...
research
05/01/2020

How I Learned to Stop Worrying About User-Visible Endpoints and Love MPI

MPI+threads is gaining prominence as an alternative to the traditional M...
research
09/12/2019

PittPack: An Open-Source Poisson's Equation Solver for Extreme-Scale Computing with Accelerators

We present a parallel implementation of a direct solver for the Poisson'...
research
06/28/2019

Parallel Performance of Molecular Dynamics Trajectory Analysis

The performance of biomolecular molecular dynamics (MD) simulations has ...
research
10/17/2022

PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks

In this paper, we present PARTIME, a software library written in Python ...
research
09/07/2007

Computational performance of a parallelized high-order spectral and mortar element toolbox

In this paper, a comprehensive performance review of a MPI-based high-or...

Please sign up or login with your details

Forgot password? Click here to reset