TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes

12/28/2020
by   Carl Pearson, et al.
0

MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents a new MPI library, TEMPI, to address this issue. TEMPI first introduces a common datatype to represent equivalent MPI derived datatypes. TEMPI can be used as an interposed library on existing MPI deployments without system or application changes. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that previously preferred "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242,000x and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 1000x in a 3D halo exchange at 192 ranks.

READ FULL TEXT
research
08/29/2022

MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming

The hybrid MPI+X programming paradigm, where X refers to threads or GPUs...
research
09/04/2017

From MPI to MPI+OpenACC: Conversion of a legacy FORTRAN PCG solver for the spherical Laplace equation

A real-world example of adding OpenACC to a legacy MPI FORTRAN Precondit...
research
06/28/2020

GPU-Accelerated Discontinuous Galerkin Methods: 30x Speedup on 345 Billion Unknowns

A discontinuous Galerkin method for the discretization of the compressib...
research
09/27/2018

Performance of MPI sends of non-contiguous data

We present an experimental investigation of the performance of MPI deriv...
research
05/08/2019

Implementing Efficient Message Logging Protocols as MPI Application Extensions

Message logging protocols are enablers of local rollback, a more efficie...
research
08/22/2019

Network-Accelerated Non-Contiguous Memory Transfers

Applications often communicate data that is non-contiguous in the send- ...
research
03/05/2023

Acceleration of a production Solar MHD code with Fortran standard parallelism: From OpenACC to `do concurrent'

There is growing interest in using standard language constructs for acce...

Please sign up or login with your details

Forgot password? Click here to reset