TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes

by   Carl Pearson, et al.

MPI derived datatypes are an abstraction that simplifies handling of non-contiguous data in MPI applications. These datatypes are recursively constructed at runtime from primitive Named Types defined in the MPI standard. More recently, the development and deployment of CUDA-aware MPI implementations has encouraged the transition of distributed high-performance MPI codes to use GPUs. Such implementations allow MPI functions to directly operate on GPU buffers, easing integration of GPU compute into MPI codes. Despite substantial attention to CUDA-aware MPI implementations, they continue to offer cripplingly poor GPU performance when manipulating derived datatypes on GPUs. This work presents a new MPI library, TEMPI, to address this issue. TEMPI first introduces a common datatype to represent equivalent MPI derived datatypes. TEMPI can be used as an interposed library on existing MPI deployments without system or application changes. Furthermore, this work presents a performance model of GPU derived datatype handling, demonstrating that previously preferred "one-shot" methods are not always fastest. Ultimately, the interposed-library model of this work demonstrates MPI_Pack speedup of up to 242,000x and MPI_Send speedup of up to 59,000x compared to the MPI implementation deployed on a leadership-class supercomputer. This yields speedup of more than 1000x in a 3D halo exchange at 192 ranks.



There are no comments yet.


page 9


GPU-Accelerated Discontinuous Galerkin Methods: 30x Speedup on 345 Billion Unknowns

A discontinuous Galerkin method for the discretization of the compressib...

Performance of MPI sends of non-contiguous data

We present an experimental investigation of the performance of MPI deriv...

From MPI to MPI+OpenACC: Conversion of a legacy FORTRAN PCG solver for the spherical Laplace equation

A real-world example of adding OpenACC to a legacy MPI FORTRAN Precondit...

Multichannel Analysis of Surface Waves Accelerated (MASWAccelerated): Software for Efficient Surface Wave Inversion Using MPI and GPUs

Multichannel Analysis of Surface Waves (MASW) is a technique frequently ...

Implementing Efficient Message Logging Protocols as MPI Application Extensions

Message logging protocols are enablers of local rollback, a more efficie...

Network-Accelerated Non-Contiguous Memory Transfers

Applications often communicate data that is non-contiguous in the send- ...

Gadget3 on GPUs with OpenACC

We present preliminary results of a GPU porting of all main Gadget3 modu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.