Efficient MPI-based Communication for GPU-Accelerated Dask Applications

01/21/2021
by   Aamir Shafi, et al.
0

Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for adding new communication devices. It currently has two communication devices: one for TCP and the other for high-speed networks using UCX-Py – a Cython wrapper to UCX. This paper presents the design and implementation of a new communication backend for Dask – called MPI4Dask – that is targeted for modern HPC clusters built with GPUs. MPI4Dask exploits mpi4py over MVAPICH2-GDR, which is a GPU-aware implementation of the Message Passing Interface (MPI) standard. MPI4Dask provides point-to-point asynchronous I/O communication coroutines, which are non-blocking concurrent operations defined using the async/await keywords from the Python's asyncio framework. Our latency and throughput comparisons suggest that MPI4Dask outperforms UCX by 6x for 1 Byte message and 4x for large messages (2 MBytes and beyond) respectively. We also conduct comparative performance evaluation of MPI4Dask with UCX using two benchmark applications: 1) sum of cuPy array with its transpose, and 2) cuDF merge. MPI4Dask speeds up the overall execution time of the two applications by an average of 3.47x and 3.11x respectively on an in-house cluster built with NVIDIA Tesla V100 GPUs for 1-6 Dask workers. We also perform scalability analysis of MPI4Dask against UCX for these applications on TACC's Frontera (GPU) system with upto 32 Dask workers on 32 NVIDIA Quadro RTX 5000 GPUs and 256 CPU cores. MPI4Dask speeds up the execution time for cuPy and cuDF applications by an average of 1.71x and 2.91x respectively for 1-32 Dask workers on the Frontera (GPU) system.

READ FULL TEXT

page 1

page 8

research
06/27/2023

Exploring Fully Offloaded GPU Stream-Aware Message Passing

Modern heterogeneous supercomputing systems are comprised of CPUs, GPUs,...
research
08/09/2022

Exploring GPU Stream-Aware Message Passing using Triggered Operations

Modern heterogeneous supercomputing systems are comprised of compute bla...
research
10/25/2018

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

TensorFlow has been the most widely adopted Machine/Deep Learning framew...
research
04/30/2021

Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver

Scientific applications that run on leadership computing facilities ofte...
research
12/16/2017

An MPI-Based Python Framework for Distributed Training with Keras

We present a lightweight Python framework for distributed training of ne...
research
01/12/2022

Distributed Processing for Encoding and Decoding of Binary LDPC codes using MPI

Low Density Parity Check (LDPC) codes are linear error correcting codes ...
research
12/05/2020

An Improved Framework of GPU Computing for CFD Applications on Structured Grids using OpenACC

This paper is focused on improving multi-GPU performance of a research C...

Please sign up or login with your details

Forgot password? Click here to reset