Efficient Embedding of MPI Collectives in MXNET DAGs for scaling Deep Learning

02/20/2018
by   Amith R Mamidala, et al.
0

Availability of high performance computing infrastructures such as clusters of GPUs and CPUs have fueled the growth of distributed learning systems. Deep Learning frameworks express neural nets as DAGs and execute these DAGs on computation resources such as GPUs. In this paper, we propose efficient designs of embedding MPI collective operations into data parallel DAGs. Incorrect designs can easily lead to deadlocks or program crashes. In particular, we demonstrate three designs: Funneled, Concurrent communication and Dependency chaining of using MPI collectives with DAGs. These designs automatically enable overlap of computation with communication by allowing for concurrent execution with the other tasks. We directly implement these designs into the KVStore API of the MXNET. This allows us to directly leverage the rest of the infrastructure. Using ImageNet and CIFAR data sets, we show the potential of our designs. In particular, our designs scale to 256 GPUs with as low as 50 seconds of epoch times for ImageNet 1K datasets.

READ FULL TEXT
research
02/24/2021

GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

As an increasing number of leadership-class systems embrace GPU accelera...
research
11/06/2022

Multi-GPU thermal lattice Boltzmann simulations using OpenACC and MPI

We assess the performance of the hybrid Open Accelerator (OpenACC) and M...
research
01/11/2018

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Existing Deep Learning frameworks exclusively use either Parameter Serve...
research
11/28/2022

RAMP: A Flat Nanosecond Optical Network and MPI Operations for Distributed Deep Learning Systems

Distributed deep learning (DDL) systems strongly depend on network perfo...
research
05/11/2023

GPU-initiated Fine-grained Overlap of Collective Communication with Computation

In order to satisfy their ever increasing capacity and compute requireme...
research
07/16/2018

Scheduling Computation Graphs of Deep Learning Models on Manycore CPUs

For a deep learning model, efficient execution of its computation graph ...
research
09/19/2023

Julia as a unifying end-to-end workflow language on the Frontier exascale system

We evaluate using Julia as a single language and ecosystem paradigm powe...

Please sign up or login with your details

Forgot password? Click here to reset