Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

07/30/2019
by   Saptadeep Pal, et al.
0

Deploying deep learning (DL) models across multiple compute devices to train large and complex models continues to grow in importance because of the demand for faster and more frequent training. Data parallelism (DP) is the most widely used parallelization strategy, but as the number of devices in data parallel training grows, so does the communication overhead between devices. Additionally, a larger aggregate batch size per step leads to statistical efficiency loss, i.e., a larger number of epochs are required to converge to a desired accuracy. These factors affect overall training time and beyond a certain number of devices, the speedup from leveraging DP begins to scale poorly. In addition to DP, each training step can be accelerated by exploiting model parallelism (MP). This work explores hybrid parallelization, where each data parallel worker is comprised of more than one device, across which the model dataflow graph (DFG) is split using MP. We show that at scale, hybrid training will be more effective at minimizing end-to-end training time than exploiting DP alone. We project that for Inception-V3, GNMT, and BigLSTM, the hybrid strategy provides an end-to-end training speedup of at least 26.5 and 22

READ FULL TEXT
research
05/10/2018

Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling

Deep learning systems have become vital tools across many fields, but th...
research
11/18/2020

Whale: A Unified Distributed Training Framework

Data parallelism (DP) has been a common practice to speed up the trainin...
research
10/18/2020

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigm...
research
07/02/2020

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

It is a challenging task to train large DNN models on sophisticated GPU ...
research
11/12/2019

HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow

The enormous amount of data and computation required to train DNNs have ...
research
03/07/2022

Parallel Training of GRU Networks with a Multi-Grid Solver for Long Sequences

Parallelizing Gated Recurrent Unit (GRU) networks is a challenging task,...
research
06/05/2022

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

Dynamic Parallelism (DP) is a runtime feature of the GPU programming mod...

Please sign up or login with your details

Forgot password? Click here to reset