An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

04/19/2021
by   Albert Njoroge Kahira, et al.
0

Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74 compared to empirical results, and as high as 97.57

READ FULL TEXT
research
02/14/2018

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the size and computational r...
research
07/25/2020

The Case for Strong Scaling in Deep Learning: Training Large 3D CNNs with Hybrid Parallelism

We present scalable hybrid-parallel algorithms for training large-scale ...
research
04/11/2021

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Recently, Deep Neural Networks (DNNs) have recorded great success in han...
research
10/18/2020

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigm...
research
10/17/2018

A Bi-layered Parallel Training Architecture for Large-scale Convolutional Neural Networks

Benefitting from large-scale training datasets and the complex training ...
research
12/07/2017

Distributed learning of CNNs on heterogeneous CPU/GPU architectures

Convolutional Neural Networks (CNNs) have shown to be powerful classific...
research
11/09/2021

DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

The rapidly growing size of deep neural network (DNN) models and dataset...

Please sign up or login with your details

Forgot password? Click here to reset