SplitBrain: Hybrid Data and Model Parallel Deep Learning

12/31/2021
by   Farley Lai, et al.
14

The recent success of deep learning applications has coincided with those widely available powerful computational resources for training sophisticated machine learning models with huge datasets. Nonetheless, training large models such as convolutional neural networks using model parallelism (as opposed to data parallelism) is challenging because the complex nature of communication between model shards makes it difficult to partition the computation efficiently across multiple machines with an acceptable trade-off. This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism. Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers. A novel scalable group communication is proposed to further improve the training throughput with reduced communication overhead. The results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/27/2022

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Large-scale deep learning models contribute to significant performance i...
research
12/07/2020

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely success...
research
02/10/2023

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Parallel training of neural networks at scale is challenging due to sign...
research
06/28/2020

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the Py...
research
04/22/2021

An Accurate and Efficient Large-scale Regression Method through Best Friend Clustering

As the data size in Machine Learning fields grows exponentially, it is i...
research
05/30/2016

ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

Many powerful machine learning models are based on the composition of mu...
research
04/03/2019

Model Slicing for Supporting Complex Analytics with Elastic Inference Cost and Resource Constraints

Deep learning models have been used to support analytics beyond simple a...

Please sign up or login with your details

Forgot password? Click here to reset