Whale: A Unified Distributed Training Framework

11/18/2020
by   Ang Wang, et al.
0

Data parallelism (DP) has been a common practice to speed up the training workloads for a long time. However, with the increase of data size and model size, DP has become less optimal for most distributed training workloads. Moreover, it does not work on models whose parameter size cannot fit into a single GPU's device memory. To enable and further improve the industrial-level giant model training, we present Whale, a unified distributed training framework. It provides comprehensive parallel strategies including data parallelism, model parallelism, operator sharding, pipeline, hybrid strategy, and automatic parallel strategy. To express complex training strategies effectively and efficiently in one framework, Whale IR is designed as the basic unit to explore and implement different distributed strategies. Moreover, Whale enables automatic parallelism upon using a meta-driven cost model. Whale is compatible with TensorFlow and can easily distribute training tasks by adding a few code lines without changing user model code. To the best of our knowledge, Whale is the first work that can support various hybrid distributed strategies within one framework. In our experiment of Bert Large model, Whale pipeline strategy is 2.32 times faster than Horovod data parallelism (HDP) on 64 GPUs. In a large-scale image classification task (100,000 classes), Whale hybrid strategy, which consists of operator sharding and DP, is 14.8 times faster than HDP on 64 GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/30/2019

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to t...
research
06/05/2022

Modeling GPU Dynamic Parallelism for Self Similar Density Workloads

Dynamic Parallelism (DP) is a runtime feature of the GPU programming mod...
research
04/09/2021

Efficient Large-Scale Language Model Training on GPU Clusters

Large language models have led to state-of-the-art accuracies across a r...
research
03/30/2021

Automatic Graph Partitioning for Very Large-scale Deep Learning

This work proposes RaNNC (Rapid Neural Network Connector) as middleware ...
research
10/08/2020

Interlocking Backpropagation: Improving depthwise model-parallelism

The number of parameters in state of the art neural networks has drastic...
research
12/06/2021

Automap: Towards Ergonomic Automated Parallelism for ML Models

The rapid rise in demand for training large neural network architectures...
research
02/05/2021

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers

The size of Transformer models is growing at an unprecedented pace. It h...

Please sign up or login with your details

Forgot password? Click here to reset