PyTorch Distributed: Experiences on Accelerating Data Parallel Training

06/28/2020
by   Shen Li, et al.
0

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

READ FULL TEXT

page 7

page 10

research
12/31/2021

SplitBrain: Hybrid Data and Model Parallel Deep Learning

The recent success of deep learning applications has coincided with thos...
research
04/21/2023

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

It is widely acknowledged that large models have the potential to delive...
research
02/02/2022

Accelerated Quality-Diversity for Robotics through Massive Parallelism

Quality-Diversity (QD) algorithms are a well-known approach to generate ...
research
05/22/2017

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

High network communication cost for synchronizing gradients and paramete...
research
05/25/2023

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training

Deep learning is experiencing a rise in foundation models that are expec...
research
09/24/2019

Exascale Deep Learning for Scientific Inverse Problems

We introduce novel communication strategies in synchronous distributed D...
research
02/27/2023

Hulk: Graph Neural Networks for Optimizing Regionally Distributed Computing Systems

Large deep learning models have shown great potential for delivering exc...

Please sign up or login with your details

Forgot password? Click here to reset