OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

10/28/2021
by   Jinhui Yuan, et al.
0

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than existing frameworks, and the actor model provides a succinct runtime mechanism to manage the complex dependencies imposed by resource constraints, data movement and computation in distributed deep learning. We demonstrate the general applicability and efficiency of OneFlow for training various large DNN models with case studies and extensive experiments. The results show that OneFlow outperforms many well-known customized libraries built on top of the state-of-the-art frameworks. The code of OneFlow is available at: https://github.com/Oneflow-Inc/oneflow.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/10/2022

Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models

Foundation models are becoming the dominant deep learning technologies. ...
research
11/05/2018

Mesh-TensorFlow: Deep Learning for Supercomputers

Batch-splitting (data-parallelism) is the dominant distributed Deep Neur...
research
08/13/2019

Exploiting Parallelism Opportunities with Deep Learning Frameworks

State-of-the-art machine learning frameworks support a wide variety of d...
research
02/27/2023

Hulk: Graph Neural Networks for Optimizing Regionally Distributed Computing Systems

Large deep learning models have shown great potential for delivering exc...
research
10/06/2020

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Although recent scaling up approaches to train deep neural networks have...
research
05/22/2018

RPC Considered Harmful: Fast Distributed Deep Learning on RDMA

Deep learning emerges as an important new resource-intensive workload an...
research
10/31/2018

Democratizing Production-Scale Distributed Deep Learning

The interest and demand for training deep neural networks have been expe...

Please sign up or login with your details

Forgot password? Click here to reset