Supporting Very Large Models using Automatic Dataflow Graph Partitioning

07/24/2018
by   Minjie Wang, et al.
0

There is a trend towards using very large deep neural networks (DNN) to improve the accuracy of complex machine learning tasks. However, the size of DNN models that can be explored today is limited by the amount of GPU device memory. This paper presents Tofu, a system for partitioning very large DNN models across multiple GPU devices. Tofu is designed for a tensor-based dataflow system: for each operator in the dataflow graph, it partitions its input/output tensors and parallelizes its execution across workers. Tofu can automatically discover how each operator can be partitioned by analyzing its semantics expressed in a simple specification language. Tofu uses a search algorithm based on dynamic programming to determine the best partition strategy for each operator in the entire dataflow graph. Our experiments on an 8-GPU machine show that Tofu enables the training of very large CNN and RNN models. It also achieves better performance than alternative approaches to train very large models on multiple GPUs.

READ FULL TEXT
research
08/19/2020

A Computational-Graph Partitioning Method for Training Memory-Constrained DNNs

We propose ParDNN, an automatic, generic, and non-intrusive partitioning...
research
02/02/2022

Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers

Deep neural networks (DNNs) have grown exponentially in complexity and s...
research
04/25/2023

Jet: Multilevel Graph Partitioning on GPUs

The multilevel heuristic is the dominant strategy for high-quality seque...
research
05/10/2021

GSPMD: General and Scalable Parallelization for ML Computation Graphs

We present GSPMD, an automatic, compiler-based parallelization system fo...
research
09/26/2022

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module...
research
09/18/2021

Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem

Multi-Instance GPU (MIG) is a new feature introduced by NVIDIA A100 GPUs...
research
09/28/2020

Accelerating Multi-Model Inference by Merging DNNs of Different Weights

Standardized DNN models that have been proved to perform well on machine...

Please sign up or login with your details

Forgot password? Click here to reset