TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

04/16/2020
by   Zhenkun Cai, et al.
8

A good parallelization strategy can significantly improve the efficiency or reduce the cost for the distributed training of deep neural networks (DNNs). Recently, several methods have been proposed to find efficient parallelization strategies but they all optimize a single objective (e.g., execution time, memory consumption) and produce only one strategy. We propose FT, an efficient algorithm that searches for an optimal set of parallelization strategies to allow the trade-off among different objectives. FT can adapt to different scenarios by minimizing the memory consumption when the number of devices is limited and fully utilize additional resources to reduce the execution time. For popular DNN models (e.g., vision, language), an in-depth analysis is conducted to understand the trade-offs among different objectives and their influence on the parallelization strategies. We also develop a user-friendly system, called TensorOpt, which allows users to run their distributed DNN training jobs without caring the details of parallelization strategies. Experimental results show that FT runs efficiently and provides accurate estimation of runtime costs, and TensorOpt is more flexible in adapting to resource availability compared with existing frameworks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2023

Proteus: Simulating the Performance of Distributed DNN Training

DNN models are becoming increasingly larger to achieve unprecedented acc...
research
02/16/2023

Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform

We present Rhino, a system for accelerating tensor programs with automat...
research
07/08/2020

Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

The last decade has witnessed growth in the computational requirements f...
research
04/11/2021

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Recently, Deep Neural Networks (DNNs) have recorded great success in han...
research
10/01/2020

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters

DNN learning jobs are common in today's clusters due to the advances in ...
research
04/30/2023

A Wall-time Minimizing Parallelization Strategy for Approximate Bayesian Computation

Approximate Bayesian Computation (ABC) is a widely applicable and popula...
research
01/18/2019

On-line Application Autotuning Exploiting Ensemble Models

Application autotuning is a promising path investigated in literature to...

Please sign up or login with your details

Forgot password? Click here to reset