Auto-MAP: A DQN Framework for Exploring Distributed Execution Plans for DNN Workloads

07/08/2020
by   Siyu Wang, et al.
34

The last decade has witnessed growth in the computational requirements for training deep neural networks. Current approaches (e.g., data/model parallelism, pipeline parallelism) parallelize training tasks onto multiple devices. However, these approaches always rely on specific deep learning frameworks and requires elaborate manual design, which make it difficult to maintain and share between different type of models. In this paper, we propose Auto-MAP, a framework for exploring distributed execution plans for DNN workloads, which can automatically discovering fast parallelization strategies through reinforcement learning on IR level of deep learning models. Efficient exploration remains a major challenge for reinforcement learning. We leverage DQN with task-specific pruning strategies to help efficiently explore the search space including optimized strategies. Our evaluation shows that Auto-MAP can find the optimal solution in two hours, while achieving better throughput on several NLP and convolution models.

READ FULL TEXT
research
07/14/2018

Beyond Data and Model Parallelism for Deep Neural Networks

The computational requirements for training deep neural networks (DNNs) ...
research
02/14/2018

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the size and computational r...
research
11/25/2022

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various...
research
04/16/2020

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

A good parallelization strategy can significantly improve the efficiency...
research
01/21/2023

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

With the growing model size, deep neural networks (DNN) are increasingly...
research
07/05/2023

Improving Automatic Parallel Training via Balanced Memory Workload Optimization

Transformer models have emerged as the leading approach for achieving st...
research
12/06/2021

Automap: Towards Ergonomic Automated Parallelism for ML Models

The rapid rise in demand for training large neural network architectures...

Please sign up or login with your details

Forgot password? Click here to reset