Proteus: Simulating the Performance of Distributed DNN Training

06/04/2023
by   Jiangfei Duan, et al.
0

DNN models are becoming increasingly larger to achieve unprecedented accuracy, and the accompanying increased computation and memory requirements necessitate the employment of massive clusters and elaborate parallelization strategies to accelerate DNN training. In order to better optimize the performance and analyze the cost, it is indispensable to model the training throughput of distributed DNN training. However, complex parallelization strategies and the resulting complex runtime behaviors make it challenging to construct an accurate performance model. In this paper, we present Proteus, the first standalone simulator to model the performance of complex parallelization strategies through simulation execution. Proteus first models complex parallelization strategies with a unified representation named Strategy Tree. Then, it compiles the strategy tree into a distributed execution graph and simulates the complex runtime behaviors, comp-comm overlap and bandwidth sharing, with a Hierarchical Topo-Aware Executor (HTAE). We finally evaluate Proteus across a wide variety of DNNs on three hardware configurations. Experimental results show that Proteus achieves 3.0% average prediction error and preserves order for training throughput of various parallelization strategies. Compared to state-of-the-art approaches, Proteus reduces prediction error by up to 133.8%.

READ FULL TEXT

page 1

page 9

research
07/14/2018

Beyond Data and Model Parallelism for Deep Neural Networks

The computational requirements for training deep neural networks (DNNs) ...
research
04/16/2020

TensorOpt: Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism

A good parallelization strategy can significantly improve the efficiency...
research
03/24/2023

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

As deep learning models and input data are scaling at an unprecedented r...
research
01/21/2023

SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction

With the growing model size, deep neural networks (DNN) are increasingly...
research
02/17/2020

Simulating Performance of ML Systems with Offline Profiling

We advocate that simulation based on offline profiling is a promising ap...
research
11/05/2018

Workload-aware Automatic Parallelization for Multi-GPU DNN Training

Deep neural networks (DNNs) have emerged as successful solutions for var...
research
11/09/2021

DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

The rapidly growing size of deep neural network (DNN) models and dataset...

Please sign up or login with your details

Forgot password? Click here to reset