DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

11/09/2021
by   Keshav Santhanam, et al.
7

The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, tensor-model, pipeline parallelism, and hybrid combinations thereof. Each of these strategies offers its own trade-offs and exhibits optimal performance across different models and hardware topologies. Selecting the best set of strategies for a given setup is challenging because the search space grows combinatorially, and debugging and testing on clusters is expensive. In this work we propose DistIR, an expressive intermediate representation for distributed DNN computation that is tailored for efficient analyses, such as simulation. This enables automatically identifying the top-performing strategies without having to execute on physical hardware. Unlike prior work, DistIR can naturally express many distribution strategies including pipeline parallelism with arbitrary schedules. Our evaluation on MLP training and GPT-2 inference models demonstrates how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations, reducing optimization time by an order of magnitude for certain regimes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/23/2020

BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity ...
research
07/14/2018

Beyond Data and Model Parallelism for Deep Neural Networks

The computational requirements for training deep neural networks (DNNs) ...
research
02/01/2023

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Model parallelism has become necessary to train large neural networks. H...
research
04/19/2021

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Deep Neural Network (DNN) frameworks use distributed training to enable ...
research
06/04/2023

Proteus: Simulating the Performance of Distributed DNN Training

DNN models are becoming increasingly larger to achieve unprecedented acc...
research
12/06/2021

Automap: Towards Ergonomic Automated Parallelism for ML Models

The rapid rise in demand for training large neural network architectures...
research
03/01/2022

Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization

There is a large space of NUMA and hardware prefetcher configurations th...

Please sign up or login with your details

Forgot password? Click here to reset