DistIR: An Intermediate Representation and Simulator for Efficient Neural Network Distribution

by   Keshav Santhanam, et al.

The rapidly growing size of deep neural network (DNN) models and datasets has given rise to a variety of distribution strategies such as data, tensor-model, pipeline parallelism, and hybrid combinations thereof. Each of these strategies offers its own trade-offs and exhibits optimal performance across different models and hardware topologies. Selecting the best set of strategies for a given setup is challenging because the search space grows combinatorially, and debugging and testing on clusters is expensive. In this work we propose DistIR, an expressive intermediate representation for distributed DNN computation that is tailored for efficient analyses, such as simulation. This enables automatically identifying the top-performing strategies without having to execute on physical hardware. Unlike prior work, DistIR can naturally express many distribution strategies including pipeline parallelism with arbitrary schedules. Our evaluation on MLP training and GPT-2 inference models demonstrates how DistIR and its simulator enable fast grid searches over complex distribution spaces spanning up to 1000+ configurations, reducing optimization time by an order of magnitude for certain regimes.


page 1

page 2

page 3

page 4


BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training

The size of deep neural networks (DNNs) grows rapidly as the complexity ...

Beyond Data and Model Parallelism for Deep Neural Networks

The computational requirements for training deep neural networks (DNNs) ...

TAP: Accelerating Large-Scale DNN Training Through Tensor Automatic Parallelisation

Model parallelism has become necessary to train large neural networks. H...

An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks

Deep Neural Network (DNN) frameworks use distributed training to enable ...

Proteus: Simulating the Performance of Distributed DNN Training

DNN models are becoming increasingly larger to achieve unprecedented acc...

Automap: Towards Ergonomic Automated Parallelism for ML Models

The rapid rise in demand for training large neural network architectures...

Learning Intermediate Representations using Graph Neural Networks for NUMA and Prefetchers Optimization

There is a large space of NUMA and hardware prefetcher configurations th...

Please sign up or login with your details

Forgot password? Click here to reset