Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment

02/13/2023
by   Shiwei Zhang, et al.
0

This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device- and topology- heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gradient compression technique and solve a combinatorial optimization problem to automatically apply the technique for training time minimization. We evaluate TAG with various representative DNN models and device topologies, showing that it can achieve up to 4.56x training speed-up as compared to existing schemes. TAG can produce efficient deployment strategies for both unseen DNN models and unseen device topologies, without heavy fine-tuning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2022

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module...
research
02/01/2022

TopoOpt: Optimizing the Network Topology for Distributed DNN Training

We explore a novel approach for building DNN training clusters using com...
research
04/28/2021

Communication Topology Co-Design in Graph Recurrent Neural Network Based Distributed Control

When designing large-scale distributed controllers, the information-shar...
research
04/11/2023

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training

Collective communications are an indispensable part of distributed train...
research
11/03/2021

Power Flow Balancing with Decentralized Graph Neural Networks

We propose an end-to-end framework based on a Graph Neural Network (GNN)...
research
01/28/2022

RiskNet: Neural Risk Assessment in Networks of Unreliable Resources

We propose a graph neural network (GNN)-based method to predict the dist...
research
09/12/2022

A Generic Bundle Forwarding Interface

A generic interface for determining the next hop(s) for a DTN bundle is ...

Please sign up or login with your details

Forgot password? Click here to reset