TopoOpt: Optimizing the Network Topology for Distributed DNN Training

02/01/2022
by   Weiyang Wang, et al.
0

We explore a novel approach for building DNN training clusters using commodity optical devices. Our proposal, called TopoOpt, co-optimizes the distributed training process across three dimensions: computation, communication, and network topology. TopoOpt uses a novel alternating optimization technique and a group theory-inspired algorithm to find the best network topology and routing plan, together with parallelization strategy, for distributed DNN training. To motivate our proposal, we measure the communication patterns of distributed DNN workloads at a large online service provider. Experiments with a 12-node prototype demonstrate the feasibility of TopoOpt. Simulations on real distributed training models show that, compared to similar-cost FatTree interconnects, TopoOpt reduces DNN training time by up to 3x.

READ FULL TEXT
research
02/13/2023

Expediting Distributed DNN Training with Device Topology-Aware Graph Deployment

This paper presents TAG, an automatic system to derive optimized DNN tra...
research
03/06/2020

Communication Optimization Strategies for Distributed Deep Learning: A Survey

Recent trends in high-performance computing and deep learning lead to a ...
research
02/19/2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for redu...
research
07/22/2022

WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Communication efficiency plays an important role in accelerating the dis...
research
09/24/2021

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

Deep Neural Networks have gained significant attraction due to their wid...
research
09/26/2022

Optimizing DNN Compilation for Distributed Training with Joint OP and Tensor Fusion

This paper proposes DisCo, an automatic deep learning compilation module...
research
08/10/2023

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train...

Please sign up or login with your details

Forgot password? Click here to reset