WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

07/22/2022
by   Fei Dai, et al.
0

Communication efficiency plays an important role in accelerating the distributed training of Deep Neural Networks (DNN). All-reduce is the key communication primitive to reduce model parameters in distributed DNN training. Most existing all-reduce algorithms are designed for traditional electrical interconnect systems, which cannot meet the communication requirements for distributed training of large DNNs. One of the promising alternatives for electrical interconnect is optical interconnect, which can provide high bandwidth, low transmission delay, and low power cost. We propose an efficient scheme called WRHT (Wavelength Reused Hierarchical Tree) for implementing all-reduce operation in optical interconnect system, which can take advantage of WDM (Wavelength Division Multiplexing) to reduce the communication time of distributed data-parallel DNN training. We further derive the minimum number of communication steps and communication time to realize the all-reduce using WRHT. Simulation results show that the communication time of WRHT is reduced by 75.59 all-reduce algorithms simulated in optical interconnect system. Simulation results also show that WRHT can reduce the communication time for all-reduce operation by 86.69 algorithms in electrical interconnect system.

READ FULL TEXT
research
11/28/2022

OpTree: An Efficient Algorithm for All-gather Operation in Optical Interconnect Systems

All-gather collective communication is one of the most important communi...
research
07/16/2023

Accelerating Distributed ML Training via Selective Synchronization

In distributed training, deep neural networks (DNNs) are launched over m...
research
02/01/2022

TopoOpt: Optimizing the Network Topology for Distributed DNN Training

We explore a novel approach for building DNN training clusters using com...
research
05/10/2019

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neura...
research
03/17/2022

Convert, compress, correct: Three steps toward communication-efficient DNN training

In this paper, we introduce a novel algorithm, 𝖢𝖮_3, for communication-e...
research
11/28/2017

Homomorphic Parameter Compression for Distributed Deep Learning Training

Distributed training of deep neural networks has received significant re...
research
11/10/2020

Role of Digital Twin in Optical Communication: Fault Management, Hardware Configuration, and Transmission Simulation

Optical communication is developing rapidly in the directions of hardwar...

Please sign up or login with your details

Forgot password? Click here to reset