Optimal Direct-Connect Topologies for Collective Communications

02/07/2022
by   Liangyu Zhao, et al.
0

We consider the problem of distilling optimal network topologies for collective communications. We provide an algorithmic framework for constructing direct-connect topologies optimized for the latency-bandwidth tradeoff given a collective communication workload. Our algorithmic framework allows us to start from small base topologies and associated communication schedules and use a set of techniques that can be iteratively applied to derive much larger topologies and associated schedules. Our approach allows us to synthesize many different topologies and schedules for a given cluster size and degree constraint, and then identify the optimal topology for a given workload. We provide an analytical-model-based evaluation of the derived topologies and results on a small-scale optical testbed that uses patch panels for configuring a topology for the duration of an application's execution. We show that the derived topologies and schedules provide significant performance benefits over existing collective communications implementations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2023

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training

Collective communications are an indispensable part of distributed train...
research
08/19/2020

Synthesizing Optimal Collective Algorithms

Collective communication algorithms are an important component of distri...
research
05/22/2023

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

We show communication schedulers' recent work proposed for ML collective...
research
04/01/2019

Optimal Low-Latency Network Topologies for Cluster Performance Enhancement

We propose that clusters interconnected with network topologies having m...
research
12/11/2017

DeepConfig: Automating Data Center Network Topologies Management with Machine Learning

In recent years, many techniques have been developed to improve the perf...
research
11/01/2018

Expander Datacenters: From Theory to Practice

Recent work has shown that expander-based data center topologies are rob...
research
06/14/2018

PADS: Practical Attestation for Highly Dynamic Swarm Topologies

Remote attestation protocols are widely used to detect device configurat...

Please sign up or login with your details

Forgot password? Click here to reset