Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

05/22/2023
by   Behnaz Arzani, et al.
0

We show communication schedulers' recent work proposed for ML collectives does not scale to the increasing problem sizes that arise from training larger models. These works also often produce suboptimal schedules. We make a connection with similar problems in traffic engineering and propose a new method, TECCL, that finds better quality schedules (e.g., finishes collectives faster and/or while sending fewer bytes) and does so more quickly on larger topologies. We present results on many different GPU topologies that show substantial improvement over the state-of-the-art.

READ FULL TEXT
research
11/08/2021

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU syst...
research
02/07/2022

Optimal Direct-Connect Topologies for Collective Communications

We consider the problem of distilling optimal network topologies for col...
research
04/11/2023

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Training

Collective communications are an indispensable part of distributed train...
research
01/04/2022

Optimal circulant graphs as low-latency network topologies

Communication latency has become one of the determining factors for the ...
research
03/31/2023

MAGNNETO: A Graph Neural Network-based Multi-Agent system for Traffic Engineering

Current trends in networking propose the use of Machine Learning (ML) fo...
research
06/10/2022

Social Network Structure Shapes Innovation: Experience-sharing in RL with SAPIENS

The human cultural repertoire relies on innovation: our ability to conti...

Please sign up or login with your details

Forgot password? Click here to reset