Synthesizing Optimal Collective Algorithms

08/19/2020
by   Zixian Cai, et al.
0

Collective communication algorithms are an important component of distributed computation. Indeed, in the case of deep-learning, collective communication is the Amdahl's bottleneck of data-parallel training. This paper introduces SCCL (for Synthesized Collective Communication Library), a systematic approach to synthesize collective communication algorithms that are explicitly tailored to a particular hardware topology. SCCL synthesizes algorithms along the Pareto-frontier spanning from latency-optimal to bandwidth-optimal implementations of a collective. The paper demonstrates how to encode SCCL's synthesis as a quantifier-free SMT formula which can be discharged to a theorem prover. We further demonstrate how to scale our synthesis by exploiting symmetries in topologies and collectives. We synthesize and introduce novel latency and bandwidth optimal algorithms not seen in the literature on two popular hardware topologies. We also show how SCCL efficiently lowers algorithms to implementations on two hardware architectures (NVIDIA and AMD) and demonstrate competitive performance with hand optimized collective communication libraries.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/07/2022

Optimal Direct-Connect Topologies for Collective Communications

We consider the problem of distilling optimal network topologies for col...
01/27/2022

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters ar...
11/08/2021

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU syst...
01/04/2022

Optimal circulant graphs as low-latency network topologies

Communication latency has become one of the determining factors for the ...
12/02/2021

Memory-efficient array redistribution through portable collective communication

Modern large-scale deep learning workloads highlight the need for parall...
11/30/2017

Improved Learning in Evolution Strategies via Sparser Inter-Agent Network Topologies

We draw upon a previously largely untapped literature on human collectiv...
04/20/2020

A Generalization of the Allreduce Operation

Allreduce is one of the most frequently used MPI collective operations, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.