Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

11/08/2021
by   Aashaka Shah, et al.
4

Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication - a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like Allgather, Alltoall, and Allreduce that are up to 3× faster than NCCL. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by 17%. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.

READ FULL TEXT

page 10

page 15

page 16

research
08/19/2020

Synthesizing Optimal Collective Algorithms

Collective communication algorithms are an important component of distri...
research
10/11/2019

Blink: Fast and Generic Collectives for Distributed ML

Model parameter synchronization across GPUs introduces high overheads fo...
research
05/22/2023

Rethinking Machine Learning Collective Communication as a Multi-Commodity Flow Problem

We show communication schedulers' recent work proposed for ML collective...
research
01/27/2022

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters ar...
research
10/20/2021

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...
research
03/11/2023

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...
research
06/16/2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of ...

Please sign up or login with your details

Forgot password? Click here to reset