DeepAI AI Chat
Log In Sign Up

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

by   Aashaka Shah, et al.
The University of Texas at Austin

Large ML models and datasets have necessitated the use of multi-GPU systems for distributed model training. To harness the power offered by multi-GPU systems, it is critical to eliminate bottlenecks in inter-GPU communication - a problem made challenging by the heterogeneous nature of interconnects. In this work, we present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. TACCL is built on top of the standard NVIDIA Collective Communication Library (NCCL), allowing it to be a drop-in replacement for GPU communication in frameworks like PyTorch with minimal changes. TACCL generates algorithms for communication primitives like Allgather, Alltoall, and Allreduce that are up to 3× faster than NCCL. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by 17%. By decomposing the optimization problem into parts and leveraging the symmetry in multi-GPU topologies, TACCL synthesizes collectives for up to 80-GPUs in less than 3 minutes, at least two orders of magnitude faster than other synthesis-based state-of-the-art collective communication libraries.


page 10

page 15

page 16


Synthesizing Optimal Collective Algorithms

Collective communication algorithms are an important component of distri...

Blink: Fast and Generic Collectives for Distributed ML

Model parameter synchronization across GPUs introduces high overheads fo...

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters ar...

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...

GPU-based Parallel Computation Support for Stan

This paper details an extensible OpenCL framework that allows Stan to ut...

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

We present Synkhronos, an extension to Theano for multi-GPU computations...

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...