OCCL: a Deadlock-free Library for GPU Collective Communication

03/11/2023
by   Lichen Pan, et al.
0

Various distributed deep neural network (DNN) training technologies lead to increasingly complicated use of collective communications on GPU. The deadlock-prone collectives on GPU force researchers to guarantee that collectives are enqueued in a consistent order on each GPU to prevent deadlocks. In complex distributed DNN training scenarios, manual hardcoding is the only practical way for deadlock prevention, which poses significant challenges to the development of artificial intelligence. This paper presents OCCL, which is, to the best of our knowledge, the first deadlock-free collective communication library for GPU supporting dynamic decentralized preemption and gang-scheduling for collectives. Leveraging the preemption opportunity of collectives on GPU, OCCL dynamically preempts collectives in a decentralized way via the deadlock-free collective execution framework and allows dynamic decentralized gang-scheduling via the stickiness adjustment scheme. With the help of OCCL, researchers no longer have to struggle to get all GPUs to launch collectives in a consistent order to prevent deadlocks. We implement OCCL with several optimizations and integrate OCCL with a distributed deep learning framework OneFlow. Experimental results demonstrate that OCCL achieves comparable or better latency and bandwidth for collectives compared to NCCL, the state-of-the-art. When used in distributed DNN training, OCCL can improve the peak training throughput by up to 78 sequenced NCCL, while introducing overheads of less than 6.5 distributed DNN training approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling

The method of choice for parameter aggregation in Deep Neural Network (D...
research
02/19/2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for redu...
research
12/19/2015

Poseidon: A System Architecture for Efficient GPU-based Deep Learning on Multiple Machines

Deep learning (DL) has achieved notable successes in many machine learni...
research
11/08/2021

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU syst...
research
10/09/2021

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

The continuous growth in both size and training data for modern Deep Neu...
research
08/09/2023

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for mod...
research
08/30/2022

Analysis of Distributed Deep Learning in the Cloud

We aim to resolve this problem by introducing a comprehensive distribute...

Please sign up or login with your details

Forgot password? Click here to reset