gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

08/09/2023
by   Jiajun Huang, et al.
0

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose gZCCL, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with an accuracy-aware design to control error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our gZCCL-accelerated collectives, including both collective computation (Allreduce) and collective data movement (Scatter), can outperform NCCL as well as Cray MPI by up to 4.5X and 28.7X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.

READ FULL TEXT
research
04/08/2023

C-Coll: Introducing Error-bounded Lossy Compression into MPI Collectives

With the ever-increasing computing power of supercomputers and the growi...
research
02/24/2021

GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

As an increasing number of leadership-class systems embrace GPU accelera...
research
09/06/2019

iFDK: A Scalable Framework for Instant High-resolution Image Reconstruction

Computed Tomography (CT) is a widely used technology that requires compu...
research
01/27/2022

GC3: An Optimizing Compiler for GPU Collective Communication

Machine learning models made up of millions or billions of parameters ar...
research
12/14/2018

An Empirical Evaluation of Allgatherv on Multi-GPU Systems

Applications for deep learning and big data analytics have compute and m...
research
10/20/2021

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...
research
03/11/2023

OCCL: a Deadlock-free Library for GPU Collective Communication

Various distributed deep neural network (DNN) training technologies lead...

Please sign up or login with your details

Forgot password? Click here to reset