An Empirical Evaluation of Allgatherv on Multi-GPU Systems

12/14/2018
by   Thomas B. Rolinger, et al.
0

Applications for deep learning and big data analytics have compute and memory requirements that exceed the limits of a single GPU. However, effectively scaling out an application to multiple GPUs is challenging due to the complexities of communication between the GPUs, particularly for collective communication with irregular message sizes. In this work, we provide a performance evaluation of the Allgatherv routine on multi-GPU systems, focusing on GPU network topology and the communication library used. We present results from the OSU-micro benchmark as well as conduct a case study for sparse tensor factorization, one application that uses Allgatherv with highly irregular message sizes. We extend our existing tensor factorization tool to run on systems with different node counts and varying number of GPUs per node. We then evaluate the communication performance of our tool when using traditional MPI, CUDA-aware MVAPICH and NCCL across a suite of real-world data sets on three different systems: a 16-node cluster with one GPU per node, NVIDIA's DGX-1 with 8 GPUs and Cray's CS-Storm with 16 GPUs. Our results show that irregularity in the tensor data sets produce trends that contradict those in the OSU micro-benchmark, as well as trends that are absent from the benchmark.

READ FULL TEXT

page 5

page 6

page 8

research
10/20/2021

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role...
research
03/11/2019

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

High performance multi-GPU computing becomes an inevitable trend due to ...
research
08/24/2019

Demystifying the MLPerf Benchmark Suite

MLPerf, an emerging machine learning benchmark suite strives to cover a ...
research
08/09/2022

Exploring GPU Stream-Aware Message Passing using Triggered Operations

Modern heterogeneous supercomputing systems are comprised of compute bla...
research
05/19/2022

Comparing single-node and multi-node performance of an important fusion HPC code benchmark

Fusion simulations have traditionally required the use of leadership sca...
research
02/24/2021

GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

As an increasing number of leadership-class systems embrace GPU accelera...
research
08/09/2023

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for mod...

Please sign up or login with your details

Forgot password? Click here to reset