GPU-initiated Fine-grained Overlap of Collective Communication with Computation

by   Kishore Punniyamurthy, et al.

In order to satisfy their ever increasing capacity and compute requirements, many machine learning models are distributed across multiple nodes using space-efficient parallelism strategies. As a result, collective communications are often on the critical path, and hiding their latency by overlapping kernel-granular communication and computation is difficult due to the absence of independent computation. In this work, we propose fusing computation with communication using GPU-initiated networking, and leverage GPUs' massive parallelism to enable fine-grained overlap of the fused operations. We have developed a single, self-contained GPU kernel where workgroups (WGs) immediately communicate their results to remote GPUs when they complete their computation. Meanwhile, other WGs within the same kernel perform overlapping computation, maintaining high ALU utilization. Furthermore, we propose zero-copy optimizations for peer-to-peer GPU communication where the data computed by one GPU is directly written to the destination buffers within the peer GPUs, eliminating intermediate stores and extra buffering. Our approach leverages the emerging multi-node GPU system trend where GPUs are physically close to network with direct GPU-NIC interconnects. We demonstrate our approach by creating an embedding + All-to-All fused kernel which overlaps embedding operations and the dependent all-to-all collective in DLRM models. We evaluate our approach both using simulation and real hardware. Our evaluations show that our approach can effectively overlap All-to-All communication with embedding computations, subsequently reducing their combined execution time by 31 58 Scale-out simulations indicate that our approach reduces DLRM execution time by  10


page 1

page 5

page 8

page 9

page 11


Improving Scalability with GPU-Aware Asynchronous Tasks

Asynchronous tasks, when created with over-decomposition, enable automat...

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Meeting both scalability and performance portability requirements is a c...

Communication-minimizing Asynchronous Tensor Parallelism

As state-of-the-art neural networks scale to billions of parameters, des...

Synkhronos: a Multi-GPU Theano Extension for Data Parallelism

We present Synkhronos, an extension to Theano for multi-GPU computations...

G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression

Text analytics directly on compression (TADOC) has proven to be a promis...

Efficient Embedding of MPI Collectives in MXNET DAGs for scaling Deep Learning

Availability of high performance computing infrastructures such as clust...

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

Early-bird communication is a communication/computation overlap techniqu...

Please sign up or login with your details

Forgot password? Click here to reset