A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

05/22/2023
by   Abhinav Jangda, et al.
0

Machine Learning (ML) models contain highly-parallel computations, such as, Matrix Multiplication, Convolutions, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation in independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in waves. However, the tiles executed in the last wave can under-utilize the execution units because tiles are not always a multiple of execution units. This under-utilization can be reduced by executing multiple independent kernels concurrently on a GPU, but is not currently possible for dependent kernels. In this paper, we present cuSync, a framework to write custom fine-grained synchronization policies for dependent kernels to improve GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing tiles of multiple dependent kernels. Using cuSync we expressed several synchronization policies in a few lines of code and reduced the inference times of GPT-3 and ResNet-38 by up to 1.19x and 1.16x respectively.

READ FULL TEXT

page 4

page 8

page 10

research
09/26/2022

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Meeting both scalability and performance portability requirements is a c...
research
01/25/2021

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization

Many emerging cyber-physical systems, such as autonomous vehicles and ro...
research
10/24/2018

Scheduling computations with provably low synchronization overheads

Work Stealing has been a very successful algorithm for scheduling parall...
research
08/15/2018

libhclooc: Software Library Facilitating Out-of-core Implementations of Accelerator Kernels on Hybrid Computing Platforms

Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xe...
research
02/12/2019

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

GPU computing is becoming increasingly more popular with the proliferati...
research
01/28/2019

The OoO VLIW JIT Compiler for GPU Inference

Current trends in Machine Learning (ML) inference on hardware accelerate...
research
09/13/2021

Specifying and Testing GPU Workgroup Progress Models

As GPU availability has increased and programming support has matured, a...

Please sign up or login with your details

Forgot password? Click here to reset