DeepAI AI Chat
Log In Sign Up

Specifying and Testing GPU Workgroup Progress Models

by   Tyler Sorensen, et al.

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a constraint for concurrent programs that require relative forward progress to terminate. Using this constraint, we synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, this allows us to determine the expected status of each litmus test – i.e. whether it is guaranteed eventual termination – under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, an intuitive progress model defined by prior work and hypothesized to describe the workgroup schedulers of existing GPUs.


page 1

page 2

page 3

page 4


A Variant of Concurrent Constraint Programming on GPU

The number of cores on graphical computing units (GPUs) is reaching thou...

TaDA Live: Compositional Reasoning for Termination of Fine-grained Concurrent Programs

We introduce TaDA Live, a separation logic for reasoning compositionally...

Term Rewriting on GPUs

We present a way to implement term rewriting on a GPU. We do this by let...

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose compu...

Measurement and Analysis of GPU-accelerated Applications with HPCToolkit

To address the challenge of performance analysis on the US DOE's forthco...

G-TADOC: Enabling Efficient GPU-Based Text Analytics without Decompression

Text analytics directly on compression (TADOC) has proven to be a promis...

Synthesizing Fine-Grained Synchronization Protocols for Implicit Monitors (Extended Version)

A monitor is a widely-used concurrent programming abstraction that encap...