Analytical Performance Estimation during Code Generation on Modern GPUs

04/29/2022
by   Dominik Ernst, et al.
0

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. We propose an alternative to time-intensive autotuning, scenario-specific performance models, or black-box machine learning to select the best-performing configuration. This paper identifies the relevant performance-defining mechanisms for memory-intensive GPU applications through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient code candidates with high accuracy. We examine the changes of the A100 GPU architecture compared to the predecessor V100 and address the challenges of how to model the data transfer volumes through the new memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range-four 3D-25pt stencil and a complex two-phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best-performing candidate. The method is not limited to stencil kernels but can be integrated into any code generator that can generate the required address expressions.

READ FULL TEXT

page 9

page 13

page 15

page 16

research
07/02/2021

Opening the Black Box: Performance Estimation during Code Generation for GPUs

Automatic code generation is frequently used to create implementations o...
research
02/10/2021

Using hardware performance counters to speed up autotuning convergence on GPUs

Nowadays, GPU accelerators are commonly used to speed up general-purpose...
research
11/30/2017

Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis

Lattice Boltzmann methods (LBM) are an important part of current computa...
research
02/10/2021

Searching CUDA code autotuning spaces with hardware performance counters: data from benchmarks running on various GPU architectures

We have developed several autotuning benchmarks in CUDA that take into a...
research
01/30/2017

Autotuning GPU Kernels via Static and Predictive Analysis

Optimizing the performance of GPU kernels is challenging for both human ...
research
08/03/2020

A Learned Performance Model for Tensor Processing Units

Accurate hardware performance models are critical to efficient code gene...

Please sign up or login with your details

Forgot password? Click here to reset