Persistent Kernels for Iterative Memory-bound GPU Applications

04/05/2022
by   Lingqi Zhang, et al.
0

Iterative memory-bound solvers commonly occur in HPC codes. Typical GPU implementations have a loop on the host side that invokes the GPU kernel as much as time/algorithm steps there are. The termination of each kernel implicitly acts as the barrier required after advancing the solution every time step. We propose a scheme for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS). In this scheme the time loop is moved inside a persistent kernel, and device-wide barriers are used for synchronization. We then reduce the traffic to device memory by caching a subset of the output in each time step in registers and shared memory to be used as input for the following time step. PERKS can be generalized to any iterative solver: they are largely independent of the solver's implementation. We explain the design principle of PERKS and demonstrate the effectiveness of PERKS for a wide range of iterative 2D/3D stencil benchmarks (geometric mean speedup of 2.29x in small domains and 1.53x in large domains), and a Krylov subspace solver (geometric mean speedup of 4.67x in smaller SpMV datasets from SuiteSparse and 1.39x in larger SpMV datasets, for conjugate gradient).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/12/2023

Revisiting Temporal Blocking Stencil Optimizations

Iterative stencils are used widely across the spectrum of High Performan...
research
04/13/2022

Explicit caching HYB: a new high-performance SpMV framework on GPGPU

Sparse Matrix-Vector Multiplication (SpMV) is a critical operation for t...
research
08/08/2018

Accelerating wave-propagation algorithms with adaptive mesh refinement using the Graphics Processing Unit (GPU)

Clawpack is a library for solving nonlinear hyperbolic partial different...
research
06/30/2020

SParSH-AMG: A library for hybrid CPU-GPU algebraic multigrid and preconditioned iterative methods

Hybrid CPU-GPU algorithms for Algebraic Multigrid methods (AMG) to effic...
research
07/14/2019

A Versatile Software Systolic Execution Model for GPU Memory-Bound Kernels

This paper proposes a versatile high-performance execution model, inspir...
research
02/25/2021

Checkpointing with cp: the POSIX Shared Memory System

We present the checkpointing scheme of Abacus, an N-body simulation code...
research
06/06/2023

Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt)

General Purpose Graphics Processing Units (GPGPU) are used in most of th...

Please sign up or login with your details

Forgot password? Click here to reset