GPUs are becoming increasingly prevalent in HPC systems. More than half the systems on the Top500  list include discrete GPUs and seven of the systems in the top ten are GPU-accelerated (November 2021 list). As a result, extensive efforts went into optimizing iterative methods for GPUs, for instance: iterative stencils [43, 60, 39, 10] used widely in numerical solvers for PDEs, iterative stationary methods for solving systems of linear equations (ex: Jacobi [1, 34], Gauss–Seidel method [11, 19, 34]), iterative Krylov subspace methods for solving systems of linear equations (ex: conjugate gradient [53, 4], BiCG[4, 3], and GMRES[4, 12]).
Although the device memory bandwidth of GPUs has been increasing from generation to generation, the gap between compute and memory is widening. Given that iterative stencils and implicit solvers typically have low arithmetic intensity , significant efforts went into optimizing them for data locality. These included moving the bottleneck from global memory to on-chip scratchpad memory [66, 38], or further pushing the bottleneck to the register files [10, 69]. Those efforts become increasingly effective since the aggregate volume of register files and scratchpad memory capacity are increasing with newer generations of GPUs . In iterative solvers, due to spatial dependencies, a barrier is typically required at the end of each time step (or several time steps when doing temporal blocking ). That is to assure that advancing the solution in time step would only start after all threads finish advancing the solution in time step . Invoking the kernels from the host side in each time step acts as an implicit barrier, where the kernel invocation in time step would happen after all threads of the kernel invocation at time step have finished execution. In-between kernel invocations, data stored in registers and scratchpad memory would be wiped out, and the next kernel invocation would start by reading its input from the device memory.
One opportunity to improve the data locality is to extend the lifetime of the kernel across the time steps and take advantage of the large volume of register files and scratchpad memory in reducing traffic to the device memory. In this paper, we propose a generic scheme for running iterative solvers on GPUs to improve data locality. PERsistent KernelS (PERKS)111In this paper, we use PERKS, interchangeably, to refer to our proposed scheme and as an abbreviation of persistent kernels have the time loop inside them, instead of the host, and use recently supported device-wide barriers (in CUDA) for synchronization. Next, we identify the cachable data in the solver: data that is the output of time step and input to time step , as well as the repeatedly loaded constant data. Finally, we use either the scratchpad memory or registers (or both) to cache the data, and reduce the traffic to device memory. PERKS can be an effective technique for not just iterative memory-bound solvers; PERKS can also improve the locality for iterative compute-bound solvers (when storing an iteration’s output cannot be effectively overlapped with compute). That is because compute-bound solvers would benefit from reducing memory traffic in-between the iterations, even if each iteration is running compute-bound code.
While the basic idea of PERKS is seemingly simple, there are several challenges for effectively improving performance with PERKS. First, it is crucial to understand and adapt to the pressure on resources (particularly registers and shared memory) by reducing the occupancy (number of active threads) while maintaining high enough concurrency to saturate the device. Increased pressure on resources from a persistent kernel reduces the number of registers available for caching and, even worse, can lead to performance degradation if register spilling occurs. Second, we need to analyze and pick the ideal scheme to traverse the domain of input array(s) at each step to improve the locality and increase the volume of cached data when in transition between time steps. Third, for ideal locality, it is important for PERKS solvers that have more than one array involved (ex: Conjugate Gradient) to have a model of which arrays to cache based on their size and how they are accessed. Finally, identifying PERKS implementation quality and latent bottlenecks necessitates a robust performance model.
It is important to note that PERKS are orthogonal to temporal blocking optimizations. Temporal blocking relies on combining multiple consecutive iterations of the time loop to reduce the memory transactions between them. The dependency along the time dimension is resolved by either: a) redundantly loading and computing cells from adjacent blocks, which limits the effectiveness of temporal blocking to low degrees of temporal blocking [24, 41, 55], or b) using tiling methods of complex geometry (e.g. trapezoidal and hexagonal tiling) along the time dimension and restrict the parallelism due to the dependency between neighboring blocks [8, 20, 46]. In contrast, the execution scheme of PERKS does not necessitate the resolution of the dependency along the time dimension since PERKS include an explicit barrier after each time step. This means the PERKS model can be generalized to any iterative solver and can be used on top of any version of the solver. In other words, iterative kernels written as PERKS do not compete with optimized versions of those iterative kernels. For instance, a stencil PERKS does not compete with kernels applying aggressive locality optimizations; the performance gain from PERKS is added to the performance gain from whatever stencil optimizations are used in the kernel. As a matter of fact, the more optimized the kernel before it is ported to the PERKS execution scheme, the higher the speedup that would be gained by PERKS. That is since optimizations to the kernel proportionally increases the overhead of data storing and loading in between iterations, which PERKS aim to reduce.
The contributions in this paper are as follows:
We introduce the design principles of PERKS and provide analysis of the different caching scheme w.r.t. how and where the domain is cached, and how to effectively port iterative solvers to PERKS.
We implement a wide range of iterative 2D/3D stencil benchmarks and a conjugate gradient solver as PERKS in CUDA. We include an elaborative discussion on the implementation details and performance-limiting factors such as the domains sizes, concurrency, and resource contention.
Using highly optimized baselines comparable to state-of-the-art 2D/3D stencil implementations, with A100 and V100, our PERKS-based implementation achieves geometric means speedups of x and x in small and large domains, respectively. PERKS-based conjugate gradient achieves a geometric mean speedup of x in comparison to the highly GPU-optimized production library Ginkgo  for SpMV datasets in SuiteSparse. For smaller datasets, the speedup goes up to x. The source code of all PERKS-based implementations in this paper is available at the following anonymized link: https://tinyurl.com/2p9atawr.
Ii Background and Motivation
CUDA’s programming model includes: threads, the basic execution unit (32 threads are executed together as a warp); Thread block (TB), which is usually composed of hundreds of threads; grid, which is usually composed of tens of thread blocks.
On-chip memory in a streaming multiprocessor (SMX) includes: shared memory (scratchpad memory), L1 cache, and register file (RF) and. Off-chip memory includes global memory and L2 cache. Data in global memory can reside for the entirety of the program, while data in on-chip memory has the lifetime of a kernel. The shared memory is shared among all threads inside a thread block.
Ii-a Trends of Nvidia GPUs
The trend in Nvidia GPUs is that the register file size and cache capacity, in aggregate, increase. In codes that are bound by memory bandwidth, increasing locality by effectively using the registers and cache to reduce traffic to the device memory is a widely used approach (e.g. fusing kernels [63, 23, 22, 18, 54]). The challenge, however, of improving locality is the need to resolve temporal and spatial dependencies.
|Register File||14 MB||20 MB||27 MB|
|Shared Memory 222Shared memory is a configurable portion of L1 cache that can be used as a user-managed scratchpad memory||3.5 MB||7.5 MB||17.29 MB|
|L2 Cache||4 MB||6 MB||40 MB|
|Memory Bandwidth||720 GB/s||900 GB/s||1555 GB/s|
Ii-B GPU Device-wide Synchronization
Synchronization in GPUs was limited to groups of threads: thread blocks in CUDA or a work group in OpenCL). Starting from CUDA 9.0, Nvidia introduced cooperative group APIs  that include an API for device-wide synchronization. Before introducing grid-level synchronization, the typical way to introduce device-wide synchronization was to launch sequences of kernels in a single CUDA stream. Zhang et al.  conducted a comprehensive study to compare the performance of both methods. The result shows that the latency difference between explicit device-wide synchronization versus implicit synchronization (via repetitive launching of kernels) is negligible in most kernels.
Ii-C Iterative algorithms
In iterative algorithms, the output of time step is the input of time step . Iterative methods can be expressed as:
When the domain is mapped out to fine grained GPU cores, there are two points to consider:
Spatial dependency necessitates synchronization between time steps, or else advancing the solution in the following time step might use data that has not yet been updated in the previous time step.
In time step , each thread or thread block needs input from the output of itself in time step (i.e. temporal dependency). This gives the opportunity for caching data between steps to reduce device memory traffic.
In the following sections, we briefly introduce iterative stencils and Krylov subspace methods. Throughout the paper, we use them as motivation examples, and we use them to report the effectiveness of our proposed methods, given their importance in the High Performance Computing (HPC) scientific and industrial codes.
Ii-C1 Iterative Stencils
Iterative stencils are widely used in HPC. Take 2D Jacobian 5-point stencil (2d5pt) as an example:
Computation of each point at time step requires the values of the point itself and its four neighboring points at time step .
In spatial blocking we split the domain into sub-domains, where each thread block can load its sub-domain to the shared memory to improve the data reuse. In the meantime, we require redundant data accesses at the boundary of the thread block to data designated for adjacent thread blocks.
In iterative stencils, each time step depends on the result of the previous time step. One could advance the solution by combining several time steps. The temporal dependency, in this case, is resolved by using a number of halo layers that match the number of combined steps. The amount of data that can be computed depends on the stencil radius () and the number of time steps in-between (). In overlapped tiling [25, 35, 42], this region can be represented as ( region). Methods based on this kind of blocking are called overlapped temporal blocking schemes.
Ii-C2 Krylov Subspace Methods
Krylov methods are popular solvers for large sparse (and dense) linear systems of equations . Krylov subspace methods can be described as:
Assuming that A is an invertible matrix, it is possible to compute(or solve ) by searching the Krylov subspace without directly computing
. Searching the Krylov subspace is a sequence of matrix vector multiplications, where at each step the approximation of the solution vectoris updated proportionally to the residual error (vector ) from the previous time step.
Conjugate gradient is a main solver in the family of Krylov subspace methods. It is mainly used to solve systems of linear equations for symmetric and positive-definite matrices.
Ii-D Motivational Example
We use a motivational example of a double precision 2D 9-point Jacobian stencil to motivate implementing iterative solvers as PERKS. Why PERKS: latency across all operations/instructions in newer generation GPUs has been significantly dropping . As a result, often fewer numbers of warps are enough for CUDA runtime to hide the latency effectively and hence maintain high performance at low occupancy . In Fig 1, we vary the number of thread blocks per streaming multiprocessor (TB/SMX) and plot its performance (left Y-axis). For each TB/SMX configuration, we plot on the right Y-axis the unused resources (shared memory and registers). As the figure shows, when TB/SMX decreases, the performance drops ( GCells/s 333GCells/s denotes giga-cells updated per second.) while the freed shared memory and registers gradually increase. Even at its peak performance, more than MB of shared memory and register files are not in use. By reducing the TB/SMX to its minimum while maintaining enough concurrency to sustain the performance level, the projection from performance gain when caching a subset of the results in unused resources can improve the performance by more than x. The prospect of PERKS: as Figure 2 shows, the amount of time required for moving the data from/to device memory in-between time steps for a stencil kernel remains constant, while the compute part decreases the more optimized the stencil implementation is. The prospect of PERKS is to reduce/eliminate this data movement time that dominates the runtime in highly optimized stencil implementations. Finally, while temporal-blocking does also reduce the data movement to some extent, resolving the temporal and spatial dependency adds compute overhead and also can lead to increased register pressure that limits the degree of temporal blocking on GPUs .
Iii PERKS: Persistent Kernels to Improve Data Locality
PERsistent Kernels (PERKS) is a generic scheme for running iterative solvers on GPUs to improve data locality by taking advantage of the large capacity of register files and shared memory. As Figure 3 illustrates, in PERKS, we move the time stepping loop from the host to the device, and use CUDA’s grid synchronization API as a device-wide barrier at each time step. We then use the register files and shared memory to reduce traffic to the device memory by caching the domain (or a subset from it) in-between time steps.
Iii-a Assumptions and Limitations
The techniques discussed in this paper are based on the following assumptions about the applications.
Target Applications: Compute-bound iterative kernels could benefit from becoming PERKS, if the kernel generates memory traffic in-between iterations that CUDA runtime can not effectively overlap with computation. In this paper, however, we limit our experiments to target GPU iterative kernels that are bounded by memory bandwidth. While execution in a PERKS fashion makes no assumptions on the underlying implementation, optimal PERKS performance can sometimes require minor adaptations to the kernel. The changes, for instance, can be as simple as changing the thread block and grid sizes to reduce over-subscription, or more elaborate as in adopting a specific SpMV method in the case of the conjugate gradient (more details in Section V-C).
Impact on Optimized Kernels: It is very important to note that PERKS is orthogonal to the optimization level applied to the compute part of the kernel. As a matter of fact, the more optimized the baseline kernel, the more performance improvement we expect from PERKS. That is since optimizations reduce the time per iteration, i.e., a single kernel invocation, while the time to store/load to global memory in-between iterations remains the same. To summarize, PERKS reduces what amounts to be Amdahl’s law serial portion of the solver, and hence the more optimized a code, i.e., a faster parallel portion, the higher the speedup that would be attributed to PERKS.
PERKS in Distributed Computing: In distributed applications that require halo regions (e.g., stencils), PERKS can be used on top of communication/computation overlapping schemes [17, 52]. In overlapping schemes, the boundary points that are computed in a separate kernel would not be cached, while the kernel of the interior points would run as a PERKS and cache the data of the interior points. PERKS could also be used with communication avoiding algorithms (e.g., communication-avoiding Krylov methods )
Use of Registers: PERKS uses registers and shared memory for caching data in-between time steps. It should be noted that there are no guarantees that the compiler releases all the registers after the compute portion in each iteration is finished (with Nvidia’s nvcc compiler we did not observe such inefficiency). If such register reuse inefficiency exists, imperfect register reuse by the compiler could result in fewer registers being available for caching and leaves only shared memory to be used for caching.
Iterative Solvers as PERKS: While this paper’s focus is to demonstrate PERKS model for iterative stencils and Krylov subspace methods (conjugate gradient), the discussion in this section (and paper in general) is applicable to a high degree for other types of iterative solvers. That is since PERKS is not much concerned with the implementation of the solver, and only loads/stores the domain (or a subset of it) before/after the solver part in the kernel, under constraints from resources. Iterative solvers that use the same flow expressed in Figure 3 can in principal be ported to PERKS (with relative ease). Generally speaking, the porting process is as follows: move the time step outside the kernel to be inside the kernel, add grid synchronization to ensure dependency, and store/load a portion of the input or output to cache: either shared memory and/or register (using register arrays).
Iii-B Caching Policy
A caching policy is required to determine which portion of the domain (or data) to cache. When the entire domain (or data) can fit in register files and shared memory used for caching, the entire algorithm can run from the cache (this is particularly useful in the cases of strong scaling where per node domain size becomes smaller as the number of nodes grows). When only a fraction of the domain can be cached, a policy is required to select the data to prioritize for caching. In the following sections, we elaborate on the caching policy.
Iii-B1 Considerations for dependency between threads Blocks
Algorithms that do not require dependency between the thread blocks can use the cache space most efficiently because all the load and store transactions to global memory can be eliminated. The dependencies within the thread block are resolved by using either the shared memory or shuffle operations.
In iterative solvers, there is often neighbor dependency (ex: a stencil kernel where a cell update relies on values computed in neighboring threads). In such a case, caching the results of the threads at the interior of the CUDA thread blocks eliminates the stores and loads from global memory. The threads at the boundary of threads blocks would, however, continue to store and load from global memory (since the shared memory scope is the thread block). When the capacity of cache is large enough, w.r.t. the domain size to be cached, the performance drawback would be negligible.
Iii-B2 Identifying which data gets priority to be cached
In many cases, the capacity of register files and shared memory is limited, i.e., it is impossible to cache the entire domain/input. In cases where all the domain/input array elements are accessed at the same frequency, one could assume it is not necessary to use a cache policy that prioritizes specific parts of the domain/input. However, this is not always true.
Take iterative stencil as an example. The data managed by the threads at the boundary of the thread block is stored to main memory to be accessed by the neighboring threads blocks in the following iterations; caching those boundary elements saves one load operation. On the other hand, data at the interior of the thread block is not involved in inter thread block dependency; caching saves one load and one store operation. Finally, data in the halo region is updated at each time step; there is no benefit in caching the layers in the halo region. To conclude, the priority in caching yielding the highest reuse would be: , i.e., the priority is to cache the data of the interior threads of the thread block, followed by the data of the threads at the boundary of the thread block, and no caching for the halo region.
For other iterative solvers, such as conjugate gradient, there are different data arrays that could be potentially cached, unlike a single domain array in stencils. The cacheable variables and usage per array element for the conjugate gradient solver are as follows: a) one load and no stores for the matrix , and b) three loads and one store for the residual vector . So by assuming that each operation accesses data in a coalesced access pattern, it would be more effective to cache vector . As a result, the ideal cache priority is .
To summarize, while PERKS does not touch on the compute part of the original kernel, attention should be given to identify the ideal caching policy for each solver implemented as PERKS. That being said, one could assume this step can be automated by using a dedicated profile-guided utility (or even sampling from the profiler directly) to aid the user in swiftly identifying an ideal caching policy, based on the access patterns and frequency of access of data arrays in the solver.
Iv Performance Analysis
In this section, we propose a performance model that serves the following purposes. First, we propose a projection of achievable performance that we compare with measured results to detect abnormal behavior or implementation shortcomings. We relied on this projection in the analysis of our PERKS implementation quality (Section IV-B). Second, we identify the bounds on reducing concurrency before dropping performance and use concurrency to explain potential optimizations for further performance improvement (Section IV-D).
This performance model relies on three performance attributes: a) measured performance of our PERKS implementation, b) the projected peak performance achievable on a given GPU, and c) the efficiency function describing the efficiency of the given kernel running on the device. More specifically, is a function of the concurrency exposed by the software and the concurrency required by the hardware . The relation of measured performance to projected peak performance becomes:
We discuss projected peak performance in the following section. A detailed discussion of the efficiency and concurrency functions is in Section IV-C.
Iv-B Projecting Peak Achievable Performance
We rely on the FOM (Figure Of Merit) as the performance metric in this analysis. In stencils, we use the giga-cells updated per second (GCells/s) [39, 10]. Given the memory-bound nature of the conjugate gradient solver, we directly use sustained memory bandwidth as a metric, following other works on conjugate gradient . Due to space limitations, this section mainly focuses on stencils to explain the performance analysis. Without loss of generality, the analysis is applicable to other cases (ex: conjugate gradients) by adjusting the performance metric and code concurrency accordingly.
We use a simple performance model inspired by the roofline model [51, 33]. The model’s utility is to project the upper bound on performance based on the reduction of global memory traffic. This model, in turn, helps us in this paper to identify performance gaps in our PERKS implementation and later inspect the reasons for those gaps (e.g., register spilling).
In a kernel implemented as PERKS, the bottleneck could either be the global memory bandwidth or the shared memory bandwidth (if the PERKS caching scheme moves the bottleneck to become the shared memory bandwidth). We don’t assume the registers to be a bottleneck since we assume that as long as we ensure that no register spilling occurs, we avoid register pressure.
We initially do not account for the memory accesses to global memory originating from the boundary threads in each thread block, e.g., halo region in stencil, and confine our analysis to the dependency in cached region. We then expand our analysis to consider those unavoidable global memory accesses (Equation 9).
We assume a total domain of size bytes, the cached portion to be bytes, and the uncached portion to be bytes. The cached portion of the domain data would be divided between registers and shared memory (since we cache in both registers and shared memory): . For time steps, Assuming the number of bytes stored to global memory in each time step is and the number of bytes loaded is , the total global memory bytes accessed becomes:
In the case when the kernel is bounded by global memory bandwidth, i.e., the volume of cached data does not move the bottleneck from global memory to shared memory, for the global memory bandwidth of and data type size of , the time for accessing the global memory becomes:
In the case when the kernel is bounded by shared memory bandwidth, i.e., the volume of data cached in shared memory moves the bottleneck to be the shared memory bandwidth, the total shared memory (in bytes) accessed becomes:
Assuming to be the shared memory originally used by the kernel, e.g., shared memory used in the baseline implementation of a stencil kernel to improve the locality, and to be the shared memory bandwidth, the time for accessing the shared memory becomes:
Next, we consider the unavoidable global memory accesses necessary to resolve the neighborhood dependencies in , e.g., the halo region of stencils. In the cached portion of the domain, the halo region requires global memory accesses to resolve the dependencies. Assuming the global memory accesses, in bytes, for the halo region of the data computed by the boundary threads in thread blocks that are in the cached portion of the domain to be , the time for accessing those halo region would be:
The projected best-case total time required for the PERKS kernel may be written as:
Accordingly, the projected peak performance ( in Equation 4) for the time steps can be expressed as:
To further illustrate how this performance analysis works, we give two examples while computing time-steps of a single precision 2D 5-point Jacobi stencil on A100.
We use the domain size as an example of a large domain size; the total cache-able region is leading to us. The total number of bytes for the halo accesses is . Thus us. So GCells/s, the measured performance is GCells/s ( of ; in the following sections we elaborate on the analysis of concurrency and its impact on the gap between measured and projected peak).
We use the domain size as an example of small domain/ size, i.e., when the whole domain can be cached. In our baseline implementation used bytes of shared memory. The shared memory cached portion is , so . We get ms and GCells/s, the measured performance is GCells/s (61.77% of ).
Iv-C Concurrency and Micro-benchmarks
Reducing device occupancy increases the availability of resources to be used for caching in PERKS (as illustrated earlier in Figure 1). On the contrary, reducing occupancy can lead to lower device utilization. To effectively implement PERKS, one has to reduce the occupancy as much as possible without scarifying performance. Inspired by the findings of Volkov , we assume that the efficiency function reaches its peak point when the code provides enough concurrency to saturate the device (irrespective of the occupancy):
Where is the minimum number of concurrently executable instructions of the operation exposed by the launched kernel, and is the maximum numbers of instructions of the operation that the device is capable of handling concurrently. Because this paper mainly focuses on memory bound applications, the referred to in this paper are limited to data access operations, i.e. global memory load/store , shared memory load/store , and L2 cache load/store .
, the kernel concurrency at the streaming multi-processor (SMX) level, can be computed based on the concurrency exposed by the threads of a thread block and number of concurrently running thread blocks per stream multiprocessor : .
Iv-D Concurrency Analysis
In this section, we briefly describe how we analyze the concurrency to reduce the occupancy of the original kernel in order to release resources for caching while sustaining performance. We conduct a static analysis to extract the data movement operations in the kernel. Note that we account for any barriers in the original kernels that could impact the concurrency of operations, i.e., we do not combine operators/instructions from before and after the barrier when we count the operators. Finally, we apply a simple model to identify the least occupancy we could drop to before the concurrency starts to drop. For example, we did a static analysis of a single precision 2D 5-point Jacobi stencil kernel to conduct a concurrency analysis. It is worth mentioning that the baseline kernel we use is fairly complex since it is highly optimized by using shared memory to reduce traffic to global memory to its minimum . The results summarized in Table II show that for this kernel, we could reduce the original occupancy to while maintaining performance.
To understand the gap between the projected upper-bound on achievable performance and the measured performance (), we inspect the efficiency function . The number of concurrent global memory accesses and shared memory accesses in the 2D 5-point Jacobi stencil kernel are enough to saturate A100 when TB/SMX=1. Accordingly, we get ,which would indicate that the observed gap in performance is not due to a drop in concurrency we did not model. While this confirms the effectiveness of the concurrency analysis (i.e., since the concurrency analysis resonates with the empirical measurements in Table II), it does not uncover the gap between measured and projected peak achievable. Further investigative profiling revealed that the concurrency for accesses in L2 cache, not global memory, is impacted by reducing occupancy on A100 in specific to the level that affects performance notably. More particularly, access to global memory for the halo region garners a high L2 cache hit rate. This effectively means that higher concurrency is necessary to saturate the L2 cache when hit rates are high. To confirm, we manually doubled the concurrency : the performance increased to 123.94 GCells/s with TB/SMX=1 (from up to ).
|Used Reg.||Unused Reg.||GM Load||GM Store||Measured|
Iv-E Register pressure in PERKS
One concern with PERKS is that kernels might run into register pressure if the compiler is not optimally reusing registers for different time steps, potentially affecting concurrency and penalizing performance. To illustrate this issue, a high register-pressure 2D 25-point double precision Jacobi stencil uses registers per thread, yet the PERKS version uses registers444We gathered the number of registers used by finding the maximum number of registers available as cache before spilling with ”__launch_bounds__” instruction. Register spilled can be indicated by ’-Xptxas ”-v -dlcm=cg”’ flag.. Similar behavior is also observed in other stencil benchmarks. Reducing the occupancy while maintaining the concurrency –as mentioned in the previous section– reduces the impact of this compiler inefficiency in register reuse in all the benchmarks we report in the results section. In the above example, at worst, registers among the maximum available registers per thread could not be used for caching data; it neither harms concurrency nor triggers register spilling.
This section elaborates on how we implemented memory-bound iterative methods (namely 2D/3D stencils and a conjugate gradient solver) as PERKS.
V-a Restrictions of Synchronization APIs
PERKS highly relies on cooperative groups related APIs (supported since CUDA 9.0 ). Currently, the API does not allow over-subscription, i.e., one needs to explicitly assign workload to blocks and threads to expose enough parallelism to the device. However, it is worth mentioning that this API does not limit the flexibility, as different kernels can still run concurrently in a single GPU, as long as they as a whole doesn’t exceed the hardware limitation.
V-B Transforming Stencil Kernels to PERKS
Our 3D stencil implementation555Used the same optimization as SM-OPT in Fig 2 uses the standard shared memory implementation where 2D planes (1D planes in 2D stencils) are loaded one after the other in shared memory, and each thread computes the cells in a vertical direction [60, 45]. In our PERKS implementation, before the compute starts, planes that already have the data cached from the previous time step do not load from global memory. We do not interfere with compute; only after the compute is finished that we store the results in the registers/shared memory. After adjusting to handle the input and output of the computation part of the kernel, transforming the existing kernel to PERKS is straightforward. To ensure coalesced memory accesses in the halo region, we transpose the vertical edges of the halo region in global memory. Finally, if the original kernel uses shared memory [60, 45] or registers  to optimize stencils, we use the version of the output residing in shared memory or registers at the end of each time step as an already cached output. This way, we avoid an unnecessary copy to shared memory and registers we would be using as cache.
V-C Transforming the Conjugate Gradient Solver to PERKS
For its simplicity and accessibility, we use the Conjugate Gradient (CG) solver implementation that is part of CUDA SDK samples (conjugateGradientMultiBlockCG ). Since the SpMV implementation in the CG sample is relatively naive, we use the highly optimized merge-based SpMV  that is part of the C++ CUB library in CUDA Toolkit , since it fits naturally with the caching scheme in PERKS. We do not discuss the details of merge-based SpMV due to the space limit. The reader can refer to details in .
We made several slight modifications to the merge-based SpMV implementation. First, merge-based SpMV is composed of two steps: and . The search step is done twice. We first find the workload for each thread block, and then find the workload for each thread inside a thread block. We save the search result of thread block workloads in global memory since the matrix is static throughout the entire iteration. The second search (thread-level) is conducted in shared memory. Those two steps repeatedly generate intermediate data. Two alternative cache policies here would be caching the TB-level search result and caching the thread-level search result. We implemented the two policies and merged them with the other two direct caching policies when conducting the conjugate gradient solver evaluations.
Second, merge-based SpMV originally uses small thread blocks, i.e., threads per TB. This introduces a high volume of concurrently running thread blocks per streaming multiprocessor. We increase the TB size to and slightly change the memory access order to accommodate the larger TB size.
V-D New Features in Nvidia Ampere
The Nvidia Ampere generation of GPUs introduced two new features that have the potential to improve the performance of PERKS. Namely, asynchronous copy for shared memory and L2 cache residency control . When testing asynchronous copy to cache in PERKS, we did not observe noticeable performance difference. For L2 cache residency control, we experimented with setting the input and halo region to be persistent. We observed slowdown and no change in performance, respectively. Accordingly, we do not use those new features in our PERKS implementations.
|Benchmark(Stencil Order, FLOPs/Cell)|
Vi-a Hardware and Software Setup
The experimental results presented here are evaluated on the two latest generations of Nvidia GPUs: Volta V100 and Ampere A100 with CUDA 11.5 and driver version 495.29.05.
Vi-B Benchmarks and Datasets
|single precision||double precision|
Vi-B1 Stencil Benchmarks
To evaluate the performance of PERKS-based stencils, we conducted a wide set of experiments on various 2D/3D stencil benchmarks (listed in Table III). The baseline implementation uses state-of-the-art optimizations such as shared memory and heavy unrolling (to counter the reduction in over-subscription). We report the performance (CGells/s) of the baseline implementation for all benchmarks in Figures 4 and Figure 5. The baseline performance is on-par to (and often exceeds) state-of-the-art GPU-optimized stencil codes reporting the highest performance across different stencil benchmarks. Namely, SSAM , register-optimized stencils [57, 58], StencilGen , and temporal blocking AN5D .
We use the test data provided by StencilGen . We tested three PERKS implementations: PERKS (sm) that only uses shared memory to cache data; PERKS (reg) that only uses register to cache data; and PERKS (mix) that uses both shared memory and registers to cache data. Due to space limitations, we report only the peak performance among those three PERKS variants.
Vi-B2 Conjugate Gradient Datasets
the conjugate gradient solver datasets come from the SuiteSparse Matrix Collection . We selected symmetric positive definite matrices that can converge in a CG solver. The details of the selected datasets are listed in Table V.
We compare the performance of the PERKS CG solver to Ginkgo , a widely used library heavily optimized for GPUs (including A100). We run 10,000 time-steps in our performance evaluation (Similar to Ginkgo’s basic setting ). We report the speedup per time step, and the measured sustained bandwidth achieved by Ginkgo.
For PERKS, we ran different variants that implement: a) the two caching policies discussed in Section III-B (i.e., caching the vectors or the matrix) and, b) the additional caching TB-level search result and thread-level search result policies mentioned in Section V-C. We only report the top performing variant for each dataset due to space limitations.
For all iterative stencils and conjugate gradient experiments, we run each single evaluation five times, and report the run with the highest performance.
Vi-C Sizes of Domains and Problems
PERKS intuitively favors small domain/problem sizes. However, for a fair evaluation of PERKS, we can not choose arbitrarily small domain sizes; we need domain/input sizes that fully utilize the compute capability of the device. We conducted an elaborate set of experiments for every individual stencil benchmark to identify the minimum domain size that would fully utilize the device. Note that domain/problem sizes that are beyond domain/problem sizes that could fully utilize the device are effectively serialized by the device once we go beyond peak concurrency sustainable by the device. Table IV summarizes the domain sizes for stencil benchmark that would provide a base for a fair comparison.
For the conjugate gradient experiments, we include datasets from SuiteSpare that cover a wide range of problem sizes: from strong-scaling small dataset sizes that would fit in L2 cache and up to large dataset sizes typically reported by libraries for a single GPU of the same generations we use (Gingko [4, 6] and MAGMA ).
Vi-D Iterative 2D/3D Stencils
Figure 4 shows the PERKS speedups for large domain sizes listed in Table IV. The geometric mean speedup for 2D stencils is x in A100 and x in V100. The geometric mean speedup for 3D stencils is x for A100 and x for V100.
It is important to note three points: a) the benchmarks we use include both low-order and high-order stencils, b) the speedups are particularly higher on low-order stencils that are more commonly used in practice (ex: average of 26% reduction in runtime for up to 2-order 3D stencils on V100 and A100, in double precision), and c) the speedups we report are not limited to the –very optimized– implementation we use as baseline; other stencil implementations, regardless of their internals, can also benefit from being transformed to PERKS.
Since small domains occur in strong scaling, we also report in Figure 5 results for small domains that could be fully cached. PERKS for small domain 2D stencils are x faster on A100 and x faster on V100. PERKS for small domain 3D stencils are x faster on A100 and x faster on V100.
Vi-E Conjugate Gradient With Merge-based SpMV
Figure 6 compares PERKS to Ginkgo. When the input is less than the L2 capacity, PERKS running on A100 achieve a geometric mean of x and x speedups for single and double precision, respectively; on V100, PERKS achieve x and x speedups. When the input matrix exceeds the L2 cache capacity, PERKS running on A100 achieve a geometric mean of x speedup for single precision and x speedup for double precision. On V100, PERKS achieve a geometric mean of x (single) and x (double) speedups. It is important to remember that the Ginkgo library we use as baseline is among the top performing libraries in CG solvers, emphasizing GPU optimizations .
Note that regardless of whether we stay within the L2 cache capacity or exceed it, we are still caching the domain using one of the caching policies we described earlier. Yet upon analyzing results, we observed a clear drop in speedup once we exceed the L2 capacity.
Vi-F Performance Analysis
The roofline-like model we propose suffers from the same limitations of the roofline model, e.g., it has an implicit assumption of perfect overlap of compute and memory accesses. Additionally, inaccuracies can arise from the L2 concurrency issues we discussed in Section IV-C. The projected peak performance is effective in serving the purposes of identifying unanticipated performance gaps and guiding our occupancy reduction. That being said, it is important to note that projected peak performance showed is taken as an indicator of the bound on performance improvement and not as accurate performance prediction. In large-domain stencils, we observed () of the projected peak. In the small-domain stencils, we observe () of the projected peak.
Vii Related Work
The concept of persistent threads and persistent kernels dates back to the introduction of CUDA. The main motivation for persistence at the time was load imbalance issues with the runtime warp scheduler [2, 9]. Later research focused on using persistent kernels to overcome the kernel invocation overhead (which was high at the time). GPUrdma  and GPU-Ether  expanded on the concept of persistent kernels to reduce the latency of network communication.
As on-chip memory sizes increased, researchers began to capitalize on data reuse in persistent kernels. GPUrdma  proposed a matrix-vector product persistent kernel holding a constant matrix in shared memory. Khorasani et al.  use persistent threads to keep parameters in cache. Zhu et al.  proposed a sparse persistent implementation of recurrent neural networks.
proposed a sparse persistent implementation of recurrent neural networks.
It is worth mentioning that Kshitij et al.  a decade ago was the first that summarized the ”persistent thread” programming style. Yet, the four scenarios the authors listed to be targeted for persistent threads are mostly out of data. The only scenario that still gives an advantage in the current GPU programming environment is to bypass the hardware scheduler for manual load balance, as a recent study shows .
To our knowledge, this work is the first to propose a methodological and generic blue print for accelerating memory bound iterative GPU applications using persistent kernels.
We propose a persistent kernel execution scheme for iterative GPU applications. We enhance performance by moving the time loop to the kernel and caching the intermediate output of each time step. We show notable performance improvement for iterative 2D/3D stencils and a conjugate gradient solver for both V100 and A100 over highly optimized baselines. We further report notably high speedups in small domain/problem sizes, which is beneficial in strong scaling cases.
-  (2017) Efficient implementation of jacobi iterative method for large sparse linear systems on graphic processing units. The Journal of Supercomputing 73 (8), pp. 3411–3432. Cited by: §I.
-  (2009) Understanding the efficiency of ray traversal on gpus. In Proceedings of the conference on high performance graphics 2009, pp. 145–149. Cited by: §VII.
-  (2015) Systematic fusion of cuda kernels for iterative sparse linear system solvers. In European Conference on Parallel Processing, pp. 675–686. Cited by: §I.
-  (2020) Ginkgo: a modern linear operator algebra framework for high performance computing. External Links: Cited by: 3rd item, §I, §IV-B, Fig. 6, §VI-B2, §VI-C, §VI-E.
-  (2017) Preconditioned krylov solvers on gpus. Parallel Computing 68, pp. 32–44. Cited by: §II-C2.
-  (2020) Evaluating the performance of nvidia’s a100 ampere gpu for sparse and batched computations. pp. 26–38. External Links: Cited by: §VI-C.
-  (2020-11) High-order finite element method using standard and device-level batch gemm on gpus. Los Alamitos, CA, USA, pp. 53–60. External Links: Cited by: §VI-C.
-  (2017) Diamond tiling: tiling techniques to maximize parallelism for stencil computations. IEEE Transactions on Parallel and Distributed Systems 28 (5), pp. 1285–1298. External Links: Cited by: §I.
-  (2010) Dynamic load balancing on single-and multi-gpu systems. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp. 1–12. Cited by: §VII.
-  (2019) A versatile software systolic execution model for gpu memory-bound kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–81. Cited by: §I, §I, Fig. 2, §IV-B, §V-B, §VI-B1.
-  (2009) Parallel dense gauss-seidel algorithm on many-core processors. In 2009 11th IEEE International Conference on High Performance Computing and Communications, pp. 139–147. Cited by: §I.
-  (2012) Sparse systems solving on gpus with gmres. The journal of Supercomputing 59 (3), pp. 1504–1516. Cited by: §I.
NVIDIA a100 tensor core gpu architecture. Nvidia. Note: [Online; accessed 20-July-2021] External Links: Cited by: §IV-C2.
-  (2020) CUDA toolkit documentation. NVIDIA Developer Zone. http://docs.nvidia.com/cuda/index.html. Cited by: §V-C.
-  (2016) GPUrdma: gpu-side library for high performance networking from gpu kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers, pp. 1–8. Cited by: §VII, §VII.
-  (2011) The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38 (1), pp. 1–25. Cited by: §VI-B2, TABLE V.
-  (2015) Realizing extremely large-scale stencil applications on gpu supercomputers. pp. 625–632. External Links: Cited by: §III-A.
-  (2015-10) Optimizing cuda code by kernel fusion: application on blas. J. Supercomput. 71 (10), pp. 3934–3957. External Links: Cited by: §II-A.
-  (2016) Vivace: a practical gauss-seidel method for stable soft body dynamics. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–9. Cited by: §I.
-  (2013) Split tiling for gpus: automatic parallelization using trapezoidal tiles. New York, NY, USA, pp. 24–31. External Links: Cited by: §I.
-  (2012) A study of persistent threads style gpu programming for gpgpu workloads. IEEE. Cited by: §VII.
-  (2015) MODESTO: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. pp. 177–186. External Links: Cited by: §II-A.
-  (2019) Absinthe: learning an analytical performance model to fuse and tile stencil codes in one shot. pp. 370–382. External Links: Cited by: §II-A.
-  (2012) High-performance code generation for stencil computations on gpu architectures. In Proceedings of the 26th ACM International Conference on Supercomputing, ICS ’12, New York, NY, USA, pp. 311–320. External Links: Cited by: §I.
-  (2012) High-performance code generation for stencil computations on gpu architectures. New York, NY, USA, pp. 311–320. External Links: Cited by: §II-C1.
-  (2020) Acceleration of fusion plasma turbulence simulations using the mixed-precision communication-avoiding krylov method. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, pp. 93. External Links: Cited by: §III-A.
-  (1988) Supernode partitioning. New York, NY, USA, pp. 319–329. External Links: Cited by: §II-C1.
-  (2019) Dissecting the nvidia turing t4 gpu via microbenchmarking. arXiv preprint arXiv:1903.07486. Cited by: §I.
-  (2018) Dissecting the nvidia volta gpu architecture via microbenchmarking. arXiv preprint arXiv:1804.06826. Cited by: TABLE I.
-  (2018) Dissecting the NVIDIA volta GPU architecture via microbenchmarking. CoRR abs/1804.06826. External Links: Cited by: §II-D.
-  (2021) GPU-ether: gpu-native packet i/o for gpu applications on commodity ethernet. In IEEE INFOCOM 2021-IEEE Conference on Computer Communications, pp. 1–10. Cited by: §VII.
-  (2018) In-register parameter caching for dynamic neural nets with virtual persistent processor specialization. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Vol. , pp. 377–389. External Links: Cited by: §VII.
-  (2011) Performance analysis and optimization of three-dimensional fdtd on gpu using roofline model. Computer Physics Communications 182 (6), pp. 1201–1207. Cited by: §IV-B.
-  (2015) GPU implementation of jacobi method and gauss-seidel method for data arrays that exceed gpu-dedicated memory size. Journal of Mathematical Modelling and Algorithms in Operations Research 14 (4), pp. 393–405. Cited by: §I.
-  (2007) Effective automatic parallelization of stencil computations. New York, NY, USA, pp. 235–244. External Links: Cited by: §II-C1.
-  (1961) A proof for the queuing formula: l = w. Operations Research 9, pp. 383–387. Cited by: §IV-C2.
Rammer: enabling holistic deep learning compiler optimizations with rtasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pp. 881–897. Cited by: §VII.
-  (2014-01) Optimizing Stencil Computations for NVIDIA Kepler GPUs. In Proceedings of the 1st International Workshop on High-Performance Stencil Computations, A. Größlinger and H. Köstler (Eds.), Vienna, Austria, pp. 89–95. Cited by: §I, Fig. 2, §IV-D, Fig. 4.
-  (2020) AN5D: automated stencil framework for high-degree temporal blocking on gpus. pp. 199–211. External Links: Cited by: §I, §I, Fig. 2, §II-C1, §II-D, §IV-B, §VI-B1.
-  (2014) Benchmarking the memory hierarchy of modern gpus. In IFIP International Conference on Network and Parallel Computing, pp. 144–156. Cited by: §IV-C2.
-  (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. In Proceedings of the 23rd International Conference on SupercomputingProceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance ComputingProceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16-21, 201428th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019, Seattle, WA, USA, September 23-26, 2019CGO ’20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, San Diego, CA, USA, February, 2020Proceedings of the 29th ACM on International Conference on Supercomputing, ICS’15, Newport Beach/Irvine, CA, USA, June 08 - 11, 20152019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesProceedings of the 1989 ACM/IEEE Conference on SupercomputingProceedings of the 26th ACM International Conference on SupercomputingProceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and ImplementationProceedings of the 23rd International Conference on Supercomputing2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS)2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing UnitsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, ICS ’09WOLFHPC ’15GPGPU-6POPL ’88Supercomputing ’89ICS ’12PLDI ’07ICS ’09GPGPU-2SC ’18FPGA ’18, Vol. , New York, NY, USA. External Links: Cited by: §I.
-  (2009) Performance modeling and automatic ghost zone optimization for iterative stencil loops on gpus. New York, NY, USA, pp. 256–265. External Links: Cited by: §II-C1.
-  (2011) A performance study for iterative stencil loops on gpus with ghost zone optimizations. International Journal of Parallel Programming 39 (1), pp. 115–142. Cited by: §I.
-  (2016) Merge-based parallel sparse matrix-vector multiplication. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 678–689. Cited by: §V-C.
-  (2009) 3D finite difference computation on gpus using cuda. New York, NY, USA, pp. 79–84. External Links: Cited by: §V-B.
-  (2015) Optimal temporal blocking for stencil computation. Procedia Computer Science 51, pp. 1303–1312. Note: International Conference On Computational Science, ICCS 2015 External Links: Cited by: §I.
-  (2021)(Website) External Links: Cited by: TABLE I, §V-D.
-  (2021) NVIDIA cuda runtime api. External Links: Cited by: §II-B, §V-A.
-  (2021) NVIDIA cuda sample. External Links: Cited by: §V-C.
-  (2022) Programming guide. External Links: Cited by: §IV-C2.
-  (2014) Applying the roofline model. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 76–85. Cited by: §IV-B.
-  (2020) Node-aware stencil communication for heterogeneous supercomputers. pp. 796–805. External Links: Cited by: §III-A.
-  (2014) A cuda implementation of the high performance conjugate gradient benchmark. In International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, pp. 68–84. Cited by: §I.
-  (2019) From loop fusion to kernel fusion: a domain-specific approach to locality optimization. pp. 242–253. External Links: Cited by: §II-A.
-  (2015) SDSLc: a multi-target domain-specific compiler for stencil computations. New York, NY, USA. External Links: Cited by: §I.
-  (2016) Effective resource management for enhancing performance of 2d and 3d stencils on gpus. In Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, pp. 92–102. Cited by: §II-C1, TABLE III.
-  (2018-02) Register optimizations for stencils on gpus. SIGPLAN Not. 53 (1), pp. 168–182. External Links: Cited by: §VI-B1.
-  (2018) Associative instruction reordering to alleviate register pressure. Piscataway, NJ, USA, pp. 46:1–46:13. External Links: Cited by: §VI-B1.
-  (2018) Domain-specific optimization and generation of high-performance gpu code for stencil computations. Proceedings of the IEEE 106 (11), pp. 1902–1920. Cited by: Fig. 2, §VI-B1, §VI-B1.
-  (2019) On optimizing complex stencils on gpus. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vol. , pp. 641–652. External Links: Cited by: §I, §V-B, Fig. 4.
-  (2021) TOP500. Note: [Online; accessed 27-Mar-2021] External Links: Cited by: §I.
-  (2010) Better performance at lower occupancy. In Proceedings of the GPU technology conference, GTC, Vol. 10, pp. 16. Cited by: §II-D, §IV-C2, §IV-C.
-  (2014) Scalable kernel fusion for memory-bound GPU applications. pp. 191–202. External Links: Cited by: §II-A.
-  (1989) More iteration space tiling. New York, NY, USA, pp. 655–664. External Links: Cited by: §II-C1.
-  (2010) Demystifying gpu microarchitecture through microbenchmarking. In Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, pp. 235–246. Cited by: §IV-C2.
-  (2017) Tessellating stencils. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’17, New York, NY, USA. External Links: Cited by: §I.
-  (2020) A study of single and multi-device synchronization methods in nvidia gpus. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 483–493. Cited by: §II-B.
-  (2017) Understanding the gpu microarchitecture to achieve bare-metal performance tuning. In ACM SIGPLAN Notices, Vol. 52, pp. 31–43. Cited by: §IV-C2.
-  (2019) Exploiting reuse and vectorization in blocked stencil computations on cpus and gpus. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–44. Cited by: §I, TABLE III.
-  (2018) Sparse persistent rnns: squeezing large recurrent networks on-chip. arXiv preprint arXiv:1804.10223. Cited by: §VII.