Accelerating solutions of one-dimensional unsteady PDEs with GPU-based swept time-space decomposition

by   Daniel J Magee, et al.
Oregon State University

The expedient design of precision components in aerospace and other high-tech industries requires simulations of physical phenomena often described by partial differential equations (PDEs) without exact solutions. Modern design problems require simulations with a level of resolution difficult to achieve in reasonable amounts of time---even in effectively parallelized solvers. Though the scale of the problem relative to available computing power is the greatest impediment to accelerating these applications, significant performance gains can be achieved through careful attention to the details of memory communication and access. The swept time-space decomposition rule reduces communication between sub-domains by exhausting the domain of influence before communicating boundary values. Here we present a GPU implementation of the swept rule, which modifies the algorithm for improved performance on this processing architecture by prioritizing use of private (shared) memory, avoiding interblock communication, and overwriting unnecessary values. It shows significant improvement in the execution time of finite-difference solvers for one-dimensional unsteady PDEs, producing speedups of 2--9× for a range of problem sizes, respectively, compared with simple GPU versions and 7--300× compared with parallel CPU versions. However, for a more sophisticated one-dimensional system of equations discretized with a second-order finite-volume scheme, the swept rule performs 1.2--1.9× worse than a standard implementation for all problem sizes.


An initial investigation of the performance of GPU-based swept time-space decomposition

Simulations of physical phenomena are essential to the expedient design ...

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

Applications that exploit the architectural details of high-performance ...

Applying the swept rule for explicit partial differential equation solutions on heterogeneous computing systems

Applications that exploit the architectural details of high performance ...

GPU Methodologies for Numerical Partial Differential Equations

In this thesis we develop techniques to efficiently solve numerical Part...

Extreme-scale Multigrid Components within PETSc

Elliptic partial differential equations (PDEs) frequently arise in conti...

Multigrid Solvers in Reconfigurable Hardware

The problem of finding the solution of Partial Differential Equations (P...

Massively-Parallel Break Detection for Satellite Data

The field of remote sensing is nowadays faced with huge amounts of data....

1 Introduction

High-fidelity computational fluid dynamics (CFD) simulations are essential for developing aerospace technologies such as rocket launch vehicles and jet engines. This project aims to accelerate such simulations to approach real-time execution—simulation at the speed of nature—in accordance with the high-performance computing development goals set out in the CFD Vision 2030 report slotnick:2014 . Classic approaches to domain decomposition for parallelized, explicit, time-stepping partial differential equation (PDE) solutions incur substantial computational performance costs from the communication between nodes required every timestep. This communication cost consists of two parts: latency and bandwidth, where latency is the fixed cost of each communication event and bandwidth is the variable cost that depends on the amount of data transferred. Latency in inter-node communication is a fundamental barrier to this goal, and advancements to network latency have historically been slower than improvements in other computing performance barriers such as bandwidth and computational power PattersonLatency . Performance may be improved by avoiding external node communication until exhausting the domain of dependence, allowing the calculation to advance multiple timesteps while requiring a smaller number of communication events. This idea is the basis of swept time-space decomposition alhubail:16jcp , Alhubail:2016arxiv .

Extreme-scale computing clusters have recently been used to solve the compressible Navier–Stokes equations on over 1.97 million CPU cores BM_BigSolution . The monetary cost, power consumption, and size of such a cluster impedes the realization of widespread peta- and exa-scale computing required for real-time, high-fidelity, CFD simulations. While these are significant challenges, they also provide an opportunity to develop new tools that increase the use of the available hardware resources. As the authors of CFD Vision 2030 note, “High Performance Computing (HPC) hardware is progressing rapidly and is on the cusp of a paradigm shift in technology that may require a rethinking of current CFD algorithms and software” slotnick:2014 . Using graphics processing unit (GPU) hardware as the primary computation device or accelerator in a heterogeneous system is a viable, expedient option for high-performance cluster design that helps mitigate these problems. For this reason, GPUs and other emerging coprocessor architectures are increasingly used to accelerate CFD simulations Niemeyer:2014hn .

GPU technology has improved rapidly in recent years; in particular, NVIDIA GPUs have progressed from Kepler to Pascal architecture in four years. This development doubled and tripled peak single- and double-precision performance, respectively PascalWP:2016 . In addition, the presence of GPUs in clusters, such as ONRL’s Titan supercomputer, has become increasingly common in the last decade. These advances have driven the development of software capable of efficiently using and unifying the disparate architectures Witherden20143028 . Although the ultimate motivation of our work is accelerating the solution of PDEs—particularly relevant to fluid flow—on distributed-memory systems, in this work we focused on a single GPU-accelerated node/workstation in the design of the algorithms and associated software. By investigating the effectiveness of the swept rule on a workstation, we provide results that can be applied to simulations on a single machine as well as an initial framework for understanding the performance of the swept rule on heterogeneous computing systems.

The swept rule operates on a simple principle: do the most work possible on the values closest to the processor before communicating. In practice, this directive results in an algorithm that advances the solution in time at all spatial points using locally accessible stencil values at the previous timestep. Because the data closest to the processor is also the least widely accessible, the strict application of this principle does not always provide the best performance, but it is a useful heuristic for implementing the procedure and analyzing its performance.

This study presents an investigation of the performance characteristics of three swept rule implementations for a single GPU in a workstation. These procedures are tested on three one-dimensional PDEs with numerical schemes of varying complexity and compared with the performance of parallel CPU algorithms and unsophisticated GPU versions. The (next) Section 2 describes recent work on partitioning schemes for PDEs and communication-avoiding algorithms, especially as applied to GPUs. Section 3 gives a brief overview of the GPU architecture, particularly the thread and memory hierarchies. Section 4 discusses the swept rule, and our adjustments to the original algorithm in response to the details of GPU architecture. Section 5 describes the swept rule implementation in detail. In Section 6 we present the results of the tests and, lastly, draw further conclusions in Section 7.

2 Related work

Alhubail and Wang introduced the swept rule for explicit, time-stepping, numerical schemes applied to PDEs MaithamRepo , alhubail:16jcp , Alhubail:2016arxiv , and our work takes their results and ideas as its starting point. The swept rule is closely related to cache optimization techniques, in particular those that use geometry to organize stencil update computation such as parallelograms Strzodka and diamonds MalasHager . The diamond tiling method presented by Malas et al. MalasHager is similar to the swept rule but uses the data dependency of the grid to improve cache usage rather than avoid communication. Concepts such as stencil optimization using domain decomposition on various architectures that are fundamental to this study are explored by Datta et al. VolkovDatta2008 . Their work explores comparisons between parallel GPU and CPU architectures and tunes the stencil algorithm with nested domain decomposition. The swept rule also has elements in common with parallel-in-time and communication-avoiding algorithms.

Parallel-in-time methods Gander2015 , such as multigrid-reduction-in-time (MGRIT) algorithms falgout2014parallel , accelerate PDE solutions with time integrators that overcome the interdependence of solutions in the time domain, allowing parallelization of the entire space-time grid. These methods calculate the solution over the space-time domain using a coarse grid and iterate over successively finer grids to achieve the desired accuracy. The use of coarse grids in parallel-in-time methods reduces efficiency and accuracy when applied to nonlinear systems alhubail:16jcp . This shortcoming is intuitive: since chaotic, nonlinear systems may suddenly change in time, and coarse grids are prone to aliasing, the required grid granularity diminishes gains in performance. The swept rule arises from the same motivation, but does not seek to parallelize the computation in time or vary dimensions during the process.

The swept rule does not alter the numerical scheme; it decomposes the domain and organizes computation. That is, compared to a classic domain decomposition, the swept rule performs the same operations in a different order and location. In this way communication-avoiding algorithms share many implementation details with the swept rule. Recent developments in communication-avoiding algorithms for GPUs have generally focused on applications involving matrices such as QR and LU factorization. The LU factorization algorithm presented by Baboulin et al. BABOULIN201217 is motivated by the increasing use of GPU accelerators in large-scale, heterogeneous clusters. This method splits tasks between the GPU and CPU, minimizing communication between devices. This allows the communication and computation performed on each device to overlap, so all data transfer occurs asynchronously with computation. We explore this approach—overlapping data transfer with hybrid computation—in this article. The motivation and structure of our study is comparable to the work of Anderson et al Anderson . They developed methods for arranging and tuning computation for single general-purpose GPU (GPGPU) in a desktop workstation, without altering the basic QR factorization algorithm. Similarly, this study focuses on adapting the swept rule to a single GPU. The swept rule is a strategy for arranging the computational path of explicit numerical methods, and this work seeks to design the data structures and operations used in that path to achieve the best performance on GPUs.

3 GPU architecture and memory

The impressive parallel processing capabilities of modern GPUs resulted from architecture originally designed to improve visual output from a computer. GPU manufacturers and third-party authors Owens:2008ku , cudaProgGuide , Brodtkorb:2013hn , EngineerCuda have described the features of this architecture in great detail, while others discussed general best practices for developing efficient GPU-based algorithms Cruz:2011gc , Niemeyer:2014hn . Particular aspects of this architecture, such as the unique and accessible memory hierarchy, are at the core of this work, so some explanation of its relevant elements is necessary before describing the details of the implementation.

Programs that run on the GPU can be implemented using several software packages, the most common of which are the OpenCL and OpenACC frameworks, and the CUDA parallel computing platform. These packages use different nomenclatures and are compatible with different hardware types. In this project all programs use CUDA, which is exclusively compatible with NVIDIA GPUs; therefore, all descriptions of GPU hardware presented here use the CUDA nomenclature. CUDA programs consist of functions, referred to as kernels, launched from a C/C++ host program. The CPU executes the host code and specifies the size and number of blocks when calling a kernel, the stream (queue) in which the kernel will be launched, and the amount of shared memory to be allocated per block at runtime. All threads in a warp—a group of threads that execute as a single-instruction multiple thread (SIMT) unit—must be in the same block, so for good practice blocks should launch with some multiple of 32 threads.

The information presented here is valid for all NVIDIA GPUs with compute capability or higher (i.e., Kepler architecture or later). The device used in this study is a Tesla K40c GPGPU, compute capability 3.5. This device contains 15 streaming multiprocessors, each capable of processing 64 warps of 32 threads, or 2048 total threads, at once PascalWP:2016 . A maximum of blocks may concurrently reside on a streaming multiprocessor; blocks may not be split between streaming multiprocessors. While each streaming multiprocessor can support 2048 resident threads, in practice their capacity is often lower because each thread or block makes demands on limited memory resources—most notably the shared memory and registers. Each streaming multiprocessor on the Tesla K40c has of shared memory and 65536 registers available. Registers offer the fastest access, but are the most limited memory type and are private to each thread, but can be accessed by other threads in the same warp using shuffle operations available on devices with compute capability 3.5 or higher. Shared memory is a controllable portion of the L1 cache accessible only to threads within a block. As a result, for a thread to read a value stored in shared memory in a different block, a thread with access to that value must write the value to global memory where the reader thread has access. Global memory is the slowest and most plentiful memory type, and where data copied from the host program resides. Global memory stores all variables passed to a kernel and large arrays declared therein.

Other memory types in the CUDA memory hierarchy include constant, texture, and surface; of these, the work presented here only uses constant memory. Constant memory is read-only, available to all kernels for the lifetime of an application, and quick to access when all threads access the same location. This makes it a convenient and performance conscious choice for storing constant values of the governing equations calculated at runtime cudaProgGuide .

4 Methodology

4.1 Experimental method

The primary goal of this study is to compare the performance of the swept rule to a simple domain decomposition scheme, referred to as Classic, on a GPU. A domain decomposition scheme is a way of splitting up a large problem so tasks can be shared by discrete workers working on distinct parts of the problem. In this case the decomposition scheme divides the space-time domain of the PDE solution. We will compare the performance of these methods by the time cost of each timestep, including the time required to transfer data to/from the GPU. While encoding the Classic is relatively straightforward, finding the best approach for the swept rule on the GPU presents a more subtle problem.

In the original swept rule approach alhubail:16jcp , the spatial domain is partitioned into independent pieces called “nodes” that correspond to compute nodes on a distributed system with private memory spaces. A major concern in adapting this nodal analogy to a single GPU is the type of memory allocated for the working array, the container for the values that are the solution to and the basis for each timestep. Several available approaches exist to map the original analogy for a node to a single GPU; here, we will explore three of them: Shared, Hybrid, and Register (which we will describe in detail in Section 5).

In all approaches we map one thread to one spatial point. Handling more than one spatial point per thread would allow for larger nodes but would require more resources and complicate the procedure without reducing thread idleness. Additional swept rule properties could be adjusted for potential implementation variants such as the data structure to hold the working values, and the method to globally synchronize threads. However, for the purposes of this study, we did not vary these attributes. In all cases in this study the working array is a standard, one-dimensional C array with two flattened rows; and kernel calls are implicitly synchronized by returning control to the queue in the host program.

4.2 First-order domain of dependence

(a) The first step of the swept rule. Values at are split between nodes and , which compute solutions in their domain of dependence, a triangle in the space-time plane. The edge values are collected in global arrays / and /.
(b) The second step in the swept rule. The nodes pass their right edge to the neighboring node. The passed values become the initial left edge, and the left edge from the previous stage becomes the right edge. Each node advances through their domain of dependence, a diamond in space-time.
Figure 1: The first two steps of the swept rule for a numerical scheme with a first-order domain of dependence. refer to left/right arrays of node , which collect edge values shown with thick bordered dots FigsShared .

A domain of dependence is a region on the space-time grid that can be completed by a numerical scheme with some set of initial values. For the purposes of this project, numerical schemes consist of stencil operations, where a value at a grid point is updated based on the weighted contribution of values at grid points in the vicinity. A three-point stencil uses values at neighboring grid points only, and the timestep of any numerical scheme can be decomposed into a series of sub-timesteps that only require a three-point stencil WangDecomp . The numerical scheme defines the order of the domain of dependence: it increases by one for every two sub-timesteps required per timestep. Therefore, a domain of dependence is first-order if all sub-timesteps in the numerical scheme use a three-point stencil, and intermediate values are required no more than two steps after they are calculated. The initial incarnation of the swept rule presented by Alhubail and Wang alhubail:16jcp , MaithamRepo decomposes multi-step timesteps and large stencils into sub-timesteps with a three-point stencil. This regularizes the procedure and ensures that all equations and schemes can be evaluated using a swept decomposition with a first-order domain of dependence, but requires more memory to store the intermediate values that result from each sub-timestep. In order to conserve limited private memory, for the GPU-based swept rule a first-order domain of dependence is applicable to schemes that require two or fewer three-point stencil sub-timesteps per timestep. Figure 1 shows the first two stages of the swept rule using nodes with spatial points.

At the start of the first step, shown in Figure 0(a), the initial conditions are passed to the kernel and each node evaluates the solution at as many points as possible in the space-time grid. The initial domain of dependence forms a triangle. The working array stores the solutions as each node steps through time and is maintained in a fast memory space private to node member threads. When each node cannot advance any further it’s necessary for each to pass one edge to a neighboring node and retain the values of the other edge. Figure 0(b) shows how the second step proceeds from the first using the edge values.

(a) At the end of a kernel (completion of an inverted triangle), the working array is stored and passed from private memory to global memory. The location shows the index of the working array in (private) shared memory. The number in the shape refers to the offset global memory index where the working value is passed.
(b) Edges are reinserted to seed the inverted triangle of a new swept cycle. The location shows the index of the working array in (private) shared memory receiving the global memory value denoted by the number.
Figure 2: The procedure for edge passing shown in Figure 0(b). The global arrays and are represented by circles and squares, respectively, and the numbers in those shapes represent the position of the value in the global array. The axes describe the location of the value in the working array FigsShared .

For the nodes to communicate as Figure 1 illustrates, the values on the edges must be passed to global arrays available to all nodes since any sufficiently fast memory location is private to a subset of threads terminated on kernel exit. Consequently, any data necessary to continue the computation must be shared between nodes using an intermediary container. The information needed to begin a new nodal cycle, including the values retained by each node, must be read in from, and out to, global memory at the beginning and end of each kernel, respectively. Figure 1(a) shows how the working array values are stored and passed to the global arrays.

Figure 0(b) illustrates the reason the two edges are stored individually as and : only one of the edges is passed between nodes. After the first step, the right edge is passed to its right neighbor node; the left edge is stationary. At the beginning of the subsequent kernel, and are swapped and reinserted into the working array to seed the next progression with one data transfer event from global to private memory as shown in Figure 1(b). When the right edge is passed between nodes, node is split across the spatial boundary and must apply the boundary conditions at its center.

The swept rule allows the computation to advance by timesteps with two, rather than , global memory accesses, where is the number of threads per block (or the number of spatial points per node). The procedure advances by passing values in the alternating directions and zig-zagging the location of the nodes in this fashion until the simulation is complete. Since the diamonds shown in Figure 0(b) do not store all the values at a single timestep, the simulation can only output values when a complementary triangle is computed and the final -length local tier is returned. This kernel can only be called after the values are passed to the left, so the results can only be read out every th timestep.

Although we already described the working array and showed in Figure 2 how the relevant values are communicated between the nodes, it is instructive to outline the performance concerns that motivate this arrangement. As Section 3 describes, the number of resident threads on a streaming multiprocessor depends on the GPU architecture and resources requested at kernel launch. For instance, storing every double-precision value in the triangle shown in Figure 0(a) in shared memory in a kernel with threads per block would require of shared memory—this is over 100 times greater than the limit for NVIDIA GPUs with Kepler architecture. The maximum number of threads per streaming multiprocessor would be limited to 128, which would negatively impact program performance.

Figure 0(a)

shows that the interior of the triangle is only needed to progress to the next timestep, and that edges on even and odd tiers do not overlap in the spatial domain. Thus, the triangle may be stored as a matrix with two rows, where the first and second rows contain the even and odd sub-timesteps results, respectively. The interior values are overwritten once they are used, and only the edge values remain. Figure 

1(a) shows the result of a local computation with this method. The last two values, the tips of each triangle, are copied into both arrays.

At the start of the next kernel, the left and right arrays are inserted into the working array to seed successive calculations as Figure 1(b) shows. The edges of the previous cycle’s first row are moved to the center, and the center of the current top row is now left open for the first tier of the inverted triangle. Each row requires space for values because two edge values are required on either side to complete the stencil for the longest row where values are computed. The computation proceeds by filling the empty two indices on the top row, overwriting the bottom row’s middle four indices, and so on. In contrast to the memory demands of storing the entire nodal computation, this method uses only values. For a block with n = 512 threads, this requires . In general, a stencil with width would require storing values in the working array. By reducing the amount of shared memory to only , the kernel is less limited and achieves higher occupancy, the number of threads each streaming multiprocessor is capable of handling simultaneously.

4.3 Higher-order domain of dependence

The first-order domain of dependence suffices for relatively simple problems. However, more complicated problems with elements such as nonlinear equations, discontinuities, or higher-order derivatives require more sophisticated procedures. The original swept rule program calculates and stores intermediate values at sub-timesteps for higher-order schemes to avoid using larger stencils alhubail:16jcp . Breaking a timestep into a series of sub-timesteps allows any numerical scheme for any equation of the same dimension to be decomposed in the same way since all stages in the computation depend on the minimum stencil, that is, only the neighbors of the current spatial point WangDecomp . For example, a second-order in time midpoint method applied to a fourth-order differential equation would require four sub-timesteps per timestep. The first sub-timestep would find the second derivative of the dependent variable so that the second sub-timestep, the midpoint solution, would only require a three point stencil: the second derivative and initial values at the neighboring spatial points. The third sub-timestep would find the second derivative using the midpoint solution, and the final sub-timestep would complete the timestep using the results of the second and third sub-timesteps on the three-point stencil and the previous timestep solution at the current spatial point. This approach presents a storage and data transfer problem on the GPU because values in the interior of the working array are overwritten two sub-timesteps after they are calculated. These forgotten but required values are marked with an “” in Figure 2(b).

(a) First step of swept rule for discretization with four, three-point stencil sub-timesteps per timestep.
(b) Using four sub-timesteps per timestep will overwrite values marked with an “” before they are needed.
Figure 3: Conflicts in the domain of dependence for discretizations requiring more than two sub-timesteps per timestep FigsShared .

Saving four values per tier would fix the problem, but requires a larger matrix in shared memory, which would diminish occupancy and require more unnecessary values to be passed between nodes. Figure 4 shows our solution to this problem: a five-point stencil that requires two sub-timesteps per timestep—a predictor and a final step. This flattens the triangle or diamond in the time domain and requires four values per sub-timestep, but the two-row matrix may be used as described in Section 4.2 and illustrated by Figure 2 with minor adjustments. The same number of values are transferred between nodes in each communication, but inevitably more communication events are required to advance the solution. Conveniently, the predictor-corrector method ensures that all odd tiers, the second matrix row, will contain predictor values, and the bottom row will hold final values.

(a) First step in swept rule for second-order domain of dependence. The working array is able to be folded and passed as shown in Figure 2
(b) Second step in the swept rule second-order domain of dependence.
Figure 4: The first two steps of the swept rule for a numerical scheme with a second-order domain of dependence FigsShared .

The problem that motivates our adjustment to the swept rule is the result of both the midpoint method and the five-point stencil discretization. Either of these circumstances alone would accommodate the method described in the previous section. Generally, the decomposition must be adjusted in this fashion when there are more than two sub-timesteps per timestep. It would need to be adjusted further for problems with more than four sub-timesteps per timestep.

5 Implementation

5.1 Swept rule variants

While the order of the swept domain of dependence is a natural consequence of the equations and numerical scheme, the computational hierarchy available on GPUs allows this structure to be implemented in different ways. To more thoroughly investigate the effects of the swept rule, we consider three versions: Shared, Hybrid, and Register, and compare them with a basic decomposition algorithm, Classic. In all of these programs, one GPU thread is assigned to one spatial point.

__global__ void upTriangle(const REAL *IC, REAL *right, REAL *left) { //tid = threadIdx.x (bottom row idx), //tidT = tid+blockDim.x (top row idx); //tids and tidTs: stencil for each tidT and tid respectively. //shareT = working array in shared memory //Read initial data in from IC to shareT. if (tid > 1 && tid <(blockDim.x-2)) {   shareT[tidT]=predictorStep(shareT, tids);   } __syncthreads(); //4 is stencil length - 1 for (int k = 4; k<(blockDim.x/2); k+=4)   {   if (tid < (blockDim.x-k) && tid >= k) {     shareT[tid]+=finalStep(shareT, tidTs);     }   k += 2;   __syncthreads();   if(tid < (blockDim.x-k) && tid >= k) {     shareT[tidT]=predictorStep(shareT,tids);     }   __syncthreads();   } //Read out the edges to left and right arrays. }
Figure 5: Main loop of the starting kernel for the swept rule as illustrated by Figure 3(a).
__global__ void classicKS(const REAL *ks_in, REAL *ks_out, bool final) { //Global Thread ID int gid = blockDim.x * blockIdx.x + threadIdx.x; //number of spatial points - 1 int lastidx = ((blockDim.x*gridDim.x)-1); //Stencil indices. int gids[5]; //True for all spatial points from periodic BCs. #pragma unroll for (int k = -2; k<3; k++) {   gids[k+2] = (gid + k) & lastidx;   } //Final is false for predictor step, true otherwise. if (final) {   ks_out[gid] += finalStep(ks_in, gids);   } else {   ks_out[gid] = predictorStep(ks_in, gids);   } }
Figure 6: Classic kernel for Kuramoto-Sivashinsky solver. Final and predictor step functions can be written in C with __device__ keyword.

The Classic algorithm is a naive GPU implementation of the numerical solution and the baseline against which the efficacy of the swept rule is measured. It advances one sub-timestep per kernel call and uses global memory to store the working array. Figure 6 shows the structure of the procedure, and it is similar for all problems and discretizations.

We consider the Shared strategy for implementing the swept rule the most natural way to map the analogy of CPU nodes to GPU architecture. It is applied to every test case and is considered the “default” swept rule GPU version for comparing the performance of the GPU programs with their MPI-based CPU counterparts MaithamRepo . The Shared version treats a block as a node and uses shared memory for the working array. Each block has exclusive access to a shared memory space and may contain up to threads, so it has access to fast memory and the capacity for various node sizes. There are some drawbacks to this version: a high number of idle threads for sub-timesteps where the domain of dependence contains relatively few spatial points, poor utilization of CPU resources, and the fact that shared memory is not the fastest memory type harris:2014 .

The Hybrid strategy uses the same GPU procedure as Shared, uses the CPU to compute the node that is split across the boundary, as seen in Figure 0(b), Transfers between the host and device are costly operations, but the devices can execute instructions concurrently—so if the CPU can complete the boundary node before the GPU finishes the other nodes, no penalty arises. In this study, we apply this strategy to problems with non-periodic boundary conditions as a way to mitigate the underutilization of the CPU and the thread divergence that results from applying boundary conditions in a GPU kernel.

The Register approach is applied to problems with periodic boundary conditions because of the difficulty involved in applying boundary conditions using warp shuffle functions. This implementation limits the number of points in a node to the size of a warp, threads (which has been constant over several iterations of NVIDIA GPUs). In this version the values are initially read from global memory to shared memory to the registers. Passing the values to the intermediate shared memory is necessary because the shuffle operations that trade registers between threads only operate on active threads in the warp being called; if some threads are masked, they will be unable to supply the necessary stencil values. Thus, data must be moved between memory levels at each tier rather than once at the start and end of the kernel. This seriously limits the Register approach, but it still warrants exploration since registers are the fastest memory type.

5.2 Test cases

We present three test cases to demonstrate the performance and functionality of the GPU-based swept rule in one spatial dimension: the heat equation, Kuramoto–Sivashinsky (KS) equation, and Euler equations for compressible flow. Appendices Appendix BAppendix D contain the full derivations of the procedures for the numerical solutions. First, we chose the heat equation for its simplicity and familiarity. Here it is discretized with a first-order scheme using forward differencing in time and central in space. Next, we selected the KS equation to demonstrate the swept rule for higher-order, nonlinear PDEs. We discretized the KS equation with second-order, central differencing in space, which requires a five-point stencil, and a second-order Runge–Kutta method in time. Lastly we chose to solve the Euler equations, a system of quasilinear, hyperbolic equations for describing compressible, inviscid flow. The conservative form of these equations are applied to the Sod shock tube problem to demonstrate the application of the swept rule to a canonical CFD problem involving discontinuities and several dependent variables. These equations are discretized with a second-order, finite-volume scheme in space and a second-order method in time. The heat equation requires a first-order domain of dependence, while the KS and Euler equations require second-order domains of dependence. These problems also provide examples of various types of boundary conditions. The heat and Euler equations are solved with reflective and Dirichlet boundary conditions, respectively; therefore, the boundary conditions must be imposed with control flow. The KS equation uses periodic boundary conditions, which is a convenient formulation for the swept rule that splits a node across the boundary.

6 Results and discussion

All tests presented here were performed on a single workstation with a Tesla K40c GPU and an Intel Xeon 2630-E5 CPU with eight cores and 16 potential threads. The GPU-based swept rule algorithms and test cases were implemented 1DSweptCUDA v2 MyRepo . For the results we present here, each program was executed in double precision with threads per block and spatial points. Each run advanced timesteps and recorded the average time per timestep; the initial GPU memory allocations and data transfer between the host and device are included in the overall time measurement. Then, we collected the best time per timestep for each number of spatial points. We repeated this procedure five times and took the average to obtain the results presented here. There were no significant differences in the results between the tests for the same configuration.

Figures 6(a) and 7(a) show the execution time per timestep for all algorithms applied to the heat and KS equations. When applied to these problems, the swept rule algorithm outperforms the Classic procedure, and the GPU-only shared memory version, Shared, is faster than the alternate swept rule versions. Figures 6(b) and 7(b) show the speedup of the swept rule programs, or the ratio of time costs with the Classic version. Both cases exhibit similar performance patterns: Shared generally provides a larger speedup for small spatial domains ( spatial points), but only a speedup for large ones ( spatial points). Figures 6(a) and 7(a) show the performance trends of the algorithms with respect to the spatial domain size. Their time costs are insensitive to the spatial domain at smaller domain sizes and grow linearly with increasing domain size after about spatial points. The similarity in the performance trends of the heat and KS programs is intuitive; although the KS equation is nonlinear, fourth-order, and discretized with a higher-order scheme, both are continuous, scalar equations and use finite difference schemes.

(a) Time cost of GPU classic and swept domain decomposition algorithms.
(b) Speedup of swept rule programs with respect to the Classic version.
Figure 7: Performance comparison of the GPU heat equation programs FigsShared .
(a) Time cost of GPU classic and swept domain decomposition algorithms.
(b) Speedup of swept rule programs with respect to the Classic version.
Figure 8: Performance comparison of the GPU Kuramoto–Sivashinsky equation programs FigsShared .

The speedup of the swept rule declines as the number of spatial points in the domain increases, caused primarily by occupancy and the fixed cost of kernel launch. The occupancy of a kernel is the number of threads that can reside concurrently on a streaming multiprocessor at launch. This quantity is determined by the block size, since blocks of threads are indivisible, and the resources requested by the kernel. The swept rule requests more resources, specifically shared memory and registers, than Classic to store the working array and carry out its procedure, which involves more steps. If more resources are requested than are available to the streaming multiprocessors on the device, the GPU launches waves of blocks. If the occupancy is limited by resource allocation, the kernel must begin launching waves of blocks on domains with fewer spatial points. Each wave must wait until the previous one completes before beginning; theoretically this implies two waves will cost about twice as much as one. As the number of spatial points in the domain grows, more blocks are launched, and the difference in the number of waves per kernel call increases more quickly for the swept rule program. This conclusion arises from the observation that the time cost of the swept rule versions begins growing linearly at a smaller spatial domain size, about points, than Classic, about points, as shown in Figures 6(a) and 7(a).

By timing the launch of many empty kernels and taking their average, we measured the fixed cost of kernel launch on the testing workstation to be about . We conclude that this cost dominates the performance of Classic at small problem sizes because it is quite close to the cost of each timestep for spatial domains with less than spatial points for both the heat and KS equations. This fixed cost accounts for less time cost at larger spatial domain sizes where all kernels must launch several waves of blocks, so the portion of the swept rule speedup from avoiding kernel launches becomes negligible, and the overall speedup of the swept rule is reduced.

Each of the swept rule variants experiences these trends, but in all cases the Shared version performs better. Figure 7 shows the heat equation test, where the Hybrid version performs as well as Shared for large spatial domain sizes and 2–3 times worse for small ones. The Hybrid version uses the same GPU kernels as Shared, but uses the CPU to calculate the first node when it is split across the boundary. At smaller domain sizes, the data transfer between the host and device dominates the cost of the Shared scheme. At larger domain sizes, this cost drops in comparison with the overall costs, but at the same time the thread divergence caused by handling boundary conditions also drops in importance. Thus, the Shared and Hybrid methods perform similarly at larger domain sizes (i.e., above spatial points).

Figure 8 shows the KS equation test, where, similar to the heat equation, the Shared version performs better in all cases. While registers are the fastest memory type, because of memory access rules the Register version must use a warp as a node, which limits the domain of dependence to 32 spatial points. A Register program with must launch four times as many kernels to advance the same number of timesteps as Shared with (most often the best block size). Despite these differences, the Register version exhibits a similar performance trend to Shared, which shows about the speedup of Register at all spatial domain sizes. The limit on node size results in a constant four timesteps per cycle for the KS equation, which only reduces the number of communication events by compared with Classic.

(a) Time cost of GPU classic and swept domain decomposition algorithms. Hybrid and Shared lines overlap.
(b) Speedup of swept rule programs with respect to the Classic version.
Figure 9: Performance comparison of the GPU Euler equation programs FigsShared .

In contrast to the previous test cases, Figure 9 shows that the classic decomposition outperforms the swept rule at all problem sizes when applied to the Euler equations. This result differs from the findings of Alhubail and Wang alhubail:16jcp for the swept rule implemented only with MPI on CPUs, which produces roughly equivalent speedups for both the KS and Euler equations. The GPU implementation of the swept rule for the Euler problem involves much greater arithmetic intensity than the other problems, causing greater low-level memory usage. This limits the performance of the swept scheme on a GPU more than a CPU, reducing the benefits of the scheme over the classic approach within a single GPU compared with the other problems. Those cases involve higher ratios of floating-point operations to memory accesses (both read and write). This will particularly degrade performance as the intra-domain timesteps proceed and nodes become inactive.

With regard to the potential performance of the swept rule, this result is not encouraging for more complex problems in more dimensions. However, on a larger scale in a cluster, where global memory is used for the working array and the communication to be avoided is through a physical network between processors, we expect that the swept rule will provide a benefit for these types of problems.

(a) Time cost of CPU and GPU programs for KS equation.
(b) Performance improvement of GPU program compared to the same algorithm on the CPU.
Figure 10: Performance comparison of CPU (MPI) and GPU (CUDA) programs for the KS equation FigsShared .

Figure 10 compares the CPU-based KS equation programs with the GPU algorithms, Classic and Shared. This CPU version is parallelized with MPI, similar to that originally presented by Alhubail and Wang MaithamRepo . Figure 10 refers to the Shared program as SweptGPU because it serves as the “default” swept rule version on the GPU for comparison with the CPU-based swept rule. The original CPU program was designed for a distributed-memory cluster, but the Alhubail and Wang’s original performance study used only two processes on two CPUs alhubail:16jcp . The parallel CPU program was tested similarly to the GPU programs: it was run on the same workstation with the same spatial domain sizes and threads. Here we compare the best results for each number of spatial points, which usually occurred with 16 threads. This did not degrade the performance compared with the original study; in fact, both CPU versions of the program performed significantly better, and for most spatial domain sizes the CPU-based swept rule improved by about ×. Since the test used up to eight times the number of threads, this result supports the validity of repurposing this code and comparing the result with the GPU versions.

Figure 10 illustrates the performance benefits of the GPU architecture more than the swept rule itself. On small spatial domains (e.g., less than points) the GPU can assign one thread to each spatial point and process all in a single wave. There it improves performance over the CPU version by less than ×, because the work does not fully utilize the GPU. On larger spatial domains (e.g., more than points), where the work completely utilizes the resources of the GPU, the performance increases continue growing with problem size. The increase of the GPU programs’ improvement with respect to the number of points in the spatial domain inverts the trend of the swept rule’s improvement with respect to the Classic kernel. Both GPU algorithm types show a speedup of about × for the smallest spatial domain. At the other end, the speedup grows to about × for the swept and × for the classic domain decomposition.

7 Conclusions

In this study we compared the time per timestep of three swept rule time-space decomposition implementations to a simple domain decomposition scheme, Classic, for GPU accelerators. These programs were evaluated on a range of spatial domain sizes. The Classic and Shared programs were also compared with their CPU-based counterparts parallelized with MPI.

Generally, the swept rule performs better relative to Classic when the kernel launch cost is a substantial element of the program cost and the spatial grid has fewer points than concurrently launchable threads on the GPU. Once this is exceeded, the launch cost penalty becomes negligible and greater resource usage penalizes the swept rule in the form of reduced occupancy and reduced availability of the L1 cache, which is reserved for shared memory. But, like the initial performance penalty for the Classic program, the resource usage penalty does not scale and all programs see their time cost rise nearly linearly with respect to spatial domain size after about  spatial points.

Both alternate swept-rule procedures, Hybrid and Register, performed slightly worse than the Shared version. We attribute the failure of the Hybrid routine to improve performance to improvements made in the handling of the boundary conditions in Shared rather than the cost of host-to-device communication, which is handled with asynchronous streams. Though the Hybrid version does not improve performance, it does not substantially degrade it either, especially with large spatial domain sizes. Ultimately, Hybrid computation seeks to solve a problem more efficient to solve within the GPU paradigm. The Register computation shows more promise but falls short because of the limited node size.

This study also shows that implementing a Classic decomposition on the GPU for an explicit numerical scheme is simple and may result in noticeable performance improvements. This may be enough for many applications, but in performance-critical cases the swept rule may further reduce execution time. The performance of the Classic decomposition programs may be improved by altering the synchronization method. Implicit synchronization incurs the cost of kernel launch at each sub-timestep, and this kernel executes so quickly at small problem sizes that synchronization cost dominates the performance of the program. Using off-the-shelf, GPU-based synchronization GlobalLock could also provide performance benefits for the swept rule, and in particular Register, which is also limited by kernel launch cost for small spatial domains.

The specific results presented here depend on the hardware used. A commercial GeForce GPU with Kepler architecture would perform worse than the Tesla K40c, which is designed for computation as opposed to actual graphics. Despite the significant performance increase shown here, the Tesla K40c is three-year-old technology. The current state-of-the-art GPGPU with Pascal architecture, the Tesla P100, offers twice the K40c base clock speed, nearly four times the number of streaming multiprocessors, and an extra of shared memory independent of the L1 cache. Additional dedicated shared memory could dramatically impact the swept rule’s performance for the Euler equations. We predict that this device could at least halve the execution times shown here and maintain insensitivity to problem size up to over spatial points, potentially resulting in speedups on the order of over CPU parallel versions. Future GPU accelerators could further improve performance, as well as devices with similar SIMT architectures like Intel MIC/Xeon Phi.

Our future work will focus on further developing the swept rule for use on distributed memory systems with heterogeneous node architectures and implementing the swept rule in higher dimensions on the GPU. We expect the performance benefits to increase from the swept rule when applied to a distributed memory system comprised of nodes with CPUs and GPUs, where network communications incur more latency than intra-GPU memory access. However, we anticipate the performance benefits of the swept rule will diminish for higher dimensions; in two dimensions the swept rule requires three stages of inter-node communication to advance one cycle, analogous to a single diamond in one dimension. This, along with the associated increase in required memory allocation, arithmetic intensity, and kernel launch events could limit the performance of the swept rule as observed by Alhubail et al. Alhubail:2016arxiv . In spite of these challenges, recent developments such as grid-wide synchronization and increased shared memory capacity could aid the performance of the swept rule and permit the exploration of additional design options.

Appendix Appendix A Availability of materials

The results for this article were obtained using 1DSweptCUDA v2 MyRepo . All figures, and the data and plotting scripts necessary to reproduce them, are available openly under the CC-BY license FigsShared .

Appendix Appendix B Heat equation

The unsteady heat conduction equation without volumetric heat flux is


where is temperature, is time, and is thermal diffusivity. In one dimension, this is reduced to


Discretizing Eq. (2) with forward differencing in time and central in space yields


where is the spatial node index and is the time index corresponding to time . This is a first-order, explicit, finite-difference approximation. To step forward in time, define the Fourier number, , where is the timestep size and is the spatial grid size, and solve for temperature at the next timestep:


In this study the approximation is evaluated with insulated boundary conditions at both ends and spatial points:


Appendix Appendix C Kuramoto–Sivashinsky equation

The Kuramoto–Sivashinsky equation is a nonlinear, fourth-order, one-dimensional unsteady PDE:


where is the dependent chaotic variable (e.g., fluid velocity). It is discretized similarly to the heat equation, as shown in Eq. (7), with central differencing in space. In this case, decomposing the fourth spatial derivative requires a five-point stencil:


The chaotic nature of the problem necessitates a higher-order scheme overall; therefore, an explicit, second-order Runge–Kutta scheme, also known as the midpoint method, is applied to the time domain.

Let the right-hand side of Eq (7) be , then the predictor solution is found at :


Then may be evaluated and added to to obtain :


The problem demonstrated here uses periodic initial and boundary conditions. That is, the stencil at point includes points and and the initial condition is periodic and continuous at the spatial boundaries.

Appendix Appendix D Euler equations (Sod shock tube)

The Sod shock tube problem is a one-dimensional unsteady compressible flow problem based on the nonlinear, quasi-hyperbolic Euler equations:


where is the Jacobian matrix,


is density, is internal energy, is velocity, is pressure given by


and is the heat capacity ratio of air.

The initial boundary conditions, given in Eq. (14), are constant values for the state variables on either side of a diaphragm separating two parcels of the same fluid. The spatial boundary conditions are these values at their respective ends of the tube:


The equation is discretized using a second-order, finite-volume scheme with cell-centered values. The first step in the solution is evaluating the pressure ratio at the current timestep over a five-point stencil


This value is used with a minmod limiter to compute reconstructed values on both sides, and , of the current cell boundaries at :


where subscript refers to the reconstructed, or new, values on the edge of the interface and refers to the original values. For example, at the original values on the left side of the interface are at

. These reconstructed boundaries do not represent solutions for any grid cell; they are temporary values that interpolate the solutions.

Once we have the reconstructed values on either side of the interface, we can calculate the flux at that cell boundary with


where is given by Eq. (12) and

is the spectral radius, the largest eigenvalue of the Jacobian matrix


The spectral radius can be found with the Roe average at the interface


and with using Eq. (13).

The spectral radius is given by


Repeating this process at both interfaces yields all required values to solve for a timestep


The results presented here for the Euler equations use a second-order Runge–Kutta scheme in time, which can be obtained with the same procedure shown in Eqs. (9) and (8).


This material is based upon work supported by NASA under award No. NNX15AU66A under the technical monitoring of Drs. Eric Nielsen and Mujeeb Malik. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40c GPU used for this research.