1 Introduction
Highfidelity computational fluid dynamics (CFD) simulations are essential for developing aerospace technologies such as rocket launch vehicles and jet engines. This project aims to accelerate such simulations to approach realtime execution—simulation at the speed of nature—in accordance with the highperformance computing development goals set out in the CFD Vision 2030 report slotnick:2014 . Classic approaches to domain decomposition for parallelized, explicit, timestepping partial differential equation (PDE) solutions incur substantial computational performance costs from the communication between nodes required every timestep. This communication cost consists of two parts: latency and bandwidth, where latency is the fixed cost of each communication event and bandwidth is the variable cost that depends on the amount of data transferred. Latency in internode communication is a fundamental barrier to this goal, and advancements to network latency have historically been slower than improvements in other computing performance barriers such as bandwidth and computational power PattersonLatency . Performance may be improved by avoiding external node communication until exhausting the domain of dependence, allowing the calculation to advance multiple timesteps while requiring a smaller number of communication events. This idea is the basis of swept timespace decomposition alhubail:16jcp , Alhubail:2016arxiv .
Extremescale computing clusters have recently been used to solve the compressible Navier–Stokes equations on over 1.97 million CPU cores BM_BigSolution . The monetary cost, power consumption, and size of such a cluster impedes the realization of widespread peta and exascale computing required for realtime, highfidelity, CFD simulations. While these are significant challenges, they also provide an opportunity to develop new tools that increase the use of the available hardware resources. As the authors of CFD Vision 2030 note, “High Performance Computing (HPC) hardware is progressing rapidly and is on the cusp of a paradigm shift in technology that may require a rethinking of current CFD algorithms and software” slotnick:2014 . Using graphics processing unit (GPU) hardware as the primary computation device or accelerator in a heterogeneous system is a viable, expedient option for highperformance cluster design that helps mitigate these problems. For this reason, GPUs and other emerging coprocessor architectures are increasingly used to accelerate CFD simulations Niemeyer:2014hn .
GPU technology has improved rapidly in recent years; in particular, NVIDIA GPUs have progressed from Kepler to Pascal architecture in four years. This development doubled and tripled peak single and doubleprecision performance, respectively PascalWP:2016 . In addition, the presence of GPUs in clusters, such as ONRL’s Titan supercomputer, has become increasingly common in the last decade. These advances have driven the development of software capable of efficiently using and unifying the disparate architectures Witherden20143028 . Although the ultimate motivation of our work is accelerating the solution of PDEs—particularly relevant to fluid flow—on distributedmemory systems, in this work we focused on a single GPUaccelerated node/workstation in the design of the algorithms and associated software. By investigating the effectiveness of the swept rule on a workstation, we provide results that can be applied to simulations on a single machine as well as an initial framework for understanding the performance of the swept rule on heterogeneous computing systems.
The swept rule operates on a simple principle: do the most work possible on the values closest to the processor before communicating. In practice, this directive results in an algorithm that advances the solution in time at all spatial points using locally accessible stencil values at the previous timestep. Because the data closest to the processor is also the least widely accessible, the strict application of this principle does not always provide the best performance, but it is a useful heuristic for implementing the procedure and analyzing its performance.
This study presents an investigation of the performance characteristics of three swept rule implementations for a single GPU in a workstation. These procedures are tested on three onedimensional PDEs with numerical schemes of varying complexity and compared with the performance of parallel CPU algorithms and unsophisticated GPU versions. The (next) Section 2 describes recent work on partitioning schemes for PDEs and communicationavoiding algorithms, especially as applied to GPUs. Section 3 gives a brief overview of the GPU architecture, particularly the thread and memory hierarchies. Section 4 discusses the swept rule, and our adjustments to the original algorithm in response to the details of GPU architecture. Section 5 describes the swept rule implementation in detail. In Section 6 we present the results of the tests and, lastly, draw further conclusions in Section 7.
2 Related work
Alhubail and Wang introduced the swept rule for explicit, timestepping, numerical schemes applied to PDEs MaithamRepo , alhubail:16jcp , Alhubail:2016arxiv , and our work takes their results and ideas as its starting point. The swept rule is closely related to cache optimization techniques, in particular those that use geometry to organize stencil update computation such as parallelograms Strzodka and diamonds MalasHager . The diamond tiling method presented by Malas et al. MalasHager is similar to the swept rule but uses the data dependency of the grid to improve cache usage rather than avoid communication. Concepts such as stencil optimization using domain decomposition on various architectures that are fundamental to this study are explored by Datta et al. VolkovDatta2008 . Their work explores comparisons between parallel GPU and CPU architectures and tunes the stencil algorithm with nested domain decomposition. The swept rule also has elements in common with parallelintime and communicationavoiding algorithms.
Parallelintime methods Gander2015 , such as multigridreductionintime (MGRIT) algorithms falgout2014parallel , accelerate PDE solutions with time integrators that overcome the interdependence of solutions in the time domain, allowing parallelization of the entire spacetime grid. These methods calculate the solution over the spacetime domain using a coarse grid and iterate over successively finer grids to achieve the desired accuracy. The use of coarse grids in parallelintime methods reduces efficiency and accuracy when applied to nonlinear systems alhubail:16jcp . This shortcoming is intuitive: since chaotic, nonlinear systems may suddenly change in time, and coarse grids are prone to aliasing, the required grid granularity diminishes gains in performance. The swept rule arises from the same motivation, but does not seek to parallelize the computation in time or vary dimensions during the process.
The swept rule does not alter the numerical scheme; it decomposes the domain and organizes computation. That is, compared to a classic domain decomposition, the swept rule performs the same operations in a different order and location. In this way communicationavoiding algorithms share many implementation details with the swept rule. Recent developments in communicationavoiding algorithms for GPUs have generally focused on applications involving matrices such as QR and LU factorization. The LU factorization algorithm presented by Baboulin et al. BABOULIN201217 is motivated by the increasing use of GPU accelerators in largescale, heterogeneous clusters. This method splits tasks between the GPU and CPU, minimizing communication between devices. This allows the communication and computation performed on each device to overlap, so all data transfer occurs asynchronously with computation. We explore this approach—overlapping data transfer with hybrid computation—in this article. The motivation and structure of our study is comparable to the work of Anderson et al Anderson . They developed methods for arranging and tuning computation for single generalpurpose GPU (GPGPU) in a desktop workstation, without altering the basic QR factorization algorithm. Similarly, this study focuses on adapting the swept rule to a single GPU. The swept rule is a strategy for arranging the computational path of explicit numerical methods, and this work seeks to design the data structures and operations used in that path to achieve the best performance on GPUs.
3 GPU architecture and memory
The impressive parallel processing capabilities of modern GPUs resulted from architecture originally designed to improve visual output from a computer. GPU manufacturers and thirdparty authors Owens:2008ku , cudaProgGuide , Brodtkorb:2013hn , EngineerCuda have described the features of this architecture in great detail, while others discussed general best practices for developing efficient GPUbased algorithms Cruz:2011gc , Niemeyer:2014hn . Particular aspects of this architecture, such as the unique and accessible memory hierarchy, are at the core of this work, so some explanation of its relevant elements is necessary before describing the details of the implementation.
Programs that run on the GPU can be implemented using several software packages, the most common of which are the OpenCL and OpenACC frameworks, and the CUDA parallel computing platform. These packages use different nomenclatures and are compatible with different hardware types. In this project all programs use CUDA, which is exclusively compatible with NVIDIA GPUs; therefore, all descriptions of GPU hardware presented here use the CUDA nomenclature. CUDA programs consist of functions, referred to as kernels, launched from a C/C++ host program. The CPU executes the host code and specifies the size and number of blocks when calling a kernel, the stream (queue) in which the kernel will be launched, and the amount of shared memory to be allocated per block at runtime. All threads in a warp—a group of threads that execute as a singleinstruction multiple thread (SIMT) unit—must be in the same block, so for good practice blocks should launch with some multiple of 32 threads.
The information presented here is valid for all NVIDIA GPUs with compute capability or higher (i.e., Kepler architecture or later). The device used in this study is a Tesla K40c GPGPU, compute capability 3.5. This device contains 15 streaming multiprocessors, each capable of processing 64 warps of 32 threads, or 2048 total threads, at once PascalWP:2016 . A maximum of blocks may concurrently reside on a streaming multiprocessor; blocks may not be split between streaming multiprocessors. While each streaming multiprocessor can support 2048 resident threads, in practice their capacity is often lower because each thread or block makes demands on limited memory resources—most notably the shared memory and registers. Each streaming multiprocessor on the Tesla K40c has of shared memory and 65536 registers available. Registers offer the fastest access, but are the most limited memory type and are private to each thread, but can be accessed by other threads in the same warp using shuffle operations available on devices with compute capability 3.5 or higher. Shared memory is a controllable portion of the L1 cache accessible only to threads within a block. As a result, for a thread to read a value stored in shared memory in a different block, a thread with access to that value must write the value to global memory where the reader thread has access. Global memory is the slowest and most plentiful memory type, and where data copied from the host program resides. Global memory stores all variables passed to a kernel and large arrays declared therein.
Other memory types in the CUDA memory hierarchy include constant, texture, and surface; of these, the work presented here only uses constant memory. Constant memory is readonly, available to all kernels for the lifetime of an application, and quick to access when all threads access the same location. This makes it a convenient and performance conscious choice for storing constant values of the governing equations calculated at runtime cudaProgGuide .
4 Methodology
4.1 Experimental method
The primary goal of this study is to compare the performance of the swept rule to a simple domain decomposition scheme, referred to as Classic, on a GPU. A domain decomposition scheme is a way of splitting up a large problem so tasks can be shared by discrete workers working on distinct parts of the problem. In this case the decomposition scheme divides the spacetime domain of the PDE solution. We will compare the performance of these methods by the time cost of each timestep, including the time required to transfer data to/from the GPU. While encoding the Classic is relatively straightforward, finding the best approach for the swept rule on the GPU presents a more subtle problem.
In the original swept rule approach alhubail:16jcp , the spatial domain is partitioned into independent pieces called “nodes” that correspond to compute nodes on a distributed system with private memory spaces. A major concern in adapting this nodal analogy to a single GPU is the type of memory allocated for the working array, the container for the values that are the solution to and the basis for each timestep. Several available approaches exist to map the original analogy for a node to a single GPU; here, we will explore three of them: Shared, Hybrid, and Register (which we will describe in detail in Section 5).
In all approaches we map one thread to one spatial point. Handling more than one spatial point per thread would allow for larger nodes but would require more resources and complicate the procedure without reducing thread idleness. Additional swept rule properties could be adjusted for potential implementation variants such as the data structure to hold the working values, and the method to globally synchronize threads. However, for the purposes of this study, we did not vary these attributes. In all cases in this study the working array is a standard, onedimensional C array with two flattened rows; and kernel calls are implicitly synchronized by returning control to the queue in the host program.
4.2 Firstorder domain of dependence
A domain of dependence is a region on the spacetime grid that can be completed by a numerical scheme with some set of initial values. For the purposes of this project, numerical schemes consist of stencil operations, where a value at a grid point is updated based on the weighted contribution of values at grid points in the vicinity. A threepoint stencil uses values at neighboring grid points only, and the timestep of any numerical scheme can be decomposed into a series of subtimesteps that only require a threepoint stencil WangDecomp . The numerical scheme defines the order of the domain of dependence: it increases by one for every two subtimesteps required per timestep. Therefore, a domain of dependence is firstorder if all subtimesteps in the numerical scheme use a threepoint stencil, and intermediate values are required no more than two steps after they are calculated. The initial incarnation of the swept rule presented by Alhubail and Wang alhubail:16jcp , MaithamRepo decomposes multistep timesteps and large stencils into subtimesteps with a threepoint stencil. This regularizes the procedure and ensures that all equations and schemes can be evaluated using a swept decomposition with a firstorder domain of dependence, but requires more memory to store the intermediate values that result from each subtimestep. In order to conserve limited private memory, for the GPUbased swept rule a firstorder domain of dependence is applicable to schemes that require two or fewer threepoint stencil subtimesteps per timestep. Figure 1 shows the first two stages of the swept rule using nodes with spatial points.
At the start of the first step, shown in Figure 0(a), the initial conditions are passed to the kernel and each node evaluates the solution at as many points as possible in the spacetime grid. The initial domain of dependence forms a triangle. The working array stores the solutions as each node steps through time and is maintained in a fast memory space private to node member threads. When each node cannot advance any further it’s necessary for each to pass one edge to a neighboring node and retain the values of the other edge. Figure 0(b) shows how the second step proceeds from the first using the edge values.
For the nodes to communicate as Figure 1 illustrates, the values on the edges must be passed to global arrays available to all nodes since any sufficiently fast memory location is private to a subset of threads terminated on kernel exit. Consequently, any data necessary to continue the computation must be shared between nodes using an intermediary container. The information needed to begin a new nodal cycle, including the values retained by each node, must be read in from, and out to, global memory at the beginning and end of each kernel, respectively. Figure 1(a) shows how the working array values are stored and passed to the global arrays.
Figure 0(b) illustrates the reason the two edges are stored individually as and : only one of the edges is passed between nodes. After the first step, the right edge is passed to its right neighbor node; the left edge is stationary. At the beginning of the subsequent kernel, and are swapped and reinserted into the working array to seed the next progression with one data transfer event from global to private memory as shown in Figure 1(b). When the right edge is passed between nodes, node is split across the spatial boundary and must apply the boundary conditions at its center.
The swept rule allows the computation to advance by timesteps with two, rather than , global memory accesses, where is the number of threads per block (or the number of spatial points per node). The procedure advances by passing values in the alternating directions and zigzagging the location of the nodes in this fashion until the simulation is complete. Since the diamonds shown in Figure 0(b) do not store all the values at a single timestep, the simulation can only output values when a complementary triangle is computed and the final length local tier is returned. This kernel can only be called after the values are passed to the left, so the results can only be read out every th timestep.
Although we already described the working array and showed in Figure 2 how the relevant values are communicated between the nodes, it is instructive to outline the performance concerns that motivate this arrangement. As Section 3 describes, the number of resident threads on a streaming multiprocessor depends on the GPU architecture and resources requested at kernel launch. For instance, storing every doubleprecision value in the triangle shown in Figure 0(a) in shared memory in a kernel with threads per block would require of shared memory—this is over 100 times greater than the limit for NVIDIA GPUs with Kepler architecture. The maximum number of threads per streaming multiprocessor would be limited to 128, which would negatively impact program performance.
Figure 0(a)
shows that the interior of the triangle is only needed to progress to the next timestep, and that edges on even and odd tiers do not overlap in the spatial domain. Thus, the triangle may be stored as a matrix with two rows, where the first and second rows contain the even and odd subtimesteps results, respectively. The interior values are overwritten once they are used, and only the edge values remain. Figure
1(a) shows the result of a local computation with this method. The last two values, the tips of each triangle, are copied into both arrays.At the start of the next kernel, the left and right arrays are inserted into the working array to seed successive calculations as Figure 1(b) shows. The edges of the previous cycle’s first row are moved to the center, and the center of the current top row is now left open for the first tier of the inverted triangle. Each row requires space for values because two edge values are required on either side to complete the stencil for the longest row where values are computed. The computation proceeds by filling the empty two indices on the top row, overwriting the bottom row’s middle four indices, and so on. In contrast to the memory demands of storing the entire nodal computation, this method uses only values. For a block with n = 512 threads, this requires . In general, a stencil with width would require storing values in the working array. By reducing the amount of shared memory to only , the kernel is less limited and achieves higher occupancy, the number of threads each streaming multiprocessor is capable of handling simultaneously.
4.3 Higherorder domain of dependence
The firstorder domain of dependence suffices for relatively simple problems. However, more complicated problems with elements such as nonlinear equations, discontinuities, or higherorder derivatives require more sophisticated procedures. The original swept rule program calculates and stores intermediate values at subtimesteps for higherorder schemes to avoid using larger stencils alhubail:16jcp . Breaking a timestep into a series of subtimesteps allows any numerical scheme for any equation of the same dimension to be decomposed in the same way since all stages in the computation depend on the minimum stencil, that is, only the neighbors of the current spatial point WangDecomp . For example, a secondorder in time midpoint method applied to a fourthorder differential equation would require four subtimesteps per timestep. The first subtimestep would find the second derivative of the dependent variable so that the second subtimestep, the midpoint solution, would only require a three point stencil: the second derivative and initial values at the neighboring spatial points. The third subtimestep would find the second derivative using the midpoint solution, and the final subtimestep would complete the timestep using the results of the second and third subtimesteps on the threepoint stencil and the previous timestep solution at the current spatial point. This approach presents a storage and data transfer problem on the GPU because values in the interior of the working array are overwritten two subtimesteps after they are calculated. These forgotten but required values are marked with an “” in Figure 2(b).
Saving four values per tier would fix the problem, but requires a larger matrix in shared memory, which would diminish occupancy and require more unnecessary values to be passed between nodes. Figure 4 shows our solution to this problem: a fivepoint stencil that requires two subtimesteps per timestep—a predictor and a final step. This flattens the triangle or diamond in the time domain and requires four values per subtimestep, but the tworow matrix may be used as described in Section 4.2 and illustrated by Figure 2 with minor adjustments. The same number of values are transferred between nodes in each communication, but inevitably more communication events are required to advance the solution. Conveniently, the predictorcorrector method ensures that all odd tiers, the second matrix row, will contain predictor values, and the bottom row will hold final values.
The problem that motivates our adjustment to the swept rule is the result of both the midpoint method and the fivepoint stencil discretization. Either of these circumstances alone would accommodate the method described in the previous section. Generally, the decomposition must be adjusted in this fashion when there are more than two subtimesteps per timestep. It would need to be adjusted further for problems with more than four subtimesteps per timestep.
5 Implementation
5.1 Swept rule variants
While the order of the swept domain of dependence is a natural consequence of the equations and numerical scheme, the computational hierarchy available on GPUs allows this structure to be implemented in different ways. To more thoroughly investigate the effects of the swept rule, we consider three versions: Shared, Hybrid, and Register, and compare them with a basic decomposition algorithm, Classic. In all of these programs, one GPU thread is assigned to one spatial point.
The Classic algorithm is a naive GPU implementation of the numerical solution and the baseline against which the efficacy of the swept rule is measured. It advances one subtimestep per kernel call and uses global memory to store the working array. Figure 6 shows the structure of the procedure, and it is similar for all problems and discretizations.
We consider the Shared strategy for implementing the swept rule the most natural way to map the analogy of CPU nodes to GPU architecture. It is applied to every test case and is considered the “default” swept rule GPU version for comparing the performance of the GPU programs with their MPIbased CPU counterparts MaithamRepo . The Shared version treats a block as a node and uses shared memory for the working array. Each block has exclusive access to a shared memory space and may contain up to threads, so it has access to fast memory and the capacity for various node sizes. There are some drawbacks to this version: a high number of idle threads for subtimesteps where the domain of dependence contains relatively few spatial points, poor utilization of CPU resources, and the fact that shared memory is not the fastest memory type harris:2014 .
The Hybrid strategy uses the same GPU procedure as Shared, uses the CPU to compute the node that is split across the boundary, as seen in Figure 0(b), Transfers between the host and device are costly operations, but the devices can execute instructions concurrently—so if the CPU can complete the boundary node before the GPU finishes the other nodes, no penalty arises. In this study, we apply this strategy to problems with nonperiodic boundary conditions as a way to mitigate the underutilization of the CPU and the thread divergence that results from applying boundary conditions in a GPU kernel.
The Register approach is applied to problems with periodic boundary conditions because of the difficulty involved in applying boundary conditions using warp shuffle functions. This implementation limits the number of points in a node to the size of a warp, threads (which has been constant over several iterations of NVIDIA GPUs). In this version the values are initially read from global memory to shared memory to the registers. Passing the values to the intermediate shared memory is necessary because the shuffle operations that trade registers between threads only operate on active threads in the warp being called; if some threads are masked, they will be unable to supply the necessary stencil values. Thus, data must be moved between memory levels at each tier rather than once at the start and end of the kernel. This seriously limits the Register approach, but it still warrants exploration since registers are the fastest memory type.
5.2 Test cases
We present three test cases to demonstrate the performance and functionality of the GPUbased swept rule in one spatial dimension: the heat equation, Kuramoto–Sivashinsky (KS) equation, and Euler equations for compressible flow. Appendices Appendix B–Appendix D contain the full derivations of the procedures for the numerical solutions. First, we chose the heat equation for its simplicity and familiarity. Here it is discretized with a firstorder scheme using forward differencing in time and central in space. Next, we selected the KS equation to demonstrate the swept rule for higherorder, nonlinear PDEs. We discretized the KS equation with secondorder, central differencing in space, which requires a fivepoint stencil, and a secondorder Runge–Kutta method in time. Lastly we chose to solve the Euler equations, a system of quasilinear, hyperbolic equations for describing compressible, inviscid flow. The conservative form of these equations are applied to the Sod shock tube problem to demonstrate the application of the swept rule to a canonical CFD problem involving discontinuities and several dependent variables. These equations are discretized with a secondorder, finitevolume scheme in space and a secondorder method in time. The heat equation requires a firstorder domain of dependence, while the KS and Euler equations require secondorder domains of dependence. These problems also provide examples of various types of boundary conditions. The heat and Euler equations are solved with reflective and Dirichlet boundary conditions, respectively; therefore, the boundary conditions must be imposed with control flow. The KS equation uses periodic boundary conditions, which is a convenient formulation for the swept rule that splits a node across the boundary.
6 Results and discussion
All tests presented here were performed on a single workstation with a Tesla K40c GPU and an Intel Xeon 2630E5 CPU with eight cores and 16 potential threads. The GPUbased swept rule algorithms and test cases were implemented 1DSweptCUDA v2 MyRepo . For the results we present here, each program was executed in double precision with threads per block and spatial points. Each run advanced timesteps and recorded the average time per timestep; the initial GPU memory allocations and data transfer between the host and device are included in the overall time measurement. Then, we collected the best time per timestep for each number of spatial points. We repeated this procedure five times and took the average to obtain the results presented here. There were no significant differences in the results between the tests for the same configuration.
Figures 6(a) and 7(a) show the execution time per timestep for all algorithms applied to the heat and KS equations. When applied to these problems, the swept rule algorithm outperforms the Classic procedure, and the GPUonly shared memory version, Shared, is faster than the alternate swept rule versions. Figures 6(b) and 7(b) show the speedup of the swept rule programs, or the ratio of time costs with the Classic version. Both cases exhibit similar performance patterns: Shared generally provides a larger speedup for small spatial domains ( spatial points), but only a speedup for large ones ( spatial points). Figures 6(a) and 7(a) show the performance trends of the algorithms with respect to the spatial domain size. Their time costs are insensitive to the spatial domain at smaller domain sizes and grow linearly with increasing domain size after about spatial points. The similarity in the performance trends of the heat and KS programs is intuitive; although the KS equation is nonlinear, fourthorder, and discretized with a higherorder scheme, both are continuous, scalar equations and use finite difference schemes.
The speedup of the swept rule declines as the number of spatial points in the domain increases, caused primarily by occupancy and the fixed cost of kernel launch. The occupancy of a kernel is the number of threads that can reside concurrently on a streaming multiprocessor at launch. This quantity is determined by the block size, since blocks of threads are indivisible, and the resources requested by the kernel. The swept rule requests more resources, specifically shared memory and registers, than Classic to store the working array and carry out its procedure, which involves more steps. If more resources are requested than are available to the streaming multiprocessors on the device, the GPU launches waves of blocks. If the occupancy is limited by resource allocation, the kernel must begin launching waves of blocks on domains with fewer spatial points. Each wave must wait until the previous one completes before beginning; theoretically this implies two waves will cost about twice as much as one. As the number of spatial points in the domain grows, more blocks are launched, and the difference in the number of waves per kernel call increases more quickly for the swept rule program. This conclusion arises from the observation that the time cost of the swept rule versions begins growing linearly at a smaller spatial domain size, about points, than Classic, about points, as shown in Figures 6(a) and 7(a).
By timing the launch of many empty kernels and taking their average, we measured the fixed cost of kernel launch on the testing workstation to be about . We conclude that this cost dominates the performance of Classic at small problem sizes because it is quite close to the cost of each timestep for spatial domains with less than spatial points for both the heat and KS equations. This fixed cost accounts for less time cost at larger spatial domain sizes where all kernels must launch several waves of blocks, so the portion of the swept rule speedup from avoiding kernel launches becomes negligible, and the overall speedup of the swept rule is reduced.
Each of the swept rule variants experiences these trends, but in all cases the Shared version performs better. Figure 7 shows the heat equation test, where the Hybrid version performs as well as Shared for large spatial domain sizes and 2–3 times worse for small ones. The Hybrid version uses the same GPU kernels as Shared, but uses the CPU to calculate the first node when it is split across the boundary. At smaller domain sizes, the data transfer between the host and device dominates the cost of the Shared scheme. At larger domain sizes, this cost drops in comparison with the overall costs, but at the same time the thread divergence caused by handling boundary conditions also drops in importance. Thus, the Shared and Hybrid methods perform similarly at larger domain sizes (i.e., above spatial points).
Figure 8 shows the KS equation test, where, similar to the heat equation, the Shared version performs better in all cases. While registers are the fastest memory type, because of memory access rules the Register version must use a warp as a node, which limits the domain of dependence to 32 spatial points. A Register program with must launch four times as many kernels to advance the same number of timesteps as Shared with (most often the best block size). Despite these differences, the Register version exhibits a similar performance trend to Shared, which shows about the speedup of Register at all spatial domain sizes. The limit on node size results in a constant four timesteps per cycle for the KS equation, which only reduces the number of communication events by compared with Classic.
In contrast to the previous test cases, Figure 9 shows that the classic decomposition outperforms the swept rule at all problem sizes when applied to the Euler equations. This result differs from the findings of Alhubail and Wang alhubail:16jcp for the swept rule implemented only with MPI on CPUs, which produces roughly equivalent speedups for both the KS and Euler equations. The GPU implementation of the swept rule for the Euler problem involves much greater arithmetic intensity than the other problems, causing greater lowlevel memory usage. This limits the performance of the swept scheme on a GPU more than a CPU, reducing the benefits of the scheme over the classic approach within a single GPU compared with the other problems. Those cases involve higher ratios of floatingpoint operations to memory accesses (both read and write). This will particularly degrade performance as the intradomain timesteps proceed and nodes become inactive.
With regard to the potential performance of the swept rule, this result is not encouraging for more complex problems in more dimensions. However, on a larger scale in a cluster, where global memory is used for the working array and the communication to be avoided is through a physical network between processors, we expect that the swept rule will provide a benefit for these types of problems.
Figure 10 compares the CPUbased KS equation programs with the GPU algorithms, Classic and Shared. This CPU version is parallelized with MPI, similar to that originally presented by Alhubail and Wang MaithamRepo . Figure 10 refers to the Shared program as SweptGPU because it serves as the “default” swept rule version on the GPU for comparison with the CPUbased swept rule. The original CPU program was designed for a distributedmemory cluster, but the Alhubail and Wang’s original performance study used only two processes on two CPUs alhubail:16jcp . The parallel CPU program was tested similarly to the GPU programs: it was run on the same workstation with the same spatial domain sizes and – threads. Here we compare the best results for each number of spatial points, which usually occurred with 16 threads. This did not degrade the performance compared with the original study; in fact, both CPU versions of the program performed significantly better, and for most spatial domain sizes the CPUbased swept rule improved by about ×. Since the test used up to eight times the number of threads, this result supports the validity of repurposing this code and comparing the result with the GPU versions.
Figure 10 illustrates the performance benefits of the GPU architecture more than the swept rule itself. On small spatial domains (e.g., less than points) the GPU can assign one thread to each spatial point and process all in a single wave. There it improves performance over the CPU version by less than ×, because the work does not fully utilize the GPU. On larger spatial domains (e.g., more than points), where the work completely utilizes the resources of the GPU, the performance increases continue growing with problem size. The increase of the GPU programs’ improvement with respect to the number of points in the spatial domain inverts the trend of the swept rule’s improvement with respect to the Classic kernel. Both GPU algorithm types show a speedup of about × for the smallest spatial domain. At the other end, the speedup grows to about × for the swept and × for the classic domain decomposition.
7 Conclusions
In this study we compared the time per timestep of three swept rule timespace decomposition implementations to a simple domain decomposition scheme, Classic, for GPU accelerators. These programs were evaluated on a range of spatial domain sizes. The Classic and Shared programs were also compared with their CPUbased counterparts parallelized with MPI.
Generally, the swept rule performs better relative to Classic when the kernel launch cost is a substantial element of the program cost and the spatial grid has fewer points than concurrently launchable threads on the GPU. Once this is exceeded, the launch cost penalty becomes negligible and greater resource usage penalizes the swept rule in the form of reduced occupancy and reduced availability of the L1 cache, which is reserved for shared memory. But, like the initial performance penalty for the Classic program, the resource usage penalty does not scale and all programs see their time cost rise nearly linearly with respect to spatial domain size after about spatial points.
Both alternate sweptrule procedures, Hybrid and Register, performed slightly worse than the Shared version. We attribute the failure of the Hybrid routine to improve performance to improvements made in the handling of the boundary conditions in Shared rather than the cost of hosttodevice communication, which is handled with asynchronous streams. Though the Hybrid version does not improve performance, it does not substantially degrade it either, especially with large spatial domain sizes. Ultimately, Hybrid computation seeks to solve a problem more efficient to solve within the GPU paradigm. The Register computation shows more promise but falls short because of the limited node size.
This study also shows that implementing a Classic decomposition on the GPU for an explicit numerical scheme is simple and may result in noticeable performance improvements. This may be enough for many applications, but in performancecritical cases the swept rule may further reduce execution time. The performance of the Classic decomposition programs may be improved by altering the synchronization method. Implicit synchronization incurs the cost of kernel launch at each subtimestep, and this kernel executes so quickly at small problem sizes that synchronization cost dominates the performance of the program. Using offtheshelf, GPUbased synchronization GlobalLock could also provide performance benefits for the swept rule, and in particular Register, which is also limited by kernel launch cost for small spatial domains.
The specific results presented here depend on the hardware used. A commercial GeForce GPU with Kepler architecture would perform worse than the Tesla K40c, which is designed for computation as opposed to actual graphics. Despite the significant performance increase shown here, the Tesla K40c is threeyearold technology. The current stateoftheart GPGPU with Pascal architecture, the Tesla P100, offers twice the K40c base clock speed, nearly four times the number of streaming multiprocessors, and an extra of shared memory independent of the L1 cache. Additional dedicated shared memory could dramatically impact the swept rule’s performance for the Euler equations. We predict that this device could at least halve the execution times shown here and maintain insensitivity to problem size up to over spatial points, potentially resulting in speedups on the order of over CPU parallel versions. Future GPU accelerators could further improve performance, as well as devices with similar SIMT architectures like Intel MIC/Xeon Phi.
Our future work will focus on further developing the swept rule for use on distributed memory systems with heterogeneous node architectures and implementing the swept rule in higher dimensions on the GPU. We expect the performance benefits to increase from the swept rule when applied to a distributed memory system comprised of nodes with CPUs and GPUs, where network communications incur more latency than intraGPU memory access. However, we anticipate the performance benefits of the swept rule will diminish for higher dimensions; in two dimensions the swept rule requires three stages of internode communication to advance one cycle, analogous to a single diamond in one dimension. This, along with the associated increase in required memory allocation, arithmetic intensity, and kernel launch events could limit the performance of the swept rule as observed by Alhubail et al. Alhubail:2016arxiv . In spite of these challenges, recent developments such as gridwide synchronization and increased shared memory capacity could aid the performance of the swept rule and permit the exploration of additional design options.
Appendix Appendix A Availability of materials
The results for this article were obtained using 1DSweptCUDA v2 MyRepo . All figures, and the data and plotting scripts necessary to reproduce them, are available openly under the CCBY license FigsShared .
Appendix Appendix B Heat equation
The unsteady heat conduction equation without volumetric heat flux is
(1) 
where is temperature, is time, and is thermal diffusivity. In one dimension, this is reduced to
(2) 
Discretizing Eq. (2) with forward differencing in time and central in space yields
(3) 
where is the spatial node index and is the time index corresponding to time . This is a firstorder, explicit, finitedifference approximation. To step forward in time, define the Fourier number, , where is the timestep size and is the spatial grid size, and solve for temperature at the next timestep:
(4) 
In this study the approximation is evaluated with insulated boundary conditions at both ends and spatial points:
(5) 
Appendix Appendix C Kuramoto–Sivashinsky equation
The Kuramoto–Sivashinsky equation is a nonlinear, fourthorder, onedimensional unsteady PDE:
(6) 
where is the dependent chaotic variable (e.g., fluid velocity). It is discretized similarly to the heat equation, as shown in Eq. (7), with central differencing in space. In this case, decomposing the fourth spatial derivative requires a fivepoint stencil:
(7) 
The chaotic nature of the problem necessitates a higherorder scheme overall; therefore, an explicit, secondorder Runge–Kutta scheme, also known as the midpoint method, is applied to the time domain.
Let the righthand side of Eq (7) be , then the predictor solution is found at :
(8) 
Then may be evaluated and added to to obtain :
(9) 
The problem demonstrated here uses periodic initial and boundary conditions. That is, the stencil at point includes points and and the initial condition is periodic and continuous at the spatial boundaries.
Appendix Appendix D Euler equations (Sod shock tube)
The Sod shock tube problem is a onedimensional unsteady compressible flow problem based on the nonlinear, quasihyperbolic Euler equations:
(10) 
where is the Jacobian matrix,
(11)  
(12) 
is density, is internal energy, is velocity, is pressure given by
(13) 
and is the heat capacity ratio of air.
The initial boundary conditions, given in Eq. (14), are constant values for the state variables on either side of a diaphragm separating two parcels of the same fluid. The spatial boundary conditions are these values at their respective ends of the tube:
(14) 
The equation is discretized using a secondorder, finitevolume scheme with cellcentered values. The first step in the solution is evaluating the pressure ratio at the current timestep over a fivepoint stencil
(15) 
This value is used with a minmod limiter to compute reconstructed values on both sides, and , of the current cell boundaries at :
(16) 
(17) 
where subscript refers to the reconstructed, or new, values on the edge of the interface and refers to the original values. For example, at the original values on the left side of the interface are at
. These reconstructed boundaries do not represent solutions for any grid cell; they are temporary values that interpolate the solutions.
Once we have the reconstructed values on either side of the interface, we can calculate the flux at that cell boundary with
(18) 
where is given by Eq. (12) and
is the spectral radius, the largest eigenvalue of the Jacobian matrix
.The spectral radius can be found with the Roe average at the interface
(19) 
and with using Eq. (13).
The spectral radius is given by
(20) 
Acknowledgments
This material is based upon work supported by NASA under award No. NNX15AU66A under the technical monitoring of Drs. Eric Nielsen and Mujeeb Malik. We also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40c GPU used for this research.
References
 [1] J. P. Slotnick, A. Khodadoust, J. J. Alonso, D. L. Darmofal, W. D. Gropp, E. A. Lurie, D. J. Mavriplis, CFD vision 2030 study: A path to revolutionary computational aerosciences, NASA Technical Report, NASA/CR2014218178, NF1676L18332 (Mar. 2014).
 [2] D. A. Patterson, Latency lags bandwith, Commun. ACM 47 (10) (2004) 71–75. doi:10.1145/1022594.1022596.
 [3] M. Alhubail, Q. Wang, The swept rule for breaking the latency barrier in time advancing PDEs, Journal of Computational Physics 307 (2016) 110–121. doi:10.1016/j.jcp.2015.11.026.
 [4] M. M. Alhubail, Q. Wang, J. Williams, The swept rule for breaking the latency barrier in time advancing twodimensional PDEs, arXiv:1602.07558 [cs.NA] (2016).
 [5] I. BermejoMoreno, J. Bodart, J. Larsson, B. M. Barney, J. W. Nichols, S. Jones, Solving the compressible Navier–Stokes equations on up to 1.97 million cores and 4.1 trillion grid points, in: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, ACM, New York, NY, USA, 2013, pp. 62:1–62:10. doi:10.1145/2503210.2503265.
 [6] K. E. Niemeyer, C. J. Sung, Recent progress and challenges in exploiting graphics processors in computational fluid dynamics, Journal of Supercomputing 67 (2) (2014) 528–564. doi:10.1007/s1122701310157.
 [7] NVIDIA Corporation, Whitepaper Nvidia Tesla P100, http://www.nvidia.com/object/pascalarchitecturewhitepaper.html (2016).
 [8] F. Witherden, A. Farrington, P. Vincent, PyFR: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach, Computer Physics Communications 185 (11) (2014) 3028–3040. doi:10.1016/j.cpc.2014.07.011.
 [9] M. Alhubail, Q. Wang, KSIDSwept, git commit e575d73, https://github.com/hubailmm/KS_1D_Swept (2015).
 [10] R. Strzodka, M. Shaheen, D. Pajak, H.P. Seidel, Cache oblivious parallelograms in iterative stencil computations, in: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, ACM, New York, NY, USA, 2010, pp. 49–59. doi:10.1145/1810085.1810096.
 [11] T. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, D. Keyes, Multicoreoptimized wavefront diamond blocking for optimizing stencil updates, SIAM Journal on Scientific Computing 37 (4) (2015) C439–C464. doi:10.1137/140991133.

[12]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker,
D. Patterson, J. Shalf, K. Yelick,
Stencil computation
optimization and autotuning on stateoftheart multicore architectures,
in: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08,
IEEE Press, Piscataway, NJ, USA, 2008, pp. 4:1–4:12.
URL http://dl.acm.org/citation.cfm?id=1413370.1413375  [13] M. J. Gander, 50 years of time parallel time integration, in: T. Carraro, M. Geiger, S. Körkel, R. Rannacher (Eds.), Multiple Shooting and Time Domain Decomposition Methods, Vol. 9 of Contributions in Mathematical and Computational Sciences, Springer, Cham, 2015, pp. 69–113. doi:10.1007/9783319233215_3.
 [14] R. Falgout, S. Friedhoff, T. Kolev, S. MacLachlan, J. B. Schroder, Parallel time integration with multigrid, PAMM 14 (2014) 951–952. doi:10.1002/pamm.201410456.
 [15] M. Baboulin, S. Donfack, J. Dongarra, L. Grigori, A. Rémy, S. Tomov, A class of communicationavoiding algorithms for solving general dense linear systems on CPU/GPU parallel machines, Procedia Computer Science 9 (2012) 17–26. doi:10.1016/j.procs.2012.04.003.

[16]
M. Anderson, G. Ballard, J. Demmel, K. Keutzer, Communicationavoiding QR decomposition for GPUs, in: Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, 2011, pp. 48–58.
doi:10.1109/IPDPS.2011.15.  [17] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, J. C. Phillips, GPU computing, Proc. IEEE 96 (5) (2008) 879–899. doi:10.1109/JPROC.2008.917757.

[18]
NVIDIA Corporation,
CUDA C
programming guide, version 8.0 (2016).
URL http://docs.nvidia.com/cuda/cudacprogrammingguide  [19] A. R. Brodtkorb, T. R. Hagen, M. L. Sætra, Graphics processing unit (GPU) programming strategies and trends in GPU computing, J. Parallel Distrib. Comput. 73 (1) (2013) 4–13. doi:10.1016/j.jpdc.2012.04.003.
 [20] D. Storti, M. Yurtoglu, CUDA for Engineers: An introduction to HighPerformance Parallel Computing, AddisonWesley, 2015.
 [21] F. A. Cruz, S. K. Layton, L. A. Barba, How to obtain efficient GPU kernels: An illustration using FMM & FGT algorithms, Comput. Phys. Comm. 182 (10) (2011) 2084–2098. doi:10.1016/j.cpc.2011.05.002.
 [22] D. J. Magee, K. E. Niemeyer, Data, plotting scripts, and figures for “Accelerating solutions of PDEs with GPUbased swept timespace decomposition” (Nov. 2017). doi:10.6084/m9.figshare.4968050.v4.
 [23] Q. Wang, Decomposition of stencil update formula into atomic stages, arXiv:1606.00721 [math.NA] (2017).
 [24] M. Harris, CUDA pro tip: Do the Kepler shuffle, https://devblogs.nvidia.com/parallelforall/cudaprotipkeplershuffle/, accessed: 3 June 2016 (Feb. 2014).
 [25] D. J. Magee, K. E. Niemeyer, NiemeyerResearchGroup/1DSweptCUDA v2 (May 2017). doi:10.5281/zenodo.570984.
 [26] S. Xiao, W. C. Feng, Interblock GPU communication via fast barrier synchronization, in: 2010 IEEE International Symposium on Parallel Distributed Processing (IPDPS), 2010, pp. 1–12. doi:10.1109/IPDPS.2010.5470477.