1 Introduction
Graphics processing units (GPUs) have emerged as a popular hardware architecture for scientific computing. While originally introduced for realtime graphics rendering in the gaming industry, GPUs were found to offer the computational horsepower and high memory bandwidth necessary to accelerate large scale computations which have traditionally been performed using CPUs. GPU computing has permeated many fields such as finance [Finance2010], computational chemistry [Chemistry2013] and fluid dynamics [GPUGems2004].
Simulation work often requires approximating the solution to a partial differential equation (PDE) governing some physical phenomenon in an efficient manner. This usually involves applying some numerical method to reduce the PDE into a set of linear equations. Solving the resulting linear system is often the bottleneck of many simulations. Improving the speed of linear solvers is essential to improving the value of simulation work. For this reason, there has been a tremendous effort to develop efficient GPU implementations of linear solver algorithms. As some examples, conjugate gradient solvers for sparse unstructured matrices and a multigrid solver for regular grids using GPU textures were developed by Bolz et al. [Bolz2003]
. To improve the performance of preconditioned conjugate gradient and GMRES, Li and Saad developed an efficient sparse matrix vector product on the GPU which exhibited up to eight times speedup relative to a parallel CPU implementation on several matrices from the SuiteSparse matrix collection
[Li2013].Jacobi iteration has been a popular iterative method of study on the GPU. While slow to converge compared to Krylov methods, Jacobi iteration is completely parallel making it quite amenable to GPU implementation [Golub2007]. Furthermore, Jacobi iteration and other similar stationary iterative methods form the backbone of highly effective geometric and algebraic multigrid methods [Briggs2000]. A number of studies have been conducted to assess the performance of Jacobi iteration on the GPU. Zhang et al. study the performance of a CPU and CUDA based GPU implementation of Jacobi iteration on a dense matrix. They observe a speedup of 19 times in double precision (59 times in floating point precision) for a matrix size of 7680 (when all GPU threads are operating) compared to the CPU performance [Zhang2009]. Wang et al. perform studies of Jacobi iteration on sparse matrices on the GPU. Their implementation uses shared memory to store the solution at each step, and they observe a speedup factor of up to 100 times relative to their CPU implementation [Wang2009]. Amorim et al. apply a weighted version of Jacobi iteration to a structured matrix resulting from a finite element discretization of a set of PDEs from cardiac modeling. They compare the performance of an OpenGL implementation to several CUDA implementations with global coalesced memory accesses, shared memory and texture memory, and find that a CUDA implementation with texture memory yields the best performance [Amorim2009]. Jacobi iteration on a 2D Laplace equation is studied by Cecilia et al. They measure the time to perform a set number of iterations in CUDA using several code optimizations which involve the use of shared memory, and observe a three to four times speedup for various domain sizes compared to their unoptimized CUDA implementation (and even higher speedups relative to their CPU implementation) [Cecilia2012]. Several multiCPU, multiGPU and hybrid CPU/GPU implementations of Jacobi iteration on the 2D Poisson equation in a structured domain are developed by Venkatasubramanian and Vuduc. Their best implementation is able to achieve GPU bandwidth. To further improve performance, they develop asynchronous implementations which remove the need for complete synchronization each step, leading to a 1.2 to 2.5 times speedup over their previously optimal implementation [Vuduc2009]. Ahamed and Magoulés develop a Krylovlike version of Jacobi iteration and apply it to several three dimensional problems (Laplace equation, gravitational potential equation, heat equation) on the GPU, observing a 23 times speedup using this new version compared to the serial CPUbased algorithm [Ahamed2016].
The studies listed above detail the application of Jacobi iteration on the GPU to a variety of problems, such as 2D structured problems and full 3D unstructured problems. In all cases, the studies highlight the need for efficient memory access for optimal performance. Memory bandwidth, not computational complexity, has emerged as the limiting factor in scientific simulations. Therefore, it is essential that applications utilizing the GPU respect its memory hierarchy in an effort to achieve the best performance.
In this work, we develop an algorithm to perform the Jacobi iterative method efficiently using onchip GPU shared memory, which is quicker to access than GPU global memory. We focus specifically on structured matrices. Shared memory is accessible on the GPU as a cachelike memory which has much higher bandwidth compared to global memory which is traditionally used on the GPU [Hwu2016]. However, shared memory is typically very small (16 kB  48 kB based on settings and device compute capability [Harris2013]) and only exists per GPU block. This inspires a domain decomposition style algorithm where the domain is separated into subdomains which are allocated to the shared memory of different GPU blocks. Within the blocks, we perform many iterations of Jacobi, avoiding communication between GPU blocks at every step. This results in a hierarchical algorithm in which every cycle, multiple Jacobi iterations are performed independently within the shared memory of each GPU block (termed subiterations) and copied back to global memory (which holds the most up to date solution at the end of the cycle). Previous studies have generally focused on the performance of the standard Jacobi iteration using shared memory. However, this hierarchical approach results in a variation of Jacobi iteration that demonstrates better performance in parallel settings where a memory hierarchy exists (such as on the GPU). We report comparisons between a classical GPU implementation of Jacobi iteration as well as this shared memory approach.
The rest of our paper is structured as follows. Section 2 provides background for this work; a review of the Jacobi iterative method for solving linear systems of equations and an overview of the CUDA framework for GPU computing. Section 3 introduces various approaches for performing Jacobi iteration, including our new shared memory approach. and provides performance comparisons between this approach and a traditional GPU approach with global memory for the 1D Poisson equation. Section 4 presents a similar algorithm and comparison results in the context of 2D problems. Lastly, Section 5 provides concluding remarks on this work.
2 Background
2.1 Jacobi Iterative Method
Jacobi iteration is an iterative method used to solve a linear system where and , . The Jacobi iterative method is an example of a stationary iterative method which involves a matrix splitting of the form [ParallelIterativeAlgorithms2007]. Then the iterative update is given by:
(1) 
For Jacobi iteration, where is a diagonal matrix with diagonal entries equal to those of the matrix . Meanwhile, the matrix . Elemental Jacobi iteration can be written as
(2) 
In order for Jacobi iteration to converge, the spectral radius of the iteration matrix must be less than one
(3) 
Jacobi iteration can be used to solve linear systems which involve matrices that are diagonally dominant. Many matrices arising from discretizations of PDEs are diagonally dominant and can be solved using Jacobi iteration.
2.2 CUDA Framework for GPUs
A schematic of the CUDA model is shown in Figure 1.
A GPU program developed using CUDA (the parallel computing platform developed by NVIDIA for general purpose GPU programming) typically begins with initialization on the CPU (termed host) and transfers to the GPU (termed device) where parallel calculations will be performed. The most fundamental unit of parallel execution is the thread. A large number of threads can run simultaneously. This gives GPUs high computing power and the ability to perform mathematical operations much more quickly than CPUs. Groups of threads are organized into blocks (up to 1024 threads per block are permitted on most modern machines). Every thread is assigned a thread ID that is unique to it in the thread block, given by threadIdx.x. For 2D and 3D problems, a threadIdx.y and threadIdx.z are also assigned (these are set to 0 by default for the 1D case). Similarly, every block has its own unique block ID given by blockIdx.x (a similar blockIdx.y and blockIdx.z are assigned in 2D and 3D). Each block contains its own shared memory storage which is accessible to the threads comprising the block.
When launching a kernel (calculation to be performed on the device), the user specifies the number of threads per block (denoted by blockDim.x; blockDim.y and blockDim.z are also defined for 2D and 3D problems) as well as the total number of blocks to be invoked. The blocks are then distributed between streaming multiprocessors (SMs), and the calculations are launched by warps (groups of 32 threads belonging to the same block). It is preferable to select a thread per block value which is a multiple of 32 so that all threads in a warp are utilized. The warps are scheduled in a roundrobin fashion so that latency is hidden as calculations are being performed. They continue to operate until all threads have completed their calculations.
When a block is allocated to an SM, the block receives a portion of the total shared memory on the SM. The amount of shared memory in a block is usually limited (modern GPUs typically have 48 kB of shared memory per block) and accessible only to threads within a block. However, shared memory is more quickly accessible (approximately 10 times higher bandwidth [Wilt2013] and 100 times lower latency [Harris2013]) than the GPU global memory. Therefore, efficient use of intrablock shared memory can reduce runtime significantly.
3 Approaches for Jacobi Iteration in 1D Problems
We introduce a Jacobi iterative solver which utilizes GPU shared memory for one dimensional problems. In general, linear systems arising from one dimensional partial differential equations (PDEs) can be solved efficiently using direct methods (e.g. Thomas algorithm for tridiagonal systems), with iterative methods being desirable when dealing with linear systems arising from discretizations of higher dimensional PDEs. However, applying our shared memory Jacobi iterative solver to the 1D case can provide insights which translate to higher dimensions.
We consider a linear system which arises from a 1D PDE (e.g. diffusion equation) on a grid as shown as in Figure 2
. The interior points correspond to the different degrees of freedom (DOFs) of a linear system. We assume Dirichlet BCs, represented by the square endpoints.
Onedimensional PDEs discretized with standard discretizations (finite difference, finite elements) usually result in a tridiagonal system. Consider a general tridiagonal matrix with the values , , and in the th subdiagonal, diagonal, and superdiagonal entries, respectively (as shown below).
Applying the Jacobi iterative method (Equation (2)) to advance the th DOF (denoted by ) in our tridiagonal system from iteration to iteration results in the following update equation (where represents the right hand side value at the th grid point)
(4) 
For a particular DOF, Equation (4) shows that the solution at the next iteration depends only on the values at adjacent DOFs at the current iteration. This is illustrated in Figure 3. If the neighbor values are available, we can perform the update on the DOF. This idea is used to develop our various approaches for solving the 1D problem on the CPU and GPU.
3.1 CPU Approach
Jacobi iteration can be implemented on the CPU as follows. Each degree of freedom in the grid can be updated in a sequence, with the subsequent values at step relying only on the current values at step . This requires two storage containers, one to store the solution at step and the other to store the solution at step . Once all DOFs are updated in a given step, the two arrays can be swapped and the process can be repeated. A sample C++ style code to illustrate the CPU approach is shown below. The jacobiUpdate function corresponds to the update equation in Equation (4).
3.2 GPU Approach with Global Memory
The GPU approach for implementing Jacobi iteration with global memory is quite similar to the CPU approach. Rather than use a for loop, each GPU thread can independently update each DOF. We must ensure that we request enough threads to perform the updates on all DOFs. We can do this by setting any value for the number of threads per block (preferably a multiple of 32 as threads operate in warps containing 32 threads), and setting the number of blocks equal to the ceiling of the total number of DOFs divided by the number of threads per block specified earlier. A sample CUDA C++ style code to illustrate the GPU approach is shown below.
3.3 GPU Approach with Shared Memory
Shared memory on the GPU has a higher bandwidth and lower latency compared to global memory [Wilt2013, Harris2013], so adapting GPU applications to use shared memory has become a popular optimization technique for improving performance. Performing Jacobi updates within shared memory can improve performance because reading current values to and writing updated values from shared memory will be much faster compared to performing these accesses in global memory. However, using shared memory can be challenging because it is allocated on a per block basis. Threads in one GPU block cannot access the shared memory of another GPU block. This is a byproduct of ondie shared memory being very limited. Only a set number of blocks can be active at a given time and receive a shared memory allocation. In general, blocks must wait in the queue to operate, as well as receive a shared memory allocation and perform updates. Although the limited availability of shared memory makes it challenging to use, it can provide speedup in many applications which are memory bound, such as iterative solvers.
One way to utilize shared memory in this application is to partition the solution data between the available shared memory and operate on these partitions independently. For the 1D problem, the DOFs in the domain shown in Figure 2 can be partitioned into several subdomains. The DOFs in each subdomain may be updated by a distinct GPU block. Figure 4 shows an example of a 12 point grid, divided into 4 point subdomains which will be operated on by 3 GPU blocks.
The first step in our approach is to copy the DOF values in each subdomain from global memory (where the initial solution resides) to the shared memory of the GPU block which will be responsible for its update. Updating the edge values in each subdomain requires information from the adjacent DOFs which will not be accessible to the block. We address this by augmenting the subdomains by an extra point to the left and right (beyond the points that will be updated). These augmented subdomains (shown in Figure 5 for our example) can be copied to the shared memory of each GPU block. The shared memory simply holds a temporary copy of the solution which is to be operated on, while the final solution is always stored in global memory.
After the copy to shared memory is complete, we can perform Jacobi iteration on the interior points of each subdomain using the stencil update given in Equation (4). To maximize parallelism, each DOF update should be handled by a distinct thread. For our example, the number of threads per block should be set to 4 so that 4 threads can be used to update the 4 interior points within each subdomain. In general, the threads per block specification is exactly equal to the number of interior points in a subdomain (i.e. the subdomain size is directly tied to the user specified thread per block choice). Given a thread per block specification of blockDim.x, the subdomain size will be blockDim.x + 2 with blockDim.x interior points to be updated. To execute the previous global to shared memory transfer efficiently, each of the blockDim.x threads should copy one value from global to shared memory and the remaining two values can be copied by threads 0 and 1. One can alter the subdomain size by changing the thread per block specification. For efficiency, the threads per block specification should be a multiple of 32 because threads are launched in groups of 32 threads (known as warps). This ensures all threads in a warp are active and avoids thread divergence. As a consequence, subdomains with a multiple of 32 interior DOFs are preferable.
Since data movement tends to be the bottleneck in iterative linear solvers (rather than computation), it is beneficial to perform many updates within shared memory before executing expensive data transfers back to global memory. Following this logic, we recommend performing many Jacobi iterations within shared memory. We refer to the Jacobi iterations performed within shared memory as subiterations, and denote the number of subiterations performed by . Upon performing the subiterations, the updated values can be copied from the shared memory of every GPU block to their global memory position. Only the interior blockDim.x values of each subdomain must be copied back from shared to global memory since the two edgemost points are not updated. Each thread should copy the same DOF it updated from shared to global memory. Shared memory is terminated at the end of this step and the updated solution now exists in global memory.
This completes one cycle of our shared memory algorithm. The algorithm is implemented in a single GPU kernel (since shared memory is only active as long as a GPU kernel is running). A sample CUDA C++ style code to illustrate the implementation is provided in Appendix A. The main parameters in the algorithm are the number of threads per block (blockDim.x) and the number of subiterations (). The number of threads per block controls the size of the subdomains in shared memory, while the number of subiterations controls the number of Jacobi iterations performed within shared memory before communication back to global memory. Our algorithm is depicted in Figure 6. In summary, each cycle of the algorithm involves the following three steps:

Partition the gridpoints between the different GPU blocks and copy values from global memory to shared memory. Each subdomain size has size .

Update the interior blockDim.x points in each GPU block by performing Jacobi iterations (we refer to these as subiterations).

Update the global memory solution array by copying the interior blockDim.x values from shared memory to their appropriate positions in global memory (shared memory is terminated at this point).
An application that uses GPU shared memory requires the user to specify the number of bytes of shared memory needed. At the very least, there must be enough space for twice the subdomain data (which has size blockDim.x + 2) because Jacobi iteration requires two storage containers (one containing the current solution values and one containing the updated values). During even subiterations, the first blockDim.x + 2 entries within shared memory should hold the current solution values while the updated values are placed in the second blockDim.x + 2
entries. During odd subiterations, this should be reversed with the second
blockDim.x + 2 entries holding the current solution values while the updated values are placed in the first blockDim.x + 2 entries. The blockDim.x right hand side values corresponding to the interior subdomain DOFs can also be copied to shared memory (as the right hand side value appears in the Jacobi update function in Equation (4)). Therefore, the total amount of shared memory required for the algorithm is 2 times the subdomain size (blockDim.x + 2) for the solution array plus an array of size blockDim.x for the right hand side values corresponding to the interior DOFs. This should be multiplied by the size of the representation (4 bytes for floats, or 8 bytes for doubles).One practical question to consider when applying the algorithm is what is a good number of subiterations to perform within shared memory. Using a large value of is beneficial as we can perform many Jacobi updates without requiring expensive global memory accesses. The drawback of setting a large is the lack of frequent update of the boundary values of each subdomain. These boundary points are updated by the adjacent blocks, and the most up to date values are available to a subdomain only after each cycle when all values are transferred back to global memory after the subiterations are performed and reallocated back to shared memory. Therefore, subiterations used to update interior points within a subdomain rely on old boundary data, which can negatively impact convergence. Setting ensures that subdomains have the most up to date boundary values at every step, and is exactly equivalent to performing Jacobi iteration. However, setting such a low value requires many expensive global memory accesses which will inhibit algorithm performance. There is a tradeoff between receiving up to date subdomain values more frequently to improve convergence (by setting a small ), and performing fewer global memory accesses to improve performance (by setting a small ). We explore good values for in our numerical tests.
3.4 Effect of Subiterations
The classic GPU and shared memory approaches for Jacobi iteration presented in Sections 3.23.3 are applied to the triadiagonal system arising from a finite difference discretization of the Poisson equation on a one dimensional domain with Dirichlet boundary conditions (shown in Equation (5)):
(5) 
The triadiagonal system is given by
(6) 
where
The right hand side vector has all entries set to one. We consider a problem size of interior DOFs (1026 points in total) and measure the time required to reduce the residual norm of the initial solution (set to a vector of ones) by a factor of 1e4 using Jacobi iteration. The elemental Jacobi update (Equation (2)) applied to the 1D Poisson equation results in the update equation (7).
(7) 
Because this problem is too small to fully utilize the streaming multiprocessors of the GPU, copies of the problem are operated on simultaneously. This ensures that the GPU is fully utilized and that our algorithm can scale to higher dimensional problems (i.e. an analogous 2D problem on a 1024 by 1024 grid). The time required for the classic GPU and the shared memory approaches to decrease the residual for the 1024 copies is measured. Our main goal is to understand the speedup we can achieve using shared memory over a classical GPU implementation of Jacobi iteration. All numerical tests were run using a TITAN V GPU with double precision.
The thread per block specification is varied between the values of [32, 64, 128, 256, 512] for the classic GPU approach, and the best time is recorded. For the shared memory implementation, the thread per block value of 32 is used. This corresponds to a required shared memory allocation of 800 bytes per block (much lower than the shared memory limit). Although larger thread per block values could be used in 1D, we wish to extend our findings to higher dimensions in which case the available threads would have to be split between different dimensions. The number of blocks is always set such that there are enough to accommodate all of the DOFs in the problem (i.e. ceiling of the total number of DOFs from all 1024 copies of the problem divided by the thread per block value). The following 6 subiteration values are explored in this study: .
Figure 7 shows the performance of the shared memory approach for the various subiteration values specified above compared to the best classic GPU performance. In each case, the number of iterations/cycles required for convergence was determined ahead of time, and the GPU and shared memory implementations were run under the prescribed number of iterations/cycles to avoid calculating the residual during these time trials. The performance results include the time required for host to device and device to host transfers, and were recorded using the CUDA Event API. For the thread per block values of [32, 64, 128, 256, 512], the classic GPU times measured were [7248.11, 5690.35, 5675.69, 5677.27, 5680.42] ms. The optimal time measured is 5675.69 ms when the thread per block value is 128, although the time is essentially constant for thread per block values of 64 and above. In these cases, the GPU is being fully utilized so the performance cannot be further improved. The shared memory approach outperforms the classic GPU approach in all cases except when the number of subiterations is set to the highest value of 128. The best performance is obtained when . Past , the time to converge increases linearly with the number of subiterations, suggesting that performing too many subiterations within shared memory can hurt performance.
Table 1 shows the speedup achieved going from the best classic GPU performance to the shared memory performance for each subiteration value specified. The best speedup achieved is nearly four times over the classic GPU approach when (setting results in a nearly similar speedup). The results demonstrate that the time for convergence for Jacobi iteration can be reduced using shared memory.
Number of Subiterations  4  8  16  32  64  128 

Best GPU to Shared Speedup  2.45  3.82  3.79  2.47  1.46  0.81 
3.5 Effect of Overlapping Subdomains in 1D
One key feature of our shared memory algorithm is that many Jacobi updates are performed each cycle without transfer to global memory. While this minimizes global memory accesses to improve performance, convergence is hampered by the lack of up to date values used for the updates. In particular, boundary data in each subdomain is only updated every cycle (after subiterations), so many cycles may be required for convergence. DOFs close to the subdomain edges are especially affected and contribute a larger residual value to the overall residual norm. Improving the edge values can greatly improve the convergence of the algorithm and reduce the number of cycles necessary.
We attempt to alleviate this problem by using overlapping subdomains. The key idea of this approach is that DOFs which are further away from subdomain edges (closer to the subdomain center) have less error than those closer to the edges. With overlapping subdomains, we permit sets of points close to the subdomain edges to be allocated to two subdomains (copied to the shared memory of two GPU blocks). A DOF which exists in two subdomains will have a different local position in each subdomain. The subdomain in which the DOF is further from the edge is the one which will contribute the final value to global memory.
To illustrate this approach more clearly, an example is shown in Figure 6 for the 12 point grid. In this example, each subdomain has two interior DOFs in common with its neighbor. These overlapping DOFs are updated in the shared memory of two GPU blocks. Although either block could copy its DOF value back to global memory at the end of the cycle, we choose to have the left subdomain provide the final left point value and the right subdomain provide the final right point value in each set of overlapping points. This is because the local position of the left point within the left subdomain is further from the edge while compared to its local position in the right subdomain. Meanwhile, the local position of the right point is further from the edge in the right subdomain than in the left subdomain. Figure 8 shows how the GPU blocks contribute to the overall global memory solution in this approach. In general, any even number of overlapping points is permitted (as long as it is less than the number of interior points in the subdomain) and in this case the left subdomain should contribute the left half of points while the right subdomain should contribute the right half.
Overlapping points between different subdomains and selectively copying DOF values as described will reduce the number of cycles required for convergence. This improvement in convergence is similar to behavior seen in overlapping domain decomposition methods, where the convergence rate is improved as more overlap is introduced between subdomains [Dolean2015]. A drawback of overlapping subdomains is that multiple GPU blocks must perform updates on many of the points throughout the domain. This requires utilization of more GPU blocks, as the overlap effectively reduces the number of distinct points each subdomain handles. The number of overlapping points is another algorithm parameter which must be chosen judiciously.
Numerical tests are performed to study the effect of overlap on the performance of the shared memory algorithm. In these tests, a thread per block value of 32 is used as before. Each subdomain can have an even number of interior overlapping points with its neighboring subdomain (as long as the number of overlapping points is less than the number of interior points in a subdomain). For our thread per block specification of 32, we permit up to 30 interior overlapping points between subdomains. The left subdomain will contribute the left half of the overlapping points while the right subdomain contributes the right half. Figures 8(a) and 8(b) show the time and number of cycles required for convergence as a function of overlap for subiteration values of .
For all subiteration cases, allowing just two points to overlap causes a drop in the number of cycles required for convergence. This is a result of improving poor DOF values adjacent to subdomain edges. The reduction in cycles leads to an initial reduction in time, particularly for larger values of . As the amount of overlap increases further the number of cycles required for convergence continues to decrease slowly and eventually saturates. The time; however, grows dramatically as overlap is increased further. This is because the amount of computation required per cycle grows with overlap. Given a problem size with interior DOFs, threads per block and overlap , the number of operational blocks required is:
(8) 
The logic for this is as follows. There are initially points in the grid (including boundary points). The first block has a subdomain size of points, and every subsequent block added handles unique points in our grid. Therefore, subtracting the number of points handled by the initial block from and dividing by the number of points handled by subsequent blocks gives the number of blocks required. This can also be written as:
(9) 
The corresponding number of threads required is the number of blocks times the threads per block value .
(10) 
Equation (10) gives a sense of the amount of work required per cycle, which grows as overlap increases. Therefore, it is only beneficial to perform overlap as long as the number of cycles decreases. Once the number of cycles stagnates, it is no longer worth increasing amount of overlap between subdomains.
For each value of tested, Table 2 shows the optimal overlap for minimum time and corresponding speedup of the shared memory algorithm relative to the classic GPU implementation. The results illustrate that the optimal number of overlapping points increases as the number of subiterations is increased. This is expected because increasing causes more DOFs near the subdomain edges to have poor values due to lack of up to date boundary values (so more overlap is necessary to improve these DOFs). The best performance is achieved when setting with 4 interior points overlapping between subdomains. This corresponds to an eight times speedup compared to the classic GPU performance. The best performance was previously achieved using , but now higher values yield the best speedup because poor edge values (associated with high ) can be improved with overlap.
The numerical test results demonstrate that the convergence of Jacobi iteration for 1D problems can be accelerated using shared memory. Although the initial results suggested a four times speedup for this problem (resulting from tuning the number of subiterations), introducing overlap to improve edge point values results in an eightfold speedup. The optimal setting for the number of subiterations is about half the number of interior points in a subdomain. Furthermore, a small amount of overlap is enough to decrease the number of cycles dramatically. This shared memory approach can be extended to two dimensional problems for similar acceleration.
Number of Subiterations  4  8  16  32  64  128 
Optimal Overlap  2  4  4  8  10  12 
Optimal Speedup  3.09  5.74  8.00  7.50  5.84  3.75 
4 Approaches for Jacobi Iteration in 2D Problems
We extend the shared memory algorithm presented in Section 3.3 to the twodimensional case. Consider a PDE discretized in a structured 2D domain. A model 2D PDE is one with second order derivatives, which discretized using a 2nd order central difference scheme results in a linear system where the matrix has pentadiagonal structure as shown below ( correspond to different constants).
Applying the Jacobi iterative method to the th degree of freedom in the grid ( and correspond to indices in the spatial , domain) results in an update equation of the form in Equation (11). The updated value of degree of freedom with indices , depends only on its neighbor values at the current iteration. This is depicted in Figure 10.
(11) 
4.1 GPU Approach with Shared Memory
The approaches to solving a 2D PDE using Jacobi iteration on the CPU and the GPU are analogous to the 1D case. For the shared memory approach, we must utilize a domain decomposition strategy. A structured 2D domain can be subdivided into smaller subdomains which can be allocated to the shared memory of many GPU blocks and updated independently. Figure 10(a) illustrates a 2D domain with 12 by 12 interior points partitioned into smaller 4 by 4 subdomains which would be updated by 9 GPU blocks (each color represents the points handled by a distinct GPU block). The square points correspond to Dirichlet boundary conditions.
To prevent threads from being idle, we ensure that every thread updates one interior point in the subdomain. To achieve this, the number of interior points in a subdomain is set to the number of threads per block in the direction (denoted by blockDim.x) by the number of threads per block in the direction (denoted by blockDim.y). Updating the points on the edges of each subdomain requires information from the edges of neighboring subdomains. To prevent the need for communication of neighbors at every step, we simply augment each subdomain by one point in all dimensions as shown in Figure 10(b). Therefore, the subdomains to be copied to shared memory have size by where the inner blockDim.x by blockDim.y points will be updated using the Jacobi update equation.
The first step of the 2D shared memory algorithm is to copy the subdomain data from global memory to the shared memory of the GPU block which will perform the updates. The blockDim.x by blockDim.y threads in each block can be used to copy the blockDim.x + 2 to blockDim.y+2 subdomains to shared memory (each thread must transfer at least one point). After the copy to shared memory, the interior points of the subdomain can be advanced using the stencil update in Equation 11. Rather than perform just one update, we can perform multiple subiterations similar to the 1D case. Each thread is responsible for updating one DOF. After the updates are complete, the interior points of the subdomains can be copied back from shared memory to their appropriate positions in global memory. At this point, the most up to date solution lies in global memory and the shared memory can be terminated.
This completes one cycle of the shared memory algorithm in 2D. The main parameters of this algorithm are the number of threads per block in the direction (blockDim.x) and direction (blockDim.y), which control the dimensions of the subdomains, as well as the number of subiterations performed within shared memory (denoted by as before). The algorithm should be implemented as a single GPU kernel so that shared memory is not terminated until the end. In summary, each cycle involves the following three steps (similar to the 1D case):

Partition the gridpoints between the different GPU blocks and copy values from global memory to shared memory. Each subdomain has size
by . 
Update the interior blockDim.x by blockDim.y points in each GPU block by performing Jacobi iterations.

Update the global memory solution array by copying the interior blockDim.x by blockDim.y values from shared memory to their appropriate positions in global memory.
We can compute the amount of shared memory required for the 2D algorithm. Since Jacobi iteration requires two storage containers, we must allocate enough data for twice the subdomain size of
(blockDim.x + 2) * (blockDim.y + 2). Furthermore, the right hand side values corresponding to the interior subdomain DOFs can also be copied to shared memory to improve performance (as the right hand side appears in the update). This requires another array. Therefore, the total amount of shared memory required is then . This must be multiplied by the size of the representation (4 bytes for floats, 8 bytes for doubles).
One issue that can arise in shared memory applications is the problem of bank conflicts, which occur when threads in a given warp access noncontiguous data in shared memory. In this case, some fraction of the shared memory accesses are serialized which can significantly reduce memory bandwidth. To avoid performance loss, all warps should access contiguous data. In the 2D problem, each row of interior points in a subdomain lies in contiguous memory, but each subsequent row of interior points is separated from the previous row by two boundary points (due to the actual layout in memory as a 1D array). Therefore, we should ensure that each warp updates a single row. This is done by setting the threads per block value in the direction to 32, so that each row has 32 interior points (contiguous in memory) which will be updated by the 32 threads in a given warp.
4.2 Effect of Subiterations
We study the performance of the shared memory algorithm in solving the pentadiagonal system arising from discretization of the 2D Poisson equation in the domain , with Dirichlet boundary conditions
(12) 
The pentadiagonal matrix in our linear system is given by
where
The elemental Jacobi iterative update for the 2D Poisson problem is:
(13)  
(14) 
We consider a domain containing by interior grid points in the and directions and measure the time to reduce the residual norm of the initial solution by a factor of 1e4. This problem is analogous to the previous 1D problem in terms of computational complexity, and requires full utilization of the GPU. The entries of the initial solution and the right hand side vector are set to one. A TITAN V GPU is used for the numerical tests, which are performed using double precision.
The number of threads per block in the direction is set to 32 to avoid shared memory bank conflicts. For the classic GPU approach, the threads per block value in the direction is varied between and the best time is recorded. For the shared memory approach, the threads per block value in the direction is set to 32 so that the problem domain is split into subdomains with 32 by 32 interior points. The amount of shared memory required using these specifications is 26.688 kB per block (less than the maximum allowable value of 48 kB, although doubling either dimension of the subdomain would cause us to exceed this value). The number of subiterations is set to the following values (the same values as in the 1D problem): .
Figure 12 shows a comparison of the time required for the classic GPU and shared memory approaches to decrease the initial residual by a factor of 1e4. The time required for all data transfers between the host and device are also included in the results. The best classic GPU time measured was ms for the various thread per block specifications we experimented with. The shared memory approach outperforms the best classic GPU performance for all cases. The best performance is achieved when . The time grows linearly as the number of subiterations is increased past this point.
Table 3 shows the speedup obtained when comparing the best classic GPU performance to the shared performance for each of the subiteration values explored. A nearly four times speedup is achieved using the shared memory implementation (similar to the initial 1D speedups) when , demonstrating that shared memory can improve the convergence of Jacobi iteration for structured 2D problems.
Number of Subiterations  4  8  16  32  64  128 

Best GPU to Shared Speedup  2.17  3.67  3.98  2.96  1.86  1.09 
4.3 Effect of Overlapping Subdomains in 2D
The shared memory approach improves the performance of classic Jacobi iteration on the GPU by allowing many updates to be performed without requiring global memory accesses every step. However, this comes at the cost of lacking up to date boundary values every subiteration. In the 2D problem, the left, right, top and bottom boundary edges are only up to date at the beginning of a cycle but not for all subiterations. This causes DOFs close to these subdomain edges to rely on old boundary data and have poor solution values which contribute greatly to the residual norm (especially for larger values of ). Improving the interior values near subdomain edges (especially interior points at the corners) can greatly improve the convergence of our algorithm. This can be done by extending the overlapping approach in Section 3.5 to 2D problems. In the 2D case, sets of points can overlap between subdomains in both the and directions. Therefore, a DOF can be allocated to and updated within multiple subdomains. However, the subdomain in which the local position of the DOF is further from a subdomain edge in both the and directions should contribute the final value at the end of a cycle. This is a natural extension of the 1D overlapping approach to the 2D case.
We explore the performance of the 2D shared memory algorithm with overlap. As before, the thread per block values are set to 32 in both directions. In this case, a subdomain can have up to 30 interior points overlap with its neighboring subdomains in both directions. While it is permissible to have a different amount of overlap in the and directions, the symmetry of the domain and subdomain sizes suggests that the optimal overlap amount would be the same in both directions. Figures 12(a) and 12(b) show the time and number of cycles required for the shared memory approach to converge as the overlap in both directions is increased. The same subiteration values are used as before. Figure 12(b) shows that an initial amount of overlap can significantly drop the number of cycles required for convergence, especially when the number of subiterations is large. The corresponding time also decreases, except in the case when , as the decrease in the number of cycles is not enough to outweigh the extra work incurred due to overlap. As the number of overlapping points increases further, the number of cycles saturates. However, the time continues to increase because larger overlapping values require more GPU resources and computational time per cycle.
We conduct a similar analysis to the 1D case to determine the amount of work required per cycle as a function of the overlap. The 1D analysis suggested that the number of operational blocks was given by:
(15) 
For the 2D case, assume a domain with interior dimensions by , with subdomains with interior dimensions of by . If the overlap in the and direction are given by and , the total number of operational blocks required is then
(16) 
The number of operational threads is then the number of operational blocks times the threads per block in the and directions
(17) 
Assuming that the parameters in the and directions are equivalent (, , as is true in our problem) Equation (17) simplifies to
(18) 
Equation 18 shows the amount of work or computational resources required per cycle based on the domain size, threads per block and overlap. Increasing the overlap increases the amount of work required per cycle, so overlap is useful if it decreases the number of cycles needed for convergence enough to outweigh the extra work it requires. Once the number of cycles saturates with overlap, it is no longer useful to perform further overlap.
Table 4 shows the optimal overlap value and corresponding speedup of the shared memory algorithm relative to the classic GPU implementation for the 2D problem, for each of the subiteration values used. As before, the optimal overlap increases with the number of subiterations. A nearly six times speedup is achieved using with 4 overlapping points in the and directions.
Number of Subiterations  4  8  16  32  64  128 
Optimal Overlap in and  0  2  2  4  6  8 
Optimal Speedup  2.18  4.22  5.58  5.84  4.88  3.50 
The numerical test results demonstrate that the convergence of Jacobi iteration for 2D problems can also be accelerated using shared memory. Tuning the number of subiterations initially resulted in a four times speedup over the classical approach. However, introducing overlap to improve edge values augmented the speedup further, resulting in an almost six times speedup. The optimal number of subiterations was equal to the number of interior points in a subdomain along a single dimension. Furthermore, introducing a small amount of overlap in both directions was enough to decrease the number cycles dramatically.
5 Conclusion
In this work, we have developed a hierarchical Jacobi solver for 1D and 2D structured problems that utilizes shared memory to improve performance. The approach involves partitioning the domain into multiple subdomains which are handled by GPU blocks. Relative to a classical GPU implementation of the Jacobi iterative method, the shared memory approach resulted in an eight times speedup in convergence for the 1D case and a nearly six times speedup for the 2D case. This speedup was achieved by exploring optimal parameters for the number of subiterations performed in shared memory and investigating the effect of overlapping subdomains (which accelerated convergence by improving DOF values near subdomain edges). For both the 1D and 2D test problems, setting the number of subiterations on the order of the subdomain dimension along with a small amount of overlap resulted in the optimal performance of the shared memory algorithm. This hierarchical Jacobi solver is very suitable for parallel architectures, and has the potential to accelerate multigrid solvers which utilize this approach as a smoother on high performance clusters. Overall, this work demonstrates the need to adopt classical algorithms to emerging hardware in a way that achieves both high computational throughput as well as efficient memory access.
Appendix A Sample Pseudocode for 1D Shared Memory Algorithm
We provide a sample CUDA C++ style code to illustrate the components of the shared memory algorithm in 1D from an implementation standpoint. The code resembles the implementation of our GPU kernel which corresponds to one cycle of the shared memory algorithm. In this first step, the subdomain DOF data is copied from global memory to 2 containers in shared memory (corresponding to containers for the current and updated values in Jacobi iteration). The right hand side values for DOFs corresponding to our subdomain are also copied to a third section of shared memory. In the second step, Jacobi subiterations are performed in shared memory. We have two storage containers for Jacobi iteration, with x0 referring to the current solution and x1 referring to the updated solution in every step. In step 3, the updated values of all interior points in shared memory are copied back to their respective global memory positions.
Comments
There are no comments yet.