## 1 Introduction

Despite the large body of research on finite element methods for accelerators, widely-used, freely available general finite element codes and libraries do not use them, although exceptions exist [13, 2]. This trend is troubling as a growing number of high performance machines rely on accelerator technologies for the majority of their performance [16]. Most research is limited to high-order methods [9, 11] or one-off specialized solvers. In particular, the first and third author’s previous work focused on a specialized version of the finite element integration routines [10]. Providing a general, low-order integration method for the GPU based on quadrature will be necessary for the wider use of these architectures by finite element codes.

Our goal is to provide a general interface for efficient evaluation of finite element integrals on the limited memory, highly concurrent architectures of graphics processing units (GPUs). To support a wide set of application codes, we are interested in weak forms incorporating complex coefficients, even those which cannot be accurately represented in the finite element basis. These forms necessitate the use of quadrature for integration, and are thus more widely applicable than methods involving exact integration. Moreover, quadrature may even be applicable in cases where coefficients can be represented in the finite element basis since the size of tensors produced by exact integration grows quickly

[14].Even though low-order finite element methods are the most used in application codes, no quadrature-based efficient mapping to accelerator technologies exists. A popular choice for concurrent integration of finite element weak forms is to assign each cell to a separate thread [5, 4, 15, 17, 8]. This strategy, however, uses a large amount of local memory per thread. An alternative strategy [7] vectorizes the computation over basis functions, taking each quadrature point in turn. This requires very little local memory, but pulls all finite element coefficients from global memory for each quadrature point, rather than a single time, which is suboptimal for bandwidth-limited computations.

The method developed in this paper uses a thread transposition operation to map between evaluation at quadrature points for integration and on basis elements. This method increases the size of local memory usage slightly but avoids synchronization points. With the removal of the synchronization points and increased concurrency, the implementation achieves higher performance. In addition, it is able to hide much of the latency of moving data to local memory by keeping many kernels in flight on each processing element. In order to evaluate our method, we developed a model of both the memory traffic and computation, which allows us to predict performance based upon problem parameters.

In this paper, we formulate a simple interface that uses quadrature, supports low-order elements efficiently, and achieves 90% of memory bandwidth peak, up to 300 GFlops, on a variety of accelerators. This verifies our comprehensive performance model which is able to predict the performance within 10%. This novel execution strategy gives the integration method a considerable boost in performance, but more importantly the concision and simplicity of the implementation greatly reduces the effort of porting to different architectures, as well as the effort for inclusion in an existing code base. This contribution provides a road forward for widely used finite element methods to use accelerators effectively without large refactoring and reformulation efforts.

## 2 Formulation

If we restrict our attention to weak forms dependent only on problem fields and their gradients, we can formulate a generic scalar weak form as [3]

(1) |

We extend this to vector forms merely by making a vector and a tensor. Breaking the integral into element integrals, and using quadrature for element integrals, we have

(2) |

where is the vector of field evaluations over the set of quadrature points, a diagonal matrix of quadrature weights, and the and matrices implement the reduction over quadrature points to produce basis function coefficients which are assembled into the global vector by . The functions encapsulate the physics of the problem and need only be evaluated pointwise, at quadrature points. Note that these functions will contain any variable cofficients of your equations, as well as the basic operators. For example, a standard Laplacian problem uses , or for a variable coefficient problem.

A popular strategy on multi-core processors as well as uniprocessors with vector instructions is to vectorize over the operations used in integration so that several threads are used for each element [3]. One can vectorize over quadrature points, which is natural since users typically specify a pointwise “physics” function which operates at quadrature points, as shown above. However, a reduction over quadrature points is necessary in order to calculate the coefficient for each basis function. This reduction introduces synchronization, which will destroy most of the gains on a GPU coming from vectorization. It is also possible to vectorize over basis functions, but this strategy introduces both redundant computation and data movement.

To mitigate the synchronization penalty, we would like the reduction to be calculated by a single thread, instead of reducing across threads in a block. For this to work, we need enough reductions for all our threads, so we must use multiple elements per thread block. In order to satisfy Little’s Law [12], this means we also need to use multiple elements in our quadrature point vectorization.

In fact, the number of elements can be the least common multiple of the number of quadrature points and the number of (scalar) basis functions per element , . This is also the number of threads for each thread block that will be completely occupied. This situation is illustrated in Fig. 1, which will be detailed below. In the first phase, we vectorize over quadrature points, and in a second phase, we vectorize over basis functions. This strategy means that we must keep the intermediate results in shared memory.

## 3 Thread Transposition

During residual evaluation using quadrature, we would like to spread the evaluation of the residual contribution at each quadrature point over separate threads, but we also would like a separate thread to evaluate each basis coefficient of the residual. These two strategies are at odds when we have different numbers of quadrature points,

, and basis functions, , per element. We will reconcile them using a pattern we call thread transposition, in which we leave data in shared memory and change the target for each thread in different stages of the algorithm.If we take the size of the thread block as the least common multiple of the quadrature and basis sizes, , where denotes the number of components for a possibly vector-valued basis, we can use different size cell blocks to make the sizes commensurate. The quadrature computation will use groups of threads, each group operating on a series of cells. Likewise, the basis coefficient computation will use groups of threads, each group operating on a series of cells. This arrangement is shown in Fig. 1 for a 2D Lagrange element, where we choose the number of basis functions , the number of quadrature points , and consequently as the least common multiple for the thread block size. Each thread, shown as a red rectangle, computes values for a group of cells, shown as blue rounded rectangles, in series. So the thread computes basis function evaluations and the weak form function at quadrature points for two cells, and then computes a basis coefficient for three cells. This organization necessitates that the computed products be placed in shared memory so that after transposition different threads have access to the information. The increased local memory usage from transposition is balanced by the increased concurrency, and thus overall lower per-thread memory requirements. Note also that for multi-component fields, such as vectors, we instead use merely the total number of basis functions over all components as .

At the top level, we divide the mesh cells into chunks which are processed serially by the CPU, or in parallel on the GPU. Each chunk, of size , is assigned to a thread block, and thus the number of OpenCL workgroups (or the size of the CUDA grid) is equal to the number of chunks. The total number of cells is , where is the number of remainder cells which are always processed on the CPU. Since is less than the chunk size , it is vanishingly small compared to the total number of elements. Each chunk is further divided into batches such that there are batches, each consisting of cells, and . These batches are executed in sequence by a thread block. To summarize, chunks are assigned to thread blocks, and divided into batches. Each batch is executed simultaneously by the thread block.

Our basic unit of execution, the cell block, consists of cells, which must divide the batch size . We execute all blocks in the batch, , concurrently, but note that all cells in a block are not processed concurrently. Referring to Fig. 1, the number of threads in a thread block will be equal to the block size multiplied by the number of field components and concurrent blocks, . Each thread first processes cells sequentially in the quadrature phase described in Section 3.1, which employs threads per cell. After transposition, each thread processes cells sequentially in the basis phase described in Section 3.2, which uses threads per cell.

Before loading cell information, we store the quadrature point and weight associated with each thread into thread-private memory, and load the tabulation of basis functions and derivatives at quadrature points into shared memory. Note that some forms do not require quadrature points or part of the basis information. We then loop over the batches of cells sent to the given thread block, first loading geometric information (Jacobian inverse and determinant) and input basis coefficients, and then executing the two stages discussed below. Note that executing multiple batches in sequence allows computation in a thread block to overlap the latency for coefficient access in another. The full pseudo-code for the integration routine is given in the Appendix.

### 3.1 Map coefficients to values at quadrature points

Each thread maps the basis coefficients of cells to evaluations of the input field and its gradient at the quadrature points. Each thread sums the contribution from all basis functions at its assigned quadrature point, using the shared Jacobian inverse to transform the field derivatives. These values are then passed to the and functions from Eq. 1 and the results, multiplied by the quadrature weight and determinant, are stored in shared memory. After this write, we must synchronize the thread block in order to make these values available to the other threads, which should take about 20 cycles [18].

### 3.2 Map values at quadrature points to coefficients

After the evaluations at quadrature points have been placed in shared memory and threads have been synchronized, each thread maps the values at quadrature points of cells to basis coefficients. A thread forms the product of the values with the test function and gradient at each quadrature point and accumulates the result in a local variable. Note that this is the action of the and matrices from Eq. 2. These results are written to global memory by the thread block for each of the cells, which means that cells are written concurrently.

### 3.3 Memory Traffic and Computation

The concurrency for our algorithm is at worst . In shared memory, we must store geometric information (Jacobian inverses and determinants), tabulated basis functions and derivatives, basis function coefficients, and the values at quadrature points, so the maximum shared memory used in bytes is given by

(3) |

where is the spatial dimension, so that per cell we have

(4) |

where we get equality if . Note that some of these values can be located in thread-local memory, but this is an upper bound for shared memory. In our examples, we need only

(5) |

since we do not need function tabulation of values. For our Poisson example in Section 6, with blocks of 32 cells, the memory per thread block is KB. This allows 9 concurrent workgroups to be resident on a single multiprocessor, allowing overlap of memory access latency by computation from another workgroup. This has been shown to be quite important, along with the requirement to keep a large number of threads in flight, for optimal performance on the GPU [6].

The memory loaded per batch is

(6) |

since the tabulation is only loaded once. The computation per cell batch is

(7) |

Thus we have for the algorithmic balance expressed in floating point operations per byte

(8) |

For our example Poisson problem, , , , and , so that

(9) |

hence our computations are limited by memory bandwidth on typical GPUs.

## 4 Variable Coefficients

In order to make our proposed algorithm applicable to many applications in computational science and engineering, it must support variable auxiliary coefficients. These coefficients often come from input data, but can also arise from multiphysics coupling. We express these coefficients as vectors in a linear space, not necessarily the same finite element space as the solution.

When considering variable coefficients, additional data must be retrieved from global memory and stored in shared memory, which reduces the number of concurrent batches possible on a single processor. Moreover, if too little flops are executed with the auxiliary field, it can lower , our algorithmic balance.

The auxiliary fields can be accomodated with a small extension to our interface. Our basic equation now becomes

(10) |

where represents the auxiliary fields. On the CPU, our code allows an arbitrary representation for , including fields supported directly on quadrature points. On the GPU, however, we must match the blocking of with that of the solution fields . Our current implementation allows to come from the same space as , or from the space of constants over the cell.

## 5 GPU Implementation

We derived a first implementation of our algorithm using CUDA. However, since no CUDA-specific routines such as warp shuffle routines are used, we reimplemented the algorithm using the free, open standard OpenCL and observed the same performance. Even though OpenCL requires some additional boilerplate code, it provides several advantages when integrated into a library such as PETSc: First, a broader range of hardware from different vendors can be targeted. Second, the OpenCL implementation only requires to link with the shared library distributed with the respective OpenCL SDK, hence it is far less intrusive for the build process than CUDA. Third, the kernels are compiled at runtime by passing the respective kernel source string to the OpenCL just-in-time compiler, which allows for target-specific optimizations at runtime. Thus, any file-based code generation during the build stage can be avoided and instead the source string can be built entirely from runtime parameters during execution.

We provide default implementations for the Poisson equation and the Lame equation for linear elasticity. Library users can also provide their own functions for and defined in (1), in which case the source code representing and defined in (1) need to be provided as a string. Because it is more convenient to provide function pointers rather than string representations of functions, this can be considered a disadvantage of using OpenCL.

## 6 Benchmarks

We calculate the residual for the Poisson equation and of a linear elastic problem on unstructured simplex meshes of the unit square in two dimensions and of the unit cube in three dimensions. The specific benchmark code is distributed with PETSc as SNES example 12, available at http://www.mcs.anl.gov/petsc/petsc-dev/src/snes/examples/tutorials/ex12.c.html. All runs can be reproduced using the latest PETSc release and the instructions in Section 6.1. We note that the OpenCL backend can now be used for all PETSc FEM examples which provide the and functions as text. For the Poisson problem, in (1) is trivially zero, while for we provide the following function (passed to PETSc as a string) acting pointwise on quadrature points:

For 2D linear elasticity, the function is only slightly more complicated,

and note that in both cases the function is null since only field gradients are involved in the residual. These functions are directly inlined into the OpenCL program source code during the source generation step for the OpenCL just-in-time compiler.

The hardware used for the benchmarks is listed in Table 1. NVIDIA GPUs have been selected to represent the Fermi (GTX 580), Kepler (K20m) and Maxwell (GTX 750 Ti) architecture generations. We note that the GTX 750 Ti is limited by firmware to a lower double precision performance of merely 41 GFLOP/sec. Also, memory error correction is enabled for the Tesla K20m GPU in order to reflect the typical setup in computing centers, while all other GPUs are used without memory error correction. The AMD GPUs are an integrated GPU (A10-5800K) and a discrete high-end workstation GPU (FirePro W9100). All benchmarks were run on Linux-based systems, using OpenCL implementations included in the CUDA 6.5 and CUDA 7.0 releases, respectively.

In Table 2 we consider the performance obtained for different numbers of blocks per batch and different numbers of batches per chunk. Generally, the number of blocks per batch needs to be balanced such that the number of threads per workgroup is large enough, but at the same time shared memory consumption is kept low enough to obtain higher hardware occupancy. NVIDIA GPUs provide an upper bound of threads per workgroup on recent models, while AMD GPUs require workgroup sizes smaller , which can be cast into upper bounds for the number of blocks per batch for a given basis and quadrature. As the results show, the correct choice of cell block and batch sizes only has a mild influence: The difference between the best and the worst performance is only percent, hence good out-of-the-box performance can be provided with suitable default settings. Also, the performance remains mostly unchanged as the number of batches is changed, which is expected because batches are processed one after another and their number does not influence the amount of shared memory required.

Benchmark results obtained for the Poisson problem in Fig. 4 show that on the GTX 580 we obtained up to GFLOP/sec in single precision for the three-dimensional case. The GFLOP/sec in the two-dimensional case amount to an effective memory bandwidth of GB/sec based on the algorithmic balance from (9). This amounts to percent of peak memory bandwidth, which is the same as the practical maximum obtained with the STREAM benchmark. Similarly, the practical peak memory bandwidth is reached for the GTX 750 Ti in single precision, and the firmware limit of double precision performance is reached. A-priori one expects the Tesla K20m to reach similar performance as the GTX 580, but this is not reflected by the benchmark results. After closer inspection, however, we could explain the performance difference by the smaller memory bandwidth obtained with the STREAM benchmark, part of which is caused by the memory error correction.

The performance obtained on AMD devices in absolute terms is substantially slower than those we obtained on the devices from NVIDIA. On closer inspection, however, the integrated A10-5800K GPU achieves a memory bandwidth of GB/sec, which is the practical maximum in a dual-channel DDR3 configuration. Only the discrete FirePro W9100 could not reach high bandwidth. Our investigations showed that this was caused by low effective memory bandwidth achieved when loading the cell coefficients, which are stored in an unstructured way in global memory.

Fig. 5 depicts the benchmark results obtained for solving the linear elasticity equations. The vector-valued basis reduces the algorithmic balance by roughly two-fold compared to the Poisson equation. This is also reflected in the observed performance on a GTX 580: In two dimensions, GFLOP/sec in single precision ( GFLOP/sec for Poisson) and GFLOP/sec in double precision ( GFLOP/sec for Poisson). In three dimensions, we obtained GFLOP/sec in single precision ( GFLOP/sec for Poisson) and GFLOP/sec in double precision ( GFLOP/sec for Poisson).

### 6.1 Reproducing the Results

All the foregoing examples can be run using the PETSc libraries. You will need to configure with OpenCL support, for example

The data for the Poisson problem was collected using multiple runs of the form

where -petscfe_num_blocks specifies the number of blocks per patch and petscfe_num_batches selects the number of blocks per batch. The option -refinement_limit is passed to the mesh generator (triangle for triangular meshes, tetgen for tetrahedral meshes) and provides control over the problem size.

## 7 Conclusions

We have produced a high performance implementation of FEM quadrature for low order elements, amenable to high throughput architectures such as GPUs. The performance gain is due to flexible vectorization without a reduction over threads, made possible by the thread transposition construct we introduced. The OpenCL implementation has been integrated into PETSc, showing the ease of implementation and providing an open testbed for further work. In future work, we will integrate this work into existing application, such as the PyLith code for crustal deformation [1].

Below we give the overall structure of the integration code, eliminating extraneous detail such as variable declaration and precise indices. The full code can be found in the PETSc distribution OpenCL implementation.

// N_cb Number of serial cell batches // N_bl Number of concurrent blocks void integrateElementQuadrature(int N_cb, realType *coefficients, realType *jacobianInverses, realType *jacobianDeterminants, realType *elemVec) { int dim = spatialDim; // Spatial dimensions int N_b = numBasisFunctions; // Basis functions int N_comp = numBasisComponents; // Basis function components int N_q = numQuadPoints; // Quadrature points int N_bt = N_b*N_comp; // Total scalar basis funcs int N_bst = N_bt*N_q; // Block size int N_t = N_bst*N_bl; // Threads int N_bc = N_t/N_comp; // Cells/batch int N_c = N_cb * N_bc; // Total cells int N_sbc = N_bst/(N_q*N_comp); // Serial basis cells int N_sqc = N_bst/N_bt; // Serial quad cells /* Load quadrature weights */ w = weights_0[q]; /* Load basis tabulation phi_i for this cell */ phi_i[q,b] = Basis_0[q,b]; phiDer_i[q,b] = BasisDerivatives_0[q,b]; for (int batch = 0; batch < N_cb; ++batch) { /* Load geometry */ detJ[c] = jacobianDeterminants[c]; invJ[c] = jacobianInverses[c]; /* Load coefficients u_i for this cell */ u_i[c] = coefficients[c]; /* Map coefficients to values at quadrature points */ for (int c = 0; c < N_sqc; ++c) { u = 0.0; // u(x_q), Value of the field at x_q gradU = 0.0; // du/dx(x_q), Value of the gradient at x_q /* Get field and derivatives at this quadrature point */ for (int i = 0; i < N_b; ++i) { for (int comp = 0; comp < N_comp; ++comp) { u[comp] += u_i[c,b]*phi_i[q,b]; gradU[comp] += u_i[c,b]*(invJ[c] * phiDer_i[q,b]); } } /* Process values at quadrature points */ f_0[c,q,comp] = f0_func(u, gradU, c)*detJ[c]*w; f_1[c,q,comp] = f1_func(u, gradU, c)*detJ[c]*w; } /* ==== TRANSPOSE THREADS ==== */ syncronize_threads(); /* Map values at quadrature points to coefficients */ for (int c = 0; c < N_sbc; ++c) { e_i = 0.0; for (int q = 0; q < N_q; ++q) { e_i += phi_i[q,b]*f_0[c,q,comp]; e_i += invJ * phiDer_i[q,b] * f_1[c,q,comp]; } /* Write element vector for N_cbc cells at a time */ elemVec[c,b] = e_i; } } }

## Thanks

We would like to thank Hans Petter Langtangen for organizing a visit to Simula Research which was the genesis of this paper.

## References

- [1] Brad T. Aagaard, Matthew G. Knepley, and Charles A. Williams, A domain decomposition approach to implementing fault slip in finite-element models of quasi-static and dynamic crustal deformation, Journal of Geophysical Research: Solid Earth, 118 (2013), pp. 3059–3079. http://dx.doi.org/10.1002/jgrb.50217.
- [2] ANSYS, Abaqus reference manual, 2015. Software Release Version 6.11.2.
- [3] Jed Brown, Efficient nonlinear solvers for nodal high-order finite elements in 3D, Journal of Scientific Computing, 45 (2010), pp. 48–63.
- [4] Cris Cecka, AJ Lew, and Eric Darve, Assembly of finite element methods on graphics processors, International Journal for Numerical Methods in Engineering, 85 (2011), pp. 640–669.
- [5] Andrew Corrigan, Fernando F Camelli, Rainald Löhner, and John Wallin, Running unstructured grid-based cfd solvers on modern graphics hardware, International Journal for Numerical Methods in Fluids, 66 (2011), pp. 221–229.
- [6] Felipe A. Cruz, Simon K. Layton, and L.A. Barba, How to obtain efficient GPU kernels: An illustration using FMM & FGT algorithms, Computer Physics Communications, 182 (2011), pp. 2084–2098.
- [7] M. Dabrowski, M. Krotkiewski, and D. W. Schmid, MILAMIN: MATLAB-based finite element method solver for large problems, Geochem. Geophys. Geosyst., 9 (2008).
- [8] Adam Dziekonski, Piotr Sypek, Adam Lamecki, and Michal Mrozowski, Finite element matrix generation on a gpu, Progress In Electromagnetics Research, 128 (2012), pp. 249–265.
- [9] Andreas Klöckner, Tim Warburton, Jeff Bridge, and Jan S Hesthaven, Nodal discontinuous galerkin methods on graphics processors, Journal of Computational Physics, 228 (2009), pp. 7863–7882.
- [10] Matthew G. Knepley and Andy R. Terrel, Finite element integration on GPUs, ACM Transactions on Mathematical Software, 39 (2013). http://arxiv.org/abs/1103.0066.
- [11] Dimitri Komatitsch, Gordon Erlebacher, Dominik Göddeke, and David Michéa, High-order finite-element seismic wave propagation modeling with mpi on a large gpu cluster, Journal of Computational Physics, 229 (2010), pp. 7692–7714.
- [12] John D. C. Little, A proof for the queuing formula: , Operations Research, 9 (1961), pp. 383–387. http://www.jstor.org/stable/167570.
- [13] Victor Minden, Barry Smith, and Matthew G Knepley, Preliminary implementation of petsc using gpus, in GPU Solutions to Multi-scale Problems in Science and Engineering, Springer, 2013, pp. 131–140.
- [14] Kristian B. Ø lgaard and Garth N. Wells, Optimizations for quadrature representations of finite element tensors through automated code generation, ACM Transactions on Mathematical Software, 37 (2010), pp. 1–23.
- [15] Z.A. Taylor, M. Cheng, and S. Ourselin, High-speed nonlinear finite element analysis for surgical simulation using graphics processing units, Medical Imaging, IEEE Transactions on, 27 (2008), pp. 650–663.
- [16] TOP500 Supercomputer Sites, 2015.
- [17] Alan B. Williams, Optimizing implicit finite element applications for GPUs, in Abstracts of SIAM PP12, SIAM, 2012.
- [18] Henry Wong, M.M. Papadopoulou, M. Sadooghi-Alvandi, and Andreas Moshovos, Demystifying GPU microarchitecture through microbenchmarking, in Performance Analysis of Systems & Software (ISPASS), 2010 IEEE International Symposium on, IEEE, 2010, pp. 235–246.