Acceleration of tensor-product operations for high-order finite element methods

This paper is devoted to GPU kernel optimization and performance analysis of three tensor-product operators arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close-to-the-peak performance for these operators requires extensive optimization because of the operators' properties: low arithmetic intensity, tiered structure, and the need to store intermediate results inside the kernel. We give a guided overview of optimization strategies and we present a performance model that allows us to compare the efficacy of these optimizations against an empirically calibrated roofline.



There are no comments yet.


page 1

page 2

page 3

page 4


H1-conforming finite element cochain complexes and commuting quasi-interpolation operators on cartesian meshes

A finite element cochain complex on Cartesian meshes of any dimension ba...

Portable high-order finite element kernels I: Streaming Operations

This paper is devoted to the development of highly efficient kernels per...

Accelerating High-Order Mesh Optimization Using Finite Element Partial Assembly on GPUs

In this paper we present a new GPU-oriented mesh optimization method bas...

An algorithm for the optimization of finite element integration loops

We present an algorithm for the optimization of a class of finite elemen...

A study of vectorization for matrix-free finite element methods

Vectorization is increasingly important to achieve high performance on m...

Fast matrix-free evaluation of discontinuous Galerkin finite element operators

We present an algorithmic framework for matrix-free evaluation of discon...

Efficient Exascale Discretizations: High-Order Finite Element Methods

Efficient exploitation of exascale architectures requires rethinking of ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art high-order finite element method (FEM) based simulation codes now execute calculations on parallel machines with up to a million CPU cores. For example, the Nek5000 incompressible fluid flow simulator has successfully scaled up to a million CPU cores using an MPI distributed computing model ((Fischer et al., 2015)). The current shift towards greater on-chip parallelism makes it imperative to focus on finer grain parallel implementation of high-order operators on next generation accelerators, such as Graphics Processing Units (GPUs). Efficient high-order finite element implementations rely heavily on tensor-product contractions ((Deville et al., 2002)). These elemental tensor-product operations are memory bound and require repeated sweeps through the data for each finite element operation. In this work we demonstrate that despite their complexity, it is possible to achieve nearly maximum throughput on a current generation GPU.

In this paper we focus in particular on GPU implementations of three common place matrix-vector multiplications (further abbreviated as

matvecs) arising in finite element codes: mass matvec (BP1.0), stiffness matvec (BP3.5), and stiffness matvec that involves de-aliasing integration (BP3.0). These problems have been formulated as a part of CEED co-design effort ((CEED, 2017)

). BP1.0, BP3.5, and BP3.0 appear to be potentially well-suited for fine-grain parallelism but they are challenging to fully optimize. These benchmarks are characterized by low arithmetic intensity, namely that the ratio of floating point operations (FLOPS) to data transfer is low. The benchmarks also require non-unitarily strided memory access patterns since all threads that are processing a single element need the data computed by other threads that process this element. Finally, the operations consist of several concatenated tensor-product contractions and all threads simultaneously process the contractions in a prescribed order.

We detail several strategies that are collectively applied to maximize the computational throughput of these operations. In a sequence of computational experiments we progressively optimize the operations by varying the amount of global memory accesses, shared memory usage, register variable usage, data padding, and loop unrolling.

To guide the optimization process we construct a performance model that serves as a tool to evaluate the efficiency of the kernels. Because of the nature of the finite element operations, the performance of our kernels is limited by the bandwidth of the data streaming from both the global device memory and shared memory as well as finite register file and shared memory capacity. While a theoretical performance roofline, as presented by (Lo et al., 2014)

, gives an upper bound on the capability of the hardware under ideal circumstances, we instead rely on an empirically calibrated roofline model that provides a realistic estimate of the best achievable performance.

1.1 Overview of published literature

Early optimization studies of FEM operations date back to 2005 when (Göddeke et al., 2005) and (Göddeke et al., 2007) solved elliptic PDEs on two-dimensional domains using a cluster equipped with GPU accelerators. In later work, (Göddeke et al., 2009) focused mostly on accelerating the Navier-Stokes solver for lid driven cavity and laminar flow over a cylinder. The study presented by (Fu et al., 2014) targeted the entire FEM pipeline using the elliptic Helmholtz problem to show how the FEM codes are ported to the GPU. In addition, (Fu et al., 2014) discussed strategies for accelerating conjugate gradient and algebraic multigrid solvers.

Other authors usually choose to work on improving the performance of one part of the FEM pipeline. For example, (Cecka et al., 2011) and (Markall et al., 2013) focused on the global assembly phase of FEM and showed how to optimize this phase for GPU execution. Markall and co-authors emphasize that the choice of the most efficient algorithm strongly depends on the available hardware and selected programming model. In their work, Markall and co-authors considered the CPU and the GPU with OpenCL and CUDA parallel implementations. Moreover, (Markall et al., 2010, 2013) argue that making the code portable between different threading systems (such as many-core and multi-core architectures) requires a high-level programming language.

In all of the above work the greatest concern was efficiently pipelining GPU execution of FEM with low-order discretizations. In contrast, (Remacle et al., 2016) present algorithms for solving elliptic equations using high-order continuous hexahedral finite elements on the GPUs. The issues associated with efficiently handling the greater complexity of high-order FEM operations highlighted in that work are amplified in the current paper and refined with the use of a new roofline model.

The evaluation of the matvec typically accounts for the highest cost of the elliptic FEM solver (Remacle et al., 2016, Table 2, Table 3). Optimization of matvec performance on the GPUs appears in the literature mostly in the context of general applications. Yet a few papers focused directly on the low-order FEM matvec. For example, (Dziekonski et al., 2017) used a conjugate gradient solver and optimized matvec product as its important part. (Dehnavi et al., 2010) and (Grigoraş et al., 2016) present similar findings and provided optimization strategies. However, in our case the action of the operator is local to the element. We never assemble the global matrix and we perform the element-wise matvec. In the literature this approach is known as matrix-free approach.

The matvecs used in this paper can be expressed as a concatenation of tensor contractions. Related work on efficient implementation of tensor contractions for the GPUs consists of several papers. (Nelson et al., 2015) applied high level language to formulate tensor contractions and used GPU code auto-tuning to enhance the performance of the code. The findings were tested using several benchmark problems, including a problem derived from Nek5000 (see (Paul F. Fischer and Kerkemeier, 2008)). Optimized code reached 120 GFLOPS/s on a Maxwell class GPU. (Abdelfattah et al., 2016) published similar work where, in addition to specifying high-level language to formulate tensor contractions, the authors introduced a performance model based on both the number of floating point operations and the number of bytes of global memory that need to be read and written. The performance results were compared to CUBLAS and to CPU code executed using 16 CPU cores. The most efficient version of the code reaches 180 GFLOPS/s on NVIDIA Tesla K40 GPU. While two previously mentioned papers focused on small tensor contractions, (Liu et al., 2017) optimized large tensor contractions, also using auto-tuning for performance optimization. The paper explained the details of the implementation and reported the global memory usage.

The performance model in this paper is rooted in earlier work by several authors, originating from the roofline model in (Lo et al., 2014). (Stratton et al., 2012) and (Zhang and Owens, 2011) who modeled the GPU performance using benchmarks. These efforts focused on a semi-automatic way of identifying performance bottlenecks. Acquiring a more complete understanding of the performance of the code on the GPU and designing benchmark specifically to reveal the underlying hardware properties frequently appears in the literature due to lack of available complete GPU hardware documentation. At the same time, hardware organization strongly impacts the performance ( (Wong et al., 2010), (Hong and Kim, 2009), and (Lee et al., 2010)). (Volkov and Demmel, 2008) and (Volkov, 2010) employed a different approach; they optimized matrix-matrix multiplication on the GPU and aimed to explain the code performance while improving the implementation. In his 2010 paper, Volkov pointed out that, contrary to a common belief, high GPU occupancy is not necessary to obtain close the peak performance declared by the GPU manufacturer. This observation is important for the current paper. Our most efficient kernels are characterized by high usage of registers and shared memory, thus the achieved occupancy can often be as low as

even though the performance of these kernels is close to the empirical roofline. Volkov observation that the aggregate number of shared memory accesses per thread should be reduced to avoid bottlenecks motivates a final optimization for the high-order FEM operations considered in this work. We exploit the intrinsic structure of the interpolation and differentiation tensors used in the construction of the action of the mass and stiffness matrices to improve throughput by reducing the aggregate number of shared memory accesses.

The remainder of this paper is organized as follows. We first present the mathematical description of the three benchmark problems mentioned earlier. We then introduce the performance model we use to evaluate our kernels. We continue by explaining the implementation and optimization choices for each of the benchmark problems. Finally, we gives some concluding remarks and future goals.

2 Notation

Code Math Meaning
N Degree of the polynomial used in the interpolation
Nq Number of Gauss-Lobatto-Legendre (GLL) quadrature nodes:
Np Number of GLL nodes in a hexahedral element:
glNq Number of Gauss-Legendre (GL) quadrature nodes:
glNp Number of GL nodes in a hexahedral element.
I interpolation matrix from GLL nodes to GL nodes
D differentiation matrix used in BP3.5, see (9)
glD differentiation matrix, used in BP3.0
NElements Number of elements in the mesh
Table 1: The notation used in in the paper. The first column shows the symbols used in pseudocode listings and the second column shows the symbols used in the derivations.

The notation used throughout this paper is shown in Table 1. In addition, we define

The floating point throughput rates for compute kernels in this paper are reported using TFLOPS/s. We use double precision arithmetic in all the tests.

3 Hardware and software

All computational studies in this paper were performed on a single Tesla P100 (Pascal class) PCI-E GPU with 12 GB RAM and maximum theoretical bandwidth of 549 GB/s. The GPU is hosted by a server node equipped with Intel Xeon E5-2680 v4 processor, 2.40 Ghz with 14 cores. The code was compiled using the gcc 5.2.0 compiler and the nvcc CUDA compiler release 8.0, V8.0.61 managed by the OCCA library (see (Medina et al., 2014)).

We compute the reference times (needed for copy bandwidth in roofline plots) using CUDA events. The kernels are timed using MPI_Wtime.

In our experiments, we test the code on two hexahedral cube-shaped meshes. The small mesh consists of 512 elements and the large mesh consists of 4096 elements.

4 Problem description

The three benchmark problems we consider below were formulated by the Center for Efficient Exascale Discretizations (see (CEED, 2017)). These problems are motivated by considering the numerical approximation of the screened Poisson equation


by a high-order finite element method on hexahedral elements. In the equation, is a given forcing function and is a constant. To begin we consider an unstructured mesh of a domain into hexahedral elements , where , such that


We assume that each hexahedral element with vertices is the image of the reference bi-unit cube under a tri-linear map. We take the reference cube to be the bi-unit cube, i.e.

On each element we consider the variational form of the screened Poisson problem (1) on element by requiring that satisfies

for all test functions . We map the integrals to the reference cube in order to write the variational form as

where is the determinant of the Jacobian of the mapping from element to the reference cube and is the scaled elemental metric tensor. The operators in these integrals are now understood to be in -space.

On the reference cube we construct a high-order finite element approximation of the function , denoted , which is a degree polynomial in each dimension. We denote the space of all such polynomials as . As a polynomial basis of we choose the tensor product of one-dimensional Lagrange interpolation polynomials based on Gauss-Lobatto-Legendre (GLL) nodes on the interval . We denote the GLL nodes in by , and use an analogous notation for the GLL nodes in the and dimensions. Using a multi-index, we define the multi-dimensional basis polynomials to be the tensor product of the one-dimensional Lagrange interpolating polynomials, i.e.

for all . Hence, on we can express the polynomial as

Consequently, the coefficients are the nodal values of the polynomial at the GLL interpolation points , i.e., .

Taking the test functions to be each of the basis Lagrange interpolating polynomials, , and taking the vector to be the vector of the polynomial coefficients of , i.e. , we can write the variational formulation as the following linear system

where the local stiffness matrix , local mass matrix , and local load vector are defined as


We concatenate the local stiffness and mass matrix operators as well as local load vectors to form global unassembled versions which we denote, , and , respectively. Doing so, we form the following a block diagonal system which operators on the global vector of solution coefficients, which we denote ,

Due to its block diagonal structure, these global stiffness and mass matrix operators can be applied in a matrix-free (element-wise) way and no communication is required between elements.

The benchmarks presented below consider the efficient action of either just the mass matrix or the complete screened Poisson operator on each element. Benchmark problem 1.0 and 3.0 consider the case where the integrals (3)-(4) are evaluated using a full Gauss-Legendre (GL) quadrature and benchmark problem 3.5 considers the case where the integrals are approximated using simply a quadrature at the GLL interpolation points.

4.1 Benchmark Problem 1: Mass Matrix Multiplication

The first benchmark we present involves matrix-vector product of the high-order finite element mass matrix and a corresponding vector. This operation is a component of finite element elliptic operators and some preconditioning strategies. Thus, it serves as a useful initial performance benchmark. The operation requires relatively little data transfers compared with more demanding differential operators which necessitate loading the elemental metric tensor for each node of the element, as discussed in later benchmarks.

Beginning from the integral definition of the entries of the local mass matrix in (4) we use the fact that is the reference cube to evaluate the integral in each dimension separately via an -node Gauss-Legendre (GL) quadrature. We denote the quadrature weights and nodes in the dimension as and , respectively, and use an analogous notation for quadrature nodes in the and dimensions. Thus we can write


We can write this expression in matrix form by defining the interpolation operator,

for all and , as well as the diagonal matrix of weights and geometric data which gas entries

for all in order to write the local mass matrix (6) compactly as

Note that since the basis interpolation polynomials are a tensor products of one-dimensional polynomials, we can define the one-dimensional interpolation matrix as


for all and , in order to express the interpolation operator as a tensor product of the one-dimensional operators, i.e.

Thus, the interpolation operation from the GLL interpolation nodes to the GL quadrature nodes can be applied using three tensor contractions, while the projection back to the GLL nodes via the transpose interpolation operator comprises three additional tensor contractions. Since the remaining operation is simply the multiplication by the diagonal matrix no further tensor contractions are required. Since we will make use of the interpolation and projection operations again in BP3.0 below, we detail their pseudo code in Algorithms 1 and 2. We then detail the full matrix free action of the mass matrix in the pseudo code in Algorithm 3.

1:  Data: (1) , size ; (2) Interpolation matrix , size
2:  Output: , size
3:  for   do
4:      Interpolate in direction
5:  end for
6:  for   do
7:      Interpolate in direction
8:  end for
9:  for   do
10:      Interpolate in direction, save
11:  end for
Algorithm 1 Interpolation from GLL to GL nodes
1:  Data: (1) , size ; (2) Interpolation matrix , size
2:  Output: , size
3:  for   do
4:      Project in direction
5:  end for
6:  for   do
7:      Project in direction
8:  end for
9:  for   do
10:      Project in direction, save
11:  end for
Algorithm 2 Projection from GL to GLL nodes
1:  Data: (1) , size ; (2) Interpolation matrix , size ; (3) Scaled Jacobians, , size
2:  Output: , size
3:  for  do
4:      Interpolate to GL nodes (Algorithm 1)
5:     for   do
6:         Scale by Jacobian and integration weights
7:     end for
8:      Project to GLL nodes (Algorithm 2)
9:  end for
Algorithm 3 BP1.0: mass matrix multiplication

4.2 Benchmark Problem 3.5: Stiffness Matrix with Collocation Differentiation

For our second benchmark, we consider the matrix-vector product of the full high-order finite element screened Poisson operator and a corresponding vector. In this benchmark, the operators and are evaluated using a collocation GLL quadrature rather than the more accurate GL quadrature used in the other two benchmark problems. This operation is central to many elliptic finite element codes and is usually a part of a discrete operator we wish to invert. For example, incompressible flow solvers such as Nek5000 (see (Paul F. Fischer and Kerkemeier, 2008)) require solving a Poisson potential problem at each time step. Consequently, this matvec is potentially evaluated many times in each time step of a flow simulation, making its optimization a significant factor for good performance.

To describe the application of the full screened Poisson operator we begin by describing the local stiffness matrix defined in (3). We evaluate the integral in (3) in each dimension separately, this time using the GLL interpolation nodes as the quadrature. We denote the GLL quadrature weights and nodes in the dimension as and , respectively, and use an analogous notation for the GLL quadrature nodes in the and dimensions. Thus we can write


To write this expression in a more compact matrix form we begin by defining the differentiation operators,


for . We then define the gradient operator to be the vector of these three derivative operators, i.e. . Next, since is the scaled metric tensor on element defined by , we denote the entries of as

We define a matrix of operators where each entry of is a diagonal matrix of weights and geometric data from . That is, we define


where each entry, say , is defined as a diagonal matrix which has entries

for . We use analogous definitions for the remaining entries of . Using these matrix operators we can write the local stiffness matrix (8) compactly as follows

1:  Data: (1) Vector , size , (2) differentiation matrix , size , (3) geometric factors , size , (4) parameter ;
2:  Output: Vector , size ;
2: Loop over elements
3:  for  do
4:     for  do
4: Load geometric factors
5:        , , ;
6:        , , ;
6: Multiply by
7:        ;
8:        ;
9:        ;

 Apply chain rule

10:        ;
11:        ;
12:        ;
13:     end for
14:     for  do
16:        ;
17:     end for
18:  end for
Algorithm 4 BP3.5: collocation differentiation for 3D hexahedral mesh

To simplify the action of this local stiffness matrix we again use the fact that the basis interpolation polynomials are a tensor products of one-dimensional polynomials. We define the one-dimensional differentiation matrix as

for all . Using this one-dimensional derivative operator, and the fact that the GLL quadrature nodes collocate with the interpolation nodes using the define the Lagrange basis polynomials , we write the partial derivative matrices and as tensor products of

and the identity matrix

as follows

Thus, differentiation along each dimension can be computed using single tensor contractions.

Finally, to write the full action of the local screened Poisson operator we note that since we have use the collocation GLL quadrature in the evaluation of the integrals (3)-(4), we can follow the description of the mass matrix operator above to find that no interpolation is required and the mass matrix can be written simply as where the matrix of geometric factors is now defined using the GLL weights and quadrature points, i.e.

for . Thus we write the local screened Poisson operator as

This operator can be applied using only six tensor contractions. First, we apply the operator by differentiating along each dimension using three tensor contractions. We then multiply by the necessary geometric factors and apply the transpose operator with three more tensor contractions. Finally we add the mass matrix contribution which requires no tensor contractions. We detail the full matrix-free action of the screened Poisson operator in the pseudo code in Algorithm 4.

1:  Data: (1) Vector , size , (2) differentiation matrix , size , (3) Interpolation matrix , size , (4) geometric factors , size , (5) parameter ;
2:  Output: Vector , size ;
3:  for  do
4:      Interpolate to GL nodes (Algorithm 1)
5:     for  do
5: Load geometric factors
6:        , , ;
7:        , , ;
7: Multiply by
8:        ;
9:        ;
10:        ;
10: Apply chain rule
11:        ;
12:        ;
13:        ;
14:     end for
15:     for  do
17:        ;
18:     end for
19:      Project to GLL nodes (Algorithm 2)
20:  end for
Algorithm 5 BP3.0: differentiation for 3D hexahedral elements

4.3 BP3.0: Stiffness matrix evaluated with quadrature

The final benchmark we consider is the same matrix-vector product of the high-order screened Poisson operator as in BP3.5. This time, however, we use the full GL quadrature to approximate the integrals in (3) and (4). This benchmark combines computational elements from BP1.0 and BP3.5 which makes for a more arithmetically intense kernel and maximizing its performance on GPUs is challenging.

We again begin by describing the local stiffness matrix defined in (3). We evaluate the integral in (3) in each dimension separately, this time using the full GL interpolation nodes as the quadrature. Using the notation introduced in BP1.0 above, we write


Note here that the quadrature requires the gradients of the basis polynomials to be evaluated at the GL quadrature nodes. Were we to simply compose the interpolation and differentiation operators and defined in BP1.0 and BP3.5 above we would require nine tensor contractions to evaluate this quantity. Indeed, for each of the , , and derivatives of , we would require an operation which combines differentiation and interpolation to the GL quadrature along one dimension and only interpolation to the GL quadrature along the remaining two dimensions.

We instead reduce the number of required tensor contractions by considering the Lagrange interpolating polynomials for which interpolate the GL quadrature nodes. We define the tensor product basis polynomials as done for the GLL interpolating basis as

for . We then define the derivative operators, as done above for the GLL interpolating Lagrange basis functions, on this set of polynomials as follows

for . We can then construct the gradient operation on this set of basis function as . Thus, if we view the interpolation operators defined in (7) as transforming a polynomial from the basis of Lagrange interpolating polynomials on the GLL nodes to the basis of Lagrange interpolating polynomials on the GL quadrature then we can use the operator to evaluate the derivatives of this polynomial on the GL node basis.

With these derivative operators, we continue as in BP3.5 by defining the matrix of weights and geometric data in (10) where this time

for . We use analogous definitions for the remaining entries of . Using these matrix operators we can write the local stiffness matrix (8) compactly as follows

Note that, as with the GLL interpolation basis polynomials, we can define the one-dimensional differentiation operator defined such that

for , so that each differentiation operation in can be written as a tensor product of and the identity matrix . Therefore the differentiation operators on the GL Lagrange interpolation basis polynomials along each dimension can be applied using a single tensor contraction.

Finally, to express the full action of the local screened Poisson operator we combine the discussion in BP1.0 regarding using the GL quadrature to evaluate the local mass matrix in order to write the local screened Poisson operator as

where is defined as in BP1.0 above. This operator can be applied using a total of twelve tensor contractions. We first interpolate to the GL quadrature nodes by applying the interpolation operator using three tensor contractions. We then differentiate along each dimension by applying the operator using three tensor contractions. We then multiply by the necessary geometric factors and multiply by the transpose derivative operator using three additional tensor contractions and add the mass matrix contributions. Finally we use three tensor contractions to multiply by the transpose interpolation operator in each dimension to project the result back to the GLL interpolation nodes. We detail the full matrix-free action of the stiffness matrix evaluated with the numerical quadrature in pseudo code in Algorithm 5.

5 Empirical performance model

The performance for all three benchmark problems is limited by global memory bandwidth. In BP3.5 and BP3.0 we load seven geometric factors for every node in the FEM element in addition to loading and storing the field vector itself. Moreover, as we emphasize in the introduction, for all three operations the data-to-FLOP ratio is high which limits our ability to hide memory latency behind computation.

(Konstantinidis and Cotronis, 2015) proposed a GPU performance model that accounts for the various data-to-flop ratios. However, the kernels used in the model achieve occupancy of around while, for our kernels, extensive use of register file and shared memory results in achieved occupancy rarely exceeding 25%. Hence, we cannot apply this model for our kernels in a meaningful way.

A different idea was presented in (Abdelfattah et al., 2016), where the authors used the size of global memory transfers as a means of comparison. The advantage of such approach is that it is independent of implementation. In this work, we use a model which is similar to (Abdelfattah et al., 2016). However, we transfer comparatively more data due to the geometric factors. Our model is also based on an assumption that the bandwidth of the global device data transfer is the limiting factor. Thus, we compare the performance of our kernels to the performance of copying data from global GPU memory to global GPU memory. The size of the data transfer we compare to is equivalent to the size of the data moved from and to the global memory in the particular benchmark problems, see Table 2.

For example, for every element in the FEM mesh the BP1.0 code needs to read

doubles and needs to write

doubles. Hence, the total global memory transfer is

bytes of data per element. Since the memory bus is bi-directional and a double variable consists of eight bytes of data, we compare the performance of our code for BP1.0 to transferring

bytes of data. Each data transfer is executed ten times using a standard cudaMemcpy function and the performance in terms of GB/s is measured by taking an average of the ten measurements.

BP1 BP3.5 BP3
Read (R)
Write (W)
Total bytes (T)
Table 2: The minimum number of doubles read and written to the global GPU memory per element in the three problems considered in this paper.

The efficiency of data transfer depends on the size of data, with throughput maximized if sufficiently large amounts of data are being transferred. We therefore expect higher bandwidth for a mesh with more elements, and for higher degree polynomial approximations. We note that for the GPU used in this paper, the NVIDIA Tesla P100 12 GB PCI-E, the mesh containing 512 elements is too small to effectively hide data transfer latencies. To see this, we note that our approach is to parallelize the problem by assigning each element to a thread block. The NVIDIA Tesla P100 has 56 SMs and two blocks of threads can be processed simultaneously on every SM. The GPU is therefore capable of processing 112 blocks of threads concurrently, which for the mesh of 512 elements means only 5 thread blocks are executed per SM. The result of this small work load per SM is that the overhead cost of kernel launch is not offset by the execution time of the kernel itself and also the empirical bandwidths which we observe in these data transfers are noisy due to overhead costs.

Let denote the number of bytes read from the global GPU memory and denote the number of bytes written to global GPU memory by a GPU kernel. Let denote global memory bandwidth for copying bytes of data. Let denote the number of floating point operations that must be executed in this kernel. In our model, the maximal GFLOPS/s are estimated using a formula


For BP1.0, BP3.5 and BP3.0 the roofline is evaluated as a function of polynomial degree . Figure 1 shows maximum TFLOPS/s for BP1.0, BP3.5 and BP3.0, respectively.

Figure 1: Performance roofline bounds for the BP1.0 (top left), BP3.5 (top right), and BP3.0 (bottom) benchmarks. In each chart the upper plot (line with diamond-shaped ticks) shows a theoretical bound obtained using theoretical peak bandwidth of GB/s for the NVIDIA P100 PCI-E 12GB GPU. The lower plots show the empirical peak bandwidth bound obtained using measured bandwidth attained when performing a device memory to device memory copy (upper line for a hexahedral mesh with 4096 elements and lower line for a hexahedral mesh with 512 elements).

Inspired by (Volkov and Demmel, 2008), for several kernels tested for BP1.0 and BP3.0., we devised a supplementary theoretical roofline based on shared memory bandwidth. We observe that in addition to copying the data to/from global memory, we also copy the data to/from shared memory111Unlike the copy-based empirical streaming roofline, the shared memory performance bound depends on the kernel, not on the problem. Since the memory bandwidth of shared memory is lower than the register bandwidth, shared memory transactions can limit performance. We denote the shared memory bandwidth by and and estimate it with

With this ansatz we estimate that the shared memory for the NVIDIA Tesla P100 12 GB PCI-E GPU is limited to . The shared memory roofline model is then given by the equation


where denotes the number of floating point operations performed (per thread block), is the number of bytes read from the shared memory (per thread block) and is the number of bytes written to the shared memory (per thread block).

6 Optimizing Benchmark Problem 1.0

In this section we describe the sequence of optimization strategies that were applied to the implementation of the BP1 operation.

Kernel design. Recall that the tensor contraction operations for each element in the FEM mesh can be performed independently of other elements. Thus, to parallelize the FEM operations we assign each element to a block of threads on the GPU. In a previous work, (Remacle et al., 2016) associated a single node of an element with a single thread. This is, however, impossible for high-order interpolation because if , we exceed the maximum number of threads per block of threads (currently limited by CUDA to ). Hence, for higher-order approximations we need to assign multiple nodes to a thread. We can subdivide the nodes in each element using either use a 3D thread structure or a 2D thread structure. Figure 2 shows two such approaches. For BP1.0, we use a 2D thread structure since we found it more effective. We also investigate a 3D thread structure for for BP3.5 and for BP3.0

Figure 2: 3D vs 2D thread structure. On the left: 3D approach – each thread processes a “slice” of nodes. On the right: 2D approach – each thread processes a vertical “column” of nodes.

In BP1.0 the action of the mass matrix on each element can be applied using six tensor constraction operations. Hence, Algorithm 3 consists of contractions wherein we cycle through the entire . Block-wise synchronization is needed between the contractions to ensure that the previous operation has completed. At the minimum, we need to enforce this block synchronize five times. Using a 2D thread structure requires more thread synchronizations because we process only a slice of the nodes at a time.

BP1.0 thread memory optimization. Because the one-dimensional interpolation matrix is used by all the threads in the block, we load into the shared memory. For the field variable , we can either load to shared memory, fetch it to registers, or fetch it piece-by-piece from global memory when needed. Note that we also need a placeholder array to store the partial results between the loops. There are several options to choose from, however only a single array of size can be stored in shared memory. Storing two such arrays is not feasible since we exceed the limit of 48 KB shared memory for thread block for a large .

BP1.0 kernel optimization. We show the performance results of eight GPU kernels in Figure 3. The eight kernels were constructed in sequence beginning from a direct implementation of the pseudo-code in Algorithm 3 and applying successive optimizations. The results shown in Figure 3 help to demonstrate the effect each optimization has on the overall performance of BP1.0. We list below some details of each kernel here as well as any optimization made in that kernel. Unless otherwise noted, Kernel contains all the optimizations contained in Kernel .

Figure 3: BP1.0: achieved floating point performance. Left: results obtained using cube-shaped mesh with 512 elements. Right: results obtained using cube-shaped mesh with 4096 elements on a NVIDIA P100 PCI-E 12GB GPU.

Kernel 1: This kernel serves as a reference implementation. It uses a 2D thread structure associated with horizontal slices (see right side of Figure 2). We declare two additional global memory variables for storing intermediate results. Kernel 1 only uses shared memory for the interpolation matrix. In all the loops, it reads from and writes to global memory. As a result, the reference kernel is highly inefficient, reaching only 80 GFLOPS/s even for the larger mesh.
Kernel 2: In this kernel the global auxiliary arrays are replaced by two shared memory arrays (with elements in each array). While processing , we read directly from the input arrays, without caching to shared memory or registers. Due to the reduced number of global memory fetches, the performance improves by approximately a factor of two.
Kernel 3: In this kernel each thread allocates a register array of size and copies a section of to the array at the beginning of the kernel. This kernel reaches 1 TFLOP/s in the best case.
Kernel 4: In this kernel all the input variables, except the variable to which the output is saved, are declared as const. This yields only a marginal improvement in performance.
Kernel 5: For this kernel, if , or or , or , we pad the shared memory arrays used for storing and partial results to avoid bank conflicts. There is only a noticeable improvement for for the smaller mesh and for the larger mesh.
Kernel 6: In this kernel all loops, including the main loop in which we process the slices are unrolled. Unrolling the loop is important for the performance, since it gives the scheduler more freedom and more opportunity for instruction-level parallelism. At this point, the performance exceeds one TFLOP/s for the small mesh and 1.4 TFLOPS/s for the large mesh.
Kernel 7: All kernels presented thus far have required thread synchronizations in a block. For example with , for which , the number of synchronizations is . In this kernel the number of thread synchronizations is reduced to five by allocate more shared memory. That is, we load the entire array to shared memory as opposed to only loading slice-by-slice. Figure 4 illustrates the idea behind reducing synchronizations. Kernel 7 brings the performance up to 2.25 TFLOPS/s,
Kernel 8:: (Volkov, 2010) emphasized that shared memory is much slower than using registers. Hence, reducing total number of read and write requests to/from shared memory and replacing them with register read/write instructions can significantly improve performance. Kernel 8 exploits the structure of the matrix to reduce the number of shared memory transactions. Specifically, each entry in appears twice: . Exploiting this symmetry lets us halve the number of loads from . This kernel loads the entire into shared memory and inside each loop an appropriate entry is copied from shared memory to a register once and used twice. Since in BP1.0 we need to multiply by and by , pre-fetching only a half of matrix