Second order elliptic PDEs of the convection-diffusion-reaction form
with spatially varying coefficients play an important role in many areas of science and engineering. The Discontinuous Galerkin (DG) method [1, 2, 3, 4, 5, 6, 7, 8] is a popular discretisation scheme for various reasons: it allows local mass conservation and (when used with an appropriate basis) leads to diagonal mass matrices which avoid global solves in explicit time integrators. It also has several computational advantages as will be discussed below. Problems of the form in (1) arise for example in the simulation of subsurface flow phenomena [9, 10, 11, 12]. Another important application area is fluid dynamics. When the incompressible Navier Stokes equations are solved with Chorin’s projection method [13, 14], a Poisson equation has to be solved for the pressure correction and the momentum equation needs to be solved for the tentative velocity, see  for a DG-based version. Implicit DG solvers for the fully compressible atmospheric equations of motion are described in [16, 17].
The fast numerical solution of equation (1) requires algorithmically optimal algorithms, such as multigrid methods (see  for an overview), which have a numerical complexity that typically grows linearly with the number of unknowns. Equally important is the efficient implementation of those methods on modern computer hardware. While low order discretisation methods such as finite volume approximations and linear conforming finite element methods are very popular, from a computational point of view one of their main drawbacks is the fact that their performance is limited by main memory bandwidth. Modern computer architectures such as the Intel Haswell/Broadwell CPUs, Xeon Phi co-processors and graphics processing units (GPUs) may only achieve their peak performance floating point performance if data loaded into the CPU is reused, i.e. the arithmetic intensity of the algorithm (ratio of floating point operations executed per byte loaded, in short: flop-to-byte-ratio) is sufficiently high. For example the 16 core Haswell chip used for the computations in this work has a theoretical peak performance of fused multiply-add (FMA) operations per second and a main memory bandwidth (as measured with the STREAM benchmark ) of
. Hence about 40 FMA operations can be carried out in the time it takes to read one double precision number from memory. Low order discretisation methods require frequent sparse matrix-vector products. If the matrix is stored in the standard compressed sparse row storage (CSR) format, this multiplication has a very low arithmetic intensity of about 1 FMA operation per matrix entry loaded from memory (assuming perfect caching of the vectors). Hence traditional solvers which first assemble the system matrix and then invert it with sparse matrix-vector operations will only use a fraction of the system’s peak performance.
Matrix-free methods, on the other hand, recompute each matrix-element on-the-fly instead of storing an assembled matrix. In addition to eliminating setup costs for the matrix assembly, this raises the computational intensity and allows utilisation of the full performance of modern manycore chip architectures. An additional advantage are the reduced storage requirements: since the matrix does not have to be kept in memory, this allows the solution of much larger problems. While high arithmetic intensity is a desirable property of a good implementation, the quantity which really needs to be minimised is the total computation time. A naive matrix-free implementation can result in an increase in the runtime since a large number of floating point operations (FLOPs) is executed for each unknown. Careful matrix-free implementations of higher-order finite element discretisations avoid this problem. If is the polynomial degree of the basis function in each -dimensional grid cell, a naive implementation requires
FLOPs per cell. Sum factorisation, which exploits the tensor-product structure of the local basis functions, reduces this to[20, 21, 22, 23], making the matrix-free implementation particularly competitive if the order is sufficiently high. Numerical results indicate that sum factorisation reduces the runtime even for low polynomial degrees [23, 24]. Another advantage of the DG method is that all data associated with one grid cell is stored contiguously in memory. This avoids additional gather/scatter operations which are necessary if data is also held on other topological entities such as vertices and edges as in conforming high order finite elements.
The efficient matrix-free implementation of higher-order DG methods was described by some of us in a previous paper  which concentrates on the efficient application of the DG operator. Here we extend this approach to iterative solvers for elliptic PDEs. While matrix-vector products are one key ingredient in Krylov-subspace solvers, efficient preconditioners are equally important and often form the computational bottleneck. Many preconditioners such as AMG [25, 26] or incomplete factorisations (ILU)  require the assembly of the system matrix, which can offset the advantage of the matrix-free sparse matrix-vector product.
In this work we describe the matrix-free implementation of block-smoothers,
which are the key ingredient of multigrid methods such as those
described in [28, 12] . The idea behind the multigrid
algorithm in  is that the block-smoother eliminates
high-frequency errors within one grid cell whereas an AMG solver in a
lower order subspace of the full DG space reduces long range error
components. Due to the much smaller size of the coarse level system, the cost
for one AMG V-cycle on the low order subspace (which uses a traditional
matrix-based implementation) is negligible in comparison to the block smoother
on the high order system. In the simplest case the smoother is a block-Jacobi
iteration applied to the system which arises from high-order DG discretisation
of (1). In each cell of
the grid this requires the following operations:
Here is the solution on cell and the matrix describes the coupling between cells and which share at least one common face. In addition to dense matrix-vector products to calculate the local defect , this requires the solution of a small dense system
with unknowns in the second step. Matrix based methods use an LU- or Cholesky- factorisation of the matrix to solve (2) in each grid cell. The cost of the back-substitution step, which has to be carried out for every block-Jacobi iteration, is . There is an additional setup cost of for the initial factorisation of , which may be relevant if the number of preconditioner applications is small. In contrast, the matrix-free method which we present here uses an iterative method to solve this equation and if the implementation uses sum-factorisation, this reduces the complexity to . An important factor in this expression is , the number of iterations which are required to solve the system in (2) to accuracy . Solving (2) exactly, i.e. iterating until the error is reduced to machine precision, will result in a relatively large and might not lead to a reduction of the total solution time for the full system. However, since the inverse is only required in the preconditioner, an approximate inversion with a small number of iterations might be sufficient and can increase the computational efficiency of the overall method. This is one of the key optimisations which will be explored below. The overall efficiency of the implementation also depends on how computationally expensive the evaluation of the functions , and in (1) is. At first sight this might put a matrix-free preconditioner at a disadvantage, since those functions have to be re-computed at every quadrature point in each iteration. However, it is often sufficient to use a simpler functional dependency in the preconditioner, for example by approximating , and by their average value on each cell. To demonstrate the efficiency of this approach we solve a set of linear problems of increasing complexity. This includes advection dominated problems which require more advanced SOR-type smoothers and the stationary SPE10 benchmark. The latter is a hard problem since the permeability field varies over several magnitudes and has high-contrast jumps between grid cells.
A similar fully matrix-free approach for the solution of elliptic problems is described in . The results there are obtained with the deal.II  implementation discussed in . As for our code, the matrix-free operator application in  is highly optimised and utilises a significant fraction the peak floating point performance. In contrast to our implementation, the authors of 
use a (matrix-free) Chebyshev-iteration as a smoother on the DG space (a Chebyshev smoother is also used for the AMG solver on the lower order subspace). One drawback of this approach is that it requires estimates of the largest eigenvalues, which will result in additional setup costs. While the authors of solve a Laplace equation on complex geometries, it is not clear how the same Chebyshev smoother can be applied to non-trivial convection-dominated problems or problems with large, high contrast jumps of the coefficients.
Another approach for preconditioning higher order finite element discretisations was originally developed for spectral methods in  and used for example in [31, 32]. In , a preconditioner is constructed by approximating the higher-order finite element operator with a low-order discretisation on the finer grid which is obtained by subdividing each grid cell into elements defined by the nodal points. The resulting sparse system is then solved directly with an AMG preconditioner. A similar approach is used in , where a two-level Schwarz preconditioner is constructed for a spectral element discretisation of the pressure correction equation in the Navier Stokes system. Instead of using an AMG method for the sparse low-order system, the author employs a two-level Schwarz preconditioner. As in our method, the coarse level is obtained from the linear finite element discretisation on the original grid and this is combined with an iterative solution of the low order system on the finer nodal grid. Finally, instead of re-discretising the equation on a finer grid, the authors of  construct a sparse approximation with algebraic methods, which requires additional setup costs. Compared to our approach, the drawback of all those methods is that the fine grid problem is obtained from a lowest-order discretisation (or an algebraic representation leading to a sparse matrix), and will therefore result in a memory-bound implementation which can not use the full capacity of modern manycore processors.
This paper is organised as follows: in Section 2 we describe the higher order DG discretisation and our hybrid multigrid solver algorithm which is used throughout this work. We introduce the principles of the matrix-free smoother implementation and analyse its computational complexity as a function of the order of the DG discretisation. The implementation of those algorithms based on the DUNE software framework [34, 35] is described in Section 3, where we also briefly review the principles of sum-factorised operator application. Numerical results and comparisons between matrix-based and matrix-free solvers are presented in Section 4 and we conclude in Section 5. Some more technical details are relegated to the appendices. In A we discuss the re-discretisation of the operator in the low-order subspace which is used in the coarse level correction of the multigrid algorithm. A detailed breakdown of setup costs for the multigrid algorithm can be found in B.
We consider linear second order elliptic problems of the form
with boundary conditions
in the -dimensional domain . Here the unit outer normal vector on the boundary is denoted by . is the diffusion tensor, is the advection vector and is a scalar reaction coefficient. All those terms can depend on the position ; the fields , and are not necessarily smooth and can have large, high contrast jumps. A typical example for a problem with discontinuous coefficients is the SPE10 benchmark . The problem in (3) could also appear in the linearisation of a non-linear problem such as the Navier Stokes equations . In this case it needs to be solved at every iteration of a Newton- or Picard- iteration. The reaction term often arises from the implicit time discretisation of an instationary problem.
2.1 Higher-order DG discretisation
To discretise (3) and (4) we use the weighted symmetric interior penalty (WSIPG) method derived in  based on [1, 2, 6, 37, 38, 39] together with an upwinding flux for the advective part. The full method is described in  and the most important properties are recalled here. We work on a grid consisting of a set of cuboid cells with smallest size . Each cell is the image of a diffeomorphic map where is the reference cube in dimensions. Given , the Discontinuous Galerkin space is the subspace of piecewise polynomials of degree
Here is the set of polynomials of at most degree . It is worth stressing that, in contract to conforming discretisations, functions in are not necessarily continuous between cells. We also define the inner product of two (possibly vector-valued) functions on any domain as
with the bilinear form
which has been split into a volume (v), interface (if) and boundary (b) term defined by
The right hand side functional is
In those expressions the set of interior faces is denoted as . The sets of faces on the Neumann- and Dirichlet boundary are and respectively. With each face we associate an oriented unit normal vector . For any point on an internal face we define the jump
and the weighted average
where is the value of the field on one side of the facet. In the expressions and we implicitly assume that and are evaluated on the interior side of the boundary facets. To account for strongly varying diffusion fields, we choose the weights introduced in :
In this expression is a free parameter chosen to be in all the computations reported below. To discretise the advection terms, we use the upwind flux on the face which is given by
While the formulation in (8), (10) is valid on any grid, here we only consider grids based on hexahedral elements. To use sum-factorisation techniques, we also assume that the basis functions on the reference element as well as the quadrature rules have tensor-product structure. Similar techniques can be used on simplicial elements [40, 41]. A basis is chosen for , i.e. every function can be written as
is the total number of degrees of freedom andis the vector of unknowns. With this basis the weak formulation in (7) is equivalent to a matrix equation
It is this equation which we will solve in the following. Since we consider a discontinuous function space, the basis can be chosen such that the support of any basis function is restricted to a single element and this implies that the matrix is block-structured. This property is important for the algorithms developed in the following. Throughout this work we use a basis which is constructed from the tensor-product of one-dimensional Gauss-Lobatto basis functions on each grid cell. We also assume that the basis is nodal, i.e. there is a set of points , such that .
Block notation of linear systems
To efficiently deal with block-structured matrices we introduce the following notation. For any finite index set we define the vector space to be isomorphic to with components indexed by . Thus any vector of unknowns can be interpreted as a mapping and . In the same way, for any two finite index sets we write with the interpretation and . Finally, for any subset we define the restriction matrix as . Now, let denote the subset of indices which are associated with basis functions that are nonzero on element . Then
where the last property follows from the fact that we consider a discontinuous function space and therefore forms a disjoint partitioning of the index set . Using this partitioning and imposing an ordering on the mesh elements, the linear system in (16) can be written in block form
where , and . For each cell and function we also define the function as
Our choice of Gauss-Lobatto basis functions with non-overlapping support leads to the so called minimal stencil; the matrices are only nonzero if or if the elements and share a face, and hence the matrix in (18) is block-sparse. In general and for the basis considered in this paper, the matrix is dense.
This partitioning of and the minimal stencil property form are key for the matrix-free preconditioner implementation discussed below.
2.2 Hybrid multigrid preconditioner
To solve the linear system in (16) we use the hybrid multigrid preconditioner described in . However, in contrast to , the computationally most expensive components, namely the operator- and preconditioner application in the DG space , are implemented in a matrix free way. As the numerical results in Section 4 confirm, this leads to significant speedups at higher polynomial orders and has benefits even for relatively modest orders of .
The key idea in  is to combine a block-smoother in the high-order DG space with an AMG correction in a low-order subspace . In this work we will consider low order subspaces spanned by the piecewise constant and conforming piecewise linear elements . One V-cycle of the multigrid algorithm is shown in Algorithm 1. At high polynomial orders the computationally most expensive steps are the operator application in the defect (or residual) calculation and the pre-/post-smoothing on the DG level. Usually the multigrid algorithm is used as a preconditioner of a Krylov subspace method, which requires additional operator applications.
To solve the problem in the lower-order subspace we use the aggregation based AMG solver described in [42, 43]. In the following we discuss the efficient implementation of the individual components in Algorithm 1:
Matrix-free operator application in the DG space to calculate the residual (line 2)
Matrix-free pre-/post-smoothing in the DG space (lines 1 and 6)
Transfer of grid functions between between DG space and low order subspace (lines 3 and 5)
Assembly of the low order subspace matrix which is used in the AMG V-cycle (line 4)
The matrix-free operator application based on sum factorisation techniques has been previously described in  (and is briefly reviewed in Section 3.1 below). Here we will concentrate on the other components.
2.2.1 Matrix-free smoothers
In the context of multigrid methods, iterative improvement of the solution with a stationary method is referred to as smoothing since a small number of iterations with a simple method reduces high-frequency error components. For a linear equation , this process can be summarised as follows: given a matrix , calculate the defect for some guess and solve for such that
This correction can then be used to obtain an improved solution
Starting from some initial guess , the iterates form an increasingly better approximation to the exact solution of for a convergent method. Choosing a block-smoother guarantees that any high-frequency oscillations inside one cell are treated exactly, which is particularly important for higher polynomial orders. Lower-frequency errors with variations between the grid cells are reduced by the AMG preconditioner in the low-order subspace (line 4 in Alg. 1). Following the block partitioning in (18), let be decomposed into the strictly lower block triangular part , the block diagonal and the strictly upper block triangular part with
In this work we consider the following choices for the block preconditioner : (1) block-Jacobi , (2) successive block over-relaxation (block-SOR) and (3) symmetric block-SOR (block-SSOR) . The over-relaxation parameter can be tuned to accelerate convergence. The explicit form of the block-Jacobi and block-SOR algorithms are written down in Algorithms 2 and 3. Note that for block-SOR, the field can be obtained by transforming in-place, which reduces storage requirements. The corresponding backward-SOR sweep is obtained by reversing the direction of the grid traversal. It can then be shown that one application of the block-SSOR smoother is equivalent to a forward-block-SOR sweep following by a backward-block-SOR iteration.
Note that in all cases the computationally most expensive operations are the multiplication with the dense matrices and the solution of the dense system in each cell. As will be described in Section 3, the defect can be calculated in a matrix-free way by using the techniques from . The standard approach to calculating the correction is to assemble the local matrices , factorise them with an LU-/Cholesky-approach and then use this to solve the linear system. However, the costs for the factorisation- and back-substitution steps grow like and respectively, and hence this method becomes prohibitively expensive for high polynomial degrees . To avoid this, we solve the block-system
with a Krylov-subspace method such as the Conjugate Gradient (CG)  or GMRES  iteration. This only requires operator applications which can be implemented in a matrix-free way. If sum-factorisation techniques are used, the overall cost of this inversion is where is the number of iterations required to solve the block-system. An important observation, which drastically effects the practical usage of the matrix-free smoothers, is that it suffices to compute an inexact solution of (24) with the Krylov subspace method. For this we iterate until the two-norm of the residual has been reduced by a fixed factor . The resulting reduction in and preconditioner cost has to be balanced against the potential degradation in the efficiency of the preconditioner. This dependence on on the overall solution time will be studied below.
Using block smoothers, the method proposed here is not strictly robust with respect to the polynomial degree . This can be remedied by using overlapping block smoothers [42, 28] where the blocks originate from an overlapping partitioning of the mesh. However, this also increases the computational cost substantially. Since in practical applications we consider moderate polynomial degree anyway, we concentrate on simple non-overlapping block smoothers here where the blocks originate from a single element.
2.2.2 Intergrid operators
Since we use nested function spaces with , injection is the natural choice for the prolongation . Given a function , we define
Expanding in basis functions on element and using (19) this leads to
Here is the set of indices for which the coarse basis function is non-zero on element . For a general basis, calculation of requires solution of a small system involving the local mass matrix in each cell ; this is expensive for higher polynomial degrees . However, here we only consider the special case that is a subspace of and the basis is nodal. In this case the unknowns in the higher order space can be found much more efficiently by evaluating the low order basis functions at the nodal points of the higher order space,
In other words, the -th column of the local prolongation matrix can be calculated by evaluating at all nodal points,
Since the DG basis functions form a partition of unity, for the subspace this simplifies even further to
For the restriction operation we choose the transpose of the prolongation, i.e. . In the absence of advection () this choice preserves the symmetry of the system and the WSIPG discretisation.
2.2.3 Low order subspace matrix
The low order subspace matrix is needed to solve the “coarse” grid correction with the AMG preconditioner. In general this matrix is given by the Galerkin product
However, calculating according to this formula is expensive for several reasons: first of all it requires two matrix-matrix multiplications of large matrices with different sparsity patterns. In addition, the matrix in the DG space has to be assembled, even though in a matrix-free implementation it is never used in the DG smoother. Furthermore, for our choice of coarse function spaces, the matrix has lots of entries which are formally zero. However, if it is calculated according to (30) in an actual implementation those entries will not vanish exactly due to rounding errors. This leads to drastically increased storage requirements and unnecessary calculations whenever is used in the AMG preconditioner. All those issues can be avoided by directly assembling the coarse grid matrix as follows:
For the subspace it is shown in A.1 that on an equidistant grid the matrix entries can be calculated directly by discretising (3) with a modified finite volume scheme in which (i) all flux terms are multiplied by a factor and (ii) the boundary flux is scaled by an additional factor 2. For the conforming subspace, all contributions to which arise from jump terms in vanish. In fact, up to an additional Dirichlet boundary term , the matrix is identical to the matrix obtained from a conforming discretisation of (3), and it is this matrix that we use in our implementation. In both cases the sparsity pattern of is the same as that obtained by a discretisation in the low-order subspace. In summary this re-discretisation in the coarse level subspace instead of using the Galerkin product in (30) results in a significantly reduced sparsity pattern of the matrix and avoids explicit assembly of the large DG matrix .
3 Implementation in EXADUNE
The crucial ingredient for the development of the matrix-free block smoothers described in Section 2.2.1 is the efficient on-the-fly operator application of the full operator and multiplication with the block-diagonal for each element . While the operator application is required in the calculation of the local defect, the frequent multiplication with is necessary in the iterative inversion of the diagonal blocks. Our code is implemented in the EXADUNE  fork of the DUNE framework [46, 34]. EXADUNE contains several optimisations for adapting DUNE to modern computer architectures.
3.1 Matrix-free operator application
The matrix-free application of the operator itself has been implemented and heavily optimised in , and here we only summarise the most important ideas. Recall that the weak form in (8) is split into volume-, interface- and boundary terms defined in (9). The local contributions to the system matrix are evaluated in a LocalOperator class in DUNE. More specifically, let , be a pair of adjacent cells which share a in internal face and let be the face of a cell adjacent to the boundary of the domain. For each cell , internal interface and boundary face the LocalOperator class provides methods for evaluating the local operator application
For the interface term we calculate four contributions, one for each of the combinations , , and . By iterating over all elements and faces of the grid and calling the corresponding local operator evaluations, the full operator can be applied to any field . This iteration over the grid is carried out by an instance of a GridOperator class, which takes the LocalOperator as template parameter.
In the following we briefly recapitulate the key ideas used to optimise the matrix-free operator application with sum-factorisation techniques and concentrate on application of the operator . In each cell this proceeds in three stages (see Algorithm 1 in  for details)
Given the coefficient vector , calculate the values of the field and its gradient at the quadrature points on the reference element .
At each quadrature point, evaluate the coefficients , and as well a geometric factors from the mapping between and . Using this and the function values from Stage 1, evaluate the weak form at each quadrature point.
Multiply by test functions (and their derivatives) to recover the entries of the coefficient vector .
We now discuss the computational complexity of the individual stages.
We assume that the basis functions in the reference element have a tensor-product structure
Those basis functions are enumerated by -tuples , . For each cell we have an index map which assigns a global index to each -tuple in the cell; this implies that in a particular cell any function can be written as
We use a tensor-product quadrature rule with points
enumerated by -tuples , . In the following we only consider and , in this case that there are quadrature points and basis functions per cell. Following Eq. (13) in , the evaluation of the function at the quadrature point can be written as
The entries of the matrices are given by the one-dimensional shape functions evaluated at the quadrature points, , , . (35) describes the multiplication of a vector of length by a matrix with . Assuming that the ratio is constant, the naive cost of this operation is FLOPs. A significant reduction in computational complexity is achieved by applying the small matrices recursively in stages as
Each step in (36) requires the multiplication with an matrix for vectors. Multiplication by an matrix requires operations and hence the total cost is reduced to . Since grows with the polynomial degree (typically ), for high polynomial degrees this sum-factorised approach is much faster than the naive algorithm with complexity . Gradients of can be evaluated in the same way with slightly different matrices .
The face integrals in the bilinear form require the evaluation on points of a quadrature rule of lower dimension . Although this is complicated by the different embeddings of a face into the neighbouring cells, the same sum factorisation techniques can be applied. While for those terms the cost is only , for low polynomial degrees the absolute cost can still be larger than the cost of the volume terms (see Fig. 1 in ). We conclude that - using sum factorisation - the cost of the first stage in the matrix-free operator application is .
Since the operations can be carried out independently at the quadrature points, the cost of this stage is .
Given the values at the quadrature points from Stage 2, the entries of the coefficient vector can be calculated as
Here “” stands for terms of a similar structure which arise due to multiplication with the derivative of a test function in the bilinear form. For simplicity those terms are not written down here; an example can be found in Eq. (10) of . Note that (37) is the same as (35), but we now multiply by the transpose of the matrices . Sum factorisation can be applied in exactly the same way, resulting in a computational complexity of .
Combining the total cost of all stages we find that the overall complexity of the sum-factorised operator application is . Finally, observe that the operator application requires reading the local dof-vector which has entries. A vector of the same size has to be written back at the end of the calculation. Overall, the total amount of memory moved is . The resulting arithmetic intensity is and the operation is clearly FLOP-bound for large .
In EXADUNE  several optimisations are applied to speed up the operator application and exploit modern FLOP-bound architectures. The C++ vector class library  is used for the multiplication with the small dense matrices . The code is vectorised by grouping the value of the function and its three spatial derivatives and evaluating the four values simultaneously in (35). Efficient transfer of data from memory is important. In the standard DUNE implementation, a copy of the local dof-vector is created on each element. While this approach is necessary for conforming finite element methods, for the DG method used here the local dofs of neighbouring cells do not overlap and it is therefore possible to directly operate on the global dof-vector, thus avoiding expensive and unnecessary memory copies. As reported in , the operator application runs at close to of the peak floating point performance of an Intel Haswell CPU in some cases.
3.2 Matrix-free application and inversion of block-diagonal
Since in our smoothers the block system (24) is solved iteratively, we need to be able to apply the block-diagonal part of the operator in each cell in an efficient way. For this, note that