A stencil scaling approach for accelerating matrix-free finite element implementations

by   Simon Bauer, et al.

We present a novel approach to fast on-the-fly low order finite element assembly for scalar elliptic partial differential equations of Darcy type with variable coefficients optimized for matrix-free implementations. Our approach introduces a new operator that is obtained by appropriately scaling the reference stiffness matrix from the constant coefficient case. Assuming sufficient regularity, an a priori analysis shows that solutions obtained by this approach are unique and have asymptotically optimal order convergence in the H^1- and the L^2-norm on hierarchical hybrid grids. For the pre-asymptotic regime, we present a local modification that guarantees uniform ellipticity of the operator. Cost considerations show that our novel approach requires roughly one third of the floating-point operations compared to a classical finite element assembly scheme employing nodal integration. Our theoretical considerations are illustrated by numerical tests that confirm the expectations with respect to accuracy and run-time. A large scale application with more than a hundred billion (1.6·10^11) degrees of freedom executed on 14,310 compute cores demonstrates the efficiency of the new scaling approach.


Stencil scaling for vector-valued PDEs on hybrid grids with applications to generalized Newtonian fluids

Matrix-free finite element implementations for large applications provid...

Stencil scaling for vector-valued PDEs with applications to generalized Newtonian fluids

Matrix-free finite element implementations for large applications provid...

Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations

This work investigates a variant of the conjugate gradient (CG) method a...

Optimal finite elements for ergodic stochastic two-scale elliptic equations

We develop an essentially optimal finite element approach for solving er...

A fractional model for anomalous diffusion with increased variability. Analysis, algorithms and applications to interface problems

Fractional equations have become the model of choice in several applicat...

Finite Element Integration with Quadrature on the GPU

We present a novel, quadrature-based finite element integration method f...

Analysis of heterogeneous computing approaches to simulating heat transfer in heterogeneous material

The simulation of heat flow through heterogeneous material is important ...

1 Introduction

Traditional finite element implementations are based on computing local element stiffness matrices, followed by a local-to-global assembly step, resulting in a sparse matrix. However, the cost of storing the global stiffness matrix is significant. Even for scalar equations and low order 3D tetrahedral elements, the stiffness matrix has, on average, fifteen entries per row, and thus a standard sparse matrix format will require thirty times as much storage for the matrix as for the solution vector. This limits the size of the problems that can be tackled and becomes the dominating cost factor since the sparse matrix must be re-read from memory repeatedly when iterative solvers are applied. On all current and future computing systems memory throughput and memory access latency can determine the run-time more critically than the floating-point operations executed. Furthermore, energy consumption has been identified as one of the fundamental roadblocks in exa-scale computing. In this cost metric, memory access is again more expensive than computation. Against the backdrop of this technological development, it has become mandatory to develop numerical techniques that reduce memory traffic. In the context of partial differential equations this is leading to a re-newed interest in so-called

matrix-free techniques and – in some sense – to a revival of techniques that are well-known in the context of finite difference methods.

Matrix-free techniques are motivated from the observation that many iterative solvers, e.g., Richardson iteration or Krylov subspace methods, require only the result of multiplying the global system matrix with a vector, but not the matrix itself. The former can be computed by local operations in each element, avoiding to set up, store, and load the global stiffness matrix. One of the first papers in this direction is [10], which describes the so called element-by-element approach (EBE), in which the global matrix-vector-product (MVP) is assembled from the contributions of MVPs of element matrices with local vectors of unknowns. The element matrices are either pre-computed and stored or recomputed on-the-fly. Note that storing all element matrices, even for low-order elements, has a higher memory footprint than the global matrix111However, it requires less memory than storage schemes for sparse direct solvers which reserve space for fill-in, the original competing scenario in [10]. itself.

Consequently, the traditional EBE has not found wide application in unstructured mesh approaches. However, it has been successfully applied in cases where the discretization is based on undeformed hexahedral elements with tri-linear trial functions, see e.g. [1, 7, 15, 37]. In such a setting, the element matrix is the same for all elements, which significantly reduces the storage requirements, and variable material parameters can be introduced by weighting local matrix entries in the same on-the-fly fashion as will be developed in this paper.

Matrix-free approaches for higher-order elements, as described in e.g., [9, 25, 28, 29, 32]

, differ from the classic EBE approach in that they do not setup the element matrix and consecutively multiply it with the local vector. Instead, they fuse the two steps by going back to numerical integration of the weak form itself. The process is accelerated by pre-computing and storing certain quantities, such as e.g., derivatives of basis functions at quadrature points within a reference element. These techniques in principle work for arbitrarily shaped elements and orders, although a significant reduction of complexity can be achieved for tensor-product elements.

However, these matrix-free approaches have also shortcomings. While the low-order settings [1, 7, 15, 37] require structured hexahedral meshes, modern techniques for unstructured meshes only pay off for elements with tensor-product spaces with polynomial orders of at least two; see [25, 28].

In this paper we will present a novel matrix-free approach for low-order finite elements designed for the hierarchical hybrid grids framework (HHG); see e.g. [3, 4, 5]. HHG offers significantly more geometric flexibility than undeformed hexahedral meshes. It is based on two interleaved ideas. The first one is a special discretization of the problem domain. In HHG, the computational grid is created by way of a uniform refinement following the rules of [6], starting from a possibly unstructured simplicial macro mesh. The resulting nested hierarchy of meshes allows for the implementation of powerful geometric multigrid solvers. The elements of the macro mesh are called macro-elements and the resulting sub-triangulation reflects a uniform structure within these macro-elements. The second aspect is based on the fact that each row of the global finite element stiffness matrix can be considered as a difference stencil. This notion and point of view is classical on structured grids and recently has found re-newed interest in the context of finite elements too; see e.g., [14]. In combination with the HHG grid construction this implies that for linear simplicial elements one obtains stencils with identical structure for each inner node of a macro primitive. We define macro primitives as the geometrical entities of the macro mesh of different dimensions, i.e., vertex, edge, face, and tetrahedrons. If additionally the coefficients of the PDE are constant per macro-element, then also the stencil entries are exactly the same. Consequently, only a few different stencils (one per macro primitive) can occur and need to be stored. This leads to extremely efficient matrix-free techniques, as has been demonstrated e.g., in [3, 18].

Let us now consider the setting of an elliptic PDE with piecewise smooth variable coefficients, assuming that the macro mesh resolves jumps in the coefficients. In this case, a standard finite element formulation is based on quadrature formulas and introduces a variational crime. According to [11, 34], there is flexibility how the integrals are approximated without degenerating the order of convergence. This has recently been exploited in [2] with a method that approximates these integral values on-the-fly using suitable surrogate polynomials with respect to the macro mesh. The resulting two-scale method is able to preserve the convergence order if the coarse and the fine scale are related properly. Here we propose an alternative which is based on the fine scale.

For this article, we restrict ourselves to the lowest order case of conforming finite elements on simplicial meshes. Then the most popular quadrature formula is the one point Gauss rule which in the simplest case of as PDE operator just weights the element based reference stiffness matrix of the Laplacian by the factor of where is the barycenter of the element . Alternatively, one can select a purely vertex-based quadrature formula. Here, the weighting of the element matrix is given by , where is the space dimension and are the vertices of element . Using a vertex-based quadrature formula saves function evaluations and is, thus, attractive whenever the evaluation of the coefficient function is expensive and it pays off to reuse once computed values in several element stiffness matrices. Note that reusing barycentric data on general unstructured meshes will require nontrivial storage schemes.

In the case of variable coefficient functions, stencil entries can vary from one mesh node to another. The number of possibly different stencils within each macro-element becomes , where is the number of uniform refinement steps for HHG. Now we can resort to two options: Either these stencils are computed once and then saved, effectively creating a sparse matrix data structure, or they are computed on-the-fly each time when they are needed. Neither of these techniques is ideal for extreme scale computations. While for the first option extra memory is consumed and extensive memory traffic occurs, the second option requires re-computation of local contributions.

The efficiency of a numerical PDE solver can be analyzed following the textbook paradigm [8] that defines a work unit (WU) to be the cost of one application of the discrete operator for a given problem. With this definition, the analysis of iterative solvers can be conducted in terms of WU. Classical multigrid textbook efficiency is achieved when the solution is obtained in less than 10 WU. For devising an efficient method it is, however, equally critical to design algorithms that reduce the cost of a WU without sacrificing accuracy. Clearly, the real life cost of a WU depends on the computer hardware and the efficiency of the implementation, as e.g., analyzed for parallel supercomputers in [18]. On the other side, matrix-free techniques, as the one proposed in this article, seek opportunities to reduce the cost of a WU by a clever rearrangement of the algorithms or by exploiting approximations where this is possible; see e.g., also [2].

These preliminary considerations motivate our novel approach to reduce the cost of a WU by recomputing the surrogate stencil entries for a matrix-free solver more efficiently. We find that these values can be assembled from a reference stencil of the constant coefficient case which is scaled appropriately using nodal values of the coefficient function. We will show that under suitable conditions, this technique does not sacrifice accuracy. However, we also demonstrate that the new method can reduce the cost of a WU considerably and in consequence helps to reduce the time-to-solution.

The rest of this paper is structured as follows: In section 2, we define our new scaling approach. The variational crime is analyzed in Section 3 where optimal order a priori results for the - and -norm are obtained. In section 4, we consider modifications in the pre-asymptotic regime to guarantee uniform ellipticity. Section 5 is devoted to the reproduction property and the primitive concept which allows for a fast on-the-fly reassembling in a matrix free software framework. In section 6, we discuss the cost compared to a standard nodal based element-wise assembling. Finally, in section 7 we perform numerically an accuracy study and a run-time comparison to illustrate the performance gain of the new scaling approach.

2 Problem setting and definition of the scaling approach

We consider a scalar elliptic partial differential equation of Darcy type, i.e.,

where tr stands for the boundary trace operator and . Here , , is a bounded polygonal/polyhedral domain, and denotes a uniformly positive and symmetric tensor with coefficients specified through a number of functions , , where in 2D and in 3D due to symmetry.

For the Darcy operator with a scalar uniform positive permeability, i.e., , we can set and . The above setting also covers blending finite elements approaches [19]. Here is related to the Jacobian of the blending function. For example, if the standard Laplacian model problem is considered on the physical domain but the actual assembly is carried out on a reference domain , we have


where is the Jacobian of the mapping , and , .

2.1 Definition of our scaling approach

The weak form associated with the partial differential equation is defined in terms of the bilinear form , and the weak solution satisfies:

This bilinear form can be affinely decomposed as


where is a first order partial differential operator and stands for some suitable inner product. In the case of a scalar permeability we find and stands for the scalar product in . While for (1) in 2D one can, as one alternative, e.g. define

where and the same scalar product as above. Note that this decomposition reduces to the one in case of a scalar permeability, i.e. for .

Let , fixed, be a possibly unstructured simplicial triangulation resolving . We call also macro-triangulation and denote its elements by . Using uniform mesh refinement, we obtain from by decomposing each element into sub-elements, ; see [6] for the 3D case. The elements of are denoted by . The macro-triangulation is then decomposed into the following geometrical primitives: elements, faces, edges, and vertices. Each of these geometric primitives acts as a container for a subset of unknowns associated with the refined triangulations. These sets of unknowns can be stored in array-like data structures, resulting in a contiguous memory layout that conforms inherently to the refinement hierarchy; see [5, 18]. In particular, the unknowns can be accessed without indirect addressing such that the overhead is reduced significantly when compared to conventional sparse matrix data structures. Associated with is the space of piecewise linear finite elements. In , we do not include the homogeneous boundary conditions. We denote by the nodal basis functions associated to the -th mesh node. Node is located at the vertex . For and , we define our scaled discrete bilinear forms and by equationparentequation


This definition is motivated by the fact that can be written as


Here we have exploited symmetry and the row sum property. It is obvious that if is a constant restricted to , we do obtain . In general however, the definition of introduces a variational crime and it does not even correspond to an element-wise local assembling based on a quadrature formula. We note that each node on is redundantly existent in the data structure and that we can easily account for jumps in the coefficient function when resolved by the macro-mesh elements .

Similar scaling techniques have been used in [39] for a generalized Stokes problem from geodynamics with coupled velocity components. However, for vectorial equations such a simple scaling does asymptotically not result in a physically correct solution. For the computation of integrals on triangles containing derivatives in the form of (2), cubature formulas of the form (3b) in combination with Euler-MacLaurin type asymptotic expansions have been applied [31, Table 1].

Remark 1

At first glance the Definition (3b) might not be more attractive than (4) regarding the computational cost. In a matrix free approach, however, where we have to reassemble the entries in each matrix call, (3b) turns out to be much more favorable. In order see this, we have to recall that we work with hybrid hierarchical meshes. This means that for each inner node in , we find the same entries in the sense that

Here we have identified the index notation with the vertex notation, and the vertex is obtained from the vertex by a shift of , i.e., . Consequently, the values of do not have to be re-computed but can be efficiently stored.

For simplicity of notation, we shall restrict ourselves in the following to the case of the Darcy equation with a scalar uniformly positive definite permeability; i.e., and drop the index . However, the proofs in Sec. 3 can be generalized to conceptually the same type of results for . In Subsection 7.3, we also show numerical results for in 3D.

2.2 Stencil structure

We exploit the hierarchical grid structure to save a significant amount of memory compared to classical sparse matrix formats. Any direct neighbor can be described through a direction vector such that . The regularity of the grid in the interior of a macro-element implies that these vectors remain the same, when we move from one node to another node. Additionally, for each neighbor there is a mirrored neighbor of reachable by ; see Fig. 1.

Let denote the stencil size at mesh node . We define the stencil associated to the i-th mesh node restricted on as

The symmetry of the bilinear form yields

We recall that for each mesh node we have in 2D and in 3D. Out of these entries only 3 in 2D and 7 in 3D have to be computed since the remaining ones follow from symmetry arguments and the observation that .

Figure 1: From left to right: Exemplary local indices and their corresponding direction vectors of a 15 point stencil in 3D; Six elements attached to one edge; Four elements attached to one edge

Due to the hierarchical hybrid grid structure, two stencils and are exactly the same if and are two nodes belonging to the same primitive; i.e., we find only different kinds of stencils per macro-element in 3D, one for each of its 15 primitives (4 vertices, 6 edges, 4 faces, 1 volume), and in 2D. This observation allows for an extremely fast and memory-efficient on-the-fly (re)assembly of the entries of the stiffness matrix in stencil form. For each node in the data structure, we save the nodal values of the coefficient function . With these considerations in mind, the bilinear form (3b) can be evaluated very efficiently and requires only a suitable scaling of the reference entries; see Sec. 6 for detailed cost considerations.

3 Variational crime framework and a priori analysis

In order to obtain order and a priori estimates of the modified finite element approximation in the - and -norm, respectively, we analyze the discrete bilinear form. From now on, we assume that for each . Moreover, we denote by the -norm on and defines a broken -norm. We recall that the coefficient function is only assumed to be element-wise smooth with respect to the macro triangulation. Existence and uniqueness of a finite element solution of

is given provided that the following assumption (A1) holds true:

  • is uniformly coercive on

  • ,

Here and in the following, the notation is used as abbreviation for , where is independent of the mesh-size . The assumption (A2), if combined with Strang’s first lemma, yields that the finite element solution results in a priori estimates with respect to the -norm; see, e.g., [11, 34]. We note that for small enough, the uniform coercivity (A1) follows from the consistency assumption (A2), since for

Remark 2

As it is commonly done in the finite element analysis in unweighted Sobolev norms, we allow the generic constant to be dependent on the global contrast of k defined by . Numerical results, however, show that the resulting bounds may be overly pessimistic for coefficients with large global variations. In [33] and the references therein, methods to improve the bounds in this case are presented. The examples show that the bounds may be improved significantly for coefficients with a global contrast in the magnitude of about . We are mainly interested in showing alternative assembly techniques to the standard finite element method and in comparing them to the well-established approaches in standard norms. Moreover, in our modification only the local variation of the coefficient is important, therefore we shall not work out these subtleties here.

3.1 Abstract framework for -norm estimates

Since (A2) does not automatically guarantee optimal order -estimates, we employ duality arguments. To get a better feeling on the required accuracy of , we briefly recall the basic steps occurring in the proof of the upper bound. As it is standard, we assume -regularity of the primal and the dual problem. Restricting ourselves to the case of homogeneous Dirichlet boundaries, the dual PDE and boundary operators coincide with the primal ones. Let us denote by the standard Galerkin approximation of , i.e., the finite element solution obtained as the solution of a discrete problem using the bilinear form . It is well-known that under the given assumptions . Now, to obtain an -estimate for , we consider the dual problem with on the right-hand side. Let be the solution of for . Due to the standard Galerkin orthogonality, we obtain


This straightforward consideration shows us that compared to (A2), we need to make stronger assumptions on the mesh-dependent bilinear form . We define (A3) by

where denotes the Hessian of and with ; see Fig. 2 for a 2D illustration. The semi-norm stands for the -norm restricted to .

Figure 2: Elements in (left) and (right) for
Lemma 1

Let the problem under consideration be -regular, be sufficiently small and (A3) be satisfied. Then we obtain a unique solution and optimal order convergence in the - and the -norm, i.e.,



Given that is small enough, (A1) follows from (A3). In terms of (5) and (A3), we get


The stability of the standard conforming Galerkin formulation yields . By Definition (3), we find that the discrete bilinear form is uniformly continuous for a coefficient function in , and thus . To bound the two terms involving , we use the 1D Sobolev embedding [36]. More precisely, for an element in , , we have see [27]. Here we use and in terms of the -regularity assumption, we obtain

Using the above estimates, (7) reduces to

Applying the triangle inequality and using the approximation properties of the Galerkin finite element solution results in the extra term in the upper bound for . The bound for follows by a standard inverse estimate for elements in and the best approximation property of . Since for small enough it holds where is a suitably fixed positive constant, the upper bound (6) follows.

3.2 Verification of the assumptions

It is well-known [11] that assumptions (A1)-(A3) are satisfied for the bilinear form


Here denotes the vertices of the d-dimensional simplex , i.e., we approximate the integral by a nodal quadrature rule and . Thus, to verify the assumptions also for , it is sufficient to consider in more detail with given by (3). Let

be the local stiffness matrix associated with the nodal basis functions , i.e., and the local coefficient function with , . The diagonal entries of are defined differently as


We introduce the component-wise Hadamard product between two matrices as and define the rank one matrix by .

Due to the symmetry of and the fact that the row sum of is equal to zero, we can rewrite the discrete bilinear forms. With , we associate locally elements with , . We recall that if and is a boundary node, then .

Lemma 2

The bilinear forms given by (3) and (8) have the algebraic form



We note that (9) yields that the row sum of is equal to zero. Moreover, is by construction symmetric. Introducing for the set of all elements sharing the global nodes and , we identify by the local index of the node associated with the element and by the local index of the global node . Then the standard local to global assembling process yields that the right-hand side in (10) reads

Comparing this result with the definition (3b), we find equality since the coefficient function is assumed to be smooth within each . The proof of (11) follows with exactly the same arguments as the one for (10).

Although the proof of the previous lemma is straightforward, the implication of it for large scale simulations cannot be underestimated. In 2D, the number of different edge types per macro-element is three while in 3D it is seven assuming uniform refinement. All edges in 3D in the interior of a macro-element share only four or six elements; see Fig. 1. We have three edge types that have four elements attached to them and four edge types with six adjacent elements.

The algebraic formulations (11) and (10) allow us to estimate the effects of the variational crime introduced by the stencil scaling approach.

Lemma 3

Assumptions (A2) and (A3) hold true.


The required bound for (A2) is straightforward and also holds true for any unstructured mesh refinement strategy. Recalling that if the coefficient function restricted to is a constant, (11) and (10) yield


denotes the maximal eigenvalue of its argument.

To show (A3), we have to exploit the structure of the mesh . Let be the set of all edges in and the subset of elements which share the edge having the two global nodes and as endpoints. As before, we identify the local indices of these endpoints by and . We note that the two sets and , are not necessarily disjoint. Observing that each element is exactly contained in elements of , we find

and thus it is sufficient to focus on the contributions resulting from . We consider two cases separately: First, we consider the case that the edge is part of for at least one , then we directly find the upper bound

Second, we consider the case that is in the interior of one . Then for each element there exists exactly one such that is obtained by point reflection at the midpoint of the edge ; see Fig. 3. In the following, we exploit that the midpoint of the edge is the barycenter of . Here the local indices and are associated with the global nodes and , respectively. Without loss of generality, we assume a local renumbering such that , and that is the point reflected vertex of ; see also Fig. 3.







(a) Local numbering along an inner edge in 2D
(b) Local numbering along an inner edge in 3D
Figure 3: Local numbering in element and its point reflected element in 2D and 3D

Let us next focus on

Exploiting the fact that , we can bound by the local -seminorms

A Taylor expansion of in guarantees that the terms of zeroth and first order cancel out and only second order derivatives of scaled with remain, i.e.,

Then the summation over all macro elements, all edges, and all elements in the subsets in combination with a finite covering argument [16] yields the upper bound of (A3).

4 Guaranteed uniform coercivity

While for our hierarchical hybrid mesh framework the assumptions (A2) and (A3) are satisfied and thus asymptotically, i.e., for sufficiently small, also (A1) is satisfied, (A1) is not necessarily guaranteed for any given mesh .

Lemma 4

If the matrix representation of the discrete Laplace operator is an M-matrix, then the scaled bilinear form is positive semi-definite on for globally smooth.


Let be the matrix representation of the discrete Laplace operator, then the M-matrix property guarantees that for . Taking into account that , definition (3) for the special case and globally smooth yields

Remark 3

In 2D it is well-known [12] that if all elements of the macro mesh have no obtuse angle, then is an M-matrix and we are in the setting of Lemma 4.

4.1 Pre-asymptotic modification in 2D based on (A2)

Here we work out the technical details of a modification in 2D that guarantees uniform ellipticity assuming that at least one macro-element has an obtuse angle. Our modification yields a linear condition on the local mesh-size depending on the discrete gradient of . It only applies to selected stencil directions. In 2D our -point stencil associated with an interior fine grid node has exactly two positive off-center entries if the associated macro-element has an obtuse angle. We call the edges associated with a positive reference stencil entry to be of gray type. With each macro-element , we associate the reference stiffness matrix . Without loss of generality, we assume that the local enumeration is done in such a way that the largest interior angle of the macro-element is located at the local node , i.e., if has an obtuse angle then , , and and otherwise , . By we denote the smallest non-degenerated eigenvalue of the generalized eigenvalue problem

We note that both matrices in the eigenvalue problem are symmetric, positive semi-definite and have the same one dimensional kernel and thus .

Let be a gray type edge. For each such edge