1 Introduction
Traditional finite element implementations are based on computing local element stiffness matrices, followed by a localtoglobal assembly step, resulting in a sparse matrix. However, the cost of storing the global stiffness matrix is significant. Even for scalar equations and low order 3D tetrahedral elements, the stiffness matrix has, on average, fifteen entries per row, and thus a standard sparse matrix format will require thirty times as much storage for the matrix as for the solution vector. This limits the size of the problems that can be tackled and becomes the dominating cost factor since the sparse matrix must be reread from memory repeatedly when iterative solvers are applied. On all current and future computing systems memory throughput and memory access latency can determine the runtime more critically than the floatingpoint operations executed. Furthermore, energy consumption has been identified as one of the fundamental roadblocks in exascale computing. In this cost metric, memory access is again more expensive than computation. Against the backdrop of this technological development, it has become mandatory to develop numerical techniques that reduce memory traffic. In the context of partial differential equations this is leading to a renewed interest in socalled
matrixfree techniques and – in some sense – to a revival of techniques that are wellknown in the context of finite difference methods.Matrixfree techniques are motivated from the observation that many iterative solvers, e.g., Richardson iteration or Krylov subspace methods, require only the result of multiplying the global system matrix with a vector, but not the matrix itself. The former can be computed by local operations in each element, avoiding to set up, store, and load the global stiffness matrix. One of the first papers in this direction is [10], which describes the so called elementbyelement approach (EBE), in which the global matrixvectorproduct (MVP) is assembled from the contributions of MVPs of element matrices with local vectors of unknowns. The element matrices are either precomputed and stored or recomputed onthefly. Note that storing all element matrices, even for loworder elements, has a higher memory footprint than the global matrix^{1}^{1}1However, it requires less memory than storage schemes for sparse direct solvers which reserve space for fillin, the original competing scenario in [10]. itself.
Consequently, the traditional EBE has not found wide application in unstructured mesh approaches. However, it has been successfully applied in cases where the discretization is based on undeformed hexahedral elements with trilinear trial functions, see e.g. [1, 7, 15, 37]. In such a setting, the element matrix is the same for all elements, which significantly reduces the storage requirements, and variable material parameters can be introduced by weighting local matrix entries in the same onthefly fashion as will be developed in this paper.
Matrixfree approaches for higherorder elements, as described in e.g., [9, 25, 28, 29, 32]
, differ from the classic EBE approach in that they do not setup the element matrix and consecutively multiply it with the local vector. Instead, they fuse the two steps by going back to numerical integration of the weak form itself. The process is accelerated by precomputing and storing certain quantities, such as e.g., derivatives of basis functions at quadrature points within a reference element. These techniques in principle work for arbitrarily shaped elements and orders, although a significant reduction of complexity can be achieved for tensorproduct elements.
However, these matrixfree approaches have also shortcomings. While the loworder settings [1, 7, 15, 37] require structured hexahedral meshes, modern techniques for unstructured meshes only pay off for elements with tensorproduct spaces with polynomial orders of at least two; see [25, 28].
In this paper we will present a novel matrixfree approach for loworder finite elements designed for the hierarchical hybrid grids framework (HHG); see e.g. [3, 4, 5]. HHG offers significantly more geometric flexibility than undeformed hexahedral meshes. It is based on two interleaved ideas. The first one is a special discretization of the problem domain. In HHG, the computational grid is created by way of a uniform refinement following the rules of [6], starting from a possibly unstructured simplicial macro mesh. The resulting nested hierarchy of meshes allows for the implementation of powerful geometric multigrid solvers. The elements of the macro mesh are called macroelements and the resulting subtriangulation reflects a uniform structure within these macroelements. The second aspect is based on the fact that each row of the global finite element stiffness matrix can be considered as a difference stencil. This notion and point of view is classical on structured grids and recently has found renewed interest in the context of finite elements too; see e.g., [14]. In combination with the HHG grid construction this implies that for linear simplicial elements one obtains stencils with identical structure for each inner node of a macro primitive. We define macro primitives as the geometrical entities of the macro mesh of different dimensions, i.e., vertex, edge, face, and tetrahedrons. If additionally the coefficients of the PDE are constant per macroelement, then also the stencil entries are exactly the same. Consequently, only a few different stencils (one per macro primitive) can occur and need to be stored. This leads to extremely efficient matrixfree techniques, as has been demonstrated e.g., in [3, 18].
Let us now consider the setting of an elliptic PDE with piecewise smooth variable coefficients, assuming that the macro mesh resolves jumps in the coefficients. In this case, a standard finite element formulation is based on quadrature formulas and introduces a variational crime. According to [11, 34], there is flexibility how the integrals are approximated without degenerating the order of convergence. This has recently been exploited in [2] with a method that approximates these integral values onthefly using suitable surrogate polynomials with respect to the macro mesh. The resulting twoscale method is able to preserve the convergence order if the coarse and the fine scale are related properly. Here we propose an alternative which is based on the fine scale.
For this article, we restrict ourselves to the lowest order case of conforming finite elements on simplicial meshes. Then the most popular quadrature formula is the one point Gauss rule which in the simplest case of as PDE operator just weights the element based reference stiffness matrix of the Laplacian by the factor of where is the barycenter of the element . Alternatively, one can select a purely vertexbased quadrature formula. Here, the weighting of the element matrix is given by , where is the space dimension and are the vertices of element . Using a vertexbased quadrature formula saves function evaluations and is, thus, attractive whenever the evaluation of the coefficient function is expensive and it pays off to reuse once computed values in several element stiffness matrices. Note that reusing barycentric data on general unstructured meshes will require nontrivial storage schemes.
In the case of variable coefficient functions, stencil entries can vary from one mesh node to another. The number of possibly different stencils within each macroelement becomes , where is the number of uniform refinement steps for HHG. Now we can resort to two options: Either these stencils are computed once and then saved, effectively creating a sparse matrix data structure, or they are computed onthefly each time when they are needed. Neither of these techniques is ideal for extreme scale computations. While for the first option extra memory is consumed and extensive memory traffic occurs, the second option requires recomputation of local contributions.
The efficiency of a numerical PDE solver can be analyzed following the textbook paradigm [8] that defines a work unit (WU) to be the cost of one application of the discrete operator for a given problem. With this definition, the analysis of iterative solvers can be conducted in terms of WU. Classical multigrid textbook efficiency is achieved when the solution is obtained in less than 10 WU. For devising an efficient method it is, however, equally critical to design algorithms that reduce the cost of a WU without sacrificing accuracy. Clearly, the real life cost of a WU depends on the computer hardware and the efficiency of the implementation, as e.g., analyzed for parallel supercomputers in [18]. On the other side, matrixfree techniques, as the one proposed in this article, seek opportunities to reduce the cost of a WU by a clever rearrangement of the algorithms or by exploiting approximations where this is possible; see e.g., also [2].
These preliminary considerations motivate our novel approach to reduce the cost of a WU by recomputing the surrogate stencil entries for a matrixfree solver more efficiently. We find that these values can be assembled from a reference stencil of the constant coefficient case which is scaled appropriately using nodal values of the coefficient function. We will show that under suitable conditions, this technique does not sacrifice accuracy. However, we also demonstrate that the new method can reduce the cost of a WU considerably and in consequence helps to reduce the timetosolution.
The rest of this paper is structured as follows: In section 2, we define our new scaling approach. The variational crime is analyzed in Section 3 where optimal order a priori results for the  and norm are obtained. In section 4, we consider modifications in the preasymptotic regime to guarantee uniform ellipticity. Section 5 is devoted to the reproduction property and the primitive concept which allows for a fast onthefly reassembling in a matrix free software framework. In section 6, we discuss the cost compared to a standard nodal based elementwise assembling. Finally, in section 7 we perform numerically an accuracy study and a runtime comparison to illustrate the performance gain of the new scaling approach.
2 Problem setting and definition of the scaling approach
We consider a scalar elliptic partial differential equation of Darcy type, i.e.,
where tr stands for the boundary trace operator and . Here , , is a bounded polygonal/polyhedral domain, and denotes a uniformly positive and symmetric tensor with coefficients specified through a number of functions , , where in 2D and in 3D due to symmetry.
For the Darcy operator with a scalar uniform positive permeability, i.e., , we can set and . The above setting also covers blending finite elements approaches [19]. Here is related to the Jacobian of the blending function. For example, if the standard Laplacian model problem is considered on the physical domain but the actual assembly is carried out on a reference domain , we have
(1) 
where is the Jacobian of the mapping , and , .
2.1 Definition of our scaling approach
The weak form associated with the partial differential equation is defined in terms of the bilinear form , and the weak solution satisfies:
This bilinear form can be affinely decomposed as
(2) 
where is a first order partial differential operator and stands for some suitable inner product. In the case of a scalar permeability we find and stands for the scalar product in . While for (1) in 2D one can, as one alternative, e.g. define
where and the same scalar product as above. Note that this decomposition reduces to the one in case of a scalar permeability, i.e. for .
Let , fixed, be a possibly unstructured simplicial triangulation resolving . We call also macrotriangulation and denote its elements by . Using uniform mesh refinement, we obtain from by decomposing each element into subelements, ; see [6] for the 3D case. The elements of are denoted by . The macrotriangulation is then decomposed into the following geometrical primitives: elements, faces, edges, and vertices. Each of these geometric primitives acts as a container for a subset of unknowns associated with the refined triangulations. These sets of unknowns can be stored in arraylike data structures, resulting in a contiguous memory layout that conforms inherently to the refinement hierarchy; see [5, 18]. In particular, the unknowns can be accessed without indirect addressing such that the overhead is reduced significantly when compared to conventional sparse matrix data structures. Associated with is the space of piecewise linear finite elements. In , we do not include the homogeneous boundary conditions. We denote by the nodal basis functions associated to the th mesh node. Node is located at the vertex . For and , we define our scaled discrete bilinear forms and by equationparentequation
(3a)  
(3b) 
This definition is motivated by the fact that can be written as
(4) 
Here we have exploited symmetry and the row sum property. It is obvious that if is a constant restricted to , we do obtain . In general however, the definition of introduces a variational crime and it does not even correspond to an elementwise local assembling based on a quadrature formula. We note that each node on is redundantly existent in the data structure and that we can easily account for jumps in the coefficient function when resolved by the macromesh elements .
Similar scaling techniques have been used in [39] for a generalized Stokes problem from geodynamics with coupled velocity components. However, for vectorial equations such a simple scaling does asymptotically not result in a physically correct solution. For the computation of integrals on triangles containing derivatives in the form of (2), cubature formulas of the form (3b) in combination with EulerMacLaurin type asymptotic expansions have been applied [31, Table 1].
Remark 1
At first glance the Definition (3b) might not be more attractive than (4) regarding the computational cost. In a matrix free approach, however, where we have to reassemble the entries in each matrix call, (3b) turns out to be much more favorable. In order see this, we have to recall that we work with hybrid hierarchical meshes. This means that for each inner node in , we find the same entries in the sense that
Here we have identified the index notation with the vertex notation, and the vertex is obtained from the vertex by a shift of , i.e., . Consequently, the values of do not have to be recomputed but can be efficiently stored.
For simplicity of notation, we shall restrict ourselves in the following to the case of the Darcy equation with a scalar uniformly positive definite permeability; i.e., and drop the index . However, the proofs in Sec. 3 can be generalized to conceptually the same type of results for . In Subsection 7.3, we also show numerical results for in 3D.
2.2 Stencil structure
We exploit the hierarchical grid structure to save a significant amount of memory compared to classical sparse matrix formats. Any direct neighbor can be described through a direction vector such that . The regularity of the grid in the interior of a macroelement implies that these vectors remain the same, when we move from one node to another node. Additionally, for each neighbor there is a mirrored neighbor of reachable by ; see Fig. 1.
Let denote the stencil size at mesh node . We define the stencil associated to the ith mesh node restricted on as
The symmetry of the bilinear form yields
We recall that for each mesh node we have in 2D and in 3D. Out of these entries only 3 in 2D and 7 in 3D have to be computed since the remaining ones follow from symmetry arguments and the observation that .
Due to the hierarchical hybrid grid structure, two stencils and are exactly the same if and are two nodes belonging to the same primitive; i.e., we find only different kinds of stencils per macroelement in 3D, one for each of its 15 primitives (4 vertices, 6 edges, 4 faces, 1 volume), and in 2D. This observation allows for an extremely fast and memoryefficient onthefly (re)assembly of the entries of the stiffness matrix in stencil form. For each node in the data structure, we save the nodal values of the coefficient function . With these considerations in mind, the bilinear form (3b) can be evaluated very efficiently and requires only a suitable scaling of the reference entries; see Sec. 6 for detailed cost considerations.
3 Variational crime framework and a priori analysis
In order to obtain order and a priori estimates of the modified finite element approximation in the  and norm, respectively, we analyze the discrete bilinear form. From now on, we assume that for each . Moreover, we denote by the norm on and defines a broken norm. We recall that the coefficient function is only assumed to be elementwise smooth with respect to the macro triangulation. Existence and uniqueness of a finite element solution of
is given provided that the following assumption (A1) holds true:

is uniformly coercive on

,
Here and in the following, the notation is used as abbreviation for , where is independent of the meshsize . The assumption (A2), if combined with Strang’s first lemma, yields that the finite element solution results in a priori estimates with respect to the norm; see, e.g., [11, 34]. We note that for small enough, the uniform coercivity (A1) follows from the consistency assumption (A2), since for
Remark 2
As it is commonly done in the finite element analysis in unweighted Sobolev norms, we allow the generic constant to be dependent on the global contrast of k defined by . Numerical results, however, show that the resulting bounds may be overly pessimistic for coefficients with large global variations. In [33] and the references therein, methods to improve the bounds in this case are presented. The examples show that the bounds may be improved significantly for coefficients with a global contrast in the magnitude of about . We are mainly interested in showing alternative assembly techniques to the standard finite element method and in comparing them to the wellestablished approaches in standard norms. Moreover, in our modification only the local variation of the coefficient is important, therefore we shall not work out these subtleties here.
3.1 Abstract framework for norm estimates
Since (A2) does not automatically guarantee optimal order estimates, we employ duality arguments. To get a better feeling on the required accuracy of , we briefly recall the basic steps occurring in the proof of the upper bound. As it is standard, we assume regularity of the primal and the dual problem. Restricting ourselves to the case of homogeneous Dirichlet boundaries, the dual PDE and boundary operators coincide with the primal ones. Let us denote by the standard Galerkin approximation of , i.e., the finite element solution obtained as the solution of a discrete problem using the bilinear form . It is wellknown that under the given assumptions . Now, to obtain an estimate for , we consider the dual problem with on the righthand side. Let be the solution of for . Due to the standard Galerkin orthogonality, we obtain
(5) 
This straightforward consideration shows us that compared to (A2), we need to make stronger assumptions on the meshdependent bilinear form . We define (A3) by
where denotes the Hessian of and with ; see Fig. 2 for a 2D illustration. The seminorm stands for the norm restricted to .
Lemma 1
Let the problem under consideration be regular, be sufficiently small and (A3) be satisfied. Then we obtain a unique solution and optimal order convergence in the  and the norm, i.e.,
(6) 
Proof
Given that is small enough, (A1) follows from (A3). In terms of (5) and (A3), we get
(7) 
The stability of the standard conforming Galerkin formulation yields . By Definition (3), we find that the discrete bilinear form is uniformly continuous for a coefficient function in , and thus . To bound the two terms involving , we use the 1D Sobolev embedding [36]. More precisely, for an element in , , we have see [27]. Here we use and in terms of the regularity assumption, we obtain
Using the above estimates, (7) reduces to
Applying the triangle inequality and using the approximation properties of the Galerkin finite element solution results in the extra term in the upper bound for . The bound for follows by a standard inverse estimate for elements in and the best approximation property of . Since for small enough it holds where is a suitably fixed positive constant, the upper bound (6) follows.
3.2 Verification of the assumptions
It is wellknown [11] that assumptions (A1)(A3) are satisfied for the bilinear form
(8) 
Here denotes the vertices of the ddimensional simplex , i.e., we approximate the integral by a nodal quadrature rule and . Thus, to verify the assumptions also for , it is sufficient to consider in more detail with given by (3). Let
be the local stiffness matrix associated with the nodal basis functions , i.e., and the local coefficient function with , . The diagonal entries of are defined differently as
(9) 
We introduce the componentwise Hadamard product between two matrices as and define the rank one matrix by .
Due to the symmetry of and the fact that the row sum of is equal to zero, we can rewrite the discrete bilinear forms. With , we associate locally elements with , . We recall that if and is a boundary node, then .
Proof
We note that (9) yields that the row sum of is equal to zero. Moreover, is by construction symmetric. Introducing for the set of all elements sharing the global nodes and , we identify by the local index of the node associated with the element and by the local index of the global node . Then the standard local to global assembling process yields that the righthand side in (10) reads
Comparing this result with the definition (3b), we find equality since the coefficient function is assumed to be smooth within each . The proof of (11) follows with exactly the same arguments as the one for (10).
Although the proof of the previous lemma is straightforward, the implication of it for large scale simulations cannot be underestimated. In 2D, the number of different edge types per macroelement is three while in 3D it is seven assuming uniform refinement. All edges in 3D in the interior of a macroelement share only four or six elements; see Fig. 1. We have three edge types that have four elements attached to them and four edge types with six adjacent elements.
The algebraic formulations (11) and (10) allow us to estimate the effects of the variational crime introduced by the stencil scaling approach.
Lemma 3
Assumptions (A2) and (A3) hold true.
Proof
The required bound for (A2) is straightforward and also holds true for any unstructured mesh refinement strategy. Recalling that if the coefficient function restricted to is a constant, (11) and (10) yield
where
denotes the maximal eigenvalue of its argument.
To show (A3), we have to exploit the structure of the mesh . Let be the set of all edges in and the subset of elements which share the edge having the two global nodes and as endpoints. As before, we identify the local indices of these endpoints by and . We note that the two sets and , are not necessarily disjoint. Observing that each element is exactly contained in elements of , we find
and thus it is sufficient to focus on the contributions resulting from . We consider two cases separately: First, we consider the case that the edge is part of for at least one , then we directly find the upper bound
Second, we consider the case that is in the interior of one . Then for each element there exists exactly one such that is obtained by point reflection at the midpoint of the edge ; see Fig. 3. In the following, we exploit that the midpoint of the edge is the barycenter of . Here the local indices and are associated with the global nodes and , respectively. Without loss of generality, we assume a local renumbering such that , and that is the point reflected vertex of ; see also Fig. 3.
Let us next focus on
Exploiting the fact that , we can bound by the local seminorms
A Taylor expansion of in guarantees that the terms of zeroth and first order cancel out and only second order derivatives of scaled with remain, i.e.,
Then the summation over all macro elements, all edges, and all elements in the subsets in combination with a finite covering argument [16] yields the upper bound of (A3).
4 Guaranteed uniform coercivity
While for our hierarchical hybrid mesh framework the assumptions (A2) and (A3) are satisfied and thus asymptotically, i.e., for sufficiently small, also (A1) is satisfied, (A1) is not necessarily guaranteed for any given mesh .
Lemma 4
If the matrix representation of the discrete Laplace operator is an Mmatrix, then the scaled bilinear form is positive semidefinite on for globally smooth.
Proof
Let be the matrix representation of the discrete Laplace operator, then the Mmatrix property guarantees that for . Taking into account that , definition (3) for the special case and globally smooth yields
Remark 3
4.1 Preasymptotic modification in 2D based on (A2)
Here we work out the technical details of a modification in 2D that guarantees uniform ellipticity assuming that at least one macroelement has an obtuse angle. Our modification yields a linear condition on the local meshsize depending on the discrete gradient of . It only applies to selected stencil directions. In 2D our point stencil associated with an interior fine grid node has exactly two positive offcenter entries if the associated macroelement has an obtuse angle. We call the edges associated with a positive reference stencil entry to be of gray type. With each macroelement , we associate the reference stiffness matrix . Without loss of generality, we assume that the local enumeration is done in such a way that the largest interior angle of the macroelement is located at the local node , i.e., if has an obtuse angle then , , and and otherwise , . By we denote the smallest nondegenerated eigenvalue of the generalized eigenvalue problem
We note that both matrices in the eigenvalue problem are symmetric, positive semidefinite and have the same one dimensional kernel and thus .
Let be a gray type edge. For each such edge