This article shows that powerful multigrid smoothers based on domain decomposition with cells and vertex patches as subdomains can be implemented very efficiently using fast diagonalization. In particular, we show that now, as matrix-free application of operators associated with finite element bilinear forms is state of the art, implementation of powerful smoothers can be accomplished with the same asymptotic complexity with respect to polynomial degree. The technique demonstrated for the Laplacian can be applied to any separable operator.
Multigrid and domain decomposition methods are the two classes of solvers or preconditioners which allow the solution of discretized elliptic partial differential equations with linear complexity in the number of degrees of freedom. This is true at least for mesh refinement. When the polynomial degree is increased, point smoothers on the fine mesh must be replaced by more complex methods with in general superlinear complexity. For instance, nonoverlapping subdomain smoothers using the inversion of cell matrices of the interior penalty method (IP) have yielded very effective multigrid methods for higher order discretizations of the Laplacian[Kanschat08smoother], reaction-diffusion systems [LuceroKanschat18], or radiation transport [Lucero18]. For divergence constrained problems, they are not sufficient and we have to resort to overlapping patches of hypercube cells around a single vertex. Then, we obtain effective multigrid methods for divergence-dominated problems [ArnoldFalkWinther97Hdiv], the Stokes problem [KanschatMao15], or a Darcy-Stokes-Brinkman system [KanschatLazarovMao17]. While these methods are very effective in the sense of few iteration steps, computation time can become unfeasible in a standard implementation, if large cell matrices are inverted with an algorithm of cubic complexity.
Modern hardware favors algorithms performing complex operations on small data sets, since memory access is by far more expensive in terms of time and energy than computation. Thus, it was observed for instance in [KronbichlerKormann12, kronbichler2019fast, MuethingPiatkowskiBastian17] that implementations based on stored sparse matrices, which have a computational intensity of one FLOP for each entry read from memory are not competitive. On the other hand, once the computational intensity is high enough that computation dominates memory access, it is not only worthwhile, but mandatory to optimize the computational part of algorithms. This has been achieved for application of finite element operators, where most codes now prefer integration of bilinear forms over mesh cells computed on the fly to stored matrices. While unfeasible with a standard quadrature with complexity in the polynomial degree
in three dimensions, applications of a local matrix to a local vector can be performed at low arithmetic cost and complexity of orderusing the technique of sum factorization. This technique has been introduced in the context of spectral methods in [orszag80].
For effective multigrid smoothers, we need the solution of local problems in addition to operator application. Hence, we turn to low rank tensor representations of the local matrices and their inverses to yield a similar reduction of complexity, following the idea in [lynch64]. There, a low rank tensor representation of the inverse of a separable (discrete) operator on a tensor product mesh is presented. We apply it as a local solver on overlapping subdomains on Cartesian meshes and as an approximate local solver on more general geometries.
Kronecker decompositions of separable operators have been used as preconditioners in [Beuchler03, lottes05, stiller2017, stiller2016]. In [Beuchler03], the one dimensional local problems are preconditioned by a wavelet method and then used in a block preconditioner of the global system, splitting the bubble degrees of freedom on edges and the interior of cells, respectively, from those those in vertices for continuous finite elements. Methods closer to ours were introduced in the context of continuous finite elements in [lottes05, stiller2017]. There, a cell based smoother is introduced which augments each cell by a few layers of support points with associated basis functions from neighboring cells. In [stiller2016], a similar smoother for discontinuous Galerkin methods is presented and compared to an augmented patch of two cells sharing a face. From the point of view of data structures, these approaches are more complicated than ours, since we use cell-wise data instead of augmenting by neighboring shape functions. The block preconditioners in [pazner18]
based on Kronecker decompositions are not restricted to separable operators. However, the successive Kronecker singular value decompositions[vanloan93, vanloan00] of the local solvers on cells require instead of arithmetic operations per element, losing optimality in three dimensions. An alternative approach to efficient local solvers based on localized matrix-free methods is the iterative solution of the local problems up to a fixed accuracy, see [BastianMuellerMuethingPiatkowski18].
Algorithms in this article have been developed with vectorizing multicore architectures and their high cost for memory access compared to computation in mind. Here, we study their computational complexity only, showing that even with purely sequential arithmetic, we obtain good multigrid performance. The reason are low iteration counts combined with an implementation of the smoothers with low overhead. Parallel implementation, where the balance between computation and memory access becomes important, is deferred to a forthcoming study.
This article is organized as follows: in the following section, we introduce the model problem, its discretization by the interior penalty method, and the multilevel Schwarz methods we use for preconditioning together with some results on convergence speed. In Section 3, we present the efficient implementation of these smoothers for separable operators and results for their computational effort. Section 4 discusses the application of inexact local solvers on more general meshes and their impact on performance. Our findings are summarized in Section 5.
2 Multilevel interior penalty methods
In this work, we discuss a method for the model problem of Poisson’s equation
where is a polygonal domain in with . and are given functions in and
, respectively. We point out that we used the Laplacian as a simple example, but that it can be replaced by any separable operator. In case of nonsymmetric operators, eigenvalues below must be replaced by singular values.
The model problem is discretized by means of the symmetric interior penalty method (SIPG) following [ArnoldBrezziCockburnMarini02, Arnold82]. To this end, we subdivide the domain into meshes for levels , where the finest level is the actual discretization level on which we want to solve and the intermediate levels form the hierarchy for the geometric multigrid method. Each mesh consists of a collection of quadrilateral/hexahedral cells , which are obtained by a mapping from the reference cell . The relation of these meshes is defined by induction as follows: starting from a coarse mesh consisting of few cells at most, we generate a hierarchy of meshes for levels by recursively splitting each cell in with respect to its midpoint into children in . These meshes are nested in the fashion that every cell of is equal to the union of its children in as well as conforming in the sense that either any edge/face of the cell is at the domain’s boundary or a complete edge/face of an adjacent cell.
The shape function space on the reference cell consists of the discontinuous, tensor product polynomials . Its basis
consists of tensor products of Lagrangian interpolation polynomials of degreewith respect to the Gauss-Lobatto points on the reference interval. More details will be provided in Section 3. The shape function spaces on an actual grid cell are obtained by composition with the mapping such that . The finite element space on level is then defined by
The indexing of the basis follows the structure as a direct sum. This basis defines by duality the coefficient space of the same dimension as equipped with the Euclidean inner product. In computations, this is the inner product used to compute norms, such that we will identify with the coefficient space and do not distinguish them in notation.
Let be the set of all interior interfaces between two cells . Then, we refer to traces of functions on taken from cell as , respectively. For such a function, we define the “averaging operator”
On a face at the boundary, denoted by , there is only a trace from the interior and thus we define
Using as the outward pointing normal of the cell at face , we introduce the interior penalty bilinear form
Here, the integrals over sets of cells (faces) are to be understood as the sum of the individual integrals over cells (faces). From left to right we refer to the four integrals on the right hand side of (5) as the bulk, penalty, consistency and adjoint consistency term. We still have to define the edge-wise penalty parameter , which penalizes the jumps of the solution and yields stability of the bilinear form [Kanschat03habil, §2.2.8]. It is of the form
where is the (average) length of cell orthogonal to the common edge . On boundary edges we let , where is the length of the corresponding cell orthogonal to . The factor is chosen equal to one on Cartesian elements and has to be increased on non-Cartesian elements to preserve stability of the discretization. Finally, we can state the interior penalty discretization of the model problem (1): find such that
2.2 Geometric Multigrid
We follow [GopalakrishnanKanschat03] in the definition of the geometric multigrid algorithm for the interior penalty method. More precisely, we state the V-cycle algorithm used for preconditioning in Algorithm 1.
The operators used there are as follows: refers to the level matrix associated with the interior penalty bilinear form in (5). The operator is the prolongation operator. Since under our assumptions the spaces are nested, this is simply the embedding from into . The restriction operator is the adjoint of the prolongation operator with respect to the Euclidean inner product in the coefficient spaces. This definition corresponds to the transpose matrix. The operators and are the smoothers on level described in detail in the next subsection.
2.3 Schwarz Smoothers
We use the terms domain decomposition smoothers or Schwarz smoothers in the context of multigrid methods and many, very small subdomains, on which we solve the differential equation exactly. Examples from the literature are the and smoothers in [ArnoldFalkWinther97Hdiv, ArnoldFalkWinther00] or cell based smoothers for the interior penalty method in [Kanschat08smoother]. The first group has been generalized successfully to Stokes [KanschatMao15] and Darcy-Stokes-Brinkman [KanschatLazarovMao17] problems. The second class has been generalized to singularly perturbed reaction-diffusion problems in [LuceroKanschat18], where we also generalized the convergence analysis in [DryjaKrzyzanowski16] to quadrilateral and hexahedral meshes. Thus, we consider two classes of domain decomposition smoothers with local solvers on cells and vertex patches, respectively.
cell based smoothers: each subdomain of the spatial decomposition on level consists of a single cell of the mesh as depicted in Figure 0(a). After enumerating the cells in as with , the subspaces consist of functions with support in the cell . As the spatial decomposition is nonoverlapping and we use discontinuous finite elements, is the disjoint union of the .
vertex patch smoothers: each subdomain consists of all cells sharing the vertex of (after enumeration) as shown in Figure 0(b). The subspaces for , where is the number of interior vertices in , consist of functions with support in the cell . As typically cells share a vertex and a cell has vertices, the spatial decomposition is overlapping and the union of the subspaces is not disjoint.
In both cases, we define the local solvers by
We refer to the operator associated with the bilinear form restricted to as . From now on we suppress the level index in expressions like and . We define the additive Schwarz smoother on level as
where is a relaxation parameter. is the restriction operator and its transpose the embedding. The form on the right highlights the structure as a product of the system matrix and the additive Schwarz preconditioner .
The multiplicative Schwarz operator, in its standard form is defined by
While we do not evaluate parallel performance in this article, but rather focus on the numerical efficiency of the smoothing methods, namely the number of arithmetic operations needed, we nevertheless have parallel execution by vectorization, multi-threading, and MPI-parallelization on distributed systems in mind. Neither the standard form of the multiplicative smoothers, nor the additive smoother with vertex patches are suited for such parallelism. While the additive vertex patch smoother suffers from race conditions, both the cell and the vertex patch multiplicative smoother are inherently sequential. Therefore, we use “coloring” of the mesh cells in order to recover potential parallelism.
Coloring refers to splitting the index set of subdomains into disjoint subsets with , such that the operations within each subset can be performed in parallel without causing conflicts.
Race conditions are conflicts due to simultaneous reading and writing. They appear in the additive vertex patch smoother, if two local solvers executed in parallel are writing into the data of the same cell. Therefore, the two colored patches in Figure 0(c) may not be processed in parallel. Coloring for this algorithm is designed such that two patches of the same color do not share a common cell. For regular meshes, this can be achieved by parquetings of the domain with possible omission of strips at the boundary.
For the multiplicative algorithm, the goal of coloring is not just avoiding race conditions, it is recovering parallelism at all. To this end, we note that
The parenthesis in the last term evaluates to zero if and only if is -orthogonal to . Since operator application involves face terms, -orthogonality is violated if two subdomains share a common face as in Figure 0(d). It can be avoided on regular meshes by red-black coloring for the cell based smoother as in Figure 1(a), yielding 2 colors in any dimension.
The multiplicative vertex patch smoother combines the conflicts of the additive vertex patch and the multiplicative cell based smoother. A coloring for this situation is shown in Figure 1(b), where we combine the parqueting for the vertex patch with red-black coloring into colors.
These simple coloring algorithms based on checkerboards and parqueting reach their limits on unstructured meshes. The finite element library deal.II provides graph-based coloring based on the DSATUR algorithm in [brelaz79] in a parallel version. We do not discuss its details here and refer the reader to [turcksin16]. Compared to the parqueting algorithms above, it generates more colors with smaller subsets within each color. Examples are provided below in Table 2.
The colored versions of the Schwarz operators read
Each factor of the multiplicative algorithm contains sums due to -orthogonality of the subspaces within the same color. Note that the colored additive algorithm is mathematically equivalent to (9), while the multiplicative version may differ due to the reordering of factors.
Since the additive Schwarz operator factors as , the smoothing step can be encoded as shown in Algorithm 2.
Here, we use a “for” loop for sequential operations, while “” indicates a possibly parallel summation. Using the form in (12), the multiplicative smoother can be implemented in a very similar way, with the single change that the residual update happens inside the loop over all colors, as in Algorithm 3.
Thus, both methods are implemented as a “short” product (in the sense of a sequence of operations) over all colors with parallel, additive smoothers for each color.
Since parallelization is only viable within each color, a small number of colors with many subdomains per color is desirable. This holds in particular for the multiplicative algorithm, where an operator application (residual update) is applied for each color. Therefore, whenever the meshes are regular, the optimal coloring by parqueting described above should be used. If the meshes are not regular, we fall back to the DSATUR algorithm mentioned above.
2.4 Efficiency of the smoothers
Here, we provide numerical evidence that the smoothers discussed above yield very efficient multigrid methods in terms of iteration counts. Summarizing our findings, we show that iteration counts are independent of the mesh size of the finest level. Furthermore, our results indicate a slow deterioration of convergence steps for the nonoverlapping smoothers with increasing polynomial degree. The overlapping vertex patch smoother even seems to profit from higher degrees.
We present results on Cartesian meshes in two and three dimensions and refer to Section 4 for the more general case. The coarse mesh is the decomposition of the square or cube into congruent cells, consequently it consists of one vertex patch. Each subsequent level is obtained by the refinement algorithm outlined in Section 2.1. We use the V-cycle with a single pre- and post smoothing step as preconditioner in the conjugate gradient solver (CG) and in the generalized minimal residual method (GMRES) for the additive and multiplicative versions, respectively. The stopping criterion of the Krylov subspace methods is a relative residual reduction of . On the coarse mesh, we solve with a relative accuracy of using a Chebyshev solver (see [varga09]) with the additive cell based smoother as preconditioner.
Due to their efficiency, multiplicative vertex patch smoothers require three or less iterations such that we consider fractional iterations for a more accurate assessment of their performance. If the reduction by is achieved after iterations, we compute
where and is the energy norm of the error and the Euclidean norm of the residual after steps of the CG and GMRES algorithms, respectively. The right-hand side of our model problem (1) is manufactured such that the exact solution is given by a superposition of “normalized” multivariate Gaussian bell curves,
where and source points , and . In two dimensions, the source points are projected onto the -plane at .
Our results are summarized in Tables 2 and 1, where for each polynomial degree convergence steps for discretizations on mesh level are shown with to degrees of freedom in two dimensions, with to in three dimensions. First, we observe that all step counts are independent of the mesh level. Thus, we confirm that we have uniform convergence with respect to mesh size. The additive cell based smoother (ACS) requires a relaxation parameter . Table 1 shows a slight growth of the number of iteration steps with polynomial degree. It takes about twice as many steps as the multiplicative version (MCS) with . Given that MCS with red-black coloring in (12) needs two applications of the operator in each step, the two cell based smoothers compare at similar levels.
Next, we consider the vertex patch smoother. It is well known that the relaxation parameter for additive methods with overlap has to be chosen smaller than , which slows down convergence considerably. Therefore, we only consider the multiplicative version (MVS) here. We use the regular coloring in Figure 1(b) to process as many patches in parallel as possible. We also compare to the graph-based coloring from [turcksin16]. As pointed out in Section 2.3, the reordering of local solvers in the multiplicative method, in Table 2 attributed to different coloring algorithms, affects the smoothing. Both coloring schemes yield iteration counts close to two, with a slight advantage for the algorithm with less colors. This is almost a direct solver. Thus, we conclude that in particular the multiplicative vertex patch smoother (MVS) is a mathematically very well suited algorithm if we manage to implement it efficiently. However, the computational effort of one smoothing step is quite high compared to other smoothers, such that we must compare the total effort to decide on the optimal version.
3 Tensor Product Elements
In recent years, the structure of tensor product polynomials on quadrilateral and hexahedral cells and their evaluation and integration by sum factorization has been exploited in the development of highly efficient codes for modern hardware, see for instance [KronbichlerKormann12, MuethingPiatkowskiBastian17, vos10].
Let be the space of polynomials in one variable of degree up to with basis . We use the Lagrange basis in Gauss-Lobatto support points for stability. Then, we define the tensor product polynomial space
with its basis
where . Here, we have adopted a multi-index notation, such that each basis function is characterized by indices. These can be unrolled into a linear index , for instance by the lexicographic mapping
This defines a polynomial shape function space on the reference cell . The polynomial shape function spaces and its basis on the mesh cell are obtained by composition with the cell mapping , that is, . Similarly, we define a -dimensional quadrature rule on as the -fold tensor product of a one-dimensional on the unit interval. If the one-dimensional quadrature formula has abscissas and weights on the interval , we let
3.1 Operator application by sum factorization
A matrix-free finite element implementation with sum factorization is easiest discussed using the mass matrix as example. Instead of assembling the mass matrix on the mesh on level , integration and application to a vector in coefficient space are folded into one operation. Let denote the transfer from global degrees of freedom on to local degrees of freedom on the cell . Then,
Here, denotes the coefficient vector of the finite element function with respect to the chosen basis. The restriction of the finite element function to the cell is determined by real-valued coefficients , reshaped as order- tensor , where here and below we suppress the cell index .
We assume the number of univariate quadrature points to be almost identical with the polynomial degree . Understanding the local finite element interpolation as pullback by the mapping , the evaluation of in the standard form (20) in all quadrature points requires arithmetic operations.
Exploiting the tensor product form of basis functions and quadrature formula, we can reduce the complexity to by means of sum factorization. This technique was first introduced in the spectral element community in [orszag80] and later extended to DG methods, see for instance [vos10]. By factorizing common indices in (20) along each dimension we obtain one-dimensional interpolations
where implicitly is used. The order- tensor is successively obtained by computing the sum-factors from right to left. In other words, each sum-factor results in an intermediate tensor with degree of freedom and quadrature indices mixed obtained by the contraction of the previous order- tensor and the matrix composed of for all . A change of variables with the cell mapping and application of the quadrature formula yields
where we use that is chosen with positive determinant. Switching to the multi-indices of , we introduce the order- mapping tensor with components
with being the entrywise product (also known as the Hadamard product). In total, the local matrix-vector multiplication with is performed at the cost of arithmetic operations. In this context, we refer to as a matrix-free finite element operator.
A similar expression can be derived for the Laplacian or a general second order elliptic operator on arbitrary quadrilaterals and hexahedra. The formula becomes more complicated then since it involves a matrix-valued mapping tensor and a vector-valued interpolation tensor due to the gradients of ansatz and test functions.
3.2 Fast diagonalization
In Section 2.4, we have presented robust Schwarz smoothers with low iteration counts. However, the naive computation of local inverses requires arithmetic operations, while matrix-free operator application costs only . Therefore, explicit inversion should be avoided as well as multiplication with an inverse with operations.
The fast diagonalization method introduced in [lynch64] is an efficient inversion algorithm for matrices with a rank-d Kronecker decomposition of the form
Assuming the matrices are symmetric, positive definite and the matrices are symmetric, the generalized eigenvalue problems
are well-defined. Here
is the identity matrix of appropriate size. Using the mixed-product property of the Kronecker product we see that the Kronecker product of the generalized eigenvectors, namely, are the eigenvectors of
such that the inverse of is
The rank- Kronecker decomposition plays a key role in obtaining a fast inversion algorithm. First, the assembly and inversion of boils down to the assembly and subsequent computation of the generalized eigendecomposition of one-dimensional problems, respectively. Second, we only store one-dimensional eigenvector matrices and eigenvalues . Then, the matrix-vector multiplication profits from its Kronecker decomposition in terms of sum factorization
where the order- tensor is the multi-index reshaping of the vector .
The Laplacian is a separable differential operator on rectangles and bricks, but it remains to argue that the discontinuous Galerkin formulation on a cell or a vertex patch is as well. Indeed, this is true on Cartesian meshes only. Let the cell be of the form with intervals of length .
Then, the bulk integral in (5) on cell is the sum of products (30) alternating with the dimension. Each product is factorized as a one-dimensional bulk integral and remaining -inner products of one-dimensional shape functions.
In general the shape function gradients
are determined by means of the chain rule. The Jacobian of the Cartesian mapping is the constant, diagonal matrix such that the univariate mass and interior stiffness matrices and are
respectively, where the quadrature rule is defined on the reference interval . The two faces of associated to dimension are as well a Cartesian product
such that the face normals are constant and aligned with coordinate direction. We omit the subscript of the face when it is clear from the context. We obtain a similar splitting for the consistency, adjoint consistency and penalty integrals in (5). The univariate consistency and point mass matrices and for the faces orthogonal to coordinate direction are obtained by
where on any face at the physical boundary and otherwise. The Nitsche contributions are summed up to obtain
Hence, the local solvers on cells admit a Kronecker decomposition of the form (25) with
Cartesian vertex patches are determined by the Cartesian product of intervals . Each interval is defined by the disjoint union with subintervals of length and . Besides the interior contributions (35) on both subintervals, denoted as and , the contributions from the interface between and have to be considered. The univariate consistency and point mass matrices and , respectively, are given by
and the Nitsche terms at the interface are summed up as
replacing by or , respectively. Therefore, the local solvers on vertex patches admit a Kronecker decomposition of the form (25) with
replacing by or , respectively.
3.3 Computational effort
We compare the computational effort of the fast tensor product smoothers with the other components of the multigrid scheme. The experimental setup is the same as in Section 2.4. Consequently, the operation counts presented here are consistent with the iteration counts there. We compute on a three-dimensional mesh with cells obtained from a single vertex patch (coarse grid) by three consecutive global refinements. We use implementations based on deal.II [dealII91] and in particular its MatrixFree framework.
|— local solvers||12||12||12||12||12||12||12|
|— local solvers||195||195||195||195||195||195||195|
First, we confirm the asymptotic complexity of the fast tensor product smoothers with respect to polynomial degree. The number of floating point operations is determined by means of the performance monitoring tool likwid-perfctr [treibig10]. In Table 3, we report the factors
which are obtained from normalizing FLOP counts by the number of subdomains (cells or vertex patches) and the expected complexity in the polynomial degree.
In Table 3, we confirm the asymptotic behavior (last column) of the additive Schwarz smoothers in Algorithm 2 over a wide range of polynomial degrees. The arithmetic effort for the smoother consists of two parts, namely the one-time setup cost and the smoothing operation in each step. The setup consists mainly of one-dimensional eigenvalue problems including integration and solving (LAPACK routine DSYGV) with order operations. The integration cost decreases for higher order polynomials since the contribution of lower order face terms becomes less prominent. The effort of one additive smoothing step is determined by the cost of applying all local solvers and updating the residual, that is a single operator application . The local solvers scale strictly with , while the normalized numerical effort for residuals decreases with increasing polynomial degree as the face integrals of the DG discretization lose weight. Thus, compared to a operator application, the additive smoothing step becomes cheaper with increasing degree. Considering now the vertex patch, the degrees of freedom in each dimension are doubled. Therefore, applying the local solvers of AVS is computationally times more expensive than in ACS. Both the matrix-free operator application and the smoothing step crucially benefit from the tensor structure.
In Table 4, we compare the number of arithmetic operations for constituents of the additive smoothing and, in particular, we compare to the non-tensorized cell based smoother (sACS). Reading columns two to five, the benefits of the fast tensor product smoothers are exposed. The number of operations to setup the smoother sACS are 3000 times higher than a single operator application for tricubic shape functions, even 95000 times higher for . Clearly, non-tensorized Schwarz smoothers are infeasible since they obliterate the advantages of matrix-free methods. As columns four and five show, the setup cost of the fast tensor product cell based smoother (ACS) is already less than a single matrix application and therefore almost negligible. We see that one smoothing step needs about of the number of operations of a matrix application for tricubic shape functions, about for