Hybrid multigrid methods for high-order discontinuous Galerkin discretizations

10/04/2019 ∙ by Niklas Fehn, et al. ∙ Technische Universität München 0

The present work develops hybrid multigrid methods for high-order discontinuous Galerkin discretizations of elliptic problems. Fast matrix-free operator evaluation on tensor product elements is used to devise a computationally efficient PDE solver. The multigrid hierarchy exploits all possibilities of geometric, polynomial, and algebraic coarsening, targeting engineering applications on complex geometries. Additionally, a transfer from discontinuous to continuous function spaces is performed within the multigrid hierarchy. This does not only further reduce the problem size of the coarse-grid problem, but also leads to a discretization most suitable for state-of-the-art algebraic multigrid methods applied as coarse-grid solver. The relevant design choices regarding the selection of optimal multigrid coarsening strategies among the various possibilities are discussed with the metric of computational costs as the driving force for algorithmic selections. We find that a transfer to a continuous function space at highest polynomial degree (or on the finest mesh), followed by polynomial and geometric coarsening, shows the best overall performance. The success of this particular multigrid strategy is due to a significant reduction in iteration counts as compared to a transfer from discontinuous to continuous function spaces at lowest polynomial degree (or on the coarsest mesh). The coarsening strategy with transfer to a continuous function space on the finest level leads to a multigrid algorithm that is robust with respect to the penalty parameter of the SIPG method. Detailed numerical investigations are conducted for a series of examples ranging from academic test cases to more complex, practically relevant geometries. Performance comparisons to state-of-the-art methods from the literature demonstrate the versatility and computational efficiency of the proposed multigrid algorithms.



There are no comments yet.


page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation

Computer hardware progress towards high Flop-to-Byte ratios renders the data movement the deciding factor for efficient PDE solvers, especially for multigrid algorithms as their main algorithmic component in the case of elliptic operators. A consequence of this development is a stronger coupling of the numerical linear algebra part and computer science part in solver development, since black-box matrix-based solvers that are severly memory-bound are no longer competitive. Therefore, optimal multigrid solvers can not be designed with a view on convergence rates and iteration counts only, but need to take into account the implementation technique from the very beginning. With the goal in mind of optimizing time-to-solution, we believe that addressing these interdisciplinary challenges deserves special attention.

Figure 1: Throughput of matrix-free versus matrix-based evaluation of scalar Laplace operator 

discretized on a hexahedral mesh (3D). The throughput is measured in degrees of freedom (unknowns) processed per second per CPU core for one forward application of the discretized operator (matrix-vector product). The matrix-free version shows the measured performance for an interior penalty DG discretization with nodal Gauss–Lobatto basis where the experiments have been conducted on an Intel Skylake architecture (see Table 

3 for details) for both a Cartesian mesh and a curvilinear mesh. The throughput measurements are done on one fully loaded compute node with the problem size () being large enough to saturate caches. For the matrix-based implementation (considering only the matrix-vector product and neglecting assembly costs), theoretical lower and upper bounds are shown for the same Intel Skylake architecture, both assuming an optimal implementation. Depending on the chosen DG basis and the stencil-width (sparsity of block-matrices of neighboring elements), the matrix-based version will perform closer to the upper bound ( nonzeros in matrix, only block-diagonal taken into account, block matrices of neighboring elements are assumed sparse and are neglected) or lower bound ( nonzeros with the block-matrices of neighboring elements being dense). The implementation of the matrix-based variant is assumed optimal, i.e., the code is assumed to run with full memory throughput at the STREAM bandwidth of the hardware where only the non-zeros of the matrix as well as the input and output vectors have to be read from/written to memory. Furthermore, optimal spatial and temporal data locality is assumed, i.e., perfect utilization of cache-lines and caches.

1.1 Matrix-free implementations and recent trends in computer hardware

It is well known from the spectral element community that computationally efficient implementations of high-order finite element discretizations on tensor-product elements rely on matrix-free algorithms using the sum-factorization technique, see for example Orszag1980 ; Kopriva2009 ; Deville2002 ; Karniadakis2013 . In the emerging field of high-order discontinuous Galerkin discretizations, recent implementations targeting current cache-based multicore architectures with high Flop-to-Byte ratios due to wide SIMD (single-instruction-multiple-data) units have been developed in Kronbichler2012 ; Kronbichler2019fast ; Muething2017arxiv . These matrix-free evaluation routines using sum-factorization have a complexity of  operations and  data transfer from memory per element for polynomial degree  in  space dimensions. Traditional implementation strategies are based on the assembly of sparse matrices in the finite element discretization stage of the numerical algorithm and are handed over to the linear algebra part of the solver that can be applied in a black-box fashion to the discrete problem. However, due to increased complexity in terms of arithmetic operations and data transfer from main memory with complexity  for the matrix-vector product (and worse complexity for the assembly part), it is clear that these sparse matrix methods can not be competitive for high polynomial degrees. Initial studies Vos2010 ; Cantwell2011 identified a break-even point of  for  (and larger  for ) beyond which a matrix-free approach with sum-factorization is faster. However, on modern computer hardware characterized by Flop-to-Byte ratios significantly larger than one, matrix-free algorithms with sum-factorization on hexahedral elements outperform sparse matrices already for polynomial degree , see Kronbichler2012 ; Kronbichler2019 . This is also shown in Figure 1, which compares the throughput measured in unknowns processed per second for the evaluation of the scalar Laplace operator in three space dimensions, , suggesting a break-even point of . Due to the paradigm shift in computer hardware, matrix-free algorithms nowadays tend to become memory-bound on recent hardware Kronbichler2019fast if implemented with a minimum of floating point operations and if vectorization is used. As a consequence of this trend, also solution techniques such as the hybridizable discontinuous Galerkin (HDG) method, see for example Kirby2012 ; Yakovlev2016 , which reduces the global problem to the degrees of freedom on the mesh skeleton by elimination of inner degrees of freedom, can not keep up with fast matrix-free operator evaluation for quadrilateral/hexahedral elements on current computer hardware as investigated in detail in the recent article Kronbichler2018 . In terms of Figure 1, the gap would open at a slower pace for the HDG case, but still be more than one order of magnitude.

1.2 Multigrid for high-order discretizations: State-of-the-art

When it comes to the solution of (non-)linear systems of equations by means of iterative solution techniques, the evaluation of nonlinear residuals as well as the application of linear(ized) operators in Krylov solvers are readily available in a matrix-free implementation environment Brown2010 . More importantly, however, also preconditioners should rely on matrix-free algorithms with optimal complexity, since the whole solver will otherwise run into similar bottlenecks with overwhelming memory transfer. Optimal complexity of all solver components is crucial in order to render high-order discretization methods more efficient in under-resolved application scenarios Fehn2018b . For elliptic PDEs, multigrid methods Trottenberg2001 are among the most efficient solution techniques Gholami2016 , especially because of their applicability to problems on complex geometries. In the context of high-order finite element discretizations, multigrid methods can be categorized into -multigrid, -multigrid, and algebraic multigrid (AMG) techniques.

Geometric or -multigrid methods rely on a hierarchy of meshes. Robust -multigrid techniques for high-order DG discretizations have been developed and analyzed in Gopalakrishnan2003 ; Hemker2003 ; Brenner2005 ; Brenner2009 for uniformly refined meshes and in Kanschat2004 ; Kanschat2008 for adaptively refined meshes. Recent improvements of these algorithms towards high-performance, matrix-free implementations have been developed in Kronbichler2018 , where a performance comparison of high-order continuous and discontinuous Galerkin discretizations as well as hybridizable discontinuous Galerkin methods is carried out. A GPU variant for continuous finite elements has been proposed in Kronbichler2019 . The parallel efficiency for adaptively refined meshes is discussed in Clevenger2019 . Large-scale applications of these -multigrid methods can be found in the field of computational fluid dynamics (CFD) and the incompressible Navier–Stokes equations Fehn2018b ; Krank2017 .

For spectral element discretizations, -multigrid, or synonymously spectral element multigrid, is frequently used, where the multigrid hierarchy is obtained by reducing the polynomial degree of the shape functions. Coarsening and multigrid transfer is particularly simple since the function spaces of different multigrid levels are nested also for element geometries deformed via high-order mappings and all operations are element-local. This approach has first been proposed and theoretically analyzed in Ronquist1987 ; Maday1988 , and later investigated, for example, in Helenbrook2003 ; Helenbrook2008 ; Mascarenhas2010 ; Lottes2005 ; Stiller2017 ; Stiller2016 ; Stiller2017b ; Huismann2019 . A related hierarchical scale separation solver is proposed in Aizinger2015 . Polynomial multigrid techniques are also frequently used to solve the compressible Euler equations Rasetarinera2001 ; Bassi2003 ; Fidkowski2004 ; Nastase2006JCP ; Luo2006 ; Hillewaert2006 ; Mascarenhas2009 ; Bassi2009 ; Helenbrook2016 and compressible Navier–Stokes equations Fidkowski2005 ; Persson2008 ; Shahbazi2009 ; Diosady2009 ; Bassi2011 ; Luo2012 ; Ghidoni2014 .

Algebraic multigrid techniques extract all information from the assembled system matrix to generate coarser levels and are attractive as they can be applied in a black-box fashion. AMG is considered in Heys2005 for high-order continuous Galerkin discretizations and in Lasser2001 ; Prill2009 ; Olson2011 for (high-order) discontinuous Galerkin discretizations. AMG applied directly to high-order DG discretizations faces several challenges, among them the construction of robust smoothers for matrices that lose diagonal dominance, but most importantly the computational complexity of matrix-based algorithms (especially for three-dimensional problems) compared to their matrix-free counterparts, see Figure 1. These limitations are also exemplified by the fact that AMG methods for high-order discretizations have only been applied to small two-dimensional problems in the works mentioned above. For reasons of computational complexity (see Section 1.1), it appears to be inevitable to combine algebraic multigrid techniques with some form of geometric coarsening to achieve a computationally efficient approach for practical applications Prill2009 ; Bastian2012 ; Siefert2014 .

Multigrid transfer operators are typically negligible in terms of computational costs when implemented in a matrix-free way with optimal complexity Kronbichler2019 ; Bastian2019 . Therefore, multigrid smoothers and coarse-grid solvers remain as the main performance-relevant multigrid components. Adhering to the matrix-free paradigm poses two major challenges that need to be addressed and further improved:

Figure 2: Illustration of combined geometric–polynomial–algebraic multigrid algorithms for nodal high-order discontinuous Galerkin discretizations.
  • Matrix-free smoothers: To fully exploit the advantages of matrix-free algorithms with sum-factorization for operator evaluation, multigrid smoothers should make use of these algorithms as well, but this problem has so far only been solved for certain types of smoothers (and mainly for elliptic problems). Polynomial smoothers based on the Chebyshev iteration Adams2003 ; Sundar2015 or explicit Runge–Kutta methods Helenbrook2003 ; Luo2006 ; Hillewaert2006 ; Luo2012 ; RuedaRamirez2019 can be immediately realized in a matrix-free way, are inherently parallel, and are therefore widely used in a high-order context. More challenging are smoothers based on elementwise inversion of operators such as block Jacobi, block Gauss–Seidel, block ILU, or (overlapping) Schwarz smoothers. Many implementations rely on matrix-based algorithms for smoothers Rasetarinera2001 ; Fidkowski2004 ; Fidkowski2005 ; Nastase2006JCP ; Bassi2009 ; Persson2008 ; Shahbazi2009 ; Diosady2009 ; Mascarenhas2009 ; Bassi2011 ; Ghidoni2014 ; Helenbrook2016 , limiting applicability mainly to two-dimensional problems, while three-dimensional problems become prohibitively expensive for higher-polynomial degrees due to the complexity of  for the assembly,  for factorizations, and  for matrix-vector products. On Cartesian meshes and tensor product elements, elementwise inversion of operators with optimal complexity is possible via the fast diagonalization method Lynch1964 , which has first been applied in Couzy1994 ; Couzy1995 in the context of spectral element discretizations and analyzed in more detail in Fischer2000 ; Lottes2005 ; Fischer2005 ; Stiller2017 ; Stiller2016 ; Stiller2017b in the context of overlapping Schwarz preconditioners and multigrid smoothers. Other tensor-product based preconditioners for high-order DG discretizations that exploit fast matrix-free operator evaluation with sum-factorization and that are applicable to more complex operators (convection–diffusion, Navier–Stokes) and non-Cartesian meshes have been proposed recently in Bastian2019 ; Pazner2018 , suggesting that the complexity can be reduced to  in a general setting.

  • Hybrid multigrid algorithms: Due to the improved efficiency of matrix-free evaluation routines and better parallel scalability Sundar2012 as compared to sparse matrix-vector products, the generation of coarser multigrid levels should rely on non-algebraic coarsening, i.e., mesh (or geometric) coarsening and reducing the polynomial degree of the function space for higher order methods, leading to the idea of hybrid - and -multigrid methods as illustrated in Figure 2. It is beneficial to stay as long as possible on the matrix-free curve in Figure 1 and to exploit all possibilities of geometric and polynomial coarsening in the multigrid hierarchy. For complex engineering applications, the number of geometric mesh levels is low (typically ) and coarse meshes might consist of millions of elements so that simple iterative solution techniques like a conjugate gradient iteration with Jacobi preconditioner used as coarse-grid solver would become too expensive and dominate the overall computational costs of the multigrid solver. Applying algebraic multigrid techniques for the coarse-grid problem discretized with linear finite elements is a good compromise between effectiveness of coarse-grid preconditioning and computational efficiency, since in this regime sparse matrix-vectors products are competitive to matrix-free evaluation routines, see Figure 1. At the same time, smoothers for algebraic multigrid methods work best at low polynomial degrees due to a better diagonal dominance of the matrix as opposed to high-order shape functions Kronbichler2018 . In Table 1, we summarize important contributions in the field of hybrid multigrid solvers. In Helenbrook2003 ; Helenbrook2001 , hybrid multigrid solvers combining -MG with -MG have been presented for high-order discretizations. In Nastase2006JCP ; Shahbazi2009 -multigrid is used along with algebraic multigrid for the coarse problem. In terms of -multigrid, the works Dobrev2006 ; Bastian2012 ; Siefert2014 can be categorized as two-level algorithms with transfer to continuous linear shape functions from the fine level to the coarse level, which is solved by an -multigrid approach in Dobrev2006 and an algebraic multigrid approach in Bastian2012 ; Siefert2014 . These works have the limitation that the underlying implementation is not matrix-free and, therefore, suffer from the complexity of matrix-based approaches. The main limitation of the approach in Sundar2012 for hexahedral meshes based on the octree concept is that it only supports linear continuous finite elements (a similar approach for tetrahedral elements is presented in Lu2014 ). An extension towards -multigrid has been done in Rudi2015 but results are limited to linear and quadratic shape functions. The approach in OMalley2017 combines -multigrid with AMG but uses expensive matrix-based implementations which could explain why results are only shown for quadratic elements. In more recent works, hybrid multigrid algorithms for high-order methods with completely matrix-free implementation are discussed in Bastian2019 , extending a previous work Bastian2012 towards a matrix-free implementation developed in Muething2017arxiv ; Kempf2018arxiv and available in the EXADUNE project Exadune2014 . That work does not exploit geometric coarsening (-multigrid) and the high-order discretization with degree  is immediately reduced to a linear space within the multigrid algorithm (and is therefore categorized as a two-level algorithm rather than -multigrid). Algebraic multigrid is employed for the coarse problem. An elaborate matrix-free implementation in the context of -multigrid solvers is presented in Kronbichler2018 based on the matrix-free implementation developed in Kronbichler2012 ; Kronbichler2019fast and available in the deal.II finite element library dealII90 . That work clearly improves the performance numbers of sophisticated geometric multigrid solvers shown in Gholami2016 . One drawback of this pure -multigrid approach is that its applicability is limited to problems where the coarse grid is comparably simple. Hybrid multigrid techniques in the context of HDG discretizations are considered, e.g., in Fabien2019 .

Study high-order matrix-free -MG -MG AMG
Helenbrook et al. Helenbrook2003
Nastase et al. Nastase2006JCP , Shahbazi et al. Shahbazi2009
Dobrev et al. Dobrev2006 ()
Bastian et al. Bastian2012 , Siefert et al. Siefert2014 ()
Sundar et al. Sundar2012
Lu et al. Lu2014
Rudi et al. Rudi2015
O’Malley et al. OMalley2017
Bastian et al. Bastian2019 ()
Kronbichler and Wall Kronbichler2018
present work
Table 1: Hybrid multigrid algorithms: relevant publications from the literature addressing combined -, -, and algebraic multigrid methods are categorized in terms of high-order discretizations (), matrix-free implementations, -MG, -MG, and AMG. Legend: fulfilled,() partly fulfilled, not fulfilled. The category -MG is partly fulfilled () if a two-level algorithm is considered with transfer from high-order space of degree  directly to linear space with .

There exist other techniques as well with the aim to overcome the complexity of matrix-based methods for high polynomial degrees. Preconditioners and multigrid methods applied to a low-order re-discretization of the operator on a mesh with vertices located on the nodes of the high-order discretization is a well-known technique originating from Deville1985 ; Deville1990 and has been analyzed for example in Fischer1997 ; Lottes2005 ; Heys2005 ; Brown2010 ; Sundar2015 ; Pazner2019arxiv . Such an approach is not considered here.

1.3 Contributions of the present work

The present work extends our previous work in Kronbichler2018 towards hybrid multigrid techniques combining geometric (), polynomial (), and algebraic coarsening. Our goal is to devise a multigrid solver applicable to engineering problems with complex geometry characterized by coarse grids with many elements. As can be seen from Table 1, the individual components relevant for efficient hybrid multigrid methods are covered by different works. However, none of these works fulfills all properties and it is the aim of the present work to fill this gap.

As a model problem, the constant-coefficient Poisson equation in three space dimensions is studied in this work. With respect to the choice of multigrid smoothers, this study makes use of Chebyshev accelerated Jacobi smoothers which have the characteristic that convergence rates are independent of  and mildly dependent on , see Kronbichler2018 ; Sundar2015 . Chebyshev smoothing is particularly attractive since it only requires application of the matrix-vector product and the inverse diagonal of the system matrix, i.e., the smoother benefits from fast matrix-free evaluation routines and is efficient in a parallel setting. Although more aggressive smoothers based on overlapping Schwarz methods resulting in lower iteration counts exist, it should be noted that Chebyshev smoothing is nonetheless highly efficient and comparative studies would need to be carried out to answer which approach is more efficient, see the initial investigation in Kronbichler2019arxiv . These aspects are, however, beyond the scope of the present study.

In case of discontinuous Galerkin discretizations, a transfer from discontinuous to continuous function spaces (denoted as DG-to-FE transfer) should be considered in addition to - and -coarsening in order to further reduce the size of the coarse-grid problem. For example, the problem size is reduced by a factor of  for linear elements with . Moreover, this approach is also beneficial in order to reduce iteration counts for the coarse-grid problem, due to the fact that existing AMG implementations and smoothers are often most effective on continuous function spaces. However, it is unclear whether to perform the DG-to-FE transfer on the high-order polynomial space  or for the coarse problem at , or likewise on the finest mesh or the coarsest mesh. It is a main finding of the present work that a DG-to-FE transfer at the fine level is beneficial, both in terms of iteration counts and overall computational costs. Furthermore, we demonstrate that this approach results in a multigrid algorithm whose convergence rates are independent of the interior penalty factor. This leads to multigrid coarsening strategies denoted as - or -multigrid, with a transfer to continuous () function spaces on the finest level followed by geometric () and polynomial () coarsening before the coarse-grid solver (e.g., AMG) is invoked.

In summary, the present work discusses the relevant design choices in the context of hybrid multigrid algorithms, i.e., combined geometric–polynomial–algebraic multigrid techniques, with an emphasis on computational costs as the driving force for algorithmic selections. The performance of these methods is detailed using a state-of-the-art matrix-free implementation, considering a series of increasingly complex problems.

1.4 Outline

The model problem studied in this work and the discontinuous Galerkin discretization are introduced in Section 2. Section 3 discusses the hybrid multigrid algorithm including the main multigrid components such as smoothers, coarsening strategies and transfer operators, as well as the coarse-level solver. The matrix-free implementation is summarized in Section 4 which is the key to an efficient hybrid multigrid solver. Numerical results are shown in Section 5, and we conclude in Section 6 with a summary of our results and an outlook on future work.

2 High-order discontinuous Galerkin discretization of the Poisson equation

As a model problem, we consider the Poisson equation discretized by discontinuous Galerkin methods with a focus on high-order polynomial spaces. Let us briefly motivate the use of discontinuous Galerkin discretizations for the Poisson problem. While continuous finite element discretizations might be regarded a suitable discretization scheme as well due to a reduced number of unknowns, DG discretizations can have an advantage over continuous discretizations for non-smooth problems in terms of accuracy versus computational costs due to a better approximation in proximity to a singularity Kronbichler2018 . Furthermore, DG discretizations of the Poisson equation arise naturally from certain model problems such as the incompressible Navier–Stokes equations discretized with discontinuous Galerkin methods. For this type of problems, efficient multigrid methods for Poisson problems are a key ingredient determining overall efficiency. Large-scale applications in the context of incompressible turbulent flows can be found in Krank2017 ; Fehn2018b , for earth mantle convection problems (with variable coefficients) in Rudi2015 , or for porous media flow in Bastian2014 . The constant coefficient Poisson equation reads

On the domain boundary , Dirichlet boundary conditions,  on , and Neumann boundary conditions,  on , are prescribed, with  and .

We consider meshes composed of hexahedral elements , that may be arbitrarily deformed via a high-order polynomial mapping  from the reference element  with coordinates  to the physical element  with coordinates . The space of test and trial functions is given as


Here,  denotes the space of polynomials of tensor degree  defined on the reference element , i.e.,


The multidimensional shape functions are formed by a tensor product of one-dimensional shape functions, . Lagrange polynomials  of degree  based on the Legendre–Gauss–Lobatto nodes are a common choice for a nodal basis due to their conditioning and are therefore considered in the present work. As usual, volume and face integrals in the weak formulation are computed by means of Gaussian quadrature with  points per coordinate direction, ensuring exact integration on affine geometries. Note that the tensor product structure of both the shape functions and the quadrature rule is important in order to use fast matrix-free evaluation routines exploiting sum-factorization.

The weak formulation of the Poisson problem written in primal formulation reads: Find  such that




We use the notation  and  for volume and face integrals, respectively, with inner products symbolized by . As an important representative of the class of DG discretization methods we consider the symmetric interior penalty method Arnold1982 ; Arnold2002 with numerical fluxes


where  is the average operator and  the jump operator, and  denote two neighboring elements . Boundary conditions are imposed weakly via the mirror principle Hesthaven2007 , setting


Inserting equation (7) into equation (4), the weak form can be split into homogeneous and inhomogeneous contributions, . The definition of the penalty parameter  according to Hillewaert2013 for quadrilateral/hexahedral elements is used


with the element volume  and the surface area . The penalty parameter  in equation (6) is given as  on interior faces , and  on boundary faces .

For the multigrid algorithm detailed below, coarser discretizations of the Laplace operator are required, which is realized by evaluating the operator for modified discretization parameters  and  (including the interior penalty parameter), i.e., on a coarser mesh or for a lower polynomial degree . In the literature, this approach is sometimes denoted as re-discretization, as opposed to a Galerkin product. Further, a transfer from discontinuous to continuous function spaces is considered in our hybrid multigrid algorithm. The conforming finite element (FE) space is given as


and the weak form of the (negative) Laplace operator simplifies to


Dirichlet boundary conditions are imposed strongly for the FE discretization, but only the homogeneous operator is required inside the multigrid algorithm. When assembling the coarse-level matrix for the AMG coarse solver, constrained degrees of freedom are kept in the system with diagonal entries set to 1 to ensure that the matrix is invertible.

In matrix-vector notation, the discrete problem can be written as the linear system of equations


where  with the problem size (number of unknowns or degrees of freedom) denoted by . Contributions from inhomogeneous boundary conditions are included in the right-hand side vector  and the matrix  only accounts for the homogeneous part . The matrix  is not assembled explicitly (apart from the algebraic multigrid coarse solver), since iterative solvers and multigrid smoothers only require the action of  applied to a vector, 111A notation like  rather than  would be more consistent in the matrix-free context, but we adhere to the matrix-vector notation as this is the common notation in linear algebra. which is realized by means of fast matrix-free evaluations (Section 4).

3 Hybrid multigrid solver

1:function SolverCG()
4:      e.g., MultigridVCycle()
7:     while  and  do
13:          e.g., MultigridVCycle()
18:     end while
19:end function
Algorithm 1 Preconditioned conjugate gradient algorithm (solves  to given tolerance)
1:function MultigridVCycle()
2:     if  then
3:          CoarseLevelSolver() coarse-level solver, e.g., AMG
4:         return
5:     else
6:          Smooth() pre-smoothing
7:          calculate residual
8:          restriction
9:          MultigridVCycle() coarse-level correction
10:          prolongation
11:          Smooth() post-smoothing
12:         return
13:     end if
14:end function
Algorithm 2 Multigrid V-cycle (solves  approximately)
1:function ChebyshevSmoother()
2:     for  do
4:     end for
5:     return
6:end function
Algorithm 3 Chebyshev-accelerated Jacobi smoother (solves  approximately)

The basic multigrid idea is to tackle oscillatory errors by smoothing and to tackle smooth errors by coarse-grid corrections, which applies to all types of multigrid coarsening (geometric, polynomial, algebraic) discussed here. We use multigrid as a preconditioner inside a Krylov solver instead of using multigrid as a solver. This approach is sometimes also denoted as a Krylov-accelerated multigrid method. The combination of a multigrid cycle with a Krylov method can be expected to be more robust and to result in lower iteration counts in general as opposed to pure multigrid solvers, see Lottes2005 ; Stiller2017 ; Stiller2016 ; Stiller2017b ; Shahbazi2009 ; Sundar2015 , especially for anisotropic problems. Since it appears that this is the most frequent use case in practice, we follow this strategy in the present work. Some performance numbers could be improved by alternative multigrid flavors, e.g., by considering full multigrid cycles Kronbichler2019 . The performance considerations and convergence rates in this work also apply to full multigrid where only a single or only few iterations on the finest level are needed. Due to the symmetric positive definite nature of the model problem considered here, we use the conjugate gradient (CG) algorithm Hestenes1952 ; Saad2003 as Krylov solver, which is detailed in Algorithm 1. The algorithmic components which are of main interest are the application of the discretized operator in line 8 and the application of the preconditioner in line 13. Other components are vector update operations and inner products (involving global communcation), but these parts of the algorithm are overall of subordinate importance since the computational costs are mainly determined by operator evaluation and the multigrid cycle called in the preconditioning step. However, it should be pointed out that the costs of all parts of the algorithm are explicitly taken into account by the experimental cost measures used in the present work, in the spirit of parallel textbook multigrid efficiency Gmeiner2015b , see Section 5.1.

In the preconditioning step of the conjugate gradient solver (preconditioner ), the operator  is inverted approximately by performing one multigrid V-cycle according to Algorithm 2 with initial solution , where  denotes the finest level. Pre- and postsmoothing are done in lines 6 and 11, respectively, and the residual evaluation in line 7. The same number of smoothing steps  is used for both pre- and postsmoothing and for all multigrid levels . These steps typically form the most expensive part of the multigrid algorithm as long as the workload in degrees of freedom per core is sufficiently large, i.e., away from the strong-scaling limit where latency effects become dominant. The coarse-level correction is called in line 9, recursively calling the multigrid V-cycle for the next coarser level  until the coarsest level  is reached, on which the coarse-level solver is called (line 3). Restriction (operator ) and prolongation (operator) are done in lines 8 and 10, respectively.

3.1 Chebyshev-accelerated Jacobi smoother

In the context of matrix-free methods analyzed here, an attractive multigrid smoother is a Chebyshev-accelerated Jacobi smoother Adams2003 , which requires the diagonal  of the operator  as well as the application of the matrix-vector product . Therefore, any fast implementation for the evaluation of the discretized operator can be applied inside the smoother and parallel scalability is directly inherited from the operator evaluation. Algorithm 3 details the Chebyshev iteration with iteration index  and  smoothing steps, where the two additional scalar parameters  and 

are calculated according to the theory of Chebyshev polynomials and require an estimation of the maximum eigenvalue 

of . The parameters are determined such that the Chebyshev smoother tackles eigenvalues in the range  on the current level, while smaller eigenvalues are damped by the coarse-grid correction. Since the maximum eigenvalue is only estimated, a safety factor of  is included to ensure robustness of the smoother. Note that the precise value used for the lower bound is not critical in terms of robustness and iteration counts. A Chebyshev iteration with  pre- and post-smoothing steps is denoted as Chebyshev() in the following, typical values for which the smoother is most efficient being , see for example Kronbichler2019 . As a default parameter, we use  and show additional results in form of a parameter study in Section 5.

The diagonal required by the Chebyshev smoother is calculated in the setup phase. The maximum eigenvalue needed by the Chebyshev iteration is estimated by performing  conjugate gradient iterations. Compared to a single solution of the linear system, this cost is not negligible. However, for many large-scale time dependent problems where  time steps have to be solved, setup costs are amortized after a few time steps, which is why we do not explicitly consider these costs in the present work. For details on setup costs see, e.g., Kronbichler2019 . This is further justified by the fact that the costs are proportional to the costs of a fine-level matrix-vector product, and therefore increase similarly under mesh refinement as the solution of the linear system of equations itself. While the present work is restricted to the constant-coefficient Poisson case, it should be mentioned that Chebyshev smoothing has been reported to work well also for variable-coefficient problems with smoothly varying coefficient Kronbichler2018 ; Sundar2015 .

3.2 Coarsening strategies and multigrid transfer operations

The multigrid level  introduced in Algorithm 2 is uniquely defined by the grid size , the polynomial degree , and the continuity parameter 

From one multigrid level to the next, only one of the three parameters may change for the hybrid multigrid methods discussed in this work. For example, a transfer from DG space to FE space leads to two multigrid levels that coincide with respect to grid size  and polynomial degree , i.e., a combined coarsening from high-order discontinuous space to low-order continuous space is not considered here. The approach is denoted as -/-multigrid if geometric/polynomial coarsening is employed only. Combined geometric-polynomial multigrid is denoted as - or -multigrid, depending on which coarsening is applied first, 222We only consider sequential coarsening in  and  as opposed to, e.g., Antonietti2015 , where the term -multigrid is used for simultaneous coarsening in both  and  from one level to the next. as illustrated in Figure 2. Following this notation and depending on whether the DG-to-FE transfer is performed at high degree or at low degree, we denote this approach as -multigrid or -multigrid, respectively, or as -multigrid or -multigrid if geometric coarsening is involved. Applying all three possibilities for coarsening would for example result in a -multigrid strategy, with the -coarsening performed first, followed by -coarsening and finally -coarsening. The different types of coarsening are illustrated in Figure 3. In all cases algebraic multigrid may be applied as a coarse-grid solver.

Figure 3: Illustration of elementary coarsening strategies for nodal high-order discontinuous Galerkin discretizations.

We write the prolongation of the coarse-level correction from coarse to fine levels generically (for all types of transfers ) as


where the global prolongation operator is expanded into the sum over all elements on the fine level with elementwise prolongation operator . The gather operator  extracts local degrees of freedom of a coarse-level element from the global DoF vector. The scatter operator  coordinates the write of local degrees of freedom into the global fine-level DoF vector and additionally performs a weighting of degrees of freedom according to the multiplicity of a shared node in case of continuous function spaces. The elementwise prolongation operator is realized as -orthogonal projection


where  denotes the mass matrix and  the embedding from space  into . Note that the integral is performed in reference space over the fine-level element 

. Therefore, the operation is the same for all elements and is done only once in the setup phase where the 1D prolongation matrices are constructed. Prolongation in multiple dimensions is constructed as the tensor product of 1D operations, exploiting fast matrix-free evaluation techniques. The 1D prolongation matrices represent the interpolation of coarse-level basis functions into the nodes of the fine-level basis functions. In the case of 

-coarsening and for general mappings from reference to physical space, however, the coarse-level space is no longer a subset of the fine-level space. Therefore, the chosen multigrid transfer operations implicitly introduce the approximation of nested function spaces as also mentioned, e.g., in Lu2014 . In case of -transfer and -transfer, the function spaces are “strictly” nested. Restriction of the residual  onto coarser levels is defined as the transpose of prolongation,


3.2.1 -coarsening

A hierarchy of -levels is constructed based on the octree concept, see for example Sundar2012 ; Burstedde2011 for details on aspects of the chosen mesh topology. Therefore, coarser meshes in the multigrid context are not obtained by coarsening a fine mesh, but rather by starting from a coarse mesh that is refined uniformly several times to obtain the fine mesh. This coarse mesh also forms the coarse-grid problem in the multigrid algorithm. From this perspective, it is clear that pure -multigrid based on the octree approach works well for cube-like domains of moderate geometrical complexity, but reaches limitations for complex geometries where only one or two refinement levels applied to the coarse mesh might be affordable in practice. In these cases, it is essential to further coarsen the problem by the use of -multigrid and algebraic multigrid techniques described in more detail below. Here, we restrict ourselves to meshes without hanging nodes and each octree has the same number of mesh refinement levels. Multigrid methods for high-order discretizations on adaptively refined meshes are discussed in Kanschat2004 ; Janssen2011 ; Kronbichler2018 ; Kronbichler2019 ; Sundar2012 in a pure -multigrid context.

3.2.2 -coarsening

As opposed to -multigrid, -multigrid offers the possibility for various -coarsening strategies. Reducing the polynomial degree by one, , is frequently applied in literature Bassi2003 ; Fidkowski2004 ; Fidkowski2005 ; Nastase2006JCP ; Shahbazi2009 ; Diosady2009 ; Bassi2009 ; Bassi2011 ; Ghidoni2014 ; Antonietti2015 ; Fabien2019 . An alternative is to reduce the polynomial degree by a factor of two,  (with appropriate rounding operation), which has been used in Helenbrook2003 ; Helenbrook2008 ; Mascarenhas2009 ; Mascarenhas2010 ; Helenbrook2016 . This coarsening strategy has a close analogy to -coarsening since the number of degrees of freedom is reduced by a factor of two in each coordinate direction from one level to the next. It is also not uncommon to immediately reduce the polynomial degree to its minimum,  or  for all  (two-level algorithm), see for example Rasetarinera2001 ; Persson2008 ; Bastian2019 . Elementwise constant shape functions with are not considered in this work. On the one hand, the present DG discretization is not consistent for polynomial degree . On the other hand, as argued in Helenbrook2008 , the small-wave-number modes that remain after smoothing are essentially continuous for diffusive problems and are, therefore, not well represented by a piecewise constant coarse space with . For the neutron diffusion problems studied in OMalley2017 , a continuous  coarse space has been found to be advantageous over a piecewise constant space. It has been observed in Persson2008 ; Shahbazi2009 by the example of the compressible Navier–Stokes equations involving diffusive terms that  performs better than . A piecewise constant space with  is typically used in the convection-dominated limit and the compressible Euler equations Luo2006 ; Nastase2006JCP . In Bastian2019 is also used for a Poisson problem with variable coefficients. These previous studies indicate that the optimal coarse space depends on the model problem under investigation. Since the present work is restricted to the constant-coefficient Poisson problem, we also restrict ourselves to a specific choice  for the coarse space. Discussions and comparisons of different -sequences can be found in Helenbrook2008 in the context of iteration counts and in Nastase2006JCP in terms of iteration counts and computational costs. In that work, only a single polynomial degree of  is investigated. Here, we foster a more rigorous investigation of the following -coarsening strategies

  • (decrease by one),

  • (bisection),

  • (two-level algorithm),

considering a wide range of polynomial degrees  and studying the impact on both iteration counts and computational costs. All -levels are exploited in our multigrid algorithm until  is reached.

3.2.3 -coarsening (transfer from discontinuous to continuous space)

A transfer from the discontinuous space to a continuous space at the coarse degree  is used in Helenbrook2008 ; OMalley2017 , an idea that has already been described in Lasser2001 in the context of two-level overlapping preconditioners. A transfer at the highest degree  is suggested in Rudi2015 without justification and with results shown only for the lowest degree . This approach might be counter-intuitive at first sight since an additional multigrid level at high polynomial degree (and therefore with expensive smoothers) is introduced and the problem size is only marginally reduced for a DG-to-FE transfer at high degree, i.e., by a factor of . It is interesting to note that a similar idea called conforming aggregation is used in Olson2011 in the context of smoothed aggregation algebraic multigrid techniques where degrees of freedom at the same spatial location are aggregated on the finest level. For the two-level scheme proposed in Dobrev2006 ; Bastian2012 ; Siefert2014 ; Bastian2019 , the high-order DG space is directly reduced to a linear conforming space. According to our experiments, this could be the reason for the strong increase in iteration counts observed in Siefert2014 ; Bastian2019 for increasing  (and for a similar two-level preconditioner used in Remacle2016 ). As mentioned previously, we introduce an additional multigrid level for the DG-to-FE transfer as in OMalley2017 , i.e., the transfer to continuous FE space is done at constant degree  and mesh size  and we found that this is important for optimal multigrid convergence rates. We investigate two variants of the DG-to-FE transfer in this work, namely performing this transfer at highest degree or lowest degree  (and similarly on the finest mesh or coarsest mesh). Performing the transfer to continuous elements on the finest level has very attractive properties. It reduces the iteration counts considerably, and yields a multigrid solver for SIPG discretizations of the Poisson equation that is robust w.r.t. the penalty parameter of the SIPG method. Theoretical background for this behavior is provided in Antonietti2017 , where this approach is motivated from the perspective of space splitting and auxiliary space methods. The important difference is that we here integrate this spliting into multigrid with the same smoother used on all levels.

The elementwise prolongation matrix is an identity matrix in the case of a DG-to-FE transfer since the continuous and discontinuous function spaces are the same from an elementwise perspective. Accordingly, the degrees of freedom shared by neighboring elements in the continuous case are simply injected into the degrees of freedom duplicated in the discontinuous case. With restriction being the transposed operation, the residual contributions of degrees of freedom of duplicated nodes in the discontinuous case are summed into the uniquely defined degree of freedom in the continuous case.

  • The two-level approaches in Lasser2001 ; Dobrev2006 ; Antonietti2017 are also known or interpreted as auxiliary space preconditioning. We refrain from this nomenclature in the present work and rather categorize these approaches as one type of multigrid coarsening in the generalized framework of hybrid multigrid algorithms. The multigrid methods in Bastian2012 ; Siefert2014 are introduced as algebraic multigrid methods that are “not fully algebraic”. In the present work, we foster a fine-level point of view and categorize these approaches as -multigrid (potentially with additional -coarsening) with algebraic multigrid applied as coarse-grid solver; for good reasons, because the fine levels are those where the numerical method spends its time (assuming that the method is applied away from the strong scaling limit) and are those that determine the computational efficiency of the approach.

3.3 Coarse-grid solver

The success of multigrid methods originates from the fact that the coarse-grid correction ensures mesh-independent convergence rates as well as low iteration counts and – at the same time – causes only low computational overhead as compared to the operations on the fine level. It is therefore important that the coarse-grid correction does not deteriorate the multigrid convergence rate which should only be affected by the smoother on the fine level. This is particularly important for the AMG coarse-grid solver that does not necessarily converge at the same rate as the smoothers on the geometric levels of the multigrid hierarchy would allow to. For this reason, it is reasonable to solve the coarse-level problem by an inner Krylov solver preconditioned by an AMG V-cycle to a specified tolerance instead of only a single AMG V-cycle as coarse-grid solver. Note that using a Krylov solver within the preconditioner does no longer guarantee that the preconditioner is a stationary operation, which might require the use of flexible solvers in general. Since we did not observe convergence problems in the present work, a basic CG iteration is used throughout as outer Krylov solver. Extending AMG solvers designed for continuous discretizations to the discontinuous case is not trivial without further measures as shown in Olson2011 . Since we want to apply the AMG coarse-grid solver in a black-box fashion in its optimal regime, we mainly show performance numbers for AMG applied to a continuous discretization of the coarse problem with lowest degree , see also Bastian2012 ; Siefert2014 . The present work makes use of the AMG implementation provided by the Trilinos ML project TrilionsML2006 , using one V-cycle with one smoothing step of an ILU smoother without fill-in and no overlap (i.e., in a block-Jacobi fashion over the MPI ranks) and an Amesos-KLU coarse solver unless specified otherwise. A comparative study of different AMG solver frameworks is beyond the scope of this study, and is for example shown in OMalley2017 for the neutron diffusion equation, or in Offermans2019 in the context of computational fluid dynamics.

4 Matrix-free operator evaluation

The overall performance of the multigrid solver crucially depends on the speed at which the matrix-vector product  can be performed. The outer Krylov solver, the multigrid V-cycle, and also the multigrid smoothers only require the action of the linear operator  applied to a vector. Since multigrid transfer operators can also be realized in a matrix-free way using sum-factorization, all components of the algorithm outlined in Section 3 (apart from the AMG coarse-grid solver) are amenable to fast matrix-free operator evaluation. The present work builds on matrix-free evaluation routines using the implementation developed in Kronbichler2012 ; Kronbichler2019fast and available in the deal.II finite element library dealII90 . The global matrix-vector product is written as a loop over all elements and faces with the local weak form evaluated by numerical quadrature


For volume integrals over , the gather operation  extracts the local degrees of freedom associated to element . Similarly, for the integral over a face ,   extracts the relevant degrees of freedom of the two elements  required for the computation of the face integral, . Then, the computation of volume and face integrals is a 3-step process described by  and , respectively. This 3-step process forms the core of the matrix-free operator evaluation and is explained in more detail below. Finally, the scatter operations  and  add contributions of volume and face integrals into the global residual vector according to the mapping of local-to-global degrees of freedom. Independently of the specific discretization technique, matrix-free techniques are the state-of-the-art implementation for high-performance realizations of PDE solvers, see for example Rudi2015 ; Ichimura2015 ; Gmeiner2015 .

Next, we detail the procedure for the matrix-free operator evaluation for the volume integral over 

The integral over the physical domain is first transformed to the reference element, giving rise to geometry terms such as the Jacobian . Integration is then performed by Gaussian quadrature, introducing the quadrature weight  and replacing the integral by a sum over all quadrature points. The last row shows how the elementwise computation of integrals can be interpreted in terms of the more abstract notation introduced in equation (15). The interpolation operator  computes the gradient (in reference coordinates) of the solution at all quadrature points by interpolation of the basis functions according to the polynomial expansion introduced in Section 2. The differential operator  applies the PDE operator for all quadrature points and depends on data associated to the current element  for non-Cartesian element geometries. The integration operator  multiplies by the gradient of the test function and sums over all quadrature points (=integration). It can be easily seen from the above equation that the integration step is the transpose of the interpolation step. Interpolation and integration are done in reference coordinates and do not depend on the current element . To obtain optimal computational complexity, it is essential to exploit the tensor product structure of the shape functions in the interpolation and integration steps. This optimization technique is called sum-factorization and replaces the sum over all nodes  by  sums over the one-dimensional nodes , leading to a complexity of  operations. Applying  one-dimensional interpolation kernels for  gradients gives rise to  kernels. However, the operations can be reduced to  kernels by first interpolating into a collocation basis ( kernels) and subsequently evaluating the gradient in the collocation basis (another  kernels) Kronbichler2019fast

. Another optimization technique reducing the number of operations for the one-dimensional kernels exploits the symmetry of the one-dimensional shape functions and is called even-odd decomposition 

Kopriva2009 . An illustration of the matrix-free evaluation process is provided in Figure 4. The computation of face integrals follows the same principles and we refer the interested reader to Kronbichler2012 ; Kronbichler2019fast for more details. For the special case of affine element geometries, a single Jacobian  is used at all quadrature points of a cell. For deformed elements, a separate Jacobian  is precomputed for each quadrature and stored as , which is then accessed during the operator evaluation and represents the main memory traffic. On faces, the quantity  is pre-computed at each quadrature points. We refer to Kronbichler2019fast for possible alternatives regarding the evaluation of the geometry. Apart from the operator evaluation and smoothing on all multigrid levels, also the multigrid transfer operators discussed in Section 3.2 are implemented with optimal-complexity matrix-free algorithms.

Figure 4: Illustration of matrix-free operator evaluation for the computation of cell integrals for a discontinuous, nodal basis with degree  and  interpolation and quadrature points per coordinate direction. Note that this illustration shows the non-vectorized case with the volume integral performed for a single element only.
Vectorization over elements and faces

The matrix-free operator evaluation performs the same operations for all elements, the only difference is that integrals over different elements operate on different parts of the solution vector  and the geometry information  has to be stored and loaded separately for each element in case of deformed element geometries. In order to exploit the single-instruction-multiple-data (SIMD) vectorization capabilities of modern hardware with wide SIMD units, the present implementation groups together several elements or faces and performs the integrals in the weak form concurrently for this batch of elements or faces. This technique has first been proposed in Kronbichler2012 . The basic data type for the operations in the matrix-free evaluation process is therefore VectorizedArray<Number>, with Number being a template for a C++ data type such as double or float. For the hardware used in the present work with support for AVX512 (see Table 3), vectorization is done over  elements/faces in double precision and  in single precision. For meshes with the number of elements/faces not being a multiple of the vectorization width, parts of the vectorized array remain empty for these corner cases.

Complexity and throughput

The theoretical complexity of the matrix-free evaluation is  operations and  data, resulting in a linear complexity, , or constant complexity, , per degree of freedom, depending on whether arithmetics or memory transfer forms the main bottleneck. On modern hardware with high Flop-to-Byte ratios, the present matrix-free implementation tends to be memory-bound when implemented with a minimum of arithmetic operations Kronbichler2019fast . Figure 1 shows the throughput of the present implementation measured for the evaluation of the scalar Laplace operator on a 3D cube geometry with periodic boundary conditions for both Cartesian and curved elements. In practice, the throughput measured in degrees of freedom per second depends only mildly on the polyomial degree and suggests an almost constant complexity up to moderately high polynomial degrees. The fact that the observed complexity is significantly better than the theoretical complexity of volume integrals can be explained by the fact that face integrals as well as data access (with constant complexity per unknown) are performance relevant for moderately high polynomial degrees.

Mixed-precision multigrid

The matrix-free algorithm outlined above is perfectly suited for mixed-precision computations in the multigrid preconditioned Krylov solver, following the idea of Gropp2000 . This is due to the fact that the amount of data transferred from main memory reduces by a factor of two in case of single precision (implying twice the throughput in terms of elements processed per time), and the vectorization strategy with explicit vectorization over elements/faces also allows twice the throughput in terms of arithmetics. The throughput of the matrix-vector product shown in Figure 1 is therefore raised by a factor of approximately  when reducing accuracy from double precision to single precision. To not spoil accuracy of the numerical approximation of the solution and ensure convergence of the outer Krylov solver, single precision is only used in the multigrid V-cycle. The outer Krylov solver operates in double precision. Larger round-off errors in the multigrid cycle can be tolerated since these high-frequency errors introduced by single-precision round-off errors are tackled by the multigrid smoothers Kronbichler2019 and since multigrid is only a preconditioner applied to the residual of the outer Krylov solver, see Algorithm 1. Since the Trilinos ML solver used here only supports double precision, the AMG coarse-grid preconditioner operates in double precision. The performance of mixed-precision is compared to pure double-precision computations in Figure 6 below and discussed in Section 5.

5 Results

We introduce relevant performance metrics used to evaluate the efficiency of the present hybrid multigrid methods in Section 5.1. Information on the hardware under consideration is given in Section 5.2. The considered test cases are briefly summarized in Section 5.3, before numerical results are shown for each problem in the subsequent sections.

5.1 Performance metrics

Frequently used metrics are the average multigrid convergence rate  and the number of iterations  needed to reduce the residual by ten orders of magnitude ()

where  denotes the residual after  iterations. These quantities are well suited to demonstrate mesh-independent convergence rates, or to quantitatively investigate robustness of the multigrid method, i.e., the influence of certain parameters such as mesh anisotropies, variable coefficients, or the polynomial degree on the convergence behavior of the multigrid algorithm. However, they are not suited to quantify the effectiveness of smoothers in terms of computational efficiency. To measure computational costs, theoretical measures such as operation counts required for the matrix-vector product or matrix nonzeros are often considered Lottes2005 ; Sundar2015 . These quantities should be considered with some care because they inherently contain assumptions on the bottleneck (for example that the algorithm is compute-bound so that floating point operations are really a cost measure). However, this depends on many aspects such as the hardware under consideration (Flop-to-Byte ratio), the implementation strategy (matrix-based vs. matrix-free), and the optimization level of the implementation. For example, it is important to implement the matrix-free algorithms discussed here with a minimum of operations and to exploit SIMD capabilities of modern hardware. Due to these uncertainties and model assumptions of theoretical cost measures, we prefer experimental cost measures determined from the actual performance of the multigrid solver, in the spirit of Kronbichler2018 ; Bastian2019 . An effective number of fine-level matrix-vector products, denoted as  in the following, is helpful to incorporate computational costs for the smoother and to compare different smoothers in the metric of computational costs instead of global iteration counts. It is unclear whether more aggressive matrix-based smoothers resulting in lower iteration counts are also superior in the practically relevant time-to-solution metric. The quantity  is defined as the ratio of the wall time for one application of the linear solver with tolerance  over the wall time for one operator evaluation. Since absolute wall times depend on the problem size, it is useful to express  as a function of two normalized quantities. The first one is the efficiency  of the matrix-free operator evaluation measured as the number of degrees of freedom  processed per second per core (also denoted as throughput)


The second one is the time  required by the multigrid solver to solve one degree of freedom per core with a residual reduction of 


or equivalently the throughput  of the linear solver in degrees of freedom solved per second per core. Then, the effective number of fine-level matix-vector products is determined experimentally as


The aim of  is to obtain a measure for the algorithmic complexity of the whole multigrid solver, but as independent of hardware and absolute performance numbers as possible. The definition of  has similarities with the parallel textbook efficiency factor defined in Gmeiner2015b , with the important difference that we define one fine-level matrix-vector product as work unit instead of one fine-level smoothing step. Since the overall goal is optimizing time-to-solution and since the operator evaluation  is the only algorithmic component re-occurring for practically all iterative solvers and preconditioners, it is important to use  as work unit so that the algorithmic complexity of different smoothers is reflected in the values achieved for . Furthermore, the performance advantage achieved by the use of mixed-precision multigrid is naturally included in our definition of . Table 2 summarizes our performance metrics.

Quantity description
number of iterations to reduce the residual by ten orders of magnitude ()
wall time in seconds to solve one unknown per core to reach 
throughput of solver in unknowns solved per second per core ()
throughput of matrix-free operator evaluation in unknowns processed per second per core
effective number of fine-level mat-vec products () to reach 
Table 2: Performance metrics used to evaluate the computational efficiency of multigrid solvers.

5.2 Hardware

The numerical experiments shown in this work are performed on an Intel Skylake architecture with AVX512 vectorization. Table 3 lists the specifications of the SuperMUC-NG supercomputer in Garching, Germany. The GNU compiler g++ version 7.3 with optimization flags -O3 -funroll-loops -std=c++17 -march=skylake-avx512 is used. All computations are done on thin nodes unless specified otherwise. The present analysis focuses mainly on the node-level performance because multigrid solvers are well-known to be scalable even to the largest supercomputers Kronbichler2018 ; Gholami2016 . The multigrid communication is between nearest neighbors, both horizontally within the matrix-vector products and vertically between the multigrid levels with one round-trip per V-cycle through the coarse solver, assuming the latter is sufficiently cheap. This is backed up by performance projections to exascale machines where multigrid is expected to be primarily memory-limited within the nodes Ibeid2018 . Good parallel scalability up to high core counts on large supercomputers when using AMG coarse-grid solvers is shown in Offermans2019 ; Offermans2016 .

Processor Memory and Caches
Processor type Intel Skylake Xeon Platinum 8174 Memory per node (thin/fat) 96/768 GByte
Frequency 2.7 GHz Theoretical memory bandwidth 256 GByte/s
Cores per node 48 STREAM memory bandwidth 205 GByte/s
SIMD width 512 bit (AVX512) Cache size (L2 + L3) per node MByte
Table 3: Performance specifications for hardware system of SuperMUC-NG at LRZ in Garching, Germany.

5.3 Test cases

The proposed hybrid multigrid methods are investigated for a series of test cases with increasing complexity regarding the geometry and the number of elements on the coarse grid, as well as the maximum aspect ratio defined as


where  and 

are the largest and smallest singular values of the Jacobian matrix 

(evaluated at all quadrature points of the element). A visualization of the geometries and the meshes of the different test cases is shown in Figure 5. We consider the following problems:

  • Cube: the geometry is a unit cube with  elements on the coarse grid. This geometry could also be discretized with a single element on the coarse grid, but we consider configurations with  elements on the coarse grid to test all multigrid components and make sure that the coarse-grid problem is non-trivial (but very small). This test case is well-suited to test the different multigrid ingredients, identify optimal multigrid coarsening strategies, perform parameter studies, study the impact of Cartesian and curved elements on iteration counts and throughput, and to compare the present implementation to the state-of-the-art (since data is mainly available for simple cube-like geometries in the literature).

  • Nozzle: the geometry of this test case is the nozzle geometry of the FDA benchmark, which has been designed to assess CFD solvers for the simulation of the flow through medical devices Malinauskas2017 . The geometry is a tube with gradual or sudden contractions/expansions of the cross section area, inducing separating flows and involving laminar, transitional, and turbulent flow regimes. We use a coarse-grid mesh with  elements in the present work. The tube and cone geometries are known analytically and used for high-order representations of the geometry via manifolds (using a cubic mapping). The blood flow through this device can be modeled as an incompressible flow, and the present work investigates the pressure Poisson component of the related incompressible Navier–Stokes solver. The mesh contains high-aspect-ratio elements with a moderate distortion, especially in the outflow part of the domain on the right.

  • Lung: The most complex test case studied in this work is the geometry of the upper airway generations of the human lung, using a patient-specific geometry of a preterm infant, for which gas exchange mechanisms have been investigated recently in the literature Roth2018 . The geometry is discretized with a purely hexahedral mesh and the coarse-grid problem consists of  elements for 8 airway generations. Simulating the flow of air through the human lung as a numerical solution of the incompressible Navier–Stokes equations again involves the solution of a pressure Poisson equation, which is studied in this work.

(a) Cube: Cartesian mesh (left) and section of curvilinear mesh (right) with  elements () and aspect ratios of  and , respectively.
(b) Nozzle: coarse mesh  of FDA nozzle geometry consisting of  elements ().
(c) Lung: coarse mesh  of a patient-specific geometry of the human lung of a preterm infant for 6, 7, and 8 airway generations (from left to right) with , and  elements, where the mesh with 8 generations has an aspect ratio of approximately .
Figure 5: Visualization of geometries and meshes investigated in the present work. The size of the coarse-grid problem ranges from  to  elements.
-like MG -like MG
Table 4: Summary of possible multigrid coarsening strategies. Regarding the nomenclature, the letters in the abbreviations are ordered according to the order in which the multigrid coarsening is performed from the fine level to the coarse level.

5.4 Cube

We consider a simple analytical test case with solution

on a cube geometry in 3D, . Dirichlet boundary conditions are prescribed on the domain boundary using the known analytical solution. The right-hand side is determined according to the method of manufactured solutions, . We analyze both a Cartesian mesh and a curvilinear mesh with deformation


in each coordinate direction. An amplitude of  is used, leading to elements that are deformed significantly, see Figure 5. For the curvilinear mesh with deformation manifold, element mappings of polynomial degree  are used throughout this work independently of the polynomial degree of the shape functions.

5.4.1 Robustness with respect to -refinement

Polynomial degree 
MG type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
14.8 12.5 12.4 12.2 14.4 14.8 17.2 17.5 19.5 19.1 22.3 22.4 24.3 24.5 26.2
14.8 12.5 12.4 12.2 14.4 14.9 17.2 17.5 19.6 19.1 22.3 22.4 24.2 24.6 26.1
14.8 12.5 12.4 12.2 14.4 14.9 17.2 17.5 19.6 19.1 22.3 22.4 24.2 24.6 26.1
7.5 5.5 5.2 5.1 5.2 5.1 5.5 5.6 6.6 6.6 7.8 7.8 8.7 8.8 9.8
7.5 5.5 5.2 5.1 5.2 5.1 5.5 5.6 6.6 6.7 7.8 7.8 8.8 8.9 9.8
(a) -multigrid (with  if -transfer is involved)
Polynomial degree 
MG type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3.3 12.4 11.2 11.2 11.3 10.5 11.0 11.4 11.8 11.5 12.5 12.6 12.8 13.3 13.5
14.8 12.5 11.3 11.3 11.3 10.7 10.9 11.5 11.8 11.6 12.5 12.8 13.3 13.6 13.8
14.8 12.5 11.3 11.3 11.3 10.7 10.9 11.5 11.8 11.6 12.5 12.8 13.3 13.6 13.8
7.4 5.5 5.1 4.9 4.8 5.0 4.7 4.7 4.6 4.7 4.6 4.7 4.6 4.8 4.9
7.5 5.5 5.1 4.9 4.8 5.0 4.7 4.7 4.6 4.7 4.6 4.7 4.6 4.8 4.9
(b) -multigrid ()
Polynomial degree 
MG type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3.3 12.4 13.0 11.9 14.2 14.1 15.9 15.4 17.9 16.9 20.1 19.4 21.3 22.3 24.3
14.8 12.5 13.9 12.0 14.3 14.2 16.0 15.3 17.8 16.9 20.2 19.8 21.6 21.9 23.9
14.8 12.5 13.9 12.0 14.3 14.2 16.0 15.3 17.8 16.9 20.2 19.8 21.6 21.9 23.9
7.4 5.5 5.1 4.9 5.1 4.8 5.0 5.1 5.6 5.4 6.3 6.3 7.1 7.0 7.8
7.5 5.5 5.1 4.9 5.1 4.8 5.0 5.1 5.6 5.4 6.3 6.3 7.1 7.0 7.8
(c) -multigrid ()
Polynomial degree 
MG type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3.3 12.4 13.0 16.6 20.3 24.6 29.4 35.6 41.6 44.9 55.8 64.6 74.0 86.9 96.8
14.8 12.5 13.9 16.8 20.7 24.7 29.8 35.6 41.6 44.8 54.8 65.2 75.1 88.4 97.3
14.8 12.5 13.9 16.8 20.7 24.7 29.8 35.6 41.6 44.8 54.7 65.2 75.2 88.6 97.0
7.4 5.5 5.1 5.2 6.6 8.7 10.7 12.9 15.5 17.4 19.9 22.6 24.7 26.9 29.8
7.5 5.5 5.1 5.2 6.6 8.7 10.7 12.9 15.5 17.4 19.9 22.5 24.6 26.9 29.7
(d) -multigrid ()
Table 5: Iteration count  as a function of polynomial degree  for various multigrid coarsening strategies for 3D Cartesian mesh with  elements. The smoother used for all experiments is Chebyshev(5,5) and the coarse-grid problem is solved iteratively to a relative tolerance of  by the conjugate gradient method with AMG V-cycle as preconditioner.
Polynomial degree 
MG type 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
17.7 13.8 14.0 15.0 17.8 19.4 22.6 22.6 25.4 26.4 28.9 29.8 32.3 33.3 35.8
17.7 13.8 14.0 15.0 17.8 19.4 22.6 23.1 25.3 26.3 28.8 29.7 32.3 32.9 35.5
17.7 13.8 14.0 15.0 17.8 19.4 22.6 23.1 25.3 26.3 28.8 29.7 32.3 32.9 35.5
8.5 5.9 5.5 5.5 5.9