1 Introduction
The accurate numerical simulation of incompressible viscous flow continues to remain a challenging task, in particular, if three space dimensions are considered due to the sake of physical realism. Higher order methods offer the potential to achieve accurate results on computationally feasible grids with a minimum of numerical costs. However, constructing higher order numerical methods maintaining stability and inheriting most of the rich structure of the continuous problem becomes an important prerequisite. For this, we refer, e.g., to [johnHigherorderFiniteElement2001, johnHigherOrderFinite2002] for stationary problems and to [hussainHigherOrderGalerkin2013, hussainEfficientStableFinite2013, ahmedNumericalStudiesHigher2017] for the nonstationary case. The efficient solution of the arising algebraic systems of equations with a huge number of unknowns is no less difficult. Often, the linear solver represents the limiting factor for the level of mesh resolution and its number of degrees of freedom. If higher order time discretizations are used, this even puts an additional facet of complexity on the structure of the resulting linear systems and their numerical solution. Here, a geometric multigrid (GMG) method that is used as a preconditioner for generalized minimal residual (GMRES) iterations is proposed and analyzed computationally for higher order spacetime finite element discretizations of the Navier–Stokes equations in two and three space dimensions. Its parallel implementation in the deal.II library [arndtDealIILibrary2020] is addressed.
Discretizing the Navier–Stokes system by infsup stable pairs of finite elements that are used in this work and applying Newton’s method for the linearization leads to linear systems of equations with saddle point structure. Higher order variational time discretizations, that we implement as time marching schemes by the choice of a discontinuous in time test basis, generate linear block systems within which each block is a saddle point problem (cf. (3.15)). Solving indefinite saddle point problems has been studied intensively in the literature, cf. [benziNumericalSolutionSaddle2005, elmanFiniteElementsFast2014]. Using a direct solver is a suitable approach for problems of small dimensions. Due to the increase in computational costs and memory, twodimensional problems that are of interest in practice or even threedimensional simulations are not feasible for direct linear solvers, even if parallelism is used. In such cases, Krylov subspace methods [saadGMRESGeneralizedMinimal1986] or multigrid schemes [turekEfficientSolversIncompressible1999] are typically applied. A classical choice for a Krylov method is the (flexible) generalized minimal residual (GMRES) method. One drawback of the GMRES solver is that an additional amount of memory is allocated in each iteration. As a remedy, a restart that typically leads to a lower rate of convergence can be used; cf. [wathenPreconditioning2015]. To improve the convergence of the GMRES method, a preconditioner is typically applied within the GMRES iterations. If the density and viscosity of the flow are constant, the ”pressure convectiondiffusion” (PCD) block preconditioner results in a mesh independent convergence behavior; cf. [kayPreconditionerSteadyStateNavier2002, elmanFiniteElementsFast2014, cyrStabilizationScalableBlock2012]). This holds at least for flow problems with low to medium Reynolds numbers; [olshanskiiPressureSchurComplement2007]. As an alternative, algebraic multigrid (AMG) methods can be used for preconditioning the GMRES method. For saddle point problems, the AMG preconditioner can not be applied in a natural way. In [notayNewAlgebraicMultigrid2016], its application becomes feasible by an appropriate transformation of the underlying saddle point problem.
In challenging benchmarks, e.g., for flow around a cylinder [schaferBenchmarkComputationsLaminar1996]), geometric multigrid (GMG) methods have proven to belong to the most efficient solvers, that are currently available, cf. [johnNumericalPerformanceSmoothers2000]. Numerical studies showed, that the performance, robustness and efficiency of the GMG methods can be improved further, if they are applied as preconditioners for Krylov iterations; cf. [johnHigherOrderFinite2002]. This combination has shown to work robustly even in challenging threedimensional simulations [johnEfficiencyLinearizationSchemes2006]. GMG methods have also been applied successfully in two space dimensions along with higher order spacetime finite element discretizations of convectiondiffusion equations [vandervegtHpMultigridSmootherAlgorithm2012, vandervegtHpMultigridSmootherAlgorithm2012a] or the Navier–Stokes equations [hussainEfficientNewtonmultigridSolution2014, hussainNoteAccurateEfficient2012, hussainEfficientStableFinite2013]. Applying the GMG method involves several complexities. Firstly, one needs to store the problem structure on various mesh levels and transfer information from finer mesh levels to coarser mesh levels by prolongation operators and vice versa by restriction operators. Secondly, parallel assembly routines add a further layer of complexity to the previous ones. Finally, the key ingredient of the GMG method, that essentially influences its performance, is the choice of the smoother, that damps high frequency errors on successively coarser mesh levels. The classical GaussSeidel smoother is not applicable to the Navier–Stokes equations, due to the saddle point structure of the discrete system. Two popular choices for this kind of problems are the Vanka type smoothers [vankaBlockimplicitMultigridCalculation1986] and the Braess–Sarazin type smoothers [braessEfficientSmootherStokes1997]. In numerical studies, Vanka type smoothers have shown to outperform the BraessSarazin ones [johnNumericalPerformanceSmoothers2000].
In this work we propose a GMG approach based on a local (cellbased) Vanka smoother for higher order discontinuous Galerkin approximations in time. An infsup stable pair of finite element spaces with discontinuous pressure approximation is used for the discretization in space. This GMG method is built in the state of the art, multipurpose finite element toolbox deal.II (cf. [arndtDealIILibrary2020] and [clevengerFlexibleParallelAdaptive2021, kronbichlerFastMassivelyParallel2018]) along with the linear algebra package Trilinos [thetrilinosprojectteamTrilinosProjectWebsite2020]. Efficient data structures are provided. The deal.II library is enhanced in such a way that a parallel assembly and application of a cellbased Vanka smoother become feasible. We note that some geometric multigrid support, that was used in [kanschatMultigridMethodsTextdiv2015] to implement a GMG based preconditioner using conforming elements for the Stokes problem, was already provided in deal.II. The robustness and efficiency of our GMG method in terms of a gridindependent convergence of the preconditioned GMRES iterations is studied computationally for the popular benchmark problems of flow around a cylinder in two and three space dimensions. For this, we explicitely note that parallel multigrid iterations for challenging threedimensional flow problems do by far not meet a standard nowadays. This is underlined by the fact that the threedimensional benchmark problem of flow around a cylinder [schaferBenchmarkComputationsLaminar1996] continues to be an open one. Confirmed numbers for the goal quantities of the simulation are not available yet.
This work is organized as follows. In Sec. 2, the prototype model problem and the notation are introduced. In Sec. 3, our spacetime finite element approach for simulating the Navier–Stokes system, as well as the structure of the resulting underlying system matrix, are presented. In Sec. 4, we briefly recall the geometric multigrid algorithm as well as the local Vanka smoother, used in this work. We address practical aspects of the parallel implementation of the algorithm by using the deal.II library and the linear algebra package Trilinos. In Sec. 5, we present numerical results and measure the performance properties of our proposed algorithms for the 2d and 3d DFG benchmark of flow around a cylinder [schaferBenchmarkComputationsLaminar1996].
2 Mathematical problem, notation and discretization
2.1 Model problem
Without loss of generality, we consider the prototype model problem of incompressible viscous flow around a cylinder in a rectangular two or threedimensional domain. The twodimensional problem configuration along with the notation of the geometrical setting is sketched in Fig. 2.1. We evaluate the performance properties of the proposed GMG solver for this benchmark problem (cf.[schaferBenchmarkComputationsLaminar1996]) that is interesting enough. Further, the notation overhead is reduced by the restriction to this setting. We consider solving the Navier–Stokes equations equationparentequation
(2.1a)  
(2.1b)  
(2.1c)  
(2.1d)  
(2.1e) 
In (2.1), , with or , is the open domain filled with fluid. We put for some final time . The velocity field and the pressure are the unknown variables. In (2.1a), the parameter denotes the fluid’s viscosity and the righthand side function is a given external force. The union of the Dirichlet boundary segments is denoted by , such that . On we prescribe the fluid velocity by a function , that prescribes an inflow profile on and a no slip condition on . represents an outflow boundary that is modeled by the donothing boundary condition (2.1d); cf. [heywoodArtificialBoundariesFlux1996]. In (2.1d), the field
is the outer unit normal vector. In (
2.1e), the function denotes the prescribed initial velocity. In our numerical experiments presented in Sec. 5, the (timeindependent) rigid domain is a sphere in two spacedimensions or a cylinder in three spacedimensions.We assume that a sufficiently regular solution of the system (2.1) exists such that higher order approaches become feasible. For the existence, uniqueness and regularity of solutions to the Navier–Stokes system, including the regularity for under realistic assumptions about the data, we refer to the broad literature in this field; cf. [johnFiniteElementMethods2016, heywoodFiniteElementApproximation1982] and the references therein.
2.2 Notation
Here, we introduce the function spaces that are used in this work to present our spacetime approach for the Navier–Stokes system (2.1). By we denote the function space of square integrable functions on the fluid domain while is the usual Sobolev space of functions in which have first order weak derivatives in . Further, is the standard inner product of . We define the subspace of with mean zero and the subspace of of functions with zero boundary values (in the sense of traces) on the portion of the boundary of as . Its dual space is denoted by . Finally, by we denote the space of all traces on of functions in . For vectorvalued functions we write those spaces bold.
For the weak problem formulation we introduce the semilinear form by
(2.2) 
for and . For given we introduce the linear form by
(2.3) 
for .
Further, we define the spaces
(2.4) 
and
(2.5) 
Finally, we let
2.3 Space discretization
Let be a family of shaperegular decompositions of the (cf. Fig. 2.1) into (open) quadrilaterals with maximum cell size . For , let denote the space of polynomials of degree at most in each variable and the space of at most degree . We put
with the reference mapping from the reference cell to element . In our computations presented in Sec. 5, only affine linear mappings are used. We define the finite element spaces
For the construction and effective application of the cellbased Vanka smoother, a local velocitypressure coupling is required. To establish such a coupling, a discontinuous in space pressure approximation is applied. For the spatial approximation of the velocity and pressure variable we use the conforming, infsup stable finite element pair that is given by
(2.6) 
for some natural number , cf. [johnFiniteElementMethods2016, matthiesInfsupConditionMapped2002]. All the numerical experiments that are presented in Sec. 5 were done by the choice (2.6) of the discrete function spaces.
The space of weakly divergence free functions is denoted by
Finally, we define the spaces
2.4 Time discretization
For the time discretization, we decompose the time interval into subintervals , , where such that and for . We put with . Further, the set of time intervals is called the time mesh. For a Banach space of functions defined on the timeindependent domain and any , we let
(2.7) 
For an integer , we put
(2.8) 
3 Spacetime finite element discretization
Here we introduce our approximation of the Navier–Stokes system (2.1) by spacefinite element methods. Firstly, we briefly recall the spacetime variational formulation of the Navier–Stokes system (2.1). Then, we present its discretization in space. For this, Nitsche’s method [beckerMeshAdaptationDirichlet2002] is applied to incorporate Dirichlet boundary conditions in a weak sense. Our motivation for using Nitsche’s method comes through our overall goal to develop a GMG solver for the simulation of the Navier–Stokes equations on evolving domains by using cut finite element techniques on fixed, timeindependent background meshes. Such a discretization is developed and analyzed computationally in [anselmannCutFEMGhostStabilization2021]. Finally, the discretization in time by the discontinuous Galerkin method is done. The fully discrete problem is presented as a time marching scheme. The block structure of the Newton linearized algebraic system for each time slab is illustrated. To solve the linear systems of the Newton iteration, we use a GMG preconditioner for outer flexible GMRES iterations. Its presentation in Sec. 4 is the key contribution of this work.
3.1 Variational formulation of the continuous problem
A sufficiently regular solution of the Navier–Stokes system (2.1) satisfies the following variational problem that provides the basis for spacetime finite element discretizations.
Problem 3.1 (Continuous variational spacetime problem).
Let and be given. Let denote a prolongation of the boundary values on to , such that on . Find , with , such that and
(3.1) 
for all .
3.2 Semidiscretization in space with weak enforcement of boundary conditions by Nitsche’s method
To incorporate Dirichlet boundary conditions we use Nitsche’s method [beckerMeshAdaptationDirichlet2002]. Here, the Dirichlet boundary conditions are enforced in a weak form by adding face integrals to the variational problem. We also refer to [anselmannHigherOrderGalerkin2020]. Degrees of freedom assigned to Dirichlet portions of the boundary are then treated as unknowns of the variational problem. They are no longer enforced by the definition of the underlying function space and their implementation into the algebraic system by its condensation, as it is usually done. In particular, Nitsche’s methods allows the simulation of the Navier–Stokes equations (2.1) on evolving domains by using cut finite element techniques on fixed background meshes; cf., e.g., [anselmannCutFEMGhostStabilization2021].
Let . For the treatment of Dirichlet boundary conditions by Nitsche’s method (cf. [beckerMeshAdaptationDirichlet2002]) we introduce the bilinearform by
(3.2)  
for and , where and are numerical (tuning) parameters for the penalization. In [agerNitschebasedCutFinite2019, winterNitscheCutFinite2018], their choice in the range of is recommended. In our simulations presented in Sec. 5, we put . The discrete semilinear form is then given by
(3.3) 
for and . The linear form is defined by
(3.4) 
for .
Remark 3.1.
We comment on the different boundary terms in the forms (3.2) and (3.3). The second term on the righthand side of (3.3) reflects the natural boundary condition, making the weak imposition of the boundary conditions consistent. The first term on the righthand side of (3.2) is introduced to preserve the symmetry properties of the continuous system. The last two terms are penalizations, that ensure the stability of the discrete system. In the inviscid limit , the last term amounts to a ”nopenetration” condition. Thus, the semilinear form (3.2) provides a natural weighting between boundary terms corresponding to viscous effects , convective behavior and inviscid behavior .
We are now in a position to define a semidiscrete approximation of Problem 3.1.
Problem 3.2.
Let and be given. Find , such that and
(3.5) 
for all .
3.3 Fully discrete problem
For the discretization in time we use the discontinuous Galerkin method with piecewise polynomials in time of order . Using a discontinuous method in the time domain has the advantage, that no initial value for the pressure is needed. Continuous in time methods require a discrete initial pressure value for the unique definition of its full trajectory. Such an initial value, that guarantees the optimal order of convergence of the velocity and pressure variables for all , is not available. A remedy is the application of extrapolation techniques; cf. [hussainNoteAccurateEfficient2012]. Futher, discontinuous Galerkin methods offer stronger stability properties since they are known to be strongly stable. Applying the discontinuous Galerkin scheme to Problem 3.2 and using a discontinuous local test basis, yields the following time marching scheme; cf., e.g., [anselmannHigherOrderGalerkin2020, anselmannCutFEMGhostStabilization2021].
Problem 3.3.
Let and be given. For , and given for and for , find , such that
(3.6) 
for all .
For the construction of the GMG method in Subsec. 4.2 we discuss the algebraic counterpart of Eq. (3.6) more thoroughly. Firstly, we represent the unknown discrete functions in a temporal basis of by means of
(3.7) 
with coefficient functions and , where for . For the basis
we choose the Lagrange interpolants with respect to the
Gauss–Radau quadrature nodes of . Appreciable of the Gauss–Radau quadrature formula is that the end point of the subinterval is a quadrature node, which simplifies the evaluation of the second term on the righthand side of (3.6). Letting(3.8) 
the coefficient functions and of (3.7) admit the representation
with the vectors of unknown coefficients
(3.9) 
for all degrees of freedom in time in with . Clearly, the vectors denote the coefficients of the velocity component functions , with , with respect to the spacial basis . The coupling of the coefficient functions, defined by (3.7), with the test basis, given by (3.8), in the algebraic formulation of Eq. (3.6) is illustrated in Table 3.1.
Defining now the vector of unknown coefficients for the solution of (3.6) in the subinterval by
(3.10) 
we recover the variational equation (3.6) in an algebraic form as
(3.11) 
for a suitably defined nonlinear function . We refer to [anselmannHigherOrderGalerkin2020] for the explicit derivation of the algebraic system of a related spacetime finite element approximation of the Navier–Stokes system. To solve the nonlinear problem (3.11), we use Newton’s method such that the linear system
(3.12) 
with the Jacobian matrix and righthand side vector
(3.13) 
has to be solved in each Newton iteration for the new iterate
(3.14) 
The block structure of the Jacobian matrix can be deduced from the couplings illustrated in Table 3.1. For brevity, an explicit form of the Jacobian matrix is not given here. For the sake of clearness, we restrict ourselves to presenting the block structure of for the polynomial order in time only. This corresponds to the dG(1) method for the time discretization. For this, we get that
(3.15) 
In (3.15), the partioning of the vector of unkwnons of the corresponding system (3.12) is then given by
where and , with , denote the components of related to velocity and pressure unknowns, respectively. For a more detailed derivation of the the system matrix of (3.15) and the definition of the submatrices and we refer to [anselmannHigherOrderGalerkin2020] again.
More precisely, to enhance the range of convergence of Newton’s method, a damped version using an additional linesearch technqiue is applied for solving (3.11). Alternatively, a ”dogleg approach” (cf., e.g. [pawlowskiInexactNewtonDogleg2008]), that belongs to the class of trustregion methods and offers the advantage that also the search direction, not just its length, can be adapted to the nonlinear solution process, was implemented and tested. Both schemes require the computation of the Jacobian matrix of the algebraic counterpart of Eq. (3.6). In the dogleg method multiple matrixvector products with the Jacobian matrix have to be computed. Since the Jacobian matrix is stored as a sparse matrix, the products can be computed at low computational costs. From the point of view of convergence, both methods yield a superlinear convergence behavior. In our numerical examples of Sec. 5, both modifications of Newton’s method lead to comparable results. In our computational studies, we did not observe any convergence problems for these nonlinear solvers. To solve to linear systems (3.12) of the Newton iteration, we use a flexible GMRES Krylov subspace method [saadGMRESGeneralizedMinimal1986] with a GMG preconditioner based on a local Vanka smoother [turekEfficientSolversIncompressible1999]. The GMG approach is presented in the next section.
4 A parallel geometric multigrid preconditioner
During the last decades numerous methods for solving the algebraic linear systems resulting from the discretization of the Navier–Stokes equations have been developed and studied. GMG methods seem to be among the best classes of solvers that are currently available; cf., e.g., [johnNumericalPerformanceSmoothers2000]. Spacetime finite element methods have recently attracted researchers’ interest strongly. Their application puts an additional complexity to the solution of the linear systems due to their more complex block structrue; cf. Table 3.1 and, e.g., [hussainEfficientNewtonmultigridSolution2014, anselmannHigherOrderGalerkin2020, bastingPreconditionersDiscontinuousGalerkin2017]
. Here, we use the GMG method as a preconditioner for Krylov subspace iterations, which is a standard concept for the efficient solution of highdimensional linear systems arising from the discretization of partial differential equations. The core of GMG methods is the smoother. We propose a cellbased Vanka smoother that is adapted to the spacetime finite element approach.
Even though the basic concepts of GMG methods have become standard in the meantime, their efficient implementation continues to be a challenging task. In particular, this holds if the computational power of modern parallel computer architectures has to be fully exploited. In this case, the definition of data structures and the memory usage become of utmost importance. Moreover, trends like adaptive spacetime finite element methods (cf. [kocherEfficientScalableData2019]) further complicate their implementation. These issues induce an ongoing research about GMG methods; cf., e.g., [clevengerFlexibleParallelAdaptive2021]. For our simulations we use the deal.II finite element toolbox [arndtDealIILibrary2020]. Details of our implementation of the GMG approach in this platform are addressed in the sequel as well. The concepts are flexible enough and can be transferred to similar software tools.
4.1 Key idea of the geometric multigrid method
To sketch briefly the basic principles of GMG iterations and fix our notation, we consider the linear system (3.12), that is rewritten in the simpler, indexfree form
(4.1) 
with a the righthand side vector and the Jacobian matrix . The key idea of the GMG method, that is sketched in Fig. 4.1, is to construct a hierarchical sequence of finite element spaces , with , that are embedded into each other, such that , and correspond to different grid levels with mesh sizes , for , of decompositions of the domain . Instead of solving the linear system (4.1) on the finest grid level entirely, the idea is to smooth only high frequency errors of an initial guess to the solution of (4.1) on the finest grid level with . Clearly, on level , the righthand side vector corresponds to the righthand side vector of the Newton system (3.12). Now, smoothing is done by the application of the local Vanka operator . Then, the resulting residual of (4.1) for the computed approximation of is restricted to next coarser mesh level , with , which yields the righthand side vector . On level , the high frequency errors of an initial guess (given by the null vector ) to the solution’s correction on , with , is smoothened by the application of the local Vanka operator again. These operations of restricting the residual to the next coarser grid and smoothing on this level the error in the solution of the defect equation is recursively repeated until the coarsest mesh level , with , is reached. On this level, typically a direct solver is used to compute the correspondig defect correction . Afterwards the computed defect correction of the coarsest level is prolongated to the next finer grid level , with , and used to update the defect correction . On this level, the defect correction is then smoothed again and, finally, prolongated to the next coarser mesh level , with . These operations of prolongating successively the coarse grid correction and smoothing the modified defect correction are continued until the finest grid level , with , is reached, where after the final smoothing an updated solution is obtained. This GMG approach is summarized in Algorithm 1.
For our implementation of the GMG approach and the simulations presented in Sec. 5, we use the deal.II finite element toolbox [arndtDealIILibrary2020] along with the direct, parallel SuperLU_Dist solver [liSuperLUDISTScalable2003]. Our code is based on the contributions of [clevengerFlexibleParallelAdaptive2021]
to this open source framework and expands their work by a parallel, cellbased Vanka smoother. For the restriction and prolongation steps in parallel computations, the deal.II classes
MultiGrid and MGTransferPrebuilt are used. The latter implements the prolongation between grids by interpolation and applies the transpose operator for the restriction. The core of our GMG approach is the smoother. This operator has to be efficient in smoothing high frequency errors. Further, since the smoother is applied frequently (cf. Fig. 4.1), this demands for its performant and scalable implementation in the multi processor such that the hardware’s potential is fully exploited. Our implementation of the smoother is presented more in detail in the sequel.4.2 A parallel, cellbased Vanka smoother
The Newton linearized system (4.1) of the fully discrete problem (3.6) has a generalized saddlepoint structure; cf. eq. 3.15. The generalization comes through the application of the higher order discontinuous Galerkin time discretization with temporal degrees of freedom (cf. (3.7) and (3.10)) in time for the velocity and pressure variable within each subinterval . Thereby, blocks of saddle point subsystems arise; cf. eq. 3.15. Standard smoothers, like the GaussSeidel or Jacobi method, that are often used in GMG methods, are not applicable to such systems; cf. [benziNumericalSolutionSaddle2005a]. Vanka smoothers, that can be traced back to [vankaBlockimplicitMultigridSolution1986], offer the potential to to damp high frequency errors in the approximation of solutions to saddle point problems. In [johnNumericalPerformanceSmoothers2000, molenaarTwogridAnalysisCombination1991], Vankatype smoothers have demonstrated excellent performance properties for systems with weak velocitypressure couplings. In this work, we adapt the principle of a cellbased, full Vanka smoother of [johnNumericalPerformanceSmoothers2000, p. 460] and extend the definition of the Vanka smoother to our higherorder spacetime finite element approximation of the Navier–Stokes system. Since the numerical results, reported for instance in [turekNumericalStudiesVankaType2009], show that a strong velocitypressure coupling leads to a local violation of the continuity constraint, (2.1b) we only use a discontinuous finite element space for the pressure variable, defined in Subsec. 2.3. This results in the ability to use local test functions, that are defined on a single cell. Therefore, the mass conservation is fulfilled locally, cf. [richterFluidstructureInteractionsModels2017, matthiesMassConservationFinite2007]. Fig. 4.2 illustrates the position of the underlying degrees of freedom for a pair of continuous/discontinuous finite elements for the velocity/pressure variables, corresponding to the case in the definitions of (2.6).
Remark 4.1.
In the deal.II finite element library, we use the FE_DGP⟨ ⟩ class to form a basis of . This basis is constructed using a set of polynomials of complete degree that form a Legendre basis on the unit square, i. e. they are orthogonal and normalized on the reference cell. Noteworthy, this element is not a Lagrangian one, so it is not defined by finding shape functions within the given function space that interpolate a particular set of points. Therefore, in fig. 4.2 the pressure DoF positions symbolically just stand for the number of basis functions on each element and not for the corresponding position of nodal interpolation points.
Now, we define the cellbased Vanka smoother for the proposed spacetime finite element approximation. This part generalizes previous work on the local Vanka smoother to the higher order time discretization. On the mesh level (cf. Fig. 4.1) and in a single iteration step, the cellbased Vanka smoother is applied to all coefficient subvectors of the spatial degrees of freedom corresponding to the respective mesh cell . On grid level , with , we let the solution vector of (4.1) be subdivided in terms of velocity and pressure subvectors according to the structure of the solution vector , defined in (3.10), of the nonlinear system (3.11), such that
(4.2) 
Here, and denote the number of (global) degrees of freedom for the velocity and pressure variable on grid level , where is defined for the finest mesh level . The subvectors for , correspond to the velocity values (or their corrections, respectively) and the subvectors for , to the pressure values (or their corrections, respectively) on the grid level . With the amount of the local pressure degrees of freedom on each element, , we then denote by the subvector of that is built from the degrees of freedom in that are associated with the element , such that
(4.3) 
Here, the subvectors for , correspond to the velocity values on the element and the subvectors for , to the pressure values. Further, for righthand side vector of (4.1) on grid level we let
(4.4) 
with the partition of into subvectors induced by (4.2). On a single cell , the local Jacobian matrix is defined as
where, in contrast to (3.13), we skipped the index of the time interval and the Newton step for brevity. On such a cell , the smoothing operator is then defined by
(4.5) 
In (4.5), the vector denotes the local subvector of , that is obtained by condensing to its components corresponding to the mesh cell , similarly to (4.3). The full application of the smoother then comes through running over all cells of the corresponding mesh level and applying the local smoother to each of the elements.
The appreciable advantage of the Vanka smoother is that the system, that has to be solved on each cell, or the inverse of the local Jacobian matrix , respectively, is small compared to the global system with system matrix . This will be addressed further in the next subsection. The efficiency of the application of the Vanka smoother in complex simulations with a high number of mesh cells depends on two ingredients:

The efficient application of : How are the local systems defined by (4.5) solved?

The efficient data exchange in the parallel environment: How are the data for computing or , respectively, assembled?
These two issues are discussed in the following.
4.3 Efficient application of
The implementation of the operator is an important ingredient for the efficiency of the GMG approach in computations, since the Vanka smoother is applied . We recall that the GMG method is used as a preconditioner in GMRES iterations for solving the Newton linearized system of each subinterval . We also refer to Fig. 4.1 illustrating the usage of smoothing steps on the grid levels of a GMG Vcycle. In our implementation of the GMG method, inverses of the elementwise Jacobian matrices , for all with , are precomputed after each update of the Jacobian matrix . For this, we use LAPACK routines to precompute the matrices and store them in a hashed unordered_map. If has to be applied on a cell according to (4.5), the costs for looking up the corresponding inverse is an operation with a complexity of order . The costs in terms of memory for storing each inverse is, for instance, or for a threedimensional problem for the spatial approximation by the – pair of finite elements (corresponding to in (2.6)) and the time discretization (corresponding to in (3.7)) on a 64bit machine, plus some (negligible) additional overhead to store, for instance, the hashes. For a time discretization (corresponding to in (3.7)), the local Jacobian is a matrix and the needed amount of memory is or . Since the code is parallelized, every process has to store only the information, data and inverses of the cells that it owns. Therefore, the additional amount of memory, that is needed in each process, can be decreased by increasing the number of involved processors.
4.4 Efficient data exchange in parallel environments
For precomputing the inverses of the local Jacobian matrices , the entries of in the element have to be computed. If the code is executed in parallel by multiple processes, the data access problem that is sketched in Fig. 4.3 occurs. When the local matrix on is assembled by the process 1, all the needed matrix entries of the global Jacobian matrix are available, and can be copied to the local Jacobian , since process 1 owns all of the involved degrees of freedom. In contrast, process 2 doesn’t own the degrees of freedom on the face separating and , since in a parallel environment every process has only read access to the entries it owns. For computing , the entries in the global matrix of process 1, corresponding to the degrees of freedom on the common interface, are required.
To provide and exchange the needed data efficiently, the following data structure, called map_proc_row_column_value, is generated on each of the involved processes. It involves the following, nested containers (from top to bottom).

dealii::MGLevelObject⟨ ⟩: The top level object, that contains the next elements for each mesh level . The MGLevelObject⟨ ⟩ is basically a container like std::vector⟨ ⟩, but with the option to shift indexing. So if the coarsest mesh starts e. g. on level one can access the elements inside this object with the operator and an index, starting from 2 onward.

std::map⟨unsigned int, std::unordered_map⟨⟩ ⟩: A map, whose keys are the process numbers (an unsigned int) of the neighboring processes, that own certain degrees of freedom. These are exactly those degrees of freedom, that the owning process needs to access during the assembly of the local Jacobian . The process numbers are obtained by Algorithm 2. The value of the map is a (hashed) unordered_map, which leads to the next container inside the structure:

std::unordered_map⟨ std::pair⟨unsigned int, unsigned int⟩, double⟩: For each neighboring process a hashed, unordered_map is stored, that contains the global row and column number (both unsigned int) of the needed matrix entries and assigns them to the corresponding value, which is stored as a double. Internally the std::pair of global row and column numbers is stored as hashed value. Since the C++ standard library doesn’t include a default hashing algorithm for std::pair, this is done with the operation shown in fig. 4.3. In line 4.3 a std::hash object is created and the operator is called with the first unsigned int variable of the std::pair. The result is then stored in a variable called ”hash” and in line 4.3 this hash is combined with the second unsigned integer of the std::pair using a boost operation. Basically each time an inverse is inserted or accessed into the std::unordered_map, this hashing algorithm is executed and access operations have an average complexity of .
After generation of the mesh hierarchy, every process executes the Algorithm 2.
[h!t] [linenos=true,escapeinside=!!]c++ struct hash_pair template ¡class T1, class T2¿ size_t operator()(const std::pair¡T1, T2¿ &p) const auto hash = std::hash¡T1¿(p.first); !! boost::hash_combine(hash, std::hash¡T2¿(p.second)); !! return hash; ;
Remark 4.2.
If the underlying mesh or the distribution of the degrees of freedom to the involved processes is fixed, which is especially the case if no remeshing is necessary between timesteps, then Algorithm 2 needs to be executed only once in the simulation. For instance, this holds in the numerical examples of Sec. 5. In the simulation of flow problems on evolving domains by CutFEM approaches, that are currently focused strongly (cf. [vonwahlUnfittedEulerianFinite2020, burmanEulerianTimesteppingSchemes2020]), this applies similarly and results in negligible computational costs.
Afterwards the simulation is continued until all contributions of the global system matrix on all levels are assembled. For building the Vanka smoother by assembling and storing the local matrices , the respective sparse matrix entries have to be exchanged such that every process can access the entries of each local Jacobian matrix . This is done by Algorithm 3.
Remark 4.3.
Algorithm 3 has to be executed whenever the global Jacobian in (4.1) is updated. To block as less resources as necessary, just the processes that need to exchange information, communicate with each other. The exchange of data is implemented in the object map_proc_row_column_value. The data transfer is done entirely in one single step (see Algorithm 3), instead of querying the data, that is needed for a single cell from foreign processes, in the assembly routine. This exchange of information is then cached in the memory by the object map_proc_row_column_value. By this approach, we reduce the communication between processes to an absolute necessary minimum.
Remark 4.4.
To call MPI functions that transfer data between processes, these data have to be serializable. The C++17 version of std::unordered_map included in the standard library is by default not serializable. In the code of this work, the boost C++ library is used that provides serialization capabilities for std::unordered_map via the headers serialization.hpp and unordered_map.hpp.
5 Numerical examples
In the following we analyze computationally the performance properties of the proposed GMG approach. This is down for the wellknown benchmark problems of low around a cylinder; cf. [schaferBenchmarkComputationsLaminar1996]. Our computations were done on a Linux cluster with 96 nodes, each of them with 2 CPUs and 14 cores per CPU. The CPUs are Intel Xeon E52680 v4 with a base frequency of , a maximum turbo frequency of and a level 3 cache of . Each node has of main memory. In this work, scaling experiments on up to the user limit of 32 nodes were performed.
5.1 Flow around a cylinder in two space dimensions
In the first numerical experiment we consider the wellknown 2d DFG benchmark setting of flow around a cylinder, defined in [schaferBenchmarkComputationsLaminar1996]. The problem setting is illustrated in Fig. 5.1. Quantities of interest and comparison in the simulations are the drag and lift coefficient of the flow on the circular cross section (cf. [schaferBenchmarkComputationsLaminar1996]). With the drag and lift forces and on the rigid circle given by
(5.1) 
where is the normal vector on and is the tangential velocity , the drag and lift coefficient are defined by means of