High-order matrix-free incompressible flow solvers with GPU acceleration and low-order refined preconditioners

by   Michael Franco, et al.

We present a matrix-free flow solver for high-order finite element discretizations of the incompressible Navier-Stokes and Stokes equations with GPU acceleration. For high polynomial degrees, assembling the matrix for the linear systems resulting from the finite element discretization can be prohibitively expensive, both in terms of computational complexity and memory. For this reason, it is necessary to develop matrix-free operators and preconditioners, which can be used to efficiently solve these linear systems without access to the matrix entries themselves. The matrix-free operator evaluations utilize GPU-accelerated sum-factorization techniques to minimize memory movement and maximize throughput. The preconditioners developed in this work are based on a low-order refined methodology with parallel subspace corrections. The saddle-point Stokes system is solved using block-preconditioning techniques, which are robust in mesh size, polynomial degree, time step, and viscosity. For the incompressible Navier-Stokes equations, we make use of projection (fractional step) methods, which require Helmholtz and Poisson solves at each time step. The performance of our flow solvers is assessed on several benchmark problems in two and three spatial dimensions.



There are no comments yet.


page 17


Low-order preconditioning of the Stokes equations

Low-order finite-element discretizations are well-known to provide effec...

Efficient low-order refined preconditioners for high-order matrix-free continuous and discontinuous Galerkin methods

In this paper, we design preconditioners for the matrix-free solution of...

Textbook efficiency: massively parallel matrix-free multigrid for the Stokes system

We employ textbook multigrid efficiency (TME), as introduced by Achi Bra...

Initial Guesses for Sequences of Linear Systems in a GPU-Accelerated Incompressible Flow Solver

We consider several methods for generating initial guesses when iterativ...

A quantitative performance analysis for Stokes solvers at the extreme scale

This article presents a systematic quantitative performance analysis for...

Anderson acceleration for contractive and noncontractive operators

A one-step analysis of Anderson acceleration with general algorithmic de...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

High-order finite element methods have the potential to attain higher accuracy per degree of freedom than low-order alternatives 

[21, 54]. Due to these potential computational savings, many high-order methods have been developed to solve a diverse range of computational fluid dynamics (CFD) problems [20, 4, 40]. Moreover, the high arithmetic intensity of these algorithms makes them a prime target for use on graphics processing units (GPUs) and GPU-accelerated architectures of current and future supercomputers [31, 53, 13].

However, the benefits of high-order methods do not immediately imply that running increasingly higher order simulations for a fixed problem size will result in more efficient simulations. In general, Galerkin finite element methods couple all degrees of freedom (DoFs) within each mesh element. Therefore, the memory required to store the resulting system matrices grows quadratically with the number of DoFs per element (i.e.  in spatial dimensions, where is the polynomial degree). Furthermore, the naive assembly of the system matrix requires operations, although sum-factorization techniques can lower this to  [35]

. Thus, traditional matrix-based approaches are impractical for use with high polynomial degrees, both in terms of computational cost and memory requirements. Instead, matrix-vector products can be replaced with on-the-fly evaluations of the discretized differential operators in a matrix-free manner. Combined with sum-factorization techniques on tensor-product elements originally developed in the spectral element community 

[36, 38], evaluation of these matrix-free operators can be performed in operations and memory [41]. Matrix-free evaluations with sum factorization have been shown to outperform sparse matrix-vector products with due to bandwidth bounds on modern architectures [34]. Furthermore, optimized implementations of these operators achieve near peak performance on modern GPUs [13, 49], making matrix-free high-order operators a desirable choice for performant implementations.

For compressible CFD problems where explicit time stepping can be employed, matrix-free operators implemented on the GPU have been used to great effect [32, 53]. However, most incompressible flow solvers require the solution of large, sparse linear systems [21], thus motivating the development of matrix-free solvers. Krylov subspace methods are a natural choice for matrix-free solvers, but they require effective preconditioners in order to obtain good performance [6]. Therefore, in this work we develop matrix-free preconditioners to solve the linear systems arising from high-order tensor-product finite element discretizations of the steady Stokes, unsteady Stokes, and unsteady incompressible Navier-Stokes equations. Particular emphasis is placed on solver robustness with respect to discretization and mesh parameters. In recent years, there has been much work on the topic of matrix-free preconditioning for high-order discretizations. Matrix-free multigrid methods using point Jacobi and Chebyshev smoothing were considered in [45] and [35]. Matrix-free tensor-product approximations to block Jacobi preconditioners for discontinuous Galerkin discretizations were constructed in [41] and [42]. In this work, we extend sparse, low-order refined preconditioners [36, 22, 16] with parallel subspace corrections, originally described for diffusion problems in [39]. The resulting preconditioners are robust in both the mesh size and polynomial degree.

The structure of this paper is as follows. In Section 2, we will briefly describe the equations governing incompressible fluid flow. Section 3 will describe the high-order spatial discretization of these equations, with particular emphasis on the implementation of the matrix-free operators. Section 4 will discuss the temporal discretization, using both Runge-Kutta methods and projection methods. Matrix-free solvers and preconditioners will be developed in Section 5. The performance of our GPU implementation of the matrix-free operators will be analyzed in Section 6. Finally, we will present numerical results verifying the - and -robustness of our solvers in 2D and 3D in Section 7.

2. Governing equations of incompressible flow

Of interest in this work are the equations governing incompressible fluid flow. In particular, we consider both the incompressible Stokes and Navier-Stokes equations in spatial dimensions (). The spatial domain is denoted . The unknowns are velocity and pressure . The unsteady incompressible Navier-Stokes equations with Dirichlet boundary conditions are


where we assume a uniform density. Here, is the kinematic viscosity, is a known forcing term, and is the specified Dirichlet boundary condition. In the limit of low Reynolds numbers, we are also interested in the linear Stokes equations, which can be obtained by neglecting the nonlinear convection term from (1), resulting in


The steady versions of (1) and (2) will also be considered.

3. Spatial discretization via high-order finite elements

The governing equations are discretized using a high-order continuous finite element method. First, the spatial domain is discretized using an unstructured mesh consisting of tensor-product elements (mapped quadrilaterals in two dimensions and mapped hexahedra in three dimensions). Note that these element mappings allow for high-order or curved elements. We define the following finite element function spaces on :


We will use nodal, Gauss-Lobatto tensor-product bases on each of the spaces and . We write the finite element formulation for (2) as: find a velocity-pressure pair such that


Expanding out , , , and in terms of the bases for their respective spaces, we obtain the semi-discrete Stokes problem:


where and are now reused to represent the vectors of coefficients of the high-order polynomial basis functions approximating their continuous counterparts. is the vector mass matrix, is the vector stiffness matrix, is the gradient operator, and is the divergence operator with the following definitions:


The bases and span the velocity space and pressure space , respectively. In the steady Stokes problem we take , so (5) reduces to the linear system


A similar treatment of the incompressible Navier-Stokes equations (1) yields the semi-discrete problem:


where is the discretized nonlinear vector-convection term defined by


Equation (12) uses the notation to represent the tensor of all , so that can be viewed as a matrix of size where each entry is the vector


When solving the steady (10) and unsteady Stokes (5) problems, we use the Taylor-Hood finite element space, , which achieves optimal convergence rates and is stable for orders  [10]. When solving the incompressible Navier-Stokes problem (1), we use the so-called space, [26].

3.1. Matrix-free operators

Figure 1. Runtime comparison of standard matrix-based and sum-factorized matrix-free operator setup and evaluation for scalar Laplacian in 2D and 3D. Standard (naive) matrix assembly scales like and matrix-based operator evaluation scales like , while matrix-free setup scales like and matrix-free operator evaluation scales like .

Explicitly storing and assembling the matrices described by (6)–(9) typically requires memory and operations. When running simulations at high polynomial degrees, especially on the GPU, this cost is impractical. Therefore, we turn to sum-factorization techniques to replace our matrix-vector products with on-the-fly evaluations of the integrals [49, 41]. These integrals are efficiently implemented by summing the contributions from each element of the mesh. For the purposes of illustration, let us consider a matrix-free implementation of the local scalar stiffness matrix on a hexahedral () element :


We assume there is an isoparametric mapping from the reference cube to element with Jacobian . Mapping to the reference cube, we have


where now the operators and basis functions are understood to be in the reference space, parameterized by

. Choosing our bases to be the tensor product of 1D Lagrange interpolating polynomials defined on the

Gauss-Lobatto points, we have


for all . Choosing tensor product quadrature points to evaluate the integrals in (15), we can also denote the quadrature nodes and weights using this same multi-index notation. That is, the quadrature weights and points are and . Thus, (15) becomes


As an aside, the solution evaluated at a quadrature point is


We immediately notice that the number of operations required to evaluate a function at each of the quadrature points is , or in general.

Kronecker products allow us to simplify notation. First, we define the one-dimensional Gauss point evaluation matrix as the Vandermonde-type matrix obtained by evaluating each of the one-dimensional basis functions at all of the quadrature points:


With this notation, is equivalent to the computation of the Kronecker product


Likewise, we define the 1D Gauss point differentiation matrix as


Returning to (17), we are now ready to recast the operator as a Kronecker product of local 1D matrices. First, we precompute the tensor , which is diagonal in its first two dimensions and is defined by


Next, we recognize that evaluating at all the quadrature points means contracting with the tensor defined by


Therefore, the local operator (17) is equivalent to


where the transpose is taken only over the first two dimensions of . A major benefit of reformulating the local operator into this form is that it is no longer necessary to form a global matrix. Instead, the action of this operator can be recreated using the 1D matrices and and the precomputed tensor . Precomputing requires operations and memory. Applying the operator and its transpose requires operations and memory.

Using these sum-factorization techniques, we can also derive matrix-free implementations of , , , and from their definitions. Each operator has comparable computational cost and memory requirements as in the case of this scalar diffusion operator.

Figure 1 shows run times for sum-factorized matrix-free operator setup and evaluation compared with standard matrix assembly and matrix-based operator evaluation for polynomial degrees between 2 and 16. We notice that matrix assembly is the most expensive operation for all the cases tested, usually taking about two orders of magnitude more time than operator evaluation. In 2D, our implementation of matrix-free operator evaluation is more efficient than matrix-based operator evaluation for polynomial degrees greater than 5. In 3D, the matrix-free sum-factorized operator evaluation is more efficient for polynomial degrees greater than 2. In the matrix-free context, the setup and precomputations typically represent a negligible portion of the overall cost of using these operators.

4. Temporal discretization

For the time dependent problems (1) and (2

), we discretize in time using several time integration methods. Broadly speaking, these are classified as split (i.e. projection or fractional step) methods, or unsplit methods. For the unsplit methods, we use the method of lines to first discretize in space and then temporally discretize the resulting system of ordinary differential equations (

11) and (5). Here, we use diagonally implicit Runge-Kutta (DIRK) schemes as our time-integration method [2]. On the other hand, split methods such as projection-type methods can be developed in order to decouple the solution of the velocity and pressure components. As a result, these methods can be computationally efficient, at the cost of incurring splitting and other approximation errors. Each of these methods will be discussed in greater detail in the following sections.

4.1. DIRK methods

Consider the system of ordinary differential equations


obtained from a finite element spatial discretization. Here, represents a vector of degrees of freedom, the finite element mass matrix, and the potentially nonlinear finite element residual vector. Let denote the known solution at time . A general -stage Runge-Kutta method to approximate the solution at time can be written as


where the coefficients and can be expressed compactly in the form of the Butcher tableau,


A Runge-Kutta scheme is diagonally implicit if its Butcher matrix is lower triangular, allowing for the solution of the system of equations (27) through a forward-substitution procedure. Of particular interest are the so-called singly diagonally implicit Runge-Kutta schemes, which use identical diagonal coefficients . These schemes allow for reuse of preconditioners, since each stage requires the solution of a backward Euler system with the same time step .

The application of DIRK methods to the systems (1) and (2) is complicated by the fact that there is no temporal evolution equation corresponding to the pressure. Thus, after spatial discretization we obtain (11) and (5), which are systems of differential-algebraic equations (DAEs) [29, 44]. To avoid this difficulty, we reformulate the DIRK method in a manner suitable for solving DAEs [27, 9]. We first require that the DIRK scheme be stiffly accurate, i.e. that and . Then, consider the general differential-algebraic system


We define the approximate solution to the differential variable at the th stage by


Analogously to (27), the stage derivatives are given by


Multiplying (32) by the mass matrix and inserting (33), we obtain


which, when augmented with the constraint


results in a system of equations for the th stage approximations for the differential and algebraic variables and . Because the DIRK schemes under consideration are stiffly accurate, the values at the next time step are given by the final stage approximations


Applying this DIRK method to the semi-discrete Stokes problem (5) requires solving the linear system


every Runge-Kutta stage, with right-hand side given by


This fully discrete linear system (37) and its steady state counterpart (10) are saddle point problems [6]. We present robust solvers for such saddle point systems in Section 5.4.

4.2. Projection methods

Projection methods, first introduced by Chorin in 1967, are a class of split methods for the temporal integration of the incompressible Navier-Stokes equations [18, 19]. These methods have the attractive feature that they only require the solution to uncoupled, positive-definite problems, instead of the coupled, saddle-point type problems that are required by the DIRK methods described in Section 4.1. For this reason, projection and fractional-step methods have become extensively used for incompressible flow problems [21, 43]. The original method of Chorin has since been modified and extended to a wide range of variants [5, 12, 30, 51, 37]. See e.g. [26] for a review and analysis of a selection of these variants.

Following the method presented in [51], we use equal order polynomial degrees for velocity and pressure, often known as a formulation. This method uses an implicit-explicit time-integration scheme for the viscous and convective terms respectively, thereby avoiding the need to solve a nonlinear system of equations at every time step. We use a BDF method of order for the implicit terms and an extrapolation method of order for the explicit terms with corresponding coefficients and  [51, 37]. First, we introduce the linear term and nonlinear term as well as their time-extrapolated versions,


Directly applying a BDF method to (1) yields


where is assumed to be known a priori. Introducing to represent all known terms at a given time step,


we can simplify (41) to


Unfortunately, despite using a order time-integration scheme, this method yields at most first-order convergence in time for velocity, as shown in [30] and later proved by [26]. This is caused by splitting errors and large divergence errors on the boundary of the domain. Therefore we use the velocity-correction formulation presented in [30], where the linear term is instead expressed as


using well-known vector calculus identities. This alternative form of the linear term imposes the incompressibility constraint from (1) weakly, by setting the first term in (44) equal to zero.

In order to solve for pressure, we first rearrange (43) and take the divergence of both sides in order to get


Notice how the first right hand side term of (45) vanishes due to the incompressibility constraint. Equation 46 is closed by the boundary condition,


where is the outward pointing normal vector. We use the known Dirichlet boundary condition to evaluate (47). In the case of a pure Neumann boundary condition, we close the system with a mean-zero condition on pressure:


Therefore, this projection method computes in three steps. First, the extrapolated contributions from the nonlinear and forcing terms are combined to compute via (42). Second, we solve for in the pressure-Poisson problem (46), closed with (47) and (48). Finally, we return to (43) and solve for in the following Helmholtz problem:


This projection method is th order in time for velocity (up to [26]. As previously mentioned, a major benefit of this method is its computational efficiency. Each time step requires only one new nonlinear evaluation, one Poisson solve, and one Helmholtz solve. We will discuss the matrix-free solvers that we use for these sub-problems in Section 5.3.

5. Solvers and matrix-free preconditioners

The numerical methods described above require solving large, sparse linear systems. The fully discrete, steady Stokes equation requires solving the saddle-point linear system (10). The time discretization of the unsteady Stokes equation (2) by a DIRK method results in a sequence of saddle point problems (37). The velocity-correction schemes require the solution of a Poisson problem (46) for the pressure and a Helmholtz equation (49) for the velocity. Additionally, the nonlinear extrapolation requires the inversion of the velocity mass matrix.

In order to solve these systems, we make use of preconditioned Krylov subspace methods, such as the conjugate gradient (CG) and generalized minimal residual (GMRES) methods [47]. These iterative solvers are a natural choice for matrix-free solvers, since they only require the action of the operator, which we compute using the matrix-free algorithms described in Section 3.1, and the evaluation of a preconditioner. The main challenge associated with the matrix-free solution of high-order flow problems is constructing efficient preconditioners that result in iteration counts that are independent of the discretization parameters , , and . In this section, we describe the construction of robust preconditioners that do not require the assembly of the high-order system matrices. We begin by describing our preconditioning strategy for the relatively simpler sub-problems, which can then be combined to create effective preconditioners for the more challenging, coupled problems.

5.1. Collocated mass preconditioning

In order to precondition the mass matrix, we make use of a diagonal preconditioner based on collocated quadrature [25]. Because we use a nodal Gauss-Lobatto basis for the finite element spaces, integrating at the same set of quadrature points results in a diagonal matrix, which is constructed and inverted in constant time and memory per degree of freedom. This matrix is spectrally equivalent to the fully-integrated mass matrix, with constants of equivalence independent of and , and so the number of solver iterations remains uniformly bounded with respect to the mesh size and polynomial degree [50].

5.2. Low-order refined preconditioners for Poisson and Helmholtz problems

Matrix-free preconditioners for the symmetric positive definite Poisson and Helmholtz problems form the fundamental building blocks for our robust fluid solvers. These preconditioners are described in detail in [39], and are based on the spectral equivalence between the high-order finite element discretization, and a low-order () finite element discretization on a Gauss-Lobatto refined mesh. This equivalence is often refereed to as the finite element method–spectral element method (FEM-SEM) equivalence [17]. The low-order finite element discretization results in a sparse matrix with nonzeros per row independent of , the polynomial degree of the original high-order discretization. Therefore, the memory requirements and computational cost to assemble the low-order matrix are both optimal, scaling linearly in the number of degrees of freedom.

Consider a scalar Poisson or Helmholtz problem


for , where is a non-negative (but possibly zero) constant. We begin by constructing a low-order refined (LOR) operator . Each of the LOR operators is obtained by a standard low-order finite element discretization on a refined mesh . This mesh is obtained by subdividing each element into the parallelepipeds defined by the Cartesian product of the one-dimensional Gauss-Lobatto points. Figure 2 illustrates one such low-order refined mesh. The identity operator can be used to map DoFs from the high-order finite element space using a Gauss-Lobatto basis to the low-order refined space. It can be shown that the low-order refined mass matrices and stiffness matrices and are spectrally equivalent to their high-order counterparts, and [14, 17, 15]. Therefore, is spectrally-equivalent to , and a robust preconditioner for is, in turn, a robust preconditioner for the original high-order matrix . The advantage of the matrix over is its greatly increased sparsity, requiring only nonzeros per row. As a consequence, this matrix can be explicitly assembled and stored, allowing for the construction of sophisticated preconditioners.

Figure 2. Illustration of the low-order refined methodology with , showing high aspect ratio elements near the coarse element interfaces. Left: original high-order mesh . Right: Gauss-Lobatto refined mesh .

The main challenge associated with constructing effective preconditioners for is the high aspect ratio associated with the low-order refined mesh . Because the Gauss-Lobatto points are clustered near the endpoints of the interval, the resulting Cartesian product mesh consists of parallelepipeds with aspect ratios that scale like [11]. As a result, the mesh is not shape-regular with respect to , and standard multigrid-type methods will not result in uniform convergence under -refinement. In order to address this issue, we make use of a structured geometric multigrid V-cycle with ordered ILU smoothing to treat the anisotropy of the problem. This ordering ensures the number of multigrid iterations is independent of . Finally, this method is parallelized using an overlapping patch-based additive Schwarz method, that is also shown to be robust in and  [39].

5.3. Application to the projection method

The projection method described in Section 4.2 requires solvers for the vector mass matrix, the pressure Poisson problem (46) and the Helmholtz problem (49). Each of these problems is symmetric positive-definite, and so we can use a preconditioned conjugate gradient solver with the preconditioners described in Section 5.1 and Section 5.2. Because the pressure Poisson problem (46) with pure Neumann conditions has a null space consisting of constant functions, care must be taken to ensure that the right-hand side is orthogonal to the null space. Therefore, each application of the preconditioner is augmented with an orthogonalization step to ensure convergence.

5.4. Block preconditioners for Stokes

In order to solve the coupled Stokes problems (10) and (37), we make use of block-preconditioning techniques, allowing us to reuse the preconditioners for the Poisson and Helmholtz problems. Consider the representative saddle point system,


We define the block triangular preconditioner


where is the Schur complement. Because the preconditioner is not symmetric, we use the flexible GMRES method [46], instead of CG, to solve the preconditioned system. It can be shown that the spectrum of the preconditioned system

consists only of a single eigenvalue, and hence GMRES will converge in at most two iterations 

[6]. Applying this preconditioner requires solving linear systems with the matrices and . However, the Schur complement is in general dense, and hence is impractical to form or invert. Therefore, in this work, we provide a matrix-free operator that approximates the action of .

Additionally, we replace the action of with the application of several uniformly-preconditioned conjugate gradient iterations on . For the steady Stokes system, , so we can use the matrix-free low-order refined preconditioner defined in Section 5.2. Likewise, for the unsteady Stokes system, , so we can use the previously-defined Helmholtz preconditioner. While these are vector linear systems, both operators decouple the dimensions. Therefore, applying the scalar preconditioners times (once in each dimension) directly provides a preconditioner for these vector linear systems. Since the CG iterations do not correspond to a fixed linear operator, it is important that we make use of flexible GMRES as an outer iteration. The sub-problems need not be solved exactly, and empirically two or three CG iterations are sufficient to provide an effective preconditioner.

For the steady Stokes system, . Before creating an approximate solver for , we notice that . To construct the approximate solver , we make the standard commutativity approximation . Then, we see that , and so


That is, the mass matrix provides an approximation to . Note that this approximation is, in fact, exact when the operators and commute, such as in the case of periodic boundary conditions. The action of the approximate solver is given by the diagonal mass preconditioner described in Section 5.1. In practice, approximating the action of by the action of doubles to triples the iterations of the iterative solver. However, inverting is infeasible, whereas applying is efficiently performed in time.

Likewise, for the unsteady Stokes system, the Schur complement is given by


Using the same commutativity approximation as in the steady case, we obtain


and so


Therefore, we take


From (58), we can apply the action of matrix-free by once again reusing the Poisson solver from Section 5.2 and the diagonal from Section 5.1.

6. GPU implementation

We have implemented the numerical algorithms described above in the framework of the MFEM finite element library [3, 1]. These algorithms take the form of single-source compute kernels can target several different backends, including OpenMP on traditional CPUs as well as CUDA for use on the GPU. In this section, we will describe important practical details that were required to obtain performant implementations for these algorithms.

Figure 3. Throughput for nonlinear vector convection evaluation in 3D for several polynomial degrees on the GPU. Left: Initial implementation. Right: Elementwise shared-memory implementation. Efficiently reusing shared memory increases throughput by a factor of between 4 and 8 and also allows for the solution of larger problems.
Figure 4. Throughput for linear operator evaluation kernels , , , in 3D for orders to on the GPU. Maximum throughput is achieved for higher orders () and larger problems (more than DoFs).
Figure 3. Throughput for nonlinear vector convection evaluation in 3D for several polynomial degrees on the GPU. Left: Initial implementation. Right: Elementwise shared-memory implementation. Efficiently reusing shared memory increases throughput by a factor of between 4 and 8 and also allows for the solution of larger problems.

Firstly, it is important to minimize memory movement between the CPU and GPU. Modern GPU architectures have limited memory and cache sizes compared to their CPU counterparts, but can reach higher peak performance in terms of floating point operations. This combination means that performance is only achievable if algorithms reach higher arithmetic intensities [35], thus motivating high-order matrix-free operators and solvers that have this potential.

We have found that using loop bounds known at compile time drastically improves performance. In -dependent compute kernels, these compile-time constants permit shared memory access within an element, thus reducing memory allocations and movement. Figure 4 shows the benefits of utilizing shared memory for the intra-element operations. We see that the shared-memory implementation outperforms the naive implementation by a factor of -. In practice, just-in-time compilation or explicit template instantiation can be used to allow for arbitrary-order simulations.

Figure 4 shows the throughput plots of our optimized implementations of the matrix-free linear operator evaluations in 3D. These results were performed on a single Nvidia V100 GPU, showing the throughput achieved for various problem sizes and polynomial degrees. Similar benchmark problems were used to assess kernel and backend performance in the context of the CEED project [23].

We believe that our implementation can be improved by better taking advantage of the GPU’s shared memory. First of all, symmetries exist in the 1D interpolation and differentiation matrices (20) and (22) since the Gauss-Lobatto nodes are symmetric about the origin. Taking advantage of these symmetries would approximately halve shared memory usage and simultaneously increase the arithmetic intensity of all operator evaluations. In some sense, this improvement can be viewed as an extension of the original sum-factorization techniques, which use inherent operator symmetries to avoid extra memory storage. Moreover, we see from Figure 4 that our implementation achieves lower throughput at lower orders. It may be possible to address this issue by combining several low-order elements per thread block to yield higher throughput. Unfortunately, the shared-memory requirements of our approach increase with order and dimension , so exhausting the shared memory is inevitable with very high order simulations. If necessary, one could compensate by storing and in the GPU’s global memory to allow for even higher order simulations.

7. Numerical results and performance analysis

7.1. Sub-problem solver performance

We first assess the performance of the matrix-free sub-problem preconditioners by measuring the number of Krylov iterations required to converge to a fixed tolerance under - and -refinement. We solve the linear system , where is either the mass matrix, Laplacian operator, or positive-definite Helmholtz operator. We write the Helmholtz operator as , and choose two representative time steps: and . For each of the problems considered, we use a preconditioned conjugate gradient iteration to solve the problem, with a relative residual tolerance of as a stopping criterion. The right hand side is taken to be a random vector. For the mass matrix, the preconditioner is the collocated diagonal preconditioner described above, and for the Helmholtz and Poisson solvers, we use the low-order refined parallel subspace correction procedure.

Figure 5. Iteration counts for sub-problem solvers under - and -refinement. For the case of -refinement, we use a fixed polynomial degree of .

To perform the -refinement study, we use a fixed Cartesian grid with 64 elements in two dimensions, and polynomial degrees from to . The number of iterations required to converge to tolerance is shown in Figure 5. We note that the number of iterations remains bounded for all problems and for all polynomial degrees. We observe a slight pre-asymptotic increase in the number of iterations for the Poisson and Helmholtz problems, but the iteration counts remain below 20 for all cases. The number of iterations required for the mass solve decreases with increasing polynomial degree. This is corroborated by an eigenvalue analysis, which shows decreasing condition number of the preconditioned system with increasing polynomial degree.

For the case of -refinement, we fix the polynomial degree to be , and perform a sequence of uniform refinements. The initial mesh is a Cartesian grid with 841 DoFs. We perform five refinements, so that the finest mesh is with 804,609 DoFs. The number of iterations required to converge to tolerance is shown in Figure 5. Here, we observe approximately constant iterations, independent of the mesh refinement. These examples verify the robustness of the sub-problem preconditioners with respect to the mesh size and polynomial degree.

7.2. Steady-state Stokes flow

Figure 6. velocity error showing high-order spatial convergence for steady-state Stokes. Polynomial degrees and are used for the velocity finite element space.

To verify the high-order accuracy of the spatial discretization, and to test the convergence properties of the solver, we solve the steady Stokes equations with a smooth solution in two spatial dimensions. Setting , we choose the right-hand side to be

The exact solution is then given by

The exact solution is imposed as a Dirichlet boundary condition for velocity on all domain boundaries. We run this case with polynomial degrees from to on a sequence of uniformly refined Cartesian meshes. We solve the resulting linear system using the FGMRES method with the matrix-free block triangular preconditioners described in Section 5.4. The stopping criterion for the iterative solver is a relative residual norm of . The spatial convergence is shown in Figure 6. The expected order of accuracy was observed in all cases, verifying high-order spatial accuracy. In Table 1, we show the error for velocity and pressure for cases considered, together with the number of FMGRES iterations required to converge the solution. The number of iterations shows a slight pre-asymptotic increase, but remains bounded with respect to both and .

Rate Rate Its.
1 31
4 6.99 5.60 41
16 7.04 5.46 43
64 8.21 8.30 46

Rate Rate Its.
1 35
4 11.32 13.66 42
16 11.08 9.01 47
64 12.23 11.89 49

Rate Rate Its.
1 39
4 15.48 26.10 45
16 13.57 8.11 52
64 -0.44 -0.49 53

Rate Rate Its.
1 41
4 17.82 34.39 47
16 1.25 -1.21 52
64 -0.33 -0.22 55
Table 1. Error and convergence results for steady Stokes equation, showing error norms for velocity and pressure, and number of FGMRES iterations required to reduce the residual by a factor of .

7.3. Unsteady Stokes flow

Figure 7. Performance of unsteady Stokes operator in 3D. Our implementation achieves expected rates of and for matrix-free setup and evaluation of the block operator, respectively. Shared-memory GPU implementation of operator evaluation outperforms 20-core CPU implementation by a factor of 6 at and 11 at .

We evaluate the performance of our GPU implementation of the coupled unsteady Stokes operator defined by (37). Figure 7 records the 3D operator setup and evaluation times under -refinement for a fixed mesh with 32,768 hexahedral elements. The GPU results were computed using one Nvidia V100 GPU, while the CPU results used 20 POWER8 cores on one node of Lawrence Livermore National Laboratory’s Ray supercomputer. Our implementation achieves the expected rates of and for matrix-free setup and evaluation, respectively. Moreover, we can see the computational benefits of high-order simulations on the GPU, as our optimized GPU kernels outperform the 20-core CPU implementation by a factor of at and at .

Figure 7 also shows the performance under -refinement, fixing the polynomial degree at . We see that for fixed , the wall-clock time scales linearly with DoFs. Once the device is saturated, the operator evaluation is accelerated by a factor of about 11 relative to the 20-core CPU evaluation.

7.4. Incompressible Navier-Stokes: Kovasznay flow

Figure 8. Two-dimensional Kovasznay flow, showing contours of velocity magnitude computed using a coarse mesh with degree 11 polynomials.
Figure 9. Two-dimensional Kovasznay flow. velocity error using polynomial degrees .

An analytical solution to the stationary Navier-Stokes equations in two spatial dimensions due to Kovasznay can be found in [33]. This solution may be used to represent the wake behind a periodic array of cylinders in the -direction. The solution is given by

where is the Reynolds number for the flow. In our example we define the problem in the rectangular domain with . Velocity magnitude contours of the solution are shown in Figure 9. We use pseudo time integration in order to apply the projection method described in Section 4.2 to a steady-state problem. The pseudo time step is chosen to be on the coarsest mesh, and is reduced by a factor of two with each refinement. The exact solution is enforced as a Dirichlet boundary condition on all boundaries of the domain. The equations are integrated until a final time of to allow for the errors to propagate out of the domain.

To investigate spatial convergence we compute the solution with increasing polynomial degree and uniform spatial refinement. The errors for the velocity are shown in Figure 9. We observe the desired order of accuracy for all of the cases considered.

7.5. Incompressible Navier-Stokes: Taylor-Green vortex

We next consider the incompressible Taylor-Green vortex, which is a standard benchmark case often used to assess the accuracy of high-order methods [8, 7]. This problem represents a simple model for the development of turbulence and resulting cascade of energy from large to small scales [48]. We use the problem configuration as defined in the first international workshop on high-order CFD methods [54]. The domain is taken to be the fully periodic cube and the initial state is set to

The Reynolds number is chosen to be . The equations are integrated with BDF-3 until a final time of using a time step of . We consider the following three configurations:

  • , grid, 439,276 DoFs per component.

  • , grid, 650,701 DoFs per component.

  • , grid, 346,969 DoFs per component.

In Figure 11, we display the time evolution of -criterion isosurfaces, colored by velocity magnitude. The quantity is defined by and is commonly used for vortex identification [28, 24]. These isosurfaces clearly display the evolution from smooth, large-scale structures to small-scale turbulent structures.

We compare the results obtained using the present high-order finite element flow solver with reference data obtained using a de-aliased pseudo-spectral method [52]. The quantities of interest for this comparison are the total kinetic energy


and the kinetic energy dissipation rate


The time evolution of these quantities is shown in Figure 10. All of the configurations considered result in good agreement with the reference data. The lowest order case (polynomial degree ) is the most dissipative, slightly under-predicting the total kinetic energy after about . The highest order case we considered (polynomial degree ) gives results of comparable accuracy to the case, with roughly half as many degrees of freedom.

Figure 10. Time evolution of total kinetic energy and kinetic energy dissipation rate for the incompressible Taylor-Green vortex. Comparison with reference data from a fully-resolved pseudo-spectral method with degrees of freedom per component.
Figure 11. Time evolution of the incompressible Taylor-Green vortex, showing isosurfaces colored by velocity magnitude.

8. Conclusions

In this work, we have described a high-order finite element method for solving incompressible flow problems with a matrix-free implementation that efficiently utilizes the high performance and memory bandwidth of modern graphics processing units. The resulting finite element operators are applied using sum-factorization algorithms and optimized GPU kernels, implemented in the MFEM library [1], and obtain throughput of over a billion degrees of freedom per second on a single Nvidia V100 GPU. A suite of matrix-free linear solvers was developed for the resulting mass, Poisson, Helmholtz, and Stokes linear systems. These preconditioners do not require the costly assembly of high-order system matrices. Their memory usage is optimal, requiring only constant memory per degree of freedom. Furthermore, the number of operations required to apply the preconditioners scales linearly with the number of degrees of freedom, which is the same as the operator evaluation. The robustness of preconditioners with respect to the mesh size, polynomial degree, and time step was demonstrated on a range of test problems. The flow solver was applied to a variety of 2D and 3D problems, verifying the high-order accuracy and efficiency of the method.


This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-JRNL-791975).

This document was prepared as an account of work sponsored by an agency of the United States government. Neither the United States government nor Lawrence Livermore National Security, LLC, nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States government or Lawrence Livermore National Security, LLC. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States government or Lawrence Livermore National Security, LLC, and shall not be used for advertising or product endorsement purposes.


  • [1] MFEM: Modular finite element methods library. https://mfem.org.
  • [2] R. Alexander. Diagonally implicit Runge-Kutta methods for stiff O.D.E.’s. SIAM Journal on Numerical Analysis, 14(6):1006–1021, Dec. 1977.
  • [3] R. Anderson, J. Andrej, A. Barker, J. Bramwell, J.-S. Camier, J. Cerveny, V. Dobrev, Y. Dudouit, A. Fisher, T. Kolev, W. Pazner, M. Stowell, V. Tomov, J. Dahm, D. Medina, and S. Zampini. MFEM: a modular finite element library. 2019.
  • [4] F. Bassi and S. Rebay. A high-order accurate discontinuous finite element method for the numerical solution of the compressible Navier-Stokes equations. Journal of Computational Physics, 131(2):267–279, 1997.
  • [5] J. B. Bell, P. Colella, and H. M. Glaz. A second-order projection method for the incompressible Navier-Stokes equations. Journal of Computational Physics, 85(2):257–283, Dec. 1989.
  • [6] M. Benzi, G. H. Golub, and J. Liesen. Numerical solution of saddle point problems. Acta Numerica, 14:1–137, Apr. 2005.
  • [7] M. E. Brachet, D. Meiron, S. Orszag, B. Nickel, R. Morf, and U. Frisch. The Taylor-Green vortex and fully developed turbulence. Journal of Statistical Physics, 34(5-6):1049–1063, Mar. 1984.
  • [8] M. E. Brachet, D. I. Meiron, S. A. Orszag, B. G. Nickel, R. H. Morf, and U. Frisch. Small-scale structure of the Taylor-Green vortex. Journal of Fluid Mechanics, 130(-1):411, May 1983.
  • [9] K. E. Brenan, S. L. Campbell, and L. R. Petzold. Numerical solution of initial-value problems in differential-algebraic equations. Society for Industrial and Applied Mathematics, Jan. 1995.
  • [10] F. Brezzi and R. S. Falk. Stability of higher-order Hood–Taylor methods. SIAM Journal on Numerical Analysis, 28(3):581–590, 1991.
  • [11] K. Brix, C. Canuto, and W. Dahmen. Nested dyadic grids associated with Legendre-Gauss-Lobatto grids. Numerische Mathematik, 131(2):205–239, Dec. 2014.
  • [12] D. L. Brown, R. Cortez, and M. L. Minion. Accurate projection methods for the incompressible Navier-Stokes equations. Journal of Computational Physics, 168(2):464–499, Apr. 2001.
  • [13] J. Brown, A. Abdelfata, J.-S. Camier, V. Dobrev, J. Dongarra, P. Fischer, A. Fisher, Y. Dudouit, A. Haidar, K. Kamran, T. Kalev, M. Min, T. Ratnayaka, M. Shephard, C. Smith, S. Tomov, V. Tomov, and T. Warburton. CEED ECP milestone report: Public release of CEED 1.0. Technical Report WBS, CEED-MS13, U.S. Department of Energy, Mar. 2018.
  • [14] C. Canuto. Stabilization of spectral methods by finite element bubble functions. Computer Methods in Applied Mechanics and Engineering, 116(1-4):13–26, Jan. 1994.
  • [15] C. Canuto, P. Gervasio, and A. Quarteroni. Finite-element preconditioning of G-NI spectral methods. SIAM Journal on Scientific Computing, 31(6):4422–4451, Jan. 2010.
  • [16] C. Canuto and A. Quarteroni. Preconditioned minimal residual methods for Chebyshev spectral calculations. Journal of Computational Physics, 60(2):315–337, Sept. 1985.
  • [17] C. Canuto, A. Quarteroni, M. Y. Hussaini, and T. A. Zang. Spectral methods. Springer Berlin Heidelberg, 2007.
  • [18] A. J. Chorin. A numerical method for solving incompressible viscous flow problems. Journal of Computational Physics, 2(1):12–26, Aug. 1967.
  • [19] A. J. Chorin. Numerical solution of the Navier-Stokes equations. Mathematics of Computation, 22(104):745–745, 1968.
  • [20] B. Cockburn and C.-W. Shu. TVB Runge-Kutta local projection discontinuous Galerkin finite element method for conservation laws. II. General framework. Mathematics of Computation, 52(186):411–435, 1989.
  • [21] M. O. Deville, P. F. Fischer, E. Mund, et al. High-order methods for incompressible fluid flow, volume 9. Cambridge University Press, 2002.
  • [22] M. O. Deville and E. H. Mund. Finite-element preconditioning for pseudospectral solutions of elliptic problems. SIAM Journal on Scientific and Statistical Computing, 11(2):311–342, Mar. 1990.
  • [23] V. Dobrev, J. Dongarra, J. Brown, P. Fischer, A. Haidar, I. Karlin, T. Kolev, M. Min, T. Moon, T. Ratnayaka, S. Tomov, and V. Tomov. CEED ECP milestone report: Identify initial kernels, bake-off problems (benchmarks) and miniapps. Technical Report WBS, CEED-MS6, U.S. Department of Energy, 2017.
  • [24] Y. Dubief and F. Delcayre. On coherent-vortex identification in turbulence. Journal of Turbulence, 1:N11, jan 2000.
  • [25] R. E. Ewing, R. D. Lazarov, P. Lu, and P. S. Vassilevski. Preconditioning indefinite systems arising from mixed finite element discretization of second-order elliptic problems. In Preconditioned Conjugate Gradient Methods, pages 28–43, Berlin, Heidelberg, 1990. Springer Berlin Heidelberg.
  • [26] J.-L. Guermond, P. D. Minev, and J. Shen. An overview of projection methods for incompressible flows. Computer Methods in Applied Mechanics and Engineering, 195:6011–6045, 2006.
  • [27] E. Hairer and G. Wanner. Solving ordinary differential equations II. Springer Berlin Heidelberg, 1996.
  • [28] J. C. R. Hunt, A. A. Wray, and P. Moin. Eddies, streams, and convergence zones in turbulent flows. Technical Report CTR-S88, Center for Turbulence Research, 1988.
  • [29] V. John, G. Matthies, and J. Rang. A comparison of time-discretization/linearization approaches for the incompressible Navier-Stokes equations. Computer Methods in Applied Mechanics and Engineering, 195(44-47):5995–6010, Sept. 2006.
  • [30] G. E. Karniadakis, M. Israeli, and S. A. Orszag. High-order splitting methods for the incompressible Navier-Stokes equations. Journal of Computational Physics, 97(2):414–443, Dec. 1991.
  • [31] A. Klöckner, T. Warburton, J. Bridge, and J. Hesthaven. Nodal discontinuous Galerkin methods on graphics processors. Journal of Computational Physics, 228(21):7863–7882, Nov. 2009.
  • [32] A. Klöckner, T. Warburton, and J. S. Hesthaven. Viscous shock capturing in a time-explicit discontinuous Galerkin method. Mathematical Modelling of Natural Phenomena, 6(3):57–83, 2011.
  • [33] L. I. G. Kovasznay. Laminar flow behind a two-dimensional grid. Mathematical Proceedings of the Cambridge Philosophical Society, 44(1):58–62, 1948.
  • [34] M. Kronbichler and K. Kormann. A generic interface for parallel cell-based finite element operator application. Computers & Fluids, 63:135 – 147, 2012.
  • [35] M. Kronbichler and K. Ljungkvist. Multigrid for matrix-free high-order finite element computations on graphics processors. ACM Transactions on Parallel Computing, 6(1):1–32, May 2019.
  • [36] S. A. Orszag. Spectral methods for problems in complex geometries. Journal of Computational Physics, 37(1):70 – 92, 1980.
  • [37] S. A. Orszag, M. Israeli, and M. O. Deville. Boundary conditions for incompressible flows. Journal of Scientific Computing, 1(1):75–111, 1986.
  • [38] A. T. Patera. A spectral element method for fluid dynamics: Laminar flow in a channel expansion. Journal of Computational Physics, 54(3):468 – 488, 1984.
  • [39] W. Pazner. Efficient low-order refined preconditioners for high-order matrix-free continuous and discontinuous Galerkin methods, Aug. 2019. arXiv preprint 1908.07071 (submitted for publication).
  • [40] W. Pazner, M. Franco, and P.-O. Persson. High-order wall-resolved large eddy simulation of transonic buffet on the OAT15A airfoil. In AIAA Scitech 2019 Forum, page 1152, 2019.
  • [41] W. Pazner and P.-O. Persson. Approximate tensor-product preconditioners for very high order discontinuous Galerkin methods. Journal of Computational Physics, 354:344–369, 2018.
  • [42] W. Pazner and P.-O. Persson. Interior penalty tensor-product preconditioners for high-order discontinuous Galerkin discretizations. In 2018 AIAA Aerospace Sciences Meeting. American Institute of Aeronautics and Astronautics, Jan. 2018.
  • [43] A. Prohl. Projection and quasi-compressibility methods for solving the incompressible Navier-Stokes equations. Vieweg+Teubner Verlag, 1997.
  • [44] J. Rang. Design of DIRK schemes for solving the Navier-Stokes equations. Informatik-Berichte der Technischen Universität Braunschweig, 2007-02, 2007.
  • [45] J. Rudi, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. W. J. Staar, Y. Ineichen, C. Bekas, A. Curioni, and O. Ghattas. An extreme-scale implicit solver for complex PDEs: Highly heterogeneous flow in Earth’s mantle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pages 5:1–5:12, New York, NY, USA, 2015. ACM.
  • [46] Y. Saad. A flexible inner-outer preconditioned GMRES algorithm. SIAM Journal on Scientific Computing, 14(2):461–469, Mar. 1993.
  • [47] Y. Saad. Iterative methods for sparse linear systems, volume 82. SIAM, 2003.
  • [48] C.-W. Shu, W.-S. Don, D. Gottlieb, O. Schilling, and L. Jameson. Numerical convergence study of nearly incompressible, inviscid Taylor-Green vortex flow. Journal of Scientific Computing, 24(1):1–27, July 2005.
  • [49] K. Świrydowicz, N. Chalmers, A. Karakus, and T. Warburton. Acceleration of tensor-product operations for high-order finite element methods. The International Journal of High Performance Computing Applications, 33(4):735–757, 2019.
  • [50] S. A. Teukolsky. Short note on the mass matrix for Gauss-Lobatto grid points. Journal of Computational Physics, 283:408–413, Feb. 2015.
  • [51] A. G. Tomboulides, J. C. Y. Lee, and S. A. Orszag. Numerical simulation of low Mach number reactive flows. Journal of Scientific Computing, 12(2):139–167, Jun 1997.
  • [52] W. M. van Rees, A. Leonard, D. Pullin, and P. Koumoutsakos. A comparison of vortex and pseudo-spectral methods for the simulation of periodic vortical flows at high Reynolds numbers. Journal of Computational Physics, 230(8):2794–2805, Apr. 2011.
  • [53] B. C. Vermeire, F. D. Witherden, and P. E. Vincent. On the utility of GPU accelerated high-order methods for unsteady flow simulations: A comparison with industry-standard tools. Journal of Computational Physics, 334:497–521, 2017.
  • [54] Z. J. Wang, K. Fidkowski, R. Abgrall, F. Bassi, D. Caraeni, A. Cary, H. Deconinck, R. Hartmann, K. Hillewaert, H. T. Huynh, et al. High-order CFD methods: current status and perspective. International Journal for Numerical Methods in Fluids, 72(8):811–845, 2013.