1. Introduction
The development of simulation software is an increasingly important aspect of modern scientific computing. Such software requires a vast range of knowledge spanning several disciplines, ranging from abstract mathematics to highperformance computing and lowlevel code optimization. Software projects developing automatic code generation systems have become quite popular in recent years, as such systems help create a separation of concerns which focuses on a particular complexity independent from the rest. Examples of such projects include FreeFEM++ (Hecht, 2012), Sundance (Long et al., 2010), the FEniCS Project (Logg et al., 2012), Feel++ (Prud’Homme et al., 2012), and Firedrake (Rathgeber et al., 2016).
The finite element method (FEM) is a mathematically robust framework for computing solutions of partial differential equations (PDEs), with a formulation that is highly amenable to codegeneration techniques. A description of the weak formulation of the PDEs, together with appropriate discrete function spaces, is enough to characterize the finite element problem. Both the FEniCS and Firedrake projects employ the Unified Form Language (UFL) (Alnæs et al., 2014) to specify the finite element integral forms and discrete spaces necessary to properly define the finite element problem. UFL is a highly expressive domainspecific language (DSL) embedded in Python, which provides the necessary abstractions for code generation systems.
There are classes of finite element discretizations resulting in discrete systems that can be solved more efficiently by directly manipulating local tensors. For example, the static condensation technique for the reduction of global finite element systems (Guyan, 1965; Irons, 1965)
produces smaller globallycoupled linear systems by eliminating interior unknowns to arrive at an equation for the facet degrees of freedom only. Alternatively, hybridized finite element methods
(Arnold and Brezzi, 1985; Brezzi and Fortin, 2012; Cockburn et al., 2009a) introduce Lagrange multipliers enforcing certain continuity constraints on finite element functions. Static condensation can then be applied to the augmented system to produce a reduced equation for the multipliers. Methods of this type are often accompanied by local postprocessing techniques that produce superconvergent approximations (Bramble and Xu, 1989; Cockburn et al., 2010a, 2009b). These procedures often require invasive manual intervention during the equation assembly process in intricate numerical code.In this paper, we provide a simple yet effective highlevel abstraction for localized dense linear algebra on systems derived from finite element problems. Using embedded DSL technology, we provide a means to enable the rapid development of hybridization and static condensation techniques within an automatic codegeneration framework. In other words, the main contribution of this paper is in solving the problem of automatically translating from the mathematics of static condensation and hybridization to compiled code. Our work is implemented in the Firedrake finite element framework (Rathgeber et al., 2016) and the PETSc (Balay et al., 1997, 2016) solver library, accessed via petsc4py (Dalcin et al., 2011).
The rest of the paper is organized as follows. We introduce common notation used throughout the paper in Section 1.1. The embedded DSL, Slate, is introduced in Section 2, which allows concise expression of localized linear algebra operations on finite element tensors. We provide some contextual examples for static condensation and hybridization in Section 3, including a discussion on postprocessing. We then outline in Section 4 how, by interpreting static condensation techniques as a preconditioner, we can use Slate to automate solution techniques employing hybridization and static condensation. We demonstrate our implementations for manufactured test problems in Section 5. Section 5.3 illustrates the composability of our implementation of a hybridized mixed method as a preconditioner for the resulting linearized system of a semiimplicit discretization of the rotating shallow water equations. Conclusions follow in Section 6.
1.1. Notation
Let denote a tessellation of , the computational domain, consisting of polygonal elements associated with a mesh size parameter , and the set of facets of . The set of facets interior to the domain is denoted by . Similarly, we denote the set of exterior facets as simply . For brevity, we denote the finite element integral forms over and any facet set by
(1)  
(2) 
where
should be interpreted as standard multiplication for scalar functions or a dot product for vector functions.
For any doublevalued vector field on a facet , we define the jump of its normal component across by
(3) 
where and denotes the positive and negative sides of the facet respectively. Here, and are the unit normal vectors with respect to the positive and negative sides of the facet . Whenever the facet domain is clear by the context, we omit the subscripts for brevity and simply write .
2. A system for localized algebra on finite element tensors
We present an expressive language for dense linear algebra on the elemental matrix systems arising from finite element problems. The language, which we call Slate, inherits typical mathematical operations performed on matrices and vectors, hence the input syntax is comparable to highlevel linear algebra software such as MATLAB. The Slate language provides basic abstract building blocks which can be used by a specialized compiler for linear algebra to generate lowlevel code implementations.
Slate is heavily influenced by the Unified Form Language (UFL) (Alnæs et al., 2014; Logg et al., 2012), a DSL embedded in Python which provides symbolic representations of finite element forms. The expressions can be compiled by a form compiler, which translates UFL into low level code for the local assembly of a form over the cells and facets of a mesh. In a similar manner, Slate expressions are compiled to low level code that performs the requested linear algebra elementwise on a mesh.
2.1. An overview of Slate
To clarify conventions and the scope of Slate, we start by considering a general form. Suppose we have a finite element form:
(4) 
where and denote appropriate integration measures. The integral form in (4) is uniquely determined by its lists (possibly of 0length) of arbitrary coefficient functions in the associated finite element spaces, arguments describing any test or trial functions, and its integrand expressions for each integral type: , , . The form describes a finite element form globally over the entire problem domain. The contribution of (4) in each cell of the mesh is simply
(5) 
We call (5) the celllocal contribution of . Equation (5) produces an element tensor which is mapped into a global data structure. However, one may want to produce a new local tensor by algebraically manipulating different element tensors. This is precisely the job of Slate.
The Slate language consists of two primary abstractions for linear algebra: (1) terminal element tensors corresponding to multilinear integral forms, or assembled data; and (2) expressions consisting of operations on terminal tensors or existing Slate expressions.
Terminal tensors:
In Slate, one associates a tensor with data on an element either by using a form, or assembled coefficient data:

Tensor
()
associates a form, expressed in UFL, with its local element tensor:(6) The number of arguments determine the rank of
Tensor
, i.e. scalars, vectors, and matrices are produced from 0forms, 1forms, and 2forms^{1}^{1}1As with UFL, Slate is capable of abstractly representing arbitrary rank tensors. However, only rank tensors are typically used in most finite element applications and therefore we currently only generate code for those ranks. respectively. 
AssembledVector
()
where is some finite element function. The result associates a function with its local coefficient vectors.
Symbolic linear algebra:
Slate supports typical linear algebra operations with a highlevel syntax close to mathematics:

A + B
, the addition of two equal shaped tensors. 
A * B
, a contraction over the last index ofA
and the first index ofB
. 
A
, the additive inverse (negative) of a tensor. 
A.T
, the transpose of a tensor. 
A.inv
, the inverse of a square tensor. 
A.solve(B, decomposition="...")
, the result of solving for , optionally specifying a factorization strategy. 
A.blocks[indices]
, where A is a tensor from a mixed finite element space, allows extraction of subblocks of the tensor indexed by field (slices are allowed). For example, if a matrix corresponds to the bilinear form , where and are product spaces consisting of finite element spaces , , then the celllocal tensors have the form:(7) The associated submatrix of (7) with indices , , , is
(8) where , .
These building blocks may be arbitrarily composed, giving us the necessary algebraic framework for a large class of problems, some of which we present in this paper.
Slate expressions are handled by a linear algebra compiler, which employs TSFC to compile kernels for the assembly of terminal tensors and generates a dense linear algebra kernel to be iterated cellwise. Our compiler generates C++ code, using the templated library Eigen (Guennebaud et al., 2015) for dense linear algebra. During execution, the local computations in each cell are mapped into global data objects via appropriate indirection mappings. Figure 1 provides an illustration of the complete toolchain.
3. Examples
We now present a few examples and discuss solution methods which require elementwise manipulations of finite element systems and their specification in Slate. We stress here that Slate is not limited to these model problems; rather these examples were chosen for clarity and to demonstrate key features of the Slate language. In Sections 4 and 5, we discuss more intricate ways the Slate DSL is used in custom preconditioners for linear systems.
3.1. Hybridized discontinuous Galerkin methods
For our model problem, consider the secondorder elliptic PDE:
(9)  
(10)  
(11) 
where and , are positivevalued coefficients. Rewriting as a first order system, we have the following mixed problem:
(12)  
(13)  
(14)  
(15) 
where and .
The hybridized discontinuous Galerkin (HDG) method is a natural extension of discontinuous Galerkin (DG) discretizations. Here, we consider a specific HDG discretization, LDGH (Cockburn et al., 2010a). Other forms of HDG that involve local lifting operators can also be implemented in this software framework by the introduction of additional local (i.e., discontinuous) variables in the definition of the local solver.
To construct the LDGH discretization, we define the DG numerical fluxes and to be functions of the trial unknowns and a new independent unknown in the trace space:
(16) 
where denotes a polynomial space defined on each facet . The numerical fluxes are:
(17)  
(18) 
where is a function approximating the trace of on and is a positive function that may vary on each facet . The full LDGH formulation reads as follows. Find such that
(19)  
(20)  
(21)  
(22) 
Equation (21) enforces continuity of the numerical flux on , which in turn produces a flux that is singlevalued on the facets. Note that the choice of has a significant influence on the expected convergence rates of the computed solutions. We use this subtle dependency to verify our implementation in Section 5.1.
The matrix system arising from (19)–(22) has the general form:
(23) 
The LDGH system has more degrees of freedom than its DG counterpart, but its matrixform exhibits some immediate advantages.

By our choice of function spaces, the operator coupling and is blocksparse. We can therefore simultaneously eliminate the coupled unknowns by performing elementwise static condensation to (23), producing a significantly smaller problem for only:
(24) where and are given by
(25) (26) 
Once is computed, both and can be recovered locally in each element. This can be accomplished in two stages. First we compute by solving:
(27) followed by:
(28) 
The solution approximations can further be improved via local postprocessing. We highlight a few procedures in Section 3.3.
Listing 1 displays the corresponding Slate code for assembling the trace system, solving (24), and recovering the eliminated unknowns.
3.2. Hybridized mixed methods
To motivate the hybridization of mixed methods, we start by recalling the mixed method for the first order system (12)–(15). Standard methods seek a solution in the finite dimensional spaces:
(29)  
(30) 
respectively. The space consists of conforming piecewise vector polynomials, where choices of typically include the RaviartThomas (RT), BrezziDouglasMarini (BDM), or BrezziDouglasFortinMarini (BDFM) elements (Brezzi et al., 1987, 1985; Nédélec, 1980; Raviart and Thomas, 1977).
The mixed finite element formulation of (12)–(15) reads as follows: find satisfying
(31)  
(32) 
where is the space of functions in whose normal components vanish on . With and denoting the vector of degrees of freedom for and respectively, computing the solution of (31)–(32) requires solving the saddle point system:
(33) 
Methods to efficiently invert such systems include multigrid (Arnold et al., 2000) (requiring complex overlappingSchwarz smoothers), global Schur complement factorizations (which require an approximation to the inverse of the elliptic Schur complement operator), or auxiliary space multigrid (Hiptmair and Xu, 2007). Here, we focus on a solution approach using a hybridized mixed method (Arnold and Brezzi, 1985; Brezzi and Fortin, 2012).
For hybridized mixed methods, the discrete solution spaces are for and for , where the superscript denotes a discontinuous approximation to in the space
(34) 
The vector finite element space is a subspace of consisting of local functions, but normal components are no longer continuous on . Lagrange multipliers enforcing normal continuity are introduced as an auxiliary variable in the trace space .
Deriving the hybridized mixed system is accomplished through integration by parts over each element . Testing with and integrating (12) over produces:
(35) 
The trace function is introduced in surface integrals approximating on elemental boundaries. An additional constraint equation is added to close the system. The global hybridmixed formulation reads: find such that
(36)  
(37)  
(38) 
where denotes the space of traces vanishing on .
Arising from the hybridizedmixed formulation, we have the augmented system of linear equations with the general form:
(39) 
where is a mixed vector of the form , and , are the vectors of degrees of freedom for the flux, scalar, and trace unknowns. Equation (39) defines a 33 block system which is blocksparse by our choices of , , and . If the space of Lagrange multipliers is chosen appropriately, then the flux , albeit sought a priori in a discontinuous space, will coincide with its conforming counterpart . The constraint in (38) enforces both continuity of the normal components of across elemental boundaries, as well as the Neumann condition on . As a result, the formulations in (36)–(37) and (31)–(32) are solving equivalent problems (Arnold and Brezzi, 1985).
The system in (39) defines a threefield problem similar to that of the HDG method in (23), and hence possesses the same properties previously discussed. The solution approach is nearly identical, including the Slate code. A few simplifications are present in the expressions for building the trace system and local solvers (). The resulting matrix for the system also defines an elliptic operator. For the interested reader, a unifying framework for the hybridization of mixed and DG methods is detailed by Cockburn et al. (2009a).
3.3. Local postprocessing
Local postprocessing techniques for the construction of superconvergent approximations were discussed within the context of mixed methods (Arnold and Brezzi, 1985; Bramble and Xu, 1989; Stenberg, 1991), and discontinuous Galerkin methods (Cockburn et al., 2010a, 2009b). Here, we present two postprocessing techniques for producing local solutions of higher approximation order. The Slate code follows naturally from previous discussions in Sections 3.1 and 3.2, using the standard set of operations on local tensors.
3.3.1. Postprocessing of the scalar solution
There are number of postprocessing techniques for enhancing the accuracy of the scalar approximation of the hybridized mixed method and its DG variant. We present a modified version of the procedure presented by Stenberg (1991), and highlighted within the context of hybridizing eigenproblems by Cockburn et al. (2010b).
Let denote a polynomial space of degree on an element . Then for a given pair of computed solutions of the hybridized methods, we define the postprocessed scalar as the unique solution of the local problem:
(40)  
(41) 
where . Here, the space denotes the orthogonal complement of . This postprocessing method directly uses the definition of the flux to construct the local problem above. In practice, the space may be constructed using an orthogonal hierarchical basis, and solving (40)–(41) amounts to inverting a local symmetric positive definite system.
At the time of this work, Firedrake does not support the construction of such a finite element basis. However, we can introduce Lagrange multipliers to enforce the orthogonality constraint. The resulting local problem then becomes the following mixed system: find such that
(42)  
(43) 
where . The local problems (42)–(43) and (40)–(41) are equivalent, with the Lagrange multiplier enforcing orthogonality of test functions in with functions in .
This postprocessing method produces a new approximation which superconverges at a rate of for hybridized mixed methods (Arnold and Brezzi, 1985; Cockburn et al., 2010b; Stenberg, 1991). For the LDGH method, superconvergence is achieved when and , but only convergence is achieved when (Cockburn et al., 2010a, 2009b). We use this dependency in the convergence rates to verify our software implementation in Section 5.1.
3.3.2. Postprocessing of the flux
To postprocess the flux in the LDGH method, we use the numerical trace from (17). The technique we outline here follows that of Cockburn et al. (2009b), and produces a new flux with better conservation properties.
Let be a mesh consisting of simplices. On each element , we define a new function to be the unique element of the local RaviartThomas space satisfying
(44)  
(45) 
This local problem produces a new flux with the following properties:

. That is,
(46) 
Furthermore, the divergence of convergences at a rate of .
4. Static condensation as a preconditioner
Slate enables static condensation approaches to be expressed very concisely. Nonetheless, application of a particular approach to different variational problems using Slate still produces a certain amount of code repetition. By formulating each form of static condensation as a preconditioner, code can be written once and then applied to any mathematically suitable problem. Rather than writing the static condensation by hand, in many cases, it is sufficient to just select the appropriate, Slatebased, preconditioner.
To understand how we can view static condensation as a preconditioner, it is helpful to frame the problem in the particular context of the solver library. Firedrake uses PETSc to provide linear solvers, and we implement our preconditioners as PETSc PC objects. These are defined to act on the problem residual, and return a correction to the solution. We can think of (left) preconditioning the matrix equation in residual form:
(47) 
by an operator (which may not necessarily be linear) as a transformation into an equivalent system of the form
(48) 
Given a current iterate the residual at the th iteration is simply , and acts on the residual to produce an approximation to the error . If is an application of an exact inverse, the residual is converted into an exact (up to numerical roundoff) error.
We will denote the application of particular Krylov method for the linear system (47) as . Upon preconditioning the system via as in (48), we write
(49) 
If (49) is solved directly via the application of , then . So, we have that produces the exact solution of (47) in a single iteration of .
4.1. Interfacing with PETSc via custom preconditioners
The implementation of preconditioners for these systems requires manipulation not of assembled matrices, but rather their symbolic representation. To do this, we use the preconditioning infrastructure developed by Kirby and Mitchell (2018), which gives preconditioners written in Python access to the symbolic problem description. We manipulate this appropriately and provide operators assembled from Slate expressions to PETSc for further algebraic preconditioning. Using this technique, we have developed a static condensation interface for the hybridization of mixed problems, and a generic interface for hybridized systems (both hybridized mixed and HDG methods). The advantage of writing even the latter as a preconditioner is the ability to switch out the solution scheme for the system, even when nested inside a larger set of coupled equations.
4.1.1. A static condensation interface for hybridization
One of the main advantages of using a hybridized variant of a DG or mixed method is that such systems permit the use of elementwise condensation and recovery. To facilitate this, we provide a PETSc PC static condensation interface, HybridSCPC. This preconditioner takes the discretized hybridized system as in (23), and performs the local elimination and recovery procedures.
More precisely, the incoming system has the form:
(50) 
where . Note that in many applications, , however our implementation does not assume symmetry of the offdiagonal operators. Equation (50) defines a 33 block system of equations. In exact arithmetic, the HybridSCPC preconditioner applies the inverse of the Schur complement factorization:
(51) 
where is the Schur complement operator for the system. The only globally coupled system that needs an iterative solver is the reduced problem for the Lagrange multipliers:
(52) 
where is the trace righthand side, and is another possible choice of preconditioner. Once is computed, the flux and scalar fields are reconstructed elementwise via inverting the local systems in (27) and (28).
4.1.2. Preconditioning mixed methods via hybridization
The preconditioner HybridizationPC expands on the previous one, this time taking an system and automatically forming the hybridized problem. This is accomplished through manipulating the UFL objects representing the discretized PDE. This includes replacing argument spaces with their discontinuous counterparts, introducing test functions on an appropriate trace space, and providing operators assembled from Slate expressions.
More precisely, let , where , , and and are the flux and scalar unknowns respectively, be the mixed saddle point problem. Then this preconditioner replaces with the augmented system:
(53) 
where , , are the righthand sides for the flux and scalar equations respectively, and indicates modified matrices and covectors with discontinuous functions. Here, are the new discontinuous unknowns.
The preconditioning operator for the hybridmixed system (53) has the form:
(54) 
where is the Schur complement matrix . As before, a single globally coupled system for is required. The recovery of and happens in the same manner as the HybridSCPC.
Since the flux is constructed in a discontinuous space , we must project the computed solution into . This can be done cheaply via local facet averaging. The resulting solution is then updated via , where is the projection mapping. This ensures the residual for the original mixed problem is properly evaluated to test for convergence. With as in (54), the preconditioning operator for the original system is:
(55) 
We note here that assembly of the righthand for the system requires special attention. The situation we are given is that we have for , but require for . For consistency, we also require for any that
(56) 
We can construct such a satisfying (56) using the local definition:
(57) 
where is the number of cells that the degree of freedom corresponding to the basis function touches. For elements, for facet nodes, and 1 otherwise. By construction of the space , we have for :
(58) 
where , and are functions corresponding to the positive and negative restrictions associated with the th facet node^{2}^{2}2 These are the two “broken” parts of on a particular facet connecting two elements. That is, for two adjacent cells, a basis function in for a particular facet node can be decomposed into two basis functions in defined on their respective sides of the facet.. Using (57) and (58), we can verify that our construction of satisfies (56).
5. Numerical results
We now present results utilizing the Slate DSL and our static condensation interfaces for a set of test problems. All parallel results were obtained on a single fullyloaded compute node of dualsocket Intel E52630v4 (Xeon) processors with cores (2 threads per core) running at 2.2GHz. In order to avoid potential memory effects due to the operating system migrating processes between sockets, we pin MPI processes to cores.
5.1. Hybridized methods and effects of postprocessing
To verify our computed results, we now perform a simple convergence study for a model Dirichlet problem. We seek a solution to the Poisson equation as a firstorder system:
(59) 
where and are chosen so that the analytic solution is the sinusoid and its negative gradient. We solve this problem by hybridizing the mixed formulation of (59), and employ our static condensation preconditioner described in Section 4.1.1. All results were obtained in serial, with MUMPS providing the LU factorization algorithms for the condensed trace system (Amestoy et al., 2000).
Each mesh in our convergence study is obtained by generating a quadrilateral mesh with cells in each spatial direction, and dividing each quadrilateral cell into two equal simplicial elements. Once the solutions are obtained, we compute a postprocessed scalar solution using the method described in (42)–(43) via Slategenerated kernels. Figure 2 displays the results for the hybridized RT method. Our computations are in full agreement with the theory.
We repeat this experiment for the LDGH method with varying choices of in order to verify how influences the convergence rates, comparing with the expected rates for the LDGH method given a particular order of (see Table 1 for a summary). In all our experiments, we use the postprocessing methods described in Section 3.3 to produce approximations and . Error convergence plots from our tests are shown in Figure 3 that confirm the expected rates. This rather sensitive test verifies that our software framework is generating correct code.
parameter  expected rates of convergence ()  

5.2. Relative performance of the HDG method
Having verified our generated code for the HDG method in the previous section, we now analyze the quality of said code for a model threedimensional elliptic system:
(60) 
where and are chosen such that the analytic solution is . We use a regular mesh consisting tetrahedral elements (). At the time of this work, Firedrake does not support sumfactorized algorithms on simplices. We therefore restrict ourselves to “low” polynomial order, only considering approximation degrees for the HDG method. Additionally, we compute a postprocessed scalar approximation of the HDG solution. In all numerical studies here, we set the HDG parameter . All results were computed in parallel, utilizing a single compute node (described previously).
A continuous Galerkin (CG) discretization serves as a reference for this experiment. Due to the superconvergence in the postprocessed solution for the HDG method, we use CG discretizations of polynomial order 2, 3, and 4. This takes into account the enhanced accuracy of the HDG solution, despite being initially computed as a lowerorder approximation. We therefore expect both methods to produce equally accurate solutions to the model problem.
Our aim here is not to compare the performance of HDG and CG, which has been investigated elsewhere (for example, see Kirby et al. (2012); Yakovlev et al. (2016)). Instead, we provide a reference that the reader might be more familiar with in order to evaluate whether our software framework produces a sufficiently performant HDG implementation relative to what might be expected.
For the CG discretization, we use a matrixexplicit iterative solver consisting of the conjugate gradient method preconditioned with hypre’s boomerAMG implementation of algebraic multigrid (AMG) (Falgout et al., 2006). We use the preconditioner described in Section 4.1.1 to statically condense the LDGH system, using the same iterative solver as the CG method for the Lagrange multipliers. This is sensible, as the global trace operator defines a symmetric positivedefinite operator. To avoid oversolving, we iterate to a relative tolerance such that the discretization error is minimal for a given mesh.
5.2.1. Error versus execution time
The total execution time is recorded for the CG and HDG solvers, which includes the setup time for the AMG preconditioner, matrixassembly, and the timetosolution for the Krylov method. In the HDG case, we include the time spent building the Schurcomplement for the traces, local recovery of the scalar and flux approximations, and postprocessing. The error against execution time is summarized in Figure 4.
The HDG method of order () with postprocessing, as expected, produces a solution which is as accurate as the CG method of order (). While the full HDG system is never explicitly assembled, the larger execution time is a result of several factors. The total number of trace unknowns for the , , and discretizations is roughly four, three, and two times larger (resp.) than the corresponding number of CG unknowns. Therefore, each iteration is more expensive. Moreover, we also observe that the trace system requires more Krylov iterations to reach discretization error. The gap in total number of iterations starts to close as the approximation degree increases (see Figure 5). The extra cost of HDG due to the larger degreeoffreedom count and the need to perform local tensor inversion is offset by the local conservation and stabilization properties which are useful for fluid dynamics applications, for example.
5.2.2. Break down of solver time
The HDG method introduces far more additional degrees of freedom than CG or native DG methods. This is largely due to the fact that the HDG method simultaneously approximates the primal solution and its flux. The global matrix for the traces is larger than the one for the CG system at low polynomial order. The execution time for HDG is then compounded by a more expensive global solve. We remind the reader that our goal here is not to compare CG and HDG but to verify that the solve time for HDG is as might be expected given the problem size.
Stage  

(s)  %  (s)  %  (s)  %  
Matrix assembly (static cond.)  1.05  7.47 %  6.97  11.17 %  31.53  10.51 % 
Forward elimination  0.82  5.88 %  6.31  10.11 %  31.74  10.58 % 
Trace solve  10.74  76.68 %  39.83  63.84 %  185.18  61.73 % 
Back substitution  1.13  8.08 %  8.33  13.35 %  44.86  14.96 % 
Post processing  0.26  1.89 %  0.96  1.54 %  6.68  2.23 % 
HDG Total  14.00  62.39  299.98  
(s)  %  (s)  %  (s)  %  
Matrix assembly (monolithic)  0.50  11.82 %  1.89  8.08 %  10.77  11.93 % 
Solve  3.73  88.18 %  21.56  91.92 %  79.50  88.07 % 
CG Total  4.23  23.46  90.27 
Figure 6 displays the execution times on a simplicial mesh consisting of million elements. The execution times have been normalized by the CG total time, showing that the HDG method is roughly 3 times the execution time of the CG method. This is expected given the larger degreeoffreedom count. The raw numerical breakdown of the HDG and CG solvers are shown in Table 2. We isolate each component of the HDG method contributing to the total execution time. Local operations include static condensation (trace operator assembly), forward elimination (righthand side assembly for the trace system), backwards substitution to recover the scalar and flux unknowns, and local postprocessing of the primal solution. For all , our HDG implementation is solverdominated, indicating that our software framework is producing a relatively performant implementation.
Both operator and righthand side assembly are dominated by the costs of inverting a local square mixed matrix coupling the primal and dual variables, which is performed directly via an LU factorization. They should therefore be of the same magnitude in time spent. We observe that this is the case across all degrees (ranging between approximately 6—11% of total execution time for both processes). We observe that the cost of HDG assembly takes approximately the same percentage of time as the CG method.
Backsubstitution takes slightly more time (between 8—15% of execution time across all ). This is a result of splitting the reconstruction stage into two local solves: one pass to reconstruct , followed by an additional pass to reconstruct . A further performance optimization could be to simultaneously solve for both and in each element. However, at the time of this work, the PyOP2 layer of Firedrake does not support direct insertion into mixed data structures. Finally, the additional cost of postprocessing accrues negligible time (roughly 2% of execution time across all degrees).
We note that caching of local tensors does not occur. Each pass to perform the local eliminations and backwards reconstructions rebuilds the local element tensors. It is not clear at this time whether the performance gained from avoiding rebuilding the local operators will offset the memory costs of storing the local matrices. Moreover, in timedependent problems where the operators contain statedependent variables, rebuilding local matrices will be necessary in each timestep regardless.
5.3. A hybridized finite element solver for the shallow water equations
A primary motivator for our interest in hybridized methods revolves around developing efficient solvers for problems in geophysical flows. In this last section, we present some results integrating the shallow water equations on the sphere using test case 5 (flow past a mountain) from (Williamson et al., 1992). We use the framework of compatible finite elements (Cotter and Shipton, 2012; Cotter and Thuburn, 2014).
5.3.1. Semiimplicit solver
We start with the vectorinvariant rotating nonlinear shallow water system defined on a twodimensional spherical surface embedded in :
(61)  
(62) 
where is the fluid velocity, is the depth field, is the Coriolis parameter, is the acceleration due to gravity, is the bottom topography, and , with being the unit normal to the surface .
After discretising in space and time using a semiimplicit scheme and Picard linearization, following Natale and Cotter (2017), we must solve an indefinite saddle point system at each step:
(63) 
In staggered finite difference models, the standard approach for solving (63) is to neglect the Coriolis term and eliminate the velocity unknowns to obtain a discrete elliptic problem for which smoothers like Richardson or relaxation methods are convergent. This is more problematic in the compatible finite element framework, since is not diagonal. Instead, we use the preconditioner described in Section 4.1.2 to form the hybridized problem and eliminate both and locally.
5.3.2. Atmospheric flow over a mountain
As a test problem, we solve test case 5 of Williamson et al. (1992), on the surface of an Earthsized sphere. We refer the reader to Cotter and Shipton (2012); Shipton and Cotter (2017) for a more comprehensive study on mixed finite elements for shallow water systems of this type.
We use the mixed finite element pairs (lowestorder RT method) and (nexttolowest order BDM method) for the velocity and depth spaces. The sphere mesh is generated from 8 refinements of an octahedron, resulting in a triangulation consisting of 524,288 elements. The grid information, including timestep size, and degree of freedom count is summarized in Table 3.
Grid information  
Refinements  Number of  (s)  
cells  (km)  (km)  
8  524,288  28.396  33.052  28.125 
Discretization properties  
Mixed method  Velocity  Depth  Total  
unknowns  unknowns  
786,432  524,288  1,310,720  
3,932,160  1,572,864  5,505,024 
We run for a total of 100 timesteps, with a fixed number of 4 Picard iterations in each timestep. We compare the overall simulation time using two different solver configurations for the implicit linear system. First, we use an approximate Schur complement preconditioner for GMRES of the form:
(64) 
where , and is a diagonal approximation to the velocity mass matrix. The approximate Schur complement is inverted using conjugate gradients preconditioned by PETSc’s smoothed aggregation multigrid (GAMG). The Krylov method is set to terminate once the preconditioned residual norm is reduced by a factor of . is computed approximately using a single application of incomplete LU (zero fillin).
Next, we use a preonly application of our hybridization preconditioner, which replaces the original linearized mixed system with its hybridmixed equivalent. After hybridization, we have the problem: find
Comments
There are no comments yet.