There is an increasing demand for large-scale thermomechanical simulation of turbomachinery problems. This is driven by the need for ever tighter tolerances on deformations under thermal and mechanical loading, and the highly integrated nature of modern designs. The integrated nature of advanced systems requires a move from component-level simulation to system level simulation. A barrier to progress in this area, however, is the computational cost of finite element simulation of thermomechanical problems using conventional technology.
For a step change in capability, the important advance is towards solvers with (near) linear complexity (time cost and memory) in problem size, and then with good parallel scaling. A common mistake in industrial settings, in our experience, is to focus too heavily on parallel scaling performance and to overlook complexity. To tractably perform very large-scale simulations the first step is the application of methods with cost complexity that is close to linear in problem size. Sparse direct linear solvers for three-dimensional finite element problems have at best time complexity, where is the numbers of degrees-of-freedom [17, 26]. This is prohibitive for problems with large . The high time cost complexity cannot be conquered by parallel implementations of direct solvers.
The differential equations used to model thermomechanical systems are typically elliptic or parabolic, or combinations of the two. As a consequence, practical numerical methods must be implicit, which in turn implies the solution of linear systems of equations. This is in contrast with many computational fluid dynamics problems for which explicit methods can be applied. In finite element analysis, direct sparse linear solvers that are variants on LU decomposition, are dominant. Direct solvers are robust, but have complexity in both time and memory that is far from linear in problem size. In practice, with current computer performance, advances in implementation of LU solvers, and dimensionality and complexity effects, two-dimensional simulations are generally tractable. However, very large three-dimensional simulations with order degrees-of-freedom are made intractable by cost and will remain so. We explain through complexity analysis the cost differences between two- and three-dimensional cases to elucidate why implementation improvements alone will not make large-scale three-dimensional analysis viable.
Extreme scale finite element simulation is advanced in some fields, such as computational fluid dynamics, e.g. , and geophysics, e.g. . However, other areas are trailing, such as thermomechanical analysis of turbomachinery. We present computational examples of realistic thermomechanical turbomachinery problems with up to 3.3 billion degrees-of-freedom. Our purpose is to show that, with appropriate preconditioning, iterative linear solvers can be effective for thermomechanical problems with complicated geometries at extreme scale and provide a route towards whole system/engine level analysis. Aspects of the model construction that are critical for good and robust performance are discussed. While the cost of the linear solver phase may be dominant in typical finite element libraries for thermomechanical analysis, all stages of a simulation must be considered to compute with or more degrees-of-freedom. We summarise the key components of an implementation that supports extreme scale computation, and in particular the implementation used to generate the performance data in this work.
The remainder of this work is structured as follows. The complexity of sparse direct solvers is discussed, and this is followed by the conditions under which iterative solvers can be applied with linear complexity. Other performance-critical elements of large-scale simulations are discussed briefly in the context of parallel computation. The description of the solver technology is followed by a description of the physical model used in the examples, and its numerical implementation. Performance studies are presented, and followed by conclusions. Much of what we discuss will be familiar to researchers in numerical linear algebra and high performance computing. Our aim is to reach researchers and analysts working in the field of turbomachinery to show the viability and potential of mathematically sound methods for extreme scale computation of thermomechanical problems using finite element methods. The computational examples that we present are produced with freely available open-source libraries, and in particular tools from the FEniCS Project [2, 21, 22, 4] (https://fenicsproject.org) and PETSc [6, 7] (https://www.mcs.anl.gov/petsc/).
2 Background: scalable approaches for large-scale problems
Implicit finite element solvers require the solution of
where is a matrix,
is the solution vector andis the right-hand side vector. Many finite element libraries solve the above problem using a sparse direct solver, and for large this constitutes the most expensive part of an analysis. A major barrier to large-scale analysis is that sparse direct solvers have high work complexity, especially in three-dimensions, which makes large analyses () very slow and very large analyses () intractable. To enable large analyses, methods with cost close to in work and storage are necessary.
In this section we describe the key elements needed to build parallel finite element solvers for thermomechanical modelling with close to cost. First, however we summarise the performance of direct solvers to highlight why improved implementations and parallelisaton of direct solvers, while helpful, is a ultimately a futile endeavour in terms of enabling very large-scale thermomechanical simulations.
2.1 Linear solvers
2.1.1 Direct solvers: dimensionality and complexity
Quantities of interest for characterising the cost of solving eq. 1 are the representative length of element cells, , and the number of degrees-of-freedom in a model, . For a problem with fixed geometric size, the number of degrees-of-freedom is proportional to in two dimensions and proportional to in three dimensions. Discretisation errors are usually characterised in terms of , and solver cost in terms of .
From a priori
error estimates, the solution error is typically proportional to, where . For a given factor reduction in error, a three-dimensional model therefore requires a greater relative increase in the number of degrees-of-freedom compared to a two-dimensional problem, (by an extra power). Direct sparse solvers have work complexity that is a power of , with the exponent depending on the spatial dimension. Solution on structured meshes using optimal ordering requires work and storage in two dimensions, and three dimensions work and storage . Consequently, when moving from two-dimensional to three-dimensional analysis the cost increase is compounded two-fold; the greater increase in the number of degrees-of-freedom for a given error reduction and the increase in linear solver complexity from to .
To make the effects of dimensionality and work complexity concrete, consider the computation of the displacement field for a linear elastic problem using linear Lagrange finite elements. To reduce the error in the displacement (measured in the -norm) by a factor of 10, the cell size must be reduced by a factor of ( accuracy). For a two-dimensional problem this implies increasing the number of degrees-of-freedom by a factor of , and when accounting for the solver complexity of the total time cost is a factor of greater. In three-dimensions, the factor increase in is 100, and after accounting for the solver cost the factor increase in total computational time is !
The elementary cost analysis shows that improved implementations and parallelisaton cannot make direct solvers viable for very large problems in three dimensions. Improved implementations can reduce the work proportionality constant, but this will be dramatically outstripped by the quadratic cost for large . Moreover, direct solvers are challenging to implement in parallel. Viable solvers for large-scale simulations must be , or close, in both time and memory cost.
2.1.2 Iterative solvers
The alternative to a direct sparse solver is an iterative method. The natural candidates are Krylov subspace methods, e.g. the conjugate gradient method (CG) for symmetric positive-definite operators and the generalised minimum residual method (GMRES) for non-symmetric operators. However, an iterative solver alone cannot noticeably improve on the work complexity of a direct method. To illustrate, we consider as a prototype the CG method. The core algorithmic operations are sparse matrix–vector products and vector inner products within each iteration. Since the matrix is sparse, the work cost of each iteration is and storage is . For a CG solver to have overall work cost of , it must converge in a number of iterations that is independent of . Error analysis for the CG method shows that the number of iterations to reach a specified error tolerance is , where is the condition number of in the -norm [34, Lecture 38]
(this estimate is for the case when the eigenvalues ofare spread, as is the case in finite element methods, rather than clustered). However, the condition number of for a steady elastic problem or for the diffusion part of thermal problem scales according to 
and as a result the number of CG iterations increases with mesh refinement and the solver cost is greater than .
Introducing a preconditioner , a CG solver for the problem
(the above is left preconditioning) will terminate in a number of iterations that is independent of if the condition number of is bounded independently of . If the preconditioner can be applied in time per iteration, the preconditioned solver will have cost . The key is to select a preconditioner that bounds the condition number independently of (and by extension ).
We are aware of perceptions in the turbomachinery community that iterative solvers for thermomechanical problems are not effective. In our experience this comes, at least in part, from the use of algebraic preconditioners that do not bound the condition number. For example, incomplete Cholesky factorisation preconditioners do not, in the general case, change the asymptotic scaling of the condition number (see, e.g. [9, 14]), which despite preconditioning, remains . The preconditioner must account for the differential operator that generates the matrix . Common ‘black-box’ preconditioners, e.g. incomplete LU, do not bound the condition number and the number of iterations will grow with problem size. Candidate preconditioners for the elliptic operators arising in thermomechanical analysis are multigrid and domain decomposition methods.
2.1.3 Multigrid preconditioned iterative solvers
Multigrid methods [12, 35] are a natural choice for for thermomechanical problems. While multigrid can be used as standalone method for solving the linear system, it is more effective for complicated engineering problems as a preconditioner for a Krylov subspace method. Domain decomposition solvers [10, 33, 23, 16] may also be suitable, but are generally less flexible in application than multigrid.
Multigrid methods can be divided into two types – geometric and algebraic. Geometric multigrid requires a hierarchy of finite element meshes, preferably with a nested structure. For complicated engineering geometries it may not be reasonable, or even possible, to produce a hierarchy of meshes. Algebraic multigrid methods (AMG) do not require a hierarchy of meshes, but construct a hierarchy of problems from the ‘fine grid’ input matrix. Common methods include classical Ruben–Stüben AMG  and smoothed aggregation . Classical AMG tends to be suited to scalar-valued equations, e.g. thermal problems, and smoothed aggregation tends to be suited to vector-valued equations, e.g. elasticity.
A common misconception is that AMG is a blackbox method requiring no input or guidance from the user other than the matrix to be solved. This is not the case, and we are aware of many cases of AMG being mistakenly used as blackbox, and as a consequence leading to the conclusion that it is not suitable for turbomachinery applications. For example, the proper use of smoothed aggregation AMG requires the ‘near-nullspace’ of the operator to be set, which in the case of 3D elasticity is the six rigid body modes in absence of any displacement boundary conditions. Failure to set the full near-nullspace leads to an increase in solver iteration count with mesh refinement. To complicate matters, this outcome is often not observed in simple tension tests that do not induce rotation, but the problem becomes acute when moving to realistic analyses. This has lead to erroneous conclusions that a solver is suitable for simple problems but not for realistic engineering systems. It has been our experience that analysts rarely configure AMG properly, and in particular smoothed aggregation AMG. This has contributed to the pessimistic view of iterative solvers in the turbomachinery community.
2.2 Cell quality
Orthodoxy in solid mechanics finite element analysis is to manage down the number of degrees-of-freedom to reach a tractable cost. For complicated geometries, this will typically compromise on cell quality in parts of a domain. While the computational cost of direct solvers does not depend on cell quality, the performance of preconditioned iterative solvers is highly dependent on cell quality. A small number of poor quality cells can dramatically slow, or prevent, convergence of an iterative method. We refer to the discussion in  for an overview of finite element cell quality measures. Additionally the work in  which highlights the challenges in generating meshes composed of good quality cells.
The successful application of iterative solvers requires high quality meshes, and analysts generating meshes need to refocus their efforts away from managing the cell count and towards the generation of high quality meshes. A contributing factor in poor experiences with iterative solvers is the use of meshes with poor quality cells. We have observed this to be particularly the case when attempting to benchmark against established codes with direct solvers, where, by necessity, the cell count is managed down to permit execution of the direct solver.
An important practical question is ‘how good’ must a mesh be to be acceptable. Such guidance is necessary for analysts generating meshes for use with iterative solvers, and is a topic of ongoing investigation.
2.3 Other library components
While the solution of a linear system may be the dominant cost when using a direct solver, the solution of very large-scale problems requires careful consideration and design of all stages in the solution pipeline. Addressing the linear solver in isolation is not sufficient. We summarise briefly other performance critical phases in a simulation as implemented in the open-source library DOLFIN [4, 2], which is used for the example simulations.
Input/output (IO) is often overlooked when considering extreme scale simulations. It must be possible to read and write files in parallel, and in a way that is memory scalable. IO must not take place on a single process or compute node. Some commonly used input formats for finite element analysis are fundamentally unsuited to parallel processing. ASCII formats do not lend themselves to efficient IO. The examples we present use XDMF  with HDF5  storage. The HDF5 files are read and written in parallel, using MPI-IO (transparently to the user) and exploiting parallel file systems .
2.3.2 Mesh data structures
Efficient data structures for storing the unstructured meshes associated with turbomachinery problems are essential for scalable solution. Examples of their implementation are shown in [19, 20]. Additionally, it is critical that these mesh data structures are fully distributed in parallel. Storing a mesh on one process, or a copy of the mesh on each process, prohibits the solution of very large problems. The examples we present use a fully distributed mesh [28, 4], with the mesh partitioning across processes computed using the parallel graph partitioner PT-SCOTCH .
Scalable matrix assembly builds on a fully distributed mesh data structure. It is essential that the matrix/vector assembly process is fully distributed with each process responsible for assembly over its portion of the mesh. A number of libraries exist that provide distributed matrix and vector data structures, and these typically also support distributed construction in which each process adds its contribution to the distributed matrix/vector, with a synchronisation phase to communicate any nonlocal contributions. The examples we present use PETSc  for distributed matrices and vectors.
3 Thermomechanical model and numerical formulation
We consider a thermoelastic problem on a body with boundary and outward unit normal vector on the boundary. The boundary is partitioned into and such that and . The time interval of interest is .
For the thermomehcanical model, in the absence of inertia effects, the mechanical part of the solution is governed by:
is the stress tensor,is a prescribed body force, is a prescribed boundary pressure, is the displacement field and is a prescribed displacement. The stress is given by
where is the elastic stiffness tensor, is the symmetric gradient of the displacement, and is the thermal strain. The thermal strain is given by
where is the thermal expansion coefficient, is the temperature, is a fixed reference temperature and is the identity tensor.
The temperature field is governed by
where is the mass density, is the specific heat, is the thermal conductivity, is a heat transfer coefficient, is the prescribed exterior temperature and is the initial temperature field.
We will consider problems where , , and are temperature dependent. The temperature dependencies lead to a nonlinear problem.
3.1 Fully-discrete formulation
Applying the -method for time stepping, the finite element formulation reads: given and at time , and , and at time , find and such that
where for a field , , and is a finite element space. In all examples we use conforming Lagrange finite element spaces on meshes with tetrahedral cells.
3.2 Solution strategy
The coupled system in eqs. 13 and 12 is nonlinear due to the dependence of various coefficients on the temperature. The coupling between the mechanical and thermal equations is one-way; the mechanical equation, eq. 12, depends on temperature but the thermal equation, eq. 13, does not depend on the displacement field. Therefore, at each time step we first solve the thermal problem () using Newton’s method, followed by solution of the linear mechanical problem using the most recently computed temperature field. The system solver strategy can be considered a block nonlinear Gauss–Seidel iterative process.
A full linearisation of the thermal problem in eq. 13 leads to a non-symmetric matrix operator. The degree to which symmetry is broken depends on the magnitude of . Despite the loss of symmetry in the nonlinear thermal problem, we still observe that the CG method performs well for the problems we consider. There may be cases where it is necessary to switch to a Krylov solver that is not restricted to symmetric operators, such as BiCGSTAB or GMRES.
3.3 Adaptive time step selection
The time step is selected to limit the maximum absolute temperature change anywhere in the domain over a time step. The next time step to be used is given by
where is the maximum permitted change in temperature between time steps and is a parameter that sets the target maximum change in temperature . If the temperature exceeds the maximum allowed the step is repeated, with the time step halved.
4 Model problems
Two model problems are considered; a turbocharger (shown in fig. 1) and a steam turbine casing (shown in fig. 2). Both problems are composed of multiple materials with temperature-dependent thermal and elastic parameters. For unsteady cases the boundary conditions and the applied body force are time-dependent.
The meshes of the two problems are composed of tetrahedral cells. Two ‘reference’ meshes are considered for the turbocharger:
- Turbocharger mesh ‘A’
cells and vertices
- Turbocharger mesh ‘B’
cells and vertices
and one reference mesh for the steam turbine:
- Steam turbine mesh
cells and vertices
Meshes with higher cell counts are constructed by applying parallel uniform mesh refinement to the reference meshes . Each level of uniform refinement increases the cell count by a factor of eight. Views of a series of meshes generated in this manner are shown in fig. 3. The algorithm used for uniform refinement is based on the work of [24, 25]. The algorithm cost is linear with the number of cells to be refined. The difference in tetrahedron cell quality (shape measures) between refinement levels is small, as shown in fig. 3.
Cell quality is important for performance and robustness of the iterative solvers. Histograms of cell quality, measured by the dihedral angle, are shown in fig. 4 for the reference meshes. We consider turbocharger mesh ‘A’ to be ‘high’ quality (see fig. 3(a)). Turbocharger mesh ‘B’ and the steam turbine mesh are of lower quality in this metric. We observe in our numerical experiments that best results in terms of iteration count are obtained using mesh ‘A’. The steam turbine geometry is characterised by a number of thin regions, and thin regions make the avoidance of poorly shaped cells more difficult. The quantitative understanding of the relationship between cell quality metrics and solver performance for practical applications is an area of ongoing research.
4.2 Material parameters and boundary conditions
The test cases are representative of realistic problems in terms of the number of materials and the number of boundary conditions. The turbocharger is composed of different materials (as illustrated in fig. 5) and has 125 boundary regions (as illustrated in fig. 6). The steam turbine problem has different materials and boundary regions (not shown). On each boundary region, different time-dependent boundary conditions are prescribed. Given the complexity of the problems, it is not possible to fully describe all material properties and boundary conditions. We therefore provide illustrative examples of typical normalised material data and boundary conditions that are used. Thermal and elastic parameters for the different materials are temperature dependent, and normalised sample material data is shown in fig. 7 for three different materials. Figure 8 shows time-dependent boundary condition data for two boundary regions.
5 Performance studies
We present performance data for steady and unsteady cases. We are primarily interested in total runtime, but also present timing breakdowns for performance-critical operations, namely: data I/O, matrix assembly and Newton/linear solver. We focus mainly on steady problems as the time cost for a steady solution provides an upper bound on the per time step cost of an unsteady implicit simulation. Simulations were performed on the UK national supercomputer, ARCHER. ARCHER is a Cray XC30 system and its key specifications are summarised in table 1.
|Processors (per node)||2 Intel E5-2697 v2|
|Cores per node|
|Memory per node|
5.1 Solver configuration
All tests use the conjugate gradient method preconditioned with algebraic multigrid (AMG). We stress that AMG is not a black-box method; our experience is that simulation time is often poor and can fail with default settings, especially for the mechanical problem. For the thermal solve, we precondition using BoomerAMG  from the HYPRE library, which is a classical AMG implementation. For the elastic solve, we precondition using GAMG , the native PETSc smoothed aggregation AMG implementation. Classical AMG tends to be best suited to scalar-valued equations, and smoothed aggregation is suited to vector-valued equations.
With BoomerAMG, we use a hybrid Gauss–Seidel smoother from the HYPRE library 
. With GAMG, we use a Chebyshev smoother. It is important for Chebyshev smoothing that the maximum eigenvalues are adequately approximated. If they are not, the preconditioned system may lose positive-definiteness and the CG methods will therefore fail. We have observed for complicated geometries and for higher-order elements that more Krylov iterations are typically required (more than the GAMG default) to adequately estimate the highest eigenvalues compared to simple geometries. It can also be important to control the rate of coarsening, particularly for linear tetrahedral elements, when using smoothed aggregation. For good performance, it is recommended to ensure that the multigrid preconditioner coarsens sufficiently quickly. For complicated geometries where robustness is an issue (typically on lower quality meshes), we have observed heuristically that increasing the size of the ‘coarse grid’ used by the multigrid preconditioner (the level at which the preconditioner ceases to further coarsen the algebraic problem) for the mechanical problem can dramatically enhance robustness, especially on lower quality meshes. The scalar thermal solve is typically more robust and requires fewer iterations than the vector-valued mechanical problem.
5.2 Steady thermomechanical simulations
For steady simulations, the nonlinear thermal problem is first solved, followed by the temperature-dependent mechanical problem. The thermal model parameters are temperature dependent, so an initial guess for the temperature field is required. For all examples, the initial guess of the temperature field is and the Newton solver is terminated once a relative residual of is reached. The iterative solver for the mechanical problem is terminated once a preconditioned relative residual norm of is reached.
Figure 9 presents strong scaling results for the turbocharger using linear () and quadratic elements (). Both cases have the same number of degrees-of-freedom – over 67 M for the temperature field and over 202 M for the displacement field. Also shown are breakdowns of the time cost for keys steps: reading and partitioning the mesh, and solving the thermal and elastic problems. We see that the scaling trend is good, the wall-clock time is very low in view of the problem sizes, and that the elastic solve is the dominant cost. We present the timings using a linear wall-clock time scale to make clear the potential impact on real design processes. Interpreting the timing between two process counts should be done with some caution because with unstructured grids the distribution of the mesh and the aggregation created by the multigrid implementations will differ when changing the number of processes.
Weak scaling results are presented in fig. 10 for linear elements, with approximately displacement degrees-of-freedom per process. The thermal problem size ranges from to degrees-of-freedom, and the elastic problem from to degrees-of-freedom. We observe satisfactory weak scaling.
To demonstrate the potential for solving extreme scale problems, the turbocharger problem has been solved with over isplacement degrees-of-freedom using quadratic () elements. The mesh was generated by refining turbocharger mesh ‘A’ recursively three times. The resulting mesh has cells and vertices. The simulation was run using MPI processes, and the time-to-solution was under . A breakdown of the timings is presented in fig. 11. The computation time is dominated by the elastic solve, taking of the total runtime, with of the time spent on the thermal solve. The example shows that there is potential to move well beyond current limits on problem size for thermomechanical simulation.
5.2.2 Steam turbine
A distinguishing feature of the steam turbine problem compared to the turbocharger is the presence of more fine (slender) geometric features and lower mesh quality (see fig. 4). Figure 12 presents strong scaling results for the steam turbine with over 36 M thermal and over 108 M displacement degrees-of-freedom (linear and quadratic elements). The runtimes are good, and the scaling satisfactory, as observed for the turbocharger. With MPI processes this gives an average of degrees-of-freedom per process. We observe that as the degree-of-freedom per process count reduces below , the solution time suffers and does not scale well, which is likely due to the dominance of inter-process communication.
Two data points for weak scaling are shown in fig. 13, where the number of displacement degrees-of-freedom per process is kept close to . The coarse problem has thermal and displacement degrees-of-freedom, and the fine problem has thermal and displacement degrees-of-freedom. As with the preceding results, we see that the elastic solver cost is dominant, and the most challenging to scale.
5.3 Transient thermomechanical simulations
We consider an unsteady simulation of the turbocharger. The steady performance results provide a good upper bound on the per time step cost of the implicit unsteady simulations. The per time step cost of an unsteady simulation will generally be lower than for the steady case because preconditioners can be re-used, and in many cases the solution from the previous step can be used as an initial guess for the iterative solver.
The transient examples correspond to a test cycle in the case of the turbocharger, and a start-up procedure to operating temperature in the case of the steam turbine casing. For both problems the initial temperature is set to . The transient response is then driven by time-dependent temperature and pressure boundary conditions. The time step is adjusted adaptively to limit the maximum temperature change at any point in the domain to using eq. 14. Transient simulations use the backward Euler method ().
A test cycle for the turbocharger problem using mesh ‘B’ with thermal and elastic degrees-of-freedom ( for temperature and displacement fields) was computed. The simulation required time steps to complete a cycle with the limit on the maximum temperature change per time step. The runtime for the simulation was using MPI processes. The cost per time step for different components of the simulation are shown in fig. 14, as well as the per step and the maximum change in temperature per step. The cost for the elastic solve is highest for the first step, as the preconditioner for the elastic part of the problem is re-used for subsequent time steps. Spikes in the solve time for the thermal problem correspond to steps at which the thermal preconditioner is rebuilt, and correspond to times at which rapid temperature changes occur.
Strong scaling for the transient turbocharger problems over 10 time steps is presented in fig. 15. We again see good scaling behaviour, consistent with the steady analysis of the turbocharger (cf. fig. 8(b)).
We have demonstrated that it is possible to solve large-scale thermomechanical turbomachinery problems scalably and efficiently using iterative solvers. This is contrary to widely held views in the turbomachinery community. Critical to the successful application of iterative solvers are: (i) the selection of preconditioners that are mathematically suitable for elliptic equations; (ii) proper configuration of preconditioners using properties of the underlying physical system, e.g. setting the near-nullspace for smoothed aggregation AMG; and (iii) the use of high quality meshes. Iterative solvers are less robust than direct solvers, and successful application does require greater expertise and experience on the part of the analyst, but they do offer the only avenue towards extreme scale thermomechanical simulation.
The presented examples are representative of practical turbomachinery simulations in terms of the materials and number of boundary conditions. The results do show some areas where there is scope for performance improvements, particularly for the solution of the mechanical problem in terms of runtime and parallel scaling. It would be interesting to investigate methods at higher process counts, and to explore methods that have lower set-up cost and memory usage than smoothed aggregation AMG. The presented examples are also simple, compared to the full range of physical processes that are typically modelled in turbomachinery analysis, such as geometrically nonlinear effects and contact. The results in this work provide a platform that motivates research into solver technology for a wider range of physical processes in the context of thermomechanical simulation. The methods described and the tools used are all available in open-source libraries. The implementations can be freely inspected and used.
We thank Dr Mark Adams of Lawrence Berkeley National Laboratory for his advice on the use of smoothed aggregation AMG. The support of Mitsubishi Heavy Industries is gratefully acknowledged. CNR is supported by EPSRC Grant EP/N018877/1.
- XDM  The eXtensible Data Model and Format, 2017. URL http://www.xdmf.org/.
- fen  FEniCS Project. https://fenicsproject.org, 2018.
- Adams et al.  M. F. Adams, T. Isaac, and G. N. Wells. GAMG: the native algebraic multigrid framework in PETSc. In preparation, 2018.
- Alnæs et al.  M. S. Alnæs, J. Blechta, J. Hake, A. Johansson, B. Kehlet, A. Logg, C. Richardson, J. Ring, M. E. Rognes, and G. N. Wells. The FEniCS Project Version 1.5. Archive of Numerical Software, 3(100), 2015.
- Baker et al.  A. H. Baker, R. D. Falgout, T. V. Kolev, and U. M. Yang. Multigrid smoothers for ultraparallel computing. SIAM Journal on Scientific Computing, 33(5):2864–2887, 2011.
- Balay et al.  S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. PETSc users manual. Technical Report ANL-95/11 - Revision 3.7, Argonne National Laboratory, 2016.
- Balay et al.  S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, D. A. May, L. C. McInnes, R. T. Mills, T. Munson, K. Rupp, P. Sanan, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang. PETSc Web page, 2017. URL http://www.mcs.anl.gov/petsc.
- Bank and Scott  R. E. Bank and L. R. Scott. On the conditioning of finite element equations with highly refined meshes. SIAM Journal on Numerical Analysis, 26(6):1383–1394, 1989.
- Benzi  M. Benzi. Preconditioning techniques for large linear systems: A survey. Journal of Computational Physics, 182(2):418–477, 2002.
- Brandt and Livne  A. Brandt and O. Livne. Multigrid Techniques: 1984 Guide with Applications to Fluid Dynamics, Revised Edition. SIAM, 2011.
- Brenner and Scott  S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods. Number 15 in Texts in Applied Mathematics. Springer, third edition, 2010.
- Briggs et al.  W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial. Society for Industrial and Applied Mathematics, Philadelphia, second edition, 2000.
- Chevalier and Pellegrini  C. Chevalier and F. Pellegrini. PT-Scotch: A tool for efficient parallel graph ordering. Parallel Comput., 34(6-8):318–331, July 2008.
- Eleman et al.  H. C. Eleman, D. J. Silvester, and A. J. Wathen. Finite Elements and Fast Iterative Solvers. Oxford University Press, 2nd edition, 2014.
- Emden Henson and Meier Yang  V. Emden Henson and U. Meier Yang. BoomerAMG: A parallel algebraic multigrid solver and preconditioner. Applied Numerical Mathematics, 41(1):155–177, 2002.
- Farhat and Roux  C. Farhat and F.-X. Roux. A method of finite element tearing and interconnecting and its parallel solution algorithm. International Journal for Numerical Methods in Engineering, 32(6):1205–1227, 1991.
- George  A. George. Nested dissection of a regular finite element mesh. SIAM Journal on Numerical Analysis, 10(2):345–363, 1973.
- Klingner and Shewchuk  B. M. Klingner and J. R. Shewchuk. Agressive tetrahedral mesh improvement. In Proceedings of the 16th International Meshing Roundtable, pages 3–23, Seattle, Washington, Oct. 2007. URL http://graphics.cs.berkeley.edu/papers/Klingner-ATM-2007-10/.
- Knepley and Karpeev  M. G. Knepley and D. A. Karpeev. Mesh algorithms for PDE with Sieve I: Mesh distribution. Scientific Programming, 17(3):215–230, 2009.
- Logg  A. Logg. Efficient representation of computational meshes. International Journal of Computational Science and Engineering, 4(4):283–295, 2009.
- Logg and Wells  A. Logg and G. N. Wells. DOLFIN: Automated finite element computing. ACM Trans Math Software, 37(2):20:1–20:28, 2010.
- Logg et al.  A. Logg, K.-A. Mardal, and G. N. Wells, editors. Automated Solution of Differential Equations by the Finite Element Method, volume 84 of Lecture Notes in Computational Science and Engineering. Springer, 2012.
- Mandel and Dohrmann  J. Mandel and C. R. Dohrmann. Convergence of a balancing domain decomposition by constraints and energy minimization. Numerical Linear Algebra with Applications, 10(7):639–659, 2003.
- Plaza and Carey  Á. Plaza and G. F. Carey. Local refinement of simplicial grids based on the skeleton. Applied Numerical Mathematics, 32(2):195–218, 2000.
- Plaza and Rivara  A. Plaza and M.-C. Rivara. Mesh refinement based on the 8-tetrahedra longest-edge partition. In 12th International Meshing Roundtable, Sandia National Laboratories, pages 67–78, 2003.
- Poulson  J. L. Poulson. Fast parallel solution of heterogeneous 3D time-harmonic wave equations. PhD thesis, The University of Texas at Austin, 2012. URL http://hdl.handle.net/2152/ETD-UT-2012-12-6622.
- Rasquin et al.  M. Rasquin, C. Smith, K. Chitale, E. S. Seol, B. A. Matthews, J. L. Martin, O. Sahni, R. M. Loy, M. S. Shephard, and K. E. Jansen. Scalable implicit flow solver for realistic wing simulations with flow control. Computing in Science & Engineering, 16(6):13–21, 2014.
- Richardson and Wells  C. N. Richardson and G. N. Wells. Expressive and scalable finite element simulation beyond 1000 cores. Distributed Computational Science and Engineering (dCSE) Project Report, 2013. URL https://www.repository.cam.ac.uk/handle/1810/245070.
- Rudi et al.  J. Rudi, O. Ghattas, A. C. I. Malossi, T. Isaac, G. Stadler, M. Gurnis, P. W. J. Staar, Y. Ineichen, C. Bekas, and A. Curioni. An extreme-scale implicit solver for complex PDEs: Highly heterogeneous flow in earth’s mantle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 5:1–5:12. ACM Press, 2015.
- Ruge and Stüben  J. W. Ruge and K. Stüben. Algebraic multigrid (AMG). In S. F. McCormick, editor, Multigrid Methods, volume 5 of Frontiers in Applied Mathematics. Society for Industrial and Applied Mathematics, Philadelphia, 1987.
What is a good linear finite element? interpolation, conditioning, anisotropy, and quality measures (preprint).University of California at Berkeley, 73:137, 2002.
- The HDF Group [2000–2017] The HDF Group. Hierarchical Data Format, version 5, 2000–2017. URL http://www.hdfgroup.org/HDF5.
- Toselli and Widlund  A. Toselli and O. B. Widlund. Domain Decomposition Methods: Algorithms and Theory, volume 34. Springer, 2005.
- Trefethen and Bau  L. N. Trefethen and D. Bau, III. Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, 1997.
- Trottenberg et al.  U. Trottenberg, C. W. Oosterlee, and A. Schuller. Multigrid. Academic Press, London, 2000.
- Vaněk et al.  P. Vaněk, J. Mandel, and M. Brezina. Algebraic multigrid by smoothed aggregation for second and fourth order elliptic problems. Computing, 56(3):179–196, 1996.