Current developments in computer architecture are driven by modern multi-processors with an ever increasing parallelism. This includes process-level parallelism between an increasing number of compute cores, data-level parallelism in terms of vector processing (single instruction multiple data, SIMD), and instruction-level parallelism. Parallelism is even more essential in the area of scientific computing where increasingly complex physical and technical problems can be solved by numerical simulations on high performance computers. With the advent of extreme-scale computing, scientific software must be capable of exploiting the massive parallelism provided by modern supercomputers.
In this article, the design and implementation of a new finite element software framework HyTeG (Hybrid Tetrahedral Grids) will be presented. With HyTeG, we aim to develop a flexible, extensible, and sustainable framework for massively parallel finite element computations. To meet the requirements of a scalable and modular software structure, a carefully designed software architecture is essential. Our work builds on experience with designing complex and scalable software systems in scientific computing, such as Hierarchical Hybrid Grids (HHG) [1, 2, 3] and waLBerla 111http://walberla.net [4, 5, 6, 7]. Low-level hardware-aware implementation techniques are realised in a modular software architecture for extreme-scale efficiency on current and future supercomputing platforms. In the scope of this article, the software concepts are presented for two-dimensional triangular finite elements. All features, however, are implemented to be reusable also for three-dimensional simulations.
HyTeG is a generic framework for finite-element based computations building on some of the design principles of the preceding HHG software , but is designed to increase its functionality, flexibility, and sustainability.
Both frameworks build on a hierarchical discretisation approach that employs unstructured coarse meshes combined with structured and uniform refinement [1, 9]. This structure is depicted in Fig. 1 for a two-dimensional triangular mesh.
The resulting nested hierarchy of grids is used to implement geometric multigrid methods. When progressing to extreme scale, multigrid methods are essential since they exhibit asymptotically optimal complexity. This algorithmic scalability makes them far superior to most alternative solver algorithms. Additionally, the regularity of the refined elements is exploited by representing the system matrix in terms of stencils whose entries correspond to the non-zero elements of a row in the stiffness matrix. The matrix-free approach of HHG combined with a memory access structure that avoids indirections is favourable for modern hardware, as it minimises both memory consumption and memory access operations.
Based on a similar approach, HHG was designed as a multigrid library with excellent efficiency for scalar elliptic problems [10, 2] and for Stokes flow [11, 12, 3]. The largest published results reach up to degrees of freedom (DoFs) , exceeding the capability of alternative approaches by several orders of magnitude. We emphasise that solving systems of such size would not be possible on any current supercomputer with matrix-based methods where the sparse matrix must be stored explicitly. Techniques like on-the-fly stencil assembly [13, 14] can keep the memory requirements low also for more complex problems.
HHG, however, is restricted to conforming linear finite elements only. Therefore, e. g. transport processes have to be modelled employing a duality between nodal finite element meshes and polyhedral control volumes for a finite volume discretisation, as in . The execution model is strictly bulk-synchronous and there is no support for dynamic adaptivity and load balancing, significantly limiting the applicability. The new HyTeG framework is therefore a completely new design built on similar principles but generalizing the concept to remove these limitations.
1.2 Design Goals and Contribution
A core feature of the new HyTeG code are multi-scale tetrahedral higher-order finite elements with a uniformly structured refinement. This leads to excellent computational performance combined with high geometric flexibility and improved spatial accuracy for problems with sufficient regularity. The design as described here is restricted to simplices. However, the software structure is kept flexible to also support hexahedral or prismatic elements in future versions.
The data layout of HyTeG supports DoFs on the nodes, edges, faces, and volumes of a finite element mesh to support most grid-based discretisations, including e. g. discontinuous Galerkin methods. Edge DoFs facilitate stable discretisations of flow problems as in the Taylor-Hood discretisation .
To achieve superior scalability, an improved domain partitioning concept with abstract data handling is presented in Sec. 2
. This software architecture supports static and dynamic load balancing techniques, permits asynchronous execution, and it forms the foundation for data migration, advanced resilience techniques, and adaptive mesh refinement. A newly introduced approach to classify and separately store the mesh data based on their topological location in the finite element mesh is described in Sec.3. These mesh structures can be implemented with index-based memory access, avoiding the usual indirection of sparse matrix structures with their inherent performance penalty. This direct access facilitates implementing the most time-consuming compute kernels with superior efficiency, since these data structures are specifically designed to better exploit instruction level parallelism and vectorisation. An array-access abstraction is presented that allows to adapt the underlying memory layout of the unknowns to the access patterns of the employed computational kernels. Various solvers, but in particular geometric multigrid methods, can be implemented easily. To this end, HyTeG
supports the possibility to combine different finite element spaces and the corresponding operators. The algorithmic building blocks for coupled systems of partial differential equations (PDEs) are described in Sec.4.
Currently simulating Earth mantle convection is the primary application target as a classical extreme-scale science problem. Mantle convection is the driving force for plate tectonics, causing Earth quakes and mountain formation. This application motivates the example computations presented in Sec. 5. However, the HyTeG framework is also well suited for many other physical applications that can make use of very large-scale finite element models.
1.3 Related Work
Many high performance simulation frameworks can be classified into two main categories based on the meshes they use for representing the simulation domain. Frameworks that support finite elements on unstructured meshes include DUNE [17, 18] with its module DUNE-FEM , libMesh , UG4 , and NEKTAR++ . Another class of frameworks uses structured meshes, including deal.II [23, 24], Nek5000 , and waLBerla .
The frameworks DUNE, libMesh, UG4, and deal.II support higher-order conforming and non-conforming finite elements as well as discontinuous Galerkin methods, combined with h-,p-,hp-refinement and adaptivity. Additionally, DUNE and deal.II both provide element-based matrix-free methods on polyhedral elements and structured hexahedral grids, respectively. NEKTAR++ is a tensor-product based high-order finite element package for tetrahedral, hexahedral, and prismatic elements that employs tensor-product approximations to significantly reduce computational costs. Nek5000 is a highly scalable spectral element code based on hexahedral elements that currently supports h-adaptive mesh refinement .
Unstructured meshes have the advantage of geometric flexibility, however, the implementation is often less efficient and can at this time not reach simulation sizes as e. g. demonstrated in [3, 11, 12]. Structured meshes support matrix-free methods that can be used to save memory and memory access bandwidth. They are also better suited to exploit the microarchitecture of modern processors, in particular accelerators, such as GPUs. The HyTeG framework presents a compromise between structured and unstructured meshes trying to preserve the geometric flexibility of unstructured meshes while still providing the efficiency that can be achieved by hardware-aware implementations on structured meshes.
2 Domain Partitioning and Subdomain Mesh Topology
An efficient and fully distributed data structure is a vital component for scalable parallel simulation software. In this section, a scheme and the corresponding data structures in HyTeG are presented to partition an unstructured mesh as it is used for hybrid discretisations as introduced in Sec. 1.
2.1 Domain Partitioning
Parallel implementations of stencil-based update rules require a domain partitioning strategy that allows for read-access to the neighbourhood of an unknown. The data dependencies across process boundaries are typically resolved by extending the local subdomains by ghost layers, also referred to as halos, that represent read-only copies of unknowns beyond the boundary of the subdomains. Data points on the ghost layers are communicated across the subdomain boundaries between two subsequent iterations. The hierarchical partitioning scheme of the unstructured coarse grid presented in this section is motivated by the special properties of finite element discretisations on meshes that result from the structured refinement of unstructured meshes . This scheme separates subdomains with different local stencils and results in a unique assignment of unknowns to processes.
HyTeG introduces the concept of macro-primitives. A macro-primitive describes a geometrical object, its position, and orientation. There are different types of macro-primitives that correspond to their respective topological relations: macro-faces for two-dimensional objects (e. g. triangles), macro-edges for one-dimensional lines, and macro-vertices for points.
The unstructured mesh is conceptually converted into a graph of these macro-primitives as illustrated in Fig. 2. The graph’s vertices are instances of macro-primitives. To construct the graph, one graph-vertex that represents one macro-primitive per mesh-vertex, mesh-edge, and mesh-face of the unstructured mesh is inserted into the graph.
The graph-edges connect the graph-vertices to reflect the topology of the unstructured mesh. Macro-primitives are connected with their neighbours of the next lower and higher dimension, i. e., all graph-vertices of type macro-face are only connected to the adjacent macro-edges but not to neighbouring macro-faces.
2.2 Simulation Data
The macro-primitives also serve as containers for arbitrary data structures. With this abstract data-handling, any kind of data can be attached to a macro-primitive, including instances of custom classes or standard C++ data structures.
To construct the hierarchy of refined subdomain meshes, data structures that represent fields (i. e. arrays) of unknowns are allocated and attached to the macro-primitives. The size and shape of the fields on each macro-primitive depend on the primitive’s geometry, the neighbouring macro-primitives, and the refinement level. The fields are extended by ghost layers to facilitate data exchange with the neighbouring macro-primitives when these are stored on different processors in a distributed memory system. Fig. 3 illustrates the structure of such fields of unknowns that arise from a P1 finite-element discretisation on a mesh that is refined by two levels, and is generated based on two coarse-grid mesh elements, corresponding to the mesh in Fig. 2.
Each unknown is assigned to exactly one macro-primitive. The ownership is illustrated by black points in Fig. 3. A read-only copy may reside in the ghost layers of other macro-primitives. The ghost layers are illustrated by white points. The orange squares indicate the same unknown that is only owned by one macro-primitive but resides as a read-only copy in the ghost layers of neighbouring macro-primitives.
The container approach facilitates a flexible extension of the framework by allowing to attach arbitrary fields to the macro-primitives–especially fields that represent unknowns of different discretisations, e. g. higher-order finite element or finite volume discretisations. The field data-structure is discussed in more detail in Sec. 3.
Moreover, the macro-primitives can be attributed with arbitrary data and are not restricted to field data-structures. Such data structures could also for example store logistic data as required for handling special boundary conditions or metadata that is needed for load balancing. Similar approaches to decouple the simulation data from the coarse-grained topology have been presented in [7, 27] and have been shown to provide an elegant interface for runtime load balancing , resilience, and data migration . The macro-primitive graph data structure is therefore not restricted to finite-element based simulation techniques.
To realise runtime load balancing or checkpointing techniques, data serialisation and migration to other processes or to permanent storage are required. Therefore, an interface is provided for callback functions that (de-)serialise the attached data structures. The framework is then able to (de-)serialise and store or migrate the attached data (e. g. fields of unknowns) without knowledge of its actual internal structure, which allows a decoupled design of the migration process.
2.3 Load Balancing
The graph of macro-primitives is the basic data structure to distribute the domain among different processes. During the partitioning of the unstructured mesh to a graph of macro-primitives, each macro-primitive is attributed with a globally unique identifier and is assigned to one process. Each process, however, may own multiple macro-primitives. The connectivity is stored in a distributed fashion as each macro-primitive stores the identifiers of its neighbouring macro-primitives according to the graph’s structure. As a result, no global information about the domain is stored on any process and the mesh is completely distributed. Thus, arbitrary sized meshes can be processed given a sufficient number of processes—this is essential for the design of simulation software that can reach extreme scale.
Simulation on a large partitioned domain often requires a flexible and dynamic load balancing concept. Approaches to create and balance partitions often rely on a graph representation of the targeted domain. The macro-primitive graph shown in Fig. 2 represents such a structure. Each node of the graph represents one macro-primitive of the domain while the graph-edges reflect the communication paths.
Therefore, general graph partitioning algorithms can be applied to this structure without knowledge of the underlying application. Well-designed load balancing libraries already exist [30, 31, 32] and can be used to create partitions of the macro-primitive graph. Figure 4 shows a two-dimensional mesh containing some circular obstacles that was balanced among eight processes using ParMetis .
Such partitions can be created statically during a setup phase or dynamically at runtime. Dynamic load balancing requires runtime data migration and can be realised in a two step process. First, the current partitioning of the global graph is adapted. This could by done by an external library such as ParMetis . Then the simulation data is migrated to the respective processes. In HyTeG, the migration process is realised using the (de-)serialisation routines as described in Sec. 2.2. A scalable dynamic load balancing concept applied to simulations on adaptive meshes is presented in .
The partitioning algorithms can be augmented by weights that are assigned to the individual graph-nodes and -edges. In a finite-element context, a node weight could represent the number of unknowns located in the respective primitive and therefore serves as indicator of the corresponding work load.
Due to the connectivity of the macro-primitive graph, macro-edges are likely to end up on the same processor as their neighbouring macro-faces and vice versa. Since most communication during the simulations takes place along the edges of the macro-primitive graph, this structure naturally yields suitable distributions if edge-cut minimizing partitioning algorithms are employed.
Synchronisation of the read-only copies of unknowns on the ghost layers is employed through communication between neighbouring macro-primitives. The macro-primitive graph structure illustrated in Fig. 2 reflects the direct neighbourhood of the primitives, i. e., communication is only performed along the graph-edges. For example, the ghost layers of macro-faces are updated by communication with the neighbouring macro-edges.
The communication pattern is highly dependent on the update pattern performed on the unknowns and must be adjusted individually. Explicit update rules like matrix-vector products or Jacobi-type smoothing iterations can be combined with more efficient communication patterns than implicit update rules like for example Gauss-Seidel-type smoothing iterations. In general, the latter require a certain global update succession. More details are discussed in Sec. 4.
Corresponding to the abstract data-handling concept to attach arbitrary data structures to the macro-primitives, the framework supports a similarly flexible communication abstraction. A three-layer abstraction is employed to decouple the individual components of the communication process from another.
The buffer layer provides abstraction from the actual (MPI-)send and receive calls and the internal data buffer structure. C++ operator overloading for STL data structures and basic data types allows for a convenient (de-)serialisation of the transferred data.
The packing layer is responsible for the (de-)serialisation of the attached data of a macro-primitive from and to data buffers. It provides a (de-)serialisation interface that is implemented for each data structure that must be communicated. For example, there are implementations for (de-)serialisation of unknowns that shall be sent to the ghost layers of neighbouring primitives. Upon invocation, the serialisation routines pack the respective unknowns into a buffer on one primitive and the deserialisation routines unpack them on the ghost layers of the corresponding neighbour. However, the packing layer does not employ any communication but is only called to (un-)pack data from and to buffers.
The control layer manages the communication directions along the graph-edges of the macro-primitive graph. Before sending and after receiving buffers, it calls the respective (de-)serialisation routines of the packing layer without knowledge of the actual data structures. The control layer employs optimisations like non-blocking communication and calls process-local communication routines if graph-edges do not cross process boundaries.
This decoupled design allows for intensive code reuse and enhances the extensibility. The control layer is not affected by the introduction of new data structures since only the interface of the packing layer must be implemented to allow for communication of the respective data. A well-tested buffer layer abstraction is for example already implemented in the core of the waLBerla framework  which serves also as a basis for this implementation.
3 Structured Refinement of Mesh Data
In this section, the structured field data-structures for the DoFs inside the macro-primitives are presented that result from the structured refinement of an unstructured topology. Since the largest part of the unknowns is located on the macro-faces the descriptions in the following are focused on these primitives. The presented concepts can be adapted for macro-edges and macro-vertices as well.
3.1 Structured Subdomains
As illustrated in Fig. 1, starting from a triangulation of the domain, a recursive refinement is performed by connecting the three midpoints of the triangle edges. This results in four new triangles of the same congruence class as e. g. presented in .
Unknowns are placed at certain positions in the structured refined element mesh depending on the corresponding finite element or finite volume discretisation. They are classified as vertex-unknowns, edge-unknowns and face-unknowns reflecting their topological position on the mesh, cf. Fig. 5.
In the case of first-order finite elements, the unknowns are placed on the vertices as shown in Fig. 5(a). Second-order finite elements additionally require DoFs on the edges as displayed in Fig. 5(b). Other discretisations, such as e. g. finite volumes can be realised by placing the unknowns on the interior of the triangles as illustrated in Fig. 5(c). More details are given in Sec. 4.1.
The edge-unknowns are further separated into subgroups depending on whether they are placed on a horizontal, vertical or diagonal edge of an element. The different groups are shown in different colours in Fig. 5(b). A similar grouping for rectangular elements has been presented in .
Similar to the edge-unknowns, the face-unknowns are also separated into subgroups depending on the orientation of the element they reside in. As illustrated in Fig. 5(c), there are two subgroups: upward- upward- and downward-facing elements.
One subgroup on its own (e. g. all horizontal edge unknowns or all face-unknowns in upward-facing elements) leads again to a triangular pattern similar to that of the vertex unknowns. Because of the similar patterns, a single array-access calculation routine for variable sized triangular field layouts is capable of calculating the array accesses for vertex-unknowns as well as for each subgroup of edge- and face-unknowns. Therefore, the separation into subgroups avoids code duplication in the sense that the array-access routines can be reused for all types of unknowns.
Instead of modularizing access patterns and memory layouts by element type, the required features are implemented per type of unknown, i. e., modules are implemented for vertex-unknowns, edge-unknowns, and face-unknowns. By combining these modules, arbitrary element types can be created, for example finite elements of different order. This approach reduces complexity and code duplication since routines that process vertex-unknowns can be reused for all discretisations that require vertex-unknowns. Possible examples include numerical routines that update the unknowns or routines that serialise data for communication or IO. Additionally, advanced applications that employ advanced or uncommon element types can be prototyped and implemented rapidly by combining different types of unknowns.
One challenge when comparing these triangular fields with their equivalents on quadratic or rectangular meshes (as e. g. used in ) is the translation of the topological index to the actual memory layout. While in the rectangular case the number of unknowns is constant in each row and column, this is not the case for triangles, cf. Fig. 6. In case of a simple linear data arrangement as shown in Fig. 6(b), this means that the offset from one row to another decreases constantly.
Furthermore, it is desirable that the memory layout can be changed to fit the requirements of a compute kernel and to improve the performance for the specific processor microarchitecture. Possible scenarios could be colouring schemes e. g. used in a multi-colour Gauss-Seidel smoother or the possibility to use SIMD operations more efficiently. For both of these reasons a layer of abstraction is introduced to separate the indexing from the actual memory layout. A fixed topological enumeration is introduced and flexible methods are implemented to access the unknowns using indices.
An example for DoFs located on the vertices on a macro-face can be seen in Fig. 6. The fixed topological indexing is shown in Fig. 6(a). These coordinates are translated via an indexing function into one possible memory layout as shown in Fig. 6(b). The memory layout can be exchanged according to the specific needs of the application simply by exchanging the indexing function. The topological enumeration is defined for each type of unknown, i. e. vertex-, edge- or face-unknowns.
By introducing the abstract indexing, the array access calculations are decoupled from the algorithms that access the unknowns. Since the translation to absolute array indices is hidden behind the indexing functions, array layouts can be switched by simply exchanging the indexing function. Consequently, different memory layouts can be added during later phases of the development and can be compared to each other without rewriting existing kernels that are not performance critical while time consuming compute kernels can be tuned for the specific layout. This provides the possibility to employ efficient memory access patterns to improve e. g. cache locality by using techniques as described in  and .
3.3 Inter-Primitive Data Exchange
Due to the underlying unstructured mesh, it is not possible to implicitly calculate the neighbourhood relations of a certain macro-primitive. The number of neighbours may vary for each primitive type. While macro-faces always have three neighbouring macro-edges and macro-vertices, a macro-edge can have one or two neighbouring macro-faces. A macro-vertex can have an arbitrary number of neighbouring primitives depending on the mesh. Since the data structures are fully distributed there is no global knowledge about the neighbourhood relations. Therefore, local neighbourhood information must be explicitly stored for each primitive.
As explained in Sec. 2.4 a layered approach is used where the control layer takes care that the correct primitives communicate with each other. Before sending the data, however, serialisation is needed as well as packing the data into a buffer that is sent to a receiving primitive belonging to a process located on a remote node. This means that each macro-face, macro-edge, and macro-vertex needs to copy the data that all of their neighbours require into one or more buffers. If two adjacent macro-primitives are located on the same process, the data can be copied directly from one array to the other without (de-)serialisation to or from a buffer.
It is important to point out that in case of a macro-face the memory access pattern is quite different for each of the three neighbouring macro-edges. The desired data might be located consecutively in memory but there could also be varying strides between the entries. Once more, the indexing abstraction discussed in Sec.3.2 can be used to simplify the corresponding serialisation kernels.
Another complication is the ordering of unknowns in the topological layout. Since the input mesh is fully unstructured it is not guaranteed that the orientation of one side of a macro-face is equivalent to the orientation of the neighbouring macro-edge. This means that the particular order of unknowns might have to be adjusted during the communication. The approach to solving this problem is that the primitive of higher dimension takes care of possible adjustments. For example, a macro-face would write the data in reverse order into the send buffer such that the corresponding macro-edge can simply read the data out of the buffer consecutively.
The steps that have to be performed when communicating one side of a macro-face to the adjacent macro-edge are as follows:
Determine which of the three sides of the macro-face the macro-edge is adjacent to and how the topological layout is oriented.
Determine the process associated with the adjacent macro-edge.
If the edge is located on the same process: copy the macro-face data directly into the ghost layers of the macro-edge considering the orientation.
If the macro-edge is located on a different process:
copy the macro-face data into the send buffer considering the orientation,
transfer the buffer to the process the macro-edge is located on, and
copy the data from the receive buffer into the ghost layers of the macro-edge.
In Fig. 7 an example communication from one macro-face to one of the three neighbouring macro-edges is depicted. In this case, the unknowns are written to the buffer in reversed order by the macro-face since the orientation in the macro-edge is opposite.
4 Numerical Methods and Linear Algebra
The discretisation of PDEs on the mesh structures of HyTeG typically leads to large systems of equations. For the implementation of iterative solvers, standard operations like addition, scalar products, and matrix-vector multiplications have to be performed. In the following, the realisation of these concepts within HyTeG is described.
The data structures described in Sec. 3 support storing unknowns at different positions in the topology of the underlying mesh. By grouping different unknowns together, finite elements or other discretisations can be constructed. Fig. 8 illustrates some examples of typical finite elements. The first row from Fig. 8(a) to Fig. 8(c) presents the usual conforming finite elements  from first to third order. Additionally, also non-conforming finite elements that are discontinuous across the edges can be realised, cf. Fig. 8(d) to Fig. 8(f). The elements in Figs. 8(e) and 8(f) are inherently equivalent but differ only in the location on the element the unknowns are associated with.
A global variable on the mesh discretised by a specific element may be represented by the union of all unknowns. This union represents a vector in a finite-dimensional linear space. On these vectors, the HyTeG framework implements a set of functions similar to the BLAS level 1 routines . For example, the routine assign shown in Alg. 1 computes a linear combination of two vectors.
Similarly, it is possible to implement reductions on such vectors. As an example, the implementation of the equivalent of the xDOT BLAS routine is presented in Alg. 2.
Note that the computation performed within a primitive is independent of others, thus each primitive may be processed in parallel.
4.2 Matrix-Free Approach
When solving very large systems of equations obtained from the discretisation of PDEs, the memory consumption of matrix-based implementations may be impracticably high, even when using appropriate sparse matrix formats. HyTeG therefore employs matrix-free methods based on stencils (cf. Fig. 9). Stencil operations additionally allow for efficient parallel kernels to carry out matrix-vector multiplications or point-wise smoothers.
The stencil approach is illustrated in Fig. 9 for second-order triangular finite elements. In this approach, the regularity of the grid is exploited by storing the non-zero elements of a matrix row as stencil entries for the corresponding unknown and its neighbours. Fig. 9(a) shows the stencil associated with a micro-vertex DoF by highlighting the DoFs within the compact support of this vertex. Figs. 9(b)–9(d) display the stencils associated with diagonal, horizontal, and vertical micro-edge DoFs, respectively.
In HyTeG a matrix is never fully stored in memory, except possibly on coarse levels of a multigrid hierarchy. On finer refinement levels, only the result of a matrix-vector multiplication is computable and may be stored in another vector. The implementation of the matrix-vector multiplication is up to the developer. In conventional finite-element solvers this is usually done through quadrature loops which integrate the bilinear forms of the weak form on-the-fly or by using pre-computed local element matrices. In the special case of linear PDEs with constant coefficients, only one constant stencil has to be stored for each primitive. An example for the matrix-vector multiplication implementation in this case is given in Alg. 3.
Note again that the computations in the interior of a primitive are independent of other primitives. A communication step is required only before primitives of the next dimension are processed. Thus the work in each primitive may be distributed across processes.
4.3 Iterative Solvers
The basic linear algebra routines as discussed in Sec. 4.1 and Sec. 4.2 allow the implementation of iterative solvers for linear systems. On coarse grids we may employ Krylov or direct solvers. These solvers can be implemented directly or via an interface to external libraries like PETSc . The preconditioned conjugate gradient method implemented in HyTeG can e. g. be used for symmetric and positive definite problems. For indefinite symmetric problems, the preconditioned minimal residual method is typically applied.
For large problems, geometric multigrid solvers are employed. Multigrid methods are essential, since they exhibit asymptotically optimal complexity and thus become superior when progressing to extreme scale 
. Interpolation and restriction operators between levels are required for multigrid and may be realised by matrix-vector multiplications with non-square matrices. Note that the most efficient smoothers for multigrid often require pointwise updates and thus access to the diagonal entries of the stiffness matrix. This access is provided by the stencil paradigm, but it might not be easy to accomplish in alternative matrix-free methods when only the matrix-vector-multiplication is realised. InHyTeG, multigrid can be used as a solver or as a preconditioner in a Krylov solver.
Two major categories of refinement in multigrid methods are h- and p-refinement . h-refinement describes a refinement process in that the same finite element discretisation is employed on two meshes with different geometric refinement levels. The combination of different orders of finite element discretisations, e. g. P1 and P2 elements, allows to employ p-refinement. Here the interpolation and restriction is performed between different discretisations on the same grid, i. e., on the same geometric refinement level. Combining both approaches allows the implementation of geometric multigrid solvers with hp-refinement. For the solution of large saddle-point problems, an all-at-once multigrid solver using pointwise inexact Uzawa smoothers is available .
5 Example Applications
The following applications are chosen as a proof of concept for the software architecture and design aspects.
5.1 Stokes Flow
In the first example an isoviscous and incompressible Stokes flow through a porous structure is considered (cf. Fig. 10) as modelled by
A parabolic inflow profile is prescribed at the left, no-slip conditions at the top and bottom, and Neumann outflow boundary conditions at the right boundary of the domain. As a stable discretisation, the P2-P1 pairing (Taylor-Hood ) is used, i. e., the velocity is discretised by second- and the pressure by first-order finite elements. The linear system resulting from a discretisation with refinement level 4 is directly solved by an decomposition using the PETSc interface of HyTeG. Fig. 10 shows the velocity magnitude and pressure field as well as the coarse grid structure of the underlying mesh.
5.2 Energy Transport
In the second example, the temperature driven convection from the inner to the outer boundary in an annulus is simulated as a simplified model of Earth mantle convection (cf. Fig. 11).
The process is described by the isoviscous, incompressible Stokes equations for velocity and pressure, coupled with a convection-diffusion equation for the temperature. The governing coupled system of equations for the velocity , pressure , and temperature is given by
Here represents the dimensionless Rayleigh number, the Péclet number, and an outward pointing radial vector. The inverse of the Péclet number is very small, i. e., , such that the model is dominated by numerical diffusion. This diffusion becomes smaller with decreasing mesh size. At the boundaries, homogeneous Dirichlet conditions are employed for the velocity . For the dimensionless temperature a value of is prescribed at the inner boundary and of at the outer boundary.
Here the Stokes system is discretised by equal order P1-P1 finite elements together with a Petrov-Galerkin pressure (PSPG) stabilisation  for the pressure. The convection is realised by a finite volume discretisation. To advance the system in time, the Stokes system and the convection-diffusion equation are treated with different time steps. The Stokes system is solved only after every three time steps of updating the convection-diffusion equation. The pressure and velocity are simultaneously solved in a monolithic multigrid algorithm, similar to . This has been proposed in the classical multigrid literature  as the most efficient approach to solving the Stokes system. The temperature is advanced via an explicit upwind scheme using the previously computed velocity field.
The partitioned annulus domain contains macro-faces and unknowns per macro-face on the finest refinement level 6, i. e., in total more than half a million DoFs.
6 Conclusion and Outlook
In this article, the design principles of the generic new finite-element framework HyTeG are presented. By design, HyTeG supports higher-order finite elements on a hierarchically discretised domain. It provides scalable, parallel data structures, sophisticated load balancing, an abstract memory layout, and a layered communication concept. The basic design will also support adaptivity and asynchronous execution as needed when progressing to extreme-scale computing. The approach to achieve high single-node performance and good scalability within HyTeG is based on the combination of an unstructured topology with structured data inside the macro-primitives. The data structures can be used to implement efficient matrix-free methods and the basic numerical building blocks for iterative solvers. The stencil approach allows for the scalable and memory-aware implementation of point-wise smoothers that are essential to employ fast geometric multigrid solvers.
The concept is designed towards a seamless transition to three-dimensional simulations building on tetrahedral macro-primitives and is currently extended. In future work the scalability and performance of HyTeG will be examined in more detail. A major focus lies on the implementation of massively parallel geometric multigrid solvers for the Stokes system and their application to large-scale geophysical simulations. Beyond these first examples, the framework is also well suited for other applications that require large-scale simulations with finite elements, such as in electromagnetics or fluid dynamics.
This work was partly supported by the German Research Foundation through the Priority Programme 1648 ”Software for Exascale Computing” (SPPEXA) and by grant WO671/11-1.
-  B. Bergen and F. Hülsemann, Hierarchical hybrid grids: A framework for efficient multigrid on high performance architectures, Technical Report, Lehrstuhl für Systemsimulation, Universität Erlangen 5 (2003).
-  B. Bergen, F. Hülsemann, and U. Rüde, Is Unknowns the Largest Finite Element System that Can Be Solved Today?, in Proc. 2005 ACM/IEEE Conf. on Supercomputing, SC ’05, ACM, Seattle, Washington, USA, 2005, pp. 5–5.
-  B. Gmeiner, M. Huber, L. John, U. Rüde, and B. Wohlmuth, A quantitative performance study for Stokes solvers at the extreme scale, J. Comput. Sci. 17 (2016), pp. 509–521.
-  C. Feichtinger, S. Donath, H. Köstler, J. Götz, and U. Rüde, WaLBerla: HPC software design for computational engineering simulations, J. Comput. Sci. 2 (2011), pp. 105–112, Simulation Software for Supercomputers.
-  H. Köstler and U. Rüde, The CSE Software Challenge–Covering the Complete Stack, IT-Inform. Technology 55 (2013), pp. 91–96.
-  C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, and U. Rüde, A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries, in Proc. Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’13, Denver, Colorado, ACM, 2013, pp. 35:1–35:12.
-  F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids, SIAM J. Sci. Comput. 38 (2016), pp. C96–C126.
-  B. Bergen and F. Hülsemann, Hierarchical hybrid grids: data structures and core algorithms for multigrid, Numer. Linear Algebra Appl. 11 (2004), pp. 279–291.
-  B.K. Bergen, Hierarchical Hybrid Grids: Data structures and core algorithms for efficient finite element simulations on supercomputers, SCS Publishing House, 2006.
-  B. Bergen, T. Gradl, U. Rüde, and F. Hülsemann, A Massively Parallel Multigrid Method for Finite Elements, Comput. Sci. Eng. 8 (2006), pp. 56–62.
-  B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, and B. Wohlmuth, Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems, SIAM J. Sci. Comput. 37 (2015), pp. C143–C168.
-  B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, and B. Wohlmuth, Towards textbook efficiency for parallel multigrid, Numer. Math. Theory Methods Appl. 8 (2015), p. 22–46.
-  S. Bauer, M. Mohr, U. Rüde, J. Weismüller, M. Wittmann, and B. Wohlmuth, A two-scale approach for efficient on-the-fly operator assembly in massively parallel high performance multigrid codes, Appl. Numer. Math. 122 (2017), pp. 14–38.
-  S. Bauer, D. Drzisga, M. Mohr, U. Ruede, C. Waluga, and B. Wohlmuth, A stencil scaling approach for accelerating matrix-free finite element implementations, CoRR abs/1709.06793 (2017), Submitted to SIAM J. Sci. Comput.
-  C. Waluga, B. Wohlmuth, and U. Rüde, Mass-corrections for the conservative coupling of flow and transport on collocated meshes, J. Comput. Phys. 305 (2016), pp. 319–332.
-  H.C. Elman, D.J. Silvester, and A.J. Wathen, Finite Elements and Fast Iterative Solvers: with Applications in Incompressible Fluid Dynamics, OUP Oxford, 2014.
-  P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klöfkorn, M. Ohlberger, and O. Sander, A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework, Computing 82 (2008), pp. 103–119.
-  P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klöfkorn, M. Ohlberger, and O. Sander, A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE, Computing 82 (2008), pp. 121–138.
-  A. Dedner, R. Klöfkorn, M. Nolte, and M. Ohlberger, A generic interface for parallel and adaptive discretization schemes: abstraction principles and the DUNE-FEM module, Computing 90 (2010), pp. 165–196.
-  B.S. Kirk, J.W. Peterson, R.H. Stogner, and G.F. Carey, libMesh: a C++ library for parallel adaptive mesh refinement/coarsening simulations, Eng. Comput. 22 (2006), pp. 237–254.
-  A. Vogel, S. Reiter, M. Rupp, A. Nägel, and G. Wittum, UG 4: a novel flexible software system for simulating PDE based models on high performance computers, Comp. Vis. Sci. 16 (2013), pp. 165–179.
-  C. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo, D.D. Grazia, S. Yakovlev, J.E. Lombard, D. Ekelschot, B. Jordi, H. Xu, Y. Mohamied, C. Eskilsson, B. Nelson, P. Vos, C. Biotto, R. Kirby, and S. Sherwin, Nektar++: An open-source spectral/hp element framework, Comput. Phys. Commun. 192 (2015), pp. 205–219.
-  D. Arndt, W. Bangerth, D. Davydov, T. Heister, L. Heltai, M. Kronbichler, M. Maier, J.P. Pelteret, B. Turcksin, and D. Wells, The deal.II library, version 8.5, J. Numer. Math. 25 (2017), pp. 137–146.
-  W. Bangerth, R. Hartmann, and G. Kanschat, deal.II – a general purpose object oriented finite element library, ACM Trans. Math. Softw. 33 (2007), pp. 24/1–24/27.
-  J.W. Lottes, P.F. Fischer, and S.G. Kerkemeier, Nek5000 web page (2008), Http://nek5000.mcs.anl.gov.
-  A. Peplinski, P.F. Fischer, and P. Schlatter, Parallel Performance of H-type Adaptive Mesh Refinement for Nek5000, in Proc. 2016 Exascale Appl. Softw. Conf., EASC ’16, Stockholm, Sweden, ACM, 2016, pp. 4:1–4:9.
-  B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, and L. Kale, Parallel Programming with Migratable Objects: Charm++ in Practice, in Proc. Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’14, New Orleans, Louisana, IEEE Press, Piscataway, NJ, USA, 2014, pp. 647–658.
-  F. Schornbaum and U. Rüde, Extreme-Scale Block-Structured Adaptive Mesh Refinement, CoRR abs/1704.06829 (2017), Accepted for publication in SIAM J. Sci. Comput.
-  N. Kohl, J. Hötzer, F. Schornbaum, M. Bauer, C. Godenschwager, H. Köstler, B. Nestler, and U. Rüde, A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations, Int. J. High Perform. Comput. Appl. 0 (2018), pp. 1–19.
-  G. Karypis and V. Kumar, A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, J. Parallel Distrib. Comput. 48 (1998), pp. 71–95.
-  C. Chevalier and F. Pellegrini, PT-Scotch: A Tool for Efficient Parallel Graph Ordering, Parallel Comput. 34 (2008), pp. 318–331.
-  E.G. Boman, U.V. Çatalyürek, C. Chevalier, and K.D. Devine, The Zoltan and Isorropia Parallel Toolkits for Combinatorial Scientific Computing: Partitioning, Ordering and Coloring, Sci. Program. 20 (2012), pp. 129–150.
-  R.E. Bank and A.H. Sherman, A refinement algorithm and dynamic data structure for finite element meshes, Computer Science Department, UT Austin, 1980.
-  S. Kuckuk and H. Köstler, Automatic Generation of Massively Parallel Codes from ExaSlang, Computation 4 (2016), p. 27.
-  M. Kowarschik, U. Rüde, C. Weiss, and W. Karl, Cache-aware multigrid methods for solving Poisson’s equation in two dimensions, Computing 64 (2000), pp. 381–399.
-  M. Kowarschik, U. Rüde, N. Thürey, and C. Weiß, Performance optimization of 3D multigrid on hierarchical memory architectures, in Int. Workshop Appl. Parallel Comput., Springer, 2002, pp. 307–316.
-  C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Trans. Math. Softw. 5 (1979), pp. 308–323.
-  S. Balay, W.D. Gropp, L.C. McInnes, and B.F. Smith, Efficient Management of Parallelism in Object Oriented Numerical Software Libraries, in Modern Software Tools in Scientific Computing, Birkhäuser Press, 1997, pp. 163–202.
-  L. John, U. Rüde, B. Wohlmuth, and W. Zulehner, On the analysis of block smoothers for saddle point problems, arXiv preprint arXiv:1612.01333 (2016).
-  F. Brezzi and J. Douglas, Stabilized mixed methods for the Stokes problem, Numer. Math. 53 (1988), pp. 225–235.
-  A. Brandt and O.E. Livne, Multigrid techniques: 1984 guide with applications to fluid dynamics, Vol. 67, SIAM, 2011.