1 Introduction
Current developments in computer architecture are driven by modern multiprocessors with an ever increasing parallelism. This includes processlevel parallelism between an increasing number of compute cores, datalevel parallelism in terms of vector processing (single instruction multiple data, SIMD), and instructionlevel parallelism. Parallelism is even more essential in the area of scientific computing where increasingly complex physical and technical problems can be solved by numerical simulations on high performance computers. With the advent of extremescale computing, scientific software must be capable of exploiting the massive parallelism provided by modern supercomputers.
In this article, the design and implementation of a new finite element software framework HyTeG (Hybrid Tetrahedral Grids) will be presented. With HyTeG, we aim to develop a flexible, extensible, and sustainable framework for massively parallel finite element computations. To meet the requirements of a scalable and modular software structure, a carefully designed software architecture is essential. Our work builds on experience with designing complex and scalable software systems in scientific computing, such as Hierarchical Hybrid Grids (HHG) [1, 2, 3] and waLBerla ^{1}^{1}1http://walberla.net [4, 5, 6, 7]. Lowlevel hardwareaware implementation techniques are realised in a modular software architecture for extremescale efficiency on current and future supercomputing platforms. In the scope of this article, the software concepts are presented for twodimensional triangular finite elements. All features, however, are implemented to be reusable also for threedimensional simulations.
1.1 Motivation
HyTeG is a generic framework for finiteelement based computations building on some of the design principles of the preceding HHG software [8], but is designed to increase its functionality, flexibility, and sustainability.
Both frameworks build on a hierarchical discretisation approach that employs unstructured coarse meshes combined with structured and uniform refinement [1, 9]. This structure is depicted in Fig. 1 for a twodimensional triangular mesh.
The resulting nested hierarchy of grids is used to implement geometric multigrid methods. When progressing to extreme scale, multigrid methods are essential since they exhibit asymptotically optimal complexity. This algorithmic scalability makes them far superior to most alternative solver algorithms. Additionally, the regularity of the refined elements is exploited by representing the system matrix in terms of stencils whose entries correspond to the nonzero elements of a row in the stiffness matrix. The matrixfree approach of HHG combined with a memory access structure that avoids indirections is favourable for modern hardware, as it minimises both memory consumption and memory access operations.
Based on a similar approach, HHG was designed as a multigrid library with excellent efficiency for scalar elliptic problems [10, 2] and for Stokes flow [11, 12, 3]. The largest published results reach up to degrees of freedom (DoFs) [3], exceeding the capability of alternative approaches by several orders of magnitude. We emphasise that solving systems of such size would not be possible on any current supercomputer with matrixbased methods where the sparse matrix must be stored explicitly. Techniques like onthefly stencil assembly [13, 14] can keep the memory requirements low also for more complex problems.
HHG, however, is restricted to conforming linear finite elements only. Therefore, e. g. transport processes have to be modelled employing a duality between nodal finite element meshes and polyhedral control volumes for a finite volume discretisation, as in [15]. The execution model is strictly bulksynchronous and there is no support for dynamic adaptivity and load balancing, significantly limiting the applicability. The new HyTeG framework is therefore a completely new design built on similar principles but generalizing the concept to remove these limitations.
1.2 Design Goals and Contribution
A core feature of the new HyTeG code are multiscale tetrahedral higherorder finite elements with a uniformly structured refinement. This leads to excellent computational performance combined with high geometric flexibility and improved spatial accuracy for problems with sufficient regularity. The design as described here is restricted to simplices. However, the software structure is kept flexible to also support hexahedral or prismatic elements in future versions.
The data layout of HyTeG supports DoFs on the nodes, edges, faces, and volumes of a finite element mesh to support most gridbased discretisations, including e. g. discontinuous Galerkin methods. Edge DoFs facilitate stable discretisations of flow problems as in the TaylorHood discretisation [16].
To achieve superior scalability, an improved domain partitioning concept with abstract data handling is presented in Sec. 2
. This software architecture supports static and dynamic load balancing techniques, permits asynchronous execution, and it forms the foundation for data migration, advanced resilience techniques, and adaptive mesh refinement. A newly introduced approach to classify and separately store the mesh data based on their topological location in the finite element mesh is described in Sec.
3. These mesh structures can be implemented with indexbased memory access, avoiding the usual indirection of sparse matrix structures with their inherent performance penalty. This direct access facilitates implementing the most timeconsuming compute kernels with superior efficiency, since these data structures are specifically designed to better exploit instruction level parallelism and vectorisation. An arrayaccess abstraction is presented that allows to adapt the underlying memory layout of the unknowns to the access patterns of the employed computational kernels. Various solvers, but in particular geometric multigrid methods, can be implemented easily. To this end, HyTeGsupports the possibility to combine different finite element spaces and the corresponding operators. The algorithmic building blocks for coupled systems of partial differential equations (PDEs) are described in Sec.
4.Currently simulating Earth mantle convection is the primary application target as a classical extremescale science problem. Mantle convection is the driving force for plate tectonics, causing Earth quakes and mountain formation. This application motivates the example computations presented in Sec. 5. However, the HyTeG framework is also well suited for many other physical applications that can make use of very largescale finite element models.
1.3 Related Work
Many high performance simulation frameworks can be classified into two main categories based on the meshes they use for representing the simulation domain. Frameworks that support finite elements on unstructured meshes include DUNE [17, 18] with its module DUNEFEM [19], libMesh [20], UG4 [21], and NEKTAR++ [22]. Another class of frameworks uses structured meshes, including deal.II [23, 24], Nek5000 [25], and waLBerla [4].
The frameworks DUNE, libMesh, UG4, and deal.II support higherorder conforming and nonconforming finite elements as well as discontinuous Galerkin methods, combined with h,p,hprefinement and adaptivity. Additionally, DUNE and deal.II both provide elementbased matrixfree methods on polyhedral elements and structured hexahedral grids, respectively. NEKTAR++ is a tensorproduct based highorder finite element package for tetrahedral, hexahedral, and prismatic elements that employs tensorproduct approximations to significantly reduce computational costs
[22]. Nek5000 is a highly scalable spectral element code based on hexahedral elements that currently supports hadaptive mesh refinement [26].Unstructured meshes have the advantage of geometric flexibility, however, the implementation is often less efficient and can at this time not reach simulation sizes as e. g. demonstrated in [3, 11, 12]. Structured meshes support matrixfree methods that can be used to save memory and memory access bandwidth. They are also better suited to exploit the microarchitecture of modern processors, in particular accelerators, such as GPUs. The HyTeG framework presents a compromise between structured and unstructured meshes trying to preserve the geometric flexibility of unstructured meshes while still providing the efficiency that can be achieved by hardwareaware implementations on structured meshes.
2 Domain Partitioning and Subdomain Mesh Topology
An efficient and fully distributed data structure is a vital component for scalable parallel simulation software. In this section, a scheme and the corresponding data structures in HyTeG are presented to partition an unstructured mesh as it is used for hybrid discretisations as introduced in Sec. 1.
2.1 Domain Partitioning
Parallel implementations of stencilbased update rules require a domain partitioning strategy that allows for readaccess to the neighbourhood of an unknown. The data dependencies across process boundaries are typically resolved by extending the local subdomains by ghost layers, also referred to as halos, that represent readonly copies of unknowns beyond the boundary of the subdomains. Data points on the ghost layers are communicated across the subdomain boundaries between two subsequent iterations. The hierarchical partitioning scheme of the unstructured coarse grid presented in this section is motivated by the special properties of finite element discretisations on meshes that result from the structured refinement of unstructured meshes [8]. This scheme separates subdomains with different local stencils and results in a unique assignment of unknowns to processes.
HyTeG introduces the concept of macroprimitives. A macroprimitive describes a geometrical object, its position, and orientation. There are different types of macroprimitives that correspond to their respective topological relations: macrofaces for twodimensional objects (e. g. triangles), macroedges for onedimensional lines, and macrovertices for points.
The unstructured mesh is conceptually converted into a graph of these macroprimitives as illustrated in Fig. 2. The graph’s vertices are instances of macroprimitives. To construct the graph, one graphvertex that represents one macroprimitive per meshvertex, meshedge, and meshface of the unstructured mesh is inserted into the graph.
The graphedges connect the graphvertices to reflect the topology of the unstructured mesh. Macroprimitives are connected with their neighbours of the next lower and higher dimension, i. e., all graphvertices of type macroface are only connected to the adjacent macroedges but not to neighbouring macrofaces.
2.2 Simulation Data
The macroprimitives also serve as containers for arbitrary data structures. With this abstract datahandling, any kind of data can be attached to a macroprimitive, including instances of custom classes or standard C++ data structures.
To construct the hierarchy of refined subdomain meshes, data structures that represent fields (i. e. arrays) of unknowns are allocated and attached to the macroprimitives. The size and shape of the fields on each macroprimitive depend on the primitive’s geometry, the neighbouring macroprimitives, and the refinement level. The fields are extended by ghost layers to facilitate data exchange with the neighbouring macroprimitives when these are stored on different processors in a distributed memory system. Fig. 3 illustrates the structure of such fields of unknowns that arise from a P1 finiteelement discretisation on a mesh that is refined by two levels, and is generated based on two coarsegrid mesh elements, corresponding to the mesh in Fig. 2.
Each unknown is assigned to exactly one macroprimitive. The ownership is illustrated by black points in Fig. 3. A readonly copy may reside in the ghost layers of other macroprimitives. The ghost layers are illustrated by white points. The orange squares indicate the same unknown that is only owned by one macroprimitive but resides as a readonly copy in the ghost layers of neighbouring macroprimitives.
The container approach facilitates a flexible extension of the framework by allowing to attach arbitrary fields to the macroprimitives–especially fields that represent unknowns of different discretisations, e. g. higherorder finite element or finite volume discretisations. The field datastructure is discussed in more detail in Sec. 3.
Moreover, the macroprimitives can be attributed with arbitrary data and are not restricted to field datastructures. Such data structures could also for example store logistic data as required for handling special boundary conditions or metadata that is needed for load balancing. Similar approaches to decouple the simulation data from the coarsegrained topology have been presented in [7, 27] and have been shown to provide an elegant interface for runtime load balancing [28], resilience, and data migration [29]. The macroprimitive graph data structure is therefore not restricted to finiteelement based simulation techniques.
To realise runtime load balancing or checkpointing techniques, data serialisation and migration to other processes or to permanent storage are required. Therefore, an interface is provided for callback functions that (de)serialise the attached data structures. The framework is then able to (de)serialise and store or migrate the attached data (e. g. fields of unknowns) without knowledge of its actual internal structure, which allows a decoupled design of the migration process.
2.3 Load Balancing
The graph of macroprimitives is the basic data structure to distribute the domain among different processes. During the partitioning of the unstructured mesh to a graph of macroprimitives, each macroprimitive is attributed with a globally unique identifier and is assigned to one process. Each process, however, may own multiple macroprimitives. The connectivity is stored in a distributed fashion as each macroprimitive stores the identifiers of its neighbouring macroprimitives according to the graph’s structure. As a result, no global information about the domain is stored on any process and the mesh is completely distributed. Thus, arbitrary sized meshes can be processed given a sufficient number of processes—this is essential for the design of simulation software that can reach extreme scale.
Simulation on a large partitioned domain often requires a flexible and dynamic load balancing concept. Approaches to create and balance partitions often rely on a graph representation of the targeted domain. The macroprimitive graph shown in Fig. 2 represents such a structure. Each node of the graph represents one macroprimitive of the domain while the graphedges reflect the communication paths.
Therefore, general graph partitioning algorithms can be applied to this structure without knowledge of the underlying application. Welldesigned load balancing libraries already exist [30, 31, 32] and can be used to create partitions of the macroprimitive graph. Figure 4 shows a twodimensional mesh containing some circular obstacles that was balanced among eight processes using ParMetis [30].
Such partitions can be created statically during a setup phase or dynamically at runtime. Dynamic load balancing requires runtime data migration and can be realised in a two step process. First, the current partitioning of the global graph is adapted. This could by done by an external library such as ParMetis [30]. Then the simulation data is migrated to the respective processes. In HyTeG, the migration process is realised using the (de)serialisation routines as described in Sec. 2.2. A scalable dynamic load balancing concept applied to simulations on adaptive meshes is presented in [28].
The partitioning algorithms can be augmented by weights that are assigned to the individual graphnodes and edges. In a finiteelement context, a node weight could represent the number of unknowns located in the respective primitive and therefore serves as indicator of the corresponding work load.
Due to the connectivity of the macroprimitive graph, macroedges are likely to end up on the same processor as their neighbouring macrofaces and vice versa. Since most communication during the simulations takes place along the edges of the macroprimitive graph, this structure naturally yields suitable distributions if edgecut minimizing partitioning algorithms are employed.
2.4 Communication
Synchronisation of the readonly copies of unknowns on the ghost layers is employed through communication between neighbouring macroprimitives. The macroprimitive graph structure illustrated in Fig. 2 reflects the direct neighbourhood of the primitives, i. e., communication is only performed along the graphedges. For example, the ghost layers of macrofaces are updated by communication with the neighbouring macroedges.
The communication pattern is highly dependent on the update pattern performed on the unknowns and must be adjusted individually. Explicit update rules like matrixvector products or Jacobitype smoothing iterations can be combined with more efficient communication patterns than implicit update rules like for example GaussSeideltype smoothing iterations. In general, the latter require a certain global update succession. More details are discussed in Sec. 4.
Corresponding to the abstract datahandling concept to attach arbitrary data structures to the macroprimitives, the framework supports a similarly flexible communication abstraction. A threelayer abstraction is employed to decouple the individual components of the communication process from another.
The buffer layer provides abstraction from the actual (MPI)send and receive calls and the internal data buffer structure. C++ operator overloading for STL data structures and basic data types allows for a convenient (de)serialisation of the transferred data.
The packing layer is responsible for the (de)serialisation of the attached data of a macroprimitive from and to data buffers. It provides a (de)serialisation interface that is implemented for each data structure that must be communicated. For example, there are implementations for (de)serialisation of unknowns that shall be sent to the ghost layers of neighbouring primitives. Upon invocation, the serialisation routines pack the respective unknowns into a buffer on one primitive and the deserialisation routines unpack them on the ghost layers of the corresponding neighbour. However, the packing layer does not employ any communication but is only called to (un)pack data from and to buffers.
The control layer manages the communication directions along the graphedges of the macroprimitive graph. Before sending and after receiving buffers, it calls the respective (de)serialisation routines of the packing layer without knowledge of the actual data structures. The control layer employs optimisations like nonblocking communication and calls processlocal communication routines if graphedges do not cross process boundaries.
This decoupled design allows for intensive code reuse and enhances the extensibility. The control layer is not affected by the introduction of new data structures since only the interface of the packing layer must be implemented to allow for communication of the respective data. A welltested buffer layer abstraction is for example already implemented in the core of the waLBerla framework [4] which serves also as a basis for this implementation.
3 Structured Refinement of Mesh Data
In this section, the structured field datastructures for the DoFs inside the macroprimitives are presented that result from the structured refinement of an unstructured topology. Since the largest part of the unknowns is located on the macrofaces the descriptions in the following are focused on these primitives. The presented concepts can be adapted for macroedges and macrovertices as well.
3.1 Structured Subdomains
As illustrated in Fig. 1, starting from a triangulation of the domain, a recursive refinement is performed by connecting the three midpoints of the triangle edges. This results in four new triangles of the same congruence class as e. g. presented in [33].
Unknowns are placed at certain positions in the structured refined element mesh depending on the corresponding finite element or finite volume discretisation. They are classified as vertexunknowns, edgeunknowns and faceunknowns reflecting their topological position on the mesh, cf. Fig. 5.
In the case of firstorder finite elements, the unknowns are placed on the vertices as shown in Fig. 5(a). Secondorder finite elements additionally require DoFs on the edges as displayed in Fig. 5(b). Other discretisations, such as e. g. finite volumes can be realised by placing the unknowns on the interior of the triangles as illustrated in Fig. 5(c). More details are given in Sec. 4.1.
The edgeunknowns are further separated into subgroups depending on whether they are placed on a horizontal, vertical or diagonal edge of an element. The different groups are shown in different colours in Fig. 5(b). A similar grouping for rectangular elements has been presented in [34].
Similar to the edgeunknowns, the faceunknowns are also separated into subgroups depending on the orientation of the element they reside in. As illustrated in Fig. 5(c), there are two subgroups: upward upward and downwardfacing elements.
One subgroup on its own (e. g. all horizontal edge unknowns or all faceunknowns in upwardfacing elements) leads again to a triangular pattern similar to that of the vertex unknowns. Because of the similar patterns, a single arrayaccess calculation routine for variable sized triangular field layouts is capable of calculating the array accesses for vertexunknowns as well as for each subgroup of edge and faceunknowns. Therefore, the separation into subgroups avoids code duplication in the sense that the arrayaccess routines can be reused for all types of unknowns.
Instead of modularizing access patterns and memory layouts by element type, the required features are implemented per type of unknown, i. e., modules are implemented for vertexunknowns, edgeunknowns, and faceunknowns. By combining these modules, arbitrary element types can be created, for example finite elements of different order. This approach reduces complexity and code duplication since routines that process vertexunknowns can be reused for all discretisations that require vertexunknowns. Possible examples include numerical routines that update the unknowns or routines that serialise data for communication or IO. Additionally, advanced applications that employ advanced or uncommon element types can be prototyped and implemented rapidly by combining different types of unknowns.
3.2 Indexing
One challenge when comparing these triangular fields with their equivalents on quadratic or rectangular meshes (as e. g. used in [23]) is the translation of the topological index to the actual memory layout. While in the rectangular case the number of unknowns is constant in each row and column, this is not the case for triangles, cf. Fig. 6. In case of a simple linear data arrangement as shown in Fig. 6(b), this means that the offset from one row to another decreases constantly.
Furthermore, it is desirable that the memory layout can be changed to fit the requirements of a compute kernel and to improve the performance for the specific processor microarchitecture. Possible scenarios could be colouring schemes e. g. used in a multicolour GaussSeidel smoother or the possibility to use SIMD operations more efficiently. For both of these reasons a layer of abstraction is introduced to separate the indexing from the actual memory layout. A fixed topological enumeration is introduced and flexible methods are implemented to access the unknowns using indices.
An example for DoFs located on the vertices on a macroface can be seen in Fig. 6. The fixed topological indexing is shown in Fig. 6(a). These coordinates are translated via an indexing function into one possible memory layout as shown in Fig. 6(b). The memory layout can be exchanged according to the specific needs of the application simply by exchanging the indexing function. The topological enumeration is defined for each type of unknown, i. e. vertex, edge or faceunknowns.
By introducing the abstract indexing, the array access calculations are decoupled from the algorithms that access the unknowns. Since the translation to absolute array indices is hidden behind the indexing functions, array layouts can be switched by simply exchanging the indexing function. Consequently, different memory layouts can be added during later phases of the development and can be compared to each other without rewriting existing kernels that are not performance critical while time consuming compute kernels can be tuned for the specific layout. This provides the possibility to employ efficient memory access patterns to improve e. g. cache locality by using techniques as described in [35] and [36].
3.3 InterPrimitive Data Exchange
To update the ghost layers of the fields of unknowns as introduced in Sec. 2.1 and illustrated in Fig. 3, communication between neighbouring primitives must be employed.
Due to the underlying unstructured mesh, it is not possible to implicitly calculate the neighbourhood relations of a certain macroprimitive. The number of neighbours may vary for each primitive type. While macrofaces always have three neighbouring macroedges and macrovertices, a macroedge can have one or two neighbouring macrofaces. A macrovertex can have an arbitrary number of neighbouring primitives depending on the mesh. Since the data structures are fully distributed there is no global knowledge about the neighbourhood relations. Therefore, local neighbourhood information must be explicitly stored for each primitive.
As explained in Sec. 2.4 a layered approach is used where the control layer takes care that the correct primitives communicate with each other. Before sending the data, however, serialisation is needed as well as packing the data into a buffer that is sent to a receiving primitive belonging to a process located on a remote node. This means that each macroface, macroedge, and macrovertex needs to copy the data that all of their neighbours require into one or more buffers. If two adjacent macroprimitives are located on the same process, the data can be copied directly from one array to the other without (de)serialisation to or from a buffer.
It is important to point out that in case of a macroface the memory access pattern is quite different for each of the three neighbouring macroedges. The desired data might be located consecutively in memory but there could also be varying strides between the entries. Once more, the indexing abstraction discussed in Sec.
3.2 can be used to simplify the corresponding serialisation kernels.Another complication is the ordering of unknowns in the topological layout. Since the input mesh is fully unstructured it is not guaranteed that the orientation of one side of a macroface is equivalent to the orientation of the neighbouring macroedge. This means that the particular order of unknowns might have to be adjusted during the communication. The approach to solving this problem is that the primitive of higher dimension takes care of possible adjustments. For example, a macroface would write the data in reverse order into the send buffer such that the corresponding macroedge can simply read the data out of the buffer consecutively.
The steps that have to be performed when communicating one side of a macroface to the adjacent macroedge are as follows:

Determine which of the three sides of the macroface the macroedge is adjacent to and how the topological layout is oriented.

Determine the process associated with the adjacent macroedge.

If the edge is located on the same process: copy the macroface data directly into the ghost layers of the macroedge considering the orientation.

If the macroedge is located on a different process:

copy the macroface data into the send buffer considering the orientation,

transfer the buffer to the process the macroedge is located on, and

copy the data from the receive buffer into the ghost layers of the macroedge.


In Fig. 7 an example communication from one macroface to one of the three neighbouring macroedges is depicted. In this case, the unknowns are written to the buffer in reversed order by the macroface since the orientation in the macroedge is opposite.
4 Numerical Methods and Linear Algebra
The discretisation of PDEs on the mesh structures of HyTeG typically leads to large systems of equations. For the implementation of iterative solvers, standard operations like addition, scalar products, and matrixvector multiplications have to be performed. In the following, the realisation of these concepts within HyTeG is described.
4.1 Discretisations
The data structures described in Sec. 3 support storing unknowns at different positions in the topology of the underlying mesh. By grouping different unknowns together, finite elements or other discretisations can be constructed. Fig. 8 illustrates some examples of typical finite elements. The first row from Fig. 8(a) to Fig. 8(c) presents the usual conforming finite elements [16] from first to third order. Additionally, also nonconforming finite elements that are discontinuous across the edges can be realised, cf. Fig. 8(d) to Fig. 8(f). The elements in Figs. 8(e) and 8(f) are inherently equivalent but differ only in the location on the element the unknowns are associated with.
A global variable on the mesh discretised by a specific element may be represented by the union of all unknowns. This union represents a vector in a finitedimensional linear space. On these vectors, the HyTeG framework implements a set of functions similar to the BLAS level 1 routines [37]. For example, the routine assign shown in Alg. 1 computes a linear combination of two vectors.
Similarly, it is possible to implement reductions on such vectors. As an example, the implementation of the equivalent of the xDOT BLAS routine is presented in Alg. 2.
Note that the computation performed within a primitive is independent of others, thus each primitive may be processed in parallel.
4.2 MatrixFree Approach
When solving very large systems of equations obtained from the discretisation of PDEs, the memory consumption of matrixbased implementations may be impracticably high, even when using appropriate sparse matrix formats. HyTeG therefore employs matrixfree methods based on stencils (cf. Fig. 9). Stencil operations additionally allow for efficient parallel kernels to carry out matrixvector multiplications or pointwise smoothers.
The stencil approach is illustrated in Fig. 9 for secondorder triangular finite elements. In this approach, the regularity of the grid is exploited by storing the nonzero elements of a matrix row as stencil entries for the corresponding unknown and its neighbours. Fig. 9(a) shows the stencil associated with a microvertex DoF by highlighting the DoFs within the compact support of this vertex. Figs. 9(b)–9(d) display the stencils associated with diagonal, horizontal, and vertical microedge DoFs, respectively.
In HyTeG a matrix is never fully stored in memory, except possibly on coarse levels of a multigrid hierarchy. On finer refinement levels, only the result of a matrixvector multiplication is computable and may be stored in another vector. The implementation of the matrixvector multiplication is up to the developer. In conventional finiteelement solvers this is usually done through quadrature loops which integrate the bilinear forms of the weak form onthefly or by using precomputed local element matrices. In the special case of linear PDEs with constant coefficients, only one constant stencil has to be stored for each primitive. An example for the matrixvector multiplication implementation in this case is given in Alg. 3.
Note again that the computations in the interior of a primitive are independent of other primitives. A communication step is required only before primitives of the next dimension are processed. Thus the work in each primitive may be distributed across processes.
4.3 Iterative Solvers
The basic linear algebra routines as discussed in Sec. 4.1 and Sec. 4.2 allow the implementation of iterative solvers for linear systems. On coarse grids we may employ Krylov or direct solvers. These solvers can be implemented directly or via an interface to external libraries like PETSc [38]. The preconditioned conjugate gradient method implemented in HyTeG can e. g. be used for symmetric and positive definite problems. For indefinite symmetric problems, the preconditioned minimal residual method is typically applied.
For large problems, geometric multigrid solvers are employed. Multigrid methods are essential, since they exhibit asymptotically optimal complexity and thus become superior when progressing to extreme scale [3]
. Interpolation and restriction operators between levels are required for multigrid and may be realised by matrixvector multiplications with nonsquare matrices. Note that the most efficient smoothers for multigrid often require pointwise updates and thus access to the diagonal entries of the stiffness matrix. This access is provided by the stencil paradigm, but it might not be easy to accomplish in alternative matrixfree methods when only the matrixvectormultiplication is realised. In
HyTeG, multigrid can be used as a solver or as a preconditioner in a Krylov solver.Two major categories of refinement in multigrid methods are h and prefinement [16]. hrefinement describes a refinement process in that the same finite element discretisation is employed on two meshes with different geometric refinement levels. The combination of different orders of finite element discretisations, e. g. P1 and P2 elements, allows to employ prefinement. Here the interpolation and restriction is performed between different discretisations on the same grid, i. e., on the same geometric refinement level. Combining both approaches allows the implementation of geometric multigrid solvers with hprefinement. For the solution of large saddlepoint problems, an allatonce multigrid solver using pointwise inexact Uzawa smoothers is available [39].
5 Example Applications
The following applications are chosen as a proof of concept for the software architecture and design aspects.
5.1 Stokes Flow
In the first example an isoviscous and incompressible Stokes flow through a porous structure is considered (cf. Fig. 10) as modelled by
A parabolic inflow profile is prescribed at the left, noslip conditions at the top and bottom, and Neumann outflow boundary conditions at the right boundary of the domain. As a stable discretisation, the P2P1 pairing (TaylorHood [16]) is used, i. e., the velocity is discretised by second and the pressure by firstorder finite elements. The linear system resulting from a discretisation with refinement level 4 is directly solved by an decomposition using the PETSc interface of HyTeG. Fig. 10 shows the velocity magnitude and pressure field as well as the coarse grid structure of the underlying mesh.
5.2 Energy Transport
In the second example, the temperature driven convection from the inner to the outer boundary in an annulus is simulated as a simplified model of Earth mantle convection (cf. Fig. 11).
The process is described by the isoviscous, incompressible Stokes equations for velocity and pressure, coupled with a convectiondiffusion equation for the temperature. The governing coupled system of equations for the velocity , pressure , and temperature is given by
Here represents the dimensionless Rayleigh number, the Péclet number, and an outward pointing radial vector. The inverse of the Péclet number is very small, i. e., , such that the model is dominated by numerical diffusion. This diffusion becomes smaller with decreasing mesh size. At the boundaries, homogeneous Dirichlet conditions are employed for the velocity . For the dimensionless temperature a value of is prescribed at the inner boundary and of at the outer boundary.
Here the Stokes system is discretised by equal order P1P1 finite elements together with a PetrovGalerkin pressure (PSPG) stabilisation [40] for the pressure. The convection is realised by a finite volume discretisation. To advance the system in time, the Stokes system and the convectiondiffusion equation are treated with different time steps. The Stokes system is solved only after every three time steps of updating the convectiondiffusion equation. The pressure and velocity are simultaneously solved in a monolithic multigrid algorithm, similar to [39]. This has been proposed in the classical multigrid literature [41] as the most efficient approach to solving the Stokes system. The temperature is advanced via an explicit upwind scheme using the previously computed velocity field.
The partitioned annulus domain contains macrofaces and unknowns per macroface on the finest refinement level 6, i. e., in total more than half a million DoFs.
6 Conclusion and Outlook
In this article, the design principles of the generic new finiteelement framework HyTeG are presented. By design, HyTeG supports higherorder finite elements on a hierarchically discretised domain. It provides scalable, parallel data structures, sophisticated load balancing, an abstract memory layout, and a layered communication concept. The basic design will also support adaptivity and asynchronous execution as needed when progressing to extremescale computing. The approach to achieve high singlenode performance and good scalability within HyTeG is based on the combination of an unstructured topology with structured data inside the macroprimitives. The data structures can be used to implement efficient matrixfree methods and the basic numerical building blocks for iterative solvers. The stencil approach allows for the scalable and memoryaware implementation of pointwise smoothers that are essential to employ fast geometric multigrid solvers.
The concept is designed towards a seamless transition to threedimensional simulations building on tetrahedral macroprimitives and is currently extended. In future work the scalability and performance of HyTeG will be examined in more detail. A major focus lies on the implementation of massively parallel geometric multigrid solvers for the Stokes system and their application to largescale geophysical simulations. Beyond these first examples, the framework is also well suited for other applications that require largescale simulations with finite elements, such as in electromagnetics or fluid dynamics.
Acknowledgements
This work was partly supported by the German Research Foundation through the Priority Programme 1648 ”Software for Exascale Computing” (SPPEXA) and by grant WO671/111.
References
 [1] B. Bergen and F. Hülsemann, Hierarchical hybrid grids: A framework for efficient multigrid on high performance architectures, Technical Report, Lehrstuhl für Systemsimulation, Universität Erlangen 5 (2003).
 [2] B. Bergen, F. Hülsemann, and U. Rüde, Is Unknowns the Largest Finite Element System that Can Be Solved Today?, in Proc. 2005 ACM/IEEE Conf. on Supercomputing, SC ’05, ACM, Seattle, Washington, USA, 2005, pp. 5–5.
 [3] B. Gmeiner, M. Huber, L. John, U. Rüde, and B. Wohlmuth, A quantitative performance study for Stokes solvers at the extreme scale, J. Comput. Sci. 17 (2016), pp. 509–521.
 [4] C. Feichtinger, S. Donath, H. Köstler, J. Götz, and U. Rüde, WaLBerla: HPC software design for computational engineering simulations, J. Comput. Sci. 2 (2011), pp. 105–112, Simulation Software for Supercomputers.
 [5] H. Köstler and U. Rüde, The CSE Software Challenge–Covering the Complete Stack, ITInform. Technology 55 (2013), pp. 91–96.
 [6] C. Godenschwager, F. Schornbaum, M. Bauer, H. Köstler, and U. Rüde, A Framework for Hybrid Parallel Flow Simulations with a Trillion Cells in Complex Geometries, in Proc. Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’13, Denver, Colorado, ACM, 2013, pp. 35:1–35:12.
 [7] F. Schornbaum and U. Rüde, Massively Parallel Algorithms for the Lattice Boltzmann Method on NonUniform Grids, SIAM J. Sci. Comput. 38 (2016), pp. C96–C126.
 [8] B. Bergen and F. Hülsemann, Hierarchical hybrid grids: data structures and core algorithms for multigrid, Numer. Linear Algebra Appl. 11 (2004), pp. 279–291.
 [9] B.K. Bergen, Hierarchical Hybrid Grids: Data structures and core algorithms for efficient finite element simulations on supercomputers, SCS Publishing House, 2006.
 [10] B. Bergen, T. Gradl, U. Rüde, and F. Hülsemann, A Massively Parallel Multigrid Method for Finite Elements, Comput. Sci. Eng. 8 (2006), pp. 56–62.
 [11] B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, and B. Wohlmuth, Performance and Scalability of Hierarchical Hybrid Multigrid Solvers for Stokes Systems, SIAM J. Sci. Comput. 37 (2015), pp. C143–C168.
 [12] B. Gmeiner, U. Rüde, H. Stengel, C. Waluga, and B. Wohlmuth, Towards textbook efficiency for parallel multigrid, Numer. Math. Theory Methods Appl. 8 (2015), p. 22–46.
 [13] S. Bauer, M. Mohr, U. Rüde, J. Weismüller, M. Wittmann, and B. Wohlmuth, A twoscale approach for efficient onthefly operator assembly in massively parallel high performance multigrid codes, Appl. Numer. Math. 122 (2017), pp. 14–38.
 [14] S. Bauer, D. Drzisga, M. Mohr, U. Ruede, C. Waluga, and B. Wohlmuth, A stencil scaling approach for accelerating matrixfree finite element implementations, CoRR abs/1709.06793 (2017), Submitted to SIAM J. Sci. Comput.
 [15] C. Waluga, B. Wohlmuth, and U. Rüde, Masscorrections for the conservative coupling of flow and transport on collocated meshes, J. Comput. Phys. 305 (2016), pp. 319–332.
 [16] H.C. Elman, D.J. Silvester, and A.J. Wathen, Finite Elements and Fast Iterative Solvers: with Applications in Incompressible Fluid Dynamics, OUP Oxford, 2014.
 [17] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klöfkorn, M. Ohlberger, and O. Sander, A generic grid interface for parallel and adaptive scientific computing. Part I: abstract framework, Computing 82 (2008), pp. 103–119.
 [18] P. Bastian, M. Blatt, A. Dedner, C. Engwer, R. Klöfkorn, M. Ohlberger, and O. Sander, A generic grid interface for parallel and adaptive scientific computing. Part II: implementation and tests in DUNE, Computing 82 (2008), pp. 121–138.
 [19] A. Dedner, R. Klöfkorn, M. Nolte, and M. Ohlberger, A generic interface for parallel and adaptive discretization schemes: abstraction principles and the DUNEFEM module, Computing 90 (2010), pp. 165–196.
 [20] B.S. Kirk, J.W. Peterson, R.H. Stogner, and G.F. Carey, libMesh: a C++ library for parallel adaptive mesh refinement/coarsening simulations, Eng. Comput. 22 (2006), pp. 237–254.
 [21] A. Vogel, S. Reiter, M. Rupp, A. Nägel, and G. Wittum, UG 4: a novel flexible software system for simulating PDE based models on high performance computers, Comp. Vis. Sci. 16 (2013), pp. 165–179.
 [22] C. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo, D.D. Grazia, S. Yakovlev, J.E. Lombard, D. Ekelschot, B. Jordi, H. Xu, Y. Mohamied, C. Eskilsson, B. Nelson, P. Vos, C. Biotto, R. Kirby, and S. Sherwin, Nektar++: An opensource spectral/hp element framework, Comput. Phys. Commun. 192 (2015), pp. 205–219.
 [23] D. Arndt, W. Bangerth, D. Davydov, T. Heister, L. Heltai, M. Kronbichler, M. Maier, J.P. Pelteret, B. Turcksin, and D. Wells, The deal.II library, version 8.5, J. Numer. Math. 25 (2017), pp. 137–146.
 [24] W. Bangerth, R. Hartmann, and G. Kanschat, deal.II – a general purpose object oriented finite element library, ACM Trans. Math. Softw. 33 (2007), pp. 24/1–24/27.
 [25] J.W. Lottes, P.F. Fischer, and S.G. Kerkemeier, Nek5000 web page (2008), Http://nek5000.mcs.anl.gov.
 [26] A. Peplinski, P.F. Fischer, and P. Schlatter, Parallel Performance of Htype Adaptive Mesh Refinement for Nek5000, in Proc. 2016 Exascale Appl. Softw. Conf., EASC ’16, Stockholm, Sweden, ACM, 2016, pp. 4:1–4:9.
 [27] B. Acun, A. Gupta, N. Jain, A. Langer, H. Menon, E. Mikida, X. Ni, M. Robson, Y. Sun, E. Totoni, L. Wesolowski, and L. Kale, Parallel Programming with Migratable Objects: Charm++ in Practice, in Proc. Int. Conf. on High Performance Computing, Networking, Storage and Analysis, SC ’14, New Orleans, Louisana, IEEE Press, Piscataway, NJ, USA, 2014, pp. 647–658.
 [28] F. Schornbaum and U. Rüde, ExtremeScale BlockStructured Adaptive Mesh Refinement, CoRR abs/1704.06829 (2017), Accepted for publication in SIAM J. Sci. Comput.
 [29] N. Kohl, J. Hötzer, F. Schornbaum, M. Bauer, C. Godenschwager, H. Köstler, B. Nestler, and U. Rüde, A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations, Int. J. High Perform. Comput. Appl. 0 (2018), pp. 1–19.
 [30] G. Karypis and V. Kumar, A parallel algorithm for multilevel graph partitioning and sparse matrix ordering, J. Parallel Distrib. Comput. 48 (1998), pp. 71–95.
 [31] C. Chevalier and F. Pellegrini, PTScotch: A Tool for Efficient Parallel Graph Ordering, Parallel Comput. 34 (2008), pp. 318–331.
 [32] E.G. Boman, U.V. Çatalyürek, C. Chevalier, and K.D. Devine, The Zoltan and Isorropia Parallel Toolkits for Combinatorial Scientific Computing: Partitioning, Ordering and Coloring, Sci. Program. 20 (2012), pp. 129–150.
 [33] R.E. Bank and A.H. Sherman, A refinement algorithm and dynamic data structure for finite element meshes, Computer Science Department, UT Austin, 1980.
 [34] S. Kuckuk and H. Köstler, Automatic Generation of Massively Parallel Codes from ExaSlang, Computation 4 (2016), p. 27.
 [35] M. Kowarschik, U. Rüde, C. Weiss, and W. Karl, Cacheaware multigrid methods for solving Poisson’s equation in two dimensions, Computing 64 (2000), pp. 381–399.
 [36] M. Kowarschik, U. Rüde, N. Thürey, and C. Weiß, Performance optimization of 3D multigrid on hierarchical memory architectures, in Int. Workshop Appl. Parallel Comput., Springer, 2002, pp. 307–316.
 [37] C.L. Lawson, R.J. Hanson, D.R. Kincaid, and F.T. Krogh, Basic Linear Algebra Subprograms for Fortran Usage, ACM Trans. Math. Softw. 5 (1979), pp. 308–323.
 [38] S. Balay, W.D. Gropp, L.C. McInnes, and B.F. Smith, Efficient Management of Parallelism in Object Oriented Numerical Software Libraries, in Modern Software Tools in Scientific Computing, Birkhäuser Press, 1997, pp. 163–202.
 [39] L. John, U. Rüde, B. Wohlmuth, and W. Zulehner, On the analysis of block smoothers for saddle point problems, arXiv preprint arXiv:1612.01333 (2016).
 [40] F. Brezzi and J. Douglas, Stabilized mixed methods for the Stokes problem, Numer. Math. 53 (1988), pp. 225–235.
 [41] A. Brandt and O.E. Livne, Multigrid techniques: 1984 guide with applications to fluid dynamics, Vol. 67, SIAM, 2011.