The multi-level -refinement scheme is a powerful extension of the finite element method that allows local mesh adaptation without the trouble of constraining hanging nodes. This is achieved through hierarchical high-order overlay meshes, a -scheme based on spatial refinement by superposition. An efficient parallelization of this method using standard domain decomposition approaches in combination with ghost elements faces the challenge of a large basis function support resulting from the overlay structure and is in many cases not feasible. In this contribution, a parallelization strategy for the multi-level -scheme is presented that is adapted to the scheme’s simple hierarchical structure. By distributing the computational domain among processes on the granularity of the active leaf elements and utilizing shared mesh data structures, good parallel performance is achieved, as redundant computations on ghost elements are avoided. We show the scheme’s parallel scalability for problems with a few hundred elements per process. Furthermore, the scheme is used in conjunction with the finite cell method to perform numerical simulations on domains of complex shape.
Keywords: high-order FEM, automatic -adaptivity, arbitrary hanging nodes, finite cell method, parallel computation, high performance computing
The recently proposed multi-level -refinement  is a novel approach for performing adaptive mesh refinement in the context of high-order finite elements. Contrary to conventional -adaptive methods, which perform spatial refinement by the replacement of existing elements with a subset of smaller elements [2, 3, 4], this method employs hierarchical, high-order overlay meshes to improve the quality of the approximation in areas of interest. The main advantages of this approach over conventional ones are its ability to deal with an arbitrary number of hanging nodes and its intuitive refinement and coarsening procedures that make it easy to implement and suitable for performing dynamic mesh refinements. Moreover, this -scheme maintains the same exponential convergence rates characteristic of classical -formulations even in the presence of singularities and is easily extensible to three dimensions .
As shown in , multi-level
-adaptivity can be combined with the finite cell method to solve partial differential equations on complex domains. The finite cell method, introduced by Parvizianet al. , is a high-order fictitious domain approach that circumvents the tedious process of mesh generation by embedding the physical domain in a computational domain that can be trivially meshed. Computations on complex domains are hereby simplified as the geometrical information needs only to be resolved on integration level. The combination of the finite cell method and multi-level -adaptivity, coined the multi-level -adaptive finite cell method, is currently being applied in the calculation of engineering problems in the fields of biomechanics and additive manufacturing.
Parallel computations on modern computing systems using finite element techniques have made the simulation of large engineering problems possible. A plethora of distributed memory parallelization strategies have been developed for different finite element discretization schemes such as [7, 8, 9]. These strategies not only aim at reducing computational time but also at efficiently utilizing the available hardware resources. The major challenge in developing such strategies is the choice of algorithms and data structures that best expose the parallelism in the discretization scheme under consideration. As a consequence, different studies have been undertaken such as in [3, 10], which seek to provide guidelines for the parallelization of finite element schemes.
One central algorithm in every parallelization scheme is the distribution of the computational domain among participating processes. A common approach motivated by non-overlapping domain decomposition techniques entails subdividing the computational domain into sub-domains of fairly equal computational cost and their subsequent distribution among the processes. This distribution, in the context of parallel -adaptive schemes, can be first performed on the granularity of the initial elements , followed by a re-balancing step of the refined grid or directly on the refined grid. Each process stores its local elements as well as so called ghost elements that constitute the boundary to neighboring sub-domains . This approach is advantageous in maintaining mesh compatibility after parallel mesh refinement 
and more importantly when setting up the global linear system to be solved. This system can be assembled without communication since each process can compute the global contributions of the degrees of freedom it owns by integrating the corresponding local elements and ghost elements[11, 10]. Different variants of the described domain decomposition approach have been successfully applied to -adaptive schemes, as in [3, 7, 12], and can be combined with distributed mesh data structures to reduce the amount of memory needed by a single process.
Although ghost elements eliminate the need for communication during distributed assembly, redundant computations are performed on ghost elements. Consequently, the ratio of local to ghost elements on a process has to be high, in order to hide redundant computations and allow scalability . Usage of the ghost element approach in the parallelization of the multi-level -scheme would, however, result in a significantly larger number of ghost elements than in conventional -schemes, thus inhibiting scalability. This is due to the large support of the basis functions, as basis functions are coupled over the hierarchy of the overlay meshes. Moreover, the number of basis functions and consequently the computational work vary highly from element to element in multi-level -refinement, further complicating load balancing.
This contribution presents a parallelization strategy for the multi-level -refinement that addresses the mentioned challenges. Our scheme avoids the use of ghost elements by utilizing a shared mesh data structure, a concept first applied to a conventional -scheme in . This method is adopted to take advantage of the simplicity of multi-level -refinement. The absence of ghost elements results in the need for data communication during the assembly of the global linear system. This, however, does not significantly affect parallel performance as shown in the numerical examples considered. Moreover, since no redundant computations are performed during integration, good parallel performance for problems with a few hundred elements per process is achieved. The present work also shows the combination of the parallel multi-level -scheme with the finite cell method for the computation of engineering problems involving complex geometries.
This paper is organized as follows: The main ideas behind multi-level -refinement and the finite cell method will be outlined in Section 2. This will be followed by a description of the central aspects of the parallel implementation in Section 3. Numerical examples in Section 4 show the parallel performance of the proposed method for different problem classes. Finally, Section 5 will conclude the contribution with a brief summary and an outlook into future aspects of research.
2 The multi-level -adaptive finite cell method
A significant improvement of the approximation quality in finite element simulations can be achieved by the use of -adaptive discretizations which perform local mesh refinement in areas of interest. This section provides a description of such a method, the multi-level -refinement, focusing on the characteristics that enable it to leverage the benefits of -adaptive methods while notably reducing the implementational effort. The basic idea behind the finite cell method is presented and the combined use of both methods in the simulation of complex problems is also highlighted.
2.1 Multi-level -refinement
Although -adaptive formulations posses superior approximation qualities, their implementational complexity inhibits the widespread use of these methods in the field of computational mechanics. This complexity stems mainly from the algorithms and data structures needed to deal with mesh irregularities that arise during refinement and coarsening procedures. Classical -formulations commonly perform mesh refinement by replacing elements having a high discretization error with a set of smaller elements cf. [2, 7]. This process introduces mesh irregularities, usually referred to as hanging nodes, as the basis functions of the new elements lack a corresponding counterpart in their unrefined neighboring elements . Special algorithms are therefore needed to constrain these hanging nodes and restore inter-element continuity. Many -formulations, as a consequence, only allow one level of mesh irregularity between elements e.g. [4, 7], while others implement sophisticated constraining algorithms that can deal with arbitrary levels of hanging nodes such as [14, 15].
Zander et al. propose a novel -adaptive approach in  based on refinement by superposition that circumvents the difficulties introduced by hanging nodes. By utilizing simple rule-sets to ensure compatibility and linear independence of the shape functions, intuitive refinement and coarsening procedures are developed. The need for complex data structures and algorithms to consolidate and constrain hanging nodes is alleviated as hanging nodes are avoided by construction. The approach is shown to achieve exponential convergence even in the presence of singularities and was first introduced for the two-dimensional case, extended in  to three dimensions and applied to cohesive fracture modeling in .
2.1.1 Basic idea
The core idea of the multi-level -approach is to locally superpose coarse base elements with finer overlay elements that better capture the solution characteristics in an area of interest. The final solution is hence the sum of the large scale solution on the high-order coarse base mesh and the fine-scale solution from the finer overlay mesh: . This superposition approach is attributed to the work of Mote in 1971  and has been adapted in different approaches e.g. [18, 19, 20].
Multi-level -refinement extends the original superposition approach through recursive superposition resulting in multiple levels of hierarchical overlay meshes. Furthermore, integrated Legendre shape functions are used to span the approximation space. The direct association of these shape functions with the element topology (nodes, edges, faces and solids) allows the creation of simple mesh refinement and coarsening techniques as individual basis functions can be easily eliminated from the ansatz space through deactivation of the corresponding topological entities. This important property enables the maintenance of mesh compatibility between adjacent elements and linear independence across the overlay meshes. Inter-element continuity is ensured by construction through the application of homogeneous Dirichlet boundary conditions on the overlay mesh. Consequently, arbitrary levels of hanging nodes can be used. -continuity is thus ensured within each element and -continuity across element boundaries. Linear independence, on the other hand, is guaranteed by the deactivation of topological components in the overlay meshes with active sub-components. Figure 1 illustrates multi-level -refinement for the one, two and three-dimensional case. A multi-level -mesh consists of three different kinds of elements, firstly elements on the lowest level, so called base elements, secondly refined elements on intermediate refinement levels and finally elements on the highest refinement level with no sub-elements, so called leaf elements. Elements with no sub-elements are referred to as active leaf elements or in short active elements, such elements are either base or leaf elements. Linear independence and compatibility of all basis functions over this element-hierarchy is achieved through the deactivation of topological components as shown in Figure 1. The interested reader is referred to [1, 5] for a comprehensive description of the refinement and coarsening strategies.
2.1.2 Numerical integration in multi-level -refinement
An important aspect of multi-level -refinement is the correct numerical integration of the element matrices. Applying conventional Gaussian quadrature at base element level would yield wrong results since the shape functions are only -continuous within the base elements due to the hierarchical superposition. This problem is solved in  by evaluating integrals separately on integration domains in which the basis functions are -continuous. These integration domains are obtained by a projection of the leaf elements through the mesh hierarchy onto the base elements as illustrated in Figure 2 for the one dimensional case. Numerical integration can then be performed independently on each integration domain as shown. Figure 2 also shows the coupling of the basis functions over the different mesh levels. This property of the multi-level -scheme does not affect performance, as non-zero shape functions within an integration domain can be efficiently computed by concatenating the active degrees of freedom over the different mesh levels.
2.2 The finite cell method
The aforementioned finite cell method aims to circumvent the tedious process of mesh generation for complex geometries by combining a fictitious domain approach with high-order finite elements. The physical domain is extended by a fictitious domain , yielding a computational domain from the union , which can be easily meshed using high-order finite elements on a structured grid as illustrated in Figure 3. The elements are referred to as finite cells to differentiate them from standard boundary conforming finite elements . The original geometry is resolved at integration level by means of an indicator function
which associates a given point with the physical or fictitious domain. This results in a modified weak form of the governing differential equations, which is now scaled by the scalar field . To illustrate this, we consider the weak form of the equilibrium equation in linear elasticity given as . The terms and represent the displacement field and test functions respectively, while represents the bilinear form which is calculated from the linear strain operator and material matrix . The term is the linear functional representing the contributions of the volume loads and tractions acting on the Neumann boundary .
The indicator function introduced in FCM leads to a modified bilinear form given as
As shown in , the numerical approximation resulting from the modified bilinear form and the linear functional in Equation 3 converges to the solution of the original weak form when epsilon tends to zero. The extended domain in FCM typically has a very simple geometric structure rendering it easy to mesh. In the simplest case, a uniform grid of rectangular hexahedral grids is used, yet the hierarchical overlay meshes described in the Section 2.1.1 are also applicable.
The scalar field introduces a discontinuity within elements that are intersected by the boundary of the physical domain. Different approaches have been developed to improve the integration of these cut elements by approximating the original domain boundary such as a low order approximation technique utilizing composed Gaussian quadrature in combination with recursive spacetree-subdivision 
, as well as high-order techniques like moment fitting and the recently proposed blended partitioning algorithm [24, 25].
Another important aspect of the finite cell method is the imposition of boundary conditions. While the application of Neumann boundary conditions remains unaffected by a nonconforming discretization, essential boundary conditions are. The latter are usually imposed weakly using variational techniques like the penalty method [27, 28], Nitsche’s Method [29, 28] and the Lagrange multiplier method .
The finite cell method maintains the excellent approximation qualities of standard finite elements, while significantly simplifying the mesh generation process. It has thus been successfully applied to different application fields which including linear elasticity [6, 21], large deformation analysis , image-based simulation of bones , contact mechanics  and isogeometric analysis  among others. The interested reader can refer to [6, 21, 34] for an in depth presentation of FCM or an open-source Matlab implementation  for a simple introduction into this research field. A combination of multi-level -refinement and FCM yields a highly flexible discretization method which can be used to perform static and transient simulations on complex geometries. In multi-level -FCM, the composed integration techniques are applied on the integration domains described in Section 2.1.2. Ongoing research aims to exploit the benefits of both methods in biomechanical simulation of bone-structures and in the simulation of the additive manufacturing process.
3 Parallel Implementation
Large numerical simulations running on distributed memory systems greatly benefit from the ability of -formulations to simultaneously resolve small and large geometrical scales with a reasonable number of unknowns. Such methods have been successfully applied to multi-scale problems e.g. in simulating seismic waves  or mantel convection , simulations in which application of uniform global -refinement would not be feasible. The parallelization of -formulations has caught the interest of many research groups and lead to the emergence of different strategies [8, 7] and various open source parallel -adaptive frameworks like deal.II , Hermes , libMesh , Nektar++  and MFEM  with a wide user base in the scientific community. The majority of the mentioned parallel -codes use classical -formulations that perform refinement by replacement. Moreover, several software projects like Trilinos  and PETSc  have emerged, which provide different functionalities in the field of computer science, e.g parallel linear algebra packages.
3.1 Parallelization strategy
The domain decomposition approach with ghost elements mentioned in the introduction is commonly used when distributing the computational domain among processes [10, 11, 12]. Although this approach has different variants, the main idea is that each process stores elements on its portion of the original domain and ghost elements that constitute the boundary to neighboring sub-domains. In this way, communication during the assembly of the linear system is avoided as illustrated in Figure 4a. The degrees of freedom on each process consist of entries local to the process and shared entries due to the ghost elements. Each process only computes the full contribution of its local degrees of freedom by integrating the associated local elements and ghost elements represented by the green and dashed elements in Figure 4a, respectively.
A parallel implementation utilizing the domain decomposition approach with ghost elements is not well suited for multi-level -refinement. This is due to the large support of the basis functions as stated in the introduction. To illustrate this point, an L-shaped domain comprising of three base elements is refined in two steps towards the reentrant corner. The mesh resulting from multi-level -refinement is compared to a mesh obtained from a conventional -scheme in Figure 5. In order to compute the entries in the linear system associated with basis function at node , a total of three elements need to be integrated in the conventional -scheme. This is, however, not the case in the multi-level -scheme. The basis function is supported on all 21 integration domains, represented by the dashed lines in the base elements in Figure 5b. These domains have to be integrated in order to compute the entries in the linear system associated with without communication as in Figure 4a. We can safely conclude that no speed up can be attained in this particular example. This example illustrates the shortcomings of the conventional ghost element parallelization strategy when applied to multi-level -refinement, and further motivates the development of a parallelization scheme that is better adapted to multi-level -refinement.
We propose a scheme based on the distribution of the integration domains described in Figure 2 in order to guarantee scalability even for rather coarse meshes, such as in Figure 5b. A shared mesh data structure, similar to , is employed due to its low implementational effort while at the same time allowing for an easy activation and deactivation of elements during refinement. Mesh concurrency can thus be easily maintained among processes during a simulation. The global linear system is assembled with communication as illustrated in Figure 4b. The individual components of the scheme shall now be described in the following sections.
3.2 Algorithms and data structures
Our parallelization strategy is implemented within an object-oriented framework and performs partitioning of the computational domain on the granularity of the integration domains, an approach specially crafted for multi-level -refinement. This can also be interpreted as a distribution of the active elements, due to the direct association between integration domains and active elements, see Figures 1 and 2. The strategy consists of a serial part that involves mesh initialization and refinement followed by a parallel part that constitutes mesh partitioning, integration of the linear system and its distributed assembly, solving of the linear system and post processing. A summary of the simulation pipeline is given in Algorithm 3.
3.2.1 Mesh initialization and refinement
The initial computational domain is generated by every process at simulation start. Adaptive mesh refinement using the multi-level -scheme is performed redundantly by each process on the whole domain, resulting in the same discretization on all processes. This approach is adopted as multi-level -refinement is currently driven by a priori information. Mesh refinement, therefore, does not need to be divided among the processes, but can efficiently performed by each process, since the elements to be refined are known beforehand. Furthermore, a priori refinement allows mesh partitioning to be directly performed on the refined grid. Multi-level
-refinement can, however, be extended to perform refinement driven by error indicators and estimators, as shown in, and is a topic of ongoing research . Such automatic -refinement would, however, require more involved fully parallel refinement strategies and data structures such as those described in [7, 3], which perform a distribution of the initial elements, followed by parallel refinement on a subset of elements and a subsequent re-balancing of the refined grid on the granularity of either initial elements or refined elements in order to guarantee scalability.
The aforementioned simplicity of the shared mesh data structure coupled with a priori -refinement, yields a fast, easy to implement -scheme as shown by the numerical examples in Section 4. The data structures used are implemented in terms of pointers as described in [1, 16], where the mesh stores elements as pointers. Mesh elements in turn, hold pointers to their sub-elements/children, allowing for easy navigation through the mesh hierarchy. This structure is portrayed in a simple UML-diagram in Figure 6. It is important to note that the refinement yields an implicit tree-structure, which is defined through the different pointer relations as shown in Figure 7.
3.2.2 Load balancing: Distribution of the integration domains
Upon mesh refinement, a computational domain consisting of elements with different refinement levels is obtained. The leaf elements can now be partitioned among processes, a task performed by the LeafElementPartitioner illustrated in Figure 6. This class is aggregated by the mesh and performs load balancing after every mesh refinement step by first creating a trivial intial partitioning, where each process is assigned a set of contiguous leaf elements. The LeafElementPartitioner also provides an interface to the geometric and graph-based partitioners in Zoltan . The initial partitioning created in the first partitioning step, alongside auxiliary information such as leaf element weights, are forwarded to the LB_Partition function in Zoltan. This function improves the initial leaf element distribution in a second step, yielding a set of unique active leaf elements on every process, which constitutes the final leaf element distribution. The described load balancing scheme is summarized in Algorithm 1. This scheme greatly benefits from the use of a shared data mesh structure, as all operations are based solely on the indices’s of the leaf elements and auxiliary geometrical information computed by each processor on a unique subset of elements (initial partitioning). Furthermore, no elements need to be packed and sent to other processors. Communication occurs only within the LB_Partition function in Zoltan, which returns a set of unique indices on each processor. The LeafElementPartitioner registers these indices to the mesh on each individual process, marking the respective leaf elements as active leaf elements that can be processed in the subsequent integration.
As previously mentioned, the LeafElementPartitioner is responsible for determining the weight, computational cost, of individual leaf elements. This is of particular importance in the context of multi-level -refinement, since the number of basis functions supported within a leaf element varies greatly depending on the number of overlay meshes, as shown in Figure 2. Moreover, the number of integration points within a leaf element differ when composed integration in the finite cell method is used. This difficulty is overcome by considering the time complexity of the matrix-matrix multiplications needed to compute the stiffness matrix of a single leaf element, since this operation is the main bottleneck during computations. The integration time and consequently weight of a leaf element is directly proportional to the number of quadrature points and the third power of the number of active shape functions denoted by N. The cubic time complexity in terms of can be illustrated by the fact that multiplications have to be performed to compute the bilinear term in Equation 3. The weight of a leaf element is thus approximated as in lines 6 and 10 of Algorithm 1, as and are known before integration. In concrete simulations is normalized with the weight of an unrefined element yielding a modified weight . This modification aims to factor in influences not taken into account during the derivation of .
3.2.3 Integration of the local linear system
The integration of the local linear system is performed in a loop over the local set of active elements . A second level of parallelism is made available through an OpenMP loop over the integration points within a leaf element. Each process assembles the contributions of its leaf elements into an intermediate linear system. The intermediate stiffness matrix for example, can be efficiently assembled using a local sparsity pattern based solely on the degrees of freedom present in . Boundary conditions are then applied independently on the intermediate linear system by each process.
3.2.4 Distribution of degrees of freedom among processes
Before the global linear system can be assembled, the question of degree of freedom ownership has to be addressed. In this step, the degrees of freedom have to be uniquely distributed among the processes. To this end, two distribution algorithms are proposed. Degrees of freedom can either be distributed contiguously, where each process is assigned a set of consecutive unknowns, or be partitioned using a graph-based approach, which aims to minimize the amount of data communicated during distributed assembly. The first algorithm is easy to implement but could result in increased communication costs during distributed assembly, as the current distribution of leaf elements is not taken into account. The results presented in Section 4, however, demonstrate that this additional overhead has no major relevance. This may be attributed to the use of a rather moderate number of MPI processes (). The graph based approach takes the current distribution of leaf elements into account in order to minimize the amount of information transfered during distributed assembly. This process is summarized in Algorithm 2 where a graph containing the relations between degrees of freedom and leaf elements is used to assign a degree of freedom to the process that contains the highest number of leaf elements associated with that particular degree of freedom. This in simple terms translates into the following principle: The process that did the most work for a certain degree of freedom, gets to keep it. In the case that two or more processes have an equal number of integration domains corresponding to a certain degree of freedom, the process with a lower MPI-rank is assigned the degree of freedom.
3.2.5 Assembly of the global linear system
As previously mentioned, the proposed parallelization strategy avoids redundant computations associated with ghost elements. This results in non-local entries in a process’ intermediate linear system, which need to be communicated to other processes, cf. Figure 4b. In order to perform a distributed assembly in an efficient way, the algorithms and data structures available in the distributed linear algebra packages in Trilinos  or Hypre 
are used. These packages allow non-local matrix and vector entries to be communicated in an efficient way and are widely used in different-codes [3, 12].
3.2.6 Solving the global linear system
Once assembled, the distributed linear system can be solved using different parallel solvers. The implementation provides an interface to parallel direct solvers available in Trilinos and parallel iterative solvers and preconditioners in Hypre and Trilinos. The choice of a suitable solver is dependent on different factors such as the problem type e.g. linear elasticity or Poisson, spatial dimension, size and the conditioning of the stiffness matrix.
3.2.7 Distributed post-processing
Post-processing is carried out independently on the local partition of each process. We use the parallel hierarchical data format (PHDF5)  and MPI-I/O to perform distributed I/O so as to simplify the management of the large datasets. We adopt a strategy where all processes write their result data into a single HDF5-file while a single process is responsible for writing an XML-file with the corresponding meta-data. This approach reduces the pressure on the filesystem and allows simple visualization as huge datasets in separate files do not have to be combined for visualization.
4 Numerical Results
This section highlights various aspects of the proposed parallel scheme. Starting with a simple example in two dimensions, we investigate the performance of the presented algorithms focusing on their scalability and execution time for different problem sizes. Next, the performance of the scheme is shown in a complex three dimensional transient example involving complicated refinement patterns. Further, an example involving FCM on a domain of complex shape is considered, to show the scheme’s applicability to problems of engineering relevance. Although our implementation can be run within a hybrid framework as portrayed in Algorithm 3, we restrict our numerical examples to the MPI-flat performance of the code.
The examples are computed on the CoolMAC cluster at Technical University of Munich. Each node has a dual socket Intel Sandy Bridge-EP Xeon E5-2670 architecture with a total of 16 processors and 128GB memory per node. QDR infiniband connects the nodes to each other. Moreover, the following library and compiler versions are used: Trilinos 12.6.1, Hypre 2.11.1, Zoltan v3.83, Boost 1.56 and IntelMPI compiler version 5.0 alongside the gcc compiler version 4.9 with the compiler flags -O3 and -funroll-loops. Logging of the execution time is performed using the cpu_timer implementation in the Boost library , while the memory consumption during runtime is obtained directly from the operating system via a call to the process status. Although our scheme offers an interface to different parallel direct and iterative solvers, only one solver is used for all numerical examples for the sake of brevity. All examples are solved using the parallel conjugate gradient solver with a multi-grid preconditioner available in Hypre .
4.1 2D singular benchmark
The L-shaped domain benchmark considered by Szabó and Babuška in  is taken as the first example. It aims to solve the Laplace problem on the domain depicted in Figure 8, subjected to the boundary conditions as shown. An analytical solution can be derived for this problem with the origin at the re-entrant corner, see e.g , and is also given in Figure 8.
4.1.1 Scalability of the simulation pipeline
To investigate the scalability of the parallel implementation for different problem sizes, computations on two fixed discretizations of the L-shaped domain are considered. Firstly, each of the three quadrants of the L-shaped domain is discretized uniformly with 16 by 16 elements resulting in a coarse base mesh comprising of 768 elements. This mesh is refined recursively in 5 steps towards the domain’s re-entrant corner and studied. This test case is termed as run A and consists of 813 leaf elements. A polynomial order of is chosen, ensuring that the measured times are in the range of seconds. The second test case, run B, consists of a finer mesh with 49 152 base elements, obtained by discretizing each quadrant with 128 by 128 elements. This mesh is also refined in 5 steps yielding a total of 49 197 leaf elements. A polynomial degree is chosen in this test case. The computational domains are partitioned among different numbers of processes and the execution time for various components in the simulation pipeline is monitored.
Figure 11 shows the strong scaling of the parallel implementation for the test cases run A and run B. The most time consuming components in the simulation pipeline, integration of the stiffness matrices and solving of the global system, together with the total execution time are considered. In Figure (a)a the speed up in run A for the mentioned components is shown, with each component scaling almost linearly or better up to four processes. In this setup with four processes, each process integrates around 200 leaf elements. The speed up of the integration time is still relatively good at a process count of 8, with as little as 100 leaf elements per process, but decreases gradually as the number of processes and consequently the number of integration domains per process decreases. The linear solver scales well up to 8 processes, but suffers a reduction in parallel efficiency due to limited computational work and added communication costs when the number of processes is increased. The scalability of the whole simulation is largely influenced by the scalability of the solver, as shown by the strong correlation of the two curves. These curves are, however, not identical due to other components in the simulation pipeline such as the serial parts of the code. It should be noted that the assembly of the distributed linear system is not included in the curves in Figure 11.
Next, we consider the performance of the parallel implementation for large problems in run B. The setup comprises of 5 million degrees of freedom with the process count varying from 1 to 128. The increased number of integration domains per process has a positive effect on the integration algorithm, as near to perfect scalability is achieved up to a process count of 128, see Figure (b)b. This number is still significantly lower than that of some conventional approaches which need more elements per process, up to , to hide the overhead introduced by redundant computations . The Hypre solver scales well to 8 processes, but suffers a loss in efficiency with 16 processes. Its parallel performance however improves when 32 processes are used. This is attributed to cache effects as the problem is now computed on two nodes. The overall scalability of the problem is closely coupled with the scalability of the linear solver similar to the results from run A.
A more detailed analysis of the simulation pipeline is performed using run B as a basis. Figure (a)a shows the execution times of different components plotted against the number of processes used, while Figure (b)b shows their contribution to the total execution time. These diagrams show that the communication of the non-local entries of the stiffness matrices to other processes during the assembly of the distributed system, an integral part of the parallelization scheme (see Section 3.2.5), does not significantly influence the simulation’s overall scalability and constitutes at most 5% of the execution time. Its contribution to the overall time is, however, expected to increase when the number of processes increases above 128. This would carry the same computational cost as the integration of the stiffness matrices. Furthermore, the serial part of the code would dominate the total computation time. The problem size is in this case too small to effectively utilize the available resources.
4.1.2 Influence of the polynomial order on the integration algorithm
The results from run A and run B show almost perfect scalability in the integration of the stiffness matrices. Both cases, however, utilize a rather high polynomial order so as to increase the amount of computational work per process. The behavior of the algorithm for low and moderate values of , however, remains to be shown. We investigate this by computing the example at hand with different polynomial orders where . By keeping the number of processes constant at 8 and increasing the number of mesh elements, we are able to monitor the development of the parallel efficiency of the integration algorithm for different polynomial orders and elements numbers.
The results of the study are presented in Figure 15. The parallel efficiency of the integration algorithm increases when is raised, due to the increase in computational work, and can be further improved by increasing the number of elements per process. This integration scheme is therefore also suitable for problems with low or moderate polynomial orders.
4.1.3 Memory requirements
Another important aspect of any parallel scheme is the amount of random-access memory (RAM) needed during a computation. Using run B from Section 4.1.1 as a basis, we analyze the peak total memory needed during a simulation as well as the average and maximum memory usage of a single process. These values are not constant but only represent the highest memory requirements at a single juncture in the simulation. Peak memory consumption is in general problem dependent, and is reached for the problem at hand during the solution of the linear system.
The memory needed by a single process decreases with an increase in the number of processes as shown in Figure (a)a, this reduction rate is, however, sub-linear and leads to an overall increase in the total memory needed. The minimal difference between the average and maximum process memory usage reveals that the proposed scheme is also able to achieve good balance in terms of memory consumption. Nevertheless, the memory needed by a single process starts to level off at a process count of 64. We attribute this behavior to the shared mesh, which is duplicated on every process and thus a natural lower bound. This leveling off in turn, would lead to a linear increase in the total memory at a process count beyond 128.
Following the above observation, the total memory required for simulations involving 128 to 4096 processes is approximated under the assumption of a limit shared mesh size of 3GB . The total memory usage in Figure (a)a is extrapolated and compared in Figure (b)b to the amount of memory available on two different cluster architectures, cluster 1 with nodes comprising 16 Intel SandyBridge processors and 128GB of memory and cluster 2 with nodes made up of 64 AMD Bulldozer Opteron processors and 256GB of memory. Figure (b)b shows that run B could be successfully run on both clusters with between 128-4096 processes without exceeding the available memory resources. This, however, does not hold when the setup size is tripled, this corresponds to a shared mesh of 9 GB and a total of 17 million degrees of freedom if the polynomial degree is assumed constant.
4.2 3D benchmark
The applicability of the parallel scheme to three dimensional problems is studied in this example. To this end, the shock problem considered in [50, 5] is chosen due to the complex mesh refinement patterns involved. It entails solving the Poisson equation on a unit cube subjected to Dirichlet and Neumann boundary conditions derived from the manufactured solution
is specially chosen to represent a shock-like function as illustrated in Figure 25. The term is a radial coordinate about a shifted origin given by the equation,
while is a factor determining the sharpness of the shock. The interested reader is referred to  for a detailed derivation of the applied boundary conditions and multi-level -convergence properties. In this numerical example, the original shock problem is modified to allow for a time dependent initial radius in Equation 4. The performance of our scheme in a transient setup can thus be investigated, for a sharpness and different values of as illustrated in Figure 25a-c.
The example at hand beings in the first time step with an initial mesh of base elements and a uniform polynomial degree . Moreover, the trunk space following  is used. This computational mesh is refined in every time step towards the shock with a refinement depth of three as illustrated in Figure 25d-f and a total of five time steps are computed. A graded polynomial order is applied to the elements, cf. , as illustrated in Figure 25g for the final time step, where the polynomial order is decreased when increasing the refinement level. A graded mesh posses a challenge for any scheme in terms of load balancing, as the number of basis functions within leaf elements vary greatly. The number of leaf elements and unknowns within each time step are summarized in Table 1. The performance of the simulation pipeline is monitored in every time step as in the previous example.
|No. degrees of freedom||158437||191329||240028||266506||214576|
|No. leaf elements||6474||19095||38604||48390||27201|
The results of the present study are shown in Figure 28. The total time spent in various routines was monitored over the 5 time steps and used as a basis for computing the overall scalability and the contribution of individual components to the total execution time. A similar pattern in the scalability and simulation pipeline breakdown is seen as in the two dimensional case. Although the mesh in each time step consists of elements with greatly varying polynomial orders, our partitioning algorithm is able to achieve good balance and allows scalability of the integration algorithm up to about 128 processes. The linear solver does not scale well in this simulation. This is due to a rather small problem size of about 200 000 degrees of freedom. The influence of the serial components of the code also increases with an increase in the number of MPI processes as shown in Figure (b)b.
4.3 Parallelization of the multi-level -Fcm
We conclude this section by considering an example in linear elasticity that combines the finite cell method and multi-level -refinement to simulate the loading of a bone-implant system. The setup consists of a spinal vertebra with embedded pedicle screws that is supported on its bottom surface and loaded on its upper flank as shown in Figure (a)a. The major benefit of using FCM in this simulation is its ability to deal with multiple geometric models in an easy fashion without the need of generating a boundary conforming mesh. We are thus able to simultaneously compute on the bone material, which is based on a HR-pQCT scan with a voxel size of 146.5 , and the screws, whose geometry is described by a CAD model. A computational mesh shown in Figure (b)b for example, can be easily generated by embedding both the voxel model and screw in a bounding box and using the density values of the CT-scan together with the surface of the screws to filter out elements lying completely in the fictitious domain. The resulting computational domain is refined towards the interface of the bone material and the screws as illustrate in Figure (c)c, so as to better capture the solution in this area. Furthermore, the composed integration technique illustrated in Figure 2 is applied on the voxel level, so as to make use of the available fine grain information of the CT-scan. This high resolution integration is depicted in Figure (d)d, where the black colored cells represent the leaf elements while the blue colored cells the depict the voxels. This results in a very high computational cost, rendering the integration of the linear system the main bottleneck in this simulation.
This example presents a good test for the parallel implementation. The major challenge here lies in the efficient distribution of the computational domain so as to guarantee good performance. The leaf elements not only differ in the number of degrees of freedom due to refinement, but also in the number of voxels. Moreover, an inside-outside test has to be performed for each integration point in order to find out if the point lies within the bone material, screw or fictitious domain. We perform the analysis on an initial mesh comprising 1 341 base elements that are refined in two steps towards the interface of the bone and the screws. A polynomial degree of is chosen resulting in 7 571 leaf elements upon refinement and 147 889 degrees of freedom.
Figure 36 shows the parallel performance of the different algorithms for the examples at hand. We are able to achieve relatively good scaling results in the most time consuming routines: the integration of the stiffness matrices and the post processing. More effort, however, needs to be invested in load balancing, i.e. in the distribution of the integration domains among processes, in order to improve the algorithms parallel efficiency.
5 Conclusion and future work
The article at hand presented a parallelization scheme for the multi-level -scheme that is adapted to its hierarchical structure. A shared mesh data structure was used in combination with a distributed assembly of the global system, so as to avoid redundant computations of ghost elements. This decision was motivated by the large basis function support of the multi-level -scheme, which would lead to an unproportionally high number of ghost elements compared to conventional -schemes. Moreover, the simplicity of a shared mesh data structure allowed for the easy maintenance of mesh consistency among the participating processes.
Three numerical examples are discussed in this work. They reveal that multi-level -refinement can be efficiently parallelized and be combined with the finite cell method for the simulation of large engineering problems. Good scalability in the total simulation time for different problem sizes in two and three dimensions is shown. Although each process has to communicate non-local entries of the linear system to other processes during assembly of the global system, this routine does not significantly affect parallel performance and only makes up a small fraction of the total computation time. This influence, however, increases as the number of processes increases due to reduced computational work. The problem size can in this case be increased to counteract this behavior.
A runtime analysis of the simulation pipeline revealed that the integration of the linear system greatly benefits from the proposed parallel scheme. Perfect scalability is achieved in this algorithm with as little as a few hundred integration domains per process, proving the suitability of the load balancing scheme presented in Algorithm 1. This results show the scheme’s suitability for problems in which numerical integration dominates. Furthermore, the integration scheme shows excellent performance for problems involving both moderate and high polynomial orders.
The potential of the parallel multi-level -scheme is demonstrated in the numerical examples considered. Further numerical studies are, however, necessary to extend the schemes applicability and improve its performance. Although the memory requirements per process significantly decrease with an increase in the number of processes, this behavior does not hold when high numbers of MPI processes are used, but levels off when the size of the computational mesh begins to dominate. The maximum size of the computation mesh that can be replicated on all processes is, therefore, limited. The onset of this leveling-off can be delayed by optimizing the memory footprint for the code, thus increasing the maximum size of the duplicated mesh. A more feasible solution to this problem would be the use of a distributed mesh structure, as suggested in . Moreover, utilizing the scheme in a hybrid framework would help reduce the amount of memory needed on a single node. The scalability of the parallel scheme is greatly governed by the linear solver as shown in the numerical examples. A more in depth analysis has to be conducted to find the best solver configurations for the problem types considered. The final numerical example showed the advantage of using the multi-level -scheme in conjunction with the finite cell method. The efficiency of the load balancing algorithm has to be improved in this case and adapted to the voxel based integration scheme. Yet another research direction would be the extension of the parallel scheme’s application field. One possibility would be the use of this scheme in nonlinear problems which require complex three dimensional refinement patterns.
The first and the last author gratefully acknowledge the financial support of the German Research Foundation (DFG) under Grant RA 624/27-1.
-  N. Zander, T. Bog, S. Kollmannsberger, D. Schillinger, and E. Rank, “Multi-level hp-adaptivity: High-order mesh adaptivity without the difficulties of constraining hanging nodes,” Computational Mechanics, vol. 55, no. 3, pp. 499–517, 2015.
-  P. Šolín, Higher-Order Finite Element Methods. Studies in advanced mathematics, Boca Raton: Chapman & Hall/CRC, 2004.
-  W. Bangerth, R. Hartmann, and G. Kanschat, “deal.II – a general purpose object oriented finite element library,” ACM Transactions on Mathematical Software, vol. 33, no. 4, pp. 24/1–24/27, 2007.
-  L. Demkowicz, Computing with Hp-Adaptive Finite Elements, Vol. 1: One and Two Dimensional Elliptic and Maxwell Problems. Applied mathematics and nonlinear science series, Boca Raton: Chapman & Hall/CRC, 2007.
-  N. Zander, T. Bog, M. Elhaddad, F. Frischmann, S. Kollmannsberger, and E. Rank, “The multi-level hp-method for three-dimensional problems: Dynamically changing high-order mesh refinement with arbitrary hanging nodes,” Computer Methods in Applied Mechanics and Engineering, vol. 310, pp. 252–277, 2016.
-  J. Parvizian, A. Düster, and E. Rank, “Finite cell method,” Computational Mechanics, vol. 41, no. 1, pp. 121–133, 2007.
-  M. Paszyński and L. Demkowicz, “Parallel, fully automatic hp-adaptive 3D finite element package,” Engineering with Computers, vol. 22, no. 3-4, pp. 255–276, 2006.
-  M. Paszyński and D. Pardo, “Parallel self-adaptive finite emelent method with shared data structure,” Computer Methods in Material Science, vol. 11, no. 2, pp. 399–405, 2011.
-  W. Bangerth, C. Burstedde, T. Heister, and M. Kornbichler, “Algorithms and data structures for massively parallel generic adaptive finite element codes,” ACM Transactions on Mathematical Software, vol. 38, no. 2, pp. 14:1–14:28, 2011.
-  R. Popescu, Parallel algorithms and efficient implementation techniques for finite element approximations. PhD thesis, SB, Lausanne, 2013.
-  R. S. Sampath and G. Biros, “A parallel geometric multigrid method for finite elements on octree meshes,” SIAM Journal on Scientific Computing, vol. 32, no. 3, pp. 1361–1392, 2010.
-  “MFEM: Modular finite element methods.” mfem.org.
-  T. Hughes, The Finite Element Method: Linear Static and Dynamic Finite Element Analysis. Dover Civil and Mechanical Engineering, Dover Publications, 2000.
-  P. Šolín, J. Červený, and I. Doležel, “Arbitrary-level hanging nodes and automatic adaptivity in the hp-FEM,” Mathematics and Computers in Simulation, vol. 77, no. 1, pp. 117–132, 2008.
-  P. Šolín, L. Dubcova, and I. Doležel, “Adaptive hp-FEM with arbitrary-level hanging nodes for Maxwell’s equations,” Advances in Applied Mathematics and Mechanics, vol. 2, no. 4, pp. 518–532, 2010.
-  N. Zander, M. Ruess, T. Bog, S. Kollmannsberger, and E. Rank, “Multi-level hp-adaptivity for cohesive fracture modeling,” International Journal for Numerical Methods in Engineering, 2016.
-  C. D. Mote, “Global-local finite element,” International Journal for Numerical Methods in Engineering, vol. 3, no. 4, pp. 565–574, 1971.
-  T. Belytschko, J. Fish, and A. Bayliss, “The spectral overlay on finite elements for problems with high gradients,” Computer Methods in Applied Mechanics and Engineering, vol. 81, no. 1, pp. 71–89, 1990.
-  E. Rank, “Adaptive remeshing and h-p domain decomposition,” Computer Methods in Applied Mechanics and Engineering, vol. 101, no. 1–3, pp. 299–313, 1992.
-  P. K. Moore and J. E. Flaherty, “Adaptive local overlapping grid methods for parabolic systems in two space dimensions,” Journal of Computational Physics, vol. 98, no. 1, pp. 54–63, 1992.
-  A. Düster, J. Parvizian, Z. Yang, and E. Rank, “The finite cell method for three-dimensional problems of solid mechanics,” Computer Methods in Applied Mechanics and Engineering, vol. 197, no. 45–48, pp. 3768–3782, 2008.
-  M. Dauge, A. Düster, and E. Rank, “Theoretical and Numerical Investigation of the Finite Cell Method,” Journal of Scientific Computing, vol. 65, no. 3, pp. 1039–1064, 2015.
-  M. Joulaian, S. Hubrich, and A. Düster, “Numerical integration of discontinuities on arbitrary domains based on moment fitting,” Computational Mechanics, pp. 1–21, 2016.
-  L. Kudela, N. Zander, T. Bog, S. Kollmannsberger, and E. Rank, “Efficient and accurate numerical quadrature for immersed boundary methods,” Advanced Modeling and Simulation in Engineering Sciences, vol. 2, no. 1, pp. 1–22, 2015.
-  L. Kudela, N. Zander, S. Kollmannsberger, and E. Rank, “Smart octrees: Accurately integrating discontinuous functions in 3D,” Computer Methods in Applied Mechanics and Engineering, vol. 306, pp. 406–426, 2016.
-  D. Schillinger, M. Ruess, N. Zander, Y. Bazilevs, A. Düster, and E. Rank, “Small and large deformation analysis with the p- and b-spline versions of the finite cell method,” Computational Mechanics, vol. 50, no. 4, pp. 445–478, 2012.
-  I. Babuška, “The Finite Element Method with Penalty,” Mathematics of Computation, vol. 27, no. 122, p. 221, 1973.
-  S. Fernández-Méndez and A. Huerta, “Imposing essential boundary conditions in mesh-free methods,” Computer Methods in Applied Mechanics and Engineering, vol. 193, no. 12-14, pp. 1257–1275, 2004.
-  J. Nitsche, “Über ein Variationsprinzip zur Lösung von Dirichlet-Problemen bei Verwendung von Teilräumen, die keinen Randbedingungen unterworfen sind,” Abhandlungen aus dem Mathematischen Seminar der Universität Hamburg, vol. 36, no. 1, pp. 9–15, 1971.
-  M. Ruess, D. Schillinger, Y. Bazilevs, V. Varduhn, and E. Rank, “Weakly enforced essential boundary conditions for NURBS-embedded and trimmed NURBS geometries on the basis of the finite cell method,” International Journal for Numerical Methods in Engineering, vol. 95, no. 10, pp. 811–846, 2013.
-  C. Verhoosel, G. van Zwieten, B. van Rietbergen, and R. de Borst, “Image-based goal-oriented adaptive isogeometric analysis with application to the micro-mechanical modeling of trabecular bone,” Computer Methods in Applied Mechanics and Engineering, vol. 284, pp. 138–164, 2015.
-  T. Bog, N. Zander, S. Kollmannsberger, and E. Rank, “Normal contact with high order finite elements and a fictitious contact material,” Computers & Mathematics with Applications, vol. 70, no. 7, pp. 1370–1390, 2015.
-  D. Schillinger, L. Dedè, M. A. Scott, J. A. Evans, M. J. Borden, E. Rank, and T. J. Hughes, “An isogeometric design-through-analysis methodology based on adaptive hierarchical refinement of NURBS, immersed boundary methods, and t-spline CAD surfaces,” Computer Methods in Applied Mechanics and Engineering, vol. 249-252, pp. 116–150, 2012.
-  D. Schillinger and M. Ruess, “The Finite Cell Method: A Review in the Context of Higher-Order Structural Analysis of CAD and Image-Based Geometric Models,” Archives of Computational Methods in Engineering, vol. 22, no. 3, pp. 391–455, 2014.
-  N. Zander, T. Bog, M. Elhaddad, R. Espinoza, H. Hu, A. Joly, C. Wu, P. Zerbe, A. Düster, S. Kollmannsberger, J. Parvizian, M. Ruess, D. Schillinger, and E. Rank, “FCMLab: A finite cell research toolbox for MATLAB,” Advances in Engineering Software, vol. 74, pp. 49–63, 2014.
-  A. Breuer, A. Heinecke, S. Rettenberger, M. Bader, A.-A. Gabriel, and C. Pelties, “Sustained petascale performance of seismic simulations with seissol on supermuc,” in Supercomputing: 29th International Conference, ISC 2014, Leipzig, Germany, June 22-26, 2014. Proceedings (J. M. Kunkel, T. Ludwig, and H. W. Meuer, eds.), pp. 1–18, Cham: Springer International Publishing, 2014.
-  J. Weismüller, B. Gmeiner, S. Ghelichkhan, M. Huber, L. John, B. Wohlmuth, U. Rüde, and H.-P. Bunge, “Fast asthenosphere motion in high-resolution global mantle flow models,” Geophysical Research Letters, vol. 42, no. 18, pp. 7429–7435, 2015.
-  P. Šolín et al., Hermes - Higher-Order Modular Finite Element System (User’s Guide).
-  B. S. Kirk, J. W. Peterson, R. H. Stogner, and G. F. Carey, “libMesh: A C++ Library for Parallel Adaptive Mesh Refinement/Coarsening Simulations,” Engineering with Computers, vol. 22, no. 3–4, pp. 237–254, 2006.
-  C. Cantwell, D. Moxey, A. Comerford, A. Bolis, G. Rocco, G. Mengaldo, D. D. Grazia, S. Yakovlev, J.-E. Lombard, D. Ekelschot, B. Jordi, H. Xu, Y. Mohamied, C. Eskilsson, B. Nelson, P. Vos, C. Biotto, R. Kirby, and S. Sherwin, “Nektar++: An open-source spectral/ element framework,” Computer Physics Communications, vol. 192, pp. 205 – 219, 2015.
-  M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps, A. G. Salinger, H. K. Thornquist, R. S. Tuminaro, J. M. Willenbring, A. Williams, and K. S. Stanley, “An overview of the trilinos project,” ACM Transactions on Mathematical Software, vol. 31, no. 3, pp. 397–423, 2005.
-  S. Balay, W. D. Gropp, L. C. McInnes, and B. F. Smith, “Efficient management of parallelism in object oriented numerical software libraries,” in Modern Software Tools in Scientific Computing (E. Arge, A. M. Bruaset, and H. P. Langtangen, eds.), pp. 163–202, Birkhäuser Press, 1997.
-  P. Di Stolfo, A. Schröder, N. Zander, and S. Kollmannsberger, “An easy treatment of hanging nodes in hp-finite elements,” Finite Elements in Analysis and Design, vol. 121, pp. 101–117, 2016.
-  D. D’Angella, N. Zander, S. Kollmannsberger, F. Frischmann, E. Rank, A. Schröder, and A. Reali, “Multi-level hp-adaptivity and explicit error estimation,” Advanced Modeling and Simulation in Engineering Sciences, vol. 3, no. 1, p. 33, 2016.
-  K. Devine, E. Boman, L. Riesen, U. Catalyurek, and C. Chevalier, “Getting started with zoltan: A short tutorial,” in Proceedings of 2009 Dagstuhl Seminar on Combinatorial Scientific Computing, 2009.
-  R. D. Falgout and U. M. Yang, “hypre: a library of high performance preconditioners,” in Preconditioners,” Lecture Notes in Computer Science, pp. 632–641, 2002.
-  The HDF Group, “Hierarchical Data Format, version 5.” http://www.hdfgroup.org/HDF5/, 1997.
-  B. Schling, The Boost C++ Libraries. XML Press, 2011.
-  B. A. Szabó and I. Babuška, Finite Element Analysis. New York: John Wiley & Sons, 1991.
-  W. Rachowicz, D. Pardo, and L. Demkowicz, “Fully automatic hp-adaptivity in three dimensions,” Computer Methods in Applied Mechanics and Engineering, vol. 195, no. 37–40, pp. 4816–4842, 2006.