I Introduction and Motivation
Modern supercomputers tend to be massive parallel, i. e. they consist of several hundreds of thousands of cores, thus making efficient code design inevitable in order to exploit the underlying performance and to keep up with the so-called exascale challenge. While lot of research is currently happening into this direction, still plenty of codes are not prepared yet for scalability runs on more than 32,000 cores. In many cases, communication, i. e. data exchange between single processes, and load balancing – considering adaptive mesh refinement, e. g. – are the major problems, preventing such codes from high-fidelity computations on petascale systems such as SuperMUC, installed at the Leibniz Supercomputing Centre in Garching, with its more than 241,000 cores and a combined peak performance of 6.8 Petaflops or JuQueen, installed at Jülich Supercomputing Centre, with its more than 458,000 cores and an overall peak performance of 5.9 Petaflops.
On the other hand, the so-called emerging sciences such as medicine, sociology, biology, virology, chemistry, climate or geo-sciences demand for more and more computing power in order to solve multi-scale, multi-physics, and frequently also multi-domain problems. Those problems are not only complex by their nature, they often address further aspects such as big data or real time (in-situ) computing – the latter one enabling decision makers to decide quickly where late results would be useless – thus putting a lot of pressure on the parallel code design. What all these codes have in common is the necessity for efficient data structures, fast communication schemes, state-of-the-art numerical algorithms, and advanced load balancing techniques – all in all key factors for the successful deployment of scalable massive parallel applications on modern petascale systems.
In this paper, we present a computational fluid dynamics (CFD) code for various applications such as thermal comfort assessment or porous media flows. Basic principle of this code is a hierarchical data structure in combination with an efficient multi-grid-like solver that perfectly scales both in the spatial (geometry) and computational (cores) domain. This structure allows to adaptively refine the computational mesh during runtime as well as to migrate blocks of the grid between processes in order to achieve an optimal load situation concerning data locality and minimal communication. A special process – called neighbourhood server – keeps track of all other (worker) processes, the data assigned to them, and the communication pattern among all nodes. Due to this neighbourhood server an efficient orchestration of the nodes becomes possible, thus we are able to obtain very good scalability and speed-up values when running the code on up to 140,000 processes on different supercomputers.
Main focus of this paper is a detailed scalability study of our CFD code on the two aforementioned petascale systems, i. e. SuperMUC and JuQueen, based on an indoor thermal comfort simulation. Therefore, we will profile typical communication properties along with special code characteristics, compare those results obtained on the two systems, and finally discuss identified bottlenecks and weak aspects of our parallelisation concept. These results do not only allow us to get a valuable insight into the code, but also to reveal those points which need further tuning in order to be ready for the next step towards the exascale challenge. The remainder of this paper is structured as follows: In the next section we will present the mathematical model used in our code, followed by a brief description of the data structure and the communication concept. Section 3 will highlight a scalability study performed on two different petascale systems together with a sound analysis of the measured results, while Section 4 addresses sample scenarios computed on several thousand cores using our CFD code. Section 5 will then close the paper with a short summary and outlook.
Ii MPFluid – Massive Parallel CFD Code
Ii-a Mathematical Modelling
The mathematical background of the code is described in detail in Frisch et al. . This section aims at giving a concise introduction into the mathematical modelling and into the data structure in order to bring the reader up to speed and to motivate the usage of HPC methods.
The mathematical modelling of the implementation is based on the Navier–Stokes equations derived from the conservation of mass, momentum, and energy principles. A complete and extensive derivation of the equations can be found in standard literature [2, 3, 4]. The governing equations for an incompressible Newtonian fluid flow comprise three sets of equations. The first equation set – given in differential form – is called the continuity equation
where describes the velocity of the fluid field using as velocity components in the three spatial directions , respectively. This equation has to be satisfied at every time step in the complete domain.
The second set, called the momentum equations, can be written for every direction as
where represents the time, the density of the fluid assumed constant over the complete domain and the dynamic viscosity. represents the pressure, and interior body forces in direction .
represents the unit vector in direction.
A temperature convection-diffusion equation can be applied additionally if thermal effects such as buoyancy should be modelled and describes the energy conservation and models the heat transport. The Boussinesq approximation couples the convection-diffusion equation to the momentum equations as described in Lienhard and Lienhard  if some assumptions hold.
There is no independent equation for the pressure in Equations (1) and (2). The momentum equations include the pressure gradient , whereas the incompressible continuity equation does not contain at all. By applying pressure correction methods as proposed by Harlow and Welch  in 1965 or the fractional step method (or projection method) introduced by Chorin  in 1967, incompressible flows can be solved nevertheless. The methods are based on an iteration between velocity and pressure fields, where the pressure field acts in every step as a correction in order to fulfil the continuity equation for the velocity field. The pressure term in the incompressible equations is often referred to as ‘working pressure’ as Equation (2) only contains the pressure gradient and not the absolute value itself.
By applying the fractional step method and choosing an explicit Euler time discretisation for the temporal derivative , the momentum conservation equations in the direction for an intermediate time step can be written
by neglecting the pressure gradient. denotes the time step size, the superscript denotes the current time step , and the superscript an intermediate time step between and . The pressure term is now treated in a second step
The summation of the two Equations (3) and (4) results in the original Equation (2) with an explicit treatment of the velocity term (i. e. at time step ) and an implicit treatment of the pressure term (i. e. at time step ).
Using the divergence operator on Equation (4) leads to
As the pressure term must lead to a divergence-free velocity field at time step (fulfilment of the continuity Equation (1)), the equation for determining the pressure at time step can be written as
representing a Poisson equation for the pressure, which has to be solved in every time step.
The temporal discretisation can be enhanced using a second order explicit multi-level Adams–Bashforth method, introduced in detail by Schwarz , for example. A very detailed analysis of other approaches for time derivatives can be found in Ferziger . Kim and Moin  introduced in 1985 a mixture of semi-implicit and explicit methods, where the convective terms in Equation (3) are discretised using a second order explicit Adams–Bashforth method. The viscous terms, however, are discretised using a semi-implicit Crank–Nicolson method, which eliminates some numerical stability problems. Choi and Moin  used in 1994 a similar approach, but integrate a predictor-corrector method.
Similar to the temporal discretisation, a spatial discretisation has to be introduced in order to describe and prepare the domain for a numerical simulation. Popular, well-established methods, such as the finite difference method (FDM), the finite volume method (FVM) or the finite element method (FEM), could be used. This work focuses on the FVM and FDM approach for discretising the spatial domain and a good description of these methods can be found in Ferziger and Perić  or Hirsch , for instance.
Numerical stability issues have to be taken into account as an explicit time discretisation method was chosen here. Courant and Friedrichs  analysed this problem in 1928 and proposed using an upwind-difference for the convective term and central differences for the diffusive term, and Peyret and Taylor  give a good overview on a detailed stability analysis for explicit FDM.
Ii-B Data Structure
The data structure is based on block-structured, non-overlapping, orthogonal, regular, hierarchical grids. A topological grid management (called ‘l-grids’) is responsible for all hierarchic information, such as parents and children relations, while data grids (called ‘d-grids’) contain the actual data arrays, such as velocity and pressure fields, for example.
Figure 1 gives an overview of both data structure parts. In the top part, the nested construction of the non-overlapping grids can be seen. A root l-grid, located per definition at depth zero, is refined by in the respective directions. The new resulting child l-grids can be refined again by until the desired depth is reached. Furthermore, while creating the newly refined child l-grids, the coarse l-grid will be defined as their parent l-grid. Hence, each l-grid can only have one parent l-grid.
The second major part of the data structure consists of d-grids (i. e. the data grids). Every d-grid stores only the necessary variables, such as velocities, pressure, or temperature values in a matrix of cells. Thus, a cell is comparable to one control volume. The d-grid is surrounded by a halo of ghost cells necessary for the proper functioning of the structure. One of these d-grids can be seen in the middle part of Figure 1 directly above the text line and consists of the size of , not counting the halo cells. Furthermore, each l-grid contains exactly one link to one d-grid.
Thus, complex, adaptive domain scenarios can be generated by combining regular blocks due to the flexible structure of the logical grid management. Each d-grid has an equidistant orthogonal spacing, and it can be shown that the finite volume approach using the mid-point rule and linear interpolation degenerates into a finite difference approach. Thus, on one d-grid, a simple finite difference scheme, such as a six-point stencil in 3D, can be used. This allows a strict separation into two phases: a computation phase and a communication phase. In the computation phase, the finite difference approach in the form of a stencil is evaluated on every d-grid. In the communication phase, a halo update is performed filling the ghost cells with neighbouring values in order to apply a Schwarz decomposition method. In this phase the flux continuity has to be guaranteed on grid boundaries by the communication routines.
Ii-C Data Exchange and Neighbourhood Server
As the data grids iterate solely over their local data, all other information, such as neighbouring and parental relations, must be provided by the logical grid management structure. Hence, the logical grid management takes care of flux conservation across d-grid boundaries, especially in the case of adaptive refinements, as depicted in Figure 1. Therefore, complex data exchange mechanisms introduced in  are provided by the logical grid management framework to guarantee data integrity and consistency.
The distribution of grids to different processes is managed by a special dedicated process called ‘neighbourhood server’. It is described in detail in Frisch . A dedicated server keeps track of the logical structure of all grids without knowing values of the data content itself and organises the distribution to different processes according to a Lebesgue space-filling curve (also called Z-order curve, see Bader ) in order to preserve neighbouring relations. Performance measurements for exchange times of halo values show very good results in terms of grid-to-grid communication for the given distribution on high-performance computers and are presented in Section III.
Ii-D Multi-Grid-Like Pressure Poisson Solver
As indicated in Equation (6), the fractional step method leads to a Poisson equation for the pressure term which has to be solved at every time step. Frisch  shows that more than 90 % of the time is spend in the solution of the Poisson equation. Hence, the efficient solution of the pressure Poisson equation is of utmost importance.
In the 1970s Brandt 
introduced geometric multi-grid solvers as multi-level methods which represent an adequate technique for solving partial differential equations of elliptic type, such as the Poisson equation or the Laplacian equation. By comparing the data flows of restriction and prolongation operators to the already implemented data exchanges, many similarities can be seen and a cell-centred, multi-grid-like solver was applied on the already implemented communication structures.
In the following section, scaling results for different machines are compared for a Laplacian equation, as this resembles the pressure Poisson equation. The Laplacian problem
is defined on a 1 1 1 m cubic domain, where fixed Dirichlet boundary conditions are applied on the east and west side according to
whereas all other sides are set to . This equation describes a diffusion process and can be used as well for modelling a stationary temperature diffusion problem without any convective influences or internal loads.
Iii Scaling Results
In the following section, performance measurements are shown for tests on different machines for several depths of the hierarchical data structure. The measurements were performed on three supercomputing systems: Shaheen, a 16-rack, 65,536 cores IBM Blue Gene/P supercomputer installed at King Abdullah University of Science and Technology (KAUST) in the Kingdom of Saudi Arabia, JuQueen, a 28-rack, 458,752 cores IBM Blue Gene/Q supercomputer installed at Jülich Supercomputing Centre (JSC) in Jülich, Germany, and SuperMUC, a 155,656 core (before the hardware upgrade in 2015) large IBM System iDataPlex supercomputer installed at the Leibniz Supercomputing Centre (LRZ) in Garching near Munich, Germany.
All following measurements were done solely for the pressure Poisson Equation (6) which represents the most computational complex part of the CFD code. This can be done without restriction of any kind, as the overhead for a full time step is according to  only slightly larger than for just solving the pressure Poisson equation.
First of all, we measured the time for one total ghost layer exchange based on a fully refined 3D domain using a refinement level of (2,2,2) for the l-grids and a d-grid size of (16,16,16), thus resulting in computational blocks of 4,096 d-grid cells each. This ensemble of l-grid and d-grid sizes has proven to be optimal for SuperMUC, hence we were using this setup throughout all experiments on all systems. The measurements were then performed for different depths of the logical grids, ranging from 5 to 8, leading to approximately 78.5 billion d-grid cells with 9 independent variables per cell, i. e. more than 707 billion variables to be exchanged across all processes on depth 8.
In Figure 2 (a), the total exchange times in [s] are depicted for the three different systems. All curves show a clear tendency to decrease for an increasing number of processes. On SuperMUC, for instance, the total exchange time between more than 140,000 processes on depth 8 with over 707 billion variables is approximately 0.1 s and, thus, practically negligible during the computation. Interestingly, the curves on SuperMUC and the two IBM Blue Gene systems not only look very similar, but also exhibit the same slope – except the fact that the exchange times on SuperMUC are faster than on the Blue Gene systems due to SuperMUC’s higher clock frequency – which is already a first indication of the data structure’s well scalability characteristics and its suitability for massive parallel computations.
In order to relativise the influence of the different clock frequencies, in Figure 2 (b) the total exchange times in [s] are plotted against the number of l-grids per process for the three different systems. It is also observable that a hardware upgrade from Blue Gene/P to Blue Gene/Q had an impact on the communication times. The slopes are still similar but show an offset between both machines. One can clearly observe now the perfect accordance of all measurements for corresponding depth (where available) which further emphasises the sound scalability of our CFD code for different problem sizes, different amount of processes, and different architectures.
Next experiments addressed a full time step update, i. e. an iterative solution of the pressure Poisson equation until convergence was reached. The setup was exactly the same as in the previous experiment, namely a fully refined 3D domain using l-grid refinement levels of (2,2,2) and d-grid sizes of (16,16,16) for different depths (5–8), leading to around 20 million l-grids with more than 78.5 billion d-grid cells and over 707 billion variables on depth 8. It is worth notifying that this setup also requires 28 TByte of combined main memory for storing all relevant data. In order to face this problem on SuperMUC, for instance, a total of at least 20,000 processes is necessary due to SuperMUC’s memory availability of 1.5 GByte per core. On JuQueen, however, at least 32,000 processes were necessary to accommodate all relevant data due to a smaller memory per core ratio. In Figure 3, the times to solution (i. e. a full time step) in [s] are plotted against the total number of processes (a) and the total number of l-grids per process (b) for the three different systems. Again, a clear decreasing tendency of all curves for an increasing number of processes is to be observed. Furthermore, the curves of SuperMUC and JuQueen still show the same slope (upper plot of Figure 3) and are more or less shifted in vertical direction only due to the two systems’ different clock frequencies. The lower plot of Figure 3 reveals a similar good accordance of all corresponding experiments on the different systems as in the previous case of a pure ghost layer exchange, negelecting hardware specific influences such as processor speed. Hence, the supposed scalability of the code could be approved by the second measurements, too.
Another interesting aspect is the maximum throughput for a given (fixed) problem size and different amounts of processes – also known as strong speed-up. Figures 4 and 5 highlight those strong speed-up measurements for the three different systems with up to a total of 8,192, 16,384, 32,768, 65,536, and 139,008 processes. The experimental setup is the same as in the previous cases. For better visibility, all of the above five cases are plotted separately against the ideal speed-up with a slope of 1.
In Figure 4
, one general observation is that all curves start to level-off as soon as the underlying problem becomes too small and communication time (between cores and nodes) dominates computation time. In a direct comparison between SuperMUC and Shaheen, i. e. Intel’s Sandy-Bridge architecture vs. IBM’s PowerPC 450 generation, both CPUs have similar last-level (L3) cache sizes while the Sandy-Bridge architecture has twice as much cores. Here, we can observe better (up to a factor of 1.5) performance values for the Intel architecture based on the same problem size, indicating that currently the code is not memory-bound, i. e. limited by the memory interface. In comparison to the Blue Gene/Q system (JuQueen) with its four times larger, 16 way set-associative last-level (L2) cache and its four-times more cores the obtained results look very identical to those of SuperMUC. Even one would expect much better results for JuQueen, the two systems achieve a very similar performance which – again – underlines that the code does not suffer from a memory-bound limitation at the moment. Hence, any performance loss must come from the computational kernel that does not exploit intrinsic features such as SMT, SIMD, or vectorisation (pipelining) yet. A detailed analysis according to the roofline model is inevitable in order to perform code optimisations for a higher node-level performance.
Finally, Figure 4 shows the obtained strong speed-up values for SuperMUC and JuQueen on 65,536 and 139,008 processes. The different curves of the two systems for different problem sizes once more perfectly coincide and lead to a parallel efficiency of 64 % on depth 8. While the levelling-off of both systems was to be expected (being a well-known feature of such analyses), the identical performance of both systems gives rise to further considerations. On the one hand, we observed a good scalability of our code w. r. t. to the communication, i. e. to the ghost layer exchange, on the other hand, any performance benefit of the Blue Gene/Q cannot be utilised by the computational kernel in order to outperform the Sandy-Bridge architecture. In consequence the data structure has proven to suffice very well for massive parallel computations, whereas the computational kernel turns out to be currently a bottleneck hindering the exploitation of the underlying performance. Further experiments with adaptive grids – that do not follow a uniform refinement of the computational domain – revealed very similar results and, thus, led to the same conclusions. More information on those measurements can be found in .
Iv Application Examples
In order to show the versatility of the proposed approach, a few application examples are listed below.
Figure 6 shows a typical benchmark scenario for validating CFD codes. The DFG priority research program ‘Flow Simulation on High-Performance Computers’  introduces a collection of benchmarks. The basic setup consists of a regular rectangular channel in 2D or 3D with a cylindrical or rectangular obstacle in the inflow region of the channel. Here, the channel was computed using an adaptive grid setup. This example, based on the 2D-2 setup with , shows vortex shedding and was used in Frisch  as one of the validation test cases.
Figure 7 shows a flow through porous media. In Perović et al. , the presented code was used for computing the micro-scale of fluid flow through porous media while coupling it on a macro-scale to a Darcy-based flow solver. Hence, this shows that the code can work with complex geometries without a time consuming manual meshing process, as the geometry generation is done fully-automatically by a voxel-based approach, thus generating a representation directly usable by the code’s data structure.
Last but not least, the code is also able to compute thermally coupled scenarios by applying the Boussinesq approximation as mentioned in Section II. In this case, the flow is subjected to pure natural convection boundary conditions, meaning that no inflow or outflow conditions are set in the room and the flow is only driven by thermal buoyancy effects. Figure 8 shows a classroom setup, where the human occupants are coupled to a thermoregulation model imposing thermal boundary conditions onto the surfaces, and acting as driving forces for the natural convection scenario. Further information and result evaluation can be found in Frisch et al. .
Thus, the code is able to handle quite different physical scenarios on different scales while running on more than 100,000 cores.
In this paper, we have presented a massive parallel CFD code based on hierarchical data structures for the usage on modern petascale supercomputing systems. As this code should serve for various complex engineering-based application scenarios demanding for extensive computing power, scalability was an important design issue from the very beginning. Within several measurements performend on two of Germany’s national supercomputers (both among the first 20 places in the current top 500 list111http://www.top500.org as of June 2015) very good scalability characteristics up to 140,000 processes could be observed. Moreover, within those experiments it could be shown that any performance limitation of the computational kernel is not memory-bound and, thus, further performance improvement due to node-level optimisation is possible. Such optimisations comprise simultaneous multithreading (SMT) or streaming computations as well as the exploitation of intrinsic characteristics like the Advanced Vector Extensions (AVX) of Intel’s Sandy-Bridge architecture in order to increase throughput capabilities.
The authors gratefully acknowledge the computing time granted by the JARA-HPC Vergabegremium and provided on the JARA-HPC Partition part of the supercomputer JUQUEEN  at Forschungszentrum Jülich.
Furthermore, the authors would like to cordially thank Leibniz Supercomputing Centre (LRZ) in Garching for the computing time granted during the ‘Extreme Scaling Workshop’ in June 2014.
-  J. Frisch, R.-P. Mundani, E. Rank, and C. van Treeck, “Engineering-Based Thermal CFD Simulations on Massive Parallel Systems,” Computation, vol. 3, no. 2, pp. 235–261, 2015.
-  G. K. Batchelor, An Introduction to Fluid Dynamics, Cambridge University Press, 2000.
-  J. H. Ferziger and M. Perić, Computational Methods for Fluid Dynamics, Springer, 3rd rev. edition, 2002.
-  C. Hirsch, Numerical Computation of Internal and External Flows, Volume 1, Butterworth–Heinemann, 2nd edition, 2007.
-  J. H. Lienhard IV and J. H. Lienhard V, A Heat Transfer Textbook, Dover Civil and Mechanical Engineering Series. Dover Publications, 4th edition, 2011.
-  F. H. Harlow and J. E. Welch, “Numerical calculation of time-dependent viscous incompressible flow of fluid with free surface,” Physics of Fluids, vol. 8, no. 12, pp. 2182–2189, 1965.
-  A. J. Chorin, “Numerical solution of the Navier-Stokes equations,” Mathematics of Computation, vol. 22, no. 104, pp. 745–762, 1968.
-  H. R. Schwarz and N. Köckler, Numerische Mathematik, Vieweg + Teubner, 8th rev. edition, 2011.
-  J. H. Ferziger, Numerical Methods for Engineering Applications, Wiley-Interscience Publication. Wiley, 1998.
-  J. Kim and P. Moin, “Application of a fractional-step method to incompressible Navier-Stokes equations,” Journal of Computational Physics, vol. 59, no. 2, pp. 308–323, 1985.
-  H. Choi and P. Moin, “Effects of the computational time step on numerical solutions of turbulent flow,” Journal of Computational Physics, vol. 113, no. 1, pp. 1–4, July 1994.
-  R. Courant, K. Friedrichs, and H. Lewy, “Über die partiellen Differenzengleichungen der mathematischen Physik,” Mathematische Annalen, vol. 100, no. 1, pp. 32–74, 1928.
-  R. Peyret and T. D. Taylor, Computational Methods for Fluid Flow, Springer Series in Computational Physics. Springer-Verlag, 1983.
-  J. Frisch, Towards Massive Parallel Fluid Flow Simulations in Computational Engineering, Ph.D. thesis, Technische Universität München, Oct. 2014.
-  H. A. Schwarz, “Ueber einen Grenzübergang durch alternirendes Verfahren,” Vierteljahrsschrift der Naturforschenden Gesellschaft in Zürich, vol. 15, pp. 272–286, 1870.
-  J. Frisch, R.-P. Mundani, and E. Rank, “Communication schemes of a parallel fluid solver for multi-scale environmental simulations,” in Proc. of the 13th Int. Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC). 2011, pp. 391–397, IEEE Computer Society.
-  M. Bader, Space-Filling Curves – An Introduction with Applications in Scientific Computing, Texts in Computational Science and Engineering. Springer-Verlag, 2013.
-  A. Brandt, “Multi-Level Adaptive Solutions to Boundary-Value Problems,” Mathematics of Computation, vol. 31, no. 138, pp. 333–390, Apr. 1977.
-  S. Williams, A. Waterman, and D. Patterson, “Roofline: an insightful visual performance model for multicore architectures,” Communications of the ACM, vol. 52, no. 4, pp. 65–76, 2009.
-  M. Schäfer and S. Turek, “Benchmark computations of laminar flow around a cylinder,” in Flow Simulation with High-Performance Computers II. Jan. 1996, vol. 52 of Notes on Numerical Fluid Mechanics, pp. 547–566, Vieweg.
-  N. Perović, J. Frisch, R.-P. Mundani, and E. Rank, “Interactive Data Exploration for High-Performance Fluid Flow Computations Through Porous Media,” in Proc. of the 16th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), Timişoara, Romania, Sept. 22-25, 2014, pp. 463–470.
-  M. Stephan and J. Docter, “JUQUEEN: IBM Blue Gene/Q c⃝ Supercomputer System at the Jülich Supercomputing Centre,” Journal of large-scale research facilities JLSRF, vol. 1, 2015.