One of the main design issues related to launch vehicles lies on noise emission originated by the complex interaction between the high-temperature/high-velocity exhaustion gases and the atmospheric air. These emissions, which have high noise levels, can damage the launching structure or even be reflected upon the vehicle structure itself and the equipment onboard at the top of the vehicles. The resulting pressure fluctuations can damage the solid structure of different parts of the launcher or of the onboard scientific equipment by vibrational acoustic stress. Therefore, it is strongly recommended to consider the load resulted from acoustic sources over large launching vehicles during take-off and also during the transonic flight.
The authors are interested in studying unsteady property fields of compressible jet flow configurations in order to eventually understand the acoustic phenomena, which are important design constraints for rocket applications. Experimental techniques used to evaluate such flow configuration are complex and require considerably expensive apparatus. Therefore, the authors have developed a numerical tool, JAZzY (Junior16), based on the large eddy simulation (LES) formulation (Garnier09) in order to perform time-dependent simulations of compressible jet flows. JAZzY is a compressible LES parallel code for the calculation of supersonic jet flow configurations. The large eddy simulation approach has been successfully used by the scientific community and can provide high fidelity numerical data for aeroacoustic applications (Bodony05i8; Wolf_doc; lo2012; wolf2012; Junior18-abcm). The numerical tool is written in the Fortran 90 standards coupled with Message Passing Interface (MPI) features (Dongarra95). The HDF5 (folk11; folk99) and CGNS libraries (cgns2012; legensky02; Poirier00; Poirier98) are included into the numerical solver in order to implement a hierarchical data format (HDF) and to perform Input/Output operations efficiently.
Large eddy simulations require a significant amount of computational resources to provide high fidelity results. The scientific community has been using up to hundreds of million degrees of freedom on simulations of turbulent flow configurations cohen18; gloerfelt19; Junior18-abcm; sciacovelli17. Researchers and engineers need to be certain that calculations are run with maximum parallel efficiency when allocating computational resources because the access to supercomputers is often restricted and limited. Therefore, it is of major importance to perform scalability studies, regarding an optimal choice of computational load and resources, before running simulations on supercomputers.
The present work addresses the computational performance evaluation of the code when using an Hewlett Packward Enterprise (HPE) cluster from a computational center cepid. The high-performance computing (HPC) solution provides a maximum theoretical peak performance of 127.4 TFLOPS using CPUs, Nvidia GPUs and Xeon Phi accelerators. Simulations of a pre-defined perfectly expanded jet flow condition are performed using different loads and resources in order to study the strong scalability of the solver. More specifically, nine mesh configurations are investigated running on up to 400 processors in parallel. The number of degrees of freedom starts with 5.8 million points and scales to 1.0 billion points. The speedup and computational efficiency curves are measured for each grid configuration.
The supercomputer is described after the introduction section. Then, numerical formulations and the implementation aspects of the tool are discussed. In the sequence, the strong scalability results are presented to the reader followed by the discussion on the validation of the solver. In the end, the one can find the concluding remarks and acknowledgements.
2 Computational Resources
The current work is included in the CEPID-CeMEAI cepid project from the Applied Mathematics department of the University of São Paulo. This project is addressed to four main research subjects: optimization and operational research; computational intelligence and software engineering; computational fluid mechanics; and risk evaluation. The CEPID-CeMEAI provides access to a high-performance computer server located at the University of São Paulo, named Euler. The system presents a maximum theoretical peak performance of 127.4 TFLOPS using a hybrid parallel processing architecture which has 144 CPU nodes, 4 fat-nodes, 6 GPU nodes and 1 Xeon Phi node with a total of 3428 computational cores. The detailed description of each node configuration is presented in Tab. 1. Two storage systems are available with 175Tb each one, the network file system (NFS) and the Lustre® file system lustre. The network communication is performed using Infiniband and Gigabit Ethernet. Red Hat Enterprise Linux redhat is the operating system of the cluster and Altair PBS Pro pbs is the job scheduler available.
|104||2 x CPU Intel® Xeon® E5-2680v2||128GB DDR3||20|
|40||2 x CPU Intel® Xeon® E5-2680v4||128GB DDR3||28|
|4||2 x CPU Intel® Xeon® E5-2667v4||512GB DDR3||16|
|6||2 x CPU Intel® Xeon® E5-2650v4||128GB DDR3||24|
|+ 1 x GPU Nvidia® Tesla® P-100||16GB DDR3||3584|
|1||2 x CPU Intel® Xeon® E5-2680v2||128GB DDR3||20|
|+ 1 x Intel® Xeon Phi™ C2108-RP2||8GB DDR3||60|
3 Large Eddy Simulation Formulation
The numerical simulations of supersonic jet flow configurations are performed based on the large eddy simulation formulation Garnier09. This set of equations is based on the principle of scale separation over the governing equations used to represent the fluid dynamics, the Navier-Stokes formulation. A filtering procedure can be used to describe the scale separation in a mathematical formalism. The idea is to model the small turbulent structures and to calculate the bigger ones. Subgrid scale (SGS) closures are added to the filtered equations in order to model the effects of small scale turbulent structures. The Navier-Stokes equations are written in the current work using the filtering procedure of Vreman Vreman1995 as
in which and
are independent variables representing time and spatial coordinates of a Cartesian coordinate system,x
, respectively. The components of the velocity vector,u, are written as and . Density, pressure and total energy per unit volume are written as , , and , respectively. The and operators are used in order to represent filtered and Favre averaged properties, respectively.
It is important to remark that the System I filtering procedure Vreman1995 neglects the double correlation term, , which is present into the total energy per unit volume equation in order to write
The heat flux, , is given by
where and stand for static temperature and the thermal conductivity coefficient, respectively. The last can be expressed as
in which is the specific heat at constant pressure, is the dynamic viscosity coefficient and is the Prandtl number, which is equal to for air. The SGS thermal conductivity coefficient, , is written as
where is the SGS Prandtl number, which is equal to for static SGS closures and is the eddy viscosity coefficient that is calculated by the SGS model. In the present work, the dynamic viscosity coefficient is calculated using the Sutherland Law which is given by
One can use an equation of state to correlate density, static pressure and static temperature,
in which is the gas constant, written as
is the specific heat at constant volume. The shear-stress tensor,, is written as,
where the components of the rate-of-strain tensor, , are given by
The SGS stress tensor components can be written using the eddy viscosity coefficient Sagaut05,
In the present article, the eddy viscosity coefficient, , and the components of the isotropic part of the SGS stress tensor, , are not considered for the calculations. Previous validation results performed by the authors Junior16; Junior18-abcm indicate that the characteristics of numerical discretization of JAZzY can overcome the effects of subgrid scale models. The same conclusion can be found in the work of Li and Wang Li15. Therefore, one can write the large eddy simulation set of equations can be written in a more compact form as
where Q stands for the convervative properties vector, given by
and RHS, which stands for the right-hand side of equation, represents the contribution of inviscid and viscous fluxes terms from Eq. 1. The components of the right-hand side vector are written as
in which and , are the components of inviscid and viscous flux vectors, respectively given by
Spatial derivatives are calculated in a structured finite difference context and the formulation is re-written for the general curvilinear coordinate system Junior16. The numerical flux is computed through a central difference scheme with the explicit addition of anisotropic scalar artificial dissipation model of Turkel and Vatsa Turkel_Vatsa_1994. The time marching method is an explicit 5-stage Runge-Kutta scheme developed by Jameson et. al. Jameson81.
Boundary conditions for the LES formulation are imposed in order to represent a supersonic jet flow into a 3-D computational domain with cylindrical shape. Figure 1 presents a lateral view and a frontal view of the computational domain used in the present work and the positioning of the entrance, exit, centerline, far field, and periodic boundary conditions.
A flat-hat velocity profile is implemented at the entrance boundary condition through the use of the 1-D characteristic relations for the 3-D Euler equations in order to create a jet-like flow configuration. The set of properties, then, determined is computed from within and from outside the computational domain. Riemann invariants Long91 are used in order to calculate properties at the far field surfaces where the normal-to-face components of the velocity are computed by a zero-order extrapolation from inside the computational domain. The angle of flow at the far field boundary is assumed fixed. The remaining properties are obtained as a function of the jet Mach number, which is a known variable.
The flow configuration is assumed to be subsonic at the exit plane of the domain. Therefore, the pressure is obtained from the outside, i.e., it is assumed given, the internal energy and components of the velocity are calculated by a zero-order extrapolation from the interior of the domain. Then, density, , and total energy per unit volume, , are computed at the exit boundary using the extrapolated properties and the imposed pressure at the output plane. The first and last points in the azimuthal direction are superposed in order to close the 3-D computation domain and create a periodicity boundary condition. An adequate treatment of the centerline boundary is necessary since it is a singularity of the coordinate transformation. The conserved properties are extrapolated from the adjacent longitudinal plane and averaged in the azimuthal direction in order to define the updated properties at the centerline of the jet. Furthermore, the fourth-difference terms of the artificial dissipation scheme of Turkel and Vatsa Turkel_Vatsa_1994 are carefully treated in order to avoid five-point difference stencils at the centerline singularity.
The reader can find further details about the spatial discretization, time marching scheme and implementation of boundary conditions in the work of Junqueira-Junior Junior16 and Junqueira-Junior et. al. Junior18-abcm which present the validation of the large eddy simulation solver.
4 Parallel Implementation Aspects
The solver is developed to calculate the LES set of equations, Eq. 12, for supersonic jet flow configurations using the Fortran 90 standard. The spatial discretization of the formulation is based on a centered finite-difference approach with the explicit addition of anisotropic scalar artificial dissipation model of Turkel and Vatsa Turkel_Vatsa_1994 and the time integration is performed using an explicit 2nd-order 5-stage Runge-Kutta scheme Jameson81.
Parallelism is achieved using the single program multiple data, SPMD, approach Darema01 and the exchange of messages provided by MPI protocols Dongarra95. The algorithm of the LES solver is structured in two main steps. Firstly, a pre-processing routine reads a mesh file and performs a balanced partitioning of the domain procedure. Then, in the processing routine, each MPI rank reads its correspondent grid file and starts the calculations.
The pre-processing routine is run separately from the processing step. It reads an input file with the partitioning configuration and a 2-D grid file. Next, the pre-processing code calculates the number of points in the axial and azimuthal directions in order to perform the partitioning and the extrusion in the 3rd direction for each sub-domain. The segmentation of the grid points is illustrated in Figure 2(a). A matrix index notation is used in order to represent positions of each partition where NPX and NPZ denote the number of partitions in the axial and azimuthal directions, respectively.
In the case of a non-exact domain division, the remaining points are spread among the partitions in order to have a well-balanced task distribution. Algorithm 1 presents the details of this division and Fig. 2(b) illustrates the balancing procedure in one direction, where LocNbPt, TotNbPt, NbPart, and PartIndex stand for local number of points in one direction, total number of points in one direction, number of partitions in one direction, and the index of the partitions in one direction. The same algorithm is used to perform the partitioning procedure in both axial and azimuthal directions. This preprocessing part of code is executed sequentially.
The mesh files, for each domain, are written after the optimized partitioning procedure using the CFD General Notation System (CGNS) standard (legensky02; Poirier98; Poirier00; cgns2012). This standard is based on the HDF5 (folk99; folk11) libraries which provide tools for hierarchical data format (HDF) and can perform Input/Output operations efficiently. The authors have chosen to write one CGNS grid file for each partition in order to have each MPI rank performing I/O operations independently, i.e. in parallel, during the processing step of the calculations. Moreover, each MPI rank can also write its own time-dependent solution to a local CGNS file. Such an approach avoids synchronizations during check-points, which can be a significant drawback in HPC applications.
After the pre-processing routine, the solver can start the simulation. A brief overview of the computing part of the LES code is presented in Alg. 2.
Primarily, every MPI process reads the same ASCII file with input data such as flow configurations and simulation settings, as indicated in line 2 of Alg. 2. In the sequence, lines 3 and 4 of the same algorithm, each rank reads a local-domain CGNS file, calculates Jacobian and metric terms, which are used for the general curvilinear coordinates transformation. Ghost points are added to the boundaries of local mesh at the axial and azimuthal directions in order to carry information of neighbor partition points, line 5 in Alg. 2. The artificial dissipation scheme of Turkel and Vatsa Turkel_Vatsa_1994 implemented in the code jameson_mavriplis_86 uses a five points stencil which requires information of the two neighbors of a given mesh point. Hence, two-layer ghost points are created at the beginning and at the end of each partition. Figure 3 presents the layer of ghost points used in the present code. The yellow and black layers represent the axial and azimuthal ghost points respectively. The green region represents the local partition grid points.
The initial conditions of the flow configurations are imposed following the sequence of tasks of the processing algorithm. They are calculated using data from a previous checkpoint or, from the input data depending on if it is a restart simulation or not, as presented in lines 5 to 9 of Alg. 2. An asynchronous MPI communication of metric terms, Jacobian terms, and conservative properties, set as initial conditions, is performed between partitions which share neighbor surfaces in order to update all ghost points before starting the time integration.
The core of the code consists of the while loop indicated in line 11 of Alg. 2. This specific part of the code is evaluated in the scalability study presented in Sec. 5, Computational Performance Study. This loop performs the Runge-Kutta time integration iteratively, for all grid points of the computational domain, until reaching the requested number of iterations. A succession of computing subroutines is performed at each call for the time marching scheme. It starts with synchronization of MPI ranks in order to avoid race conditions followed by the calculation of the inviscid flux vectors, artificial dissipation terms, viscous flux vector and right-hand side vector, chronologically. The boundary condition and the dynamic viscosity coefficient are updated at the end of each time-marching step followed by a nonblocking communication of the conservative properties at the partition boundaries. If necessary, a time-dependent solution is incremented to the CGNS output le at the end of the main loop iteration.
5 Computational Performance Study
The numerical discretization approach used in the present article requires very refined grids in order to reproduce high fidelity results for the simulation of supersonic jet flows configurations. Therefore, parallel computing with efficient inter-partition data exchanges is mandatory. The parallel efficiency of the code is measured using different computational loads and the results are presented in the present section. The calculations performed in the current article are run using the 104 nodes of the Euler supercomputer with the Intel® Xeon® E5-2680v2 architecture and 128GB DDR3 rapid access memory.
The Intel® Composer XE compiler, version 126.96.36.199, is used in the present work. A set of compiling flags which have been tested in previous work Junior16 are used in the present paper:
|–O3 –xHost –ipo –no-prec-div –assume-buffered –override-limits|
where O3 enables aggressive optimization such as global code scheduling, software pipelining, predication and speculation, prefetching, scalar replacement and loop transformations; xHost tells the compiler to generate instructions for the highest instruction set available on the compilation host processor; ipo uses automatic, multi-step process that allows the compiler to analyze the code and determine where you can benefit from specific optimizations; no-prec-div enables optimizations that give slightly less precise results than full division; assume-buffered_io tells the compiler to accumulate records in a buffer; and override-limits deals with very large, complex functions and loops.
5.1 Scalability setup
Simulations of an unheated perfectly expanded jet flow are performed using different grid sizes and number of processors in order to study the strong scalability of JAZzY. The jet entrance Mach number is . The pressure ratio, , and the temperature ratio, , between the jet entrance and the ambient freestream conditions, are equal to one, i.e., and where the subscript identifies the properties at the jet entrance and the subscript stands for properties at the farfield region. The Reynolds number of the jet is , based on the jet entrance diameter, D. The time increment, , used for the validation study is dimensionless time units. A stagnated flow is used as the initial condition for the simulations.
The same geometry is used for the computational evaluation, where, the 2-D surface of this computational domain, as presented in Fig. 1, is 30 dimensionless length and 10 dimensionless height. Figure 4 illustrates a 2-D cut of the geometry coloured by velocity contours.
The present work uses nine different mesh configurations whose total number of points doubles every time. Table 2 presents the details of each grid design where the first column presents the name of the mesh while the second and third columns present the number of points in the axial and radial directions, respectively. The last column indicates the total number of points of the mesh. The grid point distribution in the azimuthal direction is fixed at 361. The smallest grid, named as Mesh A, has approximately 5.9 million points while the biggest grid presents approximately 1.0 billion points.
|Mesh||No. Pt. Axial Dir.||No. Pt. Radial Dir.||No. Pt.|
The solver is able to have different partitioning configurations for a fixed number of sub-domains since the division of the mesh is performed in the axial and azimuthal directions. Therefore, different partitioning configurations are evaluated for a given number of processors. Each simulation performs 1000 iterations or 24 hours of computation and the average of the CPU time per iteration of the while loop indicated in Alg. 2 is measured in order to calculate the speedups of the solver at the Euler supercomputer. The partitioning configurations which provides the fastest calculation is used to evaluate the performance of the solver. Table 3 presents the number of partitions in the azimuthal direction for each number of processors used to study the scalability of the solver. The first column stands for the number of computational cores while the second column represents the number of zones in the azimuthal direction used to evaluate the effects of the partitioning on the computation. Calculations performed in the present study are run using one single core up to 400 MPI processes.
|Nb. Proc.||Nb. of Part. in the Azimuthal Dir.|
A sequential computation is the ideal starting point, , for a strong scalability study. However, frequently, the computational problem cannot be allocated in one single cluster node due to hardware limitations of the system. Therefore, it is necessary to shift the starting point to a minimum number of resources in which the code can be run. The LES solver has shown to be capable to allocate the five smallest grids, from mesh A to mesh E, in one single node of the Euler computer. Aforementioned indicates that it is possible to run a simulation with 94.6 million grid points using 128GB of RAM. The starting points of mesh F and mesh G are 40 cores, allocated in two nodes, and 80 cores, allocated in four nodes, respectively. Meshes H and I start the strong scalability study using 200 cores in 10 nodes of the Euler computer. Table 4 presents the minimum number of computational cores used by each mesh configuration for the strong scaling study.
5.2 Strong scalability study
The speedup, , is used in the present work to measure the strong scaling of the solver and compare it with the ideal theoretical case. There are different approaches are used by the scientific community to calculate the speedup (xian10; gustafson88) which is written in the current article as
in which stands for the time spent by mesh to perform one thousand iterations, represents the number of computational cores and is the starting point of the scalability study. The strong scaling efficiency of a given mesh configuration, as a function of the number of processors, is written considering the law of Amdahl amdahl67 as
More than 300 calculations are performed when considering all the partitioning configurations and different meshes evaluated in the present paper. The averaged time per iteration is calculated for all numerical simulations in order to study the scalability of the solver. The evolution of speedup and efficiency, as a function of the number of processors, for the nine grids used in the current work are presented in Figs. 5 and 6. The investigation indicates a good scalability of the code. Meshes with more than 50 million points present efficiency bigger than 75% when running with 400 computing cores in parallel. Moreover, mesh E, which has 95 million degrees of freedom, presented an efficiency which equivalent to the ideal case, 100%, when using the maximum number of resources evaluated. One can notice a superlinear scalability for the cases evaluated in the present article. This behavior can be explained by the fact that cache memory can be accessed more efficiently when increasing the total number of processors for a given grid configuration since more computational resources for the same load means less cache miss (Ristov16; Gusev14).
Increasing the size of a computational problem can generate a better scalability study. The time spent with computation becomes more significant when compared to the time spent with communication with the growth of a problem. One can notice such effect for meshes A, B, C, D and E. The speedup and the efficiency increase with the growth of the mesh size. However, such scalability improvement does not happen from mesh E to meshes F, G, H and I. This behavior is originated because the reference used to calculate speedup and efficiency is not the same for all grid configurations. The studies performed using meshes F, G, H and I does not use the serial computation as a reference, which is not the case of calculations performed using mesh A, B, C, D and E.
6 Compressible Jet Flow Simulation
This section presents a compilation of results achieved from the simulation of a supersonic jet flow configuration. This calculation was performed in order to validate the LES code, and it is included here simply to demonstrate that the numerical tool is indeed capable of presenting physically sound results for the problem of interest. Results are compared to numerical (Mendez10; Mendez12) and to experimental data (bridges2008turbulence). The details of this particular simulation are published in the work of Junqueira et. al.Junior18-abcm.
A geometry is created using a divergent shape whose axis length is 40 times the jet entrance diameter, . Figure 7 illustrates a 2-D cut of the geometry and the grid point distribution used on the validation of the solver. The mesh presents approximately 50 million points. The calculation is performed using 500 computational cores.
An unheated perfectly expanded jet flow is studied for the present validation. The jet entrance Mach number is . The pressure ratio, , and the temperature ratio, , between the jet entrance and the ambient freestream conditions, are equal to one, i.e., and . The subscript identifies the properties at the jet entrance and the subscript stands for properties at the farfield region. The Reynolds number of the jet is , based on the jet entrance diameter, D. The time increment, , used for the validation study is dimensionless time units.
The boundary conditions previously presented in the Large Eddy Simulation Formulation section, are applied in the current simulation. The stagnation state of the flow is set as an initial condition of the computation. The calculation runs a predetermined period of time until reaching the statistically steady flow condition. This first pre-simulation is important to assure that the jet flow is fully developed and turbulent. Computations are restarted and run for another period after achieving the statistically stationary state flow. Hence, data are extracted and recorded in a predetermined frequency. Figure 8 indicates the positioning of the two surfaces, (A) and (B), where data are extracted and averaged through time. Cuts (A) and (B) are radial profiles at and units downstream of the jet entrance. Flow quantities are also averaged in the azimuthal direction when the radial profiles are calculated.
Figures 8(a) and 8(b) present distributions of time averaged axial component of velocity and root mean square values of the fluctuating part of the axial component of velocity, which are represented in the present work as and , respectively. The solid black line indicated in Fig. 8(a) represents the potential core of the jet, which is defined as the region where the time averaged axial velocity component is at least of the velocity of the jet at the inlet.
Dimensionless profiles of and at the cuts along the mainstream direction of the computational domain are compared with numerical and experimental results in Figs. 9 and 10, respectively. The solid line stands for results achieved using the JAZzY code while square and triangular symbols represent numerical (Mendez10; Mendez12) and experimental (bridges2008turbulence) data, respectively.
The averaged profiles obtained in the present work correlate well with the reference data at the two positions compared here. It is important to remark that the LES tool can provide good predictions of supersonic jet flow configurations when using a sufficiently fine grid point distribution. Therefore, efficient massive parallel computing is mandatory in order to achieve good results.
Figures 11 and 12 present a lateral view of an instantaneous visualization of the pressure contours, in greyscale, superimposed by 3-D velocity magnitude countours and vorticity magnitude contours respectively, in color, calculated by the LES tool discussed in the present paper. A detailed visualization of the region indicated in yellow, at the jet entrance, is shown in Fig. 12. The resolution of flow features obtained from the jet simulation is more evident in this detailed plot of the jet entrance. One can clearly notice the compression waves generated at the shear layer, and their reflections at the jet axis. Such resolution is important to observe details and behavior of such flow configuration in order to understand the acoustic phenomena which is presnt in supersonic jet flow configurations.
7 Concluding Remarks
The current work is concerned with the performance of a computational fluid dynamics tool for aeroacoustics applications when using a national supercomputer. The HPC system, Euler, from the University of São Paulo presents more than 3000 computational cores and a maximum theoretical peak of 127.4 TFLOPS. The numerical solver is developed by the authors to study supersonic jet flows. Simulations of such flow configurations are expensive and need efficient parallel computing. Therefore, strong scalability studies of the solver are performed on the Euler supercomputer in order to evaluate if the numerical tool is capable of efficiently using computational resources in parallel.
The computational fluid dynamics solver is developed using the large eddy simulation formulation for perfectly expanded supersonic jet flow. The equations are written using a finite-difference centered spatial discretization with the addition of artificial dissipation. The time integration is performed using a five-steps Runge-Kutta scheme. Parallel computing is achieved through non-blocking message passing interface protocols and inter-partition data are allocated using ghost points. Each MPI partition reads and writes its own portion of the mesh that is created on pre-processing routine.
A geometry and a flow condition are defined for the scalability study performed in the present work. Nine point distributions and different partitioning configurations are used in order to evaluate the parallel code under different workloads. The size of the grid configurations start with 5.9 million points and rise up to approximately 1.0 billion points. Calculations perform 1000 iterations or 24 hours of computation using up to 400 cores in parallel. The CPU time per iteration is averaged when the simulation is finished in order to calculate the speedup and scaling efficiency. More than 300 simulations are performed for the scalability study when considering different workloads and partitioning configurations.
The code presented a good scalability for the calculations run in the current paper. The averaged CPU time per iteration decays with the increase in the number of processors in parallel for all computation performed by the large eddy simulation solver evaluated in the present work. Meshes with more than 50 million points indicated an efficiency greater than 75%. The problem with approximately 100 million points presented speedup of 400 and efficiency of 100% when running on 400 computational cores in parallel. Such performance is equivalent to theoretical behavior in parallel. It is important to remark the ability of the parallel solver to treat very dense meshes as the one tested in the present paper with approximately 1.0 billion points. Large eddy simulation demands very refined grids in order to have a good representation of the physical problem of interest. Therefore, it is important to perform simulations of such configuration with a good computation efficiency and the present scalability study article can all be seen as a guide for future simulations using the same numerical tool on the Euler supercomputer.
The authors gratefully acknowledge the partial support for this research provided by Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPq, under the Research Grant nos. 309985/2013-7, 400844/2014-1 and 443839/2014-0. The authors are also indebted to the partial financial support received from Fundação de Amparo à Pesquisa do Estado de São Paulo, FAPESP, under the Research Grant nos. 2013/07375-0 and 2013/21535-0.
Conflict of Interest
The authors declare that they have no conflict of interest.