Performance assessment of CUDA and OpenACC in large scale combustion simulations

GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code's portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD code Alya. Our focus is the combustion problems which are one of the most computing demanding CFD simulations. The most computing-intensive parts of the code were analyzed in detail. New data structures for the matrix assembly step have been created to facilitate a SIMD execution that benefits vectorization in the CPU and stream processing in the GPU. As a result, the CPU code has improved its performance by up to 25 GPU execution, CUDA has proven to be up to 2 times faster than OpenACC for the assembly of the matrix. On the contrary, similar performance has been obtained in the kernels related to vector operations used in the linear solver, where there is minimal memory reuse.



page 1

page 2

page 3

page 4


Simple and efficient GPU parallelization of existing H-Matrix accelerated BEM code

In this paper, we demonstrate how GPU-accelerated BEM routines can be us...

Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

High fidelity Computational Fluid Dynamics simulations are generally ass...

Metrics and Design of an Instruction Roofline Model for AMD GPUs

Due to the recent announcement of the Frontier supercomputer, many scien...

Performance portable ice-sheet modeling with MALI

High resolution simulations of polar ice-sheets play a crucial role in t...

P-Cloth: Interactive Complex Cloth Simulation on Multi-GPU Systems using Dynamic Matrix Assembly and Pipelined Implicit Integrators

We present a novel parallel algorithm for cloth simulation that exploits...

GPU-FV: Realtime Fisher Vector and Its Applications in Video Monitoring

Fisher vector has been widely used in many multimedia retrieval and visu...

Refactoring the MPS/University of Chicago Radiative MHD(MURaM) Model for GPU/CPU Performance Portability Using OpenACC Directives

The MURaM (Max Planck University of Chicago Radiative MHD) code is a sol...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A consolidated trend in the pre-exascale systems is the incorporation of heterogeneous CPU/GPU supercomputers. One of the main challenges of the scientific community is to adapt their legacy codes to unlock the potential of the modern HPC systems [1]. Different programming tools with different levels of complexity have been developed to adopt the GPU computing paradigm. The GPUs require exposing massive parallelism in the algorithms for attaining their maximum performance. This task is far from trivial and requires data restructuring at different levels, according to the computing paradigm utilized. Nowadays, the most common manner to use GPUs is through CUDA or OpenACC. CUDA [2] is a programming language specifically designed to utilize the NVIDIA GPUs. Its utilization requires more detailed knowledge of the device architecture and a more extensive code refactoring. On the other hand, OpenACC [3] is a directive-based parallel programming model developed to adapt the codes to GPU computing with minimum intervention. The latter has been widely used during the last years in the portability of real scale scientific applications due to its easiness and straightforward approach. One of the early attempts of using heterogeneous systems for simple CFD simulations was focused on 3D Finite Differences using CUDA for structures meshes and high order schemes [4]. Other examples [6, 5], of Navier-Stokes simulations on accelerators can be found in the literature. However, they were focused on structured meshes in single GPU implementations. Within the class of CFD algorithms on unstructured meshes, some successful examples[7, 8, 9], based the implementation in an MPI+CUDA execution model, obtaining up to 8 times of acceleration. In the context of reactive flows, the developments have been oriented in the portability of chemical kinetics simulations on GPUs [10, 11], using high order methods. Recently, efforts of porting airplane aerodynamics [12] and coastal flow simulations [13] using MPI+OpenACC have validated this alternative of computation for large-scale problems. Nevertheless, in each of these examples, the implementation aimed at porting only one physical model without comparing the computational paradigms for using GPUs. Our application context is the simulation of real-life combustion systems due to their enormous computing demands. A framework capable of combining advanced turbulence, spray, and combustion models are needed in such simulations. Furthermore, the use of complex geometries and unstructured meshes is essential to work in realistic conditions. Still, it increases substantially the complexity of the discretization strategies and the computational resources to be utilized. This work attempts to show a performance analysis using CUDA and OpenACC tuned versions of a multi-physics CFD code, Alya. Moreover, general guidelines and common drawbacks when working with legacy codes are going to be presented. Alya is one of the twelve simulation codes of the Unified European Applications Benchmark Suite (UEABS) and thus complies with the highest standards in HPC. The main kernels of the CFD simulation are studied in detail. The numerical tests have been performed in the CTE-Power9 [17] cluster at the Barcelona Supercomputing Center. The rest of the paper is organized as follows. In the next section, we present the CFD simulations’ application context that permits the identification of the most time-consuming parts of the code. In Section 3 we present an overview of the hardware architecture and software components of nodes utilized in this work. Section 4 describes the general considerations for porting our CFD code to GPU execution. Afterward, in Section 5, the impact of the vectorization and the performance of both GPU models are presented. Finally, we summarize our contributions in Section 6.

2 Application Context: combustion simulations

Figure 1: Left. Experimental configuration. Right. Computational domain.

The spatially filtered Navier-Stokes equations are considered to model incompressible turbulent flows. In short, the problem statement is: find the filtered fluid velocity and modified pressure in a domain and during a given time interval such that: [l] ρ∂t + ρ[ 2u(u) +(∇⋅u)u -12 ∇—u2 ]
            - ∇⋅[ 2 μ() ] + ∇p + ∇⋅= ,
      ∇⋅= 0, Where and are the density and viscosity of the fluid, respectively.

A finite element formulation is used to discretize the computing domain. Note that the mesh used for such a purpose is unstructured and composed of three different kinds of elements (tetrahedra, prisms, and pyramids). We use a fractional step method and a Runge-Kutta scheme of third-order as a time discretization scheme as shown in Algorithm 1. Details on the numerical and time integration schemes are provided in  [12].

1:Element assembly the Laplacian matrix.
2:for Time steps do
3:     for Runge-Kutta steps do
4:         Element assembly: assembly of momentum equations.
5:         Boundary assembly: assembly of wall model.
6:         Momentum update.
7:         Velocity correction.
8:     end for
9:     Algebraic solver: solution of pressure equation with the preconditioned CG.
10:end for
Algorithm 1 Main steps of the fractional step method.

The combustion process is assumed to take place in the flamelet regime. The description of the reacting flow in this regime is based on the assumption that chemical time scales are much faster than turbulent time scales, and hence, the thermochemical properties of the flame are described from a precomputed flamelet database. The details on the numerical models used for the Flamelet model are detailed in[14]. In short, two extra sets of equations, the heat transfer and the concentration of species, need to be solved each time integration step. Such equations are solved using a Runge-Kutta scheme of third-order as in Algorithm 1, but without needing an algebraic solver. The primary operations within the full algorithm are the matrix assembly, boundary assembly, and the algebraic solver. The application context corresponds to a swirl-stabilized technically premixed burner depicted in Fig 1. Further information of the experiment is given in [15, 16]. In this kind of problem, the most time-consuming operations are the matrix assembly and the algebraic solver. Table 1 shows the relative weight of the solver’s primary operations for a burner simulation. The chemical model considers two concentration variables, and the algebraic solver uses an iterative preconditioned conjugate gradient. The two main operations sum up to of the total execution time and, therefore, become our study’s focus.

Numerical Equations
Operation Navier-Stokes Heat Chemics Total
Matrix Assembly
Boundary Assembly
Algebraic Solver
Table 1: Relative weight of the main operations within the time integration step

3 CTE-Power9 system

Figure 2: Power 9 cluster configuration.

Following the current trend in HPC node design, the POWER9 nodes are high throughput nodes (see Fig. 2. Each compute node has 2 x POWER9 8335 @3.0 GHz with 20 physical cores each. With 4 SMT threads per core, each node can run up to 160 CPU threads. The 512 GB of main memory is distributed in 16 DIMMS of 32 GB, each operating at 2666MHz. The 4 Volta V100 accelerators have a 7.8 TFLOPS double-precision peak performance giving each one a total of 30.8 TFLOPS. The POWER9 nodes include high throughput NVLINK 2.0 connectors. Each GPU has 6 NVLINK connectors, which are attached to the neighboring GPU as well as CPU, giving an aggregate bandwidth of 150 GB/s for GPU to GPU as well as GPU to CPU communication. Details on the node architecture can be found in  [17].

4 Code Portability

4.1 CUDA vs OpenACC

CUDA is a parallel computing programming model developed by NVIDIA to utilize its GPUs [2]. It provides a set of new instructions used to control the GPU execution and the memory transfers between CPU and GPU. This new instruction set provides more flexibility to the programmer to write GPU code. However, this extra control means more programming complexity. For this reason, the CUDA programmers require to fully understand each detail of hardware and software.

OpenACC offers a different approach to manage GPU computing [3]. Its programming model is based onon compiler directives’ utilization to accelerate sections of the code where repetition constructs, such as loops, are found. Only a few compiler directives need to be added to the code, and ideally, without introducing major changes in the data structures. This simplicity comes at expenses on the detailed control flow of the code.

4.2 Domain discretization

A 3D unstructured mesh is utilized in the discretization of the combustion problems. Commonly, this mesh is composed of different geometrical elements (tetrahedrons, pyramids, and hexahedrons) in which the equations are discretized. An example of a 2D mesh with different kinds of elements is shown in Fig. 3.

Figure 3: Example of an 2D unstructured mesh with two different geometries.

The matrix assembly and algebraic solver operations are written as for loops that sweep this computational domain. In the matrix assembly, the outer loop moves through the domain’s cells (elements), while the inner loop performs the integration on the element. The inner loop is shown in Listing 1. The element assembly depends on two variables the number of gauss points (ngauss) and the number of integration nodes (nnodes). The elemental matrix (Ae) is calculated using the shape functions (N) and the element Jacobian matrix (Jac).

do ig = 1,ngauss
  do jn = 1,nnodes
    do in = 1,nnnodes
      Ae(in,jn) = Ae(in,jn) + Jac(ig) * N(in,ig) * N(jn,ig)
    end do
  end do
end do
Listing 1: Matrix element assembly

In the algebraic solver, the vector operations only perform the outer loop through the domain’s cells. The inner loop only exists in the sparse matrix-vector product (SpMV) sparsity pattern depending on the cell’s faces. The sparse matrix is stored in the traditional CSR format. Note that all the linear solver operations can be linked with highly optimized libraries like cuBLAS and cuSPARSE.

Different types of elements provoke irregular data paths due to the inner loop’s different lengths in the matrix assembly and SpMV.

4.3 Vectorization

Nowadays, all CPUs have integrated vector registers that allow performing arithmetic operations in a Single Instruction Multiple Data (SIMD) execution-model. On the other hand, GPUs utilize stream multiprocessors (SM) that execute threads concurrently in an execution model similar to SIMD. Context switching can maintain thousands of threads active and hide the memory latency. Keeping the threads active is known as occupancy, and it is one of the main factors to attain maximum performance in GPU computing. For this purpose, the kernel workload has to be divided into threads with low register and shared memory requirements. Additionally, The creation of new data structures requires special attention. The CPUs’ SIMD model and the stream multiprocessors need coalesced and aligned memory accesses for optimal performance. Random memory accesses serialize the execution of the memory fetch and decrease the overall performance. A data structure that benefits CPU and GPU execution has been developed. First, a renumbering is introduced with the objective of grouping the elements of the same geometrical type. Within each subgroup, the elements are packed in sets of VECTOR_SIZE elements. If the number of elements in the subgroup is not multiple of VECTOR_SIZE

, then zeros are padded at the end of the pack to maintain regularity. Figure 

4 shows the element organization for a mesh with 25 elements TRI03, 18 elements QUAD04 and a VECTOR_SIZE=4.

Figure 4: New data structure for a 2D mesh

The element matrices Ae and Jac are transformed in tensors by adding an extra dimension to the arrays. This new dimension represents the position of the element within the pack. The vectorization consists of performing the matrix assembly for the pack’s elements at once, as shown in Listing 

2. Fortran follows a column-major order for storing the arrays. Therefore, the element matrices data is already in a proper order to perform SIMD or SIMT operations.

do ig = 1,ngauss
  do jn = 1,nnodes
    do in = 1,nnnodes
      Ae(1:4,in,jn) = Ae(1:4,in,jn) + &
      Jac(1:4,ig) * N(in,ig) * N(jn,ig)
    end do
  end do
end do
Listing 2: Matrix element assembly VECTOR_SIZE=4

5 Numerical Results

The numerical results have been performed in one node of the Power9 facilities detailed in section 3. The two most time-consuming parts of the Algorithm 1, the solution of the algebraic solver and the matrix assembly, have been studied using the mesh for the simulation of the burner.

5.1 Vectorization

The benefits of vectorization are evident in the matrix assembly. This kernel sums up to of the execution time as shown in Table 1. Our burner mesh consists of 64 million cells with three different elements: tetrahedrons, pyramids, and hexahedrons. In this case, the vectorization improves the matrix assembly’s performance in a average. The algebraic solver does not benefit from the new data structure since its algebraic operations were already vectorizing correctly.

5.2 CUDA vs OpenACC

Our results show two different behavior in both sections of the code. First, the algebraic solver is composed of vector-based operations and the sparse matrix-vector multiplication (SpMV). The vector operations of the algorithm (AXPY and dot product) do not have memory reuse, and therefore its performance can be easily mimicked by OpenACC. The SpMV, with its indirect memory accesses, is a more complex kernel to replicate by OpenACC, thus behaving slightly worse than CUDA( of slowdown). The results can be seen in Figure 5.

Figure 5: Time in microseconds of the main operations in the algebraic solver.

On the other hand, the matrix assembly can be seen as a more complex version of the SpMV, and therefore, the OpenACC performance decreases, reaching up to times slower than the CUDA version as seen in Figure 6. Note that the performance is related to the type of element used in the discretization. The number of nodes used in the discretization determines the memory pattern and footprint. For the same version of the CUDA kernel, the performance decreases as the number of nodes are higher. More complex CUDA kernels need to be developed specifically for the type of element, losing in the generality and reusability of the code to maintain the performance.

Figure 6: Time in microseconds of the matrix assembly for different type of elements.

6 Conclusions

This work has shown the benefits of using new data structures that exploit the SIMD model execution in CPUs and GPUs. In the CPUs, the code can improve its performance on average. In GPUs, the new data structures expose the data parallelism and facilitate OpenACC or CUDA kernels’ implementation. Specific CUDA kernels can outperform up to times the OpenACC implementation in the matrix assembly operation. However, this improvement comes at the expense of many programming hours. For the solver’s algebraic operations, OpenACC seems like the best option since a similar performance can be achieved but without increasing the complexity of the code.


This work is partially supported by the European Union’s Horizon 2020 research and innovation programme under grant agreement number: 846139 (Exa-FireFlows). It is also partially supported by the BSC-IBM Deep Learning Research Agreement, under JSA “Application porting, analysis and optimization for POWER and POWER AI”. It has also received funding by the EXCELLERAT project funded by the European Commission’s ICT activity of the H2020 Programme under grant agreement number: 823691. This paper expresses the opinions of the authors and not necessarily those of the European Commission. The European Commission is not liable for any use that may be made of the information contained in this paper.


  • [1] J. Dongarra, P. Beckman, T. Moore, The international exascale software project roadmap, Int. J. High. Perform. Comput. Appl. 25 (2011)
  • [2] Kirk, David & Hwu, Wen-mei. (2017). Programming Massively Parallel Processors: A Hands-On Approach. 3er edition.
  • [3] Rob Farber. 2016. Parallel Programming with OpenACC (1st. ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
  • [4] Micikevicius P., 3D Finite Difference Computation on GPUs using CUDA, in: Proceeding of 2nd Workshop on General Purpose Processing on Graphics Processing Units,2009.
  • [5] Alfonsi A, Ciliberti S., Mancini M, Primavera L. Performances of Navier-Stokes Solver on a Hybrid CPU/GPU Computing System. Parallel Computing Technologies. Lecture Notes in Computer Science Volume 6873, 2011, pp 404-416 (3D).
  • [6] Elsen, E., LeGresley, P., Darve, E. Large calculation of the ow over a hypersonic vehicle using a GPU (2008) Journal of Computational Physics, 227 (24), pp. 10148-10161 (2D ).
  • [7] Dana A. Jacobsen, Inanc Senocak, Multi-level parallelism for incompressible low computations on GPU clusters, Parallel Computing, Volume 39, Issue 1, 2013, Pages 1-20.
  • [8] G. Oyarzun, R. Borrell, A. Gorobets, O. Lehmkuhl and A. Oliva. “Direct Numerical Simulation of Incompressible flows on unstructured meshes using hybrid CPU/GPU supercomputers”. In Procedia Engineering, Volume 61, Pages 87-93, 2013.
  • [9] Alvarez Farre, Xavier & Gorobets, Andrey & Trias, F.Xavier & Borrell, R. & Oyarzun, Guillermo. (2018). —A fully-portable, algebra-based framework for heterogeneous computing. Application to CFD. Computers & Fluids. 173. 10.1016/j.compfluid.2018.01.034.
  • [10] E Niemeyer, Kyle & Sung, Chih-Jen. (2014). Accelerating moderately stiff chemical kinetics in reactive-flow simulations using GPUs. Journal of Computational Physics. 256. 854-871.
  • [11]

    Yu Shi, William H. Green, Hsi-Wu Wong, Oluwayemisi O. Oluwole, Redesigning combustion modeling algorithms for the Graphics Processing Unit (GPU): Chemical kinetic rate evaluation and ordinary differential equation integration, Combustion and Flame, Volume 158, Issue 5, 2011, Pages 836-847.
  • [12] R. Borrell, D. Dosimont, M. Garcia-Gasulla, G. Houzeaux, O. Lehmkuhl, V. Mehta, H. Owen, M. Vázquez, G. Oyarzun, Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics, Future Generation Computer Systems, Volume 107, 2020, Pages 31-48.
  • [13] G.Oyarzun, I.A. Chalmoukis, G.A. Leftheriotis, A.A. Dimas, A GPU-based algorithm for efficient LES of high Reynolds number flows in heterogeneous CPU/GPU supercomputers, Applied Mathematical Modelling, Volume 85, 2020, Pages 141-156, ISSN 0307-904X,
  • [14] Mira, D., Lehmkuhl, O., Both, A. et al. Numerical Characterization of a Premixed Hydrogen Flame Under Conditions Close to Flashback. Flow Turbulence Combust 104, 479–507 (2020).
  • [15] Tanneberger, T., Reichel, T. G., Krüger, O., Terhaar, S., Paschereit, C.O.: Numerical investigation of the flow field and mixing in a swirl-stabilized burner with a non-swirling axial jet. In: ASME Turbo Expo 2015, pp. GT2015-43382, (2015)
  • [16] Reichel, T.G., Paschereit, C.O.: Interaction mechanisms of fuel momentum with flashback limits in lean premixed combustion of hydrogen. Int. J. Hydrog. Eng. 42, 4518–4529 (2017)
  • [17] R. Nohria and G. Santos, IBM Power System AC922 Technical Overview and Introduction, July, 2018.