A transformation in computing architectures has emerged over the past decade in the quest to keep Moore’s Law alive, leading to significant advancements in Artificial Intelligence and Machine Learning techniques[1, 2]. The success of Machine Learning has caused a role reversal, now becoming a primary driver in the development of massively-parallel thread-based hardware indicative of future exascale-era architectures . Traditional Computational Science applications such as Computational Fluid Dynamics (CFD) has eschewed the use of Graphical Processing Units (GPU) due to their programming complexity to achieve high performance. However, in this ever-changing computing landscape, algorithms suitable for these platforms are now forefront in the advancement
Low-order discretizations such as the finite volume method [4, 5] have been the industrial standard in CFD for the last 40 years. However, these algorithms suffer from low arithmetic intensity 111Arithmetic intensity is the ratio of floating-point operations performed to the amount of memory used in the calculation., making them ill-suited for GPUs. To leverage these new architectures and enable higher-fidelity simulations, an insurgence of algorithmic advancement for high-order methods, such as the discontinuous Galerkin method  and the flux-reconstruction method , has bustled within the Computational Science community. These high-order methods have higher-algorithmic intensity reflecting the increased compute to communication ratios as needed for GPUs.
This work develops a discontinuous Galerkin method to discretize the governing equations of fluid dynamics concerning aerospace and atmospheric applications suitable for the use of GPUs in a distributed computing environment. Within this work, we make simplifications such as conforming to Cartesian meshes to establish a performance ceiling for future algorithmic developments. We employ an abstraction thread-programming model known as OCCA 
, an open-source library, enabling the use of GPUs, CPUs, and FPGAs, to achieve performance portability. The performance of the method is analyzed and comparisons of network communications via CUDA-Aware MPI are performed. Lastly, we demonstrate the parallel scalability of the implementation up to 6,144 NVIDIA V100 GPUs.
Ii Discontinuous Galerkin Method
Ii-a Governing Equations
The governing equations utilized in this work are the three-dimensional compressible inviscid Euler equations, which are written in conservative form:
representing the conservation of mass, momentum, and energy for a fluid. The solution vectorQ represents the conservative flow variables and the matrix F represents the Cartesian flux components. Q and F are defined as follows:
where is density, are velocity components in each spatial coordinate direction, is pressure, is total internal energy, and is total enthalpy. These equations are closed using the ideal gas equation of state:
where is the ratio of specific heats.
This work constrains the computational domain to Cartesian meshes. Under these assumptions, arithmetic simplifications are performed to increase computational performance. Within this setting, the governing equations are discretized via the discontinuous Galerkin (DG) method  assuming a weighted residual formulation. A weak-formulation is established by multiplying each governing equation by a set of test functions, , and integrating over the Cartesian element volume :
Integrating (4) by parts, the weak-form residual is defined as:
The discrete residual contains integrals over mesh element boundaries where special treatment is needed for the fluxes. The advective fluxes are calculated using an approximate Riemann solver, namely, the Lax-Friedrichs method . To complete the spatial discretization, the solution is approximated as follows:
where are chosen to be the same as the test functions
. A collocation approach is chosen for computational efficiency by choosing the basis and test functions to be Lagrange interpolation polynomials
where : is the user-chosen polynomial degree, and numerical integration via Gauss-Legendre quadrature is used. Lastly, we implement a low-storage explicit Runge-Kutta method to discretize the temporal component of (4).
Iii CPU Implementation
This work serves within the larger computational framework known as the Wyoming Wind and Aerospace Applications Kompution Environment (WAKE3D) [11, 12, 13]. WAKE3D has been demonstrated on various aerodynamics problems  and wind energy applications [16, 18, 15, 17]. The framework is composed of multiple software components, namely, NSU3D , DG4EST [16, 20], and TIOGA [21, 22]. Figure 1 illustrates the simulation of a Siemens SWT-2.3-93 wind turbine using WAKE3D.
, is the primary focus of this work. CartDG is a discontinuous Galerkin method designed for computational efficiency on Cartesian meshes utilizing tensor-product collocation-based basis functions. The CPU implementaion has achieved over 10% sustained peak of theoretical compute performance using the viscous formulation as shown in Figure 2. CartDG utilizes the Message Passage Interface (MPI) for distributed-memory computation and enables computation-communication overlap to hide communication latency. It has been demonstrated to scale to over one million MPI ranks  on ALCF Mira Supercomputer.
Iv GPU Implementation
Discontinuous Galerkin methods exhibit multiple levels of parallelism: coarse-grain parallelism via mesh decomposition and fine-grain parallelism via single instruction, multiple data within a mesh element. In this work, the MPI programming model is employed for coarse-grain parallelism and OCCA  is utilized for fine-grain parallelism. We select the CUDA-backend within OCCA to execute the kernels on NVIDIA V100 GPUs. OCCA provides a kernel language known as OKL, a simple extension of the C-language, enabling explicit architecture-independent fine-grain parallelized code. The sample code shown below illustrates the spatial residual volume kernel (second term) from (4) using OKL.
The discontinuous Galerkin method discretizes mesh elements into multiple solution modes per variable which depend on the solution polynomial degree established a priori by the user. For example, in three-dimensional space, suppose the polynomial degree , then the number of modes per conservative variable within an element is . To map the discretization to the hardware, we assign one hardware thread per mode, which is seen in the code sample indicated by the
cubeThreads definition. The
@inner attribute assigns a thread to each index mode.
|MIT Satori: CUDA-Aware MPI|
|Problem Size: 452,984,832 DOF.|
V Performance Results: MIT Satori
The GPU implementation of CartDG is tested on the MIT Satori supercomputer. Satori is composed of 64 IBM Power9 nodes, each containing four NVIDIA V100 GPUs and two AC-922 22-core processors. The communication interconnect contains NVIDIA GPUDirect, which enables CUDA-Aware MPI communication protocols over 50 GB/s NVLink fabric linking the GPUs together within a node. The problem chosen for performance testing consists of a periodic domain containing 128x128x128 mesh elements and solution polynomial degree of , totaling 452,984,832 degrees-of-freedom (DOF) (2,264,924,160 unknowns).
Figure 3 demonstrates the parallel strong scaling performance of the implementation comparing GPUDirect communication to non-CUDA-Aware MPI communication. As seen in the figure, the solver scales well to 16 nodes. Table I demonstrates that the GPUDirect CUDA-Aware MPI provides a 29% speedup using a single node and a 24% speedup across multiple nodes.
Vi Performance Results: ORNL Summit
We test the same implementation on ORNL Summit. Summit has nearly the same architecture as MIT Satori but contains 6 NVIDIA V100 GPUs and 2 IBM AC-922 22-core processors per node. Summit contains 4,608 nodes totaling 27,648 GPUs and 202,752 CPU cores and has achieved 148.6 PetaFLOPS in the LINPACK benchmark [27, 28].
Vi-a Strong Scalability
We perform a second parallel strong-scalability test on ORNL Summit using a mesh composed of , seventh-order accurate elements,totaling 69,055,021,056 degrees-of-freedom (345,275,105,280 unknowns). This problem size is chosen to give approximately 11 million degrees-of-freedom per GPU when using all GPUs on 1,024 nodes. The strong-scaling test measures the wall-clock time to solve 100 Low-Storage Explicit Runge-Kutta 5-stage 4th-order (LSERK45) time steps. The benchmark used 128, 256, 512, 920, and 1024 nodes. Figure 4 displays the scaling results. We note that these results do not utilize CUDA-Aware MPI; the information required by neighboring cores is first transferred to the host CPU, exchanged via MPI, then transferred to the GPU. Table IV tabulates the wall-clock times at various node counts for the CPU and GPU implementations.
|ORNL Summit: GPU Performance|
|Kernel||Time (sec)||PetaFlops||Achieved Theoretical Peak|
|ORNL Summit: CUDA-Aware MPI|
|Problem Size: 21,337,344 DOF.|
|ORNL Summit: Parallel Strong Scalability|
|Problem Size: 69,055,021,056 degrees-of-freedom.|
|Each node contains 2 IBM POWER9 processors (44 cores per node) and 6 NVIDIA V100 GPUs.|
Vi-B GPU Performance
As demonstrated in the strong-scaling figure, the performance improvement is approximately 30x when using all six GPUs compared to using both CPUs per node only (using 44 cores per node). Table II demonstrates the computational kernel breakdown of the GPU simulation on 512 nodes. The volume kernel corresponding to the code sample achieved nearly 5 PetaFLOPs of performance on 512 nodes. Additionally, the communication overhead is large in comparison to the computational time. This is due to not utilizing the CUDA-Aware MPI on Summit, which allows for faster GPU-GPU communication on a node and overlap of computation and communication.
A communication benchmark is conducted using a single node with six GPUs interconnected with NVLink for a problem containing 21,337,344 DOFs. As shown in Table III, CartDG achieves an 8x speedup in communication time using the GPUDirect CUDA-Aware MPI on one node corresponding to a 40% overall speedup in time to solution. This result corroborates the results achieved on MIT Satori and illustrates a significant communication improvement when using GPUDirect CUDA-Aware MPI. This contrasts the result found by  which tested a similar high-order solver on ORNL Summit indicating a slow-down in performance when using CUDA-Aware MPI.
This work developed of a high-order discontinuous Galerkin method on Cartesian meshes for the use of heterogeneous architectures including GPUs. Performance benchmarks on MIT Satori and ORNL Summit demonstrated significant speedups achieved including a 30x time to solution over the CPU-only implementation. Further, the benefits of utilizing GPUDirect were shown, improving the overall performance by 24% across multiple nodes and an 8x improvement in communication time within one node. If the CUDA-Aware MPI performance improvement holds at scale, as indicated by the results performed on MIT Satori, a GPU to CPU performance improvement of 40-50x is expected.
The authors acknowledge the MIT Research Computing Project and Oak Ridge National Laboratory for providing HPC resources that have contributed to the research results reported in this paper.
T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Computing Surveys (CSUR) 52, no. 4 (2019): 1-43.
-  A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, “Survey and benchmarking of machine learning accelerators,” arXiv preprint arXiv:1908.11348 (2019).
-  P. Messina, “The exascale computing project,” Computing in Science & Engineering 19, no. 3 (2017): 63-67.
-  A. Jameson, and D. J. Mavriplis, “Finite volume solution of the two-dimensional Euler equations on a regular triangular mesh,” AIAA Journal 24, no. 4 (1986): 611-618.
-  D. J. Mavriplis, “Multigrid solution of the two-dimensional Euler equations on unstructured triangular meshes,” AIAA Journal 26, no. 7 (1988): 824-831.
-  J. S. Hesthaven, and T. Warburton, Nodal discontinuous Galerkin methods: algorithms, analysis, and applications. Springer Science & Business Media, 2007.
-  F. D. Witherden, A. M. Farrington, and P. E. Vincent, “PyFR: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach,” Computer Physics Communications 185, no. 11 (2014): 3028-3040.
-  D. S. Medina, A St-Cyr, and T. Warburton, “OCCA: A unified approach to multi-threading languages,” arXiv preprint arXiv:1403.0968 (2014).
-  B. Cockburn, G. E. Karniadakis, and C. Shu, Discontinuous Galerkin methods: theory, computation and applications. Vol. 11. Springer Science & Business Media, 2012.
-  A. Harten, P. D. Lax, and B. van Leer, “On upstream differencing and Godunov-type schemes for hyperbolic conservation laws,” SIAM review 25, no. 1 (1983): 35-61.
-  M. J. Brazell, A. C. Kirby, J. Sitaraman, and D. J. Mavriplis, “A multi-solver overset mesh Approach for 3D mixed element variable order discretizations,” AIAA Paper 2016-0053, 54th AIAA Aerospace Sciences Meeting, San Diego CA, January 2016.
-  A. C. Kirby, and D. J. Mavriplis, “High Fidelity Blade-Resolved Wind Plant Modeling.” SC17, 2017.
-  A. C. Kirby, Z. Yang, D. J. Mavriplis, E. P. Duque, and B. J. Whitlock, “Visualization and data analytics challenges of large-scale high-fidelity numerical simulations of wind energy applications,” AIAA Paper 2018-1171, 56th AIAA Aerospace Sciences Meeting, Kissimmee FL, January 2018.
-  K. Kara, A. C. Kirby and D. J. Mavriplis, “Hover Predictions of S76 Rotor Using a High Order Discontinuous Galerkin Off-Body Discretization,” AIAA Paper 2020-0771 Presented at the AIAA Scitech 2020 Conference, Orlando FL, January 2020.
-  A. C. Kirby, A. Hassanzadeh, D. J. Mavriplis, and J. W. Naughton, “Wind Turbine Wake Dynamics Analysis Using a High-Fidelity Simulation Framework with Blade-Resolved Turbine Models,” AIAA Paper 2018-0256, Wind Energy Symposium, Kissimmee FL, January 2018.
-  A. C. Kirby, M. J. Brazell, Z. Yang, R. Roy, B. R. Ahrabi, M. K. Stoellinger, J. Sitaraman, and D. J. Mavriplis, “Wind farm simulations using an overset hp-adaptive approach with blade-resolved turbine models,” The International Journal of High Performance Computing Applications 33, no. 5 (2019): 897-923.
-  A. Kirby, M. Brazell, J. Sitaraman, D. Mavriplis, “An Overset Adaptive High-Order Approach for Blade-Resolved Wind Energy Applications,” AHS Forum 72, West Palm Beach FL, May 2016.
-  A. P. Edmonds, A. Hassanzadeh, A. C. Kirby, D. J. Mavriplis, and J. W. Naughton, “Effects of Blade Load Distributions on Wind Turbine Wake Evolution Using Blade-Resolved Computational Fluid Dynamics Simulations,” AIAA Paper 2019-2081, 57th AIAA Aerospace Sciences Meeting, San Diego CA, January 2019.
-  D. Mavriplis, M. Long, T. Lake and M. Langlois, “NSU3D results for the second AIAA high-lift prediction workshop,” Journal of Aircraft 52, no. 4 (2015): 1063-1081.
-  M. K. Stoellinger, A. P. Edmonds, A. C. Kirby, D. J. Mavriplis and S. Heinz, “Dynamic SGS modeling in LES using DG with kinetic energy preserving flux schemes,” AIAA Paper 2019-1648, 57th AIAA Aerospace Sciences Meeting, San Diego CA, January 2019.
-  M. J. Brazell, J. Sitaraman, and D. J. Mavriplis, “An overset mesh approach for 3D mixed element high-order discretizations,” Journal of Computational Physics 322 (2016): 33-51.
-  J. Sitaraman, D. Jude, and M. J. Brazell, “Enhancements to Overset Methods for Improved Accuracy and Solution Convergence,” AIAA Paper 2020-1528, AIAA Scitech Forum, 2020.
-  A. C. Kirby, D. J. Mavriplis, and A. M. Wissink, “An Adaptive Explicit 3D Discontinuous Galerkin Solver For Unsteady Problems,” AIAA Paper 2015-3046, 22nd AIAA Computational Fluid Dynamics Conference, Dallas TX, June 2015.
-  Ma. J. Brazell, M. J. Brazell, M. K. Stoellinger, D. J. Mavriplis, and A. C. Kirby, “Using LES in a Discontinuous Galerkin method with constant and dynamic SGS models,” AIAA Paper 2015-0060, 53rd AIAA Aerospace Sciences Meeting, Kissimmee FL, January 2015.
-  A. C. Kirby, “Enabling High-order Methods for Extreme-scale Simulations,” University of Wyoming, 2018.
-  M. J. Brazell, A. C. Kirby, and D. J. Mavriplis, “A High-order Discontinuous-Galerkin Octree-Based AMR Solver for Overset Simulations,” AIAA Paper 2017-3944, 23rd AIAA Computational Fluid Dynamics Conference, Denver CO, June 2017.
-  J. J. Dongarra, P. Luszczek, and A. Petitet, “The LINPACK benchmark: past, present and future,” Concurrency and Computation: practice and experience 15, no. 9 (2003): 803-820.
-  J. A. Kahle, J. Moreno, and D. Dreps, “2.1 Summit and Sierra: Designing AI/HPC Supercomputers,” In 2019 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 42-43. IEEE, 2019.
-  J. Romero, J. Crabill, J. E. Watkins, F. D. Witherden, and A. Jameson, “ZEFR: A GPU-accelerated high-order solver for compressible viscous flows using the flux reconstruction method,” Computer Physics Communications (2020): 107169.