GPU-Accelerated Discontinuous Galerkin Methods: 30x Speedup on 345 Billion Unknowns

06/28/2020
by   Andrew C. Kirby, et al.
0

A discontinuous Galerkin method for the discretization of the compressible Euler equations, the governing equations of inviscid fluid dynamics, on Cartesian meshes is developed for use of Graphical Processing Units via OCCA, a unified approach to performance portability on multi-threaded hardware architectures. A 30x time-to-solution speedup over CPU-only implementations using non-CUDA-Aware MPI communications is demonstrated up to 1,536 NVIDIA V100 GPUs and parallel strong scalability is shown up to 6,144 NVIDIA V100 GPUs for a problem containing 345 billion unknowns. A comparison of CUDA-Aware MPI communication to non-GPUDirect communication is performed demonstrating an additional 24

READ FULL TEXT VIEW PDF

Authors

page 2

page 3

01/28/2016

Discontinuous Galerkin methods on graphics processing units for nonlinear hyperbolic conservation laws

We present a novel implementation of the modal discontinuous Galerkin (D...
12/28/2020

TEMPI: An Interposed MPI Library with a Canonical Representation of CUDA-aware Datatypes

MPI derived datatypes are an abstraction that simplifies handling of non...
03/04/2020

Multichannel Analysis of Surface Waves Accelerated (MASWAccelerated): Software for Efficient Surface Wave Inversion Using MPI and GPUs

Multichannel Analysis of Surface Waves (MASW) is a technique frequently ...
01/04/2021

Parallel Scaling of the Regionally-Implicit Discontinuous Galerkin Method with Quasi-Quadrature-Free Matrix Assembly

In this work we investigate the parallel scalability of the numerical me...
06/19/2018

Enclave Tasking for Discontinuous Galerkin Methods on Dynamically Adaptive Meshes

High-order Discontinuous Galerkin (DG) methods promise to be an excellen...
10/11/2017

Subdomain Deflation Combined with Local AMG: a Case Study Using AMGCL Library

The paper proposes a combination of the subdomain deflation method and l...
03/27/2020

Dielectric breakdown prediction with GPU-accelerated BEM

The prediction of a dielectric breakdown in a high-voltage device is bas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A transformation in computing architectures has emerged over the past decade in the quest to keep Moore’s Law alive, leading to significant advancements in Artificial Intelligence and Machine Learning techniques

[1, 2]. The success of Machine Learning has caused a role reversal, now becoming a primary driver in the development of massively-parallel thread-based hardware indicative of future exascale-era architectures [3]. Traditional Computational Science applications such as Computational Fluid Dynamics (CFD) has eschewed the use of Graphical Processing Units (GPU) due to their programming complexity to achieve high performance. However, in this ever-changing computing landscape, algorithms suitable for these platforms are now forefront in the advancement

Low-order discretizations such as the finite volume method [4, 5] have been the industrial standard in CFD for the last 40 years. However, these algorithms suffer from low arithmetic intensity 111Arithmetic intensity is the ratio of floating-point operations performed to the amount of memory used in the calculation., making them ill-suited for GPUs. To leverage these new architectures and enable higher-fidelity simulations, an insurgence of algorithmic advancement for high-order methods, such as the discontinuous Galerkin method [6] and the flux-reconstruction method [7], has bustled within the Computational Science community. These high-order methods have higher-algorithmic intensity reflecting the increased compute to communication ratios as needed for GPUs.

This work develops a discontinuous Galerkin method to discretize the governing equations of fluid dynamics concerning aerospace and atmospheric applications suitable for the use of GPUs in a distributed computing environment. Within this work, we make simplifications such as conforming to Cartesian meshes to establish a performance ceiling for future algorithmic developments. We employ an abstraction thread-programming model known as OCCA [8]

, an open-source library, enabling the use of GPUs, CPUs, and FPGAs, to achieve performance portability. The performance of the method is analyzed and comparisons of network communications via CUDA-Aware MPI are performed. Lastly, we demonstrate the parallel scalability of the implementation up to 6,144 NVIDIA V100 GPUs.

Ii Discontinuous Galerkin Method

Ii-a Governing Equations

The governing equations utilized in this work are the three-dimensional compressible inviscid Euler equations, which are written in conservative form:

(1)

representing the conservation of mass, momentum, and energy for a fluid. The solution vector

Q represents the conservative flow variables and the matrix F represents the Cartesian flux components. Q and F are defined as follows:

(2)

where is density, are velocity components in each spatial coordinate direction, is pressure, is total internal energy, and is total enthalpy. These equations are closed using the ideal gas equation of state:

(3)

where is the ratio of specific heats.

Fig. 1:

Turbulent wake structure of a wind turbine containing 1.6 billion degrees-of-freedom (8 billion unknowns). This simulation used the CPU implementation of this work on 20,000+ CPU cores.

Ii-B Discretization

This work constrains the computational domain to Cartesian meshes. Under these assumptions, arithmetic simplifications are performed to increase computational performance. Within this setting, the governing equations are discretized via the discontinuous Galerkin (DG) method [9] assuming a weighted residual formulation. A weak-formulation is established by multiplying each governing equation by a set of test functions, , and integrating over the Cartesian element volume :

(4)

Integrating (4) by parts, the weak-form residual is defined as:

(5)

The discrete residual contains integrals over mesh element boundaries where special treatment is needed for the fluxes. The advective fluxes are calculated using an approximate Riemann solver, namely, the Lax-Friedrichs method [10]. To complete the spatial discretization, the solution is approximated as follows:

(6)

where are chosen to be the same as the test functions

. A collocation approach is chosen for computational efficiency by choosing the basis and test functions to be Lagrange interpolation polynomials

(7)

where : is the user-chosen polynomial degree, and numerical integration via Gauss-Legendre quadrature is used. Lastly, we implement a low-storage explicit Runge-Kutta method to discretize the temporal component of (4).

Iii CPU Implementation

This work serves within the larger computational framework known as the Wyoming Wind and Aerospace Applications Kompution Environment (WAKE3D) [11, 12, 13]. WAKE3D has been demonstrated on various aerodynamics problems [14] and wind energy applications [16, 18, 15, 17]. The framework is composed of multiple software components, namely, NSU3D [19], DG4EST [16, 20], and TIOGA [21, 22]. Figure 1 illustrates the simulation of a Siemens SWT-2.3-93 wind turbine using WAKE3D.

The computational kernel within the DG4EST component, known as CartDG [23, 24, 25]

, is the primary focus of this work. CartDG is a discontinuous Galerkin method designed for computational efficiency on Cartesian meshes utilizing tensor-product collocation-based basis functions. The CPU implementaion has achieved over 10% sustained peak of theoretical compute performance

[26] using the viscous formulation as shown in Figure 2. CartDG utilizes the Message Passage Interface (MPI) for distributed-memory computation and enables computation-communication overlap to hide communication latency. It has been demonstrated to scale to over one million MPI ranks [11] on ALCF Mira Supercomputer.

Fig. 2: Sustained peak performance of CartDG on Intel Xeon E5-2697V2 processors for various solution orders.

Iv GPU Implementation

Discontinuous Galerkin methods exhibit multiple levels of parallelism: coarse-grain parallelism via mesh decomposition and fine-grain parallelism via single instruction, multiple data within a mesh element. In this work, the MPI programming model is employed for coarse-grain parallelism and OCCA [8] is utilized for fine-grain parallelism. We select the CUDA-backend within OCCA to execute the kernels on NVIDIA V100 GPUs. OCCA provides a kernel language known as OKL, a simple extension of the C-language, enabling explicit architecture-independent fine-grain parallelized code. The sample code shown below illustrates the spatial residual volume kernel (second term) from (4) using OKL.

1#define cubeThreads                            \
2  for (int k = 0; k < p_TM; ++k; @inner(2))    \ // p_NF: # of fields = 5
3    for (int j = 0; j < p_TM; ++j; @inner(1))  \ // p_TM: # of modes = p+1
4      for (int i = 0; i < p_TM; ++i; @inner(0))
5
6/* Inviscid Flux Volume Kernel–Total Operations: nelements*(38*N^3 + 30*N^4) */
7@kernel void residual_inviscid_volume_occa(
8                            /* solution */ @restrict const double *Q,
9                            /* residual */ @restrict double *R,
10                      /* basis functions*/ @restrict const double *dphiw,
11                  /* mesh element count */ const int nelements,
12   /* ratio of specific heats minus one */ const double gm1,
13                     /* geometry factor */ const double vgeo)
14{
15    for (int e = 0; e < nelements; ++e; @outer) {
16        const int tm_nf     = p_TM*p_NF;
17        const int tm_tm_nf  = p_TM*tm_nf;
18        const int nf_tm_ALL = p_TM*tm_tm_nf;
19        const int elem_idx  = e*nf_tm_ALL;
20
21        @exclusive double r_Q[p_NF];
22        @exclusive double r_R[p_NF];
23
24        @shared double s_fluxX[p_NQP][p_NQP][p_NQP][p_NF];
25        @shared double s_fluxY[p_NQP][p_NQP][p_NQP][p_NF];
26        @shared double s_fluxZ[p_NQP][p_NQP][p_NQP][p_NF];
27        @shared double s_dphiw[p_TM][p_TM];
28
29        // load solution and reset residual per thread memory
30        cubeThreads {
31            const int idx1 = i*p_NF + j*tm_nf + k*tm_tm_nf + elem_idx;
32            #pragma unroll p_NF
33            for (int n = 0; n < p_NF; ++n) {
34                r_Q[n] = Q[idx1+n];
35                r_R[n] = 0.0;
36            }
37            // fetch transpose(dphiw) to shared memory
38            if(k == 0) s_dphiw[i][j] = dphiw[j*p_TM+i];
39        } @barrier(”localMemFence”);
40
41        // volume flux: ops = 38*N^3
42        cubeThreads{
43            const double oneOrho = 1.0/r_Q[0];     // inverse density
44            const double u = r_Q[1]*oneOrho*vgeo;  // dimensional x-velocity
45            const double v = r_Q[2]*oneOrho*vgeo;  // dimensional y-velocity
46            const double w = r_Q[3]*oneOrho*vgeo;  // dimensional z-velocity
47            const double p = (gm1)*(r_Q[4] - 0.5*r_Q[0]*(u*u + v*v *w*w);
48            const double Ep = r_Q[4] + p;
49
50            // convective flux: x-component
51            s_fluxX[k][j][i][0] = r_Q[0]*(u);
52            s_fluxX[k][j][i][1] = r_Q[1]*(u) + p*vgeo;
53            s_fluxX[k][j][i][2] = r_Q[2]*(u);
54            s_fluxX[k][j][i][3] = r_Q[3]*(u);
55            s_fluxX[k][j][i][4] =     Ep*(u);
56            // convective flux: y-component 
57            // convective flux: z-component 
58        } @barrier(”localMemFence”);
59
60        // residual per thread and update global residual memory: ops = 10*N^4
61        cubeThreads {
62            int idx1 = i*p_NF + j*tm_nf + k*tm_tm_nf + elem_idx;
63            #pragma unroll p_NQP
64            for (int qp = 0; qp < p_NQP; ++qp)
65                #pragma unroll p_NF
66                for(int n = 0; n < p_NF; ++n)
67                    r_R[n] += s_dphiw[i][qp]*s_fluxX[k][j][qp][n];
68                    r_R[n] += s_dphiw[j][qp]*s_fluxY[k][qp][i][n];
69                    r_R[n] += s_dphiw[k][qp]*s_fluxZ[qp][j][i][n];
70
71            #pragma unroll p_NF
72            for(int n = 0; n < p_NF; ++n) R[idx1+n] = r_R[n];
73        }
74    }
75}
Listing 1: OCCA Volume Flux Kernel Example

The discontinuous Galerkin method discretizes mesh elements into multiple solution modes per variable which depend on the solution polynomial degree established a priori by the user. For example, in three-dimensional space, suppose the polynomial degree , then the number of modes per conservative variable within an element is . To map the discretization to the hardware, we assign one hardware thread per mode, which is seen in the code sample indicated by the cubeThreads definition. The @inner attribute assigns a thread to each index mode.

Fig. 3: Parallel strong scalability of CartDG on MIT Satori using up to 64 GPUs. A 29% speedup is achieved at 4 GPUs contained in a single node and a 24% speedup is achieved at 32 GPUs when using GPUDirect-enabled CUDA-Aware MPI compared to standard MPI.
MIT Satori: CUDA-Aware MPI
Problem Size: 452,984,832 DOF.
Solution Time (sec)
GPUs ON OFF Speedup
4 17.94 23.22 1.29x
8 10.60 13.22 1.25x
16 6.34 7.87 1.24x
32 3.73 4.62 1.24x
TABLE I: CUDA-Aware MPI communication comparison to regular MPI communication using NVIDIA V100 GPUs.

V Performance Results: MIT Satori

The GPU implementation of CartDG is tested on the MIT Satori supercomputer. Satori is composed of 64 IBM Power9 nodes, each containing four NVIDIA V100 GPUs and two AC-922 22-core processors. The communication interconnect contains NVIDIA GPUDirect, which enables CUDA-Aware MPI communication protocols over 50 GB/s NVLink fabric linking the GPUs together within a node. The problem chosen for performance testing consists of a periodic domain containing 128x128x128 mesh elements and solution polynomial degree of , totaling 452,984,832 degrees-of-freedom (DOF) (2,264,924,160 unknowns).

Figure 3 demonstrates the parallel strong scaling performance of the implementation comparing GPUDirect communication to non-CUDA-Aware MPI communication. As seen in the figure, the solver scales well to 16 nodes. Table I demonstrates that the GPUDirect CUDA-Aware MPI provides a 29% speedup using a single node and a 24% speedup across multiple nodes.

Vi Performance Results: ORNL Summit

We test the same implementation on ORNL Summit. Summit has nearly the same architecture as MIT Satori but contains 6 NVIDIA V100 GPUs and 2 IBM AC-922 22-core processors per node. Summit contains 4,608 nodes totaling 27,648 GPUs and 202,752 CPU cores and has achieved 148.6 PetaFLOPS in the LINPACK benchmark [27, 28].

Vi-a Strong Scalability

We perform a second parallel strong-scalability test on ORNL Summit using a mesh composed of , seventh-order accurate elements,totaling 69,055,021,056 degrees-of-freedom (345,275,105,280 unknowns). This problem size is chosen to give approximately 11 million degrees-of-freedom per GPU when using all GPUs on 1,024 nodes. The strong-scaling test measures the wall-clock time to solve 100 Low-Storage Explicit Runge-Kutta 5-stage 4th-order (LSERK45) time steps. The benchmark used 128, 256, 512, 920, and 1024 nodes. Figure 4 displays the scaling results. We note that these results do not utilize CUDA-Aware MPI; the information required by neighboring cores is first transferred to the host CPU, exchanged via MPI, then transferred to the GPU. Table IV tabulates the wall-clock times at various node counts for the CPU and GPU implementations.

ORNL Summit: GPU Performance
Kernel Time (sec) PetaFlops Achieved Theoretical Peak
Volume 1.77 4.84 22.5%
Surface 2.15 2.45 11.4%
Update-Project 6.00 0.49 2.3%
Communication 10.18
Overall 20.45 0.82 3.8%
TABLE II: GPU kernel performance for 512 nodes (non-CUDA-Aware communication). Problem size: 69,055,021,056 degrees-of-freedom. (Theoretical peak assumed 21.5 PetaFlops for 512 Nodes).
ORNL Summit: CUDA-Aware MPI
Problem Size: 21,337,344 DOF.
    Single Node     Time (sec)
Kernel ON OFF
Volume 1.67 1.68
Surface 2.04 2.04
Update-Project 5.70 5.70
Communication 0.65 5.13
Overall 10.37 14.55
TABLE III: CUDA-Aware MPI comparison to regular MPI on one node using six NVIDIA V100 GPUs demonstrating a 40% speedup.
Fig. 4: Parallel strong scalability of CartDG on ORNL Summit using up to 1,024 nodes (6,144 GPUs/45,056 CPU cores).
ORNL Summit: Parallel Strong Scalability
Problem Size: 69,055,021,056 degrees-of-freedom.
Each node contains 2 IBM POWER9 processors (44 cores per node) and 6 NVIDIA V100 GPUs.
CPU
Nodes Time (sec) Scalability %
128 2011.2 100.0
256 1021.9 98.43
512
920
1,024
GPU
Nodes Time (sec) Scalability %
128 59.37 100.0
256 34.75 85.42
512 20.45 72.58
920 14.26 57.94
1,024 13.67 54.29
Speedup
CPU/GPU
33.9
29.4
TABLE IV: ORNL Summit parallel strong scalability comparison of CPU vs GPU (non-CUDA-Aware communication).

Vi-B GPU Performance

As demonstrated in the strong-scaling figure, the performance improvement is approximately 30x when using all six GPUs compared to using both CPUs per node only (using 44 cores per node). Table II demonstrates the computational kernel breakdown of the GPU simulation on 512 nodes. The volume kernel corresponding to the code sample achieved nearly 5 PetaFLOPs of performance on 512 nodes. Additionally, the communication overhead is large in comparison to the computational time. This is due to not utilizing the CUDA-Aware MPI on Summit, which allows for faster GPU-GPU communication on a node and overlap of computation and communication.

A communication benchmark is conducted using a single node with six GPUs interconnected with NVLink for a problem containing 21,337,344 DOFs. As shown in Table III, CartDG achieves an 8x speedup in communication time using the GPUDirect CUDA-Aware MPI on one node corresponding to a 40% overall speedup in time to solution. This result corroborates the results achieved on MIT Satori and illustrates a significant communication improvement when using GPUDirect CUDA-Aware MPI. This contrasts the result found by [29] which tested a similar high-order solver on ORNL Summit indicating a slow-down in performance when using CUDA-Aware MPI.

Vii Conclusion

This work developed of a high-order discontinuous Galerkin method on Cartesian meshes for the use of heterogeneous architectures including GPUs. Performance benchmarks on MIT Satori and ORNL Summit demonstrated significant speedups achieved including a 30x time to solution over the CPU-only implementation. Further, the benefits of utilizing GPUDirect were shown, improving the overall performance by 24% across multiple nodes and an 8x improvement in communication time within one node. If the CUDA-Aware MPI performance improvement holds at scale, as indicated by the results performed on MIT Satori, a GPU to CPU performance improvement of 40-50x is expected.

Acknowledgment

The authors acknowledge the MIT Research Computing Project and Oak Ridge National Laboratory for providing HPC resources that have contributed to the research results reported in this paper.

References

  • [1]

    T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Computing Surveys (CSUR) 52, no. 4 (2019): 1-43.

  • [2] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, and J. Kepner, “Survey and benchmarking of machine learning accelerators,” arXiv preprint arXiv:1908.11348 (2019).
  • [3] P. Messina, “The exascale computing project,” Computing in Science & Engineering 19, no. 3 (2017): 63-67.
  • [4] A. Jameson, and D. J. Mavriplis, “Finite volume solution of the two-dimensional Euler equations on a regular triangular mesh,” AIAA Journal 24, no. 4 (1986): 611-618.
  • [5] D. J. Mavriplis, “Multigrid solution of the two-dimensional Euler equations on unstructured triangular meshes,” AIAA Journal 26, no. 7 (1988): 824-831.
  • [6] J. S. Hesthaven, and T. Warburton, Nodal discontinuous Galerkin methods: algorithms, analysis, and applications. Springer Science & Business Media, 2007.
  • [7] F. D. Witherden, A. M. Farrington, and P. E. Vincent, “PyFR: An open source framework for solving advection–diffusion type problems on streaming architectures using the flux reconstruction approach,” Computer Physics Communications 185, no. 11 (2014): 3028-3040.
  • [8] D. S. Medina, A St-Cyr, and T. Warburton, “OCCA: A unified approach to multi-threading languages,” arXiv preprint arXiv:1403.0968 (2014).
  • [9] B. Cockburn, G. E. Karniadakis, and C. Shu, Discontinuous Galerkin methods: theory, computation and applications. Vol. 11. Springer Science & Business Media, 2012.
  • [10] A. Harten, P. D. Lax, and B. van Leer, “On upstream differencing and Godunov-type schemes for hyperbolic conservation laws,” SIAM review 25, no. 1 (1983): 35-61.
  • [11] M. J. Brazell, A. C. Kirby, J. Sitaraman, and D. J. Mavriplis, “A multi-solver overset mesh Approach for 3D mixed element variable order discretizations,” AIAA Paper 2016-0053, 54th AIAA Aerospace Sciences Meeting, San Diego CA, January 2016.
  • [12] A. C. Kirby, and D. J. Mavriplis, “High Fidelity Blade-Resolved Wind Plant Modeling.” SC17, 2017.
  • [13] A. C. Kirby, Z. Yang, D. J. Mavriplis, E. P. Duque, and B. J. Whitlock, “Visualization and data analytics challenges of large-scale high-fidelity numerical simulations of wind energy applications,” AIAA Paper 2018-1171, 56th AIAA Aerospace Sciences Meeting, Kissimmee FL, January 2018.
  • [14] K. Kara, A. C. Kirby and D. J. Mavriplis, “Hover Predictions of S76 Rotor Using a High Order Discontinuous Galerkin Off-Body Discretization,” AIAA Paper 2020-0771 Presented at the AIAA Scitech 2020 Conference, Orlando FL, January 2020.
  • [15] A. C. Kirby, A. Hassanzadeh, D. J. Mavriplis, and J. W. Naughton, “Wind Turbine Wake Dynamics Analysis Using a High-Fidelity Simulation Framework with Blade-Resolved Turbine Models,” AIAA Paper 2018-0256, Wind Energy Symposium, Kissimmee FL, January 2018.
  • [16] A. C. Kirby, M. J. Brazell, Z. Yang, R. Roy, B. R. Ahrabi, M. K. Stoellinger, J. Sitaraman, and D. J. Mavriplis, “Wind farm simulations using an overset hp-adaptive approach with blade-resolved turbine models,” The International Journal of High Performance Computing Applications 33, no. 5 (2019): 897-923.
  • [17] A. Kirby, M. Brazell, J. Sitaraman, D. Mavriplis, “An Overset Adaptive High-Order Approach for Blade-Resolved Wind Energy Applications,” AHS Forum 72, West Palm Beach FL, May 2016.
  • [18] A. P. Edmonds, A. Hassanzadeh, A. C. Kirby, D. J. Mavriplis, and J. W. Naughton, “Effects of Blade Load Distributions on Wind Turbine Wake Evolution Using Blade-Resolved Computational Fluid Dynamics Simulations,” AIAA Paper 2019-2081, 57th AIAA Aerospace Sciences Meeting, San Diego CA, January 2019.
  • [19] D. Mavriplis, M. Long, T. Lake and M. Langlois, “NSU3D results for the second AIAA high-lift prediction workshop,” Journal of Aircraft 52, no. 4 (2015): 1063-1081.
  • [20] M. K. Stoellinger, A. P. Edmonds, A. C. Kirby, D. J. Mavriplis and S. Heinz, “Dynamic SGS modeling in LES using DG with kinetic energy preserving flux schemes,” AIAA Paper 2019-1648, 57th AIAA Aerospace Sciences Meeting, San Diego CA, January 2019.
  • [21] M. J. Brazell, J. Sitaraman, and D. J. Mavriplis, “An overset mesh approach for 3D mixed element high-order discretizations,” Journal of Computational Physics 322 (2016): 33-51.
  • [22] J. Sitaraman, D. Jude, and M. J. Brazell, “Enhancements to Overset Methods for Improved Accuracy and Solution Convergence,” AIAA Paper 2020-1528, AIAA Scitech Forum, 2020.
  • [23] A. C. Kirby, D. J. Mavriplis, and A. M. Wissink, “An Adaptive Explicit 3D Discontinuous Galerkin Solver For Unsteady Problems,” AIAA Paper 2015-3046, 22nd AIAA Computational Fluid Dynamics Conference, Dallas TX, June 2015.
  • [24] Ma. J. Brazell, M. J. Brazell, M. K. Stoellinger, D. J. Mavriplis, and A. C. Kirby, “Using LES in a Discontinuous Galerkin method with constant and dynamic SGS models,” AIAA Paper 2015-0060, 53rd AIAA Aerospace Sciences Meeting, Kissimmee FL, January 2015.
  • [25] A. C. Kirby, “Enabling High-order Methods for Extreme-scale Simulations,” University of Wyoming, 2018.
  • [26] M. J. Brazell, A. C. Kirby, and D. J. Mavriplis, “A High-order Discontinuous-Galerkin Octree-Based AMR Solver for Overset Simulations,” AIAA Paper 2017-3944, 23rd AIAA Computational Fluid Dynamics Conference, Denver CO, June 2017.
  • [27] J. J. Dongarra, P. Luszczek, and A. Petitet, “The LINPACK benchmark: past, present and future,” Concurrency and Computation: practice and experience 15, no. 9 (2003): 803-820.
  • [28] J. A. Kahle, J. Moreno, and D. Dreps, “2.1 Summit and Sierra: Designing AI/HPC Supercomputers,” In 2019 IEEE International Solid-State Circuits Conference-(ISSCC), pp. 42-43. IEEE, 2019.
  • [29] J. Romero, J. Crabill, J. E. Watkins, F. D. Witherden, and A. Jameson, “ZEFR: A GPU-accelerated high-order solver for compressible viscous flows using the flux reconstruction method,” Computer Physics Communications (2020): 107169.