Understanding fundamental kinetic processes is important in many physical problems, from the astrophysics of self-gravitating systems, to plasma physics and gas dynamics. Several recent satellite missions observe the detailed structure of these systems, for example, the GAIA mission that aims to collect the position, velocity and other data on billions of stars in our galaxy, or the Parker Solar Probe mission that is studying the detailed structure of the hot solar wind plasma that permeates the solar system. Each of these missions aims to measure the phase-space of the “particles,” e.g, stars in the case of GAIA and electrons and ions in case of Parker Solar Probe. The quality of data is unprecedented and promises to greatly enrich our understanding. Clearly, large-scale simulation capability is needed to interpret and understand the detailed physics revealed by these measurements.
A near first-principles approach is to look at the statistical description via the Boltzmann equation coupled to appropriate field equations: Poisson equations for self-gravitating system and Maxwell’s equations for plasmas. The challenge in solving such systems is the inherent nonlinearity due to the coupling of the particles and fields, and that the particle dynamics evolves in 6D phase-space (position-velocity), requiring a very careful treatment of all field-particle interaction terms.
The fundamental object in the Boltzmann description is the particle distribution function that evolves in phase-space . The particle distribution function is defined such that is the number of particles in phase-space volume at position-velocity location . The motion of particles comes about from free-streaming and particle acceleration and is described by the Boltzmann equation
where and are gradient operators in configuration and velocity space respectively, and is the acceleration. To treat the phase-space as a whole we will often use and denote the phase-space flux as . The right-hand represents collision terms that redistribute the particles in velocity space, but in a manner that conserves density, momentum and energy. Even though the streaming of particles,
, in the Boltzmann equation is linear, the collisions and coupling to the fields via the acceleration, determined by velocity moments of the distribution function, makes the complete particlefield equations a highly nonlinear, integro-differential, 6D system.
The high dimensionality means that for most problems, especially in 6D, one needs the largest computational resources one can muster. In this paper we present a novel version of the discontinuous Galerkin (DG) algorithm to solve such kinetic equations. Unlike traditional and widely-used Monte Carlo methods, such as the particle-in-cell (PIC) method for plasmas, we use a continuum scheme in which we directly discretize the 6D phase-space using discontinuous basis functions. A continuum scheme has the advantage that the counting noise inherent in PIC methods is eliminated, however, at higher computational complexity. Once the basis set and a numerical flux function are determined, we compute all volume and surface terms in the DG algorithm exactly, eliminating all aliasing errors that would otherwise contaminate the delicate field-particle interactions. This is a critical aspect of capturing the physics, both in the linear and nonlinear regimes.
We use modal basis functions (of the Serendipity family
) with reduced degrees of freedom (DOF) to improve efficiency while retaining a high formal order of convergence. Further, use of a computer algebra system (CAS) allows us to compute all integrals analytically, and orthonormalization of the basis leads to very sparse update kernels minimizing FLOPs and eliminating all tensor-tensor products and explicit quadratures.
We extend previous work, where the authors presented a nodal
DG algorithm to solve the Boltzmann equation in the context of plasma physics. In the plasma physics context, the Boltzmann equation, coupled to Maxwell’s equations, forms the Vlasov-Maxwell system of equations, in which charged particles evolve in self-consistent electromagnetic fields. For the Vlasov-Maxwell system of equations, the acceleration vector is given by, where and are particle charge and mass, and and are electric and magnetic fields, determined from Maxwell’s equations. The particle contribution to the fields is via plasma currents that appears in Ampere’s law. The work of  showed that a DG scheme can conserve the mass and, when using central fluxes for Maxwell equations, total energy (particlefield) exactly. Importantly though, unlike the case of fluid problems (Euler, Navier-Stokes, or magnetohydrodynamics equations), there is no explicit energy equation that is evolved. In fact, the energy (as discussed in Section II) depends on moments of the distribution function as well as the -norm of the electromagnetic field. Hence, ensuring both the accuracy of the evolution of the energy, and that the energy is conserved, is not trivial and care is needed to maintain energy conservation. The modal DG scheme presented here does not change the properties proved in, but it does greatly improve the efficiency and scalability of the DG algorithm, while maintaining all the scheme’s favorable properties.
Our algorithms are implemented in the open-sourceGkeyll[16, 15] code that incorporates a number of software innovations: use of JIT compiled top-level language, CAS generated computational kernels, and a sophisticated shared-memory MPI implementation to handle velocity space parallelization. We have obtained sub-quadratic scaling of the computational kernels with DOFs per-cell and also good parallel weak-scaling of the code on the Theta supercomputer.
The modal, alias-free, matrix-free, and quadrature-free DG algorithm presented here has also been applied to the discretization of Fokker-Planck equations . We note though that, to our knowledge, this paper describes the first instance of the application of a modal, alias-free, matrix-free, and quadrature-free DG algorithm to kinetic equations, especially nonlinear kinetic equations. In rest of the paper we describe some aspects of our schemes and innovation we have made to make high-dimensional problems within reach, at least on large supercomputers.
Ii Modal discontinuous Galerkin algorithm
As context for the fundamental algorithmic advancement of this paper, we briefly review the ingredients of a discontinuous Galerkin scheme. To construct a DG discretization of a partial differential equation (PDE) such as the kinetic equation, we discretize our phase space domain into grid cells,, multiply the kinetic equation by test functions , and integrate the phase space gradient by parts to construct the discrete-weak form,
The discrete-weak form is then evaluated in each grid cell and for every test function in a chosen basis expansion, with the discrete representation of the particle distribution defined as
for the test functions which define our basis. We likewise have a discrete representation for the phase space flux, , which, for example, looks like
for the particular Boltzmann equation for the evolution of a collisionless plasma, i.e., the Vlasov equation. The numerical flux function, , is some suitably appropriate prescription for the interface fluxes, such as upwind fluxes, in analogy with traditional finite volume methods. In contrast to finite volume methods though, the numerical flux function has its own basis expansion, e.g., for central fluxes,
where the superscript denote the basis expansions of and evaluated just inside (outside) the cell interface.
While the discrete-weak form is a mathematically complete formulation of the DG algorithm, to translate the discrete-weak form into code, a suitable choice of basis functions for must be made to evaluate the integrals in the discrete-weak form. Restricting ourselves to polynomial bases, a conventional approach in the application of DG methods to hyperbolic PDEs is a nodal basis, wherein the basis set is defined by a set of polynomials whose values are known at nodes. An example nodal basis in 1D is
are the Lagrange interpolating polynomials,
and are the nodes by which the polynomials are defined. Because the polynomials are defined to take a value of one at one node and zero at all other nodes, the coefficients are thus known at the specific set of nodes.
Nodal bases are common in the DG literature because of the computational advantages they provide for many applications, most especially the simplification of many of the integrals if one substitutes products and other nonlinear combinations of basis functions as
Such a simplification reduces the number of operations required to evaluate the discrete-weak form and numerically integrate the PDE of interest, but at a cost: aliasing errors are introduced into the solution since nonlinear combinations of the nodal basis set are not contained in the basis [20, 21]. These aliasing errors have been studied in the context of fluids equations, such as the Euler equations, the Navier-Stokes equations, or the equations of magnetohydrodynamics, where it is found that these aliasing errors can have a destabilizing effect . However, the computational gains from the simplification of the integrals are large enough that significant effort has been spent on mitigating these errors with filtering and artificial dissipation [7, 10, 8, 32] or split-form formulations111In the split-form formulation, conservative and non-conservative forms of the equation at the continuous level are averaged to produce a different (but mathematically equivalent), but ultimately more computationally favorable, equation to discretize. [13, 14, 11, 12, 9, 38]. Because fluid equations involve explicit conservation relations and the aliasing errors manifest in the smallest scales and highest wavenumbers, there is far less concern that mitigation techniques such as filtering or artificial dissipation will destroy the quality of the solution, at least at scales above the resolution of the simulation. The ability to control aliasing errors while maintaining the favorable computational complexity of a nodal scheme is of tremendous utility for the simulation of large scale problems in computational fluid dynamics.
Unfortunately, these aliasing errors are intolerable for kinetic equations. The catastrophic nature of aliasing errors for kinetic equations arises from the physics content of the DG discretization itself when employing polynomial bases. To take the example of the Vlasov equation for the evolution of a collisionless plasma, when employing at least piecewise quadratic polynomials, the DG discretization involves the evolution of the moment of the particle distribution function. But the velocity moment is the particle energy, whose evolution is given by
where we have summed over all cells to eliminate the surface term, as in , and substituted for the volume term the discrete exchange of energy between the particles and the electromagnetic fields, .
In order for this substitution to be valid, the integrations of the surface and volume terms must be performed exactly, or at least to a high precision, lest the aforementioned aliasing errors manifest themselves as the “energy content” of the velocity moments being transported in uncontrolled and undesirable ways. It would be nigh impossible to correct the rearrangement of the “energy content” of the basis expansion in a physically reasonable way, much less a stable way, because these errors are entering at all scales and in both fields and particles, and destroying a fundamental property of the equation system: the exchange of energy between the plasma and electromagnetic fields is given by . If we cannot safely apply standard techniques such as filtering to mitigate aliasing errors, we must then eliminate these errors in their entirety.
Eliminating aliasing errors with a nodal basis comes at a high cost though. The use of numerical quadrature, even anisotropic quadrature as in , leads to a computational complexity , where is the number of quadrature points required to exactly integrate the nonlinear term(s) in the kinetic equation. The number of quadrature points exponentially increases with dimensionality, leading to an incredibly expensive numerical method for five and six dimensional problems.
We can gain insight into how to manage this cost, while respecting our requirement of the complete elimination of aliasing errors, by considering the fundamental operation of our DG method. Substitution of the full expansions for the phase space flux, , and distribution function, , into the volume term in the discrete weak form gives us
where we have encompassed the spatial discretization in the evaluation and convolution of the entries in the tensor, . If this tensor is dense, the convolution of to evaluate the volume integral in the discrete-weak form will have a computational complexity of
, which would suffer the same curse of dimensionality as the use of numerical quadrature.
However, if could be made sparse, this would correspond to a systematic reduction in the number of operations required to evaluate the volume integral, and thus reduce the number of operations to numerically integrate the kinetic equation with our DG method. We can indeed sparsify with the use of a modal, orthonormal polynomial basis set, as many entries of the tensor will be zero if the basis functions are orthonormal. In addition to the reduction in the number of operations required to evaluate the volume integral, the use of an orthonormal basis to sparsify allows for a complete redesign of the algorithm to maximize performance on modern architectures.
We now describe the principal algorithmic advancement of this paper: an alias-free, matrix-free, and quadrature-free DG algorithm for kinetic equations. By choosing a modal, orthonormal polynomial basis, we can symbolically integrate the individual terms in the tensor and explicitly evaluate the sums which form the core of the update formulae. We construct computational kernels using the Maxima  CAS to evaluate sums such as
with similar computational kernels for the surface integrals. We show an example computational kernel for the volume integral of the Vlasov equation in Figure 1 for the piecewise linear tensor product basis in one spatial and two velocity dimensions (1X2V).
Figure 1 shows a C++ computational kernel that can be called for every cell
of a structured, Cartesian grid in phase space, as we are passing all the information required to the kernel to determine where we are physically in phase space, i.e., the local cell center coordinate and grid cell size. The output of this computational kernel, the out array, forms a component of a system of ordinary differential equations,
where the operation encodes the evaluation of the surface integrals on each surface of the cell and can also be pre-generated using a CAS222In the construction of this ordinary differential equation system, the matrix
, but due to the choice of a modal, orthonormal basis this matrix is the identity matrix and thus requires no additional operations to invert.
, but due to the choice of a modal, orthonormal basis this matrix is the identity matrix and thus requires no additional operations to invert.. Given the computation of the surface and volume integrals in every cell, this system of ordinary differential equations can be discretized with an appropriate ODE integrator such as strong-stability preserving Runge-Kutta (SSP-RK) method, as is done in Gkeyll. We note that we will likewise have computational kernels for Maxwell’s equations, or another set of field equations such as Poisson’s equations for self-gravitating systems, which must be evaluated at each stage of a SSP-RK method to complete the field-particle coupling.
Notably, the computational kernel in Figure 1 has no matrix data structure, much less the requirement to perform quadrature since we have already analytically evaluated the integrals which make up the entries of with a CAS and written out the results to double precision. Further, we unroll all loops, eliminate common expressions and collect terms to ensure that the update uses fewer FLOPs333The problem of ensuring least FLOP counts is difficult, and we apply most reasonably straightforward tricks we can think of. Certainly, a further reduction is likely possible and could be explored with more sophisticated optimization tools.. Using the local cell-center and grid spacing, we construct the phase space expansion of the phase space flux, , for each dimension, and then compute the convolution of the tensor summed over each component of the phase space flux. Thus, not only is the method alias-free because the integrals which form our spatial discretization have been evaluated to machine-precision, the method is also quadrature-free and matrix-free. Such quadrature-free methods using orthogonal polynomials were studied in the early days of the DG method [2, 28] and are still applied to a variety of linear hyperbolic equations, such as the acoustic wave equation for studies of seismic activity, the level set equation, and Maxwell’s equations [25, 30, 27, 24]. Even for alternative formulations of DG which do not seek to eliminate aliasing errors by exactly integrating the components of the discrete weak form, matrix-free implementations are desirable to reduce the memory footprint of the scheme .
To our knowledge, the construction of the alias-free, matrix-free, and quadrature-free algorithm shown here for kinetic equations, especially nonlinear kinetic equations such as the Vlasov equation for collisionless plasma dynamics, is the first instance of such an algorithm design in the literature. This particular algorithm design has numerous advantages. The sparseness of our alias-free, matrix-free, and quadrature-free DG algorithm leads to a reduction in the number of operations required to evaluate the volume and surface integrals, e.g., the computational kernel in Figure 1 has multiplications, whereas the update for numerical quadrature applied to an alias-free nodal basis has multiplications. In addition, the reduced memory footprint from requiring no matrix data structure and the unfolding of the tensor-tensor convolutions leads to additional performance improvements, e.g., from compiler optimizations such as common expression elimination. To determine quantitatively the computational complexity and more precisely evaluate the performance of the alias-free, quadrature-free, and matrix-free DG algorithm, we will perform a numerical experiment in the next section.
Iii Computation complexity
Although we have evidence from the computational kernel presented in Figure 1 that the number of operations is indeed reduced compared to the use of numerical quadrature, we would like to determine generally how sparse the tensors required to update the discrete kinetic equation are. We again take the example of the Vlasov equation, and in Figure 2 perform a numerical experiment using the computational kernels generated from a variety of basis expansion of dimensionality combinations.
We show the time to evaluate the computational kernels in a phase space cell for just the streaming term, in the left plot of Figure 2, and the evaluation of the full phase space update, streaming and acceleration, in the right plot. From the scaling of the cost to evaluate these computational kernels we can determine the computational complexity of the algorithm with respect to the number of degrees of freedom per cell, i.e., the number of basis functions in our expansion, .
It is immediately apparent that even with the steepening of the scaling as the number of degrees of freedom increases there is at least some gain over the use of direct quadrature to evaluate the integrals in the discrete weak form because, at worst, the total, streaming plus acceleration, update scales roughly as . In fact, this scaling of, at worst , is exactly the scaling obtained by under-integrating the nonlinear term in a nodal basis [20, 21]. But critically, we have obtained this same (or better) computational complexity while eliminating aliasing errors from our scheme, as we require for stability and accuracy.
However, the improvement in the scaling is actually better than it first appears. The scaling shown in Fig. 2 is the cost scaling of the full update to perform a forward Euler step in a phase space cell, i.e., in six dimensions, three spatial and three velocity, the total update time in the right plot of Fig. 2 is the time to compute the six dimensional volume integral plus the twelve required five dimensional surface integrals444We mention that the choice of orthonormal basis and analytically computing all integrals leads to the rather surprising result that the 6D volume integral is actually much, much cheaper than the surface integrals. In fact, the total cost of our algorithms is driven entirely by the surface integration costs.. This means the scaling we are quoting is irrespective of the dimensionality of the problem, unlike in the case of the nodal basis, where the quadrature must be performed for every integral and there is a hidden dimensionality factor in the scaling. In other words, in six dimensions, what at first may only seem like a factor of improvement moving from a nodal to an orthonormal, modal representation is in fact a factor of improvement in the scaling once one includes the dimensionality factor, up to the constant of proportionality of the scaling. Of course, one must also compare the size of the constant of proportionality multiplying both scalings to accurately compare the reduction in the number of operations and improvement in the overall performance, since said constant of proportionality can either tell us the picture is much rosier, that in fact the improvement in performance is larger than we expected, or much more dire, that the improvement in the scaling is offset by a larger constant of proportionality.
To determine the constant of proportionality, we perform a more thorough numerical experiment and compare the cost of the alias-free nodal scheme and alias-free modal scheme for a complete collisionless Vlasov–Maxwell simulation. We consider the following test: a 2X3V computation done with both the nodal and the modal algorithms, with a detailed timing breakdown of the most important step of the algorithm, the Vlasov time step. The reader is referred Table I for a summary of the following two paragraphs if they wish to skip the details of the computer architecture and optimizations employed. Both computations are performed in serial on a Macbook Pro with an Intel Core i7-4850HQ (“Crystal Well”) chip, the same architecture on which the scaling analysis in Figure 2 was performed. The only optimization in the compilation of both algorithms is “O3” and both versions of the code are compiled with the C++ Clang 9.1 compiler.
Specific details of the computations are as follows: a grid, with polynomial order two, and the Serendipity basis, 112 degrees of freedom per cell. The two simulations were run for a number of time-steps to allow us to more accurately compute the time per step of just the Vlasov solver, as well as the time per step of the complete simulation. The time-stepper of choice for this numerical experiment is the three-stage, third order, SSP-RK method [34, 5]. To make the simulations as realistic as possible in terms of memory movement, we also evolve a “proton” and “electron” distribution function, i.e., we evolve the Vlasov-Maxwell system of equations for two plasma species.
To make the comparison as favorable as possible for the nodal algorithm, we also employ the highly tuned Eigen linear algebra library, Eigen 3.3.4 , to perform the dense matrix-vector multiplies required to evaluate the higher order quadrature needed to eliminate aliasing errors in the nodal DG discretization. And we note that the nodal algorithm is optimized to use anisotropic quadrature (just high enough to eliminate aliasing) and uses only the surface basis functions in the surface integral evaluations, so we are doing as much as possible to reduce the cost of the alias-free nodal scheme.
The results are as follows: for the nodal basis, the computation required 1079.63 seconds per time step, of which 1033.89 seconds were spent solving the Vlasov equation. The remaining time is split between the computation of Maxwell’s equations, the computation of the current from the first velocity moment of the distribution function to couple the particles and the fields, and the accumulation of each Runge-Kutta stage from our three stage Runge-Kutta method. For the modal basis, the computation required 67.43 seconds per time step, of which 60.34 seconds were spent solving the Vlasov equation.
In the nodal case, we emphasize that we achieve a reasonable CPU efficiency, and the nodal timings are not a matter of poor implementation. We estimate the number of multiplications in the alias-free nodal algorithm required to perform a full time-step is, three trillion, once one considers the fact that we are evolving two distribution functions with a three-stage Runge–Kutta method. One thousand seconds to perform three trillion multiplications corresponds to an efficiency of flops per second (3 GFlops/s). This estimate is within 50 percent of the measured efficiencies of Eigen’s matrix-vector multiplication routines for Eigen 3.3.4 on a similar CPU architecture to the one employed for this test , so we argue that the cost of the alias-free nodal algorithm is due to the number of operations required and not an inefficient implementation of the algorithm.
It is then worth discussing how this improvement in the timings using the modal algorithm compares with our expectations. Given the scaling of the modal basis, we would anticipate the gain in efficiency in five dimensions would be around a factor of twenty, a factor of four from the reduction in the scaling from to , and a factor of five from the latter scaling containing all of the five dimensional volume integrals and the ten four dimensional surface integrals. We can see that the gain in just the Vlasov solver is , while the gain in the overall time per step is , not quite as much as we would naively expect, but still a sizable increase in the speed of the Vlasov solver. The reduction in the overall time is due to the fact that, while the time to solve Maxwell’s equations and compute the currents to couple the Vlasov equation and Maxwell’s equations is reduced, these other two costs, in addition to the cost to accumulate each Runge-Kutta stage, is not reduced as dramatically as the time to solve the Vlasov equation is.
|MacBook Pro||Intel Core i7-4850HQ||Clang 9.1 C++|
|(High Sierra OS)||(“Crystal Well”)|
|Optimization Flags||Grid Size||Polynomial Order|
|Eigen 3.3.4 for nodal||112 degrees of freedom|
|Nodal Total Time||Modal Total Time||Total Time Reduction|
|Nodal Vlasov Time||Modal Vlasov Time||Vlasov Time Reduction|
Again, the details of this comparison are summarized in Table I.
Iv Gkeyll implementation and parallel scaling
The modal kinetic solvers are implemented in Gkeyll, a modern computational software designed to solve a broad variety of plasma problems. Though the focus here is full kinetics (Boltzmann equations coupled to field equations), Gkeyll also contains solvers for the gyrokinetic equations[19, 29] as well as for multi-moment multifluid equations[37, 4].
Gkeyll uses a number of software innovations which we describe briefly here for completeness. In the context of this paper, the key features of Gkeyll are a low-level infrastructure to build new solvers and the second, a high-level “App” system that allows putting together solvers to perform a particular class of simulation. The low-level computational kernels that update a single cell (via volume and surface DG updates), and compute moments and other quantities needed in the update sequences, are in C++ and auto-generated using the Maxima  CAS. As discussed in the previous section, the use of a CAS allows us to compute most of the integrals needed in the update analytically, eliminating all quadrature and unrolling all inner loops to eliminate matrices.
The high-level App system is written in a JIT compiled language, LuaJIT. Lua is a small, light-weight language that one compiles into the framework. However, despite its simplicity, LuaJIT is a subtle and powerful language with a prototype based object system and coroutines that provides great flexibility in composing complex simulations. Further, the LuaJIT compiler produces extremely optimized code, often performing at the level of, or better than, hand-written C, giving best of the both the worlds: flexibility of a high-level language as well as speed of a compiled language. We note that Gkeyll is less than 8% (about 36K LOC) hand-written LuaJIT. The rest is autogenerated C++ via the Maxima CAS. This structure greatly reduces maintenance issues, as one only needs to ensure that the CAS code is bug-free, rather than coding up all loops, tensor-tensor products, and quadratures by hand, especially for complex functionality such as the full coupling between the Boltzmann equation, Maxwell’s equations, and a collision operator.
The Gkeyll App system greatly simplifies user interaction with the code. The flexibility of the scripting layer allows the user great control over the simulation cycle. In fact, every aspect of the simulation can be controlled by the user without writing any compiled code, or even the need for a compiler suite. Users can however compile compute-intensive code by hand and just load it into Gkeyll using the LuaJIT FFI. In addition, the App system streamlines not just the running of a simulation, but also the manipulation of the data. While post-processing can be done through a suite of tools called the postgkyl package (see Gkeyll website[16, 15] for details), computationally intensive analysis techniques can also be run through Gkeyll through the App system.
An additional software innovation is the two layers of parallel decomposition used by Gkeyll. This multi-layer decomposition is necessary because there are three grids involved in a kinetic simulation: the phase-space grid, the configuration-space grid, and the velocity-space grid. The field solvers work on the configuration-space grid, while the Boltzmann equation evolves on the phase-space grid. The coupling via moments comes from velocity integrals of the distribution function that lives on the phase-space grid. These grids and various communication patterns needed to move data between them leads to a complex use of MPI.
The first level of parallel domain decomposition is in configuration space. Since the DG algorithm only requires one layer of ghost cells to compute the surface integrals along each direction, communication is minimized during the update of the Boltzmann-Maxwell system of equations. The second level of parallel domain decomposition comes from a shared memory decomposition of the velocity grid. For this we use MPI shared-memory primitives to divide the work in updating a region of velocity space owned by sub-set of the total number of cores. A further subset of cores on each of these subsets takes part in the IO and parallel communication.We use MPI_Datatype objects extensively to avoid unnecessary copying of data into/out of buffers.
The advantage of this two-level decomposition is that there is no need to all-reduce the moment data in velocity space. Further, the use of MPI shared-memory primitives eliminates the thread latency common to thread-based parallelism models such as OpenMP. In other words, our parallelism model removes the cost of creating and destroying threads, while still improving the algorithm’s scaling on a single node and reducing the amount of memory consumed per node by eliminating the need for ghost layers amongst the intra-node work. In fact, due to the high dimensionality of the problem (say 5D/6D) even a single layer of ghost-cells need significant memory (as they are 4D/5D) and hence communication time. The use of shared-memory on a node significantly reduces memory consumption (sometimes by or ).
In Figure 3 we show the weak and strong scaling respectively of our pure MPI domain decomposition for a six dimensional problem on the Knight’s Landing (KNL) architecture on the Theta supercomputer. Even though we are not using a thread-based model for parallel programming on a node, we can still take advantage of all 256 “threads” on the KNL chip by specifying at runtime that we are using 256 shared MPI processes These options provide significant benefit for our DG algorithm, as the use of all 256 “threads” not only allows us to divide the work amongst a larger number of processes, using multiple “threads” per core exposes a much greater degree of instruction level parallelism, reducing the cycles per instruction and leading to greater floating point efficiency. Instruction level parallelism is particularly useful for our application, as the instructions of our unrolled sparse tensor-tensor products such as the computational kernel shown in Figure 1, while sparse relative to the nodal algorithm, are still dense instruction sets. As such, multiple clock cycles can be wasted as more instructions are fetched to complete the sparse tensor-tensor product.
In fact, instruction level parallelism is the reason our weak scaling is more favorable than our strong scaling, as our weak scaling maintains enough work per node to extract a larger efficiency using 256 shared MPI processes; whereas, the fixed problem size is not enough work on a decomposition using the full machine. Thus, there is degradation of performance within the node in the strong scaling case, even though communication is minimized by our DG algorithm. Due to the memory requirements of solving a six dimensional partial differential equation, we are limited on the base problem size we can choose for strong scaling. Nevertheless, the alias-free, matrix-free, and quadrature-free DG algorithm, using a large suite of MPI-3 functionality including the shared-memory primitives, has good scaling up to the machine size (4096 KNL nodes and 1 million MPI processes). Details of the scaling study can be found in the supplementary material. We have also performed similar scaling studies on the Stampede 2 supercomputer, and other local clusters.
V Example Simulations
We now briefly show the results of an example simulation run with the alias-free, matrix-free, and quadrature-free DG algorithm presented in this paper. We repeat the calculation of previous publications which used Gkeyll[35, 23]. This particular simulation demonstrates the utility of a continuum kinetic approach, as the high fidelity representation of the particle distribution function provides critical insights for our understanding of the dynamics of this kinetic system, in this case a collisionless plasma.
The setup is an electron-proton plasma in two spatial dimensions, two velocity dimensions (2X2V), with the electron population initially divided amongst two counter-streaming beams. These counter-streaming beams serve as a source of free-energy for a zoo of plasma instabilities, including two-stream, filamentation, and hybrid two-stream-filamentation modes . In the limits explored in , the authors found that as the beam velocity became both more nonrelativistic and colder, such that the beam’s initial energy was dominantly kinetic energy, a large spectrum of hybrid two-stream-filamentation, or oblique, modes all had comparable growth rates. With multiple unstable modes all growing and vying for dominance, the nonlinear saturation of these instabilities led to a highly dynamic phase space. This highly dynamic phase space had a significant impact on the late-time evolution of the plasma, with collisionless damping of the saturated modes depleting the generated electromagnetic energy of the unstable modes, and leading to overall energy conversion from kinetic to electromagnetic to thermal due to the instability dynamics.
We show in Figure 4 the electron distribution function at three different times, the initial condition, the time of nonlinear saturation when the electromagnetic energy peaks, and the end of the simulation, with two different slices of phase space, (top) and (bottom).
These distribution function slices demonstrate the phase space structure that can be represented with a continuum kinetic method such as the alias-free, matrix-free, and quadrature-free DG algorithm presented here. We emphasize that this phase space structure is an important component of the dynamics—the authors of  demonstrated in comparing the results of  to a particle-based method found that the noise inherent to the PIC algorithm can pollute the nonlinear evolution of these instabilities, an important caveat on the use of PIC algorithms and caution needed in interpreting the output in some situations. Further details for this simulation can be found in the supplementary material.
In this paper we have presented, to our knowledge, the first alias-free, matrix-free and quadrature-free scheme for continuum simulation of kinetic problems. Kinetic problems are characterised by delicate field-particle energy exchange that requires great care to ensure that aliasing errors do not modify the physics contained in the system. Further, as kinetic systems evolve in high dimension phase-space (5D/6D) it is important to ensure that the computational cost is minimized, while still retaining accuracy and convergence order. Our modal DG scheme achieves this by computing all needed volume and surface integrals analytically using a computer algebra system and generating the computational kernels automatically. These kernels leverage the sparsity of the tensor-tensor convolutions with a modal, orthonormal basis, unroll all loops and eliminate the need for matrices, and consolidate common expressions with common factors “pulled out”. This leads to dramatic reduction in FLOPs and data movement, significantly speeding up computing time compared to a nodal DG code even when the latter uses highly optimized linear algebra libraries. Critically, despite the analytical elimination of aliasing, we still obtain sub-quadratic scaling of cost with degrees-of-freedom per cell.
Our scheme is implemented in a flexible, open-source computational plasma physics framework, Gkeyll. This framework allows flexible construction of simulations using a powerful “App” system. Gkeyll us mostly written in LuaJIT, a JIT compiled language, with key computational kernels written (auto-generated) in C++. A hybrid MPI-shared-MPI domain decomposition allows us to reduce communication within nodes and ensures almost linear scaling on a single node, yet retaining excellent scaling properties across nodes. We have demonstrated this on the Theta supercomputer all the way up to the full machine ( 1 million MPI processes).
Our present algorithmic work is focused on two areas: adding a multi-moment model coupling to the kinetics that will lead to a unique hybrid moment-kinetic simulation capability (most hybrid PIC codes assume massless, isothermal electrons), and a novel recovery based DG scheme that will further increase accuracy, reducing resolution requirements. The recovery based approach is very promising as it may allow achieving, for example, order convergence with just DG basis functions where traditional DG schemes obtain order convergence for -th order basis. Such increase in accuracy can allow use of coarser meshes, further dramatically reducing the computation cost for 5D/6D problems. The recovery approach is complex, though, and benefits greatly from advances of CAS generated code reported here. Combined with the flexibility of the Gkeyll code these innovations will enable larger problems of interest in a broad array of fields.
We thank the Gkeyll team for contributions to various parts of the code and the work we have presented here. In particular, we thank Mana Francisquez for help in implementing the collision operators, moment computations and other core code, Noah Mandell for help in the high-level App system (and authoring the gyrokinetic solvers) and Petr Cagas for implementing the post-processing tools (and BGK collision operators and boundary conditions). We also thank Jason TenBarge, Greg Hammett and Bill Dorland for extensive discussion on various aspects of the physics of the Vlasov-Maxwell system.
-  (2011-03) The Serendipity Family of Finite Elements. Found. Comput. Math. 11 (3), pp. 337–344. Cited by: §I.
-  (1998) Quadrature-free implementation of discontinuous galerkin method for hyperbolic equations. AIAA Journal 36 (5), pp. 775–782. External Links: Cited by: §II.
-  (2009-06) Weibel, two–stream, filamentation, oblique, bell, buneman… which one grows faster?. apj 699 (2), pp. 990–1003. External Links: Cited by: §V.
-  (2019-11) Global Ten-Moment Multifluid Simulations of the Solar Wind Interaction with Mercury: From the Planetary Conducting Core to the Dynamic Magnetosphere. Geophysical Research Letters 124, pp. 2019GL083180–13. Cited by: §IV.
-  (2010) Numerical methods for fluid dynamics: with applications to geophysics. Vol. 32, Springer Science & Business Media. Cited by: §III.
-  (2019) A matrix-free high-order discontinuous galerkin compressible navier-stokes solver: a performance comparison of compressible and incompressible formulations for turbulent incompressible flows. Int. J. Numer. Methods Fluids 89 (3), pp. 71–102. External Links: Cited by: §II.
-  (2001) Filter-based stabilization of spectral element methods. Comptes Rendus de l’Académie des Sciences - Series I - Mathematics 332 (3), pp. 265–270. External Links: Cited by: §II.
-  (2016) Simulation of underresolved turbulent flows by adaptive filtering using the high order discontinuous Galerkin spectral element method. jcomp 313, pp. 1–12. External Links: Cited by: §II.
-  (2017) On the use of kinetic energy preserving DG-schemes for large eddy simulation. jcomp 350, pp. 782–795. External Links: Cited by: §II.
-  (2013) On the accuracy of high-order discretizations for underresolved turbulence simulations. Theore. Comput. Fluid Dyn. 27 (3), pp. 221–237. External Links: Cited by: §II.
-  (2016) A well balanced and entropy conservative discontinuous Galerkin spectral element method for the shallow water equations. Appl. Math. Comput. 272, pp. 291–308. External Links: Cited by: §II.
-  (2016) Split form nodal discontinuous Galerkin schemes with summation-by-parts property for the compressible Euler equations. jcomp 327, pp. 39–66. External Links: Cited by: §II.
A skew-symmetric discontinuous Galerkin spectral element discretization and its relation to SBP-SAT finite difference methods. SIAM J. Sci. Comput. 35 (3), pp. A1233–A1253. External Links: Cited by: §II.
-  (2014) A kinetic energy preserving nodal discontinuous galerkin spectral element method. Int. J. Numer. Methods Fluids 76 (1), pp. 28–50. External Links: Cited by: §II.
-  (2020) Gkeyll Code Documentation. Note: http://gkeyll.rtfd.io Cited by: §I, §IV.
-  (2020) Gkeyll Github Source Repository. Note: https://github.com/ammarhakim/gkyl Cited by: §I, §IV.
-  (2010) Eigen v3. Note: http://eigen.tuxfamily.org Cited by: §III, §III.
-  (2019) Conservative Discontinuous Galerkin Schemes for Nonlinear Fokker-Planck Collision Operators. External Links: Cited by: §I.
-  (2020-04) Continuum electromagnetic gyrokinetic simulations of turbulence in the tokamak scrape-off layer and laboratory devices. Physics of Plasmas, pp. 1–12. Cited by: §IV.
-  (2007) Nodal discontinuous Galerkin methods: algorithms, analysis, and applications. Springer Science & Business Media. Cited by: §II, §III.
-  (2012) Explicit discontinuous Galerkin methods for unsteady problems. Computers & Fluids 61, pp. 86–93. External Links: Cited by: §II, §III.
-  (2018-01) Discontinuous Galerkin algorithms for fully kinetic plasmas. jcomp 353, pp. 110–147. External Links: Cited by: §I, §II, §II.
-  (2020) Noise-induced magnetic field saturation in kinetic simulations. External Links: Cited by: Fig. 4, §V, §V.
-  (2020) A matrix-free Discontinuous Galerkin method for the time dependent Maxwell equations in unbounded domains. arXiv preprint. External Links: Cited by: §II.
-  (2006-08) An arbitrary high-order discontinuous Galerkin method for elastic waves on unstructured meshes — I. The two-dimensional isotropic case with external source terms. Geophys. J. Int. 166 (2), pp. 855–877. External Links: Cited by: §II.
-  (2003) De-aliasing on non-uniform grids: algorithms and applications. jcomp 191 (1), pp. 249–264. External Links: Cited by: §II.
-  (2012) Computer Algebra Meets Finite Elements: An Efficient Implementation for Maxwell’s Equations. In Numerical and Symbolic Scientific Computing: Progress and Prospects, pp. 105–121. External Links: Cited by: §II.
-  (1999) Efficient implementations of the quadrature-free discontinuous galerkin method. In 14th Computational Fluid Dynamics Conference, pp. . External Links: Cited by: §II.
-  (2020-02) Electromagnetic full- gyrokinetics in the tokamak edge with discontinuous Galerkin methods. Journal of Plasma Physics 86 (1), pp. 149–39. Cited by: §IV.
-  (2006) A quadrature-free discontinuous Galerkin method for the level set equation. jcomp 212 (1), pp. 338–357. External Links: Cited by: §II.
-  (2019)(Website) http://maxima.sourceforge.net/. External Links: Cited by: §II, §IV.
-  (2017) On the eddy-resolving capability of high-order discontinuous Galerkin approaches to implicit LES / under-resolved DNS of Euler turbulence. jcomp 330, pp. 615–623. External Links: Cited by: §II.
-  (2020) Parker Solar Probe. Note: https://www.nasa.gov/content/goddard/parker-solar-probe Cited by: §I.
-  (2002) A survey of strong stability preserving high order time discretizations. Collected lectures on the preservation of stability under discretization 109, pp. 51–65. Cited by: §III.
-  (2019-02) Temperature-dependent saturation of weibel-type instabilities in counter-streaming plasmas. apjl 872 (2), pp. L28. External Links: Cited by: Fig. 4, §V, §V, §V.
-  (2020) The gaia Mission. Note: sci.esa.int/web/gaia Cited by: §I.
-  (2019-09) Exact and Locally Implicit Source Term Solvers for Multifluid-Maxwell Systems. arXiv.org. External Links: Cited by: §IV.
-  (2018) A comparative study on polynomial dealiasing and split form discontinuous Galerkin schemes for under-resolved turbulence computations. jcomp 372, pp. 1–21. External Links: Cited by: §II.