I Introduction
Understanding fundamental kinetic processes is important in many physical problems, from the astrophysics of selfgravitating systems, to plasma physics and gas dynamics. Several recent satellite missions observe the detailed structure of these systems, for example, the GAIA[36] mission that aims to collect the position, velocity and other data on billions of stars in our galaxy, or the Parker Solar Probe[33] mission that is studying the detailed structure of the hot solar wind plasma that permeates the solar system. Each of these missions aims to measure the phasespace of the “particles,” e.g, stars in the case of GAIA and electrons and ions in case of Parker Solar Probe. The quality of data is unprecedented and promises to greatly enrich our understanding. Clearly, largescale simulation capability is needed to interpret and understand the detailed physics revealed by these measurements.
A near firstprinciples approach is to look at the statistical description via the Boltzmann equation coupled to appropriate field equations: Poisson equations for selfgravitating system and Maxwell’s equations for plasmas. The challenge in solving such systems is the inherent nonlinearity due to the coupling of the particles and fields, and that the particle dynamics evolves in 6D phasespace (positionvelocity), requiring a very careful treatment of all fieldparticle interaction terms.
The fundamental object in the Boltzmann description is the particle distribution function that evolves in phasespace . The particle distribution function is defined such that is the number of particles in phasespace volume at positionvelocity location . The motion of particles comes about from freestreaming and particle acceleration and is described by the Boltzmann equation
where and are gradient operators in configuration and velocity space respectively, and is the acceleration. To treat the phasespace as a whole we will often use and denote the phasespace flux as . The righthand represents collision terms that redistribute the particles in velocity space, but in a manner that conserves density, momentum and energy. Even though the streaming of particles,
, in the Boltzmann equation is linear, the collisions and coupling to the fields via the acceleration, determined by velocity moments of the distribution function, makes the complete particle
field equations a highly nonlinear, integrodifferential, 6D system.The high dimensionality means that for most problems, especially in 6D, one needs the largest computational resources one can muster. In this paper we present a novel version of the discontinuous Galerkin (DG) algorithm to solve such kinetic equations. Unlike traditional and widelyused Monte Carlo methods, such as the particleincell (PIC) method for plasmas, we use a continuum scheme in which we directly discretize the 6D phasespace using discontinuous basis functions. A continuum scheme has the advantage that the counting noise inherent in PIC methods is eliminated, however, at higher computational complexity. Once the basis set and a numerical flux function are determined, we compute all volume and surface terms in the DG algorithm exactly, eliminating all aliasing errors that would otherwise contaminate the delicate fieldparticle interactions. This is a critical aspect of capturing the physics, both in the linear and nonlinear regimes.
We use modal basis functions (of the Serendipity family[1]
) with reduced degrees of freedom (DOF) to improve efficiency while retaining a high formal order of convergence. Further, use of a computer algebra system (CAS) allows us to compute all integrals analytically, and orthonormalization of the basis leads to very sparse update kernels minimizing FLOPs and eliminating all tensortensor products and explicit quadratures.
We extend previous work[22], where the authors presented a nodal
DG algorithm to solve the Boltzmann equation in the context of plasma physics. In the plasma physics context, the Boltzmann equation, coupled to Maxwell’s equations, forms the VlasovMaxwell system of equations, in which charged particles evolve in selfconsistent electromagnetic fields. For the VlasovMaxwell system of equations, the acceleration vector is given by
, where and are particle charge and mass, and and are electric and magnetic fields, determined from Maxwell’s equations. The particle contribution to the fields is via plasma currents that appears in Ampere’s law. The work of [22] showed that a DG scheme can conserve the mass and, when using central fluxes for Maxwell equations, total energy (particlefield) exactly. Importantly though, unlike the case of fluid problems (Euler, NavierStokes, or magnetohydrodynamics equations), there is no explicit energy equation that is evolved. In fact, the energy (as discussed in Section II) depends on moments of the distribution function as well as the norm of the electromagnetic field. Hence, ensuring both the accuracy of the evolution of the energy, and that the energy is conserved, is not trivial and care is needed to maintain energy conservation. The modal DG scheme presented here does not change the properties proved in[22], but it does greatly improve the efficiency and scalability of the DG algorithm, while maintaining all the scheme’s favorable properties.Our algorithms are implemented in the opensource
Gkeyll[16, 15] code that incorporates a number of software innovations: use of JIT compiled toplevel language, CAS generated computational kernels, and a sophisticated sharedmemory MPI implementation to handle velocity space parallelization. We have obtained subquadratic scaling of the computational kernels with DOFs percell and also good parallel weakscaling of the code on the Theta supercomputer.The modal, aliasfree, matrixfree, and quadraturefree DG algorithm presented here has also been applied to the discretization of FokkerPlanck equations [18]. We note though that, to our knowledge, this paper describes the first instance of the application of a modal, aliasfree, matrixfree, and quadraturefree DG algorithm to kinetic equations, especially nonlinear kinetic equations. In rest of the paper we describe some aspects of our schemes and innovation we have made to make highdimensional problems within reach, at least on large supercomputers.
Ii Modal discontinuous Galerkin algorithm
As context for the fundamental algorithmic advancement of this paper, we briefly review the ingredients of a discontinuous Galerkin scheme. To construct a DG discretization of a partial differential equation (PDE) such as the kinetic equation, we discretize our phase space domain into grid cells,
, multiply the kinetic equation by test functions , and integrate the phase space gradient by parts to construct the discreteweak form,The discreteweak form is then evaluated in each grid cell and for every test function in a chosen basis expansion, with the discrete representation of the particle distribution defined as
for the test functions which define our basis. We likewise have a discrete representation for the phase space flux, , which, for example, looks like
for the particular Boltzmann equation for the evolution of a collisionless plasma, i.e., the Vlasov equation. The numerical flux function, , is some suitably appropriate prescription for the interface fluxes, such as upwind fluxes, in analogy with traditional finite volume methods. In contrast to finite volume methods though, the numerical flux function has its own basis expansion, e.g., for central fluxes,
where the superscript denote the basis expansions of and evaluated just inside (outside) the cell interface.
While the discreteweak form is a mathematically complete formulation of the DG algorithm, to translate the discreteweak form into code, a suitable choice of basis functions for must be made to evaluate the integrals in the discreteweak form. Restricting ourselves to polynomial bases, a conventional approach in the application of DG methods to hyperbolic PDEs is a nodal basis, wherein the basis set is defined by a set of polynomials whose values are known at nodes. An example nodal basis in 1D is
where
are the Lagrange interpolating polynomials,
and are the nodes by which the polynomials are defined. Because the polynomials are defined to take a value of one at one node and zero at all other nodes, the coefficients are thus known at the specific set of nodes.
Nodal bases are common in the DG literature because of the computational advantages they provide for many applications, most especially the simplification of many of the integrals if one substitutes products and other nonlinear combinations of basis functions as
Such a simplification reduces the number of operations required to evaluate the discreteweak form and numerically integrate the PDE of interest, but at a cost: aliasing errors are introduced into the solution since nonlinear combinations of the nodal basis set are not contained in the basis [20, 21]. These aliasing errors have been studied in the context of fluids equations, such as the Euler equations, the NavierStokes equations, or the equations of magnetohydrodynamics, where it is found that these aliasing errors can have a destabilizing effect [26]. However, the computational gains from the simplification of the integrals are large enough that significant effort has been spent on mitigating these errors with filtering and artificial dissipation [7, 10, 8, 32] or splitform formulations^{1}^{1}1In the splitform formulation, conservative and nonconservative forms of the equation at the continuous level are averaged to produce a different (but mathematically equivalent), but ultimately more computationally favorable, equation to discretize. [13, 14, 11, 12, 9, 38]. Because fluid equations involve explicit conservation relations and the aliasing errors manifest in the smallest scales and highest wavenumbers, there is far less concern that mitigation techniques such as filtering or artificial dissipation will destroy the quality of the solution, at least at scales above the resolution of the simulation. The ability to control aliasing errors while maintaining the favorable computational complexity of a nodal scheme is of tremendous utility for the simulation of large scale problems in computational fluid dynamics.
Unfortunately, these aliasing errors are intolerable for kinetic equations. The catastrophic nature of aliasing errors for kinetic equations arises from the physics content of the DG discretization itself when employing polynomial bases. To take the example of the Vlasov equation for the evolution of a collisionless plasma, when employing at least piecewise quadratic polynomials, the DG discretization involves the evolution of the moment of the particle distribution function. But the velocity moment is the particle energy, whose evolution is given by
where we have summed over all cells to eliminate the surface term, as in [22], and substituted for the volume term the discrete exchange of energy between the particles and the electromagnetic fields, .
In order for this substitution to be valid, the integrations of the surface and volume terms must be performed exactly, or at least to a high precision, lest the aforementioned aliasing errors manifest themselves as the “energy content” of the velocity moments being transported in uncontrolled and undesirable ways. It would be nigh impossible to correct the rearrangement of the “energy content” of the basis expansion in a physically reasonable way, much less a stable way, because these errors are entering at all scales and in both fields and particles, and destroying a fundamental property of the equation system: the exchange of energy between the plasma and electromagnetic fields is given by . If we cannot safely apply standard techniques such as filtering to mitigate aliasing errors, we must then eliminate these errors in their entirety.
Eliminating aliasing errors with a nodal basis comes at a high cost though. The use of numerical quadrature, even anisotropic quadrature as in [22], leads to a computational complexity , where is the number of quadrature points required to exactly integrate the nonlinear term(s) in the kinetic equation. The number of quadrature points exponentially increases with dimensionality, leading to an incredibly expensive numerical method for five and six dimensional problems.
We can gain insight into how to manage this cost, while respecting our requirement of the complete elimination of aliasing errors, by considering the fundamental operation of our DG method. Substitution of the full expansions for the phase space flux, , and distribution function, , into the volume term in the discrete weak form gives us
where we have encompassed the spatial discretization in the evaluation and convolution of the entries in the tensor, . If this tensor is dense, the convolution of to evaluate the volume integral in the discreteweak form will have a computational complexity of
, which would suffer the same curse of dimensionality as the use of numerical quadrature.
However, if could be made sparse, this would correspond to a systematic reduction in the number of operations required to evaluate the volume integral, and thus reduce the number of operations to numerically integrate the kinetic equation with our DG method. We can indeed sparsify with the use of a modal, orthonormal polynomial basis set, as many entries of the tensor will be zero if the basis functions are orthonormal. In addition to the reduction in the number of operations required to evaluate the volume integral, the use of an orthonormal basis to sparsify allows for a complete redesign of the algorithm to maximize performance on modern architectures.
We now describe the principal algorithmic advancement of this paper: an aliasfree, matrixfree, and quadraturefree DG algorithm for kinetic equations. By choosing a modal, orthonormal polynomial basis, we can symbolically integrate the individual terms in the tensor and explicitly evaluate the sums which form the core of the update formulae. We construct computational kernels using the Maxima [31] CAS to evaluate sums such as
with similar computational kernels for the surface integrals. We show an example computational kernel for the volume integral of the Vlasov equation in Figure 1 for the piecewise linear tensor product basis in one spatial and two velocity dimensions (1X2V).
Figure 1 shows a C++ computational kernel that can be called for every cell
of a structured, Cartesian grid in phase space, as we are passing all the information required to the kernel to determine where we are physically in phase space, i.e., the local cell center coordinate and grid cell size. The output of this computational kernel, the out array, forms a component of a system of ordinary differential equations,
where the operation encodes the evaluation of the surface integrals on each surface of the cell and can also be pregenerated using a CAS^{2}^{2}2In the construction of this ordinary differential equation system, the matrix
, but due to the choice of a modal, orthonormal basis this matrix is the identity matrix and thus requires no additional operations to invert.
. Given the computation of the surface and volume integrals in every cell, this system of ordinary differential equations can be discretized with an appropriate ODE integrator such as strongstability preserving RungeKutta (SSPRK) method, as is done in Gkeyll. We note that we will likewise have computational kernels for Maxwell’s equations, or another set of field equations such as Poisson’s equations for selfgravitating systems, which must be evaluated at each stage of a SSPRK method to complete the fieldparticle coupling.Notably, the computational kernel in Figure 1 has no matrix data structure, much less the requirement to perform quadrature since we have already analytically evaluated the integrals which make up the entries of with a CAS and written out the results to double precision. Further, we unroll all loops, eliminate common expressions and collect terms to ensure that the update uses fewer FLOPs^{3}^{3}3The problem of ensuring least FLOP counts is difficult, and we apply most reasonably straightforward tricks we can think of. Certainly, a further reduction is likely possible and could be explored with more sophisticated optimization tools.. Using the local cellcenter and grid spacing, we construct the phase space expansion of the phase space flux, , for each dimension, and then compute the convolution of the tensor summed over each component of the phase space flux. Thus, not only is the method aliasfree because the integrals which form our spatial discretization have been evaluated to machineprecision, the method is also quadraturefree and matrixfree. Such quadraturefree methods using orthogonal polynomials were studied in the early days of the DG method [2, 28] and are still applied to a variety of linear hyperbolic equations, such as the acoustic wave equation for studies of seismic activity, the level set equation, and Maxwell’s equations [25, 30, 27, 24]. Even for alternative formulations of DG which do not seek to eliminate aliasing errors by exactly integrating the components of the discrete weak form, matrixfree implementations are desirable to reduce the memory footprint of the scheme [6].
To our knowledge, the construction of the aliasfree, matrixfree, and quadraturefree algorithm shown here for kinetic equations, especially nonlinear kinetic equations such as the Vlasov equation for collisionless plasma dynamics, is the first instance of such an algorithm design in the literature. This particular algorithm design has numerous advantages. The sparseness of our aliasfree, matrixfree, and quadraturefree DG algorithm leads to a reduction in the number of operations required to evaluate the volume and surface integrals, e.g., the computational kernel in Figure 1 has multiplications, whereas the update for numerical quadrature applied to an aliasfree nodal basis has multiplications. In addition, the reduced memory footprint from requiring no matrix data structure and the unfolding of the tensortensor convolutions leads to additional performance improvements, e.g., from compiler optimizations such as common expression elimination. To determine quantitatively the computational complexity and more precisely evaluate the performance of the aliasfree, quadraturefree, and matrixfree DG algorithm, we will perform a numerical experiment in the next section.
Iii Computation complexity
Although we have evidence from the computational kernel presented in Figure 1 that the number of operations is indeed reduced compared to the use of numerical quadrature, we would like to determine generally how sparse the tensors required to update the discrete kinetic equation are. We again take the example of the Vlasov equation, and in Figure 2 perform a numerical experiment using the computational kernels generated from a variety of basis expansion of dimensionality combinations.
We show the time to evaluate the computational kernels in a phase space cell for just the streaming term, in the left plot of Figure 2, and the evaluation of the full phase space update, streaming and acceleration, in the right plot. From the scaling of the cost to evaluate these computational kernels we can determine the computational complexity of the algorithm with respect to the number of degrees of freedom per cell, i.e., the number of basis functions in our expansion, .
It is immediately apparent that even with the steepening of the scaling as the number of degrees of freedom increases there is at least some gain over the use of direct quadrature to evaluate the integrals in the discrete weak form because, at worst, the total, streaming plus acceleration, update scales roughly as . In fact, this scaling of, at worst , is exactly the scaling obtained by underintegrating the nonlinear term in a nodal basis [20, 21]. But critically, we have obtained this same (or better) computational complexity while eliminating aliasing errors from our scheme, as we require for stability and accuracy.
However, the improvement in the scaling is actually better than it first appears. The scaling shown in Fig. 2 is the cost scaling of the full update to perform a forward Euler step in a phase space cell, i.e., in six dimensions, three spatial and three velocity, the total update time in the right plot of Fig. 2 is the time to compute the six dimensional volume integral plus the twelve required five dimensional surface integrals^{4}^{4}4We mention that the choice of orthonormal basis and analytically computing all integrals leads to the rather surprising result that the 6D volume integral is actually much, much cheaper than the surface integrals. In fact, the total cost of our algorithms is driven entirely by the surface integration costs.. This means the scaling we are quoting is irrespective of the dimensionality of the problem, unlike in the case of the nodal basis, where the quadrature must be performed for every integral and there is a hidden dimensionality factor in the scaling. In other words, in six dimensions, what at first may only seem like a factor of improvement moving from a nodal to an orthonormal, modal representation is in fact a factor of improvement in the scaling once one includes the dimensionality factor, up to the constant of proportionality of the scaling. Of course, one must also compare the size of the constant of proportionality multiplying both scalings to accurately compare the reduction in the number of operations and improvement in the overall performance, since said constant of proportionality can either tell us the picture is much rosier, that in fact the improvement in performance is larger than we expected, or much more dire, that the improvement in the scaling is offset by a larger constant of proportionality.
To determine the constant of proportionality, we perform a more thorough numerical experiment and compare the cost of the aliasfree nodal scheme and aliasfree modal scheme for a complete collisionless Vlasov–Maxwell simulation. We consider the following test: a 2X3V computation done with both the nodal and the modal algorithms, with a detailed timing breakdown of the most important step of the algorithm, the Vlasov time step. The reader is referred Table I for a summary of the following two paragraphs if they wish to skip the details of the computer architecture and optimizations employed. Both computations are performed in serial on a Macbook Pro with an Intel Core i74850HQ (“Crystal Well”) chip, the same architecture on which the scaling analysis in Figure 2 was performed. The only optimization in the compilation of both algorithms is “O3” and both versions of the code are compiled with the C++ Clang 9.1 compiler.
Specific details of the computations are as follows: a grid, with polynomial order two, and the Serendipity basis, 112 degrees of freedom per cell. The two simulations were run for a number of timesteps to allow us to more accurately compute the time per step of just the Vlasov solver, as well as the time per step of the complete simulation. The timestepper of choice for this numerical experiment is the threestage, third order, SSPRK method [34, 5]. To make the simulations as realistic as possible in terms of memory movement, we also evolve a “proton” and “electron” distribution function, i.e., we evolve the VlasovMaxwell system of equations for two plasma species.
To make the comparison as favorable as possible for the nodal algorithm, we also employ the highly tuned Eigen linear algebra library, Eigen 3.3.4 [17], to perform the dense matrixvector multiplies required to evaluate the higher order quadrature needed to eliminate aliasing errors in the nodal DG discretization. And we note that the nodal algorithm is optimized to use anisotropic quadrature (just high enough to eliminate aliasing) and uses only the surface basis functions in the surface integral evaluations, so we are doing as much as possible to reduce the cost of the aliasfree nodal scheme.
The results are as follows: for the nodal basis, the computation required 1079.63 seconds per time step, of which 1033.89 seconds were spent solving the Vlasov equation. The remaining time is split between the computation of Maxwell’s equations, the computation of the current from the first velocity moment of the distribution function to couple the particles and the fields, and the accumulation of each RungeKutta stage from our three stage RungeKutta method. For the modal basis, the computation required 67.43 seconds per time step, of which 60.34 seconds were spent solving the Vlasov equation.
In the nodal case, we emphasize that we achieve a reasonable CPU efficiency, and the nodal timings are not a matter of poor implementation. We estimate the number of multiplications in the aliasfree nodal algorithm required to perform a full timestep is
, three trillion, once one considers the fact that we are evolving two distribution functions with a threestage Runge–Kutta method. One thousand seconds to perform three trillion multiplications corresponds to an efficiency of flops per second (3 GFlops/s). This estimate is within 50 percent of the measured efficiencies of Eigen’s matrixvector multiplication routines for Eigen 3.3.4 on a similar CPU architecture to the one employed for this test [17], so we argue that the cost of the aliasfree nodal algorithm is due to the number of operations required and not an inefficient implementation of the algorithm.It is then worth discussing how this improvement in the timings using the modal algorithm compares with our expectations. Given the scaling of the modal basis, we would anticipate the gain in efficiency in five dimensions would be around a factor of twenty, a factor of four from the reduction in the scaling from to , and a factor of five from the latter scaling containing all of the five dimensional volume integrals and the ten four dimensional surface integrals. We can see that the gain in just the Vlasov solver is , while the gain in the overall time per step is , not quite as much as we would naively expect, but still a sizable increase in the speed of the Vlasov solver. The reduction in the overall time is due to the fact that, while the time to solve Maxwell’s equations and compute the currents to couple the Vlasov equation and Maxwell’s equations is reduced, these other two costs, in addition to the cost to accumulate each RungeKutta stage, is not reduced as dramatically as the time to solve the Vlasov equation is.
Computer  Architecture  Compiler 
MacBook Pro  Intel Core i74850HQ  Clang 9.1 C++ 
(High Sierra OS)  (“Crystal Well”)  
Optimization Flags  Grid Size  Polynomial Order 
“O3,”  Serendipity quadratic,  
Eigen 3.3.4 for nodal  112 degrees of freedom  
Nodal Total Time  Modal Total Time  Total Time Reduction 
1079.63  67.43  
Nodal Vlasov Time  Modal Vlasov Time  Vlasov Time Reduction 
1033.89  60.34 
Again, the details of this comparison are summarized in Table I.
Iv Gkeyll implementation and parallel scaling
The modal kinetic solvers are implemented in Gkeyll, a modern computational software designed to solve a broad variety of plasma problems. Though the focus here is full kinetics (Boltzmann equations coupled to field equations), Gkeyll also contains solvers for the gyrokinetic equations[19, 29] as well as for multimoment multifluid equations[37, 4].
Gkeyll uses a number of software innovations which we describe briefly here for completeness. In the context of this paper, the key features of Gkeyll are a lowlevel infrastructure to build new solvers and the second, a highlevel “App” system that allows putting together solvers to perform a particular class of simulation. The lowlevel computational kernels that update a single cell (via volume and surface DG updates), and compute moments and other quantities needed in the update sequences, are in C++ and autogenerated using the Maxima [31] CAS. As discussed in the previous section, the use of a CAS allows us to compute most of the integrals needed in the update analytically, eliminating all quadrature and unrolling all inner loops to eliminate matrices.
The highlevel App system is written in a JIT compiled language, LuaJIT. Lua is a small, lightweight language that one compiles into the framework. However, despite its simplicity, LuaJIT is a subtle and powerful language with a prototype based object system and coroutines that provides great flexibility in composing complex simulations. Further, the LuaJIT compiler produces extremely optimized code, often performing at the level of, or better than, handwritten C, giving best of the both the worlds: flexibility of a highlevel language as well as speed of a compiled language. We note that Gkeyll is less than 8% (about 36K LOC) handwritten LuaJIT. The rest is autogenerated C++ via the Maxima CAS. This structure greatly reduces maintenance issues, as one only needs to ensure that the CAS code is bugfree, rather than coding up all loops, tensortensor products, and quadratures by hand, especially for complex functionality such as the full coupling between the Boltzmann equation, Maxwell’s equations, and a collision operator.
The Gkeyll App system greatly simplifies user interaction with the code. The flexibility of the scripting layer allows the user great control over the simulation cycle. In fact, every aspect of the simulation can be controlled by the user without writing any compiled code, or even the need for a compiler suite. Users can however compile computeintensive code by hand and just load it into Gkeyll using the LuaJIT FFI. In addition, the App system streamlines not just the running of a simulation, but also the manipulation of the data. While postprocessing can be done through a suite of tools called the postgkyl package (see Gkeyll website[16, 15] for details), computationally intensive analysis techniques can also be run through Gkeyll through the App system.
An additional software innovation is the two layers of parallel decomposition used by Gkeyll. This multilayer decomposition is necessary because there are three grids involved in a kinetic simulation: the phasespace grid, the configurationspace grid, and the velocityspace grid. The field solvers work on the configurationspace grid, while the Boltzmann equation evolves on the phasespace grid. The coupling via moments comes from velocity integrals of the distribution function that lives on the phasespace grid. These grids and various communication patterns needed to move data between them leads to a complex use of MPI.
The first level of parallel domain decomposition is in configuration space. Since the DG algorithm only requires one layer of ghost cells to compute the surface integrals along each direction, communication is minimized during the update of the BoltzmannMaxwell system of equations. The second level of parallel domain decomposition comes from a shared memory decomposition of the velocity grid. For this we use MPI sharedmemory primitives to divide the work in updating a region of velocity space owned by subset of the total number of cores. A further subset of cores on each of these subsets takes part in the IO and parallel communication.We use MPI_Datatype objects extensively to avoid unnecessary copying of data into/out of buffers.
The advantage of this twolevel decomposition is that there is no need to allreduce the moment data in velocity space. Further, the use of MPI sharedmemory primitives eliminates the thread latency common to threadbased parallelism models such as OpenMP. In other words, our parallelism model removes the cost of creating and destroying threads, while still improving the algorithm’s scaling on a single node and reducing the amount of memory consumed per node by eliminating the need for ghost layers amongst the intranode work. In fact, due to the high dimensionality of the problem (say 5D/6D) even a single layer of ghostcells need significant memory (as they are 4D/5D) and hence communication time. The use of sharedmemory on a node significantly reduces memory consumption (sometimes by or ).
In Figure 3 we show the weak and strong scaling respectively of our pure MPI domain decomposition for a six dimensional problem on the Knight’s Landing (KNL) architecture on the Theta supercomputer. Even though we are not using a threadbased model for parallel programming on a node, we can still take advantage of all 256 “threads” on the KNL chip by specifying at runtime that we are using 256 shared MPI processes These options provide significant benefit for our DG algorithm, as the use of all 256 “threads” not only allows us to divide the work amongst a larger number of processes, using multiple “threads” per core exposes a much greater degree of instruction level parallelism, reducing the cycles per instruction and leading to greater floating point efficiency. Instruction level parallelism is particularly useful for our application, as the instructions of our unrolled sparse tensortensor products such as the computational kernel shown in Figure 1, while sparse relative to the nodal algorithm, are still dense instruction sets. As such, multiple clock cycles can be wasted as more instructions are fetched to complete the sparse tensortensor product.
In fact, instruction level parallelism is the reason our weak scaling is more favorable than our strong scaling, as our weak scaling maintains enough work per node to extract a larger efficiency using 256 shared MPI processes; whereas, the fixed problem size is not enough work on a decomposition using the full machine. Thus, there is degradation of performance within the node in the strong scaling case, even though communication is minimized by our DG algorithm. Due to the memory requirements of solving a six dimensional partial differential equation, we are limited on the base problem size we can choose for strong scaling. Nevertheless, the aliasfree, matrixfree, and quadraturefree DG algorithm, using a large suite of MPI3 functionality including the sharedmemory primitives, has good scaling up to the machine size (4096 KNL nodes and 1 million MPI processes). Details of the scaling study can be found in the supplementary material. We have also performed similar scaling studies on the Stampede 2 supercomputer, and other local clusters.
V Example Simulations
We now briefly show the results of an example simulation run with the aliasfree, matrixfree, and quadraturefree DG algorithm presented in this paper. We repeat the calculation of previous publications which used Gkeyll[35, 23]. This particular simulation demonstrates the utility of a continuum kinetic approach, as the high fidelity representation of the particle distribution function provides critical insights for our understanding of the dynamics of this kinetic system, in this case a collisionless plasma.
The setup is an electronproton plasma in two spatial dimensions, two velocity dimensions (2X2V), with the electron population initially divided amongst two counterstreaming beams. These counterstreaming beams serve as a source of freeenergy for a zoo of plasma instabilities, including twostream, filamentation, and hybrid twostreamfilamentation modes [3]. In the limits explored in [35], the authors found that as the beam velocity became both more nonrelativistic and colder, such that the beam’s initial energy was dominantly kinetic energy, a large spectrum of hybrid twostreamfilamentation, or oblique, modes all had comparable growth rates. With multiple unstable modes all growing and vying for dominance, the nonlinear saturation of these instabilities led to a highly dynamic phase space. This highly dynamic phase space had a significant impact on the latetime evolution of the plasma, with collisionless damping of the saturated modes depleting the generated electromagnetic energy of the unstable modes, and leading to overall energy conversion from kinetic to electromagnetic to thermal due to the instability dynamics.
We show in Figure 4 the electron distribution function at three different times, the initial condition, the time of nonlinear saturation when the electromagnetic energy peaks, and the end of the simulation, with two different slices of phase space, (top) and (bottom).
These distribution function slices demonstrate the phase space structure that can be represented with a continuum kinetic method such as the aliasfree, matrixfree, and quadraturefree DG algorithm presented here. We emphasize that this phase space structure is an important component of the dynamics—the authors of [23] demonstrated in comparing the results of [35] to a particlebased method found that the noise inherent to the PIC algorithm can pollute the nonlinear evolution of these instabilities, an important caveat on the use of PIC algorithms and caution needed in interpreting the output in some situations. Further details for this simulation can be found in the supplementary material.
Vi Conclusion
In this paper we have presented, to our knowledge, the first aliasfree, matrixfree and quadraturefree scheme for continuum simulation of kinetic problems. Kinetic problems are characterised by delicate fieldparticle energy exchange that requires great care to ensure that aliasing errors do not modify the physics contained in the system. Further, as kinetic systems evolve in high dimension phasespace (5D/6D) it is important to ensure that the computational cost is minimized, while still retaining accuracy and convergence order. Our modal DG scheme achieves this by computing all needed volume and surface integrals analytically using a computer algebra system and generating the computational kernels automatically. These kernels leverage the sparsity of the tensortensor convolutions with a modal, orthonormal basis, unroll all loops and eliminate the need for matrices, and consolidate common expressions with common factors “pulled out”. This leads to dramatic reduction in FLOPs and data movement, significantly speeding up computing time compared to a nodal DG code even when the latter uses highly optimized linear algebra libraries. Critically, despite the analytical elimination of aliasing, we still obtain subquadratic scaling of cost with degreesoffreedom per cell.
Our scheme is implemented in a flexible, opensource computational plasma physics framework, Gkeyll. This framework allows flexible construction of simulations using a powerful “App” system. Gkeyll us mostly written in LuaJIT, a JIT compiled language, with key computational kernels written (autogenerated) in C++. A hybrid MPIsharedMPI domain decomposition allows us to reduce communication within nodes and ensures almost linear scaling on a single node, yet retaining excellent scaling properties across nodes. We have demonstrated this on the Theta supercomputer all the way up to the full machine ( 1 million MPI processes).
Our present algorithmic work is focused on two areas: adding a multimoment model coupling to the kinetics that will lead to a unique hybrid momentkinetic simulation capability (most hybrid PIC codes assume massless, isothermal electrons), and a novel recovery based DG scheme that will further increase accuracy, reducing resolution requirements. The recovery based approach is very promising as it may allow achieving, for example, order convergence with just DG basis functions where traditional DG schemes obtain order convergence for th order basis. Such increase in accuracy can allow use of coarser meshes, further dramatically reducing the computation cost for 5D/6D problems. The recovery approach is complex, though, and benefits greatly from advances of CAS generated code reported here. Combined with the flexibility of the Gkeyll code these innovations will enable larger problems of interest in a broad array of fields.
Acknowledgment
We thank the Gkeyll team for contributions to various parts of the code and the work we have presented here. In particular, we thank Mana Francisquez for help in implementing the collision operators, moment computations and other core code, Noah Mandell for help in the highlevel App system (and authoring the gyrokinetic solvers) and Petr Cagas for implementing the postprocessing tools (and BGK collision operators and boundary conditions). We also thank Jason TenBarge, Greg Hammett and Bill Dorland for extensive discussion on various aspects of the physics of the VlasovMaxwell system.
References
 [1] (201103) The Serendipity Family of Finite Elements. Found. Comput. Math. 11 (3), pp. 337–344. Cited by: §I.
 [2] (1998) Quadraturefree implementation of discontinuous galerkin method for hyperbolic equations. AIAA Journal 36 (5), pp. 775–782. External Links: Document Cited by: §II.
 [3] (200906) Weibel, two–stream, filamentation, oblique, bell, buneman… which one grows faster?. apj 699 (2), pp. 990–1003. External Links: Document, Link Cited by: §V.
 [4] (201911) Global TenMoment Multifluid Simulations of the Solar Wind Interaction with Mercury: From the Planetary Conducting Core to the Dynamic Magnetosphere. Geophysical Research Letters 124, pp. 2019GL083180–13. Cited by: §IV.
 [5] (2010) Numerical methods for fluid dynamics: with applications to geophysics. Vol. 32, Springer Science & Business Media. Cited by: §III.
 [6] (2019) A matrixfree highorder discontinuous galerkin compressible navierstokes solver: a performance comparison of compressible and incompressible formulations for turbulent incompressible flows. Int. J. Numer. Methods Fluids 89 (3), pp. 71–102. External Links: Document, Link Cited by: §II.
 [7] (2001) Filterbased stabilization of spectral element methods. Comptes Rendus de l’Académie des Sciences  Series I  Mathematics 332 (3), pp. 265–270. External Links: ISSN 07644442, Document, Link Cited by: §II.
 [8] (2016) Simulation of underresolved turbulent flows by adaptive filtering using the high order discontinuous Galerkin spectral element method. jcomp 313, pp. 1–12. External Links: ISSN 00219991, Document, Link Cited by: §II.
 [9] (2017) On the use of kinetic energy preserving DGschemes for large eddy simulation. jcomp 350, pp. 782–795. External Links: ISSN 00219991, Document, Link Cited by: §II.
 [10] (2013) On the accuracy of highorder discretizations for underresolved turbulence simulations. Theore. Comput. Fluid Dyn. 27 (3), pp. 221–237. External Links: Document, ISBN 14322250, Link Cited by: §II.
 [11] (2016) A well balanced and entropy conservative discontinuous Galerkin spectral element method for the shallow water equations. Appl. Math. Comput. 272, pp. 291–308. External Links: ISSN 00963003, Document, Link Cited by: §II.
 [12] (2016) Split form nodal discontinuous Galerkin schemes with summationbyparts property for the compressible Euler equations. jcomp 327, pp. 39–66. External Links: ISSN 00219991, Document, Link Cited by: §II.

[13]
(2013)
A skewsymmetric discontinuous Galerkin spectral element discretization and its relation to SBPSAT finite difference methods
. SIAM J. Sci. Comput. 35 (3), pp. A1233–A1253. External Links: Document, Link Cited by: §II.  [14] (2014) A kinetic energy preserving nodal discontinuous galerkin spectral element method. Int. J. Numer. Methods Fluids 76 (1), pp. 28–50. External Links: Document, Link Cited by: §II.
 [15] (2020) Gkeyll Code Documentation. Note: http://gkeyll.rtfd.io Cited by: §I, §IV.
 [16] (2020) Gkeyll Github Source Repository. Note: https://github.com/ammarhakim/gkyl Cited by: §I, §IV.
 [17] (2010) Eigen v3. Note: http://eigen.tuxfamily.org Cited by: §III, §III.
 [18] (2019) Conservative Discontinuous Galerkin Schemes for Nonlinear FokkerPlanck Collision Operators. External Links: arXiv:1903.08062 Cited by: §I.
 [19] (202004) Continuum electromagnetic gyrokinetic simulations of turbulence in the tokamak scrapeoff layer and laboratory devices. Physics of Plasmas, pp. 1–12. Cited by: §IV.
 [20] (2007) Nodal discontinuous Galerkin methods: algorithms, analysis, and applications. Springer Science & Business Media. Cited by: §II, §III.
 [21] (2012) Explicit discontinuous Galerkin methods for unsteady problems. Computers & Fluids 61, pp. 86–93. External Links: ISSN 00457930, Document, Link Cited by: §II, §III.
 [22] (201801) Discontinuous Galerkin algorithms for fully kinetic plasmas. jcomp 353, pp. 110–147. External Links: Document, Link Cited by: §I, §II, §II.
 [23] (2020) Noiseinduced magnetic field saturation in kinetic simulations. External Links: arXiv:2004.07255 Cited by: Fig. 4, §V, §V.
 [24] (2020) A matrixfree Discontinuous Galerkin method for the time dependent Maxwell equations in unbounded domains. arXiv preprint. External Links: 2002.08733 Cited by: §II.
 [25] (200608) An arbitrary highorder discontinuous Galerkin method for elastic waves on unstructured meshes — I. The twodimensional isotropic case with external source terms. Geophys. J. Int. 166 (2), pp. 855–877. External Links: ISSN 0956540X, Document Cited by: §II.
 [26] (2003) Dealiasing on nonuniform grids: algorithms and applications. jcomp 191 (1), pp. 249–264. External Links: ISSN 00219991, Document, Link Cited by: §II.
 [27] (2012) Computer Algebra Meets Finite Elements: An Efficient Implementation for Maxwell’s Equations. In Numerical and Symbolic Scientific Computing: Progress and Prospects, pp. 105–121. External Links: ISBN 9783709107942, Document, Link Cited by: §II.
 [28] (1999) Efficient implementations of the quadraturefree discontinuous galerkin method. In 14th Computational Fluid Dynamics Conference, pp. . External Links: Document, Link, https://arc.aiaa.org/doi/pdf/10.2514/6.19993309 Cited by: §II.
 [29] (202002) Electromagnetic full gyrokinetics in the tokamak edge with discontinuous Galerkin methods. Journal of Plasma Physics 86 (1), pp. 149–39. Cited by: §IV.
 [30] (2006) A quadraturefree discontinuous Galerkin method for the level set equation. jcomp 212 (1), pp. 338–357. External Links: ISSN 00219991, Document Cited by: §II.
 [31] (2019)(Website) http://maxima.sourceforge.net/. External Links: Link Cited by: §II, §IV.
 [32] (2017) On the eddyresolving capability of highorder discontinuous Galerkin approaches to implicit LES / underresolved DNS of Euler turbulence. jcomp 330, pp. 615–623. External Links: ISSN 00219991, Document, Link Cited by: §II.
 [33] (2020) Parker Solar Probe. Note: https://www.nasa.gov/content/goddard/parkersolarprobe Cited by: §I.
 [34] (2002) A survey of strong stability preserving high order time discretizations. Collected lectures on the preservation of stability under discretization 109, pp. 51–65. Cited by: §III.
 [35] (201902) Temperaturedependent saturation of weibeltype instabilities in counterstreaming plasmas. apjl 872 (2), pp. L28. External Links: Document Cited by: Fig. 4, §V, §V, §V.
 [36] (2020) The gaia Mission. Note: sci.esa.int/web/gaia Cited by: §I.
 [37] (201909) Exact and Locally Implicit Source Term Solvers for MultifluidMaxwell Systems. arXiv.org. External Links: 1909.04125v2 Cited by: §IV.
 [38] (2018) A comparative study on polynomial dealiasing and split form discontinuous Galerkin schemes for underresolved turbulence computations. jcomp 372, pp. 1–21. External Links: ISSN 00219991, Document, Link Cited by: §II.