I Introduction
Studies in computational plasma physics aim to explain experimental results and confirm theory. Plasma theory generally takes either the kinetic or fluid approach to modeling plasma particles.
The firstprinciples description of a plasma is kinetic. Kinetic theory describes the plasma as a six dimensional phase space probability distribution for each particle species. Kinetic theory makes no assumptions regarding thermal equilibrium and thus may result in multimodal arbitrary distributions.
In the fluid approach, it is assumed that the details of the distribution functions can be neglected and a given fluid parcel can be described by just its density, momentum and temperature. Fluid models are generally derived by marginalizing out the velocity dependence of the fully kinetic description.
Certain regimes of space and laboratory plasmas must be simulated using kinetic models, which capture all the relevant physics but are computationally more expensive. Two numerical approaches are typically used: Continuum (Vlasov) solvers and ParticleinCell (PIC) methods. PIC codes are an example of a fully kinetic (six dimensional) solver. PIC codes discretize the distribution function according to the vlasov equation and then subsample to represent regions of plasma as macroparticles. These macroparticles are advanced by fields defined by the electromagnetic Maxwell equations. This method is computationally more tractable than a direct continuum solver, however it unfortunately introduces two sources of intrinsic noise.
The first is systemic noise inherent to the discrete plasma representation and the mapping between a discrete mesh and continuous particle positions [12]
. While the particle’s position is represented by continuous 3D space, it must be mapped to a 3D discrete mesh where the fields live in order to interpolate the field values and advance the particle’s momentum and position.
The second source of noise is introduced in recovering the particle distribution functions from the output of the simulation. This is known as likelihoodfree inference or simulation based inference. The samples, or particles in this case, are data generated by advancing the simulation through some number of timesteps. The simulation with parameters , represents some implicit and unknown likelihood function . Traditionally this likelihood was recovered by binning the particles into histograms. Because the simulation must make compromises on the number of particles and other numerical constants encapsulated by for tractability reasons, the resulting likelihoods tend to contain noise.
For the remainder of this work we will use the terms particle distribution function and likelihood interchangeably.
A possible method of denoising the likelihood lies in generative modeling. Generative modeling has shown great success in denoising and super resolution tasks
[21, 1, 2, 3, 4]. Generating an accurate denoised distribution function from PIC codes which encapsulates the underlying physics and matches the results predicted by continuum codes, would introduce a reliable method for crosscode validation as well as cut costs by allowing for inference on commodity hardware.Ia Contribution
In this work we aim to motivate the use of robust generative modeling techniques as a novel solution to the noise inherent to the distribution functions produced by PIC methods. We will apply techniques from generative modeling to denoise our nongaussian data, performing likelihoodfree inference without violating the physical constraints of the fully kinetic model. We will then demonstrate that this technique may be expanded to encapsulate temporal dynamics. These experiments will be used as motivation for future coreedge coupling studies mapping distributions generated from PIC codes to distributions solved by continuum codes.
Ii Background
Iia Particle Distribution Function
The baseline particle distribution function (PDF) is seven dimensional, three spatial and three velocity components plus time per ion species,
Normally, for analysis we look at a subdomain region of the simulation to study plasma evolution. This amounts to marginalizing the distribution function over space and taking specific time slices resulting in a multivariate gaussian where the plasma bulk flow parameterizes the means and the temperature parameterizes the covariance. This is known as a Maxwellian distribution
(1) 
It is important to note that this only holds true for an idealized plasma in thermalequilibrium. As the domain evolves through the course of a simulation various processes will cause a departure from the Maxwellian form. The resulting PDF will be of arbitrary form and temporally dynamic, making the task of modeling density/datadriven likelihoods particularly difficult.
IiB Generative Modeling
A generative model’s aim is to represent a probability distribution in a tractable fashion such that it is capable of generating new samples. Concretely, given a datapoint
, can we learn an approximation to the true distribution such that we may generate new samples. The likelihood of the generated samples should closely match the likelihood of the data used to train the model. We refer to likelihoods learned from data as datadriven likelihoods (DDL). Our data was produced by a simulation with predefined parameters which represents the implicit likelihood, so we can say we want to find the DDL which approximatesRecent advances in machine learning have produced a wide variety of generative techniques. Chief among these are variational autoencoders (VAE), generative adversarial networks (GAN), and expectation maximization (EM) algorithms.
The VAE is a maximum likelihood estimator that approximates the evidence by maximizing the evidence lower bound
[14]. The core problem with this approach lies in the approximation of the posterior. In achieving a closed form solution, one must know apriori the posterior’s functional form. The standard approach assumes a gaussian, as such it performs poorly on multimodal or nongaussian data. Alternative posteriors have been proposed in the literature [6], but these methods still require apriori knowledge of the posterior’s functional form.The GAN on the otherhand, doesn’t actually model the likelihood of the data. Its goal is to trick a discriminator into believing the generated samples have been drawn from the true distribution [13]. So while samples generated from the GAN may appear to be reflective of the simulation data, the possibility exists that we are not modeling the true likelihood. Relying on believable but arbitrary samples leaves no guarantee that our inference would respect the physical constraints of the domain in question.
EM algorithms performed on gaussian mixture models do well at modeling multimodal distributions, however it requires prior knowledge of the modality of the data. As we are looking to model our particle distribution functions at an arbitrary time during the evolution of the simulation, the modality is assumed to be dynamic.
IiC Normalizing Flows
A normalizing flow describes the transformation of a probability density through a sequence of invertible mappings [7]. Given data , a tractable prior , and a learnable bijective transformation we can apply the following change of variable formula to define a distribution on .
(2) 
Furthermore, defining to be a composite of a sequence of N bijective mappings, allows us to say
(3) 
where and . Optimizing on the negative log loss gives us a maximum likelihood model that allows for efficient sampling and density estimation. What remains to be specified is the class of bijective transfomation being used. To make this tractable, we would ideally pick a class which is easily invertible, flexible, and results in a Jacobian with a tractable determinant. For this work we use the Masked Autoregressive Flow (MAF).
The MAF offers a robust procedure for modeling our DDL. As an autoregressive model it aims to construct a conditional probability distribution for each feature, where the distribution is conditioned on all previous features. Assuming normal priors allows us to concisely say:
(4) 
(5) 
Where ,
are arbitrary functions parameterized by neural networks. We may generate new data as follows
(6) 
To ensure robust predictions we include a permutation of the features before each layer of the flow. This class of transformation, being autoregressive, results in a lower triangular Jacobian. It also easily extends to conditional probabilities. For further details on MAF please see [18].
We can see that the normalizing flow is convenient not only because it allows us to generate samples in an interpretible manner, but gives direct access to the density, allowing us to solve the likelihoodfree inference problem for the particle distribution function. For further details on normalizing flows we refer the reader to [8, 15, 19]
Iii Experiments
The following experiments were performed with data produced by the Particle Simulation Code (PSC) [12]. Multimodal and nongaussian behavior manifests itself in our data due to excitation processes. Particle excitation occurs through the acquisition of energy from an outside source, usually due to magnetic reconnection or collisionless shocks. In this case, our simulation parameters are very nearly described by [16].
Shown in Fig 1 is the temporal evolution of the data’s
marginalized distribution function (not normalized). We see that from T4 to T15 an energization process occurs which drives the multimodal behavior. Overlayed with the PDF is the normal distribution parameterized by our data’s mean and variance.
Iiia NonGaussianity
To motivate our use of the MAF we first demonstrate that our data is nongaussian. There are several methods which may be used to demonstrate this, including the
statistic of skewness and kurtosis, pairwise nongaussianity of datapoints, and the KullbackLeibler (KL) divergence. This suite of tests outlined by Diaz Rivero
[20] gives us an established holistic evaluation procedure.For brevity we focus only on the KullbackLeibler divergence test. Taking the null hypothesis to be that the our data is gaussian, we generate a normal distribution parameterized by the mean and variance of the data. We draw two separate sample batches from the normal distribution and calculate the KL divergence between the two in order to calculate the null hypothesis. It is well established that the KL divergence between two sample sets drawn from the same distribution will be variable on both the number of samples drawn and number of bins. Taking both numbers to be very large we are able to minimize this variability and achieve the expected minimal distance for the baseline. We then calculate the KL divergence between the data and a sample batch drawn from the normal distribution for comparison. Results in Fig
2 show that from T4 to T15 the KL divergence of the data is an order of magnitude greater than if the data was normally distributed, disproving the null hypothesis. This tells us that the data is nongaussian (nonMaxwellian) and that there are excitation processes occurring.IiiB Data Driven Likelihood
Having shown the nongaussianity of the data we can confidently state that the VAE, GAN, and EM algorithm will yield poor DDL. With this in mind we select the MAF as our generative model. Nflows, built and maintained by [11]
, is a standardized python library built on pytorch which provides a probabilistic machine learning framework. We constructed the MAF using Nflows and trained using negative log likelihood for 1000 epochs. The specific architecture consisted of an 8 layer flow, each layer of which contained a reverse permutation transformation and a masked affine autoregressive transformation. The affine transformations themselves consist of a scale and shift parameter, each of which is represented by a single hidden layer neural network containing 32 nodes. We take our base distribution to be a multivariate normal.
Training the flow using the negative log likelihood allows us to use the Adam optimizer to iteratively update the parameters of our model in an unsupervised manner. The flow is fed simulation data which is transformed and mapped to the base distribution. Each iterative update modifies the flow’s parameters so that the likelihood of the simulation data under the base distribution after transformation is maximized.
MAF Hyperparameters 


Layers  Permutation  Transformation  hidden nodes  Base Distribution 
8  Reverse  Masked Affine Autoregressive  32  Multivariate Normal 
Results may be seen in Fig 3 which show the binned distribution function of both the true data and samples generated by the model, with the smooth learned likelihood. Here we see the power of using a normalizing flow as a generative model. By gaining direct access to the density function we are able to work with a smooth approximation to what would otherwise be a noisy distribution. If we were to use this data to analyze kinetic processes we would traditionally use the PDF represented in frame A of Fig 3. This clearly contains noise at a level which could skew interpretation of the underlying physics. In frame C we see the DDL learned by the model, demonstrating a dramatic noise reduction in comparison to frame A.
IiiC Temporal Evolution
We can leverage the versatility of the normalizing flow by taking our base distribution to be a conditional normal where we condition on simulation time. This allows us to capture the underlying particle information at different times throughout the simulation and encapsulate that in our model. This is powerful in that we no longer need to store terabytes of particle data, we can compress that information into the parameters of our model and perform inference from commodity hardware.
Here we use a single layer neural network with 8 nodes and a ReLU activation to map the simulation time to the conditional parameters of our base distribution. We repeat the same training procedure as the previous section with the exception that we now use the data produced by the simulation at each interval of 1000 timesteps.
In the framework of kinetic theory, we can use the distribution function to directly calculate the conserved physical quantities of our system. The zeroth order moment is the number density, which may be scaled to the mass or charge density. The first order moment gives us momentum and the second order moment kinetic energy.
In Fig 4 we present the absolute percentage error of the zeroth, first, and second moment calculations directly between the raw data and the predictions of our model. As shown, the maximum error for the first 21,000 timesteps is always well below 1%, demonstrating that we have compressed the temporal evolution of our simulation into our generative model without violating the physical constraints of the system.
Iv Discussion and Conclusion
We have shown our data to be nongaussian and shown that we must be selective in which techniques we use to model it. We have shown that generative modeling with normalizing flows is flexible enough to learn our PDF. By applying the MAF to high dimensional particle data produced in PIC simulations we have successfully learned the DDL of the particles, resulting in a smooth tractable estimate of . The MAF is easily extendable to conditional distributions, which allowed us to encapsulate temporal dynamics into our model and which opens up room for further studies on adaptable subdomains. Most importantly in modeling our data we made no assumptions as to the physical process taking place within the simulation. Our predictions align with the simulation’s results implying that we have not violated physical constraints in generating new samples.
This presents exciting opportunities for the eXascale Computing Project Whole Device Modeling Application (WDMApp). WDMApp aims to model plasma within the interior of a magnetic confinement fusion device known as a tokomak. Due to the high computational cost, simulations of this nature have historically been restricted to limited volumes of the domain. WDMApp will use the continuum code GENE to model the dense core plasma and a separate, possibly PIC code, to model the less dense edge regions [5]. Domain coherence requires frequent communication of the electromagnetic fields and the particle distribution function between the two codes. The efficient transfer of this information is known as coreedge coupling and involves mapping information between the two code’s disparate representations. Coupling the codes to allow information exchange in a meaningful way is an active area of research [9, 10, 17]. We propose using these results as motivation for further studies incorporating generative modeling into coreedge coupling schema.
References
 [1] (2013) Adaptive multicolumn deep neural networks with application to robust image denoising. In Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1, NIPS’13, Red Hook, NY, USA, pp. 1493–1501. Cited by: §I.
 [2] (2020) Learning generative models using denoising density estimators. External Links: 2001.02728 Cited by: §I.
 [3] (2020) Generative modeling with denoising autoencoders and langevin sampling. External Links: 2002.00107 Cited by: §I.

[4]
(2013)
Simple sparsification improves sparse denoising autoencoders in denoising highly noisy images
. pp. 1469–1477 (English (US)). Note: 30th International Conference on Machine Learning, ICML 2013 ; Conference date: 16062013 Through 21062013 Cited by: §I.  [5] (2018) Coupling exascale multiphysics applications: methods and lessons learned. pp. 442–452. External Links: Document Cited by: §IV.

[6]
(2017)
Deep unsupervised clustering with gaussian mixture variational autoencoders
. External Links: 1611.02648 Cited by: §IIB.  [7] (2015) NICE: nonlinear independent components estimation. External Links: 1410.8516 Cited by: §IIC.
 [8] (2017) Density estimation using real nvp. External Links: 1605.08803 Cited by: §IIC.
 [9] (202102) Spatial coupling of gyrokinetic simulations, a generalized scheme based on firstprinciples. Physics of Plasmas 28 (2). External Links: Document Cited by: §IV.
 [10] (201807) A tightcoupling scheme sharing minimum information across a spatial interface between gyrokinetic turbulence codes. Physics of Plasmas 25 (7), pp. 072308. External Links: ISSN 10897674, Link, Document Cited by: §IV.
 [11] nflows: normalizing flows in PyTorch External Links: Document, Link Cited by: §IIIB.
 [12] (2016) The plasma simulation code: a modern particleincell code with patchbased loadbalancing. Journal of Computational Physics 318, pp. 305–326. External Links: ISSN 00219991, Document, Link Cited by: §I, §III.
 [13] (2014) Generative adversarial networks. External Links: 1406.2661 Cited by: §IIB.
 [14] (2014) Autoencoding variational bayes. External Links: 1312.6114 Cited by: §IIB.
 [15] (2018) Glow: generative flow with invertible 1x1 convolutions. External Links: 1807.03039 Cited by: §IIC.
 [16] (202102) Kinetic simulations of electron preenergization by magnetized collisionless shocks in expanding laboratory plasmas. The Astrophysical Journal 908 (2), pp. L52. External Links: Document, Link Cited by: §III.
 [17] (2021) First coupled gene–xgc microturbulence simulations. Physics of Plasmas 28 (1), pp. 012303. External Links: Document, Link, https://doi.org/10.1063/5.0026661 Cited by: §IV.
 [18] (2018) Masked autoregressive flow for density estimation. External Links: 1705.07057 Cited by: §IIC.
 [19] (2016) Variational inference with normalizing flows. External Links: 1505.05770 Cited by: §IIC.
 [20] (202011) Flowbased likelihoods for nongaussian inference. Physical Review D 102 (10). External Links: ISSN 24700029, Link, Document Cited by: §IIIA.
 [21] (2012) Image denoising and inpainting with deep neural networks. In Advances in Neural Information Processing Systems, Vol. 25. External Links: Link Cited by: §I.