DCTRGAN: Improving the Precision of Generative Models with Reweighting

09/03/2020 ∙ by Sascha Diefenbacher, et al. ∙ Rutgers University University of Hamburg DESY Berkeley Lab 0

Significant advances in deep learning have led to more widely used and precise neural network-based generative models such as Generative Adversarial Networks (GANs). We introduce a post-hoc correction to deep generative models to further improve their fidelity, based on the Deep neural networks using the Classification for Tuning and Reweighting (DCTR) protocol. The correction takes the form of a reweighting function that can be applied to generated examples when making predictions from the simulation. We illustrate this approach using GANs trained on standard multimodal probability densities as well as calorimeter simulations from high energy physics. We show that the weighted GAN examples significantly improve the accuracy of the generated samples without a large loss in statistical power. This approach could be applied to any generative model and is a promising refinement method for high energy physics applications and beyond.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Generative models are a critical component of inference in many areas of science and industry. As a result of recent advances in deep learning, neural network-based generative models are rapidly being deployed to augment slow simulations, act as simulator surrogates, or used for ab initio modeling of data densities directly for inference or for uncertainty quantification. Well-studied methods include Generative Adversarial Networks (GanGoodfellow et al. (2014a); Creswell et al. (2018)

, Variational Autoencoders 

Kingma and Welling ; Kingma and Welling (2019), and variations of Normalizing Flows Rezende and Mohamed (2015); Kobyzev et al. (2020).

In many industrial applications of generative modeling, the aim is primarily to achieve “realistic” images on a per-example basis. By contrast, in high energy physics applications, the main goal is usually to improve the quality of an ensemble of examples

. In other words, it is of paramount importance that the generator accurately reproduce the underlying probability distribution of the training data.

One challenge faced by current generative models is that even though they are able to qualitatively reproduce features of the data probability density, they are often unable to reproduce fine details. While a variety of advanced methods are being proposed to enhance the precision of generative models, we make the observation that a relatively simple operation can be performed on the output of a generative model to improve its fidelity. This operation uses a neural network classifier to learn a weighting function that is applied post-hoc to tweak the learned probability density. The result is a set of weighted examples that can be combined to more accurately model the statistics of interest. This procedure does not improve the fidelity of any particular example, but instead improves the modeling of the probability density.

Estimating probability density ratios with classification has a long history (see e.g., Ref. Hastie et al. (2001); Sugiyama et al. (2012)

) and has been used for a wide variety of applications in machine learning research related to generative models 

Nowozin et al. (2016); Mohamed and Lakshminarayanan (2016); Grover and Ermon (2017); Gretton et al. (2007); Bowman et al. (2015); Lopez-Paz and Oquab (2016); Danihelka et al. (2017); Rosca et al. (2017); Im et al. (2018); Gulrajani et al. (2020); Grover et al. (2017, 2019a); Azadi et al. (2018); Turner et al. (2018); Tao et al. (2018). The most closely related application to this paper is Ref. Grover et al. (2019b) which used learned weights during the training of a generative model to improve fairness. In this work, the weights are derived after the generator is trained. In high energy physics, there are many studies using deep generative models Paganini et al. (2018a, b); Vallecorsa et al. (2019); Chekalina et al. (2018); ATLAS Collaboration (2018); Carminati et al. (2018); Vallecorsa (2018); Musella and Pandolfi (2018); Erdmann et al. (2018, 2019a); de Oliveira et al. (2017a, 2018); Hooberman et al. (2017); Belayneh et al. (2019); Buhmann et al. (2020); de Oliveira et al. (2017b); Butter et al. (2019a); Arjona Martinez et al. (2019); Bellagente et al. (2019); Vallecorsa et al. (2019); Ahdida et al. (2019); Carrazza and Dreyer (2019); Butter et al. (2019b); Lin et al. (2019); Di Sipio et al. (2019); Hashemi et al. (2019); Zhou et al. (2019); Datta et al. (2018); Deja et al. (2019); Derkach et al. (2019); Erbin and Krippendorf (2018); Urban and Pawlowski (2018); Farrell et al. (2019) and using likelihood ratio estimates based on classifiers Andreassen and Nachman (2019); Badiali et al. (2020); Stoye et al. (2018); Hollingsworth and Whiteson (2020); Brehmer et al. (2018a, b, 2020, c); Cranmer et al. (2015); Badiali et al. (2020); Andreassen et al. (2020a, b); Erdmann et al. (2019b); this paper combines both concepts to construct DctrGan, which is a tool with broad applicability in high energy physics and beyond.

This paper is organized as follows. Section II introduces the statistical methods of reweighting and how they can be applied to generative models using deep learning. Numerical results are presented in Sec. III using standard multimodal probability densities as well as calorimeter simulations from high energy physics. The paper ends with conclusions and outlook in Sec. IV.

Ii Methods

ii.1 Generative Models

A generative model is a function that maps a latent space (noise) to a target feature space . The goal of the generator is for the learned probability density to match the density of a target process , . There are a variety of approaches for constructing using flexible parameterizations such as deep neural networks. Some of these methods produce an explicit density estimation for and others only allow generation without a corresponding explicit density.

While the method presented in this paper will work for any generative model , our examples will consider the case when is a Generative Adversarial Network Goodfellow et al. (2014b). A Gan is trained using two neural networks:


where is a discriminator/classifier network that distinguishes real examples from those drawn from the generator . The discriminator network provides feedback to the generator through gradient updates and when it is unable to classify examples as drawn from the generator or from the true probability density, then will be accurate. Gans provide efficient models for generating examples from , but do not generally provide an explicit estimate of the probability density.

This paper describes a method for refining the precision of generative models using a post-processing step that is also based on deep learning using reweighting. A weighting function is a map designed so that . Augmenting a generator with a reweighting function will not change the fidelity of any particular example, but it will improve the relative density of examples. Therefore, this method is most useful for applications that use generators for estimating the ensemble properties of a generative process and not the fidelity of any single example. For example, generators trained to replace or augment slow physics-based simulators may benefit from the addition of a reweighting function. Such generators have been proposed for a variety of tasks in high energy physics and cosmology Paganini et al. (2018a, b); Vallecorsa et al. (2019); Chekalina et al. (2018); ATLAS Collaboration (2018); Carminati et al. (2018); Vallecorsa (2018); Musella and Pandolfi (2018); Erdmann et al. (2018, 2019a); de Oliveira et al. (2017a, 2018); Hooberman et al. (2017); Belayneh et al. (2019); Buhmann et al. (2020); de Oliveira et al. (2017b); Butter et al. (2019a); Arjona Martinez et al. (2019); Bellagente et al. (2019); Vallecorsa et al. (2019); Ahdida et al. (2019); Carrazza and Dreyer (2019); Butter et al. (2019b); Lin et al. (2019); Di Sipio et al. (2019); Hashemi et al. (2019); Zhou et al. (2019); Datta et al. (2018); Deja et al. (2019); Derkach et al. (2019); Erbin and Krippendorf (2018); Urban and Pawlowski (2018); Farrell et al. (2019).

ii.2 Statistical Properties of Weighted Examples

Let be a trained generative model that is designed to mimic the target process . Furthermore, let

be a random variable that corresponds to weights for each value of

. If is a function of the phase space , one can compute the weighted expectation value


Nearly every use of Monte Carlo event generators in high energy physics can be cast in the form Eq. 2. A common use-case is the estimation of the bin counts of a histogram, in which case is an indicator function that is one if is in the bin and zero otherwise. For the true expectation value, and  111For simplicity we assume initial weights to be unity. This procedure can trivially be extended to more complex weight distributions e.g. from higher-order simulations., while for the nominal generator approximation, with still unity. The goal of DctrGan is to learn a weighting function that reduces the difference between and for all functions . The ideal function that achieves this goal is


Before discussing strategies for learning , it is important to examine the sampling properties of weighted examples. To simplify the discussion in the remainder of this subsection, we will assume that the ideal reweighting function has been learned exactly, so that


for every observable . In practice, small deviations from the ideal reweighting function will result in subleading contributions to the statistical uncertainty222A related question is the statistical power of the examples generated from . See Ref. Matchev and Shyamsundar (2020) and A. Butter and Plehn (2020) for discussions, and the latter paper for an empirical demonstration of this topic..

Now suppose we generate events , , and there are truth events , . We are interested in how large must be in order to achieve the same statistical precision as the truth sample. The key observation we will make is that the required depends jointly on the observable and the weights .

Since we will be interested in variances, it makes sense to consider the mean-subtracted observable


The sampling estimates for and are


where hats denote sampling estimates of expectation values. The variances of the sampling estimates are given by333We are neglecting the contribution to the variance from which should also be estimated from the sample. It is suppressed by compared to the variance of the expectation value of the sampling estimate of .




Equating Eq. 7 and 8, we see that to achieve the same statistical precision as the truth events for a given observable , we need to generate


number of events. How many events we need depends on the observable and the weights . If everywhere (the generator exactly learns ), then for every observable, as expected. Otherwise, if the weights are small (large) in the parts of phase space preferred by , then we will need less (more) events than . Smaller weights are better for statistical power – they correspond to regions of phase space which are over-represented by the generator.

One cannot simply reduce the error everywhere by making all of the weights small. If the weights are small somewhere in phase space, they must be large somewhere else. To see this, observe that with the ideal reweighting function, in the large limit, the weights are automatically normalized across the entire sample such that they sum to :


where in the last equation we have substituted Eq. 3. So indeed the weights cannot be small everywhere in phase space.

Evidently, if we want the generator to have good statistical precision across a wide range of observables, we want the weights to be close to unity. Otherwise, there will always be some observables for which we need to generate many more events than .

As a special case of Eq. 9, consider that is a histogram bin function, specifically that it is one in a neighborhood of and zero otherwise. For a sufficiently small neighborhood, we can ignore the mean (it is proportional to the volume of the neighborhood), and (9) reduces to


In other words, the value of the weight in the histogram bin directly determines how many events we need to generate to achieve the same statistical precision as the truth sample.

Finally, while the integral formulas above are formally correct and lead to various insights, they are of limited practical usefulness since we generally do not know and and cannot evaluate the integrals over . For actually estimating the uncertainties on expected values, one should replace the integrals in Eq. 7 and Eq. 8 with their sampling estimates.

ii.3 Learning the Weighting Function

The weighting function is constructed using the Deep neural networks using Classification for Tuning and Reweighting (Dctr) methodology Andreassen and Nachman (2019) (see also Ref. Badiali et al. (2020); Stoye et al. (2018); Hollingsworth and Whiteson (2020); Brehmer et al. (2018a, b, 2020, c); Cranmer et al. (2015); Badiali et al. (2020); Andreassen et al. (2020a, b)). Let be a set of examples drawn from and be a set of examples drawn from and then mapped through , henceforth called for . A neural network classifier is trained to distinguish from

using the binary cross entropy loss function so that

corresponds to 444A variety of other loss functions can also be applied such as the mean squared error. The dependence of on may need to be modified if other loss functions are used.. The weighting function is then constructed as . With sufficient training data, classifier expressiveness, and training flexibility, one can show that approaches the likelihood ratio . If the weights are exactly the likelihood ratio, then the expectation values computed in Eq. 2 will be unbiased.

Training a classifier is often easier and more reliable than training a full generative model. Therefore, a natural question is why learn at all? The answer is because the further is from the constant unity function, the more statistical dilution will arise from the generated examples (when a sufficiently broad range of observables is considered), as described in Sec. II.2. In particular, this method does not work at all when there are weights that are infinite, corresponding to regions of phase space where but . Therefore, the Dctr approach should be viewed as a refinement and the better is to begin with, the more effective will be in refining the density. Our method differs from the refinement proposed in Ref. Erdmann et al. (2018) because Dctr-ing leaves the data-points intact and only changes their statistical weight while Erdmann et al. (2018) modifies the data-points themselves. A combined framework that simultaneously changes both data and weights might be an interesting avenue for future studies.

Iii Results

The DctrGan approach is illustrated with two sets of examples: one set where the true density is known analytically (Sec. III.1) and one where is not known, but samples can be drawn from it (Sec. III.2).

iii.1 Multimodel Gaussian Distributions

As a first set of examples, Gaussian mixture models are simulated in one and two dimensions. Multimodal probability densities are traditionally challenging for standard

Gans to model precisely and are a useful benchmark for the DctrGan methodology. In particular, the following random variables are simulated:

  1. 1D bimodal:


    where is the probability density of a normal random variable with mean

    and standard deviation

    evaluated at .

  2. 2D circular:


    where .

  3. 2D grid:


All models were implemented in Keras Chollet (2017) with the Tensorflow backend Abadi et al. (2016) and optimized with the Adam Kingma and Ba (2014). The discriminator

networks have two hidden layers with 25 nodes each and use the rectified linear unit (ReLU) activation function. The sigmoid function is used after the last layer. The generator networks

also has two hidden layers, with 15 units each. A latent space of dimension 5, 8, and 25 is used for the 1D bimodal, 2D circular, and 2D

grid, respectively. Each model was trained with a batch size of 128 for 10,000 epochs (passes through the batches, not the entire dataset). For the reweighting model, three hidden layers were used for all three cases with ReLU activiations on the intermeidate layers and a sigmoid for the last layer. For the 2D models, the numbers of hidden units were 64, 128, and 256 while the 1D example used 20 hidden notes on each intermediate layer. The binary cross entropy was used for the loss function and the models were trained for 100 epochs (passes through the entire dataset).

The resulting probability densities are presented in Fig. 1. In all cases, the fidelity of the Gan density is significantly improved using reweighting.

Figure 1: Three multi-modal Gaussian examples. The true probability density is shown in the first column, a Gan density is presented in the second column, and then the density from DctrGan is drawn in the third column. Examples inspired from Ref. Fisher et al. (2018).

iii.2 Calorimeter Simulation Examples

Data analysis in high energy physics makes extensive use of simulations for inferring fundamental properties of nature. These complex simulations encode physics spanning a vast energy range. The most computationally demanding part of the simulation stack is the modeling of particles stopping in the dense material of calorimeters that part of most detectors. Gans have been investigated as a surrogate model for accelerating these slow calorimeter simulations Paganini et al. (2018a, b); Vallecorsa et al. (2019); Chekalina et al. (2018); ATLAS Collaboration (2018); Carminati et al. (2018); Vallecorsa (2018); Musella and Pandolfi (2018); Erdmann et al. (2018, 2019a); de Oliveira et al. (2017a, 2018); Hooberman et al. (2017); Belayneh et al. (2019); Buhmann et al. (2020)

The Gan model studied here is a modified version of the Bounded-Information-Bottleneck autoencoder (BIB-AE) Voloshynovskiy et al. (2019) shown in Ref. Buhmann et al. (2020). The BIB-AE setup is based on the encoder-decoder structure of a VAE, but its training is enhanced through the use of Wasserstein-Gan Arjovsky et al. (2017)-like critics. The theoretical basis of the model is discussed in Ref. Voloshynovskiy et al. (2019), while the explicit architecture and training process is described in the appendix of Ref. Buhmann et al. (2020). The main modification to Ref. Buhmann et al. (2020) is introduced to reduce mode collapse: regions of phase space that are significantly under- or over-sampled from the Gan. If extreme enough, such regions can cause Dctr weights that lead to infinites in the loss function. Our modified version maintains the encoder and decoder architecture, but each critic is replaced by a set of two identical critic networks. These two critics are trained in parallel, however one of the two has its weights reset after every epoch. Based on its training history the continuously trained critic may be blind to certain artifacts in the generated shower images that lead to mode collapse. The reset critic, however, is able to notice and reduce such artifacts. Additionally we change the input noise to be uniform instead of Gaussian and skip the neural network based post processing described in Ref. Buhmann et al. (2020), as most of its effects can be replicated through the Dctr approach.

The real examples are based on detailed detector simulations using Geant4 10.4 Agostinelli et al. (2003) through a DD4hep 1.11 Frank et al. (2014) implementation of the International Large Detector (ILD) Abramowicz et al. (2020) proposed for the International Linear Collider. The calorimeter is composed of 30 active silicon layers with tungsten absorber. The energy recorded in each layer is projected onto a regular grid of cells. To simulate the ILD Minimal Ionizing Particle (MIP) cutoff we removes hits with energies for both the Geant4 and Gan showers. All showers are generated for an incident photon with an energy of 50 GeV. More details on the simulation are given in Ref. Buhmann et al. (2020).

Two different Dctr models are trained: one using the original data dimensionality (low-level or LL) and one that uses slightly processed inputs with lower dimensionality. The latter network is built from 33 features: the number of non-zero cells, the longitudinal centroid (energy-weighted layer index of the shower: ), the total energy, and the energy in each of the thirty layers. A fully connected network processes these observables, with two hidden layers of 128 nodes each. The ReLU activation is used for the intermediate layers and the sigmoid is used for the final layer and the model is trained with the binary cross entropy loss function. The network was implemented using Keras with the Tensorflow backend and optimized with the Adam and optimized with a batch size of 128 for 50 epochs.

The low-level classifier was trained directly on the shower images. On major problem in this approach are artificial features in the Gan showers caused by mode collapse. These give the classifier a way to distinguish between Geant4 and Gan without learning the actual physics. This in turn means that the showers with a high classification score are not necessarily the most realistic ones, which reduces the effectiveness of the reweighting. Therefore a large part of the low-level classifier setup is designed to reduce the impact of such artifacts. The feature processing for the high-level network acts as an effective regularization that mitigates these effects.

The low-level classifier was built using PyTorch Paszke et al. (2019). The initial input image first has a 50% chance of being flipped along the direction and another 50% chance of being flipped along the direction. These directions are perpendicular to the longitudinal direction of the incoming photon. The input is largely symmetric under these transformations, but they make it harder for the classifier to pick up on the above mentioned artifacts. The image is then passed through 3 layers of 3D-convolutions, all with 128 output filters and

size kernels. The first two convolutions have a stride of 2 with a zero padding of 1, while the third has a stride of 1 and no padding. Between the first and second and second and third convolution layer-norm steps are performed. The

output of the final convolutions is flattened and passed to a set of fully connected layers with (64, 256, 256, 1) output nodes, respectively. Each layer uses a LeakyReLU activation function Maas et al. (2013) with a slope of 0.01, except for the final output layer, which uses a sigmoid activation. The network is trained with a binary cross entropy loss function using the Adam optimizer with a learning rate of . The training set consist of 500k Gan showers and 177k Geant4 showers, the Geant4 set is tiled, so it matches the size of the Gan set. This is equivalent to increasing the training weights of the Geant4 set, but was found to converge better than using weights. The network is trained for 1 epoch, as longer training makes the classifier more sensitive to the artificial structures in the Gan showers. After training the classifier is calibrated using Temperature Scaling Guo et al. (2017). Finally we clip the individual per shower weight to be no larger than 5.

Figure 2 shows histograms of the learned Dctr weights for both the low-level and high-level models. The most probable weight is near unity, with a long and asymmetric tail towards low weights. The right plot of Fig. 2 shows that there is a positive correlation between the weights of the two models.

Figure 2: The weights of the low-level and high-level Dctr models. The top plot presents histograms of the weights and the bottom plot presents a scatter plot demonstrating the correlation between the weights of the two models. The Pearson correlation () is indicated in the plot.

Figures 3-6 show histograms of various one-dimensional observables from the full -dimensional space. In each plot, two metrics are computed in order to quantify the performance of the generative models. First, the distance measure


is used to quantify how similar the Gan histograms are to the Geant4 ones. This measure known in the information theory literature as triangular discrimination Topsoe (2000) (and is an -divergence similar to the divergence Nachman (2016)). In the high-energy physics literature, this is often called the separation power Hocker et al. (2007). It has the property that it is if the two histograms are the same and 1 when they are non-overlapping in every bin. As described in Sec. II.2, the tradeoff for improved performance is statistical dilution. To quantify this dilution, the uncertainty on the mean of the observable using the Gan is computed divided by the uncertainty computed using Geant4, denoted (see Section II.2 for a detailed discussion of these uncertainties). Since the Gan and Geant4 datasets have the same number of events, without Dctr and deviates from one for the weighted Gan models. The effective number of events contained in a weighted sample is observable-dependent, but is approximately by the ratio of the effective number of events in the Geant4 dataset to the Gan one. These values are typically between 1-10 while the Gan is many ten-thousand times faster than GeantBuhmann et al. (2020).

Figure 3: Histograms of various observables from simulated calorimeter showers of 50 GeV photons in a 5-layer calorimeter with cells in each layer. A cell is activated if a non-zero energy is registered in that cell. The panels below each histogram show the ratio between the Gan or the DctrGan and the physics-based simulator Geant4. The legend includes the separation power between the (weighted) Gan model and the Geant4 model. Additionally, the ratio of the uncertainty in the mean of the observable between the Gan and Geant4 is also presented.

Three composite observables are presented in Fig. 3. The total number of activated cells is more peaked around 780 in Geant4 than the Gan and both the low-level and high-level models are able to significantly improve the agreement with Geant4. The value of is about 20 times smaller than the unweighted Gan for the high-level Dctr model and about 5 times smaller for the low-level model. The statistical dilution is modest for the low-level model with while it is 3.6 for the high-level model. The modeling of the total energy is also improved through the reweighting, where both the low-level and high-level models shift the energy towards lower values. The longitudinal centroid is already relatively well-modeled by the Gan, but is further improved by the high-level Dctr model, reducing the by more than a factor of two.

Histograms of the energy in representative layers are shown in Fig. 4. Generally, the Geant4 showers penetrate deeper into the calorimeter than the Gan showers, so the energy in the early layers is too high for the Gan and the energy in the later layers is too low. The Dctr models are able to correct these trends, with a systematically superior fidelity as measured by for the high-level model.

The modeling of correlations between layers is probed in Fig. 5 by examining histograms of the difference between energies in different layers. Layers that are closer together should be more correlated. This manifests as histograms with a smaller spread for layers that are physically closer together. For layers that intercept the shower before its maximum, the difference in energy between a layer and the next layer is negative. The shower maximum is typically just beyond the tenth layer. The fidelity improvements from the weighting in the difference histograms is comparable to the per-layer energy histograms from Fig. 4. Interestingly, in cases when the for the Gan is already good without weighting (e.g. the energy in layer 15 - layer 16), the modeling still improves with the Dctr approach and there can still be significant statistical dilution.

Another way to visualize correlations in the data is to compute the linear correlation coefficient between observables. This is reported for a representative set of observables in Fig. 7. Generally, the difference in correlations between the Gan and Geant4 are improved after applying the Dctr reweighting, with many of the residual differences reducing by factors of about 2-10.

Figures 3-5 are features that were directly used by the high-level model. Figure 6

presents histograms for a collection of observables that were not accessible to the high-level model during training. In particular, the energy-weighted second moment along the

-direction are computed for each layer. The results for the -direction are nearly the same. Despite not being present during training, the high-level network is still able to improve the the performance over the unweighted Gan in every case with only a modest reduction in statistical power. Weights in the Dctr models are per-example so one can compute any observable even if they are not explicitly part of the reweighting model evaluation.

A summary of the overall performance appears in Fig. 8. The most probable ratio of computed with DctrGan to the one computed with the unweighted Gan is between 0.4 and 0.5 and most most of the observables show improvement after weighting. As mentioned earlier, the Gan and Geant4 datasets have the same number of events so without Dctr. For the low-level model, has a narrow distribution peaked at about 1.5. In contrast, the high-level model is peaked past two with a tail out to around 5. This difference in and the overall performance between the low- and high-level models is in part from the extensive regularization of the low-level model during training. The high-level model is also highly-regularized by the dimensionality reduction, but otherwise has a sufficiently complex classifier that is not arrested during training.

Figure 4: Histograms of total energies in selected layers of the calorimeter. The panels below each histogram show the ratio between the Gan or the DctrGan and the physics-based simulator Geant4. The legend includes the separation power between the (weighted) Gan model and the Geant4 model. Additionally, the ratio of the uncertainty in the mean of the observable between the Gan and Geant4 is also presented. Underflow and overflow are not included in the leftmost or rightmost bins.
Figure 5: Selected histograms of total energies differences between layer 5 (left), 10 (second), 15 (third), an 20 (right) with the layer that is one later (top) and four later (bottom). The panels below each histogram show the ratio between the Gan or the DctrGan and the physics-based simulator Geant4. The legend includes the separation power between the (weighted) Gan model and the Geant4 model. Additionally, the ratio of the uncertainty in the mean of the observable between the Gan and Geant4 is also presented. Underflow and overflow are not included in the leftmost or rightmost bins.
Figure 6: Histograms of energy weighted second moment in per layer. The panels below each histogram show the ratio between the Gan or the DctrGan and the physics-based simulator Geant4. The legend includes the separation power between the (weighted) Gan model and the Geant4 model. Additionally, the ratio of the uncertainty in the mean of the observable between the Gan and Geant4 is also presented. Underflow and overflow are not included in the leftmost or rightmost bins.
Figure 7: Differnces in pairwise linear correlations between the stated observables between Geant4 and the nominal Gan (left), for the high-level DctrGan (middle) and the low-level DctrGan (right). Darker colors indicate stronger correlation.
Figure 8: Histograms that summarize the performance of the reweighting models. The first column presents histograms of for all of the observables shown in the previous figures and the second column shows the ratio of for the Gan compared with the DctrGan models. The third column quantifies the statistical dilution using . The top row includes all of the 33 input features to the high-level network, the second row includes all 30 observables, and the third row includes all differences between layers.

Iv Conclusions and Outlook

This paper has introduced a post-processing method for generative models to improve their accuracy. While the focus lies on deep generative models, it could similarly be used to enhance the precision of other fast simulation tools (such as Delphes de Favereau et al. (2014)). The approach is based on reweighting and the result is a combined generator and weighting function that produces weighted examples. These weighted examples can be used to perform inference using the same tools as for unweighted events. The potential of deep generative models continues to expand as their fidelity improves and tools like DctrGan may be able help achieve the target precision for a variety of applications.

Code and Data

The code for this paper can be found at https://github.com/bnachman/DCTRGAN. Examples and instructions to reproduce the calorimeter GAN dataset are available at https://github.com/FLC-QU-hep/getting_high.


We thank A. Andreassen, P. Komiske, E. Metodiev, and J. Thaler for many helpful discussions about reweighting with NNs and E. Buhmann, F. Gaede, and K. Krüger for stimulating discussions on improving the fidelity of calorimter simulation. We also thank J. Thaler for feedback on the manuscript. BPN and DS are supported by the U.S. Department of Energy, Office of Science under contract numbers DE-AC02-05CH11231 and DE-SC0010008, respectively. BPN would also like to thank NVIDIA for providing Volta GPUs for neural network training. GK and SD acknowledge support by the DFG under Germany’s Excellence Strategy – EXC 2121 Quantum Universe – 390833306. EE is funded through the Helmholtz Innovation Pool project AMALEA that provided a stimulating scientific environment for parts of the research done here. DS is grateful to LBNL, BCTP and BCCP for their generous support and hospitality during his sabbatical year.