1 Introduction
Learning generative models of complex environments from highdimensional observations is a longstanding challenge in machine learning. Once learned, these models are used to draw inferences and plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning
[1]. In modelbased offpolicy policy evaluation (henceforth MBOPE), a learned dynamics model is used to simulate and evaluate a target policy without realworld deployment [2], which is especially valuable for risksensitive applications [3]. In spite of the recent successes of deep generative models, existing theoretical results show that learning distributions in an unbiased manner is either impossible or has prohibitive sample complexity [4, 5]. Consequently, the models used in practice are inherently biased,^{1}^{1}1We call a generative model biased if it produces biased statistics relative to the true data distribution. and can lead to misleading downstream inferences.In order to address this issue, our work starts from the observation that many typical uses of generative models involve computing expectations under the model. For instance, in MBOPE, we seek to find the expected return of a policy under a trajectory distribution defined by this policy and a learned dynamics model. A classical recipe for correcting the bias in expectations, when samples from a different distribution than the ground truth are available, is to importance weight the samples according to the likelihood ratio [6]
. If the importance weights were exact, the resulting estimates are unbiased. But in practice, the likelihood ratio is unknown and needs to be estimated since the true data distribution is unknown and even the model likelihood is intractable or illdefined for many deep generative models, e.g., variational autoencoders
[7] and generative adversarial networks [8].Our proposed solution to estimate the importance weights is to train a calibrated, probabilistic classifier to distinguish samples from the data distribution and the generative model. As shown in prior work, the output of such classifiers can be used to extract density ratios [9]. Appealingly, this estimation procedure is likelihoodfree since it only requires samples from the two distributions. While density ratios have been used previously to expand the class of learning objectives for deep generative modeling [10, 11, 12], we use the density ratios for bias reduction of a pretrained generative model to be used for downstream Monte Carlo evaluation.
Empirically, we evaluate our bias reduction framework on three main sets of experiments. First, we consider goodnessoffit metrics for evaluating sample quality metrics of a likelihoodbased and a likelihoodfree stateoftheart (SOTA) model on the CIFAR10 dataset. All these metrics are defined as Monte Carlo estimates from the generated samples. By importance weighting samples, we observe a bias reduction of 23.35% and 13.48% averaged across commonly used sample quality metrics on PixelCNN++ [13] and SNGAN [14] models respectively.
Next, we demonstrate the utility of our approach on the task of data augmentation for multiclass classification on the Omniglot dataset [15]. We show that while naively augmenting the model with samples from a data augmentation generative adversarial network [1] is not very effective for multiclass classification, we can improve classification accuracy from 66.03% to 68.18% by importance weighting the contributions of each augmented data point.
Finally, we demonstrate bias reduction for MBOPE [16]. A typical MBOPE approach is to first estimate a generative model of the dynamics using offpolicy data and then evaluate the policy via Monte Carlo [2, 17]. Again, we observe that correcting the bias of the estimated dynamics model via importance weighting leads improves policy evaluations by TODO% on MuJoCo environments [18].
2 Debiasing Generative Models for Monte Carlo Evaluation
2.1 Preliminaries
Notation.
Unless explicitly stated otherwise, we assume probability distributions admit absolutely continuous densities on a suitable reference measure. We use uppercase notation
to denote random variables and lowercase notation
to denote specific values in the corresponding sample spaces. We use boldface for multivariate random variables and their vector values.
Background. Consider a finite dataset of instances drawn i.i.d. from a fixed (unknown) distribution . Given , the goal of generative modeling is to learn a distribution to approximate . Here,
denotes the model parameters, e.g. weights in a neural network for deep generative models. The parameters can be learned via maximum likelihood estimation (MLE) as in the case of autoregressive models
[19], normalizing flows [20], and variational autoencoders [7, 21], or via adversarial training e.g., using generative adversarial networks [8, 11] and variants.2.2 Monte Carlo Evaluation
We are interested in use cases where the goal is to evaluate or optimize expectations of functions under some distribution (either equal or close to the data distribution ). Assuming access to samples from as well some generative model , one extreme is to evaluate the sample average using the samples from alone. However, this ignores the availability of , through which we have a virtually unlimited access of generated samples ignoring computational constraints and hence, could improve the accuracy of our estimates when is close to . We begin by presenting a direct motivating use case of data augmentation using generative models for training classifiers which generalize better.
Example Use Case: Sufficient labeled training data for learning classification and regression system is often expensive to obtain or susceptible to noise. Data augmentation seeks to overcome this shortcoming by artificially injecting new datapoints into the training set. These new datapoints are derived from an existing labeled dataset, either by manual transformations (e.g., rotations, flips for images), or alternatively, learned via a generative model [22, 1].
Consider a supervised learning task over a labeled dataset of pairs of features and labels denoted as , which are assumed to be sampled independently from an underlying data distribution defined over . Further, let . In order to learn a classifier , we are interested in minimizing the expectation of a loss over the training dataset:
(1) 
E.g.,
could be the crossentropy loss. A generative model for the task of data augmentation learns a joint distribution
. Several algorithmic variants exist for learning the model’s joint distribution and we defer the specifics to the experiments section. Once the generative model is learned, it can be used to optimize the expected classification loss in Eq. (1) under a mixture distribution of empirical data distributions and generative model distributions given as:(2) 
for a suitable choice of the mixture weights . Notice that while the eventual task here is optimization, reliably evaluating the expected loss of a candidate parameter is an important ingredient and we focus on this basic question first, before leveraging the solution for data augmentation. Further, even if evaluating the expectation once is easy, optimization requires us to do repeated evaluation (for different values of ) which is significantly more challenging. Also observe that the distribution under which we seek expectations is same as here, and we rely on the generalization of to generate transformations of an instance in the dataset which are not explicitly present, but plausibly observed in other, similar instances [23].
2.3 LikelihoodFree Importance Weighting
Whenever the distribution under which we seek expectations differs from , modelbased estimates exhibit bias. In this section, we start out by formalizing bias for Monte Carlo expectations and subsequently propose a bias reduction strategy based on likelihoodfree importance weighting (LFIW). We are interested in evaluating expectations of a class of functions of interest w.r.t. the distribution . For any given , we have .
Given access to samples from a generative model , if we knew the densities for both and , then a classical scheme to evaluate expectations under using samples from is to use importance sampling [6]. We reweight each sample from according to its likelihood ratio under and and compute a weighted average of the function over these samples.
(3) 
where is the importance weight for . The validity of this procedure is subject to the use of a proposal such that for all where , we also have .^{2}^{2}2A stronger sufficient, but not necessary condition that is independent of , states that the proposal is valid if it has a support larger than , i.e., for all , implies .
To apply this technique to reduce the bias of a generative sampler w.r.t. , we require knowledge of the importance weights for any . However, we typically only have a sampling access to via finite datasets. For instance, in the data augmentation example above, where , the unknown distribution used to learn . Hence we need a scheme to learn the weights , using samples from and , which is the problem we tackle next.
Consider two sets of samples from the distributions and respectively. Without loss of generality, assign the positive label to samples from and negative label to samples from . A probabilistic, binary classifier assigns a probability that a sample belongs to the positive class . As shown in prior work, such a classifier can be used to extract density ratios [24]. We restate the result in the proposition below.
Proposition 1.
If a probabilistic classifier trained to classify data from and is Bayes optimal, then the ratio of densities assigned to any point is given as:
(4) 
For the rest of the work, we assume for the purpose of brevity that a data point is equally likely to be classified as positive or negative, and hence . This can be enforced empirically by training a classifier on minibatches with an equal number of positive and negative examples.
Practical LFIW Estimators. In practice, we do not have access to a Bayes optimal classifier and hence, the estimated importance weights will not be exact. Consequently, we can hope to reduce the bias as opposed to eliminating it entirely. Hence, our proposed LFIW estimator is given as:
(5) 
where is the importance weight for estimated via a probabilistic binary classifier . Besides imperfections in the classifier, the quality of a generative model also dictates the efficacy of importance weighting. For example, images generated by deep generative models often possess distinct artifacts which can easily be exploited to give highlyconfident predictions by the classifier [25, 26]
. This could lead to highly confident predictions and small importance weights for some generated images, and consequently greater variance across the Monte Carlo batch. Next, we discuss some practical tricks to offset this challenge If we let
to be the importance weight for , then we propose the following alternate LFIW estimators.
[leftmargin=0cm,itemindent=.5cm,labelwidth=labelsep=0cm,align=left]

Selfnormalization: The selfnormalized LFIW estimator for Monte Carlo evaluation normalizes the importance weights across a sampled batch:
(6) 
Flattening:
The flattened LFIW estimator interpolates between the uniform importance weights and the default LFIW weights via a power scaling parameter
:(7) For , there is no bias correction, and returns the default estimator in Eq. (5). For intermediate values of , we can tradeoff bias reduction with any undesirable variance introduced.

Clipping: The clipped LFIW estimator specifies a lower bound on the importance weights:
(8) When , we recover the default LFIW estimator in Eq. (5). Finally, we note that these estimators are not exclusive and can be combined e.g., flattened or clipped weights can be normalized.
Confidence intervals.
Since we have real and generated data coming from a finite dataset and parametric model respectively, we propose a combination of empirical and parametric bootstraps to derive confidence intervals around the estimated importance weights. See Appendix
A for details.is the number of points used for training the generative model and multilayer perceptron.
Synthetic experiment.
We visually illustrate our importance weighting approach in a toy experiment (Figure 1a). We are given a finite set of samples drawn from a mixture of two Gaussians (red). The model family is a unimodal Gaussian, illustrating mismatch due to a parametric model. The mean and variance of the model are estimated by the empirical means and variances of the observed data. Using estimated model parameters, we then draw samples from the model (blue).
In Figure 1b, we show the probability assigned by a binary classifier to a point to be from true data distribution. Here, the classifier is a single hiddenlayer multilayer perceptron. The classifier is not Bayes optimal, which can be seen by the gaps between the optimal probabilities curve (black) and the estimated class probability curve (green). However, as we increase the number of real and generated examples in Figures 1cd, the classifier approaches optimality. Furthermore, even its uncertainty shrinks with increasing data, as expected. In summary, this experiment demonstrates how a binary classifier can mitigate this bias due to a mismatched generative model.
Model  Evaluation  IS ()  FID ()  KID () 

  Reference  11.09 0.1263  5.20 0.0533  0.008 0.0004 
PixelCNN++  Default (no debiasing)  5.16 0.0117  58.70 0.0506  0.196 0.0001 
LFIW  6.68 0.0773  55.83 0.9695  0.126 0.0009  
SNGAN  Default (no debiasing)  8.33 0.0280  20.40 0.0747  0.094 0.0002 
LFIW  8.57 0.0325  17.29 0.0698  0.073 0.0004 
Goodnessoffit evaluation on CIFAR10 dataset for PixelCNN++ and SNGAN. Standard errors computed over 10 runs.
Higher IS is better. Lower FID and KID scores are better.3 Application Use Cases
Our goal is to demonstrate improved Monte Carlo inference using a pretrained deep generative model. In our experiments, the binary classifier for estimating the importance weights was a calibrated deep neural network trained to minimize the crossentropy loss. The selfnormalized LFIW in Eq. (6) worked best. Additional analysis on the estimators and experiment details are in Appendices B and C.
Dataset  

Accuracy  0.6603 0.0012  0.4431 0.0054  0.4481 0.0056  0.6600 0.0040  0.6818 0.0022 
3.1 Goodnessoffit testing
In the first set of experiments, we highlight the benefits of importance weighting for a debiased evaluation of three popularly used sample quality metrics viz. Inception Scores (IS) [27], Frechet Inception Distance (FID) [28], and Kernel Inception Distance (KID) [29]. All these scores can be formally expressed as empirical expectations with respect to the model. For all these metrics, we can simulate the population level unbiased case as a “reference score" wherein we artificially set both the real and generated sets of samples used for evaluation as finite, disjoint sets derived from .
We evaluate the three metrics for two representative stateoftheart models trained on the CIFAR10 dataset viz. an autoregressive model PixelCNN++ [13] learned via maximum likelihood estimation and a latent variable model SNGAN [14] learned via adversarial training. For evaluating each metric, we draw 10,000 samples from the model. In Table 1, we report the metrics with and without the LFIW bias correction. The consistent debiased evaluation of these metrics via LFIW suggest improved Monte Carlo evaluation for other downstream tasks as well, such as the ones we discuss next.
3.2 Data Augmentation for MultiClass Classification
We consider data augmentation via Data Augmentation Generative Adversarial Networks (DAGAN) [1]. While DAGAN was motivated by and evaluated for the task of metalearning, it can also be applied for multiclass classification scenarios, which is the setting we consider here. We trained a DAGAN on the Omniglot dataset of handwritten characters [15]. The DAGAN training procedure is described in the Appendix. The dataset is particularly relevant because it contains 1600+ classes but only 20 examples from each class and hence, could potentially benefit from augmented data. For each class, we consider a 15/5/5 split of the 20 examples for training, validation, and testing.
Once the model has been trained, it can be used for data augmentation in many ways. In particular, we consider ablation baselines that use various combinations of the real training data and generated data for training a downstream classifier. When the generated data is used, we can either use the data directly with uniform weighting for all training points, or choose to importance weight (LFIW) the contributions of the individual training points to the overall loss. The results are shown in Table 2. While generated data () alone cannot be used to obtain competitive performance relative to the real data () on this task as expected, the bias it introduces for evaluation and subsequent optimization overshadows even the naive data augmentation (). In contrast, we can obtain significant improvements by importance weighting the generated points ().
Qualitatively, we can observe the effect of importance weighting in Figure 2. Here, we show true and generated samples for randomly choosen classes (af) in the Omniglot dataset. The generated samples are further ranked in decreasing order of the importance weights. There is no way to formally test the validity of such rankings and this criteria can also prefer points which have high density under but are unlikely under since we are looking at ratios. Visual inspection suggests that the classifier is able to appropriately downweight poorer samples, as shown in Figure 2 (a, b, c, d, bottom right). There are also failure modes, such as the lowest ranked generated images in Figure 2 (e, f, bottom right) where the classifier weights reasonable generated samples poorly relative to others. This could be due to particular artifacts such as a tiny disconnected blurry speck in Figure 2 (e, bottom right) which are potentially more revealing to a classifier distinguishing real and generated data.
3.3 Modelbased Offpolicy Policy Evaluation
So far, we have seen the benefits of our debiasing framework in cases where the generative model was trained on data from the same distribution we wish to use for Monte Carlo evaluation. We can extend the same principle to more involved settings when the generative model is a building block for specifying the full data generation process, e.g., trajectory data generated via a probabilistic dynamics model along with an agent policy.
In particular, we consider the setting of offpolicy policy evaluation (OPE), where the goal is to evaluate policies using experiences collected from a different policy. Formally, let
denote an (undiscounted) Markov decision process with state space
, action space , reward function , transition , initial state distribution and horizon . Assume is a known policy that we wish to evaluate. The probability of generating a certain trajectory of length with policy and transition is given as:(9) 
The return on a trajectory is the sum of the rewards across the state, action pairs in : , where we assume a known reward function . We are interested in the value of a policy defined as .
Evaluating requires the (unknown) transition dynamics . The dynamics model is a conditional generative model of the next states conditioned on the previous stateaction pair . If we have access to historical logged data of trajectories from some behavioral policy , then we can use this offpolicy data to train a dynamics model . The policy can then be evaluated under this learned dynamics model as , where uses instead of the true dynamics in Eq. (9).
However, the trajectories sampled with could significantly deviate from samples from due to compounding errors [30]. In order to correct for this bias, we can use likelihoodfree importance weighting on entire trajectories of data. The binary classifier for estimating the importance weights in this case distinguishes between triples of true and generated transitions. For any true triple extracted from the offpolicy data, the corresponding generated triple only differs in the final transition state, i.e., . Such a classifier allows us to obtain the importance weights for every predicted state transition . The importance weights for the trajectory can be derived from the importance weights of these individual transitions as:
(10) 
Our final LFIW estimator is given as:
(11) 
Environment  (Ground truth)  (Ours)  (Ours)  

Swimmer  
HalfCheetah  
HumanoidStandup 
We consider three continuous control tasks in the MuJoCo simulator [18] from OpenAI gym [31] (in increasing number of state dimensions): Swimmer, HalfCheetah and HumanoidStandup. High dimensional state spaces (e.g., HumanoidStandup has 376 dimensions) makes it challenging to learning a reliable dynamics model in these environments. We train behavioral and evaluation policies using Proximal Policy Optimization [32]
with different hyperparameters for the two policies. The dataset collected via trajectories from the behavior policy are used train a ensemble neural network dynamics model. We the use the trained dynamics model to evaluate
and its IW version , and compare them with the ground truth returns . Each estimation is averaged over a set of 100 trajectories with horizon . Specifically, for , we also average the estimation over 5 classifier instances trained with different random seeds. We further consider performing IW over only the first steps, and use uniform weights for the remainder, which we denote as . This allow us to interpolate between and . Finally, as in the other experiments, we used the selfnormalized variant (Eq. (6)) of the importance weighted estimator in Eq. (11).We compare the policy evaluations under different environments in Table 3. These results show that the rewards estimated with the trained dynamics model differ from the ground truth by a large margin. By importance weighting the trajectories, we obtain much more accurate policy evaluations. As expected, we also see that while LFIW leads to higher returns on average, the imbalance in trajectory importance weights due to the multiplicative weights of the stateaction pairs can lead to higher variance in the importance weighted returns. In Figure 3, we demonstrate that policy evaluation becomes more accurate as more timesteps are used for LFIW evaluations, until around timesteps and thus empirically validates the benefits of importance weighting using a classifier. Given that our estimates have a large variance, but generally include the true policy value within the uncertainty interval, it would be worthwhile to compose our approach with other variance reduction techniques such as (weighted) doubly robust estimation in future work [33], as well as incorporate these estimates within a framework such as MAGIC to further blend with modelfree OPE [17].
Overall. Across all our experiments, we observe that importance weighting the generated samples leads to uniformly better results, whether in terms of evaluating the quality of samples, or their utility in downstream tasks. Since the technique is a blackbox wrapper around any generative model, we expect this to benefit a diverse set of tasks in followup works.
However, there is also some caution to be exercised with these techniques as evident from the results of Table 1. Note that in this table, the confidence intervals (computed using the reported standard errors) around the model scores after importance weighting still do not contain the reference scores obtained from the true model. This would not have been the case if our debiased estimator was completely unbiased and this observation reiterates our earlier claim that LFIW is reducing bias, as opposed to completely eliminating it. Indeed, when such a mismatch is observed, it is a good diagnostic to either learn more powerful classifiers to better approximate the Bayes optimum, or find additional data from in case the generative model fails the full support assumption.
4 Related Work & Discussion
Density ratios enjoy widespread use across machine learning e.g., for handling covariate shifts, class imbalance etc. [9, 34]. In generative modeling, estimating these ratios via binary classifiers is frequently used for defining learning objectives [11, 35]. In particular, such classifiers have been used to define learning frameworks such as generative adversarial networks [8, 10], likelihoodfree Approximate Bayesian Computation (ABC) [36] and earlier work in unsupervisedassupervised learning [24]
and noise contrastive estimation
[36] among others. Recently, [37] used importance weighting to reweigh datapoints based on differences in training and test data distributions i.e., dataset bias. The key difference is that these works are explicitly interested in learning the parameters of a generative model. In contrast, we use the binary classifier for estimating importance weights to correct for the model bias of any fixed generative model.Classifiers have also been used for defining twosample tests [38, 39, 40, 41, 35, 42, 43]. These are not particularly restricted to probabilistic classifiers and the goal here is to evaluate sample quality by goodnessoffit tests, e.g., FID, KID etc. Our framework applies to arbitrary functions even beyond goodnessoffit testing such as a classification loss or the value function of a policy. Finally, as shown in the experiments, our approach can also be used for a bias sensitive evaluation of the above metrics.
Closely related to the above use case are recent concurrent works by [44, 45, 46]
that use MCMC and resampling techniques such as rejection sampling to explicitly transform or reject the generated samples. These methods require extra computation beyond training a classifier, in rejecting the samples or running Markov chains to convergence, unlike the proposed importance weighting strategy. We presented novel application use cases in this work for which this extra computation is unnecessary. Moreover, principled rejection sampling in particular requires an upper bound on the density ratio that holds for all data points, which is typically infeasible to obtain.
5 Conclusion & Future Work
In this work, we identified bias with respect to a target data distribution as a fundamental challenge restricting the use of deep generative models as proposal distributions for Monte Carlo evaluation. We proposed a bias correction framework based on importance sampling. The importance weights are learned in a likelihoodfree fashion via a binary classifier. Empirically, we find the bias correction to be useful across a surprising variety of tasks including goodnessoffit sample quality tests and the motivating use cases of data augmentation and modelbased offpolicy policy evaluation.
The ability to characterize the bias of a deep generative model is an important step towards using these models in risksensitive applications with high uncertainty [47, 48]
, such as robust anomaly detection
[49, 50]. However, as noted, importance reweighting is valid only when the support of the model distribution includes the true data distribution. This can be particularly a problem for models which incur significant mode dropping and simple heuristics such as Gaussian perturbations on the generated samples to increase the support can only partially alleviate the problem, motivating the exploration of other debiasing techniques in future work.
Acknowledgements
This project was initiated when AG was an intern at Microsoft Research. We are thankful to Daniel Levy, Rui Shu, Yang Song, and members of the Reinforcement Learning, Deep Learning, and Adaptive Systems and Interaction groups at Microsoft Research for helpful discussions and comments on early drafts.
References
 Antoniou et al. [2017] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340, 2017.
 Mannor et al. [2007] Shie Mannor, Duncan Simester, Peng Sun, and John N Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
 Thomas [2015] Philip S Thomas. Safe reinforcement learning. PhD thesis, University of Massachusetts Libraries, 2015.
 Rosenblatt [1956] Murray Rosenblatt. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, pages 832–837, 1956.
 Arora et al. [2018] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Do gans learn the distribution? some theory and empirics. In International Conference on Learning Representations, 2018.
 Horvitz and Thompson [1952] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 1952.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Goodfellow et al. [2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Sugiyama et al. [2012] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
 Nowozin et al. [2016] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pages 271–279, 2016.
 Mohamed and Lakshminarayanan [2016] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.

Grover and Ermon [2018]
Aditya Grover and Stefano Ermon.
Boosted generative models.
In
ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Salimans et al. [2017] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
 Lake et al. [2015] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 Precup et al. [2000] Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility traces for offpolicy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, 2000.
 Thomas and Brunskill [2016] Philip Thomas and Emma Brunskill. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148, 2016.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
 Uria et al. [2016] Benigno Uria, MarcAlexandre Côté, Karol Gregor, Iain Murray, and Hugo Larochelle. Neural autoregressive distribution estimation. The Journal of Machine Learning Research, 17(1):7184–7220, 2016.
 Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Rezende et al. [2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 Ratner et al. [2017] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. Learning to compose domainspecific transformations for data augmentation. In Advances in neural information processing systems, pages 3236–3246, 2017.
 Zhao et al. [2018] Shengjia Zhao, Hongyu Ren, Arianna Yuan, Jiaming Song, Noah Goodman, and Stefano Ermon. Bias and generalization in deep generative models: An empirical study. In Advances in Neural Information Processing Systems, pages 10815–10824, 2018.
 Friedman et al. [2001] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer series in statistics New York, NY, USA:, 2001.
 Odena et al. [2016] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and checkerboard artifacts. Distill, 2016. doi: 10.23915/distill.00003. URL http://distill.pub/2016/deconvcheckerboard.
 Odena [2019] Augustus Odena. Open questions about generative adversarial networks. Distill, 4(4):e18, 2019.
 Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
 Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 Bińkowski et al. [2018] Mikołaj Bińkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401, 2018.

Ross and Bagnell [2010]
Stéphane Ross and Drew Bagnell.
Efficient reductions for imitation learning.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 661–668, 2010.  Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
 Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 Farajtabar et al. [2018] Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh. More robust doubly robust offpolicy evaluation. In International Conference on Machine Learning, 2018.
 Byrd and Lipton [2018] Jonathon Byrd and Zachary C Lipton. What is the effect of importance weighting in deep learning? arXiv preprint arXiv:1812.03372, 2018.
 Rosca et al. [2017] Mihaela Rosca, Balaji Lakshminarayanan, David WardeFarley, and Shakir Mohamed. Variational approaches for autoencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
 Gutmann and Hyvärinen [2012] Michael U Gutmann and Aapo Hyvärinen. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research, 13(Feb):307–361, 2012.
 Diesendruck et al. [2018] Maurice Diesendruck, Ethan R Elenberg, Rajat Sen, Guy W Cole, Sanjay Shakkottai, and Sinead A Williamson. Importance weighted generative networks. arXiv preprint arXiv:1806.02512, 2018.
 Gretton et al. [2007] Arthur Gretton, Karsten M Borgwardt, Malte Rasch, Bernhard Schölkopf, and Alex J Smola. A kernel method for the twosampleproblem. In Advances in neural information processing systems, pages 513–520, 2007.
 Bowman et al. [2015] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 LopezPaz and Oquab [2016] David LopezPaz and Maxime Oquab. Revisiting classifier twosample tests. arXiv preprint arXiv:1610.06545, 2016.
 Danihelka et al. [2017] Ivo Danihelka, Balaji Lakshminarayanan, Benigno Uria, Daan Wierstra, and Peter Dayan. Comparison of maximum likelihood and ganbased training of real nvps. arXiv preprint arXiv:1705.05263, 2017.
 Im et al. [2018] Daniel Jiwoong Im, He Ma, Graham Taylor, and Kristin Branson. Quantitatively evaluating gans with divergences proposed for training. arXiv preprint arXiv:1803.01045, 2018.
 Gulrajani et al. [2019] Ishaan Gulrajani, Colin Raffel, and Luke Metz. Towards gan benchmarks which require generalization. In International Conference on Learning Representations, 2019.
 Turner et al. [2018] Ryan Turner, Jane Hung, Yunus Saatci, and Jason Yosinski. Metropolishastings generative adversarial networks. arXiv preprint arXiv:1811.11357, 2018.
 Azadi et al. [2018] Samaneh Azadi, Catherine Olsson, Trevor Darrell, Ian Goodfellow, and Augustus Odena. Discriminator rejection sampling. arXiv preprint arXiv:1810.06758, 2018.
 Tao et al. [2018] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chisquare generative adversarial network. In International Conference on Machine Learning, pages 4894–4903, 2018.
 Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016.
 Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 Nalisnick et al. [2018] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? arXiv preprint arXiv:1810.09136, 2018.
 Choi and Jang [2018] Hyunsun Choi and Eric Jang. Generative ensembles for robust anomaly detection. arXiv preprint arXiv:1810.01392, 2018.
 Efron and Tibshirani [1994] Bradley Efron and Robert J Tibshirani. An introduction to the bootstrap. CRC press, 1994.
 Abadi et al. [2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: a system for largescale machine learning. In OSDI, volume 16, pages 265–283, 2016.

Szegedy et al. [2016]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 2818–2826, 2016. 
Paszke et al. [2017]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
2017.  Vinyals et al. [2016] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
 Dhariwal et al. [2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, and Yuhuai Wu. Openai baselines. GitHub, GitHub repository, 2017.
 Ramachandran et al. [2017] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
Appendices
Appendix A Confidence Intervals via Bootstrap
Bootstrap is a widelyused tool in statistics for deriving confidence intervals by fitting ensembles of models on resampled data points. If the dataset is finite e.g., , then the bootstrapped dataset is obtained via random sampling with replacement and confidence intervals are estimated via the empirical bootstrap. For a parameteric model generating the dataset e.g., , a fresh bootstrapped dataset is resampled from the model and confidence intervals are estimated via the parametric bootstrap. See [51] for a detailed review. In training a binary classifier, we can estimate the confidence intervals by retraining the classifier on a fresh sample of points from and a resampling of the training dataset
(with replacement). Repeating this process over multiple runs and then taking a uitable quantile gives us the corresponding confidence intervals.
Appendix B BiasVariance of Different LFIW estimators
As discussed in Section 2.3, bias reduction using LFIW can suffer from issues where the importance weights are too small due to highly confident predictions of the binary classifier. Across a batch of Monte Carlo samples, this can increase the corresponding variance. Inspired from the importance sampling literature, we proposed additional mechanisms to mitigate this additional variance at the cost of reduced debiasing in Eqs. (68). We now look at the empirical biasvariance tradeoff of these different estimators via a simple experiment below.
Our setup follows the goodnessoffit testing experiments in Section 3. The statistics we choose to estimate is simply are the 2048 activations of the prefinal layer of the Inception Network, averaged across the test set of samples of CIFAR10.
That is, the true statistics are given by:
(12) 
where is the th prefinal layer activation of the Inception Network. Note that set of statistics is fixed (computed once on the test set).
To estimate these statistics, we will use different estimators. For example, the default estimator involving no reweighting is given as:
(13) 
where .
Note that is a random variable since it depends on the samples drawn from . Similar to Eq. (13), other variants of the LFIW estimators proposed in Section 2.3 can be derived using Eqs. (68). For any LFIW estimate , we can use the standard decomposition of the expected meansquared error into terms corresponding to the (squared) bias and variance as shown below.
(14)  
(15)  
(16) 
In Table 4, we report the bias and variance terms of the estimators averaged over 10 draws of samples and further averaging over all statistics corresponding to . We observe that selfnormalization performs consistently well and is the best or second best in terms of bias and MSE in all cases. The flattened estimator with no debiasing (corresponding to ) has lower bias and higher variance than the selfnormalized estimator. Amongst the flattening estimators, lower values of seem to provide the best biasvariance tradeoff. The clipped estimators do not perform well in this setting, with lower values of slightly preferable over larger values. We repeat the same experiment with samples and report the results in Table 5. While the variance increases as expected (by almost an order of magnitude), the estimator bias remains roughly the same.
Model  Evaluation  ()  Variance ()  MSE () 

PixelCNN++  Selfnorm  0.0240 0.0014  0.0002935 7.22e06  0.0046 0.00031 
Flattening ()  0.0330 0.0023  9.1e06 2.6e07  0.0116 0.00093  
Flattening ()  0.1042 0.0018  5.1e06 1.5e07  0.0175 0.00138  
Flattening ()  0.1545 0.0022  8.4e06 3.7e07  0.0335 0.00246  
Flattening ()  0.1626 0.0022  3.19e05 2e06  0.0364 0.00259  
Flattening ()  0.1359 0.0018  0.0002344 1.619e05  0.0257 0.00175  
Clipping ()  0.1359 0.0018  0.0002344 1.619e05  0.0257 0.00175  
Clipping ()  0.1357 0.0018  0.0002343 1.618e05  0.0256 0.00175  
Clipping ()  0.1233 0.0017  0.000234 1.611e05  0.0215 0.00149  
Clipping ()  0.1255 0.0030  0.0002429 1.606e05  0.0340 0.00230  
SNGAN  Selfnorm  0.0178 0.0008  1.98e05 5.9e07  0.0016 0.00023 
Flattening ()  0.0257 0.0010  9.1e06 2.3e07  0.0026 0.00027  
Flattening ()  0.0096 0.0007  8.4e06 3.1e07  0.0011 8e05  
Flattening ()  0.0295 0.0006  1.15e05 6.4e07  0.0017 0.00011  
Flattening ()  0.0361 0.0006  1.93e05 1.39e06  0.002 0.00012  
Flattening ()  0.0297 0.0005  3.76e05 3.08e06  0.0015 7e05  
Clipping ()  0.0297 0.0005  3.76e05 3.08e06  0.0015 7e05  
Clipping ()  0.0297 0.0005  3.76e05 3.08e06  0.0015 7e05  
Clipping ()  0.0296 0.0005  3.76e05 3.08e06  0.0015 7e05  
Clipping ()  0.1002 0.0018  3.03e05 2.18e06  0.0170 0.00171 
Model  Evaluation  ()  Variance ()  MSE () 

PixelCNN++  Selfnorm  0.023 0.0014  0.0005086 1.317e05  0.0049 0.00033 
Flattening ()  0.0330 0.0023  1.65e05 4.6e07  0.0116 0.00093  
Flattening ()  0.1038 0.0018  9.5e06 3e07  0.0174 0.00137  
Flattening ()  0.1539 0.0022  1.74e05 8e07  0.0332 0.00244  
Flattening ()  0.1620 0.0022  6.24e05 3.83e06  0.0362 0.00256  
Flattening ()  0.1360 0.0018  0.0003856 2.615e05  0.0258 0.00174  
Clipping ()  0.1360 0.0018  0.0003856 2.615e05  0.0258 0.00174  
Clipping ()  0.1358 0.0018  0.0003856 2.615e05  0.0257 0.00173  
Clipping ()  0.1234 0.0017  0.0003851 2.599e05  0.0217 0.00148  
Clipping ()  0.1250 0.0030  0.0003821 2.376e05  0.0341 0.00232  
SNGAN  Selfnorm  0.0176 0.0008  3.88e05 9.6e07  0.0016 0.00022 
Flattening ()  0.0256 0.0010  1.71e05 4.3e07  0.0027 0.00027  
Flattening ()  0.0099 0.0007  1.44e05 3.7e07  0.0011 8e05  
Flattening ()  0.0298 0.0006  1.62e05 5.3e07  0.0017 0.00012  
Flattening ()  0.0366 0.0006  2.38e05 1.11e06  0.0021 0.00012  
Flattening ()  0.0302 0.0005  4.56e05 2.8e06  0.0015 7e05  
Clipping ()  0.0302 0.0005  4.56e05 2.8e06  0.0015 7e05  
Clipping ()  0.0302 0.0005  4.56e05 2.8e06  0.0015 7e05  
Clipping ()  0.0302 0.0005  4.56e05 2.81e06  0.0015 7e05  
Clipping ()  0.1001 0.0018  5.19e05 2.81e06  0.0170 0.0017 
Appendix C Additional Experimental Details
c.1 Calibration
We found in all our cases that the binary classifiers used for training the model were highly calibrated by default and did not require any further recalibration. We performed the analysis on a heldout set of real and generated samples and used bins for computing calibration statistics.
c.2 Synthetic experiment
The classifier used in this case is a multilayer perceptron with a single hidden layer of 100 units and has been trained to minimize the crossentropy loss by first order optimization methods. The dataset used for training the classifier consists of an equal number of samples (denoted as in Figure 1) drawn from the generative model and the data distribution.
c.3 Goodnessoffit testing
We used the Tensorflow implementation of Inception Network [52] to ensure the sample quality metrics are comparable with prior work. All our experiments were performed on a single TitanX NVIDIA GPU.
For a semantic evaluation of difference in sample quality, this test is performed in the feature space of a pretrained classifier, such as the prefinal activations of the Inception Net [53]. For example, the Inception score for a generative model given a classifier can be expressed as:
IS 
The FID score is another metric which unlike the Inception score also takes into account real data from . Mathematically, the FID between sets and sampled from distributions and respectively, is defined as:
where and are the empirical means and covariances computed based on and respectively. Here, and are sets of datapoints from and
. In a similar vein, KID compares statistics between samples in a feature space defined via a combination of kernels and a pretrained classifier. The standard kernel used is a radialbasis function kernel with a fixed bandwidth of
. As desired, the score is optimized when the data and model distributions match.We used the opensourced model implementations of PixelCNN++ [27] and SNGAN [14]. Following the observation by [40], we found that training a binary classifier on top of the feature space of any pretrained image classifier was useful for removing the lowlevel artifacts in the generated images in classifying an image as real or fake. We hence learned a multilayer perceptron (with a single hidden layer of units) on top of the dimensional feature space of the Inception Network. Learning was done using the Adam optimizer with the default hyperparameters with a learning rate of and a batch size of . We observed relatively fast convergence for training the binary classifier (in less than epochs) on both PixelCNN++ and SNGAN generated data and the best validation set accuracy across the first epochs was used for final model selection.
c.4 Data Augmentation
Our codebase was implemented using the PyTorch library [54]. We built on top of the opensource implementation of DAGAN^{3}^{3}3https://github.com/AntreasAntoniou/DAGAN.git [1].
A DAGAN learns to augment data by training a conditional generative model based on a training dataset . The generative model is learned via a minimax game with a critic. For any conditioning datapoint and noise vector , the critic learns to distinguish the generated data paired along with against another pair . Here, the point is chosen such that the points and have the same label in , i.e., . Hence, the critic learns to classify pairs of (real, real) and (real, generated) points while encouraging the generated points to be of the same class as the point being conditioned on. For the generated data, the label is assumed to be the same as the class of the point that was used for generating the data. We refer the reader to [1] for further details.
Given a DAGAN model, we additionally require training a binary classifier for estimating importance weights and a multiclass classifier for subsequent classification. The architecture for both these use cases follows prior work in meta learning on Omniglot [55]
. Except for the final output layer, the architecture consists of 4 blocks of 3x3 convolutions and 64 filters, followed by batch normalization
[53], a ReLU nonlinearity and 2x2 max pooling. Learning was done for 100 epochs using the Adam optimizer with default parameters and a learning rate of 0.001 with a batch size of 32.
c.5 Modelbased Offpolicy Policy Evaluation
For the case of modelbased offpolicy policy evaluation experiments, we used Tensorflow [52] and OpenAI baselines^{4}^{4}4https://github.com/openai/baselines.git [56]. We evaluate over three envionments, including HalfCheetah, Swimmer and HumanoidStandup (Figure 4. Both HalfCheetah and Swimmer rewards the agent for gaining higher horizontal velocity; HumanoidStandup rewards the agent for gaining more height via standing up. In all three environments, the initial state distributions are obtained via adding small random perturbation around a certain state. The dimensions for state and action spaces are shown in Table 6.
Environment  # State dim.  # Action dim 

HalfCheetah  17  6 
HumanoidStandup  376  17 
Swimmer  8  2 
Our policy network has two fully connected layers with 64 neurons and tanh activations for each layer, where as our transition model / classifier has three hidden layers of 500 neurons with swish activations
[57]. We obtain our evaluation policy by training with PPO for 1M timesteps, and our behavior policy by training with PPO for 500k timesteps. Then we train the dynamics model for 100k iterations with a batch size of 128. Our classifier is trained for 10k iterations with a batch size of 250, where we concatenate into a single vector. We also experimented with other hyperparameters in reasonable regions and the results do not vary significantly.
Comments
There are no comments yet.