1 Introduction
A variety of deep generative models have shown promising results on tasks spanning computer vision, speech recognition, natural language processing, and imitation learning
[Poon and Domingos2011, Oord, Kalchbrenner, and Kavukcuoglu2016, Kingma and Welling2014, Goodfellow et al.2014, Zhao, Song, and Ermon2017, Li, Song, and Ermon2017]. These parametric models differ from each other in their ability to perform various forms of tractable inference, learning algorithms, and objectives. Despite significant progress, existing generative models cannot fit complex distributions with a sufficiently high degree of accuracy, limiting their applicability and leaving room for improvement.
In this paper, we propose a technique for ensembling (imperfect) generative models to improve their overall performance. Our metaalgorithm is inspired by boosting, a technique used in supervised learning to combine weak classifiers (
e.g., decision stumps or trees), which individually might not perform well on a given classification task, into a more powerful ensemble. The boosting algorithm will attempt to learn a classifier to correct for the mistakes made by reweighting the original dataset, and repeat this procedure recursively. Under some conditions on the weak classifiers’ effectiveness, this procedure can drive the (training) error to zero [Freund, Schapire, and Abe1999]. Boosting can also be thought as a feature learning algorithm, where at each round a new feature is learned by training a classifier on a reweighted version of the original dataset. In practice, algorithms based on boosting perform extremely well in machine learning competitions
[Caruana and NiculescuMizil2006].We show that a similar procedure can be applied to generative models. Given an initial generative model that provides an imperfect fit to the data distribution, we construct a second model to correct for the error, and repeat recursively. The second model is also a generative one, which is trained on a reweighted version of the original training set. Our metaalgorithm is general and can construct ensembles of any existing generative model that permits (approximate) likelihood evaluation such as fullyobserved belief networks, sumproduct networks, and variational autoencoders. Interestingly, our method can also leverage powerful discriminative models. Specifically, we train a binary classifier to distinguish true data samples from “fake” ones generated by the current model and provide a principled way to include this discriminator in the ensemble.
A prior attempt at boosting density estimation proposed a sumofexperts formulation [Rosset and Segal2002]. This approach is similar to supervised boosting where at every round of boosting we derive a reweighted additive estimate of the boosted model density. In contrast, our proposed framework uses multiplicative boosting which multiplies the ensemble model densities and can be interpreted as a productofexperts formulation. We provide a holistic theoretical and algorithmic framework for multiplicative boosting contrasting with competing additive approaches. Unlike prior use cases of productofexperts formulations, our approach is blackbox, and we empirically test the proposed algorithms on several generative models from simple ones such as mixture models to expressive parameteric models such as sumproduct networks and variational autoencoders.
Overall, this paper makes the following contributions:

We provide theoretical conditions for additive and multiplicative boosting under which incorporating a new model is guaranteed to improve the ensemble fit.

We design and analyze a flexible metaalgorithmic boosting framework for including both generative and discriminative models in the ensemble.

We demonstrate the empirical effectiveness of our algorithms for density estimation, generative classification, and sample generation on several benchmark datasets.
2 Unsupervised boosting
Supervised boosting provides an algorithmic formalization of the hypothesis that a sequence of weak learners can create a single strong learner [Schapire and Freund2012]. Here, we propose a framework that extends boosting to unsupervised settings for learning generative models. For ease of presentation, all distributions are with respect to any arbitrary
, unless otherwise specified. We use uppercase symbols to denote probability distributions and assume they all admit absolutely continuous densities (denoted by the corresponding lowercase notation) on a reference measure
. Our analysis naturally extends to discrete distributions, which we skip for brevity.Formally, we consider the following maximum likelihood estimation (MLE) setting. Given some data points sampled i.i.d. from an unknown distribution , we provide a model class parameterizing the distributions that can be represented by the generative model and minimize the KullbackLiebler (KL) divergence with respect to the true distribution:
(1) 
In practice, we only observe samples from and hence, maximize the loglikelihood of the observed data . Selecting the model class for maximum likelihood learning is nontrivial; MLE w.r.t. a small class can be far from
, whereas a large class poses the risk of overfitting in the absence of sufficient data, or even underfitting due to difficulty in optimizing nonconvex objectives that frequently arise due to the use of latent variable models, neural networks, etc.
The boosting intuition is to greedily increase model capacity by learning a sequence of weak intermediate models that can correct for mistakes made by previous models in the ensemble. Here, is a predefined model class (such as ) for . We defer the algorithms pertaining to the learning of such intermediate models to the next section, and first discuss two mechanisms for deriving the final estimate from the individual density estimates at each round, .
2.1 Additive boosting
In additive boosting, the final density estimate is an arithmetic average of the intermediate models:
where denote the weights assigned to the intermediate models. The weights are renormalized at every round to sum to 1 which gives us a valid probability density estimate. Starting with a base model , we can express the density estimate after a round of boosting recursively as:
where denotes the normalized weight for at round . We now derive conditions on the intermediate models that guarantee “progress” in every round of boosting.
Theorem 1.
Let denote the reduction in KLdivergence at the round of additive boosting. The following conditions hold:

Sufficient: If , then for all .

Necessary: If such that , then .
Proof.
In Appendix A.1. ∎
The sufficient and necessary conditions require that the expected loglikelihood and likelihood respectively of the current intermediate model, are betterorequal than those of the combined previous model, under the true distribution when compared using density ratios. Next, we consider an alternative formulation of multiplicative boosting for improving the model fit to an arbitrary data distribution.
2.2 Multiplicative boosting
In multiplicative boosting, we factorize the final density estimate as a geometric average of intermediate models , each assigned an exponentiated weight :
where the partition function . Recursively, we can specify the density estimate as:
(2) 
where is the unnormalized estimate at round . The base model is learned using MLE. The conditions on the intermediate models for reducing KLdivergence at every round are stated below.
Theorem 2.
Let denote the reduction in KLdivergence at the round of multiplicative boosting. The following conditions hold:

Sufficient: If , then for all .

Necessary: If such that , then .
Proof.
In Appendix A.2. ∎
In contrast to additive boosting, the conditions above compare expectations under the true distribution with expectations under the model distribution in the previous round, . The equality in the conditions holds for , which corresponds to the trivial case where the current intermediate model is ignored in Eq. (2). For other valid , the nondegenerate version of the sufficient inequality guarantees progress towards the true data distribution. Note that the intermediate models increase the overall capacity of the ensemble at every round. As we shall demonstrate later, we find models fit using multiplicative boosting to outperform their additive counterparts empirically suggesting the conditions in Theorem 2 are easier to fulfill in practice.
From the necessary condition, we see that a “good” intermediate model assigns a betterorequal loglikelihood under the true distribution as opposed to the model distribution, . This condition suggests two learning algorithms for intermediate models which we discuss next.
3 Boosted generative models
In this section, we design and analyze metaalgorithms for multiplicative boosting of generative models. Given any base model which permits (approximate) likelihood evaluation, we provide a mechanism for boosting this model using an ensemble of generative and/or discriminative models.
3.1 Generative boosting
Supervised boosting algorithms such as AdaBoost typically involve a reweighting procedure for training weak learners [Freund and Schapire1995]. We can similarly train an ensemble of generative models for unsupervised boosting, where every subsequent model performs MLE w.r.t a reweighted data distribution :
(3)  
(4) 
and is the reweighting coefficient at round . Note that these coefficients are in general different from the model weights that appear in Eq. (2).
Proposition 1.
If we can maximize the objective in Eq. (3) optimally, then for any with the equality holding for .
Proof.
In Appendix A.3. ∎
While the objective in Eq. (3) can be hard to optimize in practice, the target distribution becomes easier to approximate as we reduce the reweighting coefficient. For the extreme case of , the reweighted data distribution is simply uniform. There is no free lunch however, since a low results in a slower reduction in KLdivergence leading to a computationalstatistical tradeoff.
The pseudocode for the corresponding boosting metaalgorithm, referred to as GenBGM, is given in Algorithm 1. In practice, we only observe samples from the true data distribution, and hence, approximate based on the empirical data distribution which is defined to be uniform over the dataset . At every subsequent round, GenBGM learns an intermediate model that maximizes the loglikelihood of data sampled from a reweighted data distribution.
3.2 Discriminative boosting
A base generative model can be boosted using a discriminative approach as well. Here, the intermediate model is specified as the density ratio obtained from a binary classifier. Consider the following setup: we observe an equal number of samples drawn i.i.d. from the true data distribution (w.l.o.g. assigned the label ) and the model distribution in the previous round (label ).
Definition 1.
Let be any convex, lower semicontinuous function satisfying . The divergence between and is defined as, .
Notable examples include the KullbackLiebler (KL) divergence, Hellinger distance, and the JensonShannon (JS) divergence among many others. The binary classifier in discriminative boosting maximizes a variational lower bound on any divergence at round :
(5) 
where denotes the Fenchel conjugate of and parameterizes the classifier. Under mild conditions on [Nguyen, Wainwright, and Jordan2010], the lower bound in Eq. (5) is tight if .
Hence, a solution to Eq. (5) can be used to estimate density ratios. The density ratios naturally fit into the multiplicative boosting framework and provide a justification for the use of objectives of the form Eq. (5) for learning intermediate models as formalized in the proposition below.
Proposition 2.
For any given divergence, let denote the optimal solution to Eq. (5) in the round of boosting. Then, the model density at the end of the boosting round matches the true density if we set and where denotes the inverse of the derivative of .
Proof.
In Appendix A.4. ∎
The pseudocode for the corresponding metaalgorithm, DiscBGM is given in Algorithm 2. At every round, we train a binary classifier to optimize the objective in Eq. (5) for a chosen divergence. As a special case, the negative of the crossentropy loss commonly used for binary classification is also a lower bound on an divergence. While Algorithm 2 is applicable for any divergence, we will focus on crossentropy henceforth to streamline the discussion.
Corollary 1.
Consider the (negative) crossentropy objective maximized by a binary classifier:
(6) 
If a binary classifier trained to optimize Eq. (6) is Bayes optimal, then the model density after round matches the true density if we set and .
Proof.
In Appendix A.5. ∎
In practice, a classifier with limited capacity trained on a finite dataset will not generally be Bayes optimal. The above corollary, however, suggests that a good classifier can provide a ‘direction of improvement’, in a similar spirit to gradient boosting for supervised learning
[Freund and Schapire1995]. Additionally, if the intermediate model distribution obtained using the above corollary satisfies the conditions in Theorem 2, it is guaranteed to improve the fit.The weights
can be interpreted as our confidence in the classification estimates, akin to the step size used in gradient descent. While in practice we heuristically assign weights to the intermediate models, the greedy optimum value of these weights at every round is a critical point for
(defined in Theorem 2). For example, in the extreme case where is uninformative, i.e., , then for all . If is Bayes optimal, then attains a maxima when (Corollary 1).3.3 Hybrid boosting
Intermediate models need not be exclusively generators or discriminators; we can design a boosting ensemble with any combination of generators and discriminators. If an intermediate model is chosen to be a generator, we learn a generative model using MLE after appropriately reweighting the data points. If a discriminator is used to implicitly specify an intermediate model, we set up a binary classification problem.
3.4 Regularization
In practice, we want boosted generative models (BGM) to generalize to data outside the training set . Regularization in BGMs is imposed primarily in two ways. First, every intermediate model can be independently regularized by incorporating explicit terms in the learning objective, early stopping based on validation error, heuristics such as dropout, etc. Moreover, restricting the number of rounds of boosting is another effective mechanism for regularizing BGMs. Fewer rounds of boosting are required if the intermediate models are sufficiently expressive.
4 Empirical evaluation
Our experiments are designed to demonstrate the superiority of the proposed boosting metaalgorithms on a wide variety of generative models and tasks. A reference implementation of the boosting metaalgorithms is available at https://github.com/ermongroup/bgm. Additional implementation details for the experiments below are given in Appendix B.
Model  NLL (in nats, with std. error) 

Base model  
Add model  
GenBGM  
DiscBGMNCE  
DiscBGMHD 
4.1 Multiplicative vs. additive boosting
A common pitfall with learning parameteric generative models is model misspecification with respect to the true underlying data distribution. For a quantitative and qualitative understanding of the behavior of additive and multiplicative boosting, we begin by considering a synthetic setting for density estimation on a mixture of Gaussians.
Density estimation on synthetic dataset.
The true data distribution is a equiweighted mixture of four Gaussians centered symmetrically around the origin, each having an identity covariance matrix. The contours of the underlying density are shown in Figure 0(a). We observe training samples drawn independently from the data distribution (shown as black dots in Figure 2), and the task is to learn this distribution. The test set contains samples from the same distribution. We repeat the process times for statistical significance.
As a base (misspecified) model, we fit a mixture of two Gaussians to the data; the contours for an example instance are shown in Figure 0(b). We compare multiplicative and additive boosting, each run for rounds. For additive boosting (Add), we extend the algorithm proposed by rosset2002boosting rosset2002boosting setting to unity and doing a line search over
. For Add and GenBGM, the intermediate models are mixtures of two Gaussians as well. The classifiers for DiscBGM are multilayer perceptrons with two hidden layers of 100 units each and ReLU activations, trained to maximize
divergences corresponding to the negative crossentropy (NCE) and Hellinger distance (HD) using the Adam optimizer [Kingma and Welling2014].The test negative loglikelihood (NLL) estimates are listed in Table 1. Qualitatively, the contour plots for the estimated densities after every boosting round on a sample instance are shown in Figure 2. Multiplicative boosting algorithms outperform additive boosting in correcting for model misspecification. GenBGM initially leans towards maximizing coverage, whereas both versions of DiscBGM are relatively more conservative in assigning high densities to data points away from the modes.
Heuristic model weighting strategies.
The multiplicative boosting algorithms require as hyperparameters the number of rounds of boosting and weights assigned to the intermediate models. For any practical setting, these hyperparameters are specific to the dataset and task under consideration and should be set based on crossvalidation. While automatically setting model weights is an important direction for future work, we propose some heuristic weighting strategies. Specifically, the
unity heuristic assigns a weight of to every model in the ensemble, the uniform heuristic assigns a weight of to every model, and the decay heuristic assigns as a weight of to the model in the ensemble.In Figure 3, we observe that the performance of the algorithms is sensitive to the weighting strategies. In particular, DiscBGM produces worse estimates as increases for the “uniform” (red) strategy. The performance of GenBGM also degrades slightly with increasing for the “unity” (green) strategy. Notably, the “decay” (cyan) strategy achieves stable performance for both the algorithms. Intuitively, this heuristic follows the rationale of reducing the step size in gradient based stochastic optimization algorithms, and we expect this strategy to work better even in other settings. However, this strategy could potentially result in slower convergence as opposed to the unity strategy.
Dataset  vars  MoB Base  Add  GenBGM  DiscBGM  SPN Base  Add  GenBGM  DiscBGM 

Accidents  34.51  31.08  29.92  29.55  28.09  
Retail  11.27  12.24  11.20  10.91  14.94  11.27  11.21  10.88  
Pumsbstar  55.67  55.91  50.66  34.93  26.70  25.00  25.00  23.69  
DNA  99.42  100.37  99.23  98.45  92.60  86.93  87.79  86.63  
Kosarek  11.72  12.57  12.41  11.13  12.71  10.97  10.73  10.67  
Ad  63.13  63.73  63.19  54.79  19.19  18.12  18.14  17.82 
Dataset  test  MoB Base  Add  GenBGM  DiscBGM  SPN Base  Add  GenBGM  DiscBGM 

Accidents  283,161  0.8395  0.8393  0.8473  0.9043  0.9258  0.9266  0.9298  0.9416 
Retail  595,080  0.9776  0.9776  0.9776  0.9792  0.9780  0.9790  0.9789  0.9791 
Pumsbstar  399,676  0.8461  0.8501  0.8819  0.9267  0.9599  0.9610  0.9611  0.9636 
DNA  213,480  0.7517  0.7515  0.7531  0.7526  0.7799  0.7817  0.7828  0.7811 
Kosarek  1,268,250  0.9817  0.9816  0.9818  0.9831  0.9824  0.9838  0.9838  0.9838 
Ad  763,996  0.9922  0.9923  0.9818  0.9927  0.9982  0.9981  0.9982  0.9982 
Density estimation on benchmark datasets.
We now evaluate the performance of additive and multiplicative boosting for density estimation on realworld benchmark datasets [Van Haaren and Davis2012]. We consider two generative model families: mixture of Bernoullis (MoB) and sumproduct networks [Poon and Domingos2011]. While our results for multiplicative boosting with sumproduct networks (SPN) are competitive with the stateoftheart, the goal of these experiments is to perform a robust comparison of boosting algorithms as well as demonstrate their applicability to various model families.
We set rounds for additive boosting and GenBGM. Since DiscBGM requires samples from the model density at every round, we set
to ensure computational fairness such that the samples can be obtained efficiently from the base model sidestepping running expensive Markov chains. Model weights are chosen based on crossvalidation. The results on density estimation are reported in Table
2. Since multiplicative boosting estimates are unnormalized, we use importance sampling to estimate the partition function.When the base model is MoB, the Add model underperforms and is often worse than even the baseline model for the best performing validated nonzero model weights. GenBGM consistently outperforms Add and improves over the baseline model in a most cases (4/6 datasets). DiscBGM performs the best and convincingly outperforms the baseline, Add, and GenBGM on all datasets. For results on SPNs, the boosted models all outperform the baseline. GenBGM again edges out Add models (4/6 datasets), whereas DiscBGM models outperform all other models on all datasets. These results demonstrate the usefulness of boosted expressive model families, especially the DiscBGM approach, which performs the best, while GenBGM is preferable to Add.
4.2 Applications of generative models
Classification.
Here, we evaluate the performance of boosting algorithms for classification. Since the datasets above do not have any explicit labels, we choose one of the dimensions to be the label (say ). Letting denote the remaining dimensions, we can obtain a prediction for as,
which is efficient to compute even for unnormalized models. We repeat the above procedure for all the variables predicting one variable at a time using the values assigned to the remaining variables. The results are reported in Table 3. When the base model is a MoB, we observe that the Add approach could often be worse than the base model whereas GenBGM performs slightly better than the baseline (4/6 datasets). The DiscBGM approach consistently performs well, and is only outperformed by GenBGM for two datasets for MoB. When SPNs are used instead, both Add and GenBGM improve upon the baseline model while DiscBGM again is the best performing model on all but one dataset.
The boosted model (d) demonstrates how ensembles of weak learners can generate sharper samples, compared to naively increasing model capacity (ac). Note that we show samples of binarized digits and not mean values for the pixels. VAE hidden layer architecture given in parenthesis.
Sample generation.
We compare boosting algorithms based on their ability to generate image samples for the binarized MNIST dataset of handwritten digits [LeCun, Cortes, and Burges2010]. We use variational autoencoders (VAE) as the base model [Kingma and Welling2014]. While any sufficiently expressive VAE can generate impressive examples, we design the experiment to evaluate the model complexity approximated as the number of learnable parameters.
Ancestral samples obtained by the baseline VAE model are shown in Figure 3(a). We use the evidence lower bound (ELBO) as a proxy for approximately evaluating the marginal loglikelihood during learning. The conventional approach to improving the performance of a latent variable model is to increase its representational capacity by adding hidden layers (Base + depth) or increasing the number of hidden units in the existing layers (Base + width). These lead to a marginal improvement in sample quality as seen in Figure 3(b) and Figure 3(c).
In contrast, boosting makes steady improvements in sample quality. We start with a VAE with much fewer parameters and generate samples using a hybrid boosting GenDiscBGM sequence VAECNNVAE (Figure 3(d)
) . The discriminator used is a convolutional neural network (CNN)
[LeCun and Bengio1995] trained to maximize the negative crossentropy. We then generate samples using independent Markov chain Monte Carlo (MCMC) runs. The boosted sequences generate sharper samples than all baselines in spite of having similar model capacity.5 Discussion and related work
In this work, we revisited boosting, a class of metaalgorithms developed in response to a seminal question: Can a set of weak learners create a single strong learner? Boosting has offered interesting theoretical insights into the fundamental limits of supervised learning and led to the development of algorithms that work well in practice [Schapire1990, Freund, Schapire, and Abe1999, Friedman2002, Caruana and NiculescuMizil2006]. Our work provides a foundational framework for unsupervised boosting with connections to prior work discussed below.
Sumofexperts.
rosset2002boosting rosset2002boosting proposed an algorithm for density estimation using Bayesian networks similar to gradient boosting. These models are normalized and easy to sample, but are generally outperformed by multiplicative formulations for correcting for model misspecification, as we show in this work. Similar additive approaches have been used for improving approximate posteriors for specific algorithms for variational inference
[Guo et al.2016, Miller, Foti, and Adams2017] and generative adversarial networks [Tolstikhin et al.2017]. For a survey on variations of additive ensembling for unsupervised settings, refer to the survey by bourel2012aggregating bourel2012aggregating.Productofexperts.
Our multiplicative boosting formulation can be interpreted as a productofexperts approach, which was initially proposed for feature learning in energy based models such as Boltzmann machines. For example, the hidden units in a restricted Boltzmann machine can be interpreted as weak learners performing MLE. If the number of weak learners is fixed, they can be efficiently updated in parallel but there is a risk of learning redundant features
[Hinton1999, Hinton2002]. Weak learners can also be added incrementally based on the learner’s ability to distinguish observed data and modelgenerated data [Welling, Zemel, and Hinton2002]. tu2007learning tu2007learning generalized the latter to boost arbitrary probabilistic models; their algorithm is a special case of DiscBGM with all ’s set to 1 and the discriminator itself a boosted classifier. DiscBGM additionally accounts for imperfections in learning classifiers through flexible model weights. Further, it can include any classifier trained to maximize any divergence.Related techniques such as noisecontrastive estimation, ratio matching, and score matching methods can be cast as minimization of Bregman divergences, akin to DiscBGM with unit model weights
[Gutmann and Hirayama2011]. A nonparametric algorithm similar to GenBGM was proposed by di2004boosting di2004boosting where an ensemble of weighted kernel density estimates are learned to approximate the data distribution. In contrast, our framework allows for both parametric and nonparametric learners and uses a different scheme for reweighting data points than proposed in the above work.
Unsupervisedassupervised learning.
The use of density ratios learned by a binary classifier for estimation was first proposed by friedman2001elements friedman2001elements and has been subsequently applied elsewhere, notably for parameter estimation using noisecontrastive estimation [Gutmann and Hyvärinen2010] and sample generation in generative adversarial networks (GAN) [Goodfellow et al.2014]. While GANs consist of a discriminator distinguishing real data from model generated data similar to DiscBGM for a suitable divergence, they differ in the learning objective for the generator [Nowozin, Cseke, and Tomioka2016]. The generator of a GAN performs an adversarial minimization of the same objective the discriminator maximizes, whereas DiscBGM uses the likelihood estimate of the base generator (learned using MLE) and the density ratios derived from the discriminator(s) to estimate the model density for the ensemble.
Limitations and future work.
In the multiplicative boosting framework, the model density needs to be specified only up to a normalization constant at any given round of boosting. Additionally, while many applications of generative modeling such as feature learning and classification can sidestep computing the partition function, if needed it can be estimated using techniques such as Annealed Importance Sampling [Neal2001]. Similarly, Markov chain Monte Carlo methods can be used to generate samples. The lack of implicit normalization can however be limiting for applications requiring fast loglikelihood evaluation and sampling.
In order to sidestep this issue, a promising direction for future work is to consider boosting of normalizing flow models [Dinh, Krueger, and Bengio2014, Dinh, SohlDickstein, and Bengio2017, Grover, Dhar, and Ermon2018]. These models specify an invertible multiplicative transformation from one distribution to another using the changeofvariables formula such that the resulting distribution is selfnormalized and efficient ancestral sampling is possible. The GenBGM algorithm can be adapted to normalizing flow models whereby every transformation is interpreted as a weak learner. The parameters for every transformation can be trained greedily after suitable reweighting resulting in a selfnormalized boosted generative model.
6 Conclusion
We presented a generalpurpose framework for boosting generative models by explicit factorization of the model likelihood as a product of simpler intermediate model densities. These intermediate models are learned greedily using discriminative or generative approaches, gradually increasing the overall model’s capacity. We demonstrated the effectiveness of these models over baseline models and additive boosting for the tasks of density estimation, classification, and sample generation. Extensions to semisupervised learning
[Kingma et al.2014] and structured prediction [Sohn, Lee, and Yan2015] are exciting directions for future work.Acknowledgements
We are thankful to Neal Jean, Daniel Levy, and Russell Stewart for helpful critique. This research was supported by a Microsoft Research PhD fellowship in machine learning for the first author, NSF grants , , , a Future of Life Institute grant, and Intel.
References
 [Bourel and Ghattas2012] Bourel, M., and Ghattas, B. 2012. Aggregating density estimators: An empirical study. arXiv preprint arXiv:1207.4959.
 [Caruana and NiculescuMizil2006] Caruana, R., and NiculescuMizil, A. 2006. An empirical comparison of supervised learning algorithms. In International Conference on Machine Learning.
 [Di Marzio and Taylor2004] Di Marzio, M., and Taylor, C. C. 2004. Boosting kernel density estimates: A bias reduction technique? Biometrika 91(1):226–233.
 [Dinh, Krueger, and Bengio2014] Dinh, L.; Krueger, D.; and Bengio, Y. 2014. NICE: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516.
 [Dinh, SohlDickstein, and Bengio2017] Dinh, L.; SohlDickstein, J.; and Bengio, S. 2017. Density estimation using Real NVP. In International Conference on Learning Representations.

[Freund and
Schapire1995]
Freund, Y., and Schapire, R. E.
1995.
A desiciontheoretic generalization of online learning and an
application to boosting.
In
European conference on computational learning theory
. 
[Freund, Schapire, and
Abe1999]
Freund, Y.; Schapire, R.; and Abe, N.
1999.
A short introduction to boosting.
JournalJapanese Society For Artificial Intelligence
14(771780):1612.  [Friedman, Hastie, and Tibshirani2001] Friedman, J.; Hastie, T.; and Tibshirani, R. 2001. The elements of statistical learning, volume 1. Springer series in statistics Springer, Berlin.
 [Friedman2002] Friedman, J. H. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis 38(4):367–378.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems.
 [Grover, Dhar, and Ermon2018] Grover, A.; Dhar, M.; and Ermon, S. 2018. FlowGAN: Combining maximum likelihood and adversarial learning in generative models. In AAAI Conference on Artificial Intelligence.
 [Guo et al.2016] Guo, F.; Wang, X.; Fan, K.; Broderick, T.; and Dunson, D. B. 2016. Boosting variational inference. arXiv preprint arXiv:1611.05559.
 [Gutmann and Hirayama2011] Gutmann, M., and Hirayama, J.i. 2011. Bregman divergence as general framework to estimate unnormalized statistical models. In Uncertainty in Artificial Intelligence.
 [Gutmann and Hyvärinen2010] Gutmann, M., and Hyvärinen, A. 2010. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Artificial Intelligence and Statistics.
 [Hinton1999] Hinton, G. E. 1999. Products of experts. In International Conference on Artificial Neural Networks.

[Hinton2002]
Hinton, G. E.
2002.
Training products of experts by minimizing contrastive divergence.
Neural computation 14(8):1771–1800.  [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations.
 [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Autoencoding variational Bayes. In International Conference on Learning Representations.
 [Kingma et al.2014] Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems.
 [LeCun and Bengio1995] LeCun, Y., and Bengio, Y. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995.
 [LeCun, Cortes, and Burges2010] LeCun, Y.; Cortes, C.; and Burges, C. J. 2010. MNIST handwritten digit database. http://yann. lecun. com/exdb/mnist.
 [Li, Song, and Ermon2017] Li, Y.; Song, J.; and Ermon, S. 2017. InfoGAIL: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems.
 [Miller, Foti, and Adams2017] Miller, A. C.; Foti, N.; and Adams, R. P. 2017. Variational boosting: Iteratively refining posterior approximations. In International Conference on Machine Learning.
 [Neal2001] Neal, R. M. 2001. Annealed importance sampling. Statistics and Computing 11(2):125–139.
 [Nguyen, Wainwright, and Jordan2010] Nguyen, X.; Wainwright, M. J.; and Jordan, M. I. 2010. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory 56(11):5847–5861.
 [Nowozin, Cseke, and Tomioka2016] Nowozin, S.; Cseke, B.; and Tomioka, R. 2016. fGAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems.

[Oord, Kalchbrenner, and
Kavukcuoglu2016]
Oord, A. v. d.; Kalchbrenner, N.; and Kavukcuoglu, K.
2016.
Pixel recurrent neural networks.
In International Conference on Machine Learning.  [Poon and Domingos2011] Poon, H., and Domingos, P. 2011. Sumproduct networks: A new deep architecture. In Uncertainty in Artificial Intelligence.
 [Rosset and Segal2002] Rosset, S., and Segal, E. 2002. Boosting density estimation. In Advances in Neural Information Processing Systems.
 [Schapire and Freund2012] Schapire, R. E., and Freund, Y. 2012. Boosting: Foundations and algorithms. MIT press.
 [Schapire1990] Schapire, R. E. 1990. The strength of weak learnability. Machine learning 5(2):197–227.
 [Sohn, Lee, and Yan2015] Sohn, K.; Lee, H.; and Yan, X. 2015. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems.
 [Tolstikhin et al.2017] Tolstikhin, I.; Gelly, S.; Bousquet, O.; SimonGabriel, C.J.; and Schölkopf, B. 2017. AdaGAN: Boosting generative models. In Advances in Neural Information Processing Systems.

[Tu2007]
Tu, Z.
2007.
Learning generative models via discriminative approaches.
In
Computer Vision and Pattern Recognition
.  [Van Haaren and Davis2012] Van Haaren, J., and Davis, J. 2012. Markov network structure learning: A randomized feature generation approach. In AAAI Conference on Artificial Intelligence.
 [Welling, Zemel, and Hinton2002] Welling, M.; Zemel, R. S.; and Hinton, G. E. 2002. Self supervised boosting. In Advances in Neural Information Processing Systems.
 [Zhao, Song, and Ermon2017] Zhao, S.; Song, J.; and Ermon, S. 2017. Learning hierarchical features from deep generative models. In International Conference on Machine Learning.
Appendices
Appendix A Proofs of theoretical results
a.1 Theorem 1
Proof.
The reduction in KLdivergence can be simplified as:
We first derive the sufficient condition by lower bounding .
(Linearity of expectation) 
If the lower bound is nonnegative, then so is . Hence:
which is the stated sufficient condition.
For the necessary condition to hold, we know that:
(Jensen’s inequality)  
(Linearity of expectation) 
Taking exponential on both sides, we get:
which is the stated necessary condition. ∎
a.2 Theorem 2
Proof.
We first derive the sufficient condition.
(using Eq. (2))  
(7)  
(Jensen’s inequality)  
(by assumption) 
Note that if , the sufficient condition is also necessary.
For the necessary condition to hold, we know that:
(Jensen’s inequality)  
(Linearity of expectation)  
∎
a.3 Proposition 1
Proof.
By assumption, we can optimize Eq. (3) to get:
Substituting for in the multiplicative boosting formulation in Eq. (2),:
where the partition function .
In order to prove the inequality, we first obtain a lower bound on the logpartition function, . For any given point, we have:
(Arithmetic Mean Geometric Mean) 
Integrating over all points in the domain, we get:
(8) 
where we have used the fact that and are normalized densities.
a.4 Proposition 2
Proof.
a.5 Corollary 1
Proof.
Let
denote the joint distribution over
at round . We will prove a slightly more general result where we have positive training examples sampled from and the negative training examples sampled from .^{1}^{1}1In the statement for Corollary 1, the classes are assumed to be balanced for simplicity i.e., . Hence, we can express the conditional and prior densities as:(9)  
(10)  
(11)  
(12) 
The Bayes optimal density can be expressed as:
(13) 
Similarly, we have:
(14) 
From Eqs. (912, 1314), we have:
where .
In Corollary 2 below, we present an additional theoretical result below that derives the optimal model weight, for an adversarial Bayes optimal classifier.
a.6 Corollary 2
Corollary 2.
[to Corollary 1] Define an adversarial Bayes optimal classifier as one that assigns the density where is the Bayes optimal classifier. For an adversarial Bayes optimal classifier , attains a maxima of zero when .
Appendix B Additional implementation details
b.1 Density estimation on synthetic dataset
Model weights.
For DiscBGM, all model weights, ’s to unity. The model weights for GenBGM, ’s are set uniformly to and reweighting coefficients, ’s are set to unity.
b.2 Density estimation on benchmark datasets
Generator learning procedure details.
We use the default open source implementations of mixture of Bernoullis (MoB) and sumproduct networks (SPN) as given in https://github.com/AmazaspShumik/sklearnbayes and https://github.com/KalraA/Tachyon respectively for baseline models.
Discriminator learning procedure details.
The discriminator considered for these experiments is a multilayer perceptron with two hidden layers consisting of
units each and ReLU activations learned using the Adam optimizer
[Kingma and Ba2015] with a learning rate of . The training is for epochs with a minibatch size of , and finally the model checkpoint with the best validation error during training is selected to specify the intermediate model to be added to the ensemble.Model weights.
Model weights for multiplicative boosting algorithms, GenBGM and DiscBGM, are set based on best validation set performance of the heuristic weighting strategies. Partition function is estimated using importance sampling with the baseline model (MoB or SPN) as a proposal and a sample size of .
b.3 Sample generation
VAE architecture and learning procedure details.
Only the last layer in every VAE is stochastic, rest are deterministic. The inference network specifying the posterior contains the same architecture for the hidden layer as the generative network. The prior over the latent variables is standard Gaussian, the hidden layer activations are ReLU, and learning is done using Adam [Kingma and Ba2015] with a learning rate of and minibatches of size .
CNN architecture and learning procedure details.
The CNN contains two convolutional layers and a single full connected layer with units. Convolution layers have kernel size , and and output channels, respectively. We apply ReLUs and max pooling after each convolution. The net is randomly initialized prior to training, and learning is done using the Adam [Kingma and Ba2015] optimizer with a learning rate of and minibatches of size .
Sampling procedure for BGM sequences.
Samples from the GenDiscBGM are drawn from a Markov chain run using the MetropolisHastings algorithm with a discrete, uniformly random proposal and the BGM distribution as the stationary distribution for the chain. Every sample in Figure 4 (d) is drawn from an independent Markov chain with a burnin period of samples and a different start seed state.