Boosting Generative Models by Leveraging Cascaded Meta-Models

05/11/2019
by   Fan Bao, et al.
0

Deep generative models are effective methods of modeling data. However, it is not easy for a single generative model to faithfully capture the distributions of complex data such as images. In this paper, we propose an approach for boosting generative models, which cascades meta-models together to produce a stronger model. Any hidden variable meta-model (e.g., RBM and VAE) which supports likelihood evaluation can be leveraged. We derive a decomposable variational lower bound of the boosted model, which allows each meta-model to be trained separately and greedily. Besides, our framework can be extended to semi-supervised boosting, where the boosted model learns a joint distribution of data and labels. Finally, we combine our boosting framework with the multiplicative boosting framework, which further improves the learning power of generative models.

READ FULL TEXT VIEW PDF

Authors

page 5

page 6

09/16/2018

A Deep Generative Model for Semi-Supervised Classification with Noisy Labels

Class labels are often imperfectly observed, due to mistakes and to genu...
02/27/2017

Boosted Generative Models

We propose a new approach for using unsupervised boosting to create an e...
12/12/2020

Learning Consistent Deep Generative Models from Sparse Data via Prediction Constraints

We develop a new framework for learning variational autoencoders and oth...
04/22/2020

Provably robust deep generative models

Recent work in adversarial attacks has developed provably robust methods...
03/07/2022

Learning to Bound: A Generative Cramér-Rao Bound

The Cramér-Rao bound (CRB), a well-known lower bound on the performance ...
06/10/2019

Multi-objects Generation with Amortized Structural Regularization

Deep generative models (DGMs) have shown promise in image generation. Ho...
12/10/2021

Guided Generative Models using Weak Supervision for Detecting Object Spatial Arrangement in Overhead Images

The increasing availability and accessibility of numerous overhead image...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The past decade has witnessed tremendous success in the field of deep generative models (DGMs) in both unsupervised learning

[Goodfellow et al.2014, Kingma and Welling2013, Radford et al.2015]

and semi-supervised learning

[Abbasnejad et al.2017, Kingma et al.2014, Li et al.2018]

paradigms. DGMs learn the data distribution by combining the scalability of deep learning with the generality of probabilistic reasoning. However, it is not easy for a single parametric model to learn a complex distribution, since the upper limit of a model ability is determined by its fixed structure. If a model with low capacity was adopted, the model would be likely to have a poor performance. Straightforwardly increasing the model capacity (e.g., including more layers or more neurons) is likely to cause serious challenges, such as vanishing gradient problem

[Hochreiter et al.2001]

and exploding gradient problem

[Grosse2017].

An alternative approach is to integrate multiple weak models to achieve a strong one. The early success was made on mixture models [Dempster et al.1977, Figueiredo and Jain2002, Xu and Jordan1996] and product-of-experts [Hinton1999, Hinton2002]. However, the weak models in such work are typically shallow models with very limited capacity. Recent success has been made on boosting generative models, where a set of meta-models (i.e., weak learners) are combined to construct a stronger model. In particular, [Grover and Ermon2018]

propose a method of multiplicative boosting, which takes the geometric average of the meta-model distributions, with each assigned an exponentiated weight. This boosting method improves performance on density estimation and sample generation, compared to a single meta-model. However, the boosted model has an explicit partition function, which requires importance sampling 

[Rubinstein and Kroese2016]

. In general, sampling from the boosted model is conducted based on Markov chain Monte Carlo (MCMC) method 

[Hastings1970]. As a result, it requires a high time complexity of likelihood evaluation and sample generation. Rosset et al. rosset2003boosting propose another method of additive boosting, which takes the weighted arithmetic mean of meta-models’ distributions. This method can sample fast, but the improvement of performance on density estimation is not comparable to the multiplicative boosting, since the additive boosting requires that the expected log-likelihood and likelihood of the current meta-model are better-or-equal than those of the previous boosted model, which is difficult to satisfy [Grover and Ermon2018]. In summary, it is nontrivial for both of the previous boosting methods to balance well between improving the learning power and keeping the efficiency of sampling and density estimation.

To address the aforementioned issues, we propose a novel boosting framework, where meta-models are connected in cascade. Our meta-algorithmic framework is inspired by the greedy layer-wise training algorithm of DBNs (Deep Belief Networks

[Bengio et al.2007, Hinton et al.2006]

, where an ensemble of RBMs (Restricted Boltzmann Machines

[Smolensky1986]

are converted to a stronger model. We propose a decomposable variational lower bound, which reveals the principle behind the greedy layer-wise training algorithm. The decomposable lower bound allows us to incorporate any hidden variable meta-model (e.g. RBM and VAE (Variational Autoencoder

[Kingma and Welling2013]), as long as it supports likelihood evaluation, and to train these meta-models separately and greedily, yielding a deep boosted model. Besides, our boosting framework can be extended to semi-supervised boosting, where the boosted model learns a joint distribution of data and labels. Finally, We demonstrate that our boosting framework can be integrated with the multiplicative boosting framework [Grover and Ermon2018], yielding a hybrid boosting with an improved learning power of generative models. To summary, we make the following contributions:

  • We propose a meta-algorithmic framework to boost generative models by cascading the hidden variable meta-models, which can be also extended to semi-supervised boosting.

  • We give a decomposable variational lower bound of the boosted model, which reveals the principle behind the greedy layer-wise training algorithm.

  • We finally demonstrate that our boosting framework can be extended to a hybrid model by integrating it with the multiplicative boosting models, which further improves the learning power of generative models.

2 Approach

In subsection 2.1, we review the current multiplicative boosting [Grover and Ermon2018]. Then, we present our boosting framework. We first figure out how to connect meta-models, and then propose our meta-algorithmic framework with its theoretical analysis. Afterwards, we discuss the convergence and extension to semi-supervised boosting.

2.1 Boosting Generative Models

Grover and Ermon grover2018boosted introduced multiplicative boosting, which takes the geometric average of meta-models’ distributions, with each assigned an exponentiated weight as

(1)

where () are meta-models, which are required to support likelihood evaluation, is the boosted model and is the partition function. The first meta-model is trained on the empirical data distribution which is defined to be uniform over the dataset . The other meta-models () are trained on a reweighted data distribution as

(2)

where with being the hypermeter. These meta-models are therefore connected in parallel, as shown in Figure 1, since their distributions are combined by multiplication.

Grover and Ermon grover2018boosted show that the expected log-likelihood of the boosted model over the dataset will not decrease (i.e., ) if Equation 2 is maximized. The multiplicative boosting makes success in improving the learning power power of generative models. Compared to an individual meta-model, it has better performance on density estimation and generates samples of higher quality. However, importance sampling and MCMC are required to evaluate the partition function and generate samples respectively, which limits its application in occasions requiring fast density estimation and sampling. To overcome these shortcomings, we propose our boosting framework, where meta-models are cascaded.

2.2 How to Connect Meta-Models

In multiplicative boosting, meta-models are connected in parallel, leading to the troublesome partition function. To overcome this problem, we connect meta-models in cascade, where the output of a previous model is passed to the input of the next model, as shown in Figure 1.

Figure 1: Left: the parallel connection of meta-models in the multiplicative boosting framework. Right: the cascade connection of meta-models in our boosting framework.

Now let’s formulate our cascade connection for generative models. Suppose we have a group of meta-models , , where is the visible variable and is the hidden variable. These models can belong to different families (e.g. RBMs and VAEs), as long as they have hidden variables and support likelihood evaluation. To ensure the output of the previous model is passed to the input of the next model, we replace with and connect meta-models in a directed chain:

(3)

where and . The visible variable of a previous model severs as the hidden variable of the next model, so that we can sample the boosted model fast in a top-down style. It is worth note that, when all meta-models are RBMs, the boosted model is a DBN and when all meta-models are VAEs, the boosted model is a DLGM (Deep Latent Gaussian Model) [Burda et al.2015, Rezende et al.2014].

The boosted model allows us to implement generating hereby. Then, we build the approximation of the posterior distribution, which allows to do inference. We also connect these meta-models in a directed chain [Burda et al.2015], so that we can do inference in a down-top style as

(4)

where and . The approximation of the posterior distribution has the advantage that we don’t need to re-infer the whole boosted model after incorporating a new meta-model : we only need to infer from conditioned on inferred previously.

2.3 Decomposable Variational Lower Bound

Suppose is our training dataset, we give a lower bound to the marginal likelihood of the boosted model . The lower bound is decomposed to terms, which reveals the principle behind the greedy layer-wise training algorithm, as given in Theorem 1.

Theorem 1

Let () be meta-models, be the model boosted from meta-models (), be the approximate posterior, then we have:

(5)

where and

(6)

Proof: see Appendix A111The appendix is in supplemental material at https://github.com/nspggg/BMG_Cascade/blob/master/SupplementalMaterial.pdf.

The lower bound is decomposed to terms (). Specifically, is the marginal likelihood of the first meta-model . () is the difference between the marginal likelihood of the observable variable for and the hidden variable for .

When = 1, there is only one meta-model and the lower bound is exactly equal to the marginal likelihood of the boosted model. So the lower bound is tight when . Based on the initially tight lower bound, we can further promote it by optimizing these decomposed terms sequentially, yielding the greedy layer-wise training algorithm, as discussed in subsection 2.4.

2.4 The Greedy Layer-Wise Training Algorithm

The difference between the lower bound of and is:

(7)

To ensure the lower bound grows with incorporated, we only need to ensure is positive. When we train the meta-model , we fix rest meta-models () and optimize by only tuning the parameters of until it is greater than . As a result, we can train each meta-model separately and greedily, as outlined in Alg. 1.

1:  Input: dataset ;  number of meta-models
2:  Let be the empirical distribution of
3:  Sample from and train
4:  
5:  while  do
6:     
7:     Sample from
8:     Use these samples to train
9:     
10:  end while
11:  
12:  return
Algorithm 1 Boosting Generative Models

Generally, after training , Equation 7 will not be negative and the lower bound in Equation 5 will not decrease, since if is an arbitrarily powerful learner, we can make sure Equation 7 is greater-or-equal than zero:

(8)

In practice, Equation 7 is likely to be negative under the following three cases:

  • is not well trained. In this case, Equation 7

    is very negative, which indicates us to tune hyperparameters of

    and retrain this meta-model.

  • is close to the marginal distribution of . In this case, Equation 7 will be close to zero, and we can either keep training by incorporating more powerful meta-models or just stop training.

  • The lower bound converges. In this case, the lower bound will stop growing even if you keep incorporating more meta-models. The convergence will be further discussed in subsection 2.5

For models with initialized from , such as DBNs [Hinton et al.2006], we can make sure that  Equation 7 will never be negative.

2.5 Convergence

It’s impossible for the decomposable lower bound to grow infinitely. After training , if it reaches the best global optimum (i.e., is exactly the marginal distribution of ), then the lower bound will stop growing even if we keep incorporating more meta-models. We formally describe the convergence in Theorem 2.

Theorem 2

If reaches the best global optimum, then , for any .

Proof: see Appendix B1.

It indicates us that it is unnecessary to incorporate meta-models as much as possible, since the boosted model will converge if reaches the best global optimum. We give another necessary condition of the convergence in Theorem 3.

Theorem 3

If reaches the best global optimum, then .

Proof: see Appendix B1.

For meta-models such as VAEs,

is the standard normal distribution and

is analytically solvable, which helps us judge whether the boosted model has converged in practice.

2.6 Semi-Supervised Extension

Our approach can be extended to semi-supervised learning, where the joint distribution is

and the approximation of posterior is chosen as

where the latent class variable only appears in the last meta-model . The model is also trained greedily and layer-wise, with the first meta-models trained on unlabelled data using Alg. 1, yielding the latent representations of data. The top-most meta-model is trained on both unlabelled and labelled data, where non-labelled parts are first converted to their latent representations. The formal training algorithm with its detailed derivation and experiments are given in Appendix C1. The experiments show that our semi-supervised extension can achieve good performance on the classification task without the help of any classification loss.

3 Hybrid Boosting

Now we know that meta-models can be connected either in cascade, as given by our boosting method, or in parallel, as given by the multiplicative boosting; we can further consider the hybrid boosting, where a boosted model contains both cascade and parallel connections. In fact, it is not difficult to implement: we can think of the boosted model produced by our method as the meta-model for multiplicative boosting.

An open problem for hybrid boosting is to determine what kind of meta-models to use and how meta-models are connected, which is closely related to the specific dataset and task. Here we introduce some strategies for this problem.

For cascade connection, if the dataset can be divided to several categories, it is appropriate to use a GMM (Gaussian Mixture Model)

[Smolensky1986] as the top-most meta-model. Other meta-models can be selected as VAEs [Kingma and Welling2013] or their variants [Burda et al.2015, Sønderby et al.2016]. There are three reasons for this strategy: (1) the posterior of VAE is much simpler than the dataset distribution, making a GMM enough to learn the posterior; (2) the posterior of a VAE is likely to consist of several components, with each corresponding to one category, making a GMM which also consists of several components suitable; (3) Since

is a standard Gaussian distribution when

is a VAE, according to Equation 7, if is a GMM, which covers the standard Gaussian distribution as a special case, we can make sure that Equation 7 will not be negative after training .

For parallel connection, each meta-model should have enough learning power for the dataset, since each meta-model is required to learn the distribution of the dataset or the reweighted dataset. If any meta-model fails to learn the distribution, the performance of the boosted model will be harmed. In subsection 4.4, we give a negative example, where a VAE and a GMM are connected in parallel and the overall performance is extremely bad.

4 Experiments

We now present experiments to verify the effectiveness of our method. We first give results of boosting a set of RBMs and VAEs to validate that the performance of the boosted model is really promoted with more meta-models incorporated. Then, we give results of boosting some advanced models to show that our method can be used as a technique to further promote the performance of state-of-the-art models. Finally, we make comparison between different generative boosting methods.

4.1 Setup

We do experiments on binarized mnist 

[LeCun and Cortes2010], which contains 60000 training data and 10000 testing data, as well as the more complex celebA dataset [Liu et al.2015], which contains 202599 face images, with each first resized to . The meta-models we use are RBMs [Smolensky1986], GMMs [Reynolds2015], VAEs [Kingma and Welling2013], ConvVAEs (i.e., VAEs with convolutional layers), IWAEs [Burda et al.2015], and LVAEs [Sønderby et al.2016], with their architectures given in Appendix D1. All experiments are conducted on one 2.60GHz CPU and one GeForce GTX.

4.2 Boosting RBMs and VAEs

Using RBMs and VAEs as meta-models, we first evaluate the lower bound of boosted models, and then generate samples from them. Finally, we compare our boosting method with the method of naively increasing model capacity.

Evaluation of lower bound.

Firstly, we evaluate the lower bound on 4 combinations of RBMS and VAEs on mnist, where a higher lower bound corresponds to a better performance. Since the stochastic variables in RBMs are discrete and the stochastic variables in VAEs are continuous, we put RBMs at bottom and put VAEs at top. For each combination, we evaluate the lower bound in Equation 5 at different on both mnist training and testing dataset. The result is shown in Figure 2, where the triangular and circular markers correspond to RBMs and VAEs respectively. If we keep adding models of the same kind, the lower bound will first increases and then reachs a stable state, as shown in Chart (1) and Chart (4). But it doesn’t mean that the lower bound has converged to the best state and will not increase. As shown in Chart (3), the boosted model reaches a stable state after incorporating two RBMs, but we can further improve the lower bound by incorporating VAEs.

Figure 2: The lower bound (Equation 5) on different combinations of meta-models. The triangular and circular markers correspond to RBMs and VAEs respectively. (1): All meta-models are VAEs. After incorporating two VAEs, the lower bound enters a stable state. (2): The first two meta-models are RBMs and the rest are VAEs. The second RBM greatly promotes the lower bound and the first VAE helps slightly. (3): The first four meta-models are RBMs and the rest are VAEs. The lower bound grows as the first two RBMs are incorporated, while the incorporation of next two RBMs doesn’t help promote the lower bound. We further improve the lower bound by adding two VAEs. (4): All meta-models are RBMs. After incorporating two RBMs, the lower bound enters a stable state.

Sample generation.

We sample randomly from boosted models consisting of one and two VAEs respectively. As shown in Figure 3, the incorporation of an extra VAE increases the quality of generated images for both mnist and celebA, which is consistent with the performance on density estimation over the two datasets, as shown in Table 1.

(Number of VAEs) mnist celebA
-102.51 -6274.86
-100.44 -6268.43
Table 1: Density estimation over mnist test set and celebA.
Figure 3: Samples generated from boosted models consisting of different number of VAEs. k is the number of VAEs.

Comparison with naively increasing model capacity.

We compare our methods with the method of naively increasing model capacity. The conventional method of increasing model capacity is either to add more deterministic hidden layers or to increase the dimension of deterministic hidden layers, so we compare our boosted model (Boosted VAEs) with a deeper model (Deeper VAE) and a wider model (Wider VAE). The Deeper VAE has ten 500-dimensional deterministic hidden layers; the Wider VAE has two 2500-dimensional deterministic hidden layers; the Boosted VAEs is composed of 5 base VAEs, each of them has two 500-dimensional deterministic hidden layers. As a result, all the three models above have 5000 deterministic hidden units.

Figure 4 shows the results. Wider VAE has the highest lower bound, but its generated digits are usually undistinguishable. Meanwhile, the Deeper VAE is able to generate distinguishable digits, but some digits are rather blurred and its lower bound is the lowest one. Only the digits generated by Boosted VAEs are both distinguishable and sharp.

Since straightforwardly increasing the model capacity is likely to cause serious challenges, such as vanishing gradient problem [Hochreiter et al.2001] and exploding gradient problem [Grosse2017], it often fails to achieve the desired results on improving models’ learning power. Our boosting method avoids these challenges and achieves better result than the method of naively increasing the model capacity.

Figure 4: Comparison between our boosting method and the method of naively increasing model capacity. Deeper VAE, Wider VAE and Boosted VAEs have the same number of deterministic hidden units. Wider VAE has the highest lower bound, but most digits generated by Wider VAE are undistinguishable. Meanwhile, the Deeper VAE is able to generate distinguishable digits, but some digits are rather blurred and its lower bound is the lowest one. Only the digits generated by Boosted VAEs are both distinguishable and sharp.

4.3 Boosting Advanced Models

We choose ConvVAE (i.e., VAE with convolutional layers), LVAE [Sønderby et al.2016], IWAE [Burda et al.2015] as advanced models, which represent current state-of-art methods. We use one advanced model and one GMM [Reynolds2015] to construct a boosted model, with the advanced model at the bottom and the GMM at the top. The result is given in Table 2. We see that the performance of each advanced model is further increased by incorporating a GMM, at the cost of a few seconds.

The performance improvement by incorporating a GMM is theoretically guaranteed: since is a standard Gaussian distribution when is a VAE or one of the above three advanced variants, according to Equation 7, if is a GMM, which covers the standard Gaussian distribution as a special case, we can make sure that will not be negative after training . Besides, the dimension of hidden variable is much smaller than the dimension of observable variable for VAEs and their variants, and thus the training of only requires very little time. As a result, our boosting framework can be used as a technique to further promote the performance of state-of-the-art models, at the cost of very little time.

extra time (s)
ConvVAE -88.41
ConvVAE + GMM -87.42 +7.85
LVAE, 2-layer -95.73
LVAE, 2-layer + GMM -95.50 +11.76
IWAE, k=5 -81.58
IWAE, k=5 + GMM -80.38 +9.41
IWAE, k=10 -80.56
IWAE, k=10 + GMM -79.20 +8.39
Table 2: Test set performance on mnist. LVAE has 2 stochastic hidden layers. The number of importance weighted samples (k) for IWAE is 5 and 10. The number of components in GMM is set to 10. The extra time is the time cost for incorporating an extra GMM.
training time (s) density estimation time (s) sampling time (s)
cascade VAE+VAE -99.53 223.85 0.42 0.13
VAE+GMM -98.13 116.33 0.14 0.12
parallel VAEVAE -95.72 225.21 50.91 543.82
VAEGMM -506.76 2471.60 130.65 480.95
hybrid (VAE+GMM)VAE -94.28 225.23 125.20 1681.77
(VAE+GMM)(VAE+GMM) -93.94 226.86 147.82 2612.59
Table 3: Comparison between different boosting methods on mnist. The ‘+’ represents the cascade connection and the ‘’ represents the parallel connection. The density is estimated on the test set and the sampling time is the time cost for sampling 10000 samples.

4.4 Comparison between Different Generative Boosting Methods

We compare our boosting framework where meta-models are connected in cascade, multiplicative boosting where meta-models are connected in parallel, and hybrid boosting where meta-models are connected in both cascade and parallel. The result is given in Table 3

. The hybrid boosting produces the strongest model, but the time cost of density estimation and sampling is high, since the structure includes parallel connections. The cascade connection allows quick density estimation and sampling, but the boosted model is not as strong as the hybrid boosting. It is also worth note that the parallel connection of one VAE and one GMM produces a bad model, since the learning power of a GMM is too weak for mnist dataset and the training time of a GMM is long for high dimensional data.

5 Related Work

Deep Belef Networks.

Our work is inspired by DBNs [Hinton et al.2006]. A DBN has a multi-layer structure, whose basic components are RBMs [Smolensky1986]. During training, each RBM is learned separately, and stacked to the top of current structure. It is a classical example of boosting generative model, since a group of RBMs are connected to produce a stronger model. Our decomposable variational lower bound reveals the principle behind the training algorithm of DBNs: since for DBNs, is initialized from , we can make sure that Equation 7 will never be negative and the lower bound will never decrease.

Deep Latent Gaussian Models.

DLGMs are deep directed graphical models with multiple layers of hidden variables [Burda et al.2015, Rezende et al.2014]. The distribution of hidden variables in layer conditioned on hidden variables in layer is a Gaussian distribution. Rezende et al. rezende2014stochastic introduce an approximate posterior distribution which factorises across layers. Burda et al. burda2015importance introduce an approximate posterior distribution which is a directed chain. When we restrict our meta-models to VAEs, we derive the same variational lower bound to Burda et al. burda2015importance. The difference is that Burda et al. burda2015importance optimize the lower bound as a whole, but our work optimizes the lower bound greedily and layer-wise.

Other methods of boosting generative models.

Methods of boosting generative models have been explored. Previous work can be divided into two categories: sum-of-experts [Figueiredo and Jain2002, Rosset and Segal2003, Tolstikhin et al.2017], which takes the arithmetic average of meta-models’ distributions, and product-of-experts [Hinton2002, Grover and Ermon2018], which takes the geometric average of meta-models’ distributions.

6 Conclusion

We propose a meta-algorithmic framework for boosting generative models by connecting meta-models in cascade. Any hidden variable meta-models can be incorporated, as long as it supports likelihood evaluation. The decomposable lower bound allows us to train these meta-models separately and greedily. Our framework can be extended to semi-supervised learning and can be integrated with multiplicative boosting. In our experiments, we first validate the effectiveness of our boosting method via density estimation and evaluating the generated samples, and then further promote the performance of some advanced models, which represent state-of-the-art methods. Finally, we compare different generative boosting methods, validating the ability of the hybrid boosting in further improving learning power of generative models.

References

  • [Abbasnejad et al.2017] M Ehsan Abbasnejad, Anthony Dick, and Anton van den Hengel. Infinite variational autoencoder for semi-supervised learning. In

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 781–790. IEEE, 2017.
  • [Bengio et al.2007] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle. Greedy layer-wise training of deep networks. In Advances in neural information processing systems, pages 153–160, 2007.
  • [Burda et al.2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
  • [Dempster et al.1977] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • [Figueiredo and Jain2002] Mario A. T. Figueiredo and Anil K. Jain. Unsupervised learning of finite mixture models. IEEE Transactions on Pattern Analysis & Machine Intelligence, (3):381–396, 2002.
  • [Goodfellow et al.2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [Grosse2017] Roger Grosse. Lecture 15: Exploding and vanishing gradients. University of Toronto Computer Science, 2017.
  • [Grover and Ermon2018] Aditya Grover and Stefano Ermon. Boosted generative models. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [Hastings1970] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
  • [Hinton et al.2006] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
  • [Hinton1999] Geoffrey E Hinton. Products of experts. In

    1999 Ninth International Conference on Artificial Neural Networks ICANN 99.(Conf. Publ. No. 470)

    , volume 1, pages 1–6. IET, 1999.
  • [Hinton2002] Geoffrey E Hinton.

    Training products of experts by minimizing contrastive divergence.

    Neural computation, 14(8):1771–1800, 2002.
  • [Hochreiter et al.2001] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001.
  • [Kingma and Welling2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the 2nd International Conference on Learning Representations (ICLR), 2013.
  • [Kingma et al.2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589, 2014.
  • [LeCun and Cortes2010] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
  • [Li et al.2018] Chongxuan Li, Jun Zhu, and Bo Zhang. Max-margin deep generative models for (semi-) supervised learning. IEEE transactions on pattern analysis and machine intelligence, 40(11):2762–2775, 2018.
  • [Liu et al.2015] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
  • [Radford et al.2015] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [Reynolds2015] Douglas Reynolds. Gaussian mixture models. Encyclopedia of biometrics, pages 827–832, 2015.
  • [Rezende et al.2014] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.

    Stochastic backpropagation and approximate inference in deep generative models.

    In

    International Conference on Machine Learning

    , 2014.
  • [Rosset and Segal2003] Saharon Rosset and Eran Segal. Boosting density estimation. In Advances in neural information processing systems, pages 657–664, 2003.
  • [Rubinstein and Kroese2016] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016.
  • [Smolensky1986] Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory. Technical report, COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.
  • [Sønderby et al.2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
  • [Tolstikhin et al.2017] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann Simon-Gabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. In Advances in neural information processing systems, pages 5424–5433, 2017.
  • [Xu and Jordan1996] Lei Xu and Michael I Jordan. On convergence properties of the em algorithm for gaussian mixtures. Neural computation, 8(1):129–151, 1996.