Generative model learning is the task where the goal is to learn a model to generate artificial samples which follow the underlying probability density function of a given dataset. When the dataset comprises of scalars, or of low dimensional (2-3 dimensions) vectors and follow a unimodal distribution, one can use a simple density model such as the multivariate Gaussian, and fit the model to the data using maximum likelihood. Unfortunately, such simple densities do not have sufficient expressive power to learn the distribution of more complicated data such as natural images, or audio because of the aforementioned high dimensional and multi-modal nature of the data.
There exists several generative model learning methods in the machine learning literature. One way of approaching the problem is to use a linear latent variable model (LVM) such as a mixture model, a latent factor model such as probabilistic PCA 
, Hidden Markov model (HMM), or linear dynamical systems [4, 20]
. These models can successfully capture the multi-modality, or low rank nature of the datasets, however they rely on linear and tractable forward mappings, and therefore lack the expressive power of modern neural network models.
More recently, the mainstream approaches for learning a generative model for complicated datasets have been centered around models that combine latent variable modeling with non-linear neural network mappings. One prominent example of such approaches is Variational Autoencoders (VAEs) 
. VAEs consider a latent variable model where the latent representation is mapped to the observation space via a complicated neural network. The variational expectation maximization algorithm in maximize a variational lower bound on the maximum likelihood objective. The prior distribution is typically chosen as a simple distribution such that the KL-divergence term in the lower bound is tractable. In this paper we argue that using a simple prior distribution is detrimental to the overall quality of the learned generative model.
Another very popular method that also uses a restricted latent representation is Generative Adversarial Networks (GANs) . The main conceptual differences of GANs from typical latent variable models (including VAEs) is that GANs are an implicit generative model learning methodology , where the model distribution is defined without specifying an output density. More importantly, unlike LVMs GANs do not maximize the standard maximum likelihood objective. Instead, GANs approximate the underlying dataset density via an additional discriminator network. Although an appealing idea, GANs are incredibly hard to train (as evidenced by the sheer number of GAN training papers in the last few years), and suffer from the predictable mode collapse problem (We delve more into this in the main text).
In this paper, we propose an implicit generative model learning method which maximizes the maximum likelihood training objective. Unlike GANs, the method does not rely on auxiliary networks such as discriminator or critic networks. For training, we propose a simple two stage training method, which maximizes a maximum likelihood training objective, and therefore does not suffer from the mode collapse problem that GANs are notorious for.
2 Generative Model Learning
The purpose of this section is to set the notation and the required concepts before we formally introduce our algorithm. As we discussed in the introduction, the goal in generative model learning is to approximate the underlying data density with the density that our model implies, which we denote by , where denotes the model parameters. Maximum likelihood training minimizes the Kullback-Leibler (KL) divergence between the data density and model density:
where the last step is a Monte Carlo approximation to the integral, and we recognize Equation (1) as the maximum likelihood objective. Note that denotes the variable we use to denote the observation space, and we use the subscripted version to denote the data item with index .
It is usually not easy to compute (not tractable) the likelihood function unless we work with very simple models. In LVMs, Jensen’s inequality is used to compute a lower bound to the maximum likelihood objective:
where Equation (2) is known as the variational lower bound , or ELBO , where denotes the latent variable, and denotes the variational distribution over the latent variable. In linear LVMs with tree structured latent variables (e.g. mixture models, HMMs), we can use the posterior as the variational distribution, because the posterior makes this bound tight [15, 4].
In the general situation where the forward mapping is defined via a non-linear mapping, such that , where is the nonlinear deterministic mapping, and is the employed noise model, computing the posterior distribution is not analytically tractable in general. VAEs therefore use a neural network mapping for the variational distribution , where
denotes the Normal distribution and the neural network mappingsparametrize the variational distribution.
Although the likelihood computation in VAEs is intractable and require the variational EM algorithm described above, we argue in this paper that the main failure mode of VAEs is caused by the simplistic prior choices for , as we demonstrate this in the experiments section.
Another popular way to learn generative models is via GANs. GANs are implicit generative models, therefore they do not employ an output distribution . Namely, the data generation mechanism is defined as follows:
where we call the base distribution
, typically chosen as a simplistic distribution such as an isotropic Gaussian distribution, andis a deterministic forward mapping similar to what we have denoted for VAEs above. GANs therefore do not employ an output distribution , but rather define via a deterministic transformation of the base distribution .
In this paper, we also argue that one of the reasons why GANs might underperform is because of the simplistic base distribution choice. In addition to this, GANs also complicate the model parameter optimization by introducing a discriminator network. GANs in their original formulation , approximate the ratio between the data density and the model density :
where denotes the training instances, and
denotes samples generated from the model. The convergence to the second line (which can be recognized as the Monte Carlo estimate for the Jensen-Shannon divergence) can be easily seen by maximizing the objectivewith respect to the discriminator parameters 
. The big conceptual problem with GANs is that the optimization step for the generator parameters cause mode collapse. This can be easily seen by examining the corresponding loss function. The original paper suggests the maximization of the following objective:
where we assumed that the discriminator is trained until convergence. We can see that the objective in the last equation has a mode seeking/zero avoiding behavior, similar to 
. In practice, therefore the discriminator is not trained until convergence, and there are various heuristics that tries to deal with mode collapse.
There exists several other variants of GANs which use other divergences , or which are based on approximate optimal transport metrics [3, 23]. Or, some approaches use a GAN ensemble to approximate the whole density .
In this paper, we propose a much simpler approach, which optimizes a maximum likelihood objective using an implicit density model. The optimization does not involve an additional discriminator, and the approach does not suffer from mode collapse since it maximizes a maximum likelihood objective.
We would also like to point out that there is a recent work on generative model learning, which does maximum likelihood for implicit models  for certain types of invertible mappings such as convolutions. However, they do not consider general mappings as we do. In addition to this we advocate using multi-modal distributions in the latent space in this paper.
3 Learning in Implicit Generative Models
We know from probability theory that in an implicit generative model as defined in Equation (3), the output probability density is related to the base distribution via the cumulative density function:
where note that the base distribution is parametrized by . The integral in Equation (5) is not tractable in general, however if we have an invertible mapping , we can obtain an analytical expression for the density function of the model using the following formula :
where , which measures the volume change due to the transformation. It is possible to construct exactly invertible mappings using typical neural network mappings such as matrix multiplications and convolutions. Constraining the forward mapping to be exactly invertible is restrictive however, mainly because invertibility only holds for transformations which do not change the dimensionality. In section 3.2 we describe an algorithm which maximizes the model likelihood for a general mappings for which we also have an approximate inverse.
3.1 Maximum Likelihood for Implicit Generative Models
If we work with invertible forward mappings, the optimization problem for maximum likelihood in an implicit generative model is the following:
where the first term can be interpreted as maximizing the likelihood of the mappings in the base distribution space, and the volume term ensures that the distribution properly normalized. If we think about this objective from a sampling perspective, in order to the generate plausible samples, the maximum likelihood objective tries to match the samples from the base distribution with the observations mapped to the base distribution space .
Note that in GANs, only the forward mapping parameters is optimized, and the base distribution is fixed to be simple unimodal distribution. Optimizing both the forward mapping parameters and a multi-modal base distribution constitutes the main idea in our paper. We argue that mapping a multimodal dataset onto a unimodal base distribution is harder to achieve than fitting a multimodal distribution on . We demonstrate this in Figure 1. Using an invertible linear mapping , where , and , we show that on a two dimensional mixture of Gaussians example that, if we do maximum likelihood on the objective in Equation (8), we fail to map the observations to the samples drawn from a fixed isotropic base distribution. However, as shown in Figure (b) if we set the base distribution as a flexible distribution such as mixture of Gaussians, and learn its parameters , we are able to learn a much more accurate distribution. We also show that if we train the same mapping using the standard GAN formulation, we get the mode collapse behavior, where only one of the Gaussians is captured in the learned distribution.
We acknowledge that in the cases where the forward mapping has the same dimensionality in the domain and range spaces (such as the example in Figure 1), learning an implicit generative model by maximizing Equation (8) is pointless, because we could have very well just fitted a mixture model on the data. For this reason, in the next section we propose the two stage learning algorithm which allows the use of forward mappings which change the dimensionality.
3.2 The Two Stage Algorithm
In practice, we typically would like to have base distribution defined on a space which has lower dimensionality than the observation space. If this is the case, then it is impossible to have an exactly invertible mapping . It is however possible to have an approximately invertible forward mapping. This idea gives the hint for a very simple two stage maximum likelihood algorithm: We first fit an auto-encoder such that the error is minimized. Once the we are done with optimizing the autoencoder, we simply fit a base distribution on the embeddings . The formal algorithm is specified in Algorithm 1.
To see that this is a maximum likelihood algorithm, let us reconsider the likelihood function of the implicit generative model with the autoencoder:
where we easily see that the base distribution parameters are independent of the volume term . Assuming that that the autoencoder learns a mapping close to the identity, we conclude that maximizing with respect to the base distribution parameters maximizes the model likelihood.
Note that since the optimization for the forward mapping parameters , and the base distribution is decoupled, it is easy to fit a multi-modal distribution for the base distribution on the embeddings . One natural choice is to use a mixture distribution. We demonstrate this on handwritten zero and one digits from the MNIST dataset  in Figure 2. We choose the dimensionality of the latent space
to be able to visualize the base distribution space. We a three component Gaussian mixture model for this example.
3.3 Learning Generative Models for Sequential Data
The framework we propose also offers the flexibility to learn distributions over sequences by simply learning a sequential distribution such as HMM on the latent representations. The likelihood of a sequence is expressed as follows:
where a sequence is denoted as . and thus }. According to this density model, the observations are mapped to latent space independent from each other. This suggests that we can closely follow the two stage algorithm defined in Algorithm 1: Same as before we first fit the autoencoder, and obtain the latent representations. In the second stage, instead of fitting an exchangeable model such as a mixture model, we fit a base distribution which models the temporal structure of the latent space. Potential options for such a distribution include Hidden Markov Models (HMMs), and RNNs, or convolutional models. In our audio experiments, we used HMMs with Gaussian emissions.
(celebrity faces). We compare our algorithm (which we abbreviate with IML - Implicit Maximum Likelihood), with VAE, standard GAN and Wasserstein GAN. As the main quality metric, we compare likelihoods computed on a test set using kernel density estimator (KDE).
For the MNIST dataset, we use an invertible perceptron in our approach to demonstrate that we can also use our approach to compute model likelihoods on the test set using the implicit generative model density function in Equation (6). (Note that in general our framework allows non-invertible mappings: We use a general convolutional autoencoder for the CELEB-A dataset) The invertible perceptron we use for the MNIST dataset is defined as follows:
where denotes the latent representation, and , ,
represents a linear layer (we follow the pytorch API convention to denote the input and output dimensionalities). The invertible non-linearity functions are denoted with, and
, which respectively stand for invertible tangent-hyperbolic and invertible sigmoid functions. We basically use the original non linearity in the invertible regime, and a linear function in the saturation regimes. Namely, for hyperbolic tangent we have the following function:
We use , and choose the bias term , and the threshold so that the function is continuous and smooth (has a continuous first derivative). Similarly, the invertible sigmoid function is defined as follows:
Note that it is straightforward to derive the inverse functions once the parameters of the non-linearities are set. Therefore the inverse network is defined as follows:
where . Note that the parameters ,
are shared for a given forward and inverse Linear layers. To obtain the volume term due to the rectangular transformation, we note that the volume change due to the rectangular linear transformation in a linear layer is given by. Therefore to the correction term involves dividing the original pdf with this volume change (we note that the implicit model likelihood holds, because the mapping is approximately invertible due to the first step of the algorithm).
To do objective comparisons between models we compute Kernel density estimates (KDE) on the test set: For each batch, we sample 1000 points from the trained models, and represent the learned density as the sum of Kernel functions centered at these samples. We then compute the average score for all the test set. We use Gaussian Kernels, with bandwidth 0.01. Namely, the KDE scores we compute for the models are defined as follows:
Notice that for small kernel bandwidth, the above objective is tantamount to computing the nearest neighbor distance for all test instances. To get high scores from this estimator, the observed samples need to capture the diversity of the test instances. Also note that this estimator is computing an estimate for , so this metric penalizes mode collapse.
In the left panel of Figure 3, we compare the KDE scores for our two-stage algorithm, GAN, Wasserstein GAN and VAE on the MNIST dataset. We use the standard training-test split defined in the pytorch data utilities (60000 training instances and 6000 test instances). We try 7 different latent dimensionality for all algorithms ranging from 20 to 140 with increments of 20. In our algorithm, we use a GMM with 30 full-covariance components for all values. We see that performance drops with increasing , however we manage to stay better than VAEs and GANs. The performance drop is expected to happen with increasing , because the density estimation problem in the latent space gets more difficult with increasing latent dimensionality. We would like to note that it possible to use a more complicated base distribution and compensate.
In the right panel of Figure 3, we compare the model likelihood computed with the implicit likelihood equation in (6) with the base distribution likelihood (the complete likelihood minus the Jacobian term). The purpose of this is to examine if there is a correlation between these quantities. As we pointed out before, our algorithm does not require an exactly invertible mapping, and as can be seen from the figure the base distribution likelihood is somewhat correlated with the overall model likelihood, and therefore can potentially be used as a proxy for the complete likelihood for mappings for which we don’t know how to compute the Jacobian term.
In Figure 4, we show the random nearest neighbor samples for randomly selected test instances for all four algorithms in the top panel. We see that IML method is able to capture the diversity of the test instances well. On top of that we see much more definition in the generated images thanks to the multi-modal base distribution that we are using. As we earlier illustrated in Figure 1, using a simplistic base distribution causes a mismatch between the mappings to the latent space and the draws from the base distribution. Due to the simplistic distributions used in VAEs, and GANs we see that these approaches tend to generate more samples which do not resemble handwritten digits. We also observe that quality of the samples (and nearest neighbor samples) are correlated with the KDE metric.
In Figure 5, we do the same nearest neighbor sample measurement on the CELEB-A dataset. We have set the latent dimensionality as 100 for all algorithms. We cropped the images using a face detector, and resized them to size in RGB space. We used 146209 such images for training, and 10000 images for test. We see that the proposed IML algorithm has more accurate nearest neighbor samples. We see that although the VAE is able to generate less distorted samples than GAN and WGAN, it’s generated images contain much more distortion than IML, potentially because of the simplistic latent representation. The generated samples from IML contain much less distortion than GANs.
For all algorithms we used the Adam optimizer . As mentioned before, in the MNIST experiment, for IML we used the invertible network we introduced in this section. For GANs and VAE we used a standard one hidden layer perceptron with exact same sizes. Namely, the decoders maps dimensions into 600, and 600 dimensions then gets mapped into 784 dimensions (MNIST images are of size ). We use the mirror image encoder for the VAE, that is we map 784 dimensions to 600, and that gets mapped into
dimensional vectors for the mean and variance of the posterior. For the CELEB-A dataset, we used a 5 layer convolutional encoders and decoders (We used the basic DC-GAN
generator architecture for all algorithms, with exact same parameter setting - only with the exception that for VAE the latent representations are obtained without passing through ReLU in order not to allow negative values as we use isotropic Gaussian as the prior). For W-GAN would like to point that we used to code published by the authors with the default parameter set-up. For GAN and VAE our code is based on code provided for pytorch examples.
To show that our algorithm can be used to learn a generative model for sequential data, we experiment with generating speech and music in the waveform domain. In all datasets, we work with audio with 8kHz sampling rate. We dissect the audio into 100ms long chunks, where consecutive chunks overlap by 50ms, and each window is multiplied by a Hann window. The autoencoder learns 80 dimensional latent representations for each chunk which is 800 samples long. We use three layer convolutional networks both in the encoder and decoder, where we use filters of length 200 samples.
We fit an HMM to the extracted latent representations. We use 300 HMM states, where each state has a diagonal covariance Gaussian emission model. The random samples are obtained by sampling from the fitted HMMs, and passing the sampled latent representation through the decoder. To reconstruct the generated chunks as an audio waveform, we follow the overlap-add procedure : We overlap the each generated by chunk by 50 percent and add.
As a speech experiment, we learn a generative model over digit utterances. We work with the free spoken digit dataset . As the training sequence, we give the model a concatenated set of digit utterances. We consider the cases where the training data only contains one digit type, and the case where the training data contains all digits. In Figure 6, we show the spectrograms of generated digit utterances (this example contained all 10 digit types - we used 1000 utterances for training) along with spectrograms of the training digit utterances. Note that the generated digit utterances are generated in sequences (We generate one long sequence which contains multiple digits). In Appendix, on figure 8 we show three cases for the one-digit only training task. We see that we are able to learn a generative model over one digit with a some variety.
As the music experiment, we train a model on a 2 minute long violin piece. We downloaded the audio file for the violin etude in https://www.youtube.com/watch?v=OuSI6t54KWY. We show the spectrogram of the first 10 seconds of the piece and our generated sequence in Figure 7. We see from the spectrogram that the model is able to learn some musical structure, although there is additional background artifacts. The generated samples for the spoken digit utterances and the generated music sequence can be downloaded and listened from the following anonymous link: https://www.dropbox.com/sh/6mvzf9ca1wl3uej/AAAkBTdNBumU61_mnMu7epDla?dl=0 (we suggest copy and pasting the link, and watching for spaces, also we suggest opening the files with vlc player if your native player does not work)
The algorithm we propose in this paper is very simple and effective. It is also principled in the sense that it performs maximum likelihood. We would like to emphasize that, compared the GANs the performance is much less sensitive to the network design choices and training parameters such as the learning rate. In author’s experience, GANs are extremely sensitive to training parameters such as the learning rate. We have observed that decoupling the training of the base distribution from the neural network mapping makes the training much easier: In our approach it suffices to pick a small enough learning rate so that the encoder converges, and successfully embeds the data in a lower dimensional space.
In our experience, VAE’s seem to be easier to train (much less susceptible to hyperparameter choices). However, as we have seen in the results and figures, the simplistic choice for the base distribution results in distorted outputs. In our experiments we have used relatively more standard models to model the latent distribution, but it is possible to use complex methods such as Dirichlet Process Mixture models to obtain complicated base distributions.
-  Free spoken digit dataset. https://github.com/Jakobovski/free-spoken-digit-dataset. Accessed: 2018-03-09.
-  Volume and jacobian determinant. https://en.wikipedia.org/wiki/Determinant. Accessed: 2018-03-09.
-  Martín Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein GAN. CoRR, abs/1701.07875, 2017.
-  Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
-  Luc Devroye. Non-Uniform Random Variate Generation(originally published with. 1986.
-  Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. CoRR, abs/1605.08803, 2016.
-  Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
-  Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, 2013.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.
-  Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
Proceedings of International Conference on Computer Vision (ICCV), 2015.
Thomas P. Minka.
Expectation propagation for approximate bayesian inference.In
Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
-  S. Mohamed and B. Lakshminarayanan. Learning in Implicit Generative Models. ArXiv e-prints, October 2016.
-  Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. The MIT Press, 2012.
-  Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 271–279. Curran Associates, Inc., 2016.
-  Alan V. Oppenheim and Ronald W. Schafer. Discrete-Time Signal Processing. Prentice Hall Press, Upper Saddle River, NJ, USA, 3rd edition, 2009.
-  Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. In PROCEEDINGS OF THE IEEE, pages 257–286, 1989.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  Sam Roweis and Zoubin Ghahramani. A unifying review of linear gaussian models. Neural Comput., 11(2):305–345, February 1999.
-  Yunus Saatci and Andrew G Wilson. Bayesian gan. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3622–3631. Curran Associates, Inc., 2017.
-  Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
-  Tim Salimans, Han Zhang, Alec Radford, and Dimitris Metaxas. Improving GANs using optimal transport. In International Conference on Learning Representations, 2018.
Michael E. Tipping and Chris M. Bishop.
Probabilistic principal component analysis.Journal of the Royal Statistical Society, Series B, 61:611–622, 1999.
6.1 Spectrograms of Individual Digits Utterances
Spectrograms fort the single type digit utterances are shown in Figure 8.