Joint Training of Variational Auto-Encoder and Latent Energy-Based Model

06/10/2020
by   Tian Han, et al.
Stevens Institute of Technology
1

This paper proposes a joint training method to learn both the variational auto-encoder (VAE) and the latent energy-based model (EBM). The joint training of VAE and latent EBM are based on an objective function that consists of three Kullback-Leibler divergences between three joint distributions on the latent vector and the image, and the objective function is of an elegant symmetric and anti-symmetric form of divergence triangle that seamlessly integrates variational and adversarial learning. In this joint training scheme, the latent EBM serves as a critic of the generator model, while the generator model and the inference model in VAE serve as the approximate synthesis sampler and inference sampler of the latent EBM. Our experiments show that the joint training greatly improves the synthesis quality of the VAE. It also enables learning of an energy function that is capable of detecting out of sample examples for anomaly detection.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

12/28/2018

Divergence Triangle for Joint Training of Generator Model, Energy-based Model, and Inference Model

This paper proposes the divergence triangle as a framework for joint tra...
12/29/2020

Learning Energy-Based Model with Variational Auto-Encoder as Amortized Sampler

Due to the intractable partition function, training energy-based models ...
10/29/2019

Bridging the ELBO and MMD

One of the challenges in training generative models such as the variatio...
11/05/2018

Simple, Distributed, and Accelerated Probabilistic Programming

We describe a simple, low-level approach for embedding probabilistic pro...
01/14/2021

Unsupervised heart abnormality detection based on phonocardiogram analysis with Beta Variational Auto-Encoders

Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular ...
01/10/2019

Stroke-based sketched symbol reconstruction and segmentation

Hand-drawn objects usually consist of multiple semantically meaningful p...
07/11/2019

retina-VAE: Variationally Decoding the Spectrum of Macular Disease

In this paper, we seek a clinically-relevant latent code for representin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The variational auto-encoder (VAE) [22, 35]

is a powerful method for generative modeling and unsupervised learning. It consists of a generator model that transforms a noise vector to a signal such as image via a top-down convolutional network (also called deconvoluitional network due to its top-down nature). It also consists of an inference model that infers the latent vector from the image via a bottom-up network. The VAE has seen many applications in image and video synthesis 

[12, 3]

and unsupervised and semi-supervised learning 

[37, 23].

Despite its success, the VAE suffers from relatively weak synthesis quality compared to methods such as generative adversarial net (GANs) [11, 34] that are based on adversarial learning. While combing VAE objective function with the GAN objective function can improve the synthesis quality, such a combination is rather ad hoc. In this paper, we shall pursue a more systematic integration of variational learning and adversarial learning. Specifically, instead of employing a discriminator as in GANs, we recruit a latent energy-based model (EBM) to mesh with VAE seamlessly in a joint training scheme.

The generator model in VAE is a directed model, with a known prior distribution on the latent vector, such as Gaussian white noise distribution, and a conditional distribution of the image given the latent vector. The advantage of such a model is that it can generate synthesized examples by direct ancestral sampling. The generator model defines a joint probability density of the latent vector and the image in a top-down scheme. We may call this joint density the generator density.

VAE also has an inference model, which defines the conditional distribution of the latent vector given the image. Together with the data distribution that generates the observed images, they define a joint probability density of the latent vector and the image in a bottom-up scheme. We may call this joint density the joint data density, where the latent vector may be considered as the missing data in the language of the EM algorithm [6].

As we shall explain later, the VAE amounts to joint minimization of the Kullback-Leibler divergence from data density to the generator density, where the joint minimization is over the parameters of both the generator model and the inference model. In this minimization, the generator density seeks to cover the modes of the data density, and as a result, the generator density can be overly dispersed. This may partially explain VAE’s lack of synthesis quality.

Unlike the generator network, the latent EBM is an undirected model. It defines an unnormalized joint density on the latent vector and the image via a joint energy function. Such undirected form enables the latent EBM to better approximate the data density than the generator network. However, the maximum likelihood learning of latent EBM requires (1) inference sampling: sampling from the conditional density of the latent vector given the observed example and (2) synthesis sampling: sampling from the joint density of the latent vector and the image. Both inference sampling and synthesis sampling require time consuming Markov chain Monte Carlo (MCMC).

In this paper, we propose to jointly train the VAE and the latent EBM so that these two models can borrow strength from each other. The objective function of the joint training method consists of the Kullback-Leibler divergences between three joint densities of the latent vector and the image, namely the data density, the generator density, and the latent EBM density. The three Kullback-Leilber divergences form an elegant symmetric and anti-symmetric form of divergence triangle that integrates the variational learning and the adversarial learning seamlessly.

The joint training is beneficial to both the VAE and the latent EBM. The latent EBM has a more flexible form and can approximate the data density better than the generator model. It serves as a critic of the generator model by judging it against the data density. To the generator model, the latent EBM serves as a surrogate of the data density and a target density for the generator model to approximate. The generator model and the associated inference model, in return, serve as approximate synthesis sampler and inference sampler of the latent EBM, thus relieving the latent EBM of the burden of MCMC sampling.

Our experiments show that the joint training method can learn the generator model with strong synthesis ability. It can also learn energy function that is capable of anomaly detection.

2 Contributions and related work

The following are contributions of our work. (1) We propose a joint training method to learn both VAE and latent EBM. The objective function is of a symmetric and anti-symmetric form of divergence triangle. (2) The proposed method integrates variational and adversarial learning. (3) The proposed method integrates the research themes initiated by the Boltzmann machine and Helmholtz machine.

The following are the themes that are related to our work.

(1) Variational and adversarial learning. Over the past few years, there has been an explosion in research on variational learning and adversarial learning, inspired by VAE [22, 35, 37, 12] and GAN [11, 34, 2, 41] respectively. One aim of our work is to find a natural integration of variational and adversarial learning. We also compare with some prominent methods along this theme in our experiments. Notably, adversarially learned inference (ALI) [9, 7] combines the learning of the generator model and inference model in an adversarial framework. It can be improved by adding conditional entropy regularization as in more recent methods ALICE [26] and SVAE [4]. Though these methods are trained using joint discriminator on image and latent vector, such a discriminator is not a probability density, thus it is not a latent EBM.

(2) Helmholtz machine and Boltzmann machine. Before VAE and GAN took over, Boltzmann machine [1, 17, 36] and Helmholtz machine [16] are two classes of models for generative modeling and unsupervised learning. Boltzmann machine is the most prominent example of latent EBM. The learning consists of two phases. The positive phase samples from the conditional distribution of the latent variables given the observed example. The negative phase samples from the joint distribution of the latent variables and the image. The parameters are updated based on the differences between statistical properties of positive and negative phases. Helmholtz machine can be considered a precursor of VAE. It consists of a top-down generation model and the bottom-up recognition model. The learning also consists of two phases. The wake phase infers the latent variables based on the recognition model and updates the parameters of the generation model. The sleep phase generates synthesized data from the generation model and updates the parameters of the recognition model. Our work seeks to integrate the two classes of models.

(3) Divergence triangle for joint training of generator network and energy-based model. The generator and energy-based model can be trained separately using maximum likelihood criteria as in [13, 33], and they can also be trained jointly as recently explored by  [20, 38, 14, 24]. In particular, [14] proposes a divergence triangle criterion for joint training. Our training criterion is also in the form of divergence triangle. However, the EBM in these papers is only defined on the image and there is no latent vector in the EBM. In our work, we employ latent EBM which defines a joint density on the latent vector and the image, thus this undirected joint density is more natural match to the generator density and the data density, both are directed joint densities of the latent vector and the image.

3 Models and learning

3.1 Generator network

Let be the -dimensional latent vector. Let be the -dimensional signal, such as an image. VAE consists of a generator model, which defines a joint probability density

(1)

in a top-down direction, where is the known prior distribution of the latent vector

, such as uniform distribution or Gaussian white noise, i.e.,

, where is the

-dimensional identity matrix.

is the conditional distribution of given . A typical form of is such that where is parametrized by a top-down convolutional network (also called the deconvolutional network due to the top-down direction), with collecting all the weight and bias terms of the network. is the residual noise, and usually it is assumed .

The generator network is a directed model. We call the generator density. can be sampled directly by first sampling and then sampling given . This is sometimes called ancestral sampling in the literature [30].

The marginal distribution . It is not in closed form. Thus the generator network is sometimes called the implicit generative model in the literature.

The inference of can be based on the posterior distribution of given , i.e., . is not in closed form.

3.2 Inference model

VAE assumes an inference model with a separate set of parameters . One example of is , where is the -dimensional mean vector, and is the

-dimensional diagonal variance-covariance matrix. Both

and can be parametrized by bottom-up convolutional networks, whose parameters are denoted by . The inference model is a closed form approximation to the true posterior .

3.3 Data density

Let be the distribution that generates the observed images. In practice, expectation with respect to can be approximated by the average over the observed training examples.

The reason we use the notation to denote the data distribution is that can be naturally combined with the inference model , so that we have the joint density

(2)

The above is also a directional density in that it can be factorized in a bottom-up direction. We may call the joint density the data density, where in the terminology of the EM algorithm, we may consider as the missing data, and

as the imputation model of the missing data.

3.4 Vae

The top-down generator density and the bottom-up data density form a natural pair. As noted by [14], VAE can be viewed as the following joint minimization

(3)

where for two densities and in general, is the Kullback-Leibler divergence between and .

To connect the above joint minimization to the usual form of VAE,

(4)
(5)

where for two joint densities and , we define .

Since , the joint minimization problem is equivalent to the joint maximization of

(6)

which is the lower bound of the log-likelihood used in VAE [22].

It is worth of noting that the wake-sleep algorithm  [16] for training the Helmholtz machine consists of (1) wake phase: , and (2) sleep phase: . The sleep phase reverses the order of KL-divergence.

3.5 Latent EBM

Unliked the directed joint densities in the generator network, and in the data density, the latent EBM defines an undirected joint density, albeit an unnormalized one:

(7)

where is the energy function (a term originated from statistical physics) defined on the image and the latent vector . is the normalizing constant. It is usually intractable, and is an unnormalized density. The most prominent example of latent EBM is the Boltzmann machine [1, 17], where consists of pairwise potentials. In our work, we first encode into a vector and then concatenate this vector with the vector , and then get by a network defined on the concatenated vector.

3.6 Inference and synthesis sampling

Let be the marginal density of latent EBM. The maximum likelihood learning of is based on because minimizing is equivalent to maximizing the log-likelihood . The learning gradient is

(8)

This is a well known result in latent EBM [1, 25]. On the right hand of the above equation, the two expectations can be approximated by Monte Carlo sampling. For each observed image, sampling from is to infer from . We call it the inference sampling. In the literature, it is called the positive phase  [1, 17]. It is also called clamped sampling where is an observed image and is fixed. Sampling from is to generate synthesized examples from the model. We call it the synthesis sampling. In the literature, it is called the negative phase. It is also called unclamped sampling where is also generated from the model.

4 Joint training

4.1 Objective function of joint training

We have the following three joint densities.

(1) The generator density .

(2) The data density .

(3) The latent EBM density .

We propose to learn the generator model parametrized by , the inference model parametrized by , and the latent EBM parametrized by by the following divergence triangle:

(9)
(10)

where all the densities , , and are joint densities of .

The above objective function is in an symmetric and anti-symmetric form. The anti-symmetry is caused by the negative sign in front of and the maximization over .

4.2 Learning of latent EBM

For learning the latent EBM, is equivalent to minimizing

(11)

In the above minimization, seeks to get close to the data density and get away from . Thus serves as a critic of by comparing against . Because of the undirected form of , it can be more flexible than the directional in approximating .

The gradient of the above minimization is

(12)

Comparing Eqn.12 to Eqn.8, we replace by in the inference sampling in the positive phase, and we replace by in the synthesis sampling in the negative phase. Both and can be sampled directly. Thus the joint training enables the latent EBM to avoid MCMC in both inference sampling and synthesis sampling. In other words, the inference model serves as an approximate inference sampler for latent EBM, and the generator network serves as an approximate synthesis sampler for latent EBM.

4.3 Learning of generator network

For learning the generator network, is equivalent to minimizing

(13)

where the gradient can be computed as:

(14)

In , appears on the right hand side of KL-divergence. Minimizing this KL-divergence with respect to requires to cover all the major modes of . If is not flexible enough, it will strain itself to cover all the major modes, and as a result, it will make over-dispersed than . This may be the reason that VAE tends to suffer in synthesis quality.

However, in the second term, , appears on the left hand side of KL-divergence, and , which seeks to get close to and get away from in its dynamics, serves as a surrogate for data density , and a target for . Because appears on the left hand side of KL-divergence, it has the mode chasing behavior, i.e., it may chase some major modes of (surrogate of ), while it does not need to cover all the modes. Also note that in , we do not need to know because it is a constant as far as is concerned.

Combine the above two KL-divergences, approximately, we minimize a symmetrized version of KL-divergence (assuming is close to ). This will correct the over-dispersion of VAE, and improve the synthesis quality of VAE.

We refer the reader to the textbook [10, 31] on the difference between and . In the literature, they are also referred to as inclusive and exclusive KL, or KL and reverse KL.

4.4 An adversarial chasing game

The dynamics of is that it seeks to get close to the data density and get away from . But the dynamics of is that it seeks to get close to (and at the same time also get close to the data density ). This defines an adversarial chasing game, i.e., runs toward and runs from , while chases . As a result, leads toward . and form an actor-critic pair.

4.5 Learning of inference model

The learning of the inference model can be based on , which is equivalent to minimizing

(15)

seeks to be close to relative to . That is, seeks to be the inference model for . Meanwhile, seeks to be close to . This is also a chasing game. leads to be close to .

The gradient of in Eqn.15 can be readily computed as:

(16)

We may also learn by minimizing

(17)

where we let to be close to both and in variational approximation.

4.6 Algorithm

0:    training images ; number of learning iterations ; , , initialized network parameters.
0:

    estimated parameters

; generated samples .
1:  Let .
2:  repeat
3:     Synthesis sampling for using Eqn.4.6.
4:     Inference sampling for using Eqn.4.6.
5:     Learn latent EBM: Given and , update using Eqn. 19 with learning rate .
6:     Learn inference model: Given , update , with learning rate using Eqn. 4.6.
7:     Learn generator network: Given and , update , with learning rate using Eqn. 4.6.
8:     Let .
9:  until 
Algorithm 1 Joint Training for VAE and latent EBM

The latent EBM, generator and inference model can be jointly trained using stochastic gradient descent based on Eqn.

12, Eqn.4.3 and Eqn.4.5. In practice, we use sample average to approximate the expectation.

Synthesis and inference sampling. The expectations for gradient computation are based on generator density and data density . To approximate the expectation of generator density , we perform synthesis sampling through , to get samples . To approximate the expectation of data density , we perform inference sampling through to get samples . Both and are assumed to be Gaussian, therefore we have:

(18)

where is the top-down deconvolutional network for the generator model (see Sec 3.1), and and are bottom-up convolutional networks for the mean vector and the diagonal variance-covariance matrix of the inference model (see Sec 3.2). We follow the common practice [11] to have directly from generator network, i.e., . Note that the synthesis sample and the inference sample are functions of the generator parameter and the inference parameter respectively which ensure gradient back-propagation.

Model learning. The obtained synthesis samples and inference samples can be used to approximate the expectations in model learning. Specifically, for latent EBM learning, the gradient in Eqn.12 can be approximated by:

(19)

For inference model, the gradient in Eqn.4.5 can be approximated by:

(20)

For generator model, the gradient in Eqn.4.3 can be approximated by:

(21)

Notice that the gradients in Eqn.4.6 and Eqn.4.6 on synthesis samples and inference samples can be easily back-propagated using Eqn.4.6. The detailed training procedure is presented in Algorithm 1.

5 Experiments

In this section, we evaluate the proposed model on four tasks: image generation, test image reconstruction, out-of-distribution generalization and anomaly detection. The learning of inference model is based on Eqn.15, and we also tested the alternative way to train infernce model using Eqn.17 for generation and reconstruction. We mainly consider 4 datasets which includes CIFAR-10, CelebA [27], Large-scale Scene Understand (LSUN) dataset [39] and MNIST. We will describe the datasets in more detail in the following relevant subsections. All the training image datasets are resized and scaled to

with no further pre-processing. All network parameters are initialized with zero-mean gaussian with standard deviation 0.02 and optimized using Adam 

[21]. We adopt the similar deconvolutional network structure as in [34]

for generator model and the “mirror” convolutional structure for inference model. Both structures involve batch normalization 

[19]. For joint energy model , we use multiple layers of convolution to transform the observation and the latent factor , then concatenate them at the higher layer which shares similarity as in  [26]. Spectral normalization is used as suggested in [29]. We refer to our implementation 111https://hthth0801.github.io/jointLearning/ for details.

5.1 Image generation

In this experiment, we evaluate the visual quality of the generated samples. The well-learned generator network could generate samples that are realistic and share visual similarities as the training data. We mainly consider three common-used datasets including CIFAR-10, CelebA [27] and LSUN [39] for generation and reconstruction evaluation. The CIFAR-10 contains 60,000 color images of size of which 50,000 images are for training and 10,000 are for testing. For CelebA dataset, we resize them to be and randomly select 10,000 images of which 9,000 are for training and 1,000 are for testing. For LSUN dataset, we select the bedroom category which contains roughly 3 million images and resize them to . We separate 10,000 images for testing and use the rest for training. The qualitative results are shown in Figure 1.

Figure 1: Generated samples. Left: cifar10 generation. Middle: CelebA generation. Right: LSUN bedroom generation.

We further evaluate our model quantitatively by using Frechet Inception Distance (FID) [28] in Table 1. We compare with baseline models including VAE [22], DCGAN [34], WGAN [2], CoopNet [38], ALICE [26], SVAE [4] and SNGAN [29]. The FID scores are from the relevant papers and for missing evaluations, we re-evaluate them by utilizing their released codes or re-implement using the similar structures and optimal parameters as indicated in their papers. From Table 1, our model achieves competitive generation performance compared to listed baseline models. Further compared to [14] which has 7.23 Inception Score (IS) on CIFAR10 and 31.9 FID on CelebA, our model has 7.17 IS and 24.7 FID respectively. It can be shown that our joint training can greatly improve the synthesis quality compared to VAE alone. Note that the SNGAN [29] get better generation on CIFAR-10 which has relatively small resolution, while on other datasets that have relatively high resolution and diverse patterns, our model obtains more favorable results and has more stable training.

Model VAE DCGAN WGAN CoopNet ALICE SVAE SNGAN Ours(+) Ours
CIFAR-10 109.5 37.7 40.2 33.6 48.6 43.5 29.3 33.3 30.1
CelebA 99.09 38.4 36.4 56.6 46.1 40.7 50.4 29.5 24.7
LSUN 175.2 70.4 67.7 35.4 72 - 67.8 31.4 27.3
Table 1: Sample quality evaluation using FID scores on various datasets. Ours(+) denotes our proposed method with inference model trained using Eqn.17.

5.2 Testing image reconstruction

In this experiment, we evaluate the accuracy of the learned inference model by testing image reconstruction. The well trained inference model should not only help to learn the latent EBM model but also learn to match the true posterior of the generator model. Therefore, in practice, the well-learned inference model can be balanced to render both realistic generation as we show in previous section as well as faithful reconstruction on testing images.

We evaluate the model on hold-out testing set of CIFAR-10, CelebA and LSUN-bedroom. Specifically, we use its own 10,000 testing images for CIFAR-10, 1,000 and 10,000 hold-out testing images for CelebA and LSUN-bedroom. The testing images and the corresponding reconstructions are shown in Figure 2. We also quantitatively compare with baseline models (ALI [9], ALICE [26], SVAE [4]) using Rooted Mean Square Error (RMSE). Note that for this experiment, we only compare with the relevant baseline models that contain joint discriminator on and could achieve the decent generation quality. Besides, we do not consider the GANs and their variants because they have no inference model involved and are infeasible for image reconstruction. Table 2 shows the results. VAE is naturally integrated into our probabilistic model for joint learning. However, using VAE alone can be extremely ineffective on complex dataset. Our model instead achieves both the high generation quality as well as the accurate reconstruction.

Figure 2: Test image reconstruction. Top: cifar10. Bottom: CelebA. Left: test images. Right: reconstructed images.
Model CIFAR-10 CelebA LSUN-bedroom
VAE 0.192 0.197 0.164
ALI 0.558 0.720 -
ALICE 0.185 0.214 0.181
SVAE 0.258 0.209 -
Ours(+) 0.184 0.208 0.169
Ours 0.177 0.190 0.169
Table 2: Testing image reconstruction evaluation using RMSE. Ours(+) denotes our proposed method with inference model trained using Eqn.17.
Figure 3: Histogram of log likelihood (unnormalized) for various datasets. We provide the histogram comparison between CIFAR-10 test set and CelebA, Uniform Random, SVHN, Texture and CIFAR-10 train set respectively.

5.3 Out-of-distribution generalization

In this experiment, we evaluate the out-of-distribution (OOD) detection using the learned latent EBM . If the energy model is well-learned, then the training image, together with its inferred latent factor, should forms the local energy minima. Unseen images from other distribution other than training ones should be assigned to relatively higher energies. This is closely related to the model of associative memory as observed by Hopfield [18].

We learn the proposed model on CIFAR-10 training images, then utilize the learned energy model to classify the CIFAR-10 testing images from other OOD images using energy value (i.e., negative log likelihood). We use area under the ROC curve (AUROC) scores as our OOD metric following 

[15] and we use Textures [5], uniform noise, SVHN [32] and CelebA images as OOD distributions (Figure 4 provides CIFAR10 test images and examples of OOD images). We compare with ALICE [26], SVAE [4] and the recent EBM [8]

as our baseline models. The CIFAR-10 training for ALICE, SVAE follow their optimal networks and hyperparameters and scores for EBM are taken directly from 

[8]. Table 3 shows the AUROC scores. We also provide histograms of relative likelihoods for OOD distributions in Figure 3 which can further verify that images from OOD distributions are assigned to relatively low log likelihood (i.e., high energy) compared to the training distribution. Our latent EBM could be learned to assign low energy for training distribution and high energy for data that comes from OOD distributions.

CIFAR
SVHN
Uniform
Texture
CelebA
Figure 4: Illustration of images from CIFAR-10 test, SVHN, Uniform Random, Texture and CelebA. The last four are considered to be OOD distributions.
Model SVHN Uniform Texture CelebA
EBM 0.63 1.0 0.48 -
ALICE 0.29 0.0 0.40 0.48
SVAE 0.42 0.29 0.5 0.52
Ours 0.68 1.0 0.56 0.56
Table 3: AUROC scores of OOD classification on various images datasets. All models are learned on CIFAR-10 train set.

5.4 Anomaly detection

In this experiment, we take a closer and more general view of the learned latent EBM with applications to anomaly detection. Unsupervised anomaly detection is one of the most important problems in machine learning and offers great potentials in many areas including cyber-security, medical analysis and surveillance etc. It is similar to the out-of-distribution detection discussed before, but can be more challenging in practice because the anomaly data may come from the distribution that is similar to and not entirely apart from the training distribution. We evaluate our model on MNIST benchmark dataset.

MNIST The dataset contains 60,000 gray-scale images of size depicting handwritten digits. Following the same experiment setting as [24, 40], we make each digit class an anomaly and consider the remaining 9 digits as normal examples. Our model is trained with only normal data and tested with both normal and anomalous data. We use energy function as our decision function and compare with the BiGAN-based anomaly detection model [40], the recent MEG [24] and the VAE model using area under the precision-recall curve (AUPRC) as in [40]. Table 4 shows the results.

Holdout VAE MEG BiGAN- Ours
1 0.063 0.281 0.035 0.287 0.023 0.297 0.033
4 0.337 0.401 0.061 0.443 0.029 0.723 0.042
5 0.325 0.402 0.062 0.514 0.029 0.676 0.041
7 0.148 0.290 0.040 0.347 0.017 0.490 0.041
9 0.104 0.342 0.034 0.307 0.028 0.383 0.025
Table 4: AUPRC scores for unsupervised anomaly detection on MNIST. Numbers are taken from [24]

and results for our model are averaged over last 10 epochs to account for variance.

6 Conclusion

This paper proposes a joint training method to learn both the VAE and the latent EBM simultaneously, where the VAE serves as an actor and the latent EBM serves as a critic. The objective function is of a simple and compact form of divergence triangle that involves three KL-divergences between three joint densities on the latent vector and the image. This objective function integrates both variational learning and adversarial learning. Our experiments show that the joint training improves the synthesis quality of VAE, and it learns reasonable energy function that is capable of anomaly detection.

Learning well-formed energy landscape remains a challenging problem, and our experience suggests that the learned energy function can be sensitive to the setting of hyper-parameters and within the training algorithm. In our further work, we shall further improve the learning of the energy function. We shall also explore joint training of models with multiple layers of latent variables in the styles of Helmholtz machine and Boltzmann machine.

Acknowledgment

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063.

References

  • [1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski (1985) A learning algorithm for boltzmann machines. Cognitive science 9 (1), pp. 147–169. Cited by: §2, §3.5, §3.6.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214–223. Cited by: §2, §5.1.
  • [3] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §1.
  • [4] L. Chen, S. Dai, Y. Pu, E. Zhou, C. Li, Q. Su, C. Chen, and L. Carin (2018)

    Symmetric variational autoencoder and connections to adversarial learning

    .
    In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 661–669. Cited by: §2, §5.1, §5.2, §5.3.
  • [5] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi (2014) Describing textures in the wild. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 3606–3613. Cited by: §5.3.
  • [6] A. P. Dempster, N. M. Laird, and D. B. Rubin (1977) Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), pp. 1–38. Cited by: §1.
  • [7] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §2.
  • [8] Y. Du and I. Mordatch (2019) Implicit generation and generalization in energy-based models. arXiv preprint arXiv:1903.08689. Cited by: §5.3.
  • [9] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2016) Adversarially learned inference. arXiv preprint arXiv:1606.00704. Cited by: §2, §5.2.
  • [10] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT press. Cited by: §4.3.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1, §2, §4.6.
  • [12] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra (2015)

    Draw: a recurrent neural network for image generation

    .
    arXiv preprint arXiv:1502.04623. Cited by: §1, §2.
  • [13] T. Han, Y. Lu, S. Zhu, and Y. N. Wu (2017) Alternating back-propagation for generator network.. In AAAI, Vol. 3, pp. 13. Cited by: §2.
  • [14] T. Han, E. Nijkamp, X. Fang, M. Hill, S. Zhu, and Y. N. Wu (2019) Divergence triangle for joint training of generator model, energy-based model, and inferential model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8670–8679. Cited by: §2, §3.4, §5.1.
  • [15] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §5.3.
  • [16] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal (1995) The” wake-sleep” algorithm for unsupervised neural networks. Science 268 (5214), pp. 1158–1161. Cited by: §2, §3.4.
  • [17] G. E. Hinton, S. Osindero, and Y. Teh (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §2, §3.5, §3.6.
  • [18] J. J. Hopfield (1982) Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences 79 (8), pp. 2554–2558. Cited by: §5.3.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.
  • [20] T. Kim and Y. Bengio (2016) Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439. Cited by: §2.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • [22] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2, §3.4, §5.1.
  • [23] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: §1.
  • [24] R. Kumar, A. Goyal, A. Courville, and Y. Bengio (2019) Maximum entropy generators for energy-based models. arXiv preprint arXiv:1901.08508. Cited by: §2, §5.4, Table 4.
  • [25] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §3.6.
  • [26] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin (2017) Alice: towards understanding adversarial learning for joint distribution matching. In Advances in Neural Information Processing Systems, pp. 5495–5503. Cited by: §2, §5.1, §5.2, §5.3, §5.
  • [27] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §5.1, §5.
  • [28] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2017) Are gans created equal? a large-scale study. arXiv preprint arXiv:1711.10337. Cited by: §5.1.
  • [29] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §5.1, §5.
  • [30] S. Mohamed and B. Lakshminarayanan (2016) Learning in implicit generative models. arXiv preprint arXiv:1610.03483. Cited by: §3.1.
  • [31] K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §4.3.
  • [32] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §5.3.
  • [33] E. Nijkamp, M. Hill, T. Han, S. Zhu, and Y. N. Wu (2019) On the anatomy of mcmc-based maximum likelihood learning of energy-based models. arXiv preprint arXiv:1903.12370. Cited by: §2.
  • [34] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §1, §2, §5.1, §5.
  • [35] D. J. Rezende, S. Mohamed, and D. Wierstra (2014)

    Stochastic backpropagation and approximate inference in deep generative models

    .
    arXiv preprint arXiv:1401.4082. Cited by: §1, §2.
  • [36] R. Salakhutdinov and H. Larochelle (2010) Efficient learning of deep boltzmann machines. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 693–700. Cited by: §2.
  • [37] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther (2016) Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §1, §2.
  • [38] J. Xie, Y. Lu, R. Gao, S. Zhu, and Y. N. Wu (2016) Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408. Cited by: §2, §5.1.
  • [39] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §5.1, §5.
  • [40] H. Zenati, C. S. Foo, B. Lecouat, G. Manek, and V. R. Chandrasekhar (2018) Efficient gan-based anomaly detection. arXiv preprint arXiv:1802.06222. Cited by: §5.4.
  • [41] J. Zhao, M. Mathieu, and Y. LeCun (2016) Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126. Cited by: §2.