1 Introduction
1.1 Integrating three models
Deep probabilistic generative models are a powerful framework for representing complex data distributions. They have been widely used in unsupervised learning problems to learn from unlabeled data. The goal of generative learning is to build rich and flexible models to fit complex, multimodal data distributions as well as to be able to generate samples with high realism. The family of generative models may be roughly divided into two classes: The first class is the energybased model (a.k.a undirected graphical model) and the second class is the latent variable model (a.k.a directed graphical model) which usually includes generator model for the generation and inference model for inference or reconstruction.
These models have their advantages and limitations. An energybased model defines an explicit likelihood of the observed data up to a normalizing constant. However, sampling from such a model usually requires expensive Markov chain Monte Carlo (MCMC). A generator model defines direct sampling of the data. However, it does not have an explicit likelihood. The inference of the latent variables also requires MCMC sampling from the posterior distribution. The inference model defines an explicit approximation to the posterior distribution of the latent variables.
Combining the energybased model, the generator model, and the inference model to get the best of each model is an attractive goal. On the other hand, challenges may accumulate when the models are trained together since different models need to effectively compete or cooperate together to achieve their highest performances. In this work, we propose the divergence triangle for joint training of energybased model, generator model and inference model. The learning of three models can then be seamlessly integrated in a principled probabilistic framework. Energybased model is learned based on the samples supplied by the generator model. With the help of the inference model, the generator model is trained by both the observed data and the energybased model. The inference model is learned from both the real data fitted by the generator model as well as the synthesized data generated by the generator model.
Our experiments demonstrate that the divergence triangle is capable of learning an energybased model with a wellbehaved energy landscape, a generator model with highly realistic samples, and an inference model with faithful reconstruction ability.
1.2 Prior art
The divergence triangle jointly learns an energybased model, a generator model, and an inference model. The following are previous methods for learning such models.
The maximum likelihood learning of the energybased model requires expectation with respect to the current model, while the maximum likelihood learning of the generator model requires expectation with respect to the posterior distribution of the latent variables. Both expectations can be approximated by MCMC, such as Gibbs sampling [1], Langevin dynamics, or Hamiltonian Monte Carlo (HMC) [2]. [3, 4] used Langevin dynamics for learning the energybased models, and [5] used Langevin dynamics for learning the generator model. In both cases, MCMC sampling introduces an inner loop in the training procedure, posing a computational expense.
An early version of the energybased model is the FRAME (Filters, Random field, And Maximum Entropy) model [6, 7]. [8] used gradientbased method such as Langevin dynamics to sample from the model. [9] called the energybased models as descriptive models. [3, 4] generalized the model to deep versions.
For learning the energybased model [10], to reduce the computational cost of MCMC sampling, contrastive divergence (CD) [11]
initializes a finite step MCMC from the observed data. The resulting learning algorithm follows the gradient of the difference between two KullbackLeibler divergences, thus the name contrastive divergence. In this paper, we shall use the term “contrastive divergence” in a more general sense than
[11]. Persistent contrastive divergence [12] initializes MCMC sampling from the samples of the previous learning iteration.Generalizing [13], [14] developed an introspective learning method where the energy function is discriminatively learned, and the energybased model is both a generative model and a discriminative model.
For learning the generator model, the variational autoencoder (VAE) [15, 16, 17] approximates the posterior distribution of the latent variables by an explicit inference model. In VAE, the inference model is learned jointly with the generator model from the observed data. A precursor of VAE is the wakesleep algorithm [18], where the inference model is learned from the dream data generated by the generator model in the sleep phase.
The generator model can also be learned jointly with a discriminator model, as in the generative adversarial networks (GAN) [19], as well as deep convolutional GAN (DCGAN) [20], energybased GAN (EBGAN) [21], Wasserstein GAN (WGAN) [22]. GAN does not involve an inference model.
The generator model can also be learned jointly with an energybased model [23, 24]
. We can interpret the learning scheme as an adversarial version of contrastive divergence. While in GAN, the discriminator model eventually becomes a confused one, in the joint learning of the generator model and the energybased model, the learned energybased model becomes a welldefined probability distribution on the observed data. The joint learning bares some similarity to WGAN, but unlike WGAN, the joint learning involves two complementary probability distributions.
To bridge the gap between the generator model and the energybased model, the cooperative learning method of [25] introduces finitestep MCMC sampling of the energybased model with the MCMC initialized from the samples generated by the generator model. Such finitestep MCMC produces synthesized examples closer to the energybased model, and the generator model can learn from how the finitestep MCMC revises its initial samples.
Adversarially learned inference (ALI) [26, 27] combines the learning of the generator model and inference model in an adversarial framework. ALI can be improved by adding conditional entropy regularization, resulting in the ALICE [28] model. The recently proposed method [29] shares the same spirit. They lack an energybased model on observed data.
1.3 Our contributions
Our proposed formulation, which we call the divergence triangle, reinterprets and integrates the following elements in unsupervised generative learning: (1) maximum likelihood learning, (2) variational learning, (3) adversarial learning, (4) contrastive divergence, (5) wakesleep algorithm. The learning is seamlessly integrated into a probabilistic framework based on KL divergence.
We conduct extensive experiments to analyze the learned models. Energy landscape mapping is used to verify that our learned energybased model is wellbehaved. Further, we evaluate the learning of a generator model via synthesis by generating samples with competitive fidelity, and evaluate the accuracy of the inference model both qualitatively and quantitatively via reconstruction. Our proposed model can also benefit in learning directly from incomplete images with various blocking patterns.
2 Learning deep probabilistic models
In this section, we shall review the two probabilistic models, namely the generator model and the energybased model, both of which are parametrized by convolutional neural networks
[30, 31]. Then, we shall present the maximum likelihood learning algorithms for training these two models, respectively. Our presentation of the two maximum likelihood learning algorithms is unconventional. We seek to derive both algorithms based on the KullbackLeibler divergence using the same scheme. This will set the stage for the divergence triangle.2.1 Generator model and energybased model
The generator model [19, 20, 15, 16, 17] is a generalization of the factor analysis model [32],
(1) 
where is a topdown mapping parametrized by a deep network with parameters . It maps the
dimensional latent vector
to the dimensional signal . and is independent of . In general, the model is defined by the prior distribution and the conditional distribution . The completedata model . The observeddata model is . The posterior distribution is . See the diagram (a) below.A complementary model is the energybased model [33, 34, 3, 4], where defines the energy of , and a low energy is assigned a high probability. Specifically, we have the following probability model
(2) 
where is parametrized by a bottomup deep network with parameters , and is the normalizing constant. If is linear in , the model becomes the familiar exponential family model in statistics or the Gibbs distribution in statistical physics. We may consider an evaluator, where assigns the value to , and evaluates by a normalized probability distribution. See the diagram (b) above.
The energybased model defines explicit loglikelihood via , even though is intractable. However, it is difficult to sample from . The generator model can generate directly by first generating , and then transforming to by . But it does not define an explicit loglikelihood of .
In the context of inverse reinforcement learning
[35, 36] or inverse optimal control, is action and defines the cost function or defines the value function or the objective function.2.2 Maximum likelihood learning
Let be the true distribution that generates the training data. Both the generator and the energybased model can be learned by maximum likelihood. For large sample, the maximum likelihood amounts to minimizing the KullbackLeibler divergence over , and minimizing over , respectively. The expectation can be approximated by sample average.
2.2.1 EMtype learning of generator model
To learn the generator model , we seek to minimize over . Suppose in an iterative algorithm, the current is . We can fix at any place we want, and vary around .
We can write
(3) 
In the EM algorithm [37], the left hand side is the surrogate objective function. This surrogate function is more tractable than the true objective function because is a distribution of the complete data, and is the completedata model.
We can write (3) as
(4) 
The geometric picture is that the surrogate objective function is above the true objective function , i.e., majorizes (upper bounds) , and they touch each other at , so that and . The reason is that and . See Figure 1.
gives us the complete data. Each step of EM fits the completedata model by minimizing the surrogate ,
(5) 
which amounts to maximizing the completedata loglikelihood. By minimizing , we will reduce relative to , and we will reduce even more, relative to , because of the majorization picture.
We can also use gradient descent to update . Because , and we can place anywhere, we have
(6) 
To implement the above updates, we need to compute the expectation with respect to the posterior distribution . It can be approximated by MCMC such as Langevin dynamics or HMC [2]. Both require gradient computations that can be efficiently accomplished by backpropagation. We have learned the generator using such learning method [5].
2.2.2 Selfcritic learning of energybased model
To learn the energybased model model , we seek to minimize over . Suppose in an iterative algorithm, the current is . We can fix at any place we want, and vary around .
Consider the following contrastive divergence
(7) 
We can use the above as surrogate function, which is more tractable than the true objective function, since the term is canceled out. Specifically, we can write (7) as
(8)  
(9) 
The geometric picture is that the surrogate function is below the true objective function , i.e., minorizes (lower bounds) , and they touch each other at , so that , and . The reason is that and . See Figure 2.
Because minorizes , we do not have a EMlike update. However, we can still use gradient descent to update , where the derivative is
(10) 
where
(11) 
Since we can place anywhere, we have
(12) 
To implement the above update, we need to compute the expectation with respect to the current model . It can be approximated by MCMC such as Langevin dynamics or HMC that samples from . It can be efficiently implemented by gradient computation via backpropagation. We have trained the energybased model using such learning method [3, 4].
The above learning algorithm has an adversarial interpretation. Updating to by following the gradient of , we seek to decrease the first KLdivergence, while we will increase the second KLdivergence, or we seek to shift the value function toward the observed data and away from the synthesized data generated from the current model. That is, the model criticizes its current version , i.e., the model is its own adversary or its own critic.
2.2.3 Similarity and difference
In both models, at or , we have , , because and .
The difference is that in the generator model, , whereas in energybased model, .
In the generator model, if we replace the intractable by the inference model , we get VAE.
In energybased model, if we replace the intractable by the generator , we get adversarial contrastive divergence (ACD). The negative sign in front of is the root of the adversarial learning.
3 Divergence triangle: integrating adversarial and variational learning
In this section, we shall first present the divergence triangle, emphasizing its compact symmetric and antisymmetric form. Then, we shall show that it is an reinterpretation and integration of existing methods, in particular, VAE [15, 16, 17] and ACD [23, 24].
3.1 Loss function
Suppose we observe training examples where is the unknown data distribution. with energy function denotes the energybased model with parameters . The generator model has parameters and latent vector . It is trivial to sample the latent distribution and the generative process is defined as , .
The maximum likelihood learning algorithms for both the generator and energybased model require MCMC sampling. We modify the maximum likelihood KLdivergences by proposing a divergence triangle criterion, so that the two models can be learned jointly without MCMC. In addition to the generator and energybased model , we also include an inference model in the learning scheme. Such an inference model is a key component in the variational autoencoder [15, 16, 17]. The inference model with parameters maps from the data space to latent space. In the context of EM,
can be considered an imputor that imputes the missing data
to get the complete data .The three models above define joint distributions over
and from different perspectives. The two marginals, i.e., empirical data distribution and latent prior distribution , are known to us. The goal is to harmonize the three joint distributions so that the competition and cooperation between different loss terms improves learning.The divergence triangle involves the following three joint distributions on :

distribution: .

distribution: .

distribution: .
See Figure 3 for illustration. The divergence triangle is based on the three KLdivergences between the three joint distributions on . It has a symmetric and antisymmetric form, where the antisymmetry is due to the negative sign in front of the last KLdivergence and the maximization over . The divergence triangle leads to the following dynamics between the three models: (1) and seek to get close to each other. (2) seeks to get close to . (3) seeks to get close to , but it seeks to get away from , as indicated by the red arrow. Note that , because is canceled out. The effect of (2) and (3) is that gets close to , while inducing to get close to as well, or in other words, chases toward .
3.2 Unpacking the loss function
The divergence triangle integrates variational and adversarial learning methods, which are modifications of maximum likelihood.
3.2.1 Variational learning
First, captures the variational autoencoder (VAE).
(14)  
Recall in (4), if we replace the intractable in (4) by the explicit , we get (14), so that we avoid MCMC for sampling .
We may interpret VAE as alternating projection between and . See Figure 4 for illustration. If , the algorithm reduces to the EM algorithm. The wakesleep algorithm [18] is similar to VAE, except that it updates by instead of , so that the wakesleep algorithm does not have a single objective function.
The VAE defines a cooperative game, with the dynamics that and run toward each other.
3.2.2 Adversarial learning
Next, consider the learning of the energybased model model [23, 24]. Recall in (8), if we replace the intractable in (8) by , we get
(15) 
or equivalently
(16) 
so that we avoid MCMC for sampling , and the gradient for updating becomes
(17) 
Because of the negative sign in front of the second KLdivergence in (15), we need in (15) or in (16), so that the learning becomes adversarial. See Figure 5 for illustration. Inspired by [38], we call (15) the adversarial contrastive divergence (ACD). It underlies [23, 24].
The adversarial form (15) or (16) defines a chasing game with the following dynamics: the generator chases the energybased model in , the energybased model seeks to get closer to and get away from . The red arrow in Figure 5 illustrates this chasing game. The result is that lures toward . In the idealized case, always catches up with , then
will converge to the maximum likelihood estimate
, and converges to .The above chasing game is different from VAE , which defines a cooperative game where and run toward each other.
Even though the above chasing game is adversarial, both models are running toward the data distribution. While the generator model runs after the energybased model, the energybased model runs toward the data distribution. As a consequence, the energybased model guides or leads the generator model toward the data distribution. It is different from GAN [19]. In GAN, the discriminator eventually becomes a confused one because the generated data become similar to the real data. In the above chasing game, the energybased model becomes close to the data distribution.
The updating of by (17) bears similarity to Wasserstein GAN (WGAN) [22], but unlike WGAN, defines a probability distribution , and the learning of is based on , which is a variational approximation to . This variational approximation only requires knowing , without knowing . However, unlike , is still intractable, in particular, its entropy does not have a closed form. Thus, we can again use variational approximation, by changing the problem to , i.e., , which is analytically tractable and which underlies [24]. In fact,
(18) 
Thus, we can modify (16) into , because again .
Fitting the above together, we have the divergence triangle (13), which has a compact symmetric and antisymmetric form.
3.3 Gap between two models
We can write the objective function as
Thus is an upper bound of the difference between the loglikelihood of the energybased model and the loglikelihood of the generator model.
3.4 Two sides of KLdivergences
In the divergence triangle, the generator model appears on the right side of , and it also appears on the left side of
. The former tends to interpolate or smooth the modes of
, while the later tends to seek after major modes of while ignoring minor modes. As a result, the learned generator model tends to generate sharper images. As to the inference model , it appears on the left side of , and it also appears on the right side of . The former is variational learning of the real data, while the latter corresponds to the sleep phase of wakesleep learning, which learns from the dream data generated by . The inference model thus can infer from both observed and generated .In fact, if we define
(19) 
we have
(20) 
(19) is the divergence triangle between the three marginal distributions on , where appears on both sides of KLdivergences. (20) is the variational scheme to make the marginal distributions into the joint distributions, which are more tractable. In (20), the two KLdivergences have reverse orders.
3.5 Training algorithm
The three models are each parameterized by convolutional neural networks. The joint learning under the divergence triangle can be implemented by stochastic gradient descent, where the expectations are replaced by the sample averages. Algorithm
1 describes the procedure which is illustrated in Figure 6.4 Experiments
Model  VAE [15]  DCGAN [20]  WGAN [22]  CoopNet [25]  CEGAN [24]  ALI [26]  ALICE [28]  Ours 
CIFAR10 (IS)  4.08  6.16  5.76  6.55  7.07  5.93  6.02  7.23 
CelebA (FID)  99.09  38.39  36.36  56.57  41.89  60.29  46.14  31.92 
In this section, we demonstrate not only that the divergence triangle is capable of successfully learning an energybased model with a wellbehaved energy landscape, a generator model with highly realistic samples, and an inference model with faithful reconstruction ability, but we also show competitive performance on four tasks: image generation, test image reconstruction, energy landscape mapping, and learning from incomplete images. For image generation, we consider spatial stationary texture images, temporal stationary dynamic textures, and general object categories. We also test our model on largescale datasets and highresolution images.
The images are resized and scaled to
, no further preprocessing is needed. The network parameters are initialized with zeromean Gaussian with standard deviation
and optimized using Adam [39]. Network weights are decayed with rate, and batch normalization
[40] is used. We refer to the Appendix for the model specifications.4.1 Image generation
In this experiment, we evaluate the visual quality of generator samples from our divergence triangle model. If the generator model is welltrained, then the obtained samples should be realistic and match the visual features and contents of training images.
4.1.1 Object generation
For object categories, we test our model on two commonlyused datasets of natural images: CIFAR10 and CelebA [41]. For CelebA face dataset, we randomly select 9,000 images for training and another 1,000 images for testing in reconstruction task. The face images are resized to and CIFAR10 images remain . The qualitative results of generated samples for objects are shown in Figure 7. We further evaluate our model using quantitative evaluations which are based on the Inception Score (IS) [42] for CIFAR10 and Frechet Inception Distance (FID) [43] for CelebA faces. We generate 50,000 random samples for the computation of the inception score and 10,000 random samples for the computation of the FID score. Table I shows the IS and FID scores of our model compared with VAE [15], DCGAN [20], WGAN [22], CoopNet [25], CEGAN [24], ALI [26], ALICE [28].
Note that for the Inception Score on CIFAR10, we borrowed the scores from relevant papers, and for FID score on 9,000 CelebA faces, we reimplemented or used the available code with the similar network structure as our model. It can be seen that our model achieves the competitive performance compared to recent baseline models.
4.1.2 Largescale dataset
We also train our model on large scale datasets including downsampled
version of ImageNet
[44, 45] (roughly 1 million images) and Largescale Scene Understand (LSUN) dataset [46]. For the LSUN dataset, we consider the bedroom, tower and Church ourdoor categories which contains roughly 3 million, 0.7 million and 0.1 million images and were resized to . The network structures are similar with the ones used in object generation with twice the number of channels and batch normalization is used in all three models. Generated samples are shown on Figure 8.4.1.3 Highresolution synthesis
In this section, we recruit a layerwise training scheme to learn models on CelebAHQ [47] with resolutions of up to
pixels. Layerwise training dates back to initializing deep neural networks by Restricted Boltzmann Machines to overcome optimization hurdles
[48, 49] and has been resurrected in progressive GANs [47], albeit the order of layer transitions is reversed such that top layers are trained first. This resembles a Laplacian Pyramid [50] in which images are generated in a coarsetofine fashion.As in [47], the training starts with downsampled images with a spatial resolution of while progressively increasing the size of the images and number of layers. All three models are grown in synchrony where convolutions project between RGB and feature. In contrast to [47], we do not require minibatch discrimination to increase variation of nor gradient penalty to preserve Lipschitz continuity of .
Figure 9 depicts highfidelity synthesis in a resolution of pixels sampled from the generator model on CelebAHQ. Figure 10 illustrates linear interpolation in latent space (i.e., ), which indicates diversity in the samples.
Therefore, the joint learning in the triangle formulation is not only able to train the three models with stable optimization, but it also achieves synthesis with high fidelity.
4.1.4 Texture synthesis
We consider texture images, which are spatial stationary and contain repetitive patterns. The texture images are resized to . Separate models are trained on each image. We start from the latent factor of size and use five convolutionaltranspose layers with kernel size and upsampling factor for the generator network. The layers have , , , and
filters, respectively, and ReLU nonlinearity between each layer is used. The inference model has the inverse or “mirror” structure of generator model except that we use convolutional layers and ReLU with leak factor
. The energybased model has three convolutional layers. The first two layers have kernel sizewith stride
for and filters respectively, and the last layer has filters with kernel size and stride .The representative examples are shown in Figure 11. Three texture synthesis results are obtained by sampling different latent factors from prior distribution . Notice that although we only have one texture image for training, the proposed triangle divergence model can effectively utilize the repetitive patterns, thus generating realistic texture images with different configurations.
4.1.5 Dynamic texture synthesis
Our model can also be used for dynamic patterns which exhibit stationary regularity in the temporal domain. The training video clips are selected from Dyntex database [51] and resized to pixels pixels frames. Inspired by recent work [52, 53], we adopt spatialtemporal models for dynamic patterns that are stationary in the temporal domain but nonstationary in the spatial domain. Specifically, we start from latent factors of size for each video clip and we adopt the same spatialtemporal convolutional transpose generator network as in [53] except we use kernel size for the second layer. For the inference model, we use spatialtemporal convolutional layers. The first layers have kernel size with upsampling factor and the last layer is fullyconnected in spatial domain but convolutional in the temporal domain, yielding reparametrized and which have the same size the as latent factors. For the energybased model, we use three spatialtemporal convolutional layers. The first two layers have kernel size with upsample factor in all directions, but the last layer is fullyconnected in the spatial domain but convolutional with kernel size and upsample by in the temporal domain. Each layer has , and filters, respectively. Some of the synthesis results are shown in Figure 12. Note, we subsampled frames of the training and generated video clips and we only show them in the first batch for illustration.
4.2 Test image reconstruction
Model  WS [18]  VAE [15]  ALI [26]  ALICE [28]  Ours 

CIFAR10  0.058  0.037  0.311  0.034  0.028 
CelebA  0.152  0.039  0.519  0.046  0.030 
In this experiment, we evaluate the reconstruction ability of our model for a holdout testing image dataset. This is a strong indicator for the accuracy of our inference model. Specifically, if our divergence triangle model is welllearned, then the inference model should match the true posterior of generator model, i.e., . Therefore, given test signal , its reconstruction should be close to , i.e., . Figure 13 shows the testing images and their reconstructions on CIFAR10 and CelebA.
For CIFAR10, we use its own 10,000 test images while for CelebA, we use the holdout 1,000 test images as stated above. The reconstruction quality is further measured by perpixel mean square error (MSE). Table II shows the perpixel MSE of our model compared to WS [18], VAE [15], ALI [26], ALICE [28].
Note, we do not consider methods without inference models on training data, including variants of GANs and cooperative training, since it is infeasible to test such models using image reconstruction.
4.3 Energy landscape mapping
In the following, we evaluate the learned energybased model by mapping the macroscopic structure of the energy landscape. When following a MLE regime by minimizing , we expect the energyfunction to encode as local energy minima. Moreover, should form minima for unseen images and macroscopic landscape structure in which basins of minima are distinctly separated by energy barriers. Hopfield observed that such landscape is a model of associative memory [54].
In order to learn a wellformed energyfunction, in Algorithm 1, we perform multiple steps such that the samples are sufficiently “close” to the local minima of . This avoids the formation of energy minima not resembling the data. The variational approximation of entropy of the marginal generator distribution preserves diversity in the samples avoiding modecollapse.
To verify that (i) local minima of resemble and (ii) minima are separated by significant energy barriers, we shall follow the approach used in [55]. When clustering with respect to energetic barriers, the landscape is partitioned into Hopfield basins of attraction whereby each point on the landscape is mapped onto a local minimum by a steepestdescent path
. The similarity measure used for hierarchical clustering is the barrier energy that separates any two regions. Given a pair of local minima
, we estimate the barrier as the highest energy along a linear interpolation . If for some energy threshold , then belong to the same basin. The clustering is repeated recursively until all minima are clustered together. Such graphs have come to be referred as disconnectivity graphs (DG) [56].We conduct energy landscape mapping experiments on the MNIST [57] and FashionMNIST [58] datasets, each containing grayscale images of size pixels depicting handwritten digits and fashion products from categories, respectively. The energy landscape mapping is not without limitations, because it is practically impossible to locate all local modes. Based on the local modes located by our algorithm, see Figure 14 for the MNIST dataset, it suggests that the learned energy function is wellformed which not only encodes meaningful images as minima, but also forms meaningful macroscopic structure. Moreover, within basins the local minima have a high degree of purity (i.e. digits within a basin belong to the same class), and, the energy barrier between basins seem informative (i.e. basins of ones and sixes form pure superbasins). Figure 15 depicts the energy landscape mapping on FashionMNIST.
Potential applications include unsupervised classification in which energy barriers act as a geodesic similarity measure which captures perceptual distance (as opposed to e.g.
distance), weaklysupervised classification with one label per basins, or, reconstruction of incomplete data (i.e. Hopfield contentaddressable memory or image inpainting).
4.4 Learning from incomplete images
The divergence triangle can be used to learn from occluded images. This task is challenging [5], because only parts of the images are observed, thus the model needs to learn sufficient information to recover the occluded parts. The generative models with inferential mechanism can be used for this task. Notably, [5] proposed to recover incomplete images using alternating backpropagation (ABP) which has a MCMC based inference step to refine the latent factors and perform reconstruction iteratively. VAEs [59, 15] build the inference model on occluded images, and can also be adapted for this task. It proceeds by filling the missing parts with average pixel intensity in the beginning, then iteratively reupdate the missing parts using reconstructed values. Unlike VAEs, which only consider the unoccluded parts of training data, our model utilizes the generated samples which become gradually recovered during training, resulting in improved recovery accuracy and sharp generation. Note that learning from incomplete data can be difficult for variants of GANs [19, 24, 20, 22] and cooperative training [25], since inference cannot be performed directly on the occluded images.
We evaluate our model on 10,000 images randomly chosen from CelebA dataset. Then, selected images are further center cropped as in [5]. Similar to VAEs, we zerofill the occluded parts in the beginning, then iterative update missing values using reconstructed images obtained from the generator model. Three types of occlusions are used: (1) salt and pepper noise which randomly covers (P.5) and (P.7) of the image. (2) Multiple block occlusion which has 10 random blocks of size (MB10). (3) Singe block occlusion where we randomly place a large and block on each image, denoted by B20 and B30 respectively. Table III shows the recovery errors using VAE [15], ABP [5] and our triangle model where the error is defined as perpixel absolute difference (relative to the range of pixel values) between the recovered image on the occluded pixels and the ground truth image.
EXP  P.5  P.7  MB10  B20  B30 
VAE [15]  0.0446  0.0498  0.1169  0.0666  0.0800 
ABP [5]  0.0379  0.0428  0.1070  0.0633  0.0757 
Ours  0.0380  0.0430  0.1060  0.0621  0.0733 
It can be seen that our model consistently outperforms the VAE model for different occlusion patterns. For structured occlusions (i.e., multiple and single blocks), the unoccluded parts contain more meaningful configurations that will improve learning of the generator through the energybased model, which will, in turn, generate more meaningful samples to refine our inference model. This could be verified by the superior results compared to ABP [5]. While for unstructured occlusions (i.e., salt and pepper noise), ABP achieves improved recovery, a possible reason being that unoccluded parts contain less meaningful patterns which offer limited help for learning the generator and inference model. Our model synthesizes sharper and more realistic images from the generator on occluded images. See Figure 17 in which images are occluded with random blocks.
5 Conclusion
The proposed probabilistic framework, namely divergence triangle, for joint learning of the energybased model, the generator model, and the inference model. The divergence triangle forms the compact learning functional for three models and naturally unifies aspects of maximum likelihood estimation [5, 25], variational autoencoder [15, 16, 17], adversarial learning [23, 24], contrastive divergence [11], and the wakesleep algorithm [18].
An extensive set of experiments demonstrated learning of a wellbehaved energybased model, realistic generator model as well as an accurate inference model. Moreover, experiments showed that the proposed divergence framework can be effective in learning directly from incomplete data.
In future work, we aim to extend the formulation to learn interpretable generator and energybased models with multiple layers of sparse or semantically meaningful latent variables or features [60, 61]. Further, it would be desirable to unify the generator, energybased and inference models into a single model [62, 63] by allowing them to share parameters and nodes instead of having separate sets of parameters and nodes.
Acknowledgments
The work is supported by DARPA XAI project N660011724029; ARO project W911NF1810296; and ONR MURI project N000141612007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Dr. Tianfu Wu, Shuai Zhu and Bo Pang for helpful discussions.
Model Architecture
We describe the basic network structures, in particular for object generation. We use the following notation:

conv(n): convolutional operation with output feature maps.

convT(n): convolutional transpose operation with output feature maps.

LReLU: LeakyReLU nonlinearity with default leaky factor 0.2.

BN: Batch normalization.
The structures for CelebA (where 9,000 random images are chosen) are shown in Table IV. The structures for CIFAR10 and MNIST/FashionMNIST are shown in Table V and Table VI, respectively.
Generator Model  
Layers  InOut Size  Stride  BN 
Input: Z  1x1x100  
4x4 convT(512), ReLU  4x4x512  1  Yes 
4x4 convT(512), ReLU  8x8x512  2  Yes 
4x4 convT(256), ReLU  16x16x256  2  Yes 
4x4 convT(128), ReLU  32x32x128  2  Yes 
4x4 convT(3), ReLU  64x64x3  2  No 
Inference model  
Input: X  64x64x3  
4x4 conv(64), LReLU  32x32x64  2  Yes 
4x4 conv(128), LReLU  16x16x128  2  Yes 
4x4 conv(256), LReLU  8x8x256  2  Yes 
4x4 conv(512), LReLU  4x4x512  2  Yes 
4x4 conv(100), LReLU  : 1x1x100  1  No 
Energybased Model  
Input: X  64x64x3  
4x4 conv(64), LReLU  32x32x64  2  Yes 
4x4 conv(128), LReLU  16x16x128  2  Yes 
4x4 conv(256), LReLU  8x8x256  2  Yes 
4x4 conv(256), LReLU  4x4x256  2  Yes 
4x4 conv(1), LReLU  1x1x1  1  No 
Generator Model  
Layers  InOut Size  Stride  BN 
Input: Z  1x1x100  
4x4 convT(512), ReLU  4x4x512  1  Yes 
4x4 convT(512), ReLU  8x8x512  2  Yes 
4x4 convT(256), ReLU  16x16x256  2  Yes 
4x4 convT(128), ReLU  32x32x128  2  Yes 
3x3 convT(3), Tanh  32x32x3  1  No 
Inference model  
Input: X  32x32x3  
3x3 conv(64), LReLU  32x32x64  1  No 
4x4 conv(128), LReLU  16x16x128  2  No 
4x4 conv(256), LReLU  8x8x256  2  No 
4x4 conv(512), LReLU  4x4x512  2  No 
4x4 conv(100)  : 1x1x100  1  No 
Energybased Model  
Input: X  32x32x3  
3x3 conv(64), LReLU  32x32x64  1  No 
4x4 conv(128), LReLU  16x16x128  2  No 
4x4 conv(256), LReLU  8x8x256  2  No 
4x4 conv(256), LReLU  4x4x256  2  No 
4x4 conv(1)  1x1x1  1  No 
Generator Model  
Layers  InOut Size  Stride  BN 
Input: Z  1x1x100  
3x3 convT(1024), ReLU  3x3x1024  1  Yes 
4x4 convT(512), ReLU  7x7x512  2  Yes 
4x4 convT(256), ReLU  14x14x256  2  Yes 
4x4 convT(1), Tanh  28x28x1  2  No 
Inference model  
Input: X  28x28x1  
4x4 conv(128), LReLU  14x14x128  2  No 
4x4 conv(256), LReLU  7x7x256  2  No 
4x4 conv(512), LReLU  3x3x512  2  No 
4x4 conv(100)  : 1x1x100  1  No 
Energybased Model  
Input: X  28x28x1  
4x4 conv(128), LReLU  14x14x128  2  No 
4x4 conv(256), LReLU  7x7x256  2  No 
4x4 conv(512), LReLU  3x3x512  2  No 
4x4 conv(1)  1x1x1  1  No 
References
 [1] S. Geman and D. Geman, “Stochastic relaxation, gibbs distributions, and the bayesian restoration of images,” IEEE Transactions on pattern analysis and machine intelligence, no. 6, pp. 721–741, 1984.
 [2] R. M. Neal, “Mcmc using hamiltonian dynamics,” Handbook of Markov Chain Monte Carlo, vol. 2, 2011.

[3]
Y. Lu, S.C. Zhu, and Y. N. Wu, “Learning FRAME models using CNN
filters,” in
Thirtieth AAAI Conference on Artificial Intelligence
, 2016. 
[4]
J. Xie, Y. Lu, S.C. Zhu, and Y. N. Wu, “A theory of generative convnet,” in
International Conference on Machine Learning
, 2016, pp. 2635–2644.  [5] T. Han, Y. Lu, S.C. Zhu, and Y. N. Wu, “Alternating backpropagation for generator network.” in AAAI, vol. 3, 2017, p. 13.
 [6] S.C. Zhu, Y. N. Wu, and D. Mumford, “Minimax entropy principle and its application to texture modeling,” Neural Computation, vol. 9, no. 8, pp. 1627–1660, 1997.

[7]
Y. N. Wu, S. C. Zhu, and X. Liu, “Equivalence of julesz ensembles and frame
models,”
International Journal of Computer Vision
, vol. 38, no. 3, pp. 247–265, 2000.  [8] S.C. Zhu and D. Mumford, “Grade: Gibbs reaction and diffusion equations.” in International Conference on Computer Vision, 1998, pp. 847–854.
 [9] S.C. Zhu, “Statistical modeling and conceptualization of visual patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 6, pp. 691–712, 2003.
 [10] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang, “A tutorial on energybased learning,” Predicting structured data, vol. 1, no. 0, 2006.
 [11] G. Hinton, “Training products of experts by minimizing contrastive divergence,” Neural Computation, pp. 1771–1800, 2002.
 [12] T. Tieleman, “Training restricted boltzmann machines using approximations to the likelihood gradient,” ICML, pp. 1064–1071, 2008.

[13]
Z. Tu, “Learning generative models via discriminative approaches,” in
2007 IEEE Conference on Computer Vision and Pattern Recognition
, 2007, pp. 1–8.  [14] L. Jin, J. Lazarow, and Z. Tu, “Introspective learning for discriminative classification,” in Advances in Neural Information Processing Systems, 2017.
 [15] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.

[16]
D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” in
International Conference on Machine Learning, 2014, pp. 1278–1286.  [17] A. Mnih and K. Gregor, “Neural variational inference and learning in belief networks,” in International Conference on Machine Learning, 2014, pp. 1791–1799.
 [18] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal, “The” wakesleep” algorithm for unsupervised neural networks,” Science, vol. 268, no. 5214, pp. 1158–1161, 1995.
 [19] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
 [20] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
 [21] J. Zhao, M. Mathieu, and Y. LeCun, “Energybased generative adversarial network,” arXiv preprint arXiv:1609.03126, 2016.
 [22] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International Conference on Machine Learning, 2017, pp. 214–223.
 [23] T. Kim and Y. Bengio, “Deep directed generative models with energybased probability estimation,” arXiv preprint arXiv:1606.03439, 2016.
 [24] Z. Dai, A. Almahairi, P. Bachman, E. Hovy, and A. Courville, “Calibrating energybased generative adversarial networks,” arXiv preprint arXiv:1702.01691, 2017.
 [25] J. Xie, Y. Lu, R. Gao, S.C. Zhu, and Y. N. Wu, “Cooperative training of descriptor and generator networks,” IEEE transactions on pattern analysis and machine intelligence (PAMI), 2018.
 [26] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville, “Adversarially learned inference,” arXiv preprint arXiv:1606.00704, 2016.
 [27] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature learning,” arXiv preprint arXiv:1605.09782, 2016.
 [28] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin, “Alice: Towards understanding adversarial learning for joint distribution matching,” in Advances in Neural Information Processing Systems, 2017, pp. 5495–5503.

[29]
L. Chen, S. Dai, Y. Pu, E. Zhou, C. Li, Q. Su, C. Chen, and L. Carin, “Symmetric variational autoencoder and connections to adversarial learning,” in
International Conference on Artificial Intelligence and Statistics, 2018, pp. 661–669.  [30] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1097–1105.
 [32] D. B. Rubin and D. T. Thayer, “Em algorithms for ml factor analysis,” Psychometrika, vol. 47, no. 1, pp. 69–76, 1982.
 [33] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng, “Learning deep energy models,” in International Conference on Machine Learning, 2011, pp. 1105–1112.
 [34] J. Dai, Y. Lu, and Y.N. Wu, “Generative modeling of convolutional neural networks,” arXiv preprint arXiv:1412.6296, 2014.
 [35] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum entropy inverse reinforcement learning.” 2008.
 [36] P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the twentyfirst international conference on Machine learning, 2004, p. 1.
 [37] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society. Series B (methodological), pp. 1–38, 1977.
 [38] G. E. Hinton, “Training products of experts by minimizing contrastive divergence.” Neural Computation, vol. 14, no. 8, pp. 1771–1800, 2002.
 [39] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [40] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.

[41]
Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in
Proceedings of International Conference on Computer Vision (ICCV), 2015.  [42] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
 [43] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet, “Are gans created equal? a largescale study,” arXiv preprint arXiv:1711.10337, 2017.
 [44] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” arXiv preprint arXiv:1601.06759, 2016.
 [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [46] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a largescale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
 [47] T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
 [48] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, vol. 313, no. 5786, pp. 504–507, 2006.
 [49] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layerwise training of deep networks,” in Advances in neural information processing systems, 2007, pp. 153–160.
 [50] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative image models using a laplacian pyramid of adversarial networks,” in NIPS, 2015, pp. 1486–1494.
 [51] R. Péteri, S. Fazekas, and M. J. Huiskes, “Dyntex: A comprehensive database of dynamic textures,” Pattern Recognition Letters, vol. 31, no. 12, pp. 1627–1632, 2010.
 [52] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” in Advances In Neural Information Processing Systems, 2016, pp. 613–621.
 [53] T. Han, Y. Lu, , J. Wu, X. Xing, and Y. N. Wu, “Learning generator networks for dynamic patterns,” in WACV, 2019.
 [54] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the National Academy of Sciences, vol. 79, no. 8, pp. 2554–2558, 1982.
 [55] M. Hill, E. Nijkamp, and S.C. Zhu, “Building a telescope to look into highdimensional image spaces,” arXiv preprint arXiv:1803.01043, 2018.
 [56] D. J. Wales, M. A. Miller, and T. R. Walsh, “Archetypal energy landscapes,” Nature, vol. 394, no. 6695, p. 758, 1998.
 [57] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. [Online]. Available: http://yann.lecun.com/exdb/mnist/
 [58] H. Xiao, K. Rasul, and R. Vollgraf, “Fashionmnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
 [59] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” arXiv preprint arXiv:1401.4082, 2014.
 [60] R. Salakhutdinov and G. E. Hinton, “Deep boltzmann machines,” in International Conference on Artificial Intelligence and Statistics, 2009.

[61]
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in
International Conference on Machine Learning, 2009, pp. 609–616.  [62] L. Dinh, D. Krueger, and Y. Bengio, “Nice: Nonlinear independent components estimation,” arXiv preprint arXiv:1410.8516, 2014.
 [63] L. Dinh, J. SohlDickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
Comments
There are no comments yet.