 # Divergence Triangle for Joint Training of Generator Model, Energy-based Model, and Inference Model

This paper proposes the divergence triangle as a framework for joint training of generator model, energy-based model and inference model. The divergence triangle is a compact and symmetric (anti-symmetric) objective function that seamlessly integrates variational learning, adversarial learning, wake-sleep algorithm, and contrastive divergence in a unified probabilistic formulation. This unification makes the processes of sampling, inference, energy evaluation readily available without the need for costly Markov chain Monte Carlo methods. Our experiments demonstrate that the divergence triangle is capable of learning (1) an energy-based model with well-formed energy landscape, (2) direct sampling in the form of a generator network, and (3) feed-forward inference that faithfully reconstructs observed as well as synthesized data. The divergence triangle is a robust training method that can learn from incomplete data.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Integrating three models

Deep probabilistic generative models are a powerful framework for representing complex data distributions. They have been widely used in unsupervised learning problems to learn from unlabeled data. The goal of generative learning is to build rich and flexible models to fit complex, multi-modal data distributions as well as to be able to generate samples with high realism. The family of generative models may be roughly divided into two classes: The first class is the energy-based model (a.k.a undirected graphical model) and the second class is the latent variable model (a.k.a directed graphical model) which usually includes generator model for the generation and inference model for inference or reconstruction.

These models have their advantages and limitations. An energy-based model defines an explicit likelihood of the observed data up to a normalizing constant. However, sampling from such a model usually requires expensive Markov chain Monte Carlo (MCMC). A generator model defines direct sampling of the data. However, it does not have an explicit likelihood. The inference of the latent variables also requires MCMC sampling from the posterior distribution. The inference model defines an explicit approximation to the posterior distribution of the latent variables.

Combining the energy-based model, the generator model, and the inference model to get the best of each model is an attractive goal. On the other hand, challenges may accumulate when the models are trained together since different models need to effectively compete or cooperate together to achieve their highest performances. In this work, we propose the divergence triangle for joint training of energy-based model, generator model and inference model. The learning of three models can then be seamlessly integrated in a principled probabilistic framework. Energy-based model is learned based on the samples supplied by the generator model. With the help of the inference model, the generator model is trained by both the observed data and the energy-based model. The inference model is learned from both the real data fitted by the generator model as well as the synthesized data generated by the generator model.

Our experiments demonstrate that the divergence triangle is capable of learning an energy-based model with a well-behaved energy landscape, a generator model with highly realistic samples, and an inference model with faithful reconstruction ability.

### 1.2 Prior art

The divergence triangle jointly learns an energy-based model, a generator model, and an inference model. The following are previous methods for learning such models.

The maximum likelihood learning of the energy-based model requires expectation with respect to the current model, while the maximum likelihood learning of the generator model requires expectation with respect to the posterior distribution of the latent variables. Both expectations can be approximated by MCMC, such as Gibbs sampling , Langevin dynamics, or Hamiltonian Monte Carlo (HMC) . [3, 4] used Langevin dynamics for learning the energy-based models, and  used Langevin dynamics for learning the generator model. In both cases, MCMC sampling introduces an inner loop in the training procedure, posing a computational expense.

An early version of the energy-based model is the FRAME (Filters, Random field, And Maximum Entropy) model [6, 7].  used gradient-based method such as Langevin dynamics to sample from the model.  called the energy-based models as descriptive models. [3, 4] generalized the model to deep versions.

For learning the energy-based model , to reduce the computational cost of MCMC sampling, contrastive divergence (CD) 

initializes a finite step MCMC from the observed data. The resulting learning algorithm follows the gradient of the difference between two Kullback-Leibler divergences, thus the name contrastive divergence. In this paper, we shall use the term “contrastive divergence” in a more general sense than

. Persistent contrastive divergence  initializes MCMC sampling from the samples of the previous learning iteration.

Generalizing ,  developed an introspective learning method where the energy function is discriminatively learned, and the energy-based model is both a generative model and a discriminative model.

For learning the generator model, the variational auto-encoder (VAE) [15, 16, 17] approximates the posterior distribution of the latent variables by an explicit inference model. In VAE, the inference model is learned jointly with the generator model from the observed data. A precursor of VAE is the wake-sleep algorithm , where the inference model is learned from the dream data generated by the generator model in the sleep phase.

The generator model can also be learned jointly with a discriminator model, as in the generative adversarial networks (GAN) , as well as deep convolutional GAN (DCGAN) , energy-based GAN (EB-GAN) , Wasserstein GAN (WGAN) . GAN does not involve an inference model.

The generator model can also be learned jointly with an energy-based model [23, 24]

. We can interpret the learning scheme as an adversarial version of contrastive divergence. While in GAN, the discriminator model eventually becomes a confused one, in the joint learning of the generator model and the energy-based model, the learned energy-based model becomes a well-defined probability distribution on the observed data. The joint learning bares some similarity to WGAN, but unlike WGAN, the joint learning involves two complementary probability distributions.

To bridge the gap between the generator model and the energy-based model, the cooperative learning method of  introduces finite-step MCMC sampling of the energy-based model with the MCMC initialized from the samples generated by the generator model. Such finite-step MCMC produces synthesized examples closer to the energy-based model, and the generator model can learn from how the finite-step MCMC revises its initial samples.

Adversarially learned inference (ALI) [26, 27] combines the learning of the generator model and inference model in an adversarial framework. ALI can be improved by adding conditional entropy regularization, resulting in the ALICE  model. The recently proposed method  shares the same spirit. They lack an energy-based model on observed data.

### 1.3 Our contributions

Our proposed formulation, which we call the divergence triangle, re-interprets and integrates the following elements in unsupervised generative learning: (1) maximum likelihood learning, (2) variational learning, (3) adversarial learning, (4) contrastive divergence, (5) wake-sleep algorithm. The learning is seamlessly integrated into a probabilistic framework based on KL divergence.

We conduct extensive experiments to analyze the learned models. Energy landscape mapping is used to verify that our learned energy-based model is well-behaved. Further, we evaluate the learning of a generator model via synthesis by generating samples with competitive fidelity, and evaluate the accuracy of the inference model both qualitatively and quantitatively via reconstruction. Our proposed model can also benefit in learning directly from incomplete images with various blocking patterns.

## 2 Learning deep probabilistic models

In this section, we shall review the two probabilistic models, namely the generator model and the energy-based model, both of which are parametrized by convolutional neural networks

[30, 31]. Then, we shall present the maximum likelihood learning algorithms for training these two models, respectively. Our presentation of the two maximum likelihood learning algorithms is unconventional. We seek to derive both algorithms based on the Kullback-Leibler divergence using the same scheme. This will set the stage for the divergence triangle.

### 2.1 Generator model and energy-based model

The generator model [19, 20, 15, 16, 17] is a generalization of the factor analysis model ,

 z∼N(0,Id),x=gθ(z)+ϵ, (1)

where is a top-down mapping parametrized by a deep network with parameters . It maps the

-dimensional latent vector

to the -dimensional signal . and is independent of . In general, the model is defined by the prior distribution and the conditional distribution . The complete-data model . The observed-data model is . The posterior distribution is . See the diagram (a) below.

 {Top-down} mapping{Bottom-up}% mapping{hidden vector} z{energy} −fα(x)⇓⇑signal x≈gθ(z)signal x(a) Generator model(b) Energy-based model

A complementary model is the energy-based model [33, 34, 3, 4], where defines the energy of , and a low energy is assigned a high probability. Specifically, we have the following probability model

 πα(x)=1Z(α)exp[fα(x)], (2)

where is parametrized by a bottom-up deep network with parameters , and is the normalizing constant. If is linear in , the model becomes the familiar exponential family model in statistics or the Gibbs distribution in statistical physics. We may consider an evaluator, where assigns the value to , and evaluates by a normalized probability distribution. See the diagram (b) above.

The energy-based model defines explicit log-likelihood via , even though is intractable. However, it is difficult to sample from . The generator model can generate directly by first generating , and then transforming to by . But it does not define an explicit log-likelihood of .

In the context of inverse reinforcement learning

[35, 36] or inverse optimal control, is action and defines the cost function or defines the value function or the objective function.

### 2.2 Maximum likelihood learning

Let be the true distribution that generates the training data. Both the generator and the energy-based model can be learned by maximum likelihood. For large sample, the maximum likelihood amounts to minimizing the Kullback-Leibler divergence over , and minimizing over , respectively. The expectation can be approximated by sample average.

#### 2.2.1 EM-type learning of generator model

To learn the generator model , we seek to minimize over . Suppose in an iterative algorithm, the current is . We can fix at any place we want, and vary around .

We can write

 KL(qdata(x)pθt(z|x)∥pθ(z,x))= KL(qdata(x)∥pθ(x))+KL(pθt(z|x)∥pθ(z|x)). (3)

In the EM algorithm , the left hand side is the surrogate objective function. This surrogate function is more tractable than the true objective function because is a distribution of the complete data, and is the complete-data model.

We can write (3) as

 S(θ)=K(θ)+~K(θ). (4)

The geometric picture is that the surrogate objective function is above the true objective function , i.e., majorizes (upper bounds) , and they touch each other at , so that and . The reason is that and . See Figure 1. Fig. 1: The surrogate S majorizes (upper bounds) K, and they touch each other at θt with the same tangent.

gives us the complete data. Each step of EM fits the complete-data model by minimizing the surrogate ,

 θt+1=argminθKL(qdata(x)pθt(z|x)∥pθ(z,x)), (5)

which amounts to maximizing the complete-data log-likelihood. By minimizing , we will reduce relative to , and we will reduce even more, relative to , because of the majorization picture.

We can also use gradient descent to update . Because , and we can place anywhere, we have

 −∂∂θKL(qdata(x)∥pθ(x)) =Eqdata(x)pθ(z|x)[∂∂θlogpθ(z,x)]. (6)

To implement the above updates, we need to compute the expectation with respect to the posterior distribution . It can be approximated by MCMC such as Langevin dynamics or HMC . Both require gradient computations that can be efficiently accomplished by back-propagation. We have learned the generator using such learning method .

#### 2.2.2 Self-critic learning of energy-based model

To learn the energy-based model model , we seek to minimize over . Suppose in an iterative algorithm, the current is . We can fix at any place we want, and vary around .

Consider the following contrastive divergence

 KL(qdata(x)∥πα(x))−KL(παt(x)∥πα(x)). (7)

We can use the above as surrogate function, which is more tractable than the true objective function, since the term is canceled out. Specifically, we can write (7) as

 S(α) = K(α)−~K(α) (8) = −(Eqdata[fα(x)]−Eπαt[fα(x)])+const. (9)

The geometric picture is that the surrogate function is below the true objective function , i.e., minorizes (lower bounds) , and they touch each other at , so that , and . The reason is that and . See Figure 2. Fig. 2: The surrogate S minorizes (lower bounds) K, and they touch each other at αt with the same tangent.

Because minorizes , we do not have a EM-like update. However, we can still use gradient descent to update , where the derivative is

 K′(αt)=S′(αt)=−(Eqdata[f′αt(x)]−Eπαt[f′αt(x)]), (10)

where

 f′αt(x)=∂∂αfα(x)∣∣αt. (11)

Since we can place anywhere, we have

 −∂∂αKL(qdata(x)∥πα(x)) =Eqdata[∂∂αfα(x)]−Eπα[∂∂αfα(x)]. (12)

To implement the above update, we need to compute the expectation with respect to the current model . It can be approximated by MCMC such as Langevin dynamics or HMC that samples from . It can be efficiently implemented by gradient computation via back-propagation. We have trained the energy-based model using such learning method [3, 4].

The above learning algorithm has an adversarial interpretation. Updating to by following the gradient of , we seek to decrease the first KL-divergence, while we will increase the second KL-divergence, or we seek to shift the value function toward the observed data and away from the synthesized data generated from the current model. That is, the model criticizes its current version , i.e., the model is its own adversary or its own critic.

#### 2.2.3 Similarity and difference

In both models, at or , we have , , because and .

The difference is that in the generator model, , whereas in energy-based model, .

In the generator model, if we replace the intractable by the inference model , we get VAE.

In energy-based model, if we replace the intractable by the generator , we get adversarial contrastive divergence (ACD). The negative sign in front of is the root of the adversarial learning.

## 3 Divergence triangle: integrating adversarial and variational learning

In this section, we shall first present the divergence triangle, emphasizing its compact symmetric and anti-symmetric form. Then, we shall show that it is an re-interpretation and integration of existing methods, in particular, VAE [15, 16, 17] and ACD [23, 24].

### 3.1 Loss function

Suppose we observe training examples where is the unknown data distribution. with energy function denotes the energy-based model with parameters . The generator model has parameters and latent vector . It is trivial to sample the latent distribution and the generative process is defined as , .

The maximum likelihood learning algorithms for both the generator and energy-based model require MCMC sampling. We modify the maximum likelihood KL-divergences by proposing a divergence triangle criterion, so that the two models can be learned jointly without MCMC. In addition to the generator and energy-based model , we also include an inference model in the learning scheme. Such an inference model is a key component in the variational auto-encoder [15, 16, 17]. The inference model with parameters maps from the data space to latent space. In the context of EM,

can be considered an imputor that imputes the missing data

to get the complete data .

The three models above define joint distributions over

and from different perspectives. The two marginals, i.e., empirical data distribution and latent prior distribution , are known to us. The goal is to harmonize the three joint distributions so that the competition and cooperation between different loss terms improves learning. Fig. 3: Divergence triangle is based on the Kullback-Leibler divergences between three joint distributions of (z,x). The blue arrow indicates the “running toward” behavior and the red arrow indicates the “running away” behavior.

The divergence triangle involves the following three joint distributions on :

1. -distribution: .

2. -distribution: .

3. -distribution: .

We propose to learn the three models , ,

by the following divergence triangle loss functional

 maxαminθminϕD(α,θ,ϕ), D=KL(Q∥P)+KL(P∥Π)−KL(Q∥Π). (13)

See Figure 3 for illustration. The divergence triangle is based on the three KL-divergences between the three joint distributions on . It has a symmetric and anti-symmetric form, where the anti-symmetry is due to the negative sign in front of the last KL-divergence and the maximization over . The divergence triangle leads to the following dynamics between the three models: (1) and seek to get close to each other. (2) seeks to get close to . (3) seeks to get close to , but it seeks to get away from , as indicated by the red arrow. Note that , because is canceled out. The effect of (2) and (3) is that gets close to , while inducing to get close to as well, or in other words, chases toward .

### 3.2 Unpacking the loss function

The divergence triangle integrates variational and adversarial learning methods, which are modifications of maximum likelihood.

#### 3.2.1 Variational learning Fig. 4: Variational auto-encoder (VAE) as joint minimization by alternating projection. Left: Interaction between the models. Right: Alternating projection. The two models run toward each other.

First, captures the variational auto-encoder (VAE).

 KL(Q∥P) = KL(qdata(x)∥pθ(x)) (14) + KL(qϕ(z|x)∥pθ(z|x)),

Recall in (4), if we replace the intractable in (4) by the explicit , we get (14), so that we avoid MCMC for sampling .

We may interpret VAE as alternating projection between and . See Figure 4 for illustration. If , the algorithm reduces to the EM algorithm. The wake-sleep algorithm  is similar to VAE, except that it updates by instead of , so that the wake-sleep algorithm does not have a single objective function.

The VAE defines a cooperative game, with the dynamics that and run toward each other. Fig. 5: Adversarial contrastive divergence (ACD). Left: Interaction between the models. Red arrow indicates a chasing game, where the generator model chases the energy-based model, which runs toward the data distribution. Right: Contrastive divergence.

Next, consider the learning of the energy-based model model [23, 24]. Recall in (8), if we replace the intractable in (8) by , we get

 minαmaxθ[KL(qdata(x)∥πα(x))−KL(pθ(x)∥πα(x))], (15)

or equivalently

 maxαminθ[KL(pθ(x)∥πα(x))−KL(qdata(x)∥πα(x))], (16)

so that we avoid MCMC for sampling , and the gradient for updating becomes

 ∂∂α[Eqdata(fα(x))−Epθ(fα(x))]. (17)

Because of the negative sign in front of the second KL-divergence in (15), we need in (15) or in (16), so that the learning becomes adversarial. See Figure 5 for illustration. Inspired by , we call (15) the adversarial contrastive divergence (ACD). It underlies [23, 24].

The adversarial form (15) or (16) defines a chasing game with the following dynamics: the generator chases the energy-based model in , the energy-based model seeks to get closer to and get away from . The red arrow in Figure 5 illustrates this chasing game. The result is that lures toward . In the idealized case, always catches up with , then

will converge to the maximum likelihood estimate

, and converges to .

The above chasing game is different from VAE , which defines a cooperative game where and run toward each other.

Even though the above chasing game is adversarial, both models are running toward the data distribution. While the generator model runs after the energy-based model, the energy-based model runs toward the data distribution. As a consequence, the energy-based model guides or leads the generator model toward the data distribution. It is different from GAN . In GAN, the discriminator eventually becomes a confused one because the generated data become similar to the real data. In the above chasing game, the energy-based model becomes close to the data distribution.

The updating of by (17) bears similarity to Wasserstein GAN (WGAN) , but unlike WGAN, defines a probability distribution , and the learning of is based on , which is a variational approximation to . This variational approximation only requires knowing , without knowing . However, unlike , is still intractable, in particular, its entropy does not have a closed form. Thus, we can again use variational approximation, by changing the problem to , i.e., , which is analytically tractable and which underlies . In fact,

 KL(P∥Π)=KL(pθ(x)∥πα(x))+KL(pθ(z|x)∥qϕ(z|x)). (18)

Thus, we can modify (16) into , because again .

Fitting the above together, we have the divergence triangle (13), which has a compact symmetric and anti-symmetric form.

### 3.3 Gap between two models

We can write the objective function as

 D =(KL(qdata(x)∥pθ(x))+KL(qϕ(z|x)∥pθ(z|x))) −(KL(qdata(x)∥πα(x))−KL(p(z)pθ(x|z)∥πα(x)qϕ(z|x))) =((KL(qdata(x)∥pθ(x))−KL(qdata(x)∥πα(x))) +KL(qϕ(z|x)∥pθ(z|x))+KL(p(z)pθ(x|z)∥πα(x)qϕ(z|x)).

Thus is an upper bound of the difference between the log-likelihood of the energy-based model and the log-likelihood of the generator model.

### 3.4 Two sides of KL-divergences

In the divergence triangle, the generator model appears on the right side of , and it also appears on the left side of

. The former tends to interpolate or smooth the modes of

, while the later tends to seek after major modes of while ignoring minor modes. As a result, the learned generator model tends to generate sharper images. As to the inference model , it appears on the left side of , and it also appears on the right side of . The former is variational learning of the real data, while the latter corresponds to the sleep phase of wake-sleep learning, which learns from the dream data generated by . The inference model thus can infer from both observed and generated .

In fact, if we define

 D0=KL(qdata∥pθ)+KL(pθ∥πα)−KL(qdata∥πα), (19)

we have

 D=D0+KL(qϕ(z|x)∥pθ(z|x))+KL(pθ(z|x)∥qϕ(z|x)). (20)

(19) is the divergence triangle between the three marginal distributions on , where appears on both sides of KL-divergences. (20) is the variational scheme to make the marginal distributions into the joint distributions, which are more tractable. In (20), the two KL-divergences have reverse orders.

### 3.5 Training algorithm

The three models are each parameterized by convolutional neural networks. The joint learning under the divergence triangle can be implemented by stochastic gradient descent, where the expectations are replaced by the sample averages. Algorithm

1 describes the procedure which is illustrated in Figure 6. Fig. 6: Joint learning of three models. The shaded circles z and x represent variables that can be sampled from the true distributions, i.e., N(0,Id) and empirical data distribution, respectively. ~x and ~z are generated samples using the generator model and the inference model, respectively. The solid line with arrow represents the conditional mapping and dashed line indicates the matching loss is involved.

## 4 Experiments

In this section, we demonstrate not only that the divergence triangle is capable of successfully learning an energy-based model with a well-behaved energy landscape, a generator model with highly realistic samples, and an inference model with faithful reconstruction ability, but we also show competitive performance on four tasks: image generation, test image reconstruction, energy landscape mapping, and learning from incomplete images. For image generation, we consider spatial stationary texture images, temporal stationary dynamic textures, and general object categories. We also test our model on large-scale datasets and high-resolution images.

The images are resized and scaled to

, no further pre-processing is needed. The network parameters are initialized with zero-mean Gaussian with standard deviation

and optimized using Adam . Network weights are decayed with rate

, and batch normalization

 is used. We refer to the Appendix for the model specifications.

### 4.1 Image generation

In this experiment, we evaluate the visual quality of generator samples from our divergence triangle model. If the generator model is well-trained, then the obtained samples should be realistic and match the visual features and contents of training images.

#### 4.1.1 Object generation

For object categories, we test our model on two commonly-used datasets of natural images: CIFAR-10 and CelebA . For CelebA face dataset, we randomly select 9,000 images for training and another 1,000 images for testing in reconstruction task. The face images are resized to and CIFAR-10 images remain . The qualitative results of generated samples for objects are shown in Figure 7. We further evaluate our model using quantitative evaluations which are based on the Inception Score (IS)  for CIFAR-10 and Frechet Inception Distance (FID)  for CelebA faces. We generate 50,000 random samples for the computation of the inception score and 10,000 random samples for the computation of the FID score. Table I shows the IS and FID scores of our model compared with VAE , DCGAN , WGAN , CoopNet , CEGAN , ALI , ALICE . Fig. 7: Generated samples. Left: generated samples on CIFAR-10 dataset. Right: generated samples on CelebA dataset.

Note that for the Inception Score on CIFAR-10, we borrowed the scores from relevant papers, and for FID score on 9,000 CelebA faces, we re-implemented or used the available code with the similar network structure as our model. It can be seen that our model achieves the competitive performance compared to recent baseline models.

#### 4.1.2 Large-scale dataset

We also train our model on large scale datasets including down-sampled

version of ImageNet

[44, 45] (roughly 1 million images) and Large-scale Scene Understand (LSUN) dataset . For the LSUN dataset, we consider the bedroom, tower and Church ourdoor categories which contains roughly 3 million, 0.7 million and 0.1 million images and were re-sized to . The network structures are similar with the ones used in object generation with twice the number of channels and batch normalization is used in all three models. Generated samples are shown on Figure 8. Fig. 8: Generated samples. Left: 32×32 ImageNet. Right: 64×64 LSUN (bedroom).

#### 4.1.3 High-resolution synthesis Fig. 9: Generated samples with 1,024×1,024 resolution drawn from gθ(z) with 512-dimensional latent vector z∼N(0,Id) for Celeba-HQ. Fig. 10: High-resolution synthesis from the generator model gθ(z) with linear interpolation in latent space (i.e., (1−α)⋅z0+α⋅z1) for Celeba-HQ.

In this section, we recruit a layer-wise training scheme to learn models on CelebA-HQ  with resolutions of up to

pixels. Layer-wise training dates back to initializing deep neural networks by Restricted Boltzmann Machines to overcome optimization hurdles

[48, 49] and has been resurrected in progressive GANs , albeit the order of layer transitions is reversed such that top layers are trained first. This resembles a Laplacian Pyramid  in which images are generated in a coarse-to-fine fashion.

As in , the training starts with down-sampled images with a spatial resolution of while progressively increasing the size of the images and number of layers. All three models are grown in synchrony where convolutions project between RGB and feature. In contrast to , we do not require mini-batch discrimination to increase variation of nor gradient penalty to preserve -Lipschitz continuity of .

Figure 9 depicts high-fidelity synthesis in a resolution of pixels sampled from the generator model on CelebA-HQ. Figure 10 illustrates linear interpolation in latent space (i.e., ), which indicates diversity in the samples.

Therefore, the joint learning in the triangle formulation is not only able to train the three models with stable optimization, but it also achieves synthesis with high fidelity.

#### 4.1.4 Texture synthesis

We consider texture images, which are spatial stationary and contain repetitive patterns. The texture images are resized to . Separate models are trained on each image. We start from the latent factor of size and use five convolutional-transpose layers with kernel size and up-sampling factor for the generator network. The layers have , , , and

filters, respectively, and ReLU non-linearity between each layer is used. The inference model has the inverse or “mirror” structure of generator model except that we use convolutional layers and ReLU with leak factor

. The energy-based model has three convolutional layers. The first two layers have kernel size

with stride

for and filters respectively, and the last layer has filters with kernel size and stride .

The representative examples are shown in Figure 11. Three texture synthesis results are obtained by sampling different latent factors from prior distribution . Notice that although we only have one texture image for training, the proposed triangle divergence model can effectively utilize the repetitive patterns, thus generating realistic texture images with different configurations. Fig. 11: Generated texture patterns. For each row, the left one is the training texture, the remaining images are 3 textures generated by divergence triangle. Fig. 12: Generated dynamic texture patterns. The top row shows the frames from the training video, the bottom row represents the frames for the generated video.

#### 4.1.5 Dynamic texture synthesis

Our model can also be used for dynamic patterns which exhibit stationary regularity in the temporal domain. The training video clips are selected from Dyntex database  and resized to pixels pixels frames. Inspired by recent work [52, 53], we adopt spatial-temporal models for dynamic patterns that are stationary in the temporal domain but non-stationary in the spatial domain. Specifically, we start from latent factors of size for each video clip and we adopt the same spatial-temporal convolutional transpose generator network as in  except we use kernel size for the second layer. For the inference model, we use spatial-temporal convolutional layers. The first layers have kernel size with upsampling factor and the last layer is fully-connected in spatial domain but convolutional in the temporal domain, yielding re-parametrized and which have the same size the as latent factors. For the energy-based model, we use three spatial-temporal convolutional layers. The first two layers have kernel size with up-sample factor in all directions, but the last layer is fully-connected in the spatial domain but convolutional with kernel size and upsample by in the temporal domain. Each layer has , and filters, respectively. Some of the synthesis results are shown in Figure 12. Note, we sub-sampled frames of the training and generated video clips and we only show them in the first batch for illustration.

### 4.2 Test image reconstruction Fig. 13: Test image reconstruction. Top: CIFAR-10. Bottom: CelebA. Left: test images. Right: reconstructed images.

In this experiment, we evaluate the reconstruction ability of our model for a hold-out testing image dataset. This is a strong indicator for the accuracy of our inference model. Specifically, if our divergence triangle model is well-learned, then the inference model should match the true posterior of generator model, i.e., . Therefore, given test signal , its reconstruction should be close to , i.e., . Figure 13 shows the testing images and their reconstructions on CIFAR-10 and CelebA.

For CIFAR-10, we use its own 10,000 test images while for CelebA, we use the hold-out 1,000 test images as stated above. The reconstruction quality is further measured by per-pixel mean square error (MSE). Table II shows the per-pixel MSE of our model compared to WS , VAE , ALI , ALICE .

Note, we do not consider methods without inference models on training data, including variants of GANs and cooperative training, since it is infeasible to test such models using image reconstruction.

### 4.3 Energy landscape mapping

In the following, we evaluate the learned energy-based model by mapping the macroscopic structure of the energy landscape. When following a MLE regime by minimizing , we expect the energy-function to encode as local energy minima. Moreover, should form minima for unseen images and macroscopic landscape structure in which basins of minima are distinctly separated by energy barriers. Hopfield observed that such landscape is a model of associative memory .

In order to learn a well-formed energy-function, in Algorithm 1, we perform multiple -steps such that the samples are sufficiently “close” to the local minima of . This avoids the formation of energy minima not resembling the data. The variational approximation of entropy of the marginal generator distribution preserves diversity in the samples avoiding mode-collapse. Fig. 14: Illustration of the disconnectivity-graph depicting the basin structure of the learned energy-function fα(x) for the MNIST dataset. Each column represents the set of at most 12 basins members ordered by energy where circles indicate the total number of basin members. Vertical lines encode minima depth in terms of energy and horizontal lines depict the lowest known barrier at which two basins merge in the landscape. Basins with less than 4 members were omitted for clarity. Fig. 15: Illustration of the disconnectivity-graph depicting the basin structure of the learned energy-function for the Fashion-MNIST dataset. Each column represents the set of at most 12 basins members ordered by energy where circles indicate the total number of basin members. Vertical lines encode minima depth in terms of energy and horizontal lines depict the lowest known barrier at which two basins merge in the landscape. Basins with less than 4 members were omitted for clarity. Fig. 16: Learning from incomplete data from the CelebA dataset. The 9 columns belong to experiments P.5, P.7, MB10, MB10, B20, B20, B30, B30, B30 respectively. Row 1: original images, not observed in learning stage. Row 2: training images. Row 3: recovered images using VAE . Row 4: recovered images using ABP . Row 5: recovered images using our method. Fig. 17: Image generation from different models learned from training images of the CelebA dataset with 30×30 occlusions. Left: images generated from VAE model . Middle: images generated from ABP model . Right: images generated from our proposed triangle divergence model.

To verify that (i) local minima of resemble and (ii) minima are separated by significant energy barriers, we shall follow the approach used in . When clustering with respect to energetic barriers, the landscape is partitioned into Hopfield basins of attraction whereby each point on the landscape is mapped onto a local minimum by a steepest-descent path

. The similarity measure used for hierarchical clustering is the barrier energy that separates any two regions. Given a pair of local minima

, we estimate the barrier as the highest energy along a linear interpolation . If for some energy threshold , then belong to the same basin. The clustering is repeated recursively until all minima are clustered together. Such graphs have come to be referred as disconnectivity graphs (DG) .

We conduct energy landscape mapping experiments on the MNIST  and Fashion-MNIST  datasets, each containing grayscale images of size pixels depicting handwritten digits and fashion products from categories, respectively. The energy landscape mapping is not without limitations, because it is practically impossible to locate all local modes. Based on the local modes located by our algorithm, see Figure 14 for the MNIST dataset, it suggests that the learned energy function is well-formed which not only encodes meaningful images as minima, but also forms meaningful macroscopic structure. Moreover, within basins the local minima have a high degree of purity (i.e. digits within a basin belong to the same class), and, the energy barrier between basins seem informative (i.e. basins of ones and sixes form pure super-basins). Figure 15 depicts the energy landscape mapping on Fashion-MNIST.

Potential applications include unsupervised classification in which energy barriers act as a geodesic similarity measure which captures perceptual distance (as opposed to e.g.

distance), weakly-supervised classification with one label per basins, or, reconstruction of incomplete data (i.e. Hopfield content-addressable memory or image inpainting).

### 4.4 Learning from incomplete images

The divergence triangle can be used to learn from occluded images. This task is challenging , because only parts of the images are observed, thus the model needs to learn sufficient information to recover the occluded parts. The generative models with inferential mechanism can be used for this task. Notably,  proposed to recover incomplete images using alternating back-propagation (ABP) which has a MCMC based inference step to refine the latent factors and perform reconstruction iteratively. VAEs [59, 15] build the inference model on occluded images, and can also be adapted for this task. It proceeds by filling the missing parts with average pixel intensity in the beginning, then iteratively re-update the missing parts using reconstructed values. Unlike VAEs, which only consider the un-occluded parts of training data, our model utilizes the generated samples which become gradually recovered during training, resulting in improved recovery accuracy and sharp generation. Note that learning from incomplete data can be difficult for variants of GANs [19, 24, 20, 22] and cooperative training , since inference cannot be performed directly on the occluded images.

We evaluate our model on 10,000 images randomly chosen from CelebA dataset. Then, selected images are further center cropped as in . Similar to VAEs, we zero-fill the occluded parts in the beginning, then iterative update missing values using reconstructed images obtained from the generator model. Three types of occlusions are used: (1) salt and pepper noise which randomly covers (P.5) and (P.7) of the image. (2) Multiple block occlusion which has 10 random blocks of size (MB10). (3) Singe block occlusion where we randomly place a large and block on each image, denoted by B20 and B30 respectively. Table III shows the recovery errors using VAE , ABP  and our triangle model where the error is defined as per-pixel absolute difference (relative to the range of pixel values) between the recovered image on the occluded pixels and the ground truth image.

It can be seen that our model consistently out-performs the VAE model for different occlusion patterns. For structured occlusions (i.e., multiple and single blocks), the un-occluded parts contain more meaningful configurations that will improve learning of the generator through the energy-based model, which will, in turn, generate more meaningful samples to refine our inference model. This could be verified by the superior results compared to ABP . While for unstructured occlusions (i.e., salt and pepper noise), ABP achieves improved recovery, a possible reason being that un-occluded parts contain less meaningful patterns which offer limited help for learning the generator and inference model. Our model synthesizes sharper and more realistic images from the generator on occluded images. See Figure 17 in which images are occluded with random blocks.

## 5 Conclusion

The proposed probabilistic framework, namely divergence triangle, for joint learning of the energy-based model, the generator model, and the inference model. The divergence triangle forms the compact learning functional for three models and naturally unifies aspects of maximum likelihood estimation [5, 25], variational auto-encoder [15, 16, 17], adversarial learning [23, 24], contrastive divergence , and the wake-sleep algorithm .

An extensive set of experiments demonstrated learning of a well-behaved energy-based model, realistic generator model as well as an accurate inference model. Moreover, experiments showed that the proposed divergence framework can be effective in learning directly from incomplete data.

In future work, we aim to extend the formulation to learn interpretable generator and energy-based models with multiple layers of sparse or semantically meaningful latent variables or features [60, 61]. Further, it would be desirable to unify the generator, energy-based and inference models into a single model [62, 63] by allowing them to share parameters and nodes instead of having separate sets of parameters and nodes.

## Acknowledgments

The work is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. We thank Dr. Tianfu Wu, Shuai Zhu and Bo Pang for helpful discussions.

## Model Architecture

We describe the basic network structures, in particular for object generation. We use the following notation:

• conv(n): convolutional operation with output feature maps.

• convT(n): convolutional transpose operation with output feature maps.

• LReLU: Leaky-ReLU nonlinearity with default leaky factor 0.2.

• BN: Batch normalization.

The structures for CelebA (where 9,000 random images are chosen) are shown in Table IV. The structures for CIFAR-10 and MNIST/Fashion-MNIST are shown in Table V and Table VI, respectively.