Multimodal Conditional Learning with Fast Thinking Policy-like Model and Slow Thinking Planner-like Model

02/07/2019 ∙ by Jianwen Xie, et al. ∙ 10

This paper studies the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input belong to two different modalities, e.g., the output is an image and the input is a sketch. We solve this problem by learning two models that bear similarities to those in reinforcement learning and optimal control. One model is policy-like. It generates the output directly by a non-linear transformation of the input and a noise vector. This amounts to fast thinking because the conditional generation is accomplished by direct sampling. The other model is planner-like. It learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model. This amounts to slow thinking because the sampling process is accomplished by an iterative algorithm such as Langevin dynamics. We propose to learn the two models jointly, where the fast thinking policy-like model serves to initialize the sampling of the slow thinking planner-like model, and the planner-like model refines the initial output by an iterative algorithm. The planner-like model learns from the difference between the refined output and the observed output, while the policy-like model learns from how the planner-like model refines its initial output. We demonstrate the effectiveness of the proposed method on various image generation tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When we learn to solve a problem, we can learn a policy that directly maps the problem to the solution. This amounts to fast thinking, which underlies reflexive or impulsive behavior, or muscle memory, and it can happen when one is emotional or under time constraint. We may also learn an objective function or value function that assigns values to candidate solutions, and we optimize the objective function by an iterative algorithm to find the most valuable solution. This amounts to slow thinking, which underlies planning, searching or optimal control, and it can happen when one is calm or have time to think through.

While fast thinking policy and slow thinking planning are commonly studied for solving the sequential decision problems such as reinforcement learning [25] and optimal control [4], they can also be used to solve the non-sequential decision problems. For instance, as a reviewer, his or her decision on whether to accept a paper can be based on gut feeling (fast thinking) or careful deliberation (slow thinking), and this is a one-shot non-sequential decision problem. We shall study such a problem in this paper.

Specifically, we shall study the supervised learning of the conditional distribution of a high-dimensional output given an input, where the output and input belong to two different modalities. For instance, the output may be an image, while the input may be a class label, a text description, or a sketch. The input defines the problem, and the output is the solution. We also refer to the input as the source or condition, and the output as the target.

We solve this problem by learning two models cooperatively. One model is policy-like. It generates the output directly by a non-linear transformation of the input and a noise vector, where the noise vector is to account for randomness or uncertainty in the output. This amounts to fast thinking because the conditional generation is accomplished by direct sampling. The other model is planner-like. It learns an objective function in the form of a conditional energy function, so that the output can be generated by optimizing the objective function, or more rigorously by sampling from the conditional energy-based model, where the sampling is to account for randomness and uncertainty. This amounts to slow thinking because the sampling is accomplished by an iterative algorithm such as Langevin dynamics, which is an example of Markov chain Monte Carlo (MCMC). We propose to learn the two models jointly, where the policy-like model serves to initialize the sampling of the planner-like model, and the planner-like model refines the initial solution by an iterative algorithm. The planner-like model learns from the difference between the refined solution and the observed solution, while the policy-like model learns from the difference between the initial solution and the refined solution.

Figure 1: The policy-like model initializes the planner-like model, which refines the initial solution. The policy learns from the planner’s refinement, while the planer learns by comparing to the observed solution.

Figure 1 conveys the basic idea. The algorithm iterates two steps, a solving step and a learning step. The solving step consists of two stages: Solve-fast: The policy-like model generates the initial solution. Solve-slow: The planner-like model refines the initial solution. The learning step also consists of two parts: Learn-policy: The policy-like model learns from how the planner-like model refines its initial solution. Learn-planner: The planner-like model updates its objective function by shifting its high value region from the refined solution to the observed solution.


(a) Learn-policy by mapping shift.

(b) Learn-planner by value shift.

Figure 2: (a) Learn-policy by mapping shift: the policy shifts its mapping toward the refined solution. (b) Learn-planner by value shift: the planner shifts the high value region or mode of its objective function toward the observed solution.

Figure 2(a) illustrates Learn-policy step. In the Solve-fast step, the policy-like model generates the latent noise vector, which, together with the input condition, is mapped to the initial solution. In the Learn-policy step, the policy-like model updates its parameters so that it maps the latent vector to the refined solution, in order to absorb the refinement made by the planner-like model. Because the latent vector is known, it does not need to be inferred and the learning is easy.

Figure 2(b) illustrates Learn-planner step. In the Solve-slow step, the planner-like model finds the refined solution at high value region around a mode of the objective function. In the Learn-planner step, the planner-like model updates its parameters so that the objective function shifts its high value region around the mode toward the observed solution, so that in the next iteration, the refined solution will get closer to the observed solution.

The planner-like model shifts its mode toward the observed solution, while inducing the policy-like model maps the latent vector to its mode.

Learning a policy-like model is like mimicking “how”, while learning a planner-like model is like trying to understand “why” in terms of goal or value underlying the action.

Why planner? The reason we need a planner-like model in addition to a policy-like model is that it is often easier to learn the objective function than learning to generate the solution directly, since it is always easier to demand or desire something than to actually produce something. Because of its relative simplicity, the learned objective function can be more generalizable than the learned policy. For instance, in an unfamiliar situation, we tend to be tentative, relying on slow thinking planning rather than fast thinking habit.

Efficiency. Even though we use the wording “slow thinking”, it is only relative to “fast thinking”. In fact, the slow thinking planning is usually fast enough, especially if it is jumpstarted by fasting thinking policy, and there is no problem scaling up our method to big datasets. Therefore the time efficiency of the slow thinking method is not a concern.

Student-teacher vs actor-critic. We may consider the policy-like model as a student model, and the planner-like model as a teacher model. The teacher refines the initial solution of the student by a refinement process, and distills the refinement process into the student. This is different from the actor-critic relationship in (inverse) reinforcement learning [1, 37, 9] because the critic does not refine the actor’s solution by a slow thinking process.

Associative memory. The two models may also be considered as associative memory [10]. While the policy-like model is like sudden recall, the planner-like model is like rumination, filling in and playing out details.

We apply our learning method to various conditional image generation tasks. Our experiments show that the proposed method is effective compared to other methods, such as those based on GAN [6].

2 Contributions and related work

This paper proposes a novel method for supervised learning of high-dimensional conditional distributions by learning a fast thinking policy-like model and a slow thinking planner-like model. We show the effectiveness of our method on conditional image generation and recovery tasks.

Perhaps more importantly, we propose a different method for conditional learning than GAN-based methods. Unlike GAN methods, our method has a learned objective function to guide a slow thinking process for sampling or optimization. The proposed strategy may be applied to a broad range of problems in AI. The interaction between the fast thinking policy and the slow thinking planner can be of interest to cognitive science.

The following are related themes of research.

Inverse reinforcement learning. Although we adopt terminologies in inverse reinforcement learning and inverse optimal control [1, 37] to explain our method, we are concerned with supervised learning instead of reinforcement learning. Unlike the action space in reinforcement learning, the output in our work is of a much higher dimension, a fact that also distinguishes our work from common supervised learning problem such as classification. As a result, the policy-like model needs to transform a latent noise vector to generate the initial solution, and this is different from the policy in reinforcement learning, where the policy is defined by the conditional distribution of action given state, without resorting to a latent vector.

Conditional random field. The objective function and the conditional energy-based model can also be considered a form of conditional random field [14]. Unlike traditional conditional random field, our conditional energy function is defined by a deep network, and its sampling process is jumpstarted by a policy-like model.

Multimodal generative learning

. Learning joint probability distribution of signals of different modalities enables us to recover or generate one modality based on other modalities. For example,

[33] learns a dual-wing harmoniums model for image and text data. [20] learns stacked multimodal auto-encoder on video and audio data. [24]

learns a multimodal deep Boltzmann machine for joint image and text modeling. Our work focuses on the conditional distribution of one modality given another modality, and our method involves the cooperation between two types of models.

Conditional adversarial learning. A popular method of multimodal learning is conditional GANs, where both the generator and discriminator networks are conditioned on the source signal, such as discrete class labels, text, etc. For example, [18, 5] use conditional GAN for image synthesis based on class labels. [21, 35] study text-conditioned image synthesis. Other examples include multimodal image-to-image mapping [28]

, image-to-image translation

[11, 36, 17]

, and super-resolution

[16]. Our work studies similar problems. The difference is that our method is based on a conditional energy function and an iterative algorithm guided by this objective function. Existing adversarial learning methods, including those in inverse reinforcement learning [9], do not involve this slow thinking planning process.

Cooperative learning. Just as the conditional GAN is inspired by the original GAN [6], our learning method is inspired by the recent work of [30], where the models are unconditioned. While unconditioned generation is interesting, conditional generation and recovery is much more useful in applications. The conditional distributions are also much better behaved than the unconditioned distributions, because the former can be much less multi-modal, so that the conditional sampling and learning can be easier and more stable.

3 Conditional learning

Let be the -dimensional signal of the target modality, and be the signal of the source modality, where “C” stands for “condition”. defines the problem, and is the solution. Our goal is to learn the conditional distribution of the target signal (solution) given the source signal (problem) as the condition. We shall learn from the training dataset of the pairs with the fast thinking policy-like model and slow thinking planner-like model.

3.1 Slow thinking planner-like model

The planner-like model is based an objective function or value function defined on . can be defined by a bottom-up convolutional network (ConvNet) where collects all the weight and bias parameters. defines a joint energy-based model [31]:

(1)

where is the normalizing constant.

Fixing the source signal , defines the value of the solution for the problem defined by , and defines the conditional energy function. The conditional probability is given by

(2)

where . The learning of this model seeks to maximize the conditional log-likelihood function

(3)

whose gradient is

(4)

where denotes the expectation with respect to . The identity underlying (4) is

The expectation in (4) is analytically intractable and can be approximated by drawing samples from and then computing the Monte Carlo average. This can be solved by an iterative algorithm, which is a slow thinking process. One solver is the Langevin dynamics for sampling . It iterates the following step:

(5)

where indexes the time steps of the Langevin dynamics, is the step size, and

is Gaussian white noise.

is the dimensionality of . A Metropolis-Hastings acceptance-rejection step can be added to correct for finite . The Langevin dynamics is gradient descent on the energy function, plus noise for diffusion so that it samples the distribution instead of being trapped in the local modes.

For each observed condition , we run the Langevin dynamics according to (5) to obtain the corresponding synthesized example as a sample from . The Monte Carlo approximation to is

(6)

We can then update .

Value shift: The above gradient ascent algorithm is to increase the average value of the observed solutions versus that of the refined solutions, i.e., on average, it shifts high value region or mode of from the generated solution toward the observed solution .

The convergence of such a stochastic gradient ascent algorithm has been studied by [34].

3.2 Fast thinking policy-like model

The policy-like model is of the following form:

(7)

where is the -dimensional latent noise vector, and is a top-down ConvNet defined by the parameters . The ConvNet maps the latent noise vector and the observed condition to the signal directly. If the source signal is of high dimensionality, we can parametrize by an encoder-decoder structure: we first encode into a latent vector , and then we map to by a decoder. Given , we can generate from the conditional generator model by direct sampling, i.e., first sampling from its prior distribution, and then mapping into directly. This is fast thinking without iteration.

We can learn the policy-like model from the training pairs by maximizing the conditional log-likelihood , where . The learning algorithm iterates the following two steps. (1) Sample from by Langevin dynamics. (2) Update by gradient descent on . See [7] for details.

3.3 Cooperative training

The policy-like model and the planner-like model cooperate with each other as follows.

(1) The policy-like model supplies initial samples for the MCMC of the planner-like model. For each observed condition input , we first generate , and then generate the initial solution . If the current policy-like model is close to the current planner-like model, then the generated should be a good initialization for sampling from the planner-like model , i.e., starting from the initial solutions , we run Langevin dynamics for steps to get the refined solutions . These serve as the synthesized examples from the planner-like model and are used to update in the same way as we learn the planner-like model in equation (6) for value shifting.

(2) The policy-like model then learns from the MCMC. Specifically, the policy-like model treats produced by the MCMC as the training data. The key is that these are obtained by the Langevin dynamics initialized from the , which are generated by the policy-like model with known latent noise vectors . Given , we can learn by minimizing , which is a nonlinear regression of on . This can be accomplished by gradient descent

(8)

Mapping shift: Initially maps to the initial solution . After updating , maps to the refined solution . Thus the updating of absorbs the MCMC transitions that change to . In other words, we distill the MCMC transitions of the refinement process into .

Algorithm 1 presents a description of the conditional learning with two models. See Figures 1 and 2 for illustrations.

1:
2:(1) training examples
3:(2) numbers of Langevin steps
4:(3) number of learning iterations .
5:
6:(1) learned parameters and ,
7:(2) generated examples .
8:
9:, initialize and .
10:repeat
11:     Solve-fast by mapping: For , generate , and generate the initial solution .
12:     Solve-slow based on value: For , starting from , run steps of Langevin dynamics to obtain the refined solution , each step following equation (5).
13:     Learn-planner by value shift: Update , where is computed according to (6).
14:     Learn-policy by mapping shift: Update , where is computed according to (8)
15:     Let
16:until 
Algorithm 1 Conditional Learning

See the supplementary materials for a theoretical understanding of our learning method.

4 Experiments

We test the proposed framework for multimodal conditional learning on a variety of tasks.

4.1 Experiment 1: Category Image

4.1.1 Conditional image generation

We start form learning the conditional distribution of an image given a category or class label. We learn the two models jointly on 30,000 MNIST handwritten digit images conditioned on their class labels, which are encoded as one-hot vectors.

In the policy-like model, we concatenate the 10-dimensional one-hot vector with the -dimensional latent noise vector sampled from as the input of the top-down ConvNet to build a conditional generator . The generator maps the 110-dimensional input (i.e., ) into the digit image of size by 4 layers of deconvolutions with kernels, with up-sampling factors from top to bottom. The numbers of channels at different layers are

from top to bottom. Batch normalization and ReLU layers are used between deconvolution layers and tanh non-linearity is added at the bottom layer.

To build the planner-like model, we first use a decoder parametrized by to decode the one-hot vector into a “template image” and perform channel concatenation with the target image . The value function is defined by a bottom-up ConvNet that maps the class decoding and the target image to the value. The decoder has the same structure as the generator in the policy-like model except that the input is a 10-dimensional one-hot vector. We parametrize by 3 layers of convolutions with kernels, with down-sampling factors from bottom to top, followed by a fully-connected layer. The numbers of channels at different layers are 64, 128, 256, and 100 from bottom to top.

We use the Adam [12] for optimization. The joint models are trained with mini-batches of size 100. Figure 3 shows some of the generated samples conditioned on the class labels after training. Each row is conditioned on one label and each column is a different generated sample.

Figure 3: Generated MNIST handwritten digits. Each row is conditioned on one class label

To evaluate the learned conditional distribution, Table 1

shows Gaussian Parzen window log-likelihood estimates of the MNIST

[15] test set. We sample 10,000 examples from the learned conditional distribution by first sampling the class label from the uniform prior distribution, and from , then the policy-like model and the planner-like model cooperatively generate the synthesized example from the sampled and

. A Gaussian Parzen window is fitted to these synthesized examples, and then the log-likelihod of the test set using the Parzen window distribution is estimated. The standard deviation of the Gaussians is obtained by cross validations. We follow the same procedure as

[6] for computing the log-likelihood estimates for fair comparison.

Model log-likelihood
DBN [3] 138 2.0
Stacked CAE [3] 121 1.6
Deep GSN [2] 214 1.1
GAN [6] 225 2.0
Conditional GAN [18] 132 1.8
ours 226 2.1
Table 1: Parzen window-based log-likelihood estimates for MNIST.

We also test the proposed framework on Cifar-10 [13] object dataset, which contains 60k training images of pixels, with the same architecture mentioned above. Figure 4 shows the generated object patterns. Each row is conditioned on one category. The first two columns display some typical training examples, while the rest columns show generated images conditional on labels. We evaluate the learned conditional distribution by computing the inception scores of the generated examples. Table 2 compares our framework against some baselines for conditional learning. It can be seen that in the proposed cooperative framework, the solution provided by the policy-like model can be further refined by the planner-like model.

Figure 4: Generated Cifar-10 object images. Each row is conditioned on one category label. The first two columns are training images, and the remaining columns display generated images conditioned on their labels.
Model Inception score
Conditional GAN [23] 6.58
Conditional SteinGAN [27] 6.35
policy-like model 6.63
planner-like model 7.30
Table 2: A comparison of Inception scores on Cifar-10 dataset.

4.1.2 Disentangling style and content

A realistic conditional generative model can be useful for exploring the underlying structure of the data by manipulation of the latent variables and the condition variable. In this section, we investigate disentanglement of content and style. The one-hot vector in the policy-like model mainly accounts for content information, such as label, but ti does not account for style, e.g. shape, rotation, size, etc. Therefore, in order to generate realistic and diverse images, the policy-like model must learn to use noise sample (i.e., latent variables) to capture style variations.

In this experiment, we train a planner-like model jointly with a policy-like model with a two-dimensional latent noise vector from MNIST dataset. With the learned models, we first use the policy-like model to generate images by fixing the category label , while varying the latent vector over a range , where we discretize both and into 10 equally spaced values, and then use the planner-like model to refine each generated examples with the corresponding category label . Figure 5 displays two examples of visualization of handwriting styles with category labels set to be digit 4 and digit 9 respectively. In both examples, the nearby regions of latent space corresponds to similar handwriting styles, which are independent of the category labels.

Figure 5: Visualization of handwriting styles learned by the conditional model with 2D latent space. The generated handwriting digits at each sub-figure are obtained by fixing the category label and changing the 2-dimensional latent variable . The category labels used for the left and right panels are digit 4 and 9 respectively.
Figure 6: Style transfer. The first column shows testing images. The other columns show style transfer by the model, where the style latent variable of each row is set to the value inferred from the testing image in the first column by the Langevin inference. Each column corresponds to a different category label .

4.1.3 Style transfer

We demonstrate that the learned model can perform style transfer from an unseen testing image onto other categories. The models are trained on SVHN [19] dataset that contains 10 classes of digits collected from street view house numbers. With the learned policy-like model, we first infer the latent variables corresponding to that testing image. We then fix the inferred latent vector, change the category label , and generate the different categories of images with the same style as the testing image by the learned model. Given a testing image with known category label , the inference of the latent vector can be performed by directly sampling from the posterior distribution via Langevin dynamics, which iterates

(9)

If the category label of the testing image is unknown, we need to infer both and from . Since is a one-hot vector, in order to adopt a gradient-based method to infer , we adopt a continuous approximation by reparametrizing using a softMax transformation on the auxiliary continuous variables . Specifically, let and , we reparametrize where , for , and assume the prior for to be . Then the Langevin dynamics for sampling iterates

(10)

Figure 6 shows 8 results of style transfer. For each testing image , we infer and by sampling , which iterates (1) , and (2) where , with randomly initialized and . We then fix the inferred latent vector , change the category label , and generate images from the combination of and by the learned models. This again demonstrates the disentanglement of style from category.

4.2 Experiment 2: Image Image

4.2.1 Semantic labels Scene images

We study learning conditional distribution for image-to-image synthesis by our framework. The expereiments are conducted on CMP Facades dataset [26] where each building facade image is associated with an image of architectural labels. In the policy-like model, we first sample from the Gaussian noise prior , and we encode the conditional image via an encoder parametrized by . The image embedding is then concatenated to the latent noise vector . After this, we generate target image by a generator . We disign the policy-like model by following a general shape of a “U-Net” [22] in this experiment. In the planner-like model, we first perform channel concatenation on target image and conditional image , where both images are of size . The value function is then defined by a 4-layer bottom-up ConvNet , which maps the 6-channel “image” to value score by 3 convolutional layers with numbers of channels , filter sizes and subsampling factors at different layers (from bottom to top), and one fully connected layer with 100 single filers. Leaky ReLU layers are used between convolutional layers.

Figure 7 shows some qualitative results of generating building facade images from the semantic labels. The first row displays 4 semantic label images that are unseen in the training data. The second row displays the corresponding ground truth images for reference. The results by a baseline method [11] are shown in the third row for comparison. The fourth and fifth rows show the generated results conditioned on the images shown in the first row by the learned policy-like model and planner-like model respectively. Please see the supplementary materials for more high-resolution results.

Figure 7: Generating images conditioned on architectural labels

4.2.2 Sketch images (or edge images) Photo images

We next test the model on CUHK Face Sketch database (CUFS) [29], where for each face, there is a sketch drawn by an artist based on a photo of the face. We learn to recover the color face images from the sketch images by the proposed framework. Figure 8(a) displays the face image synthesis results conditioned on the sketch images. The 1st, 2nd, and 3rd columns show some sketch images, while the 4th, 5th, and 6th columns show the corresponding recovered images obtained by sampling from the conditional distribution.

Figure 8

(b) demonstrates the learned sketch (condition) manifold by showing 3 examples of interpolation. For each row, the sketch images at the two ends are first encoded into the embedding by

, and then each face image in the middle is obtained by first interpolating the sketch embedding, and then generating the images using the policy-like model with fixed noise, and eventually refining the results by the planner-like model. Even though there is no ground-truth sketch images for the intervening points, the generated faces appear plausible. Since the noise is fixed, the only changing factor is the sketch embedding. We observe smooth changing of the outline of the generated faces.

Figure 8: (a) sketch-to-photo face synthesis. The 1st, 2nd and 3rd columns: sketch images as conditions. The 4th, 5th, and 6th columns: face images sampled from the learned models conditioned on sketch images. (b) Generated face images by interpolating between the embedding of the sketch images at two ends, with fixed noise vector. Each row displays one example of interpolation.

We conduct another experiment on UT Zappos50K dataset [26] for photo image recovery from edge image. The dataset contains 50k training images of shoes. Edge images are computed by HED edge detector [32] with post processing. We use the same model structure as the one in the last experiment. Figure 9 shows some qualitative results of synthesizing shoe images from edge images.

Figure 9: Example results on edges shoes generation, compared to ground truth.

4.2.3 Occluded images Inpainted images

We also test our method on the task of image inpainting by learning a mapping from an occluded image (256

256 pixels), where a mask with the size of pixels is centrally placed, to its original version. We use the CMP Facades datset. Figure 10 shows a qualitative comparison of our method and a baseline method [11]. Table 3 shows quantitative results where the recovery performance is measured by PSNR and SSIM, which are computed between the occlusion regions of the generated example and the ground truth example. The batch size is one. Our method outperforms the baseline in this recovery task. Please see the supplementary materials for more results.

PSNR SSIM
pixel2pixel [11] 19.3411 0.739
ours 20.4678 0.767
Table 3: Comparison with the baseline method for image inpainting.
Figure 10: Example results of photo inpainting

5 Conclusion

This paper addresses the problem of high-dimensional conditional learning and proposes a learning method that couples a fast thinking policy-like model and a slow thinking planner-like model. The policy-like model initializes the iterative optimization or sampling process of the planner-like model, while the planner-like model in return teaches the policy-like model by distilling its iterative algorithm into the policy-like model. We demonstrate the proposed method on a variety of image synthesis and recovery tasks.

Compared to GAN-based method, our method is equipped with an extra iterative sampling and optimization algorithm to refine the solution, guided by a learned objective function. This may prove to be a powerful method for solving challenging conditional learning problems. Integrating fast thinking and slow thinking may also be of interest to cognitive science.

Appendix: Theoretical understanding

Kullback-Leibler divergence

The Kullback-Leibler divergence between two distributions

and is defined as .

The Kullback-Leibler divergence between two conditional distributions and is defined as

(11)
(12)

where the expectation is over the joint distribution

.

Slow thinking planner-like model

The slow thinking planner-like model is

(13)

where

(14)

is the normalizing constant and is analytically intractable.

Suppose the training examples are generated by the true joint distribution , whose conditional distribution is .

For large sample , the maximum likelihood estimation of is to minimize the Kullback-Leibler divergence

(15)

In practice, the expectation with respect to is approximated by the sample average. The difficulty with is that the term is analytically intractable, and its derivative has to be approximated by MCMC sampling from the model .

Fast thinking policy-like model

The fast thinking policy-like model is

(16)

We use the notation to denote the resulting conditional distribution. It is obtained by

(17)

which is analytically intractable.

For large sample, the maximum likelihood estimation of is to minimize the Kullback-Leibler divergence

(18)

Again, the expectation with respect to is approximated by the sample average. The difficulty with is that is analytically intractable, and its derivative has to be approximated by MCMC sampling of the posterior .

Value shift: modified contrastive divergence

Let be the transition kernel of the finite-step MCMC that refines the initial solution to the refined solution . Let be the distribution obtained by running the finite-step MCMC from .

Given the current policy-like model , the value shift updates to

, and the update approximately follows the gradient of the following modified contrastive divergence

[8, 30]

(19)

Compare (19) with the MLE (13), (19) has the second divergence term to cancel the term, so that its derivative is analytically tractable. The learning is to shift or its high value region around the mode from the refined solution provided by toward the observed solution given by . If is close to , then the second divergence is close to zero, and the learning is close to MLE update.

Mapping shift: distilling MCMC

Given the current planner-like model , the mapping shift updates to , and the update approximately follows the gradient of

(20)

This update distills the MCMC transition into the model . In the idealized case where the above divergence can be minimized to zero, then . The limiting distribution of the MCMC transition is , thus the cumulative effect of the above update is to lead close to .

Compare (20) to the MLE (16), the training data distribution becomes instead of . That is, learns from how refines it. The learning is accomplished by mapping shift where the generated latent vector is known, thus does not need to be inferred (or the Langevin inference algorithm can initialize from the generated ). In contrast, if we are to learn from , we need to infer the unknown by sampling from the posterior distribution.

In the limit, if the algorithm converges to a fixed point, then the resulting minimizes , that is, seeks to be the stationary distribution of the MCMC transition , which is .

If the learned is close to , then is even closer to . Then the learned is close to MLE because the second divergence term in (19) is close to zero.

References

  • [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the Twenty-first International Conference on Machine Learning (ICML)

    , pages 1–8, 2004.
  • [2] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, pages 226–234, 2014.
  • [3] Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In International Conference on Machine Learning, pages 552–560, 2013.
  • [4] D. P. Bertsekas, D. P. Bertsekas, D. P. Bertsekas, and D. P. Bertsekas. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 2005.
  • [5] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, pages 1486–1494, 2015.
  • [6] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
  • [7] T. Han, Y. Lu, S.-C. Zhu, and Y. N. Wu. Alternating back-propagation for generator network. 2017.
  • [8] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
  • [9] J. Ho and S. Ermon.

    Generative adversarial imitation learning.

    In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
  • [10] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
  • [11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

  • [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [13] A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [14] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [16] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4681–4690, 2017.
  • [17] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems, pages 700–708, 2017.
  • [18] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [19] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In

    NIPS workshop on deep learning and unsupervised feature learning

    , volume 2011, page 5, 2011.
  • [20] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696, 2011.
  • [21] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
  • [22] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [23] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016.
  • [24] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in neural information processing systems, pages 2222–2230, 2012.
  • [25] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning, volume 135. MIT press Cambridge, 1998.
  • [26] R. Tyleček and R. Šára. Spatial pattern templates for recognition of objects with regular structure. In German Conference on Pattern Recognition, pages 364–374. Springer, 2013.
  • [27] D. Wang and Q. Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722, 2016.
  • [28] X. Wang and A. Gupta. Generative image modeling using style and structure adversarial networks. In European Conference on Computer Vision, pages 318–335. Springer, 2016.
  • [29] X. Wang and X. Tang. Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11):1955–1967, 2009.
  • [30] J. Xie, Y. Lu, R. Gao, and Y. N. Wu. Cooperative learning of energy-based model and latent variable model via mcmc teaching. In AAAI, 2018.
  • [31] J. Xie, Y. Lu, S.-C. Zhu, and Y. N. Wu. A theory of generative convnet. In International Conference on Machine Learning, 2016.
  • [32] S. Xie and Z. Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  • [33] E. P. Xing, R. Yan, and A. G. Hauptmann. Mining associated text and images with dual-wing harmoniums. arXiv preprint arXiv:1207.1423, 2012.
  • [34] L. Younes. On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics: An International Journal of Probability and Stochastic Processes, 65(3-4):177–228, 1999.
  • [35] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE Int. Conf. Comput. Vision (ICCV), pages 5907–5915, 2017.
  • [36] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks.
  • [37] B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In

    Twenty-Third AAAI Conference on Artificial Intelligence

    , pages 1433–1438, 2008.