Learning Multi-grid Generative ConvNets by Minimal Contrastive Divergence

09/26/2017 ∙ by Ruiqi Gao, et al. ∙ 0

This paper proposes a minimal contrastive divergence method for learning energy-based generative ConvNet models of images at multiple grids (or scales) simultaneously. For each grid, we learn an energy-based probabilistic model where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). Learning such a model requires generating synthesized examples from the model. Within each iteration of our learning algorithm, for each observed training image, we generate synthesized images at multiple grids by initializing the finite-step MCMC sampling from a minimal 1 x 1 version of the training image. The synthesized image at each subsequent grid is obtained by a finite-step MCMC initialized from the synthesized image generated at the previous coarser grid. After obtaining the synthesized examples, the parameters of the models at multiple grids are updated separately and simultaneously based on the differences between synthesized and observed examples. We call this learning method the multi-grid minimal contrastive divergence. We show that this method can learn realistic energy-based generative ConvNet models, and it outperforms the original contrastive divergence (CD) and persistent CD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper studies the problem of learning energy-based generative ConvNet models (LeCun et al., 2006; Hinton, 2002; Hinton et al., 2006; Hinton, Osindero, and Teh, 2006; Salakhutdinov and Hinton, 2009; Lee et al., 2009; Ngiam et al., 2011; Lu, Zhu, and Wu, 2016; Xie et al., 2016; Xie, Zhu, and Wu, 2017; Jin, Lazarow, and Tu, 2017) of images. The model is in the form of a Gibbs distribution where the energy function is defined by a bottom-up convolutional neural network (ConvNet or CNN). It can be derived from the commonly used discriminative ConvNet (LeCun et al., 1998; Krizhevsky, Sutskever, and Hinton, 2012) as a direct consequence of the Bayes rule (Dai, Lu, and Wu, 2015)

, but unlike the discriminative ConvNet, the generative ConvNet is endowed with the gift of imagination in that it can generate images by sampling from the probability distribution of the model. As a result, the generative ConvNet can be learned in an unsupervised setting without requiring class labels. The learned model can be used as a prior model for image processing. It can also be turned into a discriminative ConvNet for classification.

Figure 1: Synthesized images at multi-grids. From left to right: grid, grid and grid. Synthesized image at each grid is obtained by step Langevin sampling initialized from the synthesized image at the previous coarser grid, beginning with the grid.

The maximum likelihood learning of the energy-based generative ConvNet model follows an “analysis by synthesis” scheme: we sample the synthesized examples from the current model, usually by Markov chain Monte Carlo (MCMC), and then update the model parameters based on the difference between the observed training examples and the synthesized examples. The probability distribution or the energy function of the learned model is likely to be multi-modal if the training data are highly varied. The MCMC may have difficulty to traverse different modes and may take a long time to converge. A simple and popular modification of the maximum likelihood learning is the contrastive divergence (CD) learning

(Hinton, 2002), where for each observed training example, we obtain a corresponding synthesized example by initializing a finite-step MCMC from the observed example. Such a method can be scaled up to large training datasets using mini-batch training. However, the synthesized examples may be far from fair samples of the current model, thus resulting in bias of the learned model parameters. A modification of CD is persistent CD (Tieleman, 2008)

, where the MCMC is still initialized from the observed example at the initial learning epoch. However, in each subsequent learning epoch, the finite-step MCMC is initialized from the synthesized example of the previous epoch. Running persistent chains may make the synthesized examples less biased by the observed examples, although the persistent chains may still have difficulty traversing different modes of the learned model.

To address the above challenges under the constraint of finite budget MCMC, we propose a minimal contrastive divergence method to learn the energy-based generative ConvNet models at multiple scales or grids. Specifically, for each training image, we obtain its multi-grid versions by repeated down-scaling. Our method learns a separate generative ConvNet model at each grid. Within each iteration of our learning algorithm, for each observed training image, we generate the corresponding synthesized images at multiple grids. Specifically, we initialize the finite-step MCMC sampling from the minimal version of the training image, and the synthesized image at each grid serves to initialize the finite-step MCMC that samples from the model of the subsequent finer grid. See Fig. 1 for an illustration, where we sample images sequentially at 3 grids, with 30 steps of Langevin dynamics at each grid. After obtaining the synthesized images at the multiple grids, the models at the multiple grids are updated separately and simultaneously based on the differences between the synthesized images and the observed training images at different grids. We call this learning method the multi-grid minimal contrastive divergence because it initializes the finite-step MCMC from the minimal version of the training image.

The advantages of the proposed method are as follows.

(1) The finite-step MCMC is initialized from the version of the observed image, instead of the original observed image. Thus the synthesized image is much less biased by the observed image compared to the original CD.

(2) The learned models at coarse grids are expected to be smoother than the models at fine grids. Sampling the models at increasingly finer grids sequentially is like a simulated annealing process (Kirkpatrick et al., 1983), and the finite budget MCMC is likely to produce synthesized images that are close to fair samples from the learned models.

(3) The images and their models at multiple grids correspond to the same scene at different viewing distances. Thus the models at multiple grids and the corse-to-fine sampling processes are physically natural.

(4) Unlike original CD or persistent CD, the learned models are equipped with a fixed budget MCMC to generate new synthesized images from scratch, because we only need to initialize the MCMC by sampling from the one-dimensional histogram of the version of the training images.

We show that the proposed method can learn realistic models of images. The learned models can be used for image processing such as image inpainting. The learned feature maps can be used for subsequent tasks such as classification.

The contributions of our paper are as follows. We propose a minimal CD method for learning multi-grid energy-based generative ConvNet models. We show empirically that the proposed method outperforms original CD, persistent CD, as well as single grid learning. More importantly, we show that a small budget MCMC is capable of generating varied and realistic patterns. The deep energy-based models have not received the attention they deserve in the recent literature because of the reliance on MCMC sampling. It is our hope that this paper will stimulate further research on designing efficient MCMC algorithms for learning deep energy-based models.

2 Related work

Our method is a modification of CD (Hinton, 2002) for training energy-based models. In general, both the data distribution of the observed training examples and the learned model distribution can be multi-modal, and the data distribution can be even more multi-modal than the model distribution. The finite-step MCMC of CD initialized from the data distribution may only explore local modes around the training examples, thus the finite-step MCMC may not get close to the model distribution. This can also be the case with persistent CD (Tieleman, 2008). In contrast, our method initializes the finite-step MCMC from the minimal

version of the original image, and the sampling of the model at each grid is initialized from the image sampled from the model at the previous coarser grid. The model distribution at the coarser grid is expected to be smoother than the model distribution at the finer grid, and the coarse to fine MCMC is likely to generate varied samples that are close to fair samples from the learned models. As a result, the learned models obtained by our method can be closer to maximum likelihood estimate than the original CD.

The multi-grid Monte Carlo method originated from statistical physics (Goodman and Sokal, 1989). Our work is perhaps the first to apply the multi-grid sampling to CD learning of deep energy-based models. The motivation for multi-grid Monte Carlo is that reducing the scale or resolution leads to a smoother or less multi-modal distribution. The difference between our method and the multi-grid MCMC in statistical physics is that in the latter, the distribution of the lower resolution is obtained from the distribution of the higher resolution. In our work, the models at different grids are learned from training images at different resolutions directly and separately.

Besides energy-based generative ConvNet model, another popular deep generative model is the generator network which maps the latent vector that follows a simple prior distribution to the image via a top-down ConvNet. The model is usually trained together with an assisting model such as an inferential model as in the variational auto-encoder (VAE)

(Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014; Mnih and Gregor, 2014), or a discriminative model as in the generative adversarial networks (GAN) (Goodfellow et al., 2014; Denton et al., 2015; Radford, Metz, and Chintala, 2015)

. The focus of this paper is on training deep energy-based models, without resorting to a different class of models, so that we do not need to be concerned with the mismatch between two different classes of models. Unlike the generator model, the energy-based generative ConvNet model corresponds directly to the discriminative ConvNet classifier

(Dai, Lu, and Wu, 2015).

Our learning method is based on maximum likelihood. Recently, (Jin, Lazarow, and Tu, 2017) proposed an introspective learning method that updates the model based on discriminative learning. It is possible to apply multi-grid learning and sampling to their method.

Subsections 4.2 and 5.2 review more related work at a more technical level.

3 Energy-based generative ConvNet

This section reviews the energy-based generative ConvNet and explains its equivalence to the commonly used discriminative ConvNet.

3.1 Exponential tilting

Let be the image defined on a squared (or rectangle) grid. We use to denote the probability distribution of with parameter . The energy-based generative ConvNet model is as follows (Xie et al., 2016):

(1)

where

is the reference or negative distribution such as Gaussian white noise

(or a uniform distribution within a bounded range). The scoring function

is defined by a bottom-up ConvNet whose parameters are denoted by . The normalizing constant is analytically intractable. The energy function is

(2)

is an exponential tilting of . In the trivial case of , the effect of exponential tilting on reduces to mean shifting . If is a ConvNet with piecewise linear rectification, then is piecewise linear in (Montufar et al., 2014), and is piecewise mean shifted Gaussian (Xie et al., 2016).

In general, the local energy minima (Hopfield, 1982) satisfy an auto-encoder (Xie et al., 2016)

(3)

(for uniform , we do not have such an auto-encoding structure). The learned model is likely to be multi-modal if the training data are highly varied.

3.2 Equivalence to discriminative ConvNet

Model (1) corresponds to a classifier in the following sense (Dai, Lu, and Wu, 2015; Xie et al., 2016; Jin, Lazarow, and Tu, 2017). Suppose there are categories, , for , in addition to the background category . The ConvNets for may share common lower layers. Let

be the prior probability of category

,

. Then the posterior probability for classifying an example

to the category is a softmax multi-class classifier

(4)

where , and for , , . Conversely, if we have the softmax classifier (4), then the distribution of each category is of the form (1). Thus the energy-based generative ConvNet directly corresponds to the commonly used discriminative ConvNet.

In the case where we only observe unlabeled examples, we may model them by a single distribution in (1), and treat it as the positive distribution, and the negative distribution. Let be the prior probability that a random example comes from . Then the posterior probability that a random example comes from is

(5)

where

, i.e., a logistic regression.

(Tu, 2007; Jin, Lazarow, and Tu, 2017; Gutmann and Hyvärinen, 2012) exploited this fact to estimate and by logistic regression in order to exponentially tilt to . (Tu, 2007; Jin, Lazarow, and Tu, 2017) devised an iterative algorithm that treats the current as the new to be tilted to a new . The logistic regression tends to focus on the examples that are on the boundaries between and . In our paper, we shall pursue maximum likelihood learning which matches the average statistical properties of the model distribution and data distribution.

The equivalence to discriminative ConvNet classifier justifies the importance and naturalness of the energy-based generative ConvNet model.

4 Maximum likelihood

While the discriminative ConvNet must be learned in a supervised setting, the generative ConvNet model in (1) can be learned from unlabeled data by maximum likelihood, and the resulting learning and sampling algorithm admits an adversarial interpretation.

4.1 Learning and sampling algorithm

Suppose we observe training examples from an unknown data distribution . The maximum likelihood learning seeks to maximize the log-likelihood function

(6)

If the sample size

is large, the maximum likelihood estimator minimizes the Kullback-Leibler divergence

from the data distribution to the model distribution . The gradient of is

(7)

where denotes the expectation with respect to . The key to the above identity is that .

The expectation in equation (7) is analytically intractable and has to be approximated by MCMC, such as Langevin dynamics (Zhu and Mumford, 1998; Girolami and Calderhead, 2011), which iterates the following step:

(8)

where indexes the time steps of the Langevin dynamics, is the step size, and is Gaussian white noise. The Langevin dynamics relaxes to a low energy region, while the noise term provides randomness and variability. The energy gradient for relaxation is in the form of the reconstruction error of the auto-encoder (3) (for uniform , the gradient is simply the derivative of ). A Metropolis-Hastings step can be added to correct for the finite step size . We can also use Hamiltonian Monte Carlo for sampling the generative ConvNet (Neal, 2011; Dai, Lu, and Wu, 2015).

We can run parallel chains of Langevin dynamics according to (8) to obtain the synthesized examples . The Monte Carlo approximation to is

which is used to update .

Both the learning and sampling steps involve the derivatives of with respect to and

respectively. The derivatives can be computed efficiently by back-propagation, and the computations of the two derivatives share most of the chain rule steps. The learning algorithm is thus in the form of alternating back-propagation.

4.2 Mode shifting and adversarial interpretations

This subsection explains the intuition and adversarial interpretation of the learning and sampling algorithm. It can be skipped in the first reading.

The above learning and sampling algorithm can be interpreted as density shifting or mode shifting. In the sampling step, the Langevin dynamics settles the synthesized examples at the low energy regions or high density regions, or major modes (or basins) of , i.e., modes with low energies or high probabilities, so that tends to be low. The learning step seeks to change the energy function by changing in order to increase . This has the effect of shifting the low energy or high density regions from the synthesized examples toward the observed examples , or shifting the major modes of the energy function from the synthesized examples toward the observed examples, until the observed examples reside in the major modes of the model. If the major modes are too diffused around the observed examples, the learning step will sharpen them to focus on the observed examples. This mode shifting interpretation is related to Hopfield network (Hopfield, 1982) and attractor network (Seung, 1998) with the Langevin dynamics serving as the attractor dynamics. But it is important to ensure that the modes for encoding the observed examples are the major modes with high probabilities.

The energy landscape may have numerous major modes that are not occupied by the observed examples, and these modes imagine examples that are considered similar to the observed examples. Even though the maximum likelihood learning matches the average statistical properties between model and data, the bottom-up ConvNet is expressive enough to create modes to encode the highly varied patterns. We still lack an in-depth understanding of the energy landscape.

The learning and sampling algorithm also has an adversarial interpretation where the learning and sampling steps play a minimax game. Let the value function be defined as

(10)

The learning step updates to increase , while the Langevin sampling step tends to relax to decrease . The zero temperature limit of the Langevin sampling is gradient descent that decreases , and the resulting learning and sampling algorithm is a generalized version of herding (Welling, 2009). See also (Xie, Zhu, and Wu, 2017). This is related to Wasserstein GAN (Arjovsky, Chintala, and Bottou, 2017), but the critic and the actor are the same energy-based model, i.e., the model itself is its own generator and critic.

(Tu, 2007; Jin, Lazarow, and Tu, 2017) developed a discriminative method for updating the model by learning a classifier or logistic regression to distinguish between the observed and the synthesized , and tilt the current model according to the logistic regression as discussed in subsection 3.2. It is also an “analysis by synthesis” scheme as well as an adversarial scheme, except that the analysis is performed by a classifier (like GAN) instead of a critic.

5 Contrastive divergence

This section reviews the original CD, and presents the modified and adversarial CD.

5.1 Original CD

The MCMC sampling of may take a long time to converge, especially if the learned is multi-modal, which is often the case because is usually multi-modal. In order to learn from large datasets, we can only afford small budget MCMC, i.e., within each learning iteration, we can only run MCMC for a small number of steps. To meet such a challenge, (Hinton, 2002) proposed the contrastive divergence (CD) method, where within each learning iteration, we initialize the finite-step MCMC from each in the current learning batch to obtain a synthesized example . The parameters are then updated according to the learning gradient (4.1).

Let be the transition kernel of the finite-step MCMC that samples from . For any probability distribution and any Markov transition kernel , let denote the marginal distribution obtained after running starting from . The learning gradient of CD approximately follows the gradient of the difference between two KL divergences:

(11)

thus the name “contrastive divergence”. If is close to , then the second divergence is small, and the CD estimate is close to maximum likelihood which minimizes the first divergence. However, it is likely that and the learned are multi-modal. It is expected that is smoother than , i.e., is “colder” than in the language of simulated annealing (Kirkpatrick et al., 1983). If is different from , it is unlikely that becomes much closer to due to the trapping of local modes. This may lead to bias in the CD estimate.

The CD learning is related to score matching estimator (Hyvärinen, 2005, 2007) and auto-encoder (Vincent, 2011; Swersky et al., 2011; Alain and Bengio, 2014). (Xie et al., 2016) showed that CD1 tends to fit the auto-encoder in (3). This is because the learning step shifts the local modes from the synthesized examples toward the observed examples, until the observed examples reside in the modes, and the modes satisfy the auto-encoder in (3). One potential weakness of CD (as well as score matching, auto-encoder, and Hopfield attractor network) is that CD only explores the local modes around the observed examples, but these local modes may not be the major modes with high probabilities. It is as if fitting a mixture model with correct component distributions but incorrect component probabilities.

A persistent version of CD (Tieleman, 2008) is to initialize the MCMC from the observed in the beginning, and then in each learning epoch, the MCMC is initialized from the synthesized obtained in the previous epoch. The persistent CD may still face the challenge of traversing and exploring different local energy minima.

5.2 Modified and adversarial CDs

This subsection explains modifications of CD, including methods based on an additional generator network. It can be skipped in the first reading.

The original CD initializes MCMC sampling from the data distribution . We may modify it by initializing MCMC sampling from a given distribution , in the hope that is closer to than . The learning gradient approximately follows the gradient of

(12)

That is, we run a finite-step MCMC from a given initial distribution , and use the resulting samples as synthesized examples to approximate the expectation in (7). The approximation can be made more accurate using annealed importance sampling (Neal, 2001). Following the idea of simulated annealing, should be a “smoother” distribution than (the extreme case is to start from white noise ). Unlike persistent CD, here the finite-step MCMC is non-persistent, sometimes also referred to as “cold start”, where the MCMC is initialized from a given within each learning iteration, instead of from the examples synthesized by the previous learning epoch. The cold start version is easier to implement for mini-batch learning.

In multi-grid CD (to be introduced in the next section), at each grid, is the distribution of the images generated by the previous coarser grid. At the smallest grid, is the one-dimensional histogram of the versions of the training images.

Another possibility is to recruit a generator network as an approximated direct sampler (Kim and Bengio, 2016), so that and can be jointly learned by the adversarial CD:

(13)

That is, the learning of is modified CD with supplying synthesized examples, and the learning of is based on , which is a variational approximation.

(Xie et al., 2017) also studied the problem of joint learning of the energy-based model and the generator model. The learning of the energy-bases model is based on the modified CD:

(14)

with taking the role of , whereas the learning of the generator is based on how modifies , and is accomplished by , i.e., accumulates MCMC transitions to be close to the stationary distribution of , which is .

In this paper, we shall not consider recruiting a generator network, so that we do not need to worry about the mismatch between the generator model and the energy-based model. In other words, instead of relying on a learned approximate direct sampler, we endeavor to develop small budget MCMC for sampling.

6 Multi-grid modeling and sampling

We propose a minimal contrastive divergence method for learning and sampling generative ConvNet models at multiple grids. For an image , let be the multi-grid versions of , with being the minimal version of , and . For each , we can divide the image grid into squared blocks of pixels. We can reduce each block into a single pixel by averaging the intensity values of the pixels. Such a down-scaling operation maps to . Conversely, we can also define an up-scaling operation, by expanding each pixel of into a block of constant intensity to obtain an up-scaled version of . The up-scaled is not identical to the original because the high resolution details are lost. The mapping from to is a linear projection onto a set of orthogonal basis vectors, each of which corresponds to a block. The up-scaling operation is a pseudo-inverse of this linear mapping. In general, does not even need to be an integer (e.g., ) for the existence of the linear mapping and its pseudo-inverse. The sequence represents the images of the same scene at different viewing distances.

Let be the energy-based generative ConvNet model at grid . can be simply modeled by a one-dimensional histogram of pooled from the versions of the training images.

Within each learning iteration, for each training image in the current learning batch, we initialize the finite-step MCMC from the image . For , we sample from the current by running steps of Langevin dynamics from the up-scaled version of sampled at the previous coaser grid. After that, for , we update the model parameters based on the difference between the synthesized and the observed according to equation (4.1).

Algorithm 1 provides the details of the multi-grid minimal CD algorithm.

Input:
(1) training examples ,
(2) number of Langevin steps ,
(3) number of learning iterations .

Output:
(1) estimated parameters ,
(2) synthesized examples .

1:  Let , initialize .
2:  repeat
3:     For , initialize .
4:     For , initialize as the up-scaled version of , and run steps of Langevin dynamics to evolve , each step following equation (8).
5:     For , update , with step size , where is computed according to equation (4.1).
6:     Let .
7:  until 
Algorithm 1 Multi-grid minimal CD learning

In the above sampling scheme, can be sampled directly because it is a one-dimensional histogram. Each is expected to be smoother than . Thus the sampling scheme is similar to simulated annealing, where we run finite-step MCMC through a sequence of probability distributions that are increasingly multi-modal (or cold), in the hope of reaching and exploring major modes of the model distributions. The learning process then shifts these major modes towards the observed examples, while sharpening these modes along the way, in order to memorize the observed examples with these major modes (instead of the spurious modes) of the model distributions.

Observed images DCGAN Single-grid CD Multi-grid CD
Figure 2: Synthesized images from models learned on the CelebA (top) and forest road category of MIT places205 (bottom) datasets. From left to right: observed images, images synthesized by DCGAN (Radford, Metz, and Chintala, 2015), single-grid CD and multi-grid CD. CD1 and persistent CD cannot synthesize realistic images and their results are not shown.

Let be the data distribution of . Let be the model at grid . Let be the up-scaled version of the model . Specifically, let be a random example at grid , and let be the up-scaled version of , then is the distribution of . Let be the Markov transition kernel of -step Langevin dynamics that samples . The learning gradient of the multi-grid minimal CD method at grid approximately follows the gradient of the difference between two KL divergences:

(15)

is smoother than , and will evolve to a distribution close to by creating details at the current resolution. If we use the original CD by initializing MCMC from , then we are sampling a multi-modal (cold) distribution by initializing from a presumably even more multi-modal (or colder) distribution , and we may not expect the resulting distribution to be close to the target .

Theoretical analysis of the properties of CD learning and its variants is generally difficult. In this paper, we study the multi-grid CD empirically.

7 Experiments

We set , , and learn the models at 3 grids: , and , which we refer to as grid1, grid2 and grid3, respectively.

We carry out qualitative and quantitative experiments to evaluate our method with respect to several baseline methods and evaluate the advantage of minimal CD and multi-grid sampling. The first baseline is single-grid CD: starting from a image, we directly up-scale it to and sample a image using a single generative ConvNet. The other two baselines are CD1 (one-step CD) and persistent CD, which initialize the sampling from the observed images.

7.1 Implementation details

The training images are resized to . Since the models of the three grids act on images of different scales, we design a specific ConvNet structure per grid: grid1 has a 3-layer network with stride filters at the first layer and stride filters at the next two layers; grid2 has a 4-layer network with stride filters at the first layer and stride filters at the next three layers; grid3 has a 3-layer network with stride filters at the first layer, stride filters at the second layer, and stride filters at the third layer. Numbers of channels are at grid1 and grid3, and at grid2. A fully-connected layer with channel output is added on top of every grid to get the value of the scoring function

. Batch normalization

(Ioffe and Szegedy, 2015)

and ReLU activations are applied after every convolution. At each iteration, we run

steps of Langevin dynamics for each grid with step size . All networks are trained simultaneously with mini-batches of size and an initial learning rate of . Learning rate is decayed logarithmically every iterations.

For CD1, persistent CD and single-grid CD, we follow the same setting as multi-grid CD except that for persistent CD and single-grid CD, we set the Langevin steps to to maintain the same MCMC budget as the multi-grid CD. We use the same network structure of grid3 for these baseline methods.

7.2 Synthesis and diagnosis

Rock Volcano
Hotel room Building facade
Figure 3: Synthesized images from models learned by multi-grid CD on 4 categories of MIT places205 datasets.

We learn multi-grid models from CelebA (Liu et al., 2015) and MIT places205 (Zhou et al., 2014) datasets. In the CelebA dataset, the images are cropped at the center to . We randomly sample 10,000 images for training. For the MIT places205 dataset, we learn the models from images of a single place category. Fig. 2 shows some synthesized images from CelebA dataset and the forest road category of MIT places205 dataset. We also show synthesized images generated by models learned by DCGAN (Radford, Metz, and Chintala, 2015) and single-grid CD. CD1 and persistent CD cannot synthesize realistic images, so we do not show their synthesis results. Compared with single-grid CD, images generated by multi-grid CD are more realistic. The results from multi-grid models are comparable to the results from DCGAN. Fig. 3 shows synthesized images from models learned from another categories of MIT places205 dataset by multi-grid CD. The number of training images is for rock and for the other categories.

To monitor the convergence of multi-grid CD, we show the -norm of gradients over iterations learned on CelebA dataset in Fig. 4, which suggests that the learning is stable.

grid1 grid2 grid3
Figure 4: -norm of gradients over iterations of learning on CelebA dataset.

To monitor model fitting and synthesis, we calculate the values of scoring function after training. Table 1 shows the results after iterations of training on CelebA dataset. We randomly sample 10,000 images that are not included in the training dataset from CelebA for testing, and use images randomly sampled from MIT places205 as negative examples. Compared with negative images, scores of training and testing images are higher and close to each other. Scores of training and synthesized images are also close, indicating that the synthesized images are close to fair samples.

Images grid1 () grid2 () grid3 ()
Training 5.33 0.91 8.59 1.12 2.59 0.10
Testing 5.33 0.89 8.27 1.01 2.47 0.10
Synthesized 5.15 0.91 8.38 1.17 2.58 0.11
Negative 4.10 1.00 5.42 1.19 1.99 0.11
Table 1: Average standard deviation of the score .

To check the diversity of Langevin dynamics sampling, we synthesize images by initializing the Langevin dynamics from the same image. As shown in Fig. 5, after steps of Langevin dynamics, the sampled images from the same image are different from each other.

Figure 5: Synthesized images by initializing the Langevin dynamics sampling from the same image. Each block of 4 images are generated from the same image.

7.3 Learning feature maps for classification

The SVHN dataset (Netzer et al., 2011) consists of color images of house numbers collected by Google Street View. The training set consists of 73,257 images and the testing set has 26,032 images. We use the training set to learn the models and the learned models generate images as shown in Fig. 6.

Observed images DCGAN
Single-grid CD Multi-grid CD
Figure 6: Synthesized images from models learned on the SVHN dataset. CD1 and persistent CD cannot synthesize realistic images and their results are not shown.
Test error rate with # of labeled images 1,000 2,000 4,000
DGN
(Kingma et al., 2014)
36.02 - -
Virtual Adversarial
(Miyato et al., 2015)
24.63 - -
Auxiliary Deep Generative Model
(Maaløe et al., 2016)
22.86 - -
DCGAN+L2-SVM
(Radford, Metz, and Chintala, 2015)
22.48 - -
persistent CD 42.74 35.20 29.16
one-step CD 29.75 23.90 19.15
single-grid CD 21.63 17.90 15.07
Supervised CNN with the same structure 39.04 22.26 15.24
multi-grid CD + CNN classifier 19.73 15.86 12.71
Table 2: Test errors of classification on SVHN.

To evaluate the feature maps learned by multi-grid CD, we perform a classification experiment. The procedure is similar to the one outlined in (Radford, Metz, and Chintala, 2015). That is, we use multi-grid CD as a feature extractor. We first train the models on the SVHN training set in an unsupervised way. Then we learn a classifier from labeled data based on the learned feature maps. Specifically, we extract the top layer feature maps of the three grids and train a two-layer classification CNN on top of the feature maps. The first layer is a stride convolutional layer with 64 channels operated separately on each of the three feature maps. Then the outputs from the three feature maps are concatenated to form a 34,624-dimensional vector. A fully-connected layer is added on top of the vector. We train this classifier using 1,000, 2,000 and 4,000 labeled examples that are randomly sampled from the training set and are uniformly class distributed. As shown in Table 2, our method achieves a test error rate of for labeled images. Within the same setting, our method outperforms CD1, persistent CD and single-grid CD. For comparison, we train a classification network with the same structure (as used in multi-grid CD plus two layers of classification) on the same labeled training data. It has a significantly higher error rate of for labeled training images.

7.4 Image inpainting

We further test our method on image inpainting. In this task, we try to learn the conditional distribution by our models, where consists of pixels to be masked, and consists of pixels not to be masked. In the training stage, we randomly place the mask on each training image, but we assume is observed in training. We follow the same learning and sampling algorithm as in Algorithm 1, except that in the sampling step (i.e., step 4 in Algorithm 1), in each Langevin step, only the masked part of the image is updated, and the unmasked part remains fixed as observed. This is a generalization of the pseudo-likelihood estimation (Besag, 1974), which corresponds to the case where consists of one pixel. It can also be considered a form of associative memory (Hopfield, 1982). After learning from the fully observed training images, we then use it to inpaint the masked testing images, where the masked parts are not observed.

Figure 7: Inpainting examples on CelebA dataset. In each block from left to right: (1) the original image; (2) masked input; (3) inpainting image by multi-grid CD.
Mask PCD CD1 SCD CE MCD
Mask 0.056 0.081 0.066 0.045 0.042
Error Doodle 0.055 0.078 0.055 0.050 0.045
Pepper 0.069 0.084 0.054 0.060 0.036
Mask 12.81 12.66 15.97 17.37 16.42
PSNR Doodle 12.92 12.68 14.79 15.40 16.98
Pepper 14.93 15.00 15.36 17.04 19.34
Table 3: Quantitative evaluations for three types of masks. Lower values of error are better. Higher values of PSNR are better. PCD, SCD and MCD indicate persistent CD, single-grid CD and multi-grid CD, respectively.

We use 10,000 face images randomly sampled from CelebA dataset to train the model. We set the mask size at for training. During training, the size of the mask is fixed but the position is randomly selected for each training image. Another 1,000 face images are randomly selected from CelebA dataset for testing. We find that during the testing, the mask does not need to be restricted to square mask. So we test three different shapes of masks: 1) square mask, 2) doodle mask with approximately missing pixels, and 3) pepper and salt mask with approximately missing pixels. Fig. 7 shows some inpainting examples.

We perform quantitative evaluations using two metrics: 1) reconstruction error measured by the per pixel difference and 2) peak signal-to-noise ratio (PSNR). Matrices are computed between the inpainting results obtained by different methods and the original face images on the masked pixels. We compare with persistent CD, CD1 and single-grid CD. We also compare with the ContextEncoder (Pathak et al., 2016) (CE). We retrain the CE model on 10,000 training face images for fair comparison. As our tested masks are not in the image center, we use the “inpaintRandom” version of the CE code and randomly place a mask in each image during training. The results are shown in table 3. It shows that the multi-grid CD learning works well for the inpainting task.

8 Conclusion

This paper proposes a minimal contrastive divergence method for learning multi-grid energy-based generative ConvNet models. We show that the method can learn realistic models of images and the learned models can be useful for tasks such as image processing and classification.

Because an energy-based generative ConvNet corresponds directly to a ConvNet classifier, it is of fundamental importance to study such models for the purpose of unsupervised learning. Our work seeks to facilitate the learning of such models by developing small budget MCMC initialized from a simple distribution for sampling from the learned models. In particular, we learn multi-stage models corresponding to multi-stage reductions of the observed examples, and let the multi-stage models guide the multi-stage MCMC sampling. The multi-stage data reductions do not have to follow a multi-grid scheme, although the latter preserves the convolutional structures of the networks.

It is our hope that this paper will stimulate further research on learning energy-based generative ConvNets with efficient MCMC sampling.

Code

Our experiments are based on MatConvNet (Vedaldi and Lenc, 2015). The code for our experiments can be downloaded from the project page: http://www.stat.ucla.edu/~ruiqigao/multigrid/main.html

Acknowledgement

We thank Zhuowen Tu for sharing with us his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.

The work is supported by DARPA SIMPLEX N66001-15-C-4035, ONR MURI N00014-16-1-2007, and DARPA ARO W911NF-16-1-0579.

References

  • Alain and Bengio (2014) Alain, G., and Bengio, Y. 2014. What regularized auto-encoders learn from the data-generating distribution.

    The Journal of Machine Learning Research

    15(1):3563–3593.
  • Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
  • Besag (1974) Besag, J. 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological) 192–236.
  • Dai, Lu, and Wu (2015) Dai, J.; Lu, Y.; and Wu, Y. N. 2015. Generative modeling of convolutional neural networks. In ICLR.
  • Denton et al. (2015) Denton, E. L.; Chintala, S.; Fergus, R.; et al. 2015. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, 1486–1494.
  • Girolami and Calderhead (2011) Girolami, M., and Calderhead, B. 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2):123–214.
  • Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680.
  • Goodman and Sokal (1989) Goodman, J., and Sokal, A. D. 1989. Multigrid monte carlo method. conceptual foundations. Physical Review D 40(6):2035.
  • Gutmann and Hyvärinen (2012) Gutmann, M. U., and Hyvärinen, A. 2012. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13(Feb):307–361.
  • Hinton et al. (2006) Hinton, G. E.; Osindero, S.; Welling, M.; and Teh, Y.-W. 2006.

    Unsupervised discovery of nonlinear structure using contrastive backpropagation.

    Cognitive Science 30(4):725–731.
  • Hinton, Osindero, and Teh (2006) Hinton, G. E.; Osindero, S.; and Teh, Y.-W. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18:1527–1554.
  • Hinton (2002) Hinton, G. E. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771–1800.
  • Hopfield (1982) Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79(8):2554–2558.
  • Hyvärinen (2005) Hyvärinen, A. 2005. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6:695–709.
  • Hyvärinen (2007) Hyvärinen, A. 2007. Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Neural Networks, IEEE Transactions on 18(5):1529–1531.
  • Ioffe and Szegedy (2015) Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
  • Jin, Lazarow, and Tu (2017) Jin, L.; Lazarow, J.; and Tu, Z. 2017. Introspective learning for discriminative classification. In Advances in Neural Information Processing Systems.
  • Kim and Bengio (2016) Kim, T., and Bengio, Y. 2016. Deep directed generative models with energy-based probability estimation. arXiv preprint arXiv:1606.03439.
  • Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. ICLR.
  • Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, 3581–3589.
  • Kirkpatrick et al. (1983) Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P.; et al. 1983. Optimization by simulated annealing. science 220(4598):671–680.
  • Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
  • LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
  • LeCun et al. (2006) LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; and Huang, F. J. 2006. A tutorial on energy-based learning. In Predicting Structured Data. MIT Press.
  • Lee et al. (2009) Lee, H.; Grosse, R.; Ranganath, R.; and Ng, A. Y. 2009.

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

    In ICML, 609–616. ACM.
  • Liu et al. (2015) Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learning face attributes in the wild. In

    Proceedings of the IEEE International Conference on Computer Vision

    , 3730–3738.
  • Lu, Zhu, and Wu (2016) Lu, Y.; Zhu, S.-C.; and Wu, Y. N. 2016. Learning FRAME models using CNN filters. In

    Thirtieth AAAI Conference on Artificial Intelligence

    .
  • Maaløe et al. (2016) Maaløe, L.; Sønderby, C. K.; Sønderby, S. K.; and Winther, O. 2016. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473.
  • Miyato et al. (2015) Miyato, T.; Maeda, S.-i.; Koyama, M.; Nakae, K.; and Ishii, S. 2015. Distributional smoothing by virtual adversarial examples. stat 1050:2.
  • Mnih and Gregor (2014) Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In ICML.
  • Montufar et al. (2014) Montufar, G. F.; Pascanu, R.; Cho, K.; and Bengio, Y. 2014. On the number of linear regions of deep neural networks. In NIPS, 2924–2932.
  • Neal (2001) Neal, R. M. 2001. Annealed importance sampling. Statistics and Computing 11.
  • Neal (2011) Neal, R. M. 2011. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.
  • Netzer et al. (2011) Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011,  5.
  • Ngiam et al. (2011) Ngiam, J.; Chen, Z.; Koh, P. W.; and Ng, A. Y. 2011. Learning deep energy models. In International Conference on Machine Learning.
  • Pathak et al. (2016) Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A. 2016. Context encoders: Feature learning by inpainting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2536–2544.
  • Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
  • Rezende, Mohamed, and Wierstra (2014) Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Jebara, T., and Xing, E. P., eds., ICML, 1278–1286. JMLR Workshop and Conference Proceedings.
  • Salakhutdinov and Hinton (2009) Salakhutdinov, R., and Hinton, G. E. 2009.

    Deep boltzmann machines.

    In AISTATS.
  • Seung (1998) Seung, H. S. 1998. Learning continuous attractors in recurrent networks. In NIPS, 654–660. MIT Press.
  • Swersky et al. (2011) Swersky, K.; Ranzato, M.; Buchman, D.; Marlin, B.; and Freitas, N. 2011.

    On autoencoders and score matching for energy based models.

    In ICML, 1201–1208. ACM.
  • Tieleman (2008) Tieleman, T. 2008.

    Training restricted boltzmann machines using approximations to the likelihood gradient.

    In Proceedings of the 25th international conference on Machine learning, 1064–1071. ACM.
  • Tu (2007) Tu, Z. 2007. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–8. IEEE.
  • Vedaldi and Lenc (2015) Vedaldi, A., and Lenc, K. 2015. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia.
  • Vincent (2011) Vincent, P. 2011.

    A connection between score matching and denoising autoencoders.

    Neural Computation 23(7):1661–1674.
  • Welling (2009) Welling, M. 2009. Herding dynamical weights to learn. In ICML, 1121–1128. ACM.
  • Xie et al. (2016) Xie, J.; Lu, Y.; Zhu, S.-C.; and Wu, Y. N. 2016. A theory of generative convnet. In ICML.
  • Xie et al. (2017) Xie, J.; Lu, Y.; Gao, R.; Zhu, S.-C.; and Wu, Y. N. 2017. Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408.
  • Xie, Zhu, and Wu (2017) Xie, J.; Zhu, S.-C.; and Wu, Y. N. 2017. Synthesizing dynamic patterns by spatial-temporal generative convnet. In CVPR.
  • Zhou et al. (2014) Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; and Oliva, A. 2014.

    Learning deep features for scene recognition using places database.

    In Advances in neural information processing systems, 487–495.
  • Zhu and Mumford (1998) Zhu, S.-C., and Mumford, D. 1998. Grade: Gibbs reaction and diffusion equations. In ICCV, 847–854.