1 Introduction
This paper studies the problem of learning energybased generative ConvNet models (LeCun et al., 2006; Hinton, 2002; Hinton et al., 2006; Hinton, Osindero, and Teh, 2006; Salakhutdinov and Hinton, 2009; Lee et al., 2009; Ngiam et al., 2011; Lu, Zhu, and Wu, 2016; Xie et al., 2016; Xie, Zhu, and Wu, 2017; Jin, Lazarow, and Tu, 2017) of images. The model is in the form of a Gibbs distribution where the energy function is defined by a bottomup convolutional neural network (ConvNet or CNN). It can be derived from the commonly used discriminative ConvNet (LeCun et al., 1998; Krizhevsky, Sutskever, and Hinton, 2012) as a direct consequence of the Bayes rule (Dai, Lu, and Wu, 2015)
, but unlike the discriminative ConvNet, the generative ConvNet is endowed with the gift of imagination in that it can generate images by sampling from the probability distribution of the model. As a result, the generative ConvNet can be learned in an unsupervised setting without requiring class labels. The learned model can be used as a prior model for image processing. It can also be turned into a discriminative ConvNet for classification.
The maximum likelihood learning of the energybased generative ConvNet model follows an “analysis by synthesis” scheme: we sample the synthesized examples from the current model, usually by Markov chain Monte Carlo (MCMC), and then update the model parameters based on the difference between the observed training examples and the synthesized examples. The probability distribution or the energy function of the learned model is likely to be multimodal if the training data are highly varied. The MCMC may have difficulty to traverse different modes and may take a long time to converge. A simple and popular modification of the maximum likelihood learning is the contrastive divergence (CD) learning
(Hinton, 2002), where for each observed training example, we obtain a corresponding synthesized example by initializing a finitestep MCMC from the observed example. Such a method can be scaled up to large training datasets using minibatch training. However, the synthesized examples may be far from fair samples of the current model, thus resulting in bias of the learned model parameters. A modification of CD is persistent CD (Tieleman, 2008), where the MCMC is still initialized from the observed example at the initial learning epoch. However, in each subsequent learning epoch, the finitestep MCMC is initialized from the synthesized example of the previous epoch. Running persistent chains may make the synthesized examples less biased by the observed examples, although the persistent chains may still have difficulty traversing different modes of the learned model.
To address the above challenges under the constraint of finite budget MCMC, we propose a minimal contrastive divergence method to learn the energybased generative ConvNet models at multiple scales or grids. Specifically, for each training image, we obtain its multigrid versions by repeated downscaling. Our method learns a separate generative ConvNet model at each grid. Within each iteration of our learning algorithm, for each observed training image, we generate the corresponding synthesized images at multiple grids. Specifically, we initialize the finitestep MCMC sampling from the minimal version of the training image, and the synthesized image at each grid serves to initialize the finitestep MCMC that samples from the model of the subsequent finer grid. See Fig. 1 for an illustration, where we sample images sequentially at 3 grids, with 30 steps of Langevin dynamics at each grid. After obtaining the synthesized images at the multiple grids, the models at the multiple grids are updated separately and simultaneously based on the differences between the synthesized images and the observed training images at different grids. We call this learning method the multigrid minimal contrastive divergence because it initializes the finitestep MCMC from the minimal version of the training image.
The advantages of the proposed method are as follows.
(1) The finitestep MCMC is initialized from the version of the observed image, instead of the original observed image. Thus the synthesized image is much less biased by the observed image compared to the original CD.
(2) The learned models at coarse grids are expected to be smoother than the models at fine grids. Sampling the models at increasingly finer grids sequentially is like a simulated annealing process (Kirkpatrick et al., 1983), and the finite budget MCMC is likely to produce synthesized images that are close to fair samples from the learned models.
(3) The images and their models at multiple grids correspond to the same scene at different viewing distances. Thus the models at multiple grids and the corsetofine sampling processes are physically natural.
(4) Unlike original CD or persistent CD, the learned models are equipped with a fixed budget MCMC to generate new synthesized images from scratch, because we only need to initialize the MCMC by sampling from the onedimensional histogram of the version of the training images.
We show that the proposed method can learn realistic models of images. The learned models can be used for image processing such as image inpainting. The learned feature maps can be used for subsequent tasks such as classification.
The contributions of our paper are as follows. We propose a minimal CD method for learning multigrid energybased generative ConvNet models. We show empirically that the proposed method outperforms original CD, persistent CD, as well as single grid learning. More importantly, we show that a small budget MCMC is capable of generating varied and realistic patterns. The deep energybased models have not received the attention they deserve in the recent literature because of the reliance on MCMC sampling. It is our hope that this paper will stimulate further research on designing efficient MCMC algorithms for learning deep energybased models.
2 Related work
Our method is a modification of CD (Hinton, 2002) for training energybased models. In general, both the data distribution of the observed training examples and the learned model distribution can be multimodal, and the data distribution can be even more multimodal than the model distribution. The finitestep MCMC of CD initialized from the data distribution may only explore local modes around the training examples, thus the finitestep MCMC may not get close to the model distribution. This can also be the case with persistent CD (Tieleman, 2008). In contrast, our method initializes the finitestep MCMC from the minimal
version of the original image, and the sampling of the model at each grid is initialized from the image sampled from the model at the previous coarser grid. The model distribution at the coarser grid is expected to be smoother than the model distribution at the finer grid, and the coarse to fine MCMC is likely to generate varied samples that are close to fair samples from the learned models. As a result, the learned models obtained by our method can be closer to maximum likelihood estimate than the original CD.
The multigrid Monte Carlo method originated from statistical physics (Goodman and Sokal, 1989). Our work is perhaps the first to apply the multigrid sampling to CD learning of deep energybased models. The motivation for multigrid Monte Carlo is that reducing the scale or resolution leads to a smoother or less multimodal distribution. The difference between our method and the multigrid MCMC in statistical physics is that in the latter, the distribution of the lower resolution is obtained from the distribution of the higher resolution. In our work, the models at different grids are learned from training images at different resolutions directly and separately.
Besides energybased generative ConvNet model, another popular deep generative model is the generator network which maps the latent vector that follows a simple prior distribution to the image via a topdown ConvNet. The model is usually trained together with an assisting model such as an inferential model as in the variational autoencoder (VAE)
(Kingma and Welling, 2014; Rezende, Mohamed, and Wierstra, 2014; Mnih and Gregor, 2014), or a discriminative model as in the generative adversarial networks (GAN) (Goodfellow et al., 2014; Denton et al., 2015; Radford, Metz, and Chintala, 2015). The focus of this paper is on training deep energybased models, without resorting to a different class of models, so that we do not need to be concerned with the mismatch between two different classes of models. Unlike the generator model, the energybased generative ConvNet model corresponds directly to the discriminative ConvNet classifier
(Dai, Lu, and Wu, 2015).Our learning method is based on maximum likelihood. Recently, (Jin, Lazarow, and Tu, 2017) proposed an introspective learning method that updates the model based on discriminative learning. It is possible to apply multigrid learning and sampling to their method.
3 Energybased generative ConvNet
This section reviews the energybased generative ConvNet and explains its equivalence to the commonly used discriminative ConvNet.
3.1 Exponential tilting
Let be the image defined on a squared (or rectangle) grid. We use to denote the probability distribution of with parameter . The energybased generative ConvNet model is as follows (Xie et al., 2016):
(1) 
where
is the reference or negative distribution such as Gaussian white noise
(or a uniform distribution within a bounded range). The scoring function
is defined by a bottomup ConvNet whose parameters are denoted by . The normalizing constant is analytically intractable. The energy function is(2) 
3.2 Equivalence to discriminative ConvNet
Model (1) corresponds to a classifier in the following sense (Dai, Lu, and Wu, 2015; Xie et al., 2016; Jin, Lazarow, and Tu, 2017). Suppose there are categories, , for , in addition to the background category . The ConvNets for may share common lower layers. Let
be the prior probability of category
,. Then the posterior probability for classifying an example
to the category is a softmax multiclass classifier(4) 
where , and for , , . Conversely, if we have the softmax classifier (4), then the distribution of each category is of the form (1). Thus the energybased generative ConvNet directly corresponds to the commonly used discriminative ConvNet.
In the case where we only observe unlabeled examples, we may model them by a single distribution in (1), and treat it as the positive distribution, and the negative distribution. Let be the prior probability that a random example comes from . Then the posterior probability that a random example comes from is
(5) 
where
, i.e., a logistic regression.
(Tu, 2007; Jin, Lazarow, and Tu, 2017; Gutmann and Hyvärinen, 2012) exploited this fact to estimate and by logistic regression in order to exponentially tilt to . (Tu, 2007; Jin, Lazarow, and Tu, 2017) devised an iterative algorithm that treats the current as the new to be tilted to a new . The logistic regression tends to focus on the examples that are on the boundaries between and . In our paper, we shall pursue maximum likelihood learning which matches the average statistical properties of the model distribution and data distribution.The equivalence to discriminative ConvNet classifier justifies the importance and naturalness of the energybased generative ConvNet model.
4 Maximum likelihood
While the discriminative ConvNet must be learned in a supervised setting, the generative ConvNet model in (1) can be learned from unlabeled data by maximum likelihood, and the resulting learning and sampling algorithm admits an adversarial interpretation.
4.1 Learning and sampling algorithm
Suppose we observe training examples from an unknown data distribution . The maximum likelihood learning seeks to maximize the loglikelihood function
(6) 
If the sample size
is large, the maximum likelihood estimator minimizes the KullbackLeibler divergence
from the data distribution to the model distribution . The gradient of is(7) 
where denotes the expectation with respect to . The key to the above identity is that .
The expectation in equation (7) is analytically intractable and has to be approximated by MCMC, such as Langevin dynamics (Zhu and Mumford, 1998; Girolami and Calderhead, 2011), which iterates the following step:
(8)  
where indexes the time steps of the Langevin dynamics, is the step size, and is Gaussian white noise. The Langevin dynamics relaxes to a low energy region, while the noise term provides randomness and variability. The energy gradient for relaxation is in the form of the reconstruction error of the autoencoder (3) (for uniform , the gradient is simply the derivative of ). A MetropolisHastings step can be added to correct for the finite step size . We can also use Hamiltonian Monte Carlo for sampling the generative ConvNet (Neal, 2011; Dai, Lu, and Wu, 2015).
We can run parallel chains of Langevin dynamics according to (8) to obtain the synthesized examples . The Monte Carlo approximation to is
which is used to update .
Both the learning and sampling steps involve the derivatives of with respect to and
respectively. The derivatives can be computed efficiently by backpropagation, and the computations of the two derivatives share most of the chain rule steps. The learning algorithm is thus in the form of alternating backpropagation.
4.2 Mode shifting and adversarial interpretations
This subsection explains the intuition and adversarial interpretation of the learning and sampling algorithm. It can be skipped in the first reading.
The above learning and sampling algorithm can be interpreted as density shifting or mode shifting. In the sampling step, the Langevin dynamics settles the synthesized examples at the low energy regions or high density regions, or major modes (or basins) of , i.e., modes with low energies or high probabilities, so that tends to be low. The learning step seeks to change the energy function by changing in order to increase . This has the effect of shifting the low energy or high density regions from the synthesized examples toward the observed examples , or shifting the major modes of the energy function from the synthesized examples toward the observed examples, until the observed examples reside in the major modes of the model. If the major modes are too diffused around the observed examples, the learning step will sharpen them to focus on the observed examples. This mode shifting interpretation is related to Hopfield network (Hopfield, 1982) and attractor network (Seung, 1998) with the Langevin dynamics serving as the attractor dynamics. But it is important to ensure that the modes for encoding the observed examples are the major modes with high probabilities.
The energy landscape may have numerous major modes that are not occupied by the observed examples, and these modes imagine examples that are considered similar to the observed examples. Even though the maximum likelihood learning matches the average statistical properties between model and data, the bottomup ConvNet is expressive enough to create modes to encode the highly varied patterns. We still lack an indepth understanding of the energy landscape.
The learning and sampling algorithm also has an adversarial interpretation where the learning and sampling steps play a minimax game. Let the value function be defined as
(10) 
The learning step updates to increase , while the Langevin sampling step tends to relax to decrease . The zero temperature limit of the Langevin sampling is gradient descent that decreases , and the resulting learning and sampling algorithm is a generalized version of herding (Welling, 2009). See also (Xie, Zhu, and Wu, 2017). This is related to Wasserstein GAN (Arjovsky, Chintala, and Bottou, 2017), but the critic and the actor are the same energybased model, i.e., the model itself is its own generator and critic.
(Tu, 2007; Jin, Lazarow, and Tu, 2017) developed a discriminative method for updating the model by learning a classifier or logistic regression to distinguish between the observed and the synthesized , and tilt the current model according to the logistic regression as discussed in subsection 3.2. It is also an “analysis by synthesis” scheme as well as an adversarial scheme, except that the analysis is performed by a classifier (like GAN) instead of a critic.
5 Contrastive divergence
This section reviews the original CD, and presents the modified and adversarial CD.
5.1 Original CD
The MCMC sampling of may take a long time to converge, especially if the learned is multimodal, which is often the case because is usually multimodal. In order to learn from large datasets, we can only afford small budget MCMC, i.e., within each learning iteration, we can only run MCMC for a small number of steps. To meet such a challenge, (Hinton, 2002) proposed the contrastive divergence (CD) method, where within each learning iteration, we initialize the finitestep MCMC from each in the current learning batch to obtain a synthesized example . The parameters are then updated according to the learning gradient (4.1).
Let be the transition kernel of the finitestep MCMC that samples from . For any probability distribution and any Markov transition kernel , let denote the marginal distribution obtained after running starting from . The learning gradient of CD approximately follows the gradient of the difference between two KL divergences:
(11) 
thus the name “contrastive divergence”. If is close to , then the second divergence is small, and the CD estimate is close to maximum likelihood which minimizes the first divergence. However, it is likely that and the learned are multimodal. It is expected that is smoother than , i.e., is “colder” than in the language of simulated annealing (Kirkpatrick et al., 1983). If is different from , it is unlikely that becomes much closer to due to the trapping of local modes. This may lead to bias in the CD estimate.
The CD learning is related to score matching estimator (Hyvärinen, 2005, 2007) and autoencoder (Vincent, 2011; Swersky et al., 2011; Alain and Bengio, 2014). (Xie et al., 2016) showed that CD1 tends to fit the autoencoder in (3). This is because the learning step shifts the local modes from the synthesized examples toward the observed examples, until the observed examples reside in the modes, and the modes satisfy the autoencoder in (3). One potential weakness of CD (as well as score matching, autoencoder, and Hopfield attractor network) is that CD only explores the local modes around the observed examples, but these local modes may not be the major modes with high probabilities. It is as if fitting a mixture model with correct component distributions but incorrect component probabilities.
A persistent version of CD (Tieleman, 2008) is to initialize the MCMC from the observed in the beginning, and then in each learning epoch, the MCMC is initialized from the synthesized obtained in the previous epoch. The persistent CD may still face the challenge of traversing and exploring different local energy minima.
5.2 Modified and adversarial CDs
This subsection explains modifications of CD, including methods based on an additional generator network. It can be skipped in the first reading.
The original CD initializes MCMC sampling from the data distribution . We may modify it by initializing MCMC sampling from a given distribution , in the hope that is closer to than . The learning gradient approximately follows the gradient of
(12) 
That is, we run a finitestep MCMC from a given initial distribution , and use the resulting samples as synthesized examples to approximate the expectation in (7). The approximation can be made more accurate using annealed importance sampling (Neal, 2001). Following the idea of simulated annealing, should be a “smoother” distribution than (the extreme case is to start from white noise ). Unlike persistent CD, here the finitestep MCMC is nonpersistent, sometimes also referred to as “cold start”, where the MCMC is initialized from a given within each learning iteration, instead of from the examples synthesized by the previous learning epoch. The cold start version is easier to implement for minibatch learning.
In multigrid CD (to be introduced in the next section), at each grid, is the distribution of the images generated by the previous coarser grid. At the smallest grid, is the onedimensional histogram of the versions of the training images.
Another possibility is to recruit a generator network as an approximated direct sampler (Kim and Bengio, 2016), so that and can be jointly learned by the adversarial CD:
(13) 
That is, the learning of is modified CD with supplying synthesized examples, and the learning of is based on , which is a variational approximation.
(Xie et al., 2017) also studied the problem of joint learning of the energybased model and the generator model. The learning of the energybases model is based on the modified CD:
(14) 
with taking the role of , whereas the learning of the generator is based on how modifies , and is accomplished by , i.e., accumulates MCMC transitions to be close to the stationary distribution of , which is .
In this paper, we shall not consider recruiting a generator network, so that we do not need to worry about the mismatch between the generator model and the energybased model. In other words, instead of relying on a learned approximate direct sampler, we endeavor to develop small budget MCMC for sampling.
6 Multigrid modeling and sampling
We propose a minimal contrastive divergence method for learning and sampling generative ConvNet models at multiple grids. For an image , let be the multigrid versions of , with being the minimal version of , and . For each , we can divide the image grid into squared blocks of pixels. We can reduce each block into a single pixel by averaging the intensity values of the pixels. Such a downscaling operation maps to . Conversely, we can also define an upscaling operation, by expanding each pixel of into a block of constant intensity to obtain an upscaled version of . The upscaled is not identical to the original because the high resolution details are lost. The mapping from to is a linear projection onto a set of orthogonal basis vectors, each of which corresponds to a block. The upscaling operation is a pseudoinverse of this linear mapping. In general, does not even need to be an integer (e.g., ) for the existence of the linear mapping and its pseudoinverse. The sequence represents the images of the same scene at different viewing distances.
Let be the energybased generative ConvNet model at grid . can be simply modeled by a onedimensional histogram of pooled from the versions of the training images.
Within each learning iteration, for each training image in the current learning batch, we initialize the finitestep MCMC from the image . For , we sample from the current by running steps of Langevin dynamics from the upscaled version of sampled at the previous coaser grid. After that, for , we update the model parameters based on the difference between the synthesized and the observed according to equation (4.1).
Algorithm 1 provides the details of the multigrid minimal CD algorithm.
In the above sampling scheme, can be sampled directly because it is a onedimensional histogram. Each is expected to be smoother than . Thus the sampling scheme is similar to simulated annealing, where we run finitestep MCMC through a sequence of probability distributions that are increasingly multimodal (or cold), in the hope of reaching and exploring major modes of the model distributions. The learning process then shifts these major modes towards the observed examples, while sharpening these modes along the way, in order to memorize the observed examples with these major modes (instead of the spurious modes) of the model distributions.
Observed images  DCGAN  Singlegrid CD  Multigrid CD 
Let be the data distribution of . Let be the model at grid . Let be the upscaled version of the model . Specifically, let be a random example at grid , and let be the upscaled version of , then is the distribution of . Let be the Markov transition kernel of step Langevin dynamics that samples . The learning gradient of the multigrid minimal CD method at grid approximately follows the gradient of the difference between two KL divergences:
(15) 
is smoother than , and will evolve to a distribution close to by creating details at the current resolution. If we use the original CD by initializing MCMC from , then we are sampling a multimodal (cold) distribution by initializing from a presumably even more multimodal (or colder) distribution , and we may not expect the resulting distribution to be close to the target .
Theoretical analysis of the properties of CD learning and its variants is generally difficult. In this paper, we study the multigrid CD empirically.
7 Experiments
We set , , and learn the models at 3 grids: , and , which we refer to as grid1, grid2 and grid3, respectively.
We carry out qualitative and quantitative experiments to evaluate our method with respect to several baseline methods and evaluate the advantage of minimal CD and multigrid sampling. The first baseline is singlegrid CD: starting from a image, we directly upscale it to and sample a image using a single generative ConvNet. The other two baselines are CD1 (onestep CD) and persistent CD, which initialize the sampling from the observed images.
7.1 Implementation details
The training images are resized to . Since the models of the three grids act on images of different scales, we design a specific ConvNet structure per grid: grid1 has a 3layer network with stride filters at the first layer and stride filters at the next two layers; grid2 has a 4layer network with stride filters at the first layer and stride filters at the next three layers; grid3 has a 3layer network with stride filters at the first layer, stride filters at the second layer, and stride filters at the third layer. Numbers of channels are at grid1 and grid3, and at grid2. A fullyconnected layer with channel output is added on top of every grid to get the value of the scoring function
(Ioffe and Szegedy, 2015)and ReLU activations are applied after every convolution. At each iteration, we run
steps of Langevin dynamics for each grid with step size . All networks are trained simultaneously with minibatches of size and an initial learning rate of . Learning rate is decayed logarithmically every iterations.For CD1, persistent CD and singlegrid CD, we follow the same setting as multigrid CD except that for persistent CD and singlegrid CD, we set the Langevin steps to to maintain the same MCMC budget as the multigrid CD. We use the same network structure of grid3 for these baseline methods.
7.2 Synthesis and diagnosis
Rock  Volcano 
Hotel room  Building facade 
We learn multigrid models from CelebA (Liu et al., 2015) and MIT places205 (Zhou et al., 2014) datasets. In the CelebA dataset, the images are cropped at the center to . We randomly sample 10,000 images for training. For the MIT places205 dataset, we learn the models from images of a single place category. Fig. 2 shows some synthesized images from CelebA dataset and the forest road category of MIT places205 dataset. We also show synthesized images generated by models learned by DCGAN (Radford, Metz, and Chintala, 2015) and singlegrid CD. CD1 and persistent CD cannot synthesize realistic images, so we do not show their synthesis results. Compared with singlegrid CD, images generated by multigrid CD are more realistic. The results from multigrid models are comparable to the results from DCGAN. Fig. 3 shows synthesized images from models learned from another categories of MIT places205 dataset by multigrid CD. The number of training images is for rock and for the other categories.
To monitor the convergence of multigrid CD, we show the norm of gradients over iterations learned on CelebA dataset in Fig. 4, which suggests that the learning is stable.
grid1  grid2  grid3 
To monitor model fitting and synthesis, we calculate the values of scoring function after training. Table 1 shows the results after iterations of training on CelebA dataset. We randomly sample 10,000 images that are not included in the training dataset from CelebA for testing, and use images randomly sampled from MIT places205 as negative examples. Compared with negative images, scores of training and testing images are higher and close to each other. Scores of training and synthesized images are also close, indicating that the synthesized images are close to fair samples.
Images  grid1 ()  grid2 ()  grid3 () 
Training  5.33 0.91  8.59 1.12  2.59 0.10 
Testing  5.33 0.89  8.27 1.01  2.47 0.10 
Synthesized  5.15 0.91  8.38 1.17  2.58 0.11 
Negative  4.10 1.00  5.42 1.19  1.99 0.11 
To check the diversity of Langevin dynamics sampling, we synthesize images by initializing the Langevin dynamics from the same image. As shown in Fig. 5, after steps of Langevin dynamics, the sampled images from the same image are different from each other.
7.3 Learning feature maps for classification
The SVHN dataset (Netzer et al., 2011) consists of color images of house numbers collected by Google Street View. The training set consists of 73,257 images and the testing set has 26,032 images. We use the training set to learn the models and the learned models generate images as shown in Fig. 6.
Observed images  DCGAN 
Singlegrid CD  Multigrid CD 
Test error rate with # of labeled images  1,000  2,000  4,000  

36.02      

24.63      

22.86      

22.48      
persistent CD  42.74  35.20  29.16  
onestep CD  29.75  23.90  19.15  
singlegrid CD  21.63  17.90  15.07  
Supervised CNN with the same structure  39.04  22.26  15.24  
multigrid CD + CNN classifier  19.73  15.86  12.71 
To evaluate the feature maps learned by multigrid CD, we perform a classification experiment. The procedure is similar to the one outlined in (Radford, Metz, and Chintala, 2015). That is, we use multigrid CD as a feature extractor. We first train the models on the SVHN training set in an unsupervised way. Then we learn a classifier from labeled data based on the learned feature maps. Specifically, we extract the top layer feature maps of the three grids and train a twolayer classification CNN on top of the feature maps. The first layer is a stride convolutional layer with 64 channels operated separately on each of the three feature maps. Then the outputs from the three feature maps are concatenated to form a 34,624dimensional vector. A fullyconnected layer is added on top of the vector. We train this classifier using 1,000, 2,000 and 4,000 labeled examples that are randomly sampled from the training set and are uniformly class distributed. As shown in Table 2, our method achieves a test error rate of for labeled images. Within the same setting, our method outperforms CD1, persistent CD and singlegrid CD. For comparison, we train a classification network with the same structure (as used in multigrid CD plus two layers of classification) on the same labeled training data. It has a significantly higher error rate of for labeled training images.
7.4 Image inpainting
We further test our method on image inpainting. In this task, we try to learn the conditional distribution by our models, where consists of pixels to be masked, and consists of pixels not to be masked. In the training stage, we randomly place the mask on each training image, but we assume is observed in training. We follow the same learning and sampling algorithm as in Algorithm 1, except that in the sampling step (i.e., step 4 in Algorithm 1), in each Langevin step, only the masked part of the image is updated, and the unmasked part remains fixed as observed. This is a generalization of the pseudolikelihood estimation (Besag, 1974), which corresponds to the case where consists of one pixel. It can also be considered a form of associative memory (Hopfield, 1982). After learning from the fully observed training images, we then use it to inpaint the masked testing images, where the masked parts are not observed.
Mask  PCD  CD1  SCD  CE  MCD  
Mask  0.056  0.081  0.066  0.045  0.042  
Error  Doodle  0.055  0.078  0.055  0.050  0.045 
Pepper  0.069  0.084  0.054  0.060  0.036  
Mask  12.81  12.66  15.97  17.37  16.42  
PSNR  Doodle  12.92  12.68  14.79  15.40  16.98 
Pepper  14.93  15.00  15.36  17.04  19.34 
We use 10,000 face images randomly sampled from CelebA dataset to train the model. We set the mask size at for training. During training, the size of the mask is fixed but the position is randomly selected for each training image. Another 1,000 face images are randomly selected from CelebA dataset for testing. We find that during the testing, the mask does not need to be restricted to square mask. So we test three different shapes of masks: 1) square mask, 2) doodle mask with approximately missing pixels, and 3) pepper and salt mask with approximately missing pixels. Fig. 7 shows some inpainting examples.
We perform quantitative evaluations using two metrics: 1) reconstruction error measured by the per pixel difference and 2) peak signaltonoise ratio (PSNR). Matrices are computed between the inpainting results obtained by different methods and the original face images on the masked pixels. We compare with persistent CD, CD1 and singlegrid CD. We also compare with the ContextEncoder (Pathak et al., 2016) (CE). We retrain the CE model on 10,000 training face images for fair comparison. As our tested masks are not in the image center, we use the “inpaintRandom” version of the CE code and randomly place a mask in each image during training. The results are shown in table 3. It shows that the multigrid CD learning works well for the inpainting task.
8 Conclusion
This paper proposes a minimal contrastive divergence method for learning multigrid energybased generative ConvNet models. We show that the method can learn realistic models of images and the learned models can be useful for tasks such as image processing and classification.
Because an energybased generative ConvNet corresponds directly to a ConvNet classifier, it is of fundamental importance to study such models for the purpose of unsupervised learning. Our work seeks to facilitate the learning of such models by developing small budget MCMC initialized from a simple distribution for sampling from the learned models. In particular, we learn multistage models corresponding to multistage reductions of the observed examples, and let the multistage models guide the multistage MCMC sampling. The multistage data reductions do not have to follow a multigrid scheme, although the latter preserves the convolutional structures of the networks.
It is our hope that this paper will stimulate further research on learning energybased generative ConvNets with efficient MCMC sampling.
Code
Our experiments are based on MatConvNet (Vedaldi and Lenc, 2015). The code for our experiments can be downloaded from the project page: http://www.stat.ucla.edu/~ruiqigao/multigrid/main.html
Acknowledgement
We thank Zhuowen Tu for sharing with us his insights on his recent work on introspective learning. We thank Jianwen Xie for helpful discussions.
The work is supported by DARPA SIMPLEX N6600115C4035, ONR MURI N000141612007, and DARPA ARO W911NF1610579.
References

Alain and Bengio (2014)
Alain, G., and Bengio, Y.
2014.
What regularized autoencoders learn from the datagenerating
distribution.
The Journal of Machine Learning Research
15(1):3563–3593.  Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
 Besag (1974) Besag, J. 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological) 192–236.
 Dai, Lu, and Wu (2015) Dai, J.; Lu, Y.; and Wu, Y. N. 2015. Generative modeling of convolutional neural networks. In ICLR.
 Denton et al. (2015) Denton, E. L.; Chintala, S.; Fergus, R.; et al. 2015. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in Neural Information Processing Systems, 1486–1494.
 Girolami and Calderhead (2011) Girolami, M., and Calderhead, B. 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2):123–214.
 Goodfellow et al. (2014) Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2672–2680.
 Goodman and Sokal (1989) Goodman, J., and Sokal, A. D. 1989. Multigrid monte carlo method. conceptual foundations. Physical Review D 40(6):2035.
 Gutmann and Hyvärinen (2012) Gutmann, M. U., and Hyvärinen, A. 2012. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13(Feb):307–361.

Hinton et al. (2006)
Hinton, G. E.; Osindero, S.; Welling, M.; and Teh, Y.W.
2006.
Unsupervised discovery of nonlinear structure using contrastive backpropagation.
Cognitive Science 30(4):725–731.  Hinton, Osindero, and Teh (2006) Hinton, G. E.; Osindero, S.; and Teh, Y.W. 2006. A fast learning algorithm for deep belief nets. Neural Computation 18:1527–1554.
 Hinton (2002) Hinton, G. E. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14(8):1771–1800.
 Hopfield (1982) Hopfield, J. J. 1982. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79(8):2554–2558.
 Hyvärinen (2005) Hyvärinen, A. 2005. Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research 6:695–709.
 Hyvärinen (2007) Hyvärinen, A. 2007. Connections between score matching, contrastive divergence, and pseudolikelihood for continuousvalued variables. Neural Networks, IEEE Transactions on 18(5):1529–1531.
 Ioffe and Szegedy (2015) Ioffe, S., and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
 Jin, Lazarow, and Tu (2017) Jin, L.; Lazarow, J.; and Tu, Z. 2017. Introspective learning for discriminative classification. In Advances in Neural Information Processing Systems.
 Kim and Bengio (2016) Kim, T., and Bengio, Y. 2016. Deep directed generative models with energybased probability estimation. arXiv preprint arXiv:1606.03439.
 Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Autoencoding variational bayes. ICLR.
 Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, 3581–3589.
 Kirkpatrick et al. (1983) Kirkpatrick, S.; Gelatt, C. D.; Vecchi, M. P.; et al. 1983. Optimization by simulated annealing. science 220(4598):671–680.
 Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS, 1097–1105.
 LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 LeCun et al. (2006) LeCun, Y.; Chopra, S.; Hadsell, R.; Ranzato, M.; and Huang, F. J. 2006. A tutorial on energybased learning. In Predicting Structured Data. MIT Press.

Lee et al. (2009)
Lee, H.; Grosse, R.; Ranganath, R.; and Ng, A. Y.
2009.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.
In ICML, 609–616. ACM. 
Liu et al. (2015)
Liu, Z.; Luo, P.; Wang, X.; and Tang, X.
2015.
Deep learning face attributes in the wild.
In
Proceedings of the IEEE International Conference on Computer Vision
, 3730–3738. 
Lu, Zhu, and Wu (2016)
Lu, Y.; Zhu, S.C.; and Wu, Y. N.
2016.
Learning FRAME models using CNN filters.
In
Thirtieth AAAI Conference on Artificial Intelligence
.  Maaløe et al. (2016) Maaløe, L.; Sønderby, C. K.; Sønderby, S. K.; and Winther, O. 2016. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473.
 Miyato et al. (2015) Miyato, T.; Maeda, S.i.; Koyama, M.; Nakae, K.; and Ishii, S. 2015. Distributional smoothing by virtual adversarial examples. stat 1050:2.
 Mnih and Gregor (2014) Mnih, A., and Gregor, K. 2014. Neural variational inference and learning in belief networks. In ICML.
 Montufar et al. (2014) Montufar, G. F.; Pascanu, R.; Cho, K.; and Bengio, Y. 2014. On the number of linear regions of deep neural networks. In NIPS, 2924–2932.
 Neal (2001) Neal, R. M. 2001. Annealed importance sampling. Statistics and Computing 11.
 Neal (2011) Neal, R. M. 2011. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo 2.
 Netzer et al. (2011) Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, 5.
 Ngiam et al. (2011) Ngiam, J.; Chen, Z.; Koh, P. W.; and Ng, A. Y. 2011. Learning deep energy models. In International Conference on Machine Learning.

Pathak et al. (2016)
Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A.
2016.
Context encoders: Feature learning by inpainting.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2536–2544.  Radford, Metz, and Chintala (2015) Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.
 Rezende, Mohamed, and Wierstra (2014) Rezende, D. J.; Mohamed, S.; and Wierstra, D. 2014. Stochastic backpropagation and approximate inference in deep generative models. In Jebara, T., and Xing, E. P., eds., ICML, 1278–1286. JMLR Workshop and Conference Proceedings.

Salakhutdinov and
Hinton (2009)
Salakhutdinov, R., and Hinton, G. E.
2009.
Deep boltzmann machines.
In AISTATS.  Seung (1998) Seung, H. S. 1998. Learning continuous attractors in recurrent networks. In NIPS, 654–660. MIT Press.

Swersky et al. (2011)
Swersky, K.; Ranzato, M.; Buchman, D.; Marlin, B.; and Freitas, N.
2011.
On autoencoders and score matching for energy based models.
In ICML, 1201–1208. ACM. 
Tieleman (2008)
Tieleman, T.
2008.
Training restricted boltzmann machines using approximations to the likelihood gradient.
In Proceedings of the 25th international conference on Machine learning, 1064–1071. ACM.  Tu (2007) Tu, Z. 2007. Learning generative models via discriminative approaches. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 1–8. IEEE.
 Vedaldi and Lenc (2015) Vedaldi, A., and Lenc, K. 2015. Matconvnet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia.

Vincent (2011)
Vincent, P.
2011.
A connection between score matching and denoising autoencoders.
Neural Computation 23(7):1661–1674.  Welling (2009) Welling, M. 2009. Herding dynamical weights to learn. In ICML, 1121–1128. ACM.
 Xie et al. (2016) Xie, J.; Lu, Y.; Zhu, S.C.; and Wu, Y. N. 2016. A theory of generative convnet. In ICML.
 Xie et al. (2017) Xie, J.; Lu, Y.; Gao, R.; Zhu, S.C.; and Wu, Y. N. 2017. Cooperative training of descriptor and generator networks. arXiv preprint arXiv:1609.09408.
 Xie, Zhu, and Wu (2017) Xie, J.; Zhu, S.C.; and Wu, Y. N. 2017. Synthesizing dynamic patterns by spatialtemporal generative convnet. In CVPR.

Zhou et al. (2014)
Zhou, B.; Lapedriza, A.; Xiao, J.; Torralba, A.; and Oliva, A.
2014.
Learning deep features for scene recognition using places database.
In Advances in neural information processing systems, 487–495.  Zhu and Mumford (1998) Zhu, S.C., and Mumford, D. 1998. Grade: Gibbs reaction and diffusion equations. In ICCV, 847–854.
Comments
There are no comments yet.