Texture Modeling with Convolutional Spike-and-Slab RBMs and Deep Extensions

11/24/2012 ∙ by Heng Luo, et al. ∙ Université de Montréal 0

We apply the spike-and-slab Restricted Boltzmann Machine (ssRBM) to texture modeling. The ssRBM with tiled-convolution weight sharing (TssRBM) achieves or surpasses the state-of-the-art on texture synthesis and inpainting by parametric models. We also develop a novel RBM model with a spike-and-slab visible layer and binary variables in the hidden layer. This model is designed to be stacked on top of the TssRBM. We show the resulting deep belief network (DBN) is a powerful generative model that improves on single-layer models and is capable of modeling not only single high-resolution and challenging textures but also multiple textures.



There are no comments yet.


page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Texture processing is one of the essential components of scene understanding in human vision. Natural images can be seen as a large mixture of heterogeneous textures. Thus, to a certain extent, progress in modeling natural images requires that we make progress in modeling textures. To this end, texture modeling has been an active research area of machine learning, computer vision and graphics during the past five decades. Although nonparametric approaches 

(Lin et al., 2006) have made significant progress in synthesizing textures from example images, capturing the statistical properties of textures via a probabilistic model remains an active area of inquiry. Such probabilistic models are important for modeling natural images (Heess et al., 2009) but also for understanding human vision (Zhu et al., 2000).

In this work we consider a probabilistic model of textures based on the spike-and-slab Restricted Boltzmann Machine (ssRBM) (Courville et al., 2011a, b). The ssRBM has previously demonstrated the ability to generate samples of small natural images that preserved much of their statistical structure (Courville et al., 2011b). This would suggest that the ssRBM is potentially well suited to the task of texture modeling. Following the recent exploration of Boltzmann machines for textures (Kivinen and Williams, 2012), we have trained ssRBMs with tiled-convolution weight sharing Gregor and LeCun (2010); Le et al. (2010) on the Brodatz-texture images111www.ux.uis.no/tranden/brodatz.html. Tiled convolution allows weight sharing in filters with non-overlapping receptive fields. The use of tiled convolution is a particularly appropriate choice of model architecture in the context of texture modeling. The weight sharing allows the model to synthesize texture patches of variable size, while the tiling pattern of weight sharing allows us to efficiently devote model capacity to modeling the local texture patches.

In Kivinen and Williams (2012), the authors concentrated their quantitative evaluation of the texture models on a subset of the Brodatz textures that exhibit strong spatial invariance, i.e. textures largely consisting of a regular repeating pattern. While this is an important problem in its own right, most natural textures (i.e. those associated with a natural-looking world) exhibit significant spatial non-stationarity and features with a wide spatial frequency range. One popular way to deal with images with a wide spatial frequency range is to decompose the frequencies using, for example, a Laplacian image pyramid. However, since many textures have features that interact across spatial resolutions, spatial pyramids would appear to be inappropriate. We propose that deep convolutional generative architectures are well suited to model these natural textures. In particular, by increasing the effective receptive field with depth, we can use higher layers of the model to efficiently communicate information such as phase to spatially isolated parts of the first layer model.

Deep belief networks also have another important property that we find useful in the context of texture modeling. As argued by Hinton (2012); Le Roux and Bengio (2008)

, training the lower layers by contrastive divergence (CD) 

(Hinton et al., 2006) allows the lower layers to concentrate on modeling local features of the data. We have found best results by training the lower layers by CD and the uppermost layer by a closer approximation to maximum likelihood, such as persistent contrastive divergence (PCD) Tieleman (2008), promoting a better division of labour between the layers of the DBN. On this account, the ssRBM offers an important advantage over other similar models in the literature. For unlike models such as the mcRBM (Ranzato and Hinton, 2010) and mPoT (Ranzato et al., 2010a), the structure of the ssRBM makes it readily amenable to CD training.

Our contributions are, first, the exploration of a tiled-convolutionally trained ssRBM (TssRBM) texture model and its objective comparison with the other similar models in the literature. We show that the TssRBM is competitive with the state-of-the-art on texture synthesis and inpainting tasks on a selection of Brodatz textures. Second, we develop a novel RBM model with a spike-and-slab visible layer and binary variables in the hidden layer. This model is design to be stacked on top of the TssRBM within a deep belief network configuration, with each layer trained convolutionally with a greedy layer-wise pretraining strategy. We demonstrate how the resulting two and three-layer DBN (the third layer is a standard RBM) models are able to encode longer term dependencies in the higher layers while simultaneously recovering more detailed structure in the CD-trained lower layer, all of which translates to superior texture model performance – particularly when the textures being modeled exhibit strong non-stationarity. Finally, we show how the depth helps in learning a generative model of multiple textures. Kivinen and Williams (2012) introduce a model capable of modeling multiple textures, however they make use of label information in the training process alleviating the difficult learning problem of constructing multiple modes to represent each texture. In this work, we show how a deep belief network based on the ssRBM is capable of learning to model multiple textures based on purely unsupervised training.

2 Previous Work

The problem of texture synthesis has been extensively studied in the computer vision community for decades (Zhu et al., 2000)

. Probably the most popular texture synthesis strategies are currently example-based or nonparametric methods

(Wei et al., 2009). These typically seed a target image with transformed versions of patches drawn from the target texture. While these methods are flexible, they are unlikely to be readily applicable to natural textures, where some aspects of the statistical structure (eg. the path of the duck tracks) are global in scope.

The Gaussian RBM  (Welling et al., 2005; Ranzato and Hinton, 2010) models real-valued observations by adding quadratic terms on the visible units to the standard binary-binary RBM energy function. One limitation of the Gaussian RBM is that changing its hidden unit activations only changes the conditional mean of the visible units. For modeling natural images, it has been found important to allow the hidden unit configuration to capture changes in the covariance between pixels, and this has motivated several of the models discussed below as well as the ssRBM. The product of Student’s T-distributions (PoT) model (Welling et al., 2003)

is an energy-based model where the conditional distribution over the visible units conditioned on the hidden variables is a multivariate Gaussian (non-diagonal covariance) and the complementary conditional distribution over the hidden variables given the visibles are a set of independent Gamma distributions. The PoT model has recently been generalized to the mPoT model 

(Ranzato et al., 2010b) to include nonzero Gaussian means by the addition of Gaussian RBM-like hidden units. In the same work, Kivinen and Williams (2012) explored the “Multi-Texture Boltzmann Machine” (Multi-Tm), training a single large Gaussian RBM (up to 256 feature maps par tiling position, as opposed to 32 maps per tiling position) on multiple textures. In modeling multiple textures, Kivinen and Williams (2012) used label information during the training process to enable the model to focus on a single texture class at a time. In section 5.3, we show how we can use a deep belief network, based on the ssRBM, to learn a model of multiple textures using no label information at all.

In addition to validating the ssRBM as the basis of an effective texture model, we also set out to study the impact of adding layers to the tiled-convolutional ssRBM model, in order to see if depth can help maintain coherence of large scale texture features. Recent work (Ranzato et al., 2011) has shown that stacking additional RBM layers on top of an mPoT model (also trained using tiled-convolutional weight sharing) can have a dramatic impact of the ability of the model to generate globally coherent natural image samples. Findings such as these motivated our attempt to use depth to synthesize textures with increased global coherence.

3 Spike-and-Slab RBM

The ssRBM describes the interaction between three sets of random variables: the real-valued visible random vector

representing the observed data of dimension , the set of binary “spike” random variables and the real-valued “slab” random variables . The ssRBM has the interpretation that, with hidden units, the th hidden unit is associated with both an element of the binary vector and an element of the real-valued variable. In this work we will concern ourselves with the ssRBM formulation referred to as the -ssRBM (Courville et al., 2011b) with the associated energy function:


where denotes the th weight (or feature) vector, is a scalar bias associated with the spike variable , and are respectively a mean and precision parameter associated with the random slab variable , is a diagonal precision matrix on the visibles , and is an -gated contribution to the precision on

. As is standard with energy-based models, the joint probability distribution over

, and is specified as: , where is the normalizing partition function.

An interesting property of the ssRBM is that despite having higher-order interactions of variables, the model maintains the bipartite graph structure of the standard restricted Boltzmann machine where the th hidden unit consists of the product of the random variables and

. This property implies that, unlike the mPoT (which also models conditional variance), the ssRBM shares the simple and practical conditional independence structure of the standard restricted Boltzmann machines. This makes it easy to use efficient block Gibbs sampling. As seen in the model conditionals:



denotes a Gaussian distribution with mean

and covariance , represents a logistic sigmoid, and is the diagonal conditional covariance matrix.

Training the ssRBM:

Like the standard RBM, learning and inference in the ssRBM is rooted in the ability to efficiently draw samples from the model via block Gibbs sampling. In training the ssRBM we are free to use either contrastive divergence or a better approximation to maximum likelihood such as the stochastic maximum likelihood algorithm, also known as persistent contrastive divergence (PCD) (Tieleman, 2008)

. CD training involves approximating the negative phase component of the likelihood gradient by a few steps (often just one) of Gibbs sampling away from the data presented in the positive phase. In PCD, one maintains a persistent Markov chain to approximate the negative phase and simulates a few Gibbs steps between each parameter update. These samples are then used to approximate the expectations over the model distribution

. Details regarding PCD training of the ssRBM are available in Courville et al. (2011a).

Our use of block Gibbs sampling marks an important distinction between our approach to learning and that used by Kivinen and Williams (2012), who use Hybrid Monte Carlo (HMC) (Neal, 1994) to draw samples from the model distribution. Their use of HMC is likely motivated by their need to train models such as the PoT model where the conditional over visible vectors do not factorize and hence is not amenable to efficient block Gibbs sampling. The ability to easily and efficiently Gibbs sample from the ssRBM also makes it amenable to CD training, unlike models such as the PoT and mPoT models. As we show in the experiments with deeper models, the use of CD training is crucial to achieving our best results.

Tiled-Convolutional ssRBM:

Gregor and LeCun (2010) introduced tiled-convolutional weight sharing (Ranzato et al., 2010a; Le et al., 2010) and is similar to convolutional weight sharing (LeCun et al., 1998; Desjardins and Bengio, 2008; Lee et al., 2009) except that spatially neighboring features (with overlapping receptive fields) do not share weights. Within the tiled-convolutional structure, every specific filter ties the input images without overlaps with itself and at the same time different filters do overlap with each other. The first setting of tiled-convolution not only allows us to efficiently work on much bigger images than traditional convolutional models but also makes the states of hidden units less correlated, which is very helpful when we draw samples from the models by block Gibbs sampling. The second setting aims at removing the tiling artifacts introduced the non-overlapping filters.

To make comparisons easier, our TssRBM uses the same architecture as  Kivinen and Williams (2012) including the same receptive field size of

and the same diagonal tiling pattern with a stride of one pixel (neighboring receptive fields are offset by one pixel). This diagonal tiling (which reduces considerably the number of free parameters) makes for 11 sets of filters (one for each offset). We also kept constant the number of filters (32 per set) to make comparisons with the results in  

Kivinen and Williams (2012) simpler.

4 An ssRBM-based Deep Belief Network

In this section, we describe how we extend the TssRBM in a hierarchical generative model in the form of a deep belief network (DBN). Following the standard procedure for learning DBNs, we follow a layer-wise training strategy. Training the bottom layer ssRBM, either by CD, PCD or FPCD is straightforward and discussed above in Sec. 3. We now consider the form of the model we intend to stack on top of the ssRBM.

Following the DBN approach, we express the ssRBM model as

As discussed in the previous section, due to the factorial nature of , it is convenient to consider this the bottom layer of our DBN and focus on how to model the spike-and-slab latent state. Let denote the data distribution. We introduce another higher-layer model of the spike-and-slab state to model the aggregated posterior distribution, , of the ssRBM

If models the aggregated posterior better than does (defined by the ssRBM), then adding the second layer can improve the model of the training data (Hinton et al., 2006).

Formally, the two layer model is,

From a generative perspective, the sampling procedure consists of generating a sampling pair from the top (second here) layer, followed by mapping them to image space though .

We have yet to specify the form of the model . We will follow the common practice of using another model of the RBM family to model the distribution over and . We introduce a variant of the RBM: , which models the aggregate posterior through a hidden a binary random vector: . We choose to use a binary hidden layer in order to transition to a more standard binary representation. When we include a third layer to the DBN, then that layer will be formed by training a standard binary-binary RBM.

The energy function of the second layer model is defined as follows:


where refers to the th element of the weight matrix encoding the interactions between and spike-and-slab variables and respectively. The term controls the bias on the binary . All other parameters have the same interpretation as their first layer analogues.

Similar to the standard ssRBM, the conditionals , and are factorial and given by:


The structure of this model gives us two advantages. First, at the start of training the second layer we can make close to defined by the first layer ssRBM by initializing the corresponding parameters to match their first layer analogues’ values. Second, after training the second layer, we get a new binary representation for training data. Based on it, building a even deeper model is straightforward. In our experiments, this architecture works very well.

Figure 1: The architecture of the lowest two layer. The first layer possesses tiled-convolutional weight sharing in a diagonal arrangement (tilings are represented by different colors). Each second layer unit has a receptive field over all the feature maps in the first hidden layer. The second layer is arranged with traditional convolutional weight sharing and a stride of 1.

Training the second-layer model:

After pretraining the first layer (ssRBM), given training data , we sample from , then take and as the new training data to train the second layer. Just as we do for the bottom-layer ssRBM, we train this second-layer RBM with either PCD or with CD. We typically see best results if we train with PCD for the top-layer model and with CD for all other layers.

Sampling and inference in our two layer model:

Once the second layer has been trained with PCD it can be used to generate samples. We run Gibbs sampling in the top layer, getting the sample . Next, we sample from then pass and to the first layer. Inference in our two layer model is exactly the same to the process of converting training data into the new representation (spike and slab variables) discussed above. Given , we sample from , then pass and to higher layer.

Convolutional Structure:

The second layer possesses a convolutional weight sharing structure (not tiled-convolutional). Based on our use of patches of size randomly cropped from the texture images, the tiling structure of the first layer model results in a set of feature maps of size (the receptive field size was ). Second layer hidden units are each connected to all feature maps with the same receptive field across all feature maps. Using a stride length of 1, this implies that each second layer feature is associated with a feature map of size . For our experiments with a 3-layer model, we keep the same convolutional weight sharing structure for the third layer and use receptive fields of size .

Figure 2: Examples of texture synthesis for the models under consideration (rows) for different textures (columns). The top row has original data.

5 Experiments

We evaluate our texture models on 8 texture images (D4, D6, D16, D21, D53, D68, D77 and D103) from the Brodatz texture dataset. Acording to Lin et al. (2006)

, we can roughly classify them into 4 different types, regular textures (D6, D21, D53, D77), near-regular textures (D16, D103), irregular textures (D68) and stochastic textures (D4). The regular textures are simpler: shallow models (such as mPoT, Gaussian RBM and ssRBM with tiled-convolutional weight sharing) are able to model them with high fidelity. However, the other textures (D4, D16, D68 and D103) remain challenging for shallow models. We show that deep models give better results.

Figure 3: Examples of texture inpainting for the models under consideration (rows) for different textures (columns).
Synthesis D6 D21 D53 D77
Bi-FoE 0.7573 0.0594 0.8710 0.0317 0.8266 0.0869 0.6464 0.0215
TmPoT 0.9329 0.0356 0.8961 0.0696 0.8527 0.0559 0.8699 0.0080
TPoT 0.5641 0.0916 0.7388 0.1055 0.7583 0.1082 0.6870 0.0973
T-GaussRBM 0.9301 0.0207 0.8901 0.0792 0.8485 0.0606 0.8663 0.0084
Multi-Tm (256) 0.9304 0.0280 0.9346 0.0205 0.9231 0.0103 0.8610 0.0096
TssRBM 0.9365 0.0468 0.9482 0.0249 0.9412 0.0215 0.8410 0.0121
Our 2-layer DBM 0.9516 0.0164 0.9465 0.0322 0.9499 0.0264 0.8638 0.0161
Inpainting D6 D21 D53 D77
Efros&Leung 0.8524 0.0318 0.8566 0.0344 0.8558 0.0578 0.6012 0.0760
TmPoT 0.8629 0.0180 0.8741 0.0116 0.8602 0.0234 0.7668 0.0322
TPoT 0.8446 0.0172 0.8609 0.0275 0.8935 0.0159 0.6379 0.0373
T-GaussRBM 0.8578 0.0160 0.8662 0.0185 0.8494 0.0233 0.7642 0.0267
Multi-Tm (256) 0.8452 0.0173 0.8673 0.0103 0.8554 0.0284 0.7328 0.0615
TssRBM 0.8881 0.0227 0.9119 0.0139 0.9156 0.0237 0.7627 0.0314
Our 2-layer DBN 0.8894 0.0246 0.9060 0.0160 0.9242 0.0285 0.7738 0.0232
Table 1: A comparison of the one and two-layer TssRBM results with other models. All reported results other than the TssRBM-based results were taken from Kivinen and Williams (2012) (including their Multi-Tm: a multiple texture model trained with 256 hidden units). The synthesis results are based on the TSS criterion while the inpainting results are based on MSSIM-scores. In both cases larger numbers are better.

5.1 TssRBM texture modeling

In this section, we compare the tiled-convolutional ssRBM with other related models in the literature. We base our comparison on the results reported in Kivinen and Williams (2012). To provide a fair comparison, we follow the general experimental protocol established by Heess et al. (2009) and Kivinen and Williams (2012). Specifically, we rescaled the original textures (all but D16) to either (D4, D21 and D77) or (D6, D53, D68 and D103). Each texture image was divided into a top half for training and a bottom half used for testing. Then we report the performances of the TssRBM and our 2-layer TssRBM-based DBN on two tasks: texture synthesis and inpainting. All models, in all experiments, are trained on

sized patches randomly cropped from the preprocessed training texture images which are normalized to have zero mean and standard deviation of 1. We use a minibatch size of 64.

The TssRBM is trained with FPCD (Tieleman and Hinton, 2009). For deep models, we always pretrain the lower layer with one step CD and train the top layer with PCD (We find that in the higher layer RMBs, the mixing of the negative phase Gibbs chain is relatively fast, so we use PCD). In both PCD and FPCD training processes, at the beginning of learning the persistent chains are initialized with noise and for some textures (especially for those regular textures) restarting the Markov chains with a small possibility, like 0.01, seems advantageous to further promote mixing. After training, we aply our models for the following two task: texture synthesis and inpainting.

Figure 4: LEFT: Synthesized texture D53, D4 and D103 at full resolution. The training algorihtms are shown in the layer-order, e.g. 3-DBN: CD-CD-PCD denotes a 3-layer DBN trained with CD for the first two layers and with PCD for the upppermost layer. Both depth and the choice of inductive bias have a significant impact on the quality of the model. RIGHT: The autocorrelation spectrum of Monte Carlo Markov Chain samples of the texture D103 for our one, two and three-layer models. All layers are trained with CD, except the uppermost which is trained with PCD (TssRBM trained with FPCD).

Texture Synthesis:

For this task we generate unconstrained samples from our models by the usual DBN generative procedure, with Gibbs sampling in the top-level RBM, followed (in the case of deep models) by stochastic projection (except for the visible units, as usual, and except for the slab units, where we take the expectation) in image space. Following Kivinen and Williams (2012), after a large number of “burn-in” samples, we collected 128 samples of size for both the 1-layer and 2-layer models. A quantitative measure of the quality of the samples is provided by the Texture Similarity Score (TSS) (Heess et al., 2009), comparing each generated sample with the test patches from the test region of the image. For a sample and test texture , the TSS is given by the maximum normalized cross correlation (NCC):


where denotes patch within the test region of the image and is the number of possible unique patches in the test region. A patch (and sample) of size was used to compute the score. We only use TSS for those regular textures (D6, D21, D53, D77). Fig. 2 compares images of textures synthesized by some of the methods under consideration. Table 1 provides a quantitative comparison based on the TSS and shows that the TssRBM-based models are competitive with these other probabilistic models of texture.


The inpainting (constrained texture synthesis) task requires the models to generate a texture which is consistent with a given boundary. Following Kivinen and Williams (2012), we randomly cut texture patches from the test texture images and set the center () to zeros. The resulting images as the inpainting frames were fed to our models. The inpainting was done by running 500 Gibbs sampling iterations in our models while the border was held fixed. The number of inpainting frames was 20 for each texture, and the inpainting were each done with 5 different random seeds, making it a total of 100 inpaintings for each model and each texture. The quality of the inpainting was evaluated using the mean structural similarity index (MSSIM) (Wang et al., 2004) that compares the inpainted region and the ground truth. Fig. 3 compares the texture results of some of the methods under consideration. Table 1 provides the quantitative MSSIM comparison against other similar models. Here again, the TssRBM-based models are fairly competitive with these other probabilistic models of texture.

Figure 5: LEFT: Multi-texture samples generated by the TssRBM model. RIGHT: Multi-texture samples generated by our 3-layer DBN.

5.2 Experiments II: Exploring High-Resolution Textures

To further explore the generative power of the DBN models, we move to a more challenging task, specifically, modeling high-resolution textures while keeping the first layer structure unchanged: the same number of filters, the same size () of the receptive fields and the same size () of the training patches. This implies that the first layer will face a much more challenging learning task. We show that by adding more hidden layers these difficult tasks are handled very well. We add two more hidden layers to the first layer. That gives us three layer DBNs. There are 128 filters with convolutional weight sharing in both of these two layers. Due to the limited sapce, we only show the results of texture D53, D4 and D103. The other 5 textures yield a similar pattern of results. While the quantitative measures used in the previous experiments are useful to extablish an objective comparison between methods, we feel that they are rather imperfect measures of the quality of the texture model and therefore in this section we forgo these measures in favour of simply presenting texture synthesis results for visual inspection. Fig. 4 (right) illustrates the impact of both depth and the inductive bias (FPCD versus CD training) in training TssRBM-based models of texture.

Depth helps mixing.

One key aspect that might help to explain the improvements in the models is that as the model gets deeper the mixing rate of the negative phace Gibbs chain improves, as already demonstrated and argued in Bengio et al. (2012)

. Improved mixing of the Gibbs chain improves the performance of training methods such as PCD that rely on it for the estimation of negative phase statistics. It also helps the generation of the samples shown. To demonstrate the improvement in mixing with depth, we assess the mixing rate of three models (one, two and three layer model) trained on D103 via the autocorrelation spectrum. After training, we run a Markov chain in all of three models and plot the autocorrelation spectrum in Fig.

4 (left). As seen in the figure, mixing become very fast in the three-layer model, i.e., samples at some distance in the chain are less correlated with each other.

CD pretraining vs. PCD and FPCD pretraining.

We find that pretraining the lower layers with CD-1 results in better DBN texture models. More specifically, worse results were obtained by PCD pre-training, then FPCD pre-training, and substantially better with CD, as can e.g. be seen in Figure 4 (right). This is consistent with the claims made in Hinton (2012) regarding the advantage of CD vs PCD. It is also consistent with the results in Le Roux and Bengio (2008), which show that maximum likelihood training of the lower layers of a DBN is sub-optimal, and that assuming a high-capacity top layer, the optimal way to train the first layer would be to minimize the KL divergence between the visible units and the stochastic one-step reconstruction, something much closer to what CD does than what PCD does. Another hypothesis is that CD helps here because it makes sure to extract good features that preserve the input information, without the constraint that the lower level RBMs do a good job (of avoid spurious modes) far from the training samples. Instead for the top-level RBM, which is used to sample from the model, it is important to use a good approximation to maximum likelihood training.

5.3 Learning with Multiple Textures

In this section, we try to assess the power of our deep models by using not only high-resolution texture images but also multiple heterogeneous textures. We train a three layers model on all 8 textures. The first layer of our DBN is a TssRBM with 96 filters. The second layer is our new RBM variant introduced in Sec. 4 with 256 filters and receptive fields of size . The third layer is a convolutional binary RBM with 256 filters and receptive fields of size . We compare our DBN with a one layer model (TssRBM with 128 filters). After training, we generate samples from both models and show the results in Fig. 5. We can see that the single layer TssRBM only models the high frequency structure in the training data. On the other hand, the deep model seems to capture much of the 8 textures that occur in the training set. There are 7 different textures apparent in these 32 samples. We are only missing samples of D4, which is a stochastic texture and hard to capture, particularly when most of the training data are highly structured images. Kivinen and Williams (2012) also trained Gaussian RBM with tiled-convolution weight sharing on multiple textures with labels. The labels can help their model to pick different filters for different textures and thus make the learning problem much easier.

6 Conclusions

In this paper, we apply the ssRBM with tiled-convolution weight sharing on texture modeling task. We show that not only is the ssRBM competitive as a single layer model of texture, but that, by being amenable to CD training, it it well suited to being incorporated into even more effective deep models of texture. Interestingly, we find that CD training of lower layers yields better models, and that mixing is better in deeper layers. Our integration of the ssRBM into a DBN necessitated the development of a novel RBM with a spike-and-slab visible layer and a binary latent layer. Finally we show our new ssRBM-based DBN is capable of modeling multiple high-resolution textures.


  • Bengio et al. (2012) Bengio, Y., Mesnil, G., Dauphin, Y., and Rifai, S. (2012). Better mixing via deep representations. Technical Report arXiv:1207.4404, Universite de Montreal.
  • Courville et al. (2011a) Courville, A., Bergstra, J., and Bengio, Y. (2011a). A Spike and Slab Restricted Boltzmann Machine. In AISTATS’2011.
  • Courville et al. (2011b) Courville, A., Bergstra, J., and Bengio, Y. (2011b). Unsupervised models of images by spike-and-slab RBMs. In ICML’2011.
  • Desjardins and Bengio (2008) Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision. Technical Report 1327, Dept. IRO, U. Montréal.
  • Gregor and LeCun (2010) Gregor, K. and LeCun, Y. (2010). Emergence of complex-like cells in a temporal product network with local receptive fields. Technical report, arXiv:1006.0448.
  • Heess et al. (2009) Heess, N., Williams, C. K. I., and Hinton, G. E. (2009). Learning generative texture models with extended fields-of-experts. In BMVC.
  • Hinton (2012) Hinton, G. E. (2012).

    Tutorial on deep learning.

    IPAM Graduate Summer School: Deep Learning, Feature Learning.
  • Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554.
  • Kivinen and Williams (2012) Kivinen, J. J. and Williams, C. K. I. (2012). Multiple texture Boltzmann machines. In

    Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics (AISTATS’2012)

    , volume 22 of JMLR: W&CP.
  • Le et al. (2010) Le, Q., Ngiam, J., Chen, Z., hao Chia, D. J., Koh, P. W., and Ng, A. (2010).

    Tiled convolutional neural networks.

    In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23 (NIPS’10), pages 1279–1287.
  • Le Roux and Bengio (2008) Le Roux, N. and Bengio, Y. (2008). Representational power of restricted Boltzmann machines and deep belief networks. Neural Computation, 20(6), 1631–1649.
  • LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient based learning applied to document recognition. IEEE, 86(11), 2278–2324.
  • Lee et al. (2009) Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009).

    Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

    In L. Bottou and M. Littman, editors, ICML 2009. ACM, Montreal (Qc), Canada.
  • Lin et al. (2006) Lin, W.-C., Hays, J. H., Wu, C., Kwatra, V., and Liu, Y. (2006). Quantitative evaluation on near regular texture synthesis. In

    Computer Vision and Pattern Recognition Conference (CVPR ’06)

    , volume 1, pages 427 – 434.
  • Neal (1994) Neal, R. M. (1994). Bayesian Learning for Neural Networks. Ph.D. thesis, Dept. of Computer Science, University of Toronto.
  • Ranzato and Hinton (2010) Ranzato, M. and Hinton, G. H. (2010). Modeling pixel means and covariances using factorized third-order Boltzmann machines. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR’10), pages 2551–2558. IEEE Press.
  • Ranzato et al. (2010a) Ranzato, M., Mnih, V., and Hinton, G. (2010a). Generating more realistic images using gated MRF’s. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23 (NIPS’10), pages 2002–2010.
  • Ranzato et al. (2010b) Ranzato, M., Mnih, V., and Hinton, G. (2010b). Generating more realistic images using gated MRF’s. In NIPS’2010.
  • Ranzato et al. (2011) Ranzato, M., Susskind, J., Mnih, V., and Hinton, G. E. (2011). On deep generative models with applications to recognition. In CVPR’11, pages 2857–2864.
  • Tieleman (2008) Tieleman, T. (2008). Training restricted Boltzmann machines using approximations to the likelihood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors, ICML 2008, pages 1064–1071. ACM.
  • Tieleman and Hinton (2009) Tieleman, T. and Hinton, G. (2009). Using fast weights to improve persistent contrastive divergence. In L. Bottou and M. Littman, editors, ICML 2009, pages 1033–1040. ACM.
  • Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE TRANSACTIONS ON IMAGE PROCESSING, 13(4), 600–612.
  • Wei et al. (2009) Wei, L.-Y., Lefebvre, S., Kwatra, V., and Turk, G. (2009). State of the art in example-based texture synthesis. Eurographics’09 State of the Art Reports.
  • Welling et al. (2003) Welling, M., Hinton, G. E., and Osindero, S. (2003). Learning sparse topographic representations with products of Student-t distributions. In NIPS’2002.
  • Welling et al. (2005) Welling, M., Rosen-Zvi, M., and Hinton, G. E. (2005). Exponential family harmoniums with an application to information retrieval. In NIPS’04, volume 17, Cambridge, MA. MIT Press.
  • Zhu et al. (2000) Zhu, S. C., Liu, X. W., and Wu, Y. N. (2000). Exploring texture ensembles by efficient Markov chain Monte-Carlo - towards a ”trichromacy” theory of texture. IEEE Trans. PAMI, 22(6), 554–569.