1 The S3C model
The S3C model consists of latent binary spike variables , latent realvalued slab variables , and realvalued
dimensional visible vector
generated according to this process:(1) 
where
is the logistic sigmoid function,
is a set of biases on the spike variables, and govern the linear dependence of on and on respectively, and are diagonal precision matrices of their respective conditionals, and denotes the elementwise product of and .To avoid overparameterizing the distribution, we constrain the columns of to have unit norm, as in sparse coding. We restrict to be a diagonal matrix and to be a diagonal matrix or a scalar. We refer to the variables and as jointly defining the hidden unit, so that there are are total of rather than hidden units. The state of a hidden unit is best understood as , that is, the spike variables gate the slab variables.
In the subsequent sections we motivate our use of S3C as a feature discovery algorithm by describing how this model occupies a middle ground between sparse coding and the ssRBM. The S3C model avoids many disadvantages that the ssRBM and sparse coding have when applied as feature discovery algorithms.
1.1 Comparison to sparse coding
Sparse coding has been widely used to discover features for classification (Raina et al., 2007). Recently Coates and Ng (2011a) showed that this approach achieves excellent performance on the CIFAR10 object recognition dataset.
Sparse coding (Olshausen and Field, 1997) describes a class of generative models where the observed data
is normally distributed given a set of continuous latent variables
and a dictionary matrix : . Sparse coding places a factorial prior on such as a Cauchy or Laplace distribution, chosen to encourage the posterior mode of to be sparse. One can derive the S3C model from sparse coding by replacing the factorial Cauchy or Laplace prior with a spikeandslab prior.One drawback of sparse coding is that the latent variables are not merely encouraged to be sparse; they are encouraged to remain close to 0, even when they are active. This kind of regularization is not necessarily undesirable, but in the case of simple but popular priors such as the Laplace prior (corresponding to an penalty on the latent variables ), the degree of regularization on active units is confounded with the degree of sparsity. There is little reason to believe that in realistic settings, these two types of complexity control should be so tightly bound together. The S3C model avoids this issue by controlling the sparsity of units via the parameter that determines how likely each spike unit is to be active, while separately controlling the magnitude of active uits via the and parameters that govern the distribution over . Sparse coding has no parameter analogous to and cannot control these aspects of the posterior independently.
Sparse coding is also difficult to integrate into a deep generative model of data such as natural images. While Yu et al. (2011) and Zeiler et al. (2009)
have recently shown some success at learning hierarchical sparse coding, our goal is to integrate the feature extraction scheme into a proven generative model framework such as the deep Boltzmann Machine
(Salakhutdinov and Hinton, 2009). Existing inference schemes known to work well in the DBMtype setting are all either samplebased or are based on variational approximations to the model posteriors, while sparse coding schemes typically employ MAP inference. Our use of variational inference makes the S3C framework well suited to integrate into the known successful strategies for learning and inference in DBM models. It is not obvious how one can employ a variational inference strategy to standard sparse coding with the goal of achieving sparse feature encoding.1.2 Comparison to Restricted Boltzmann Machines
The S3C model also resembles another class of models commonly used for feature discovery: the RBM. An RBM (Smolensky, 1986)
is an energybased model defined through an energy function that describes the interactions between the obversed data variables and a set of latent variables. It is possible to interpret the S3C as an energybased model, by rearranging
to take the form , with the following energy function:(2) 
The ssRBM model family is a good starting point for S3C because it has demonstrated both reasonable performance as a feature discovery scheme and remarkable performance as a generative model (Courville et al., 2011). Within the ssRBM family, S3C’s closest relative is a variant of the ssRBM, defined by the following energy function:
(3) 
where the variables and parameters are defined identically to the S3C. Comparison of equations 2 and 3 reveals that the simple addition of a latent factor interaction term to the ssRBM energy function turns the ssRBM into the S3C model. With the inclusion of this term S3C moves from an undirected ssRBM model to the directed graphical model, described in equation (1). This change from undirected modeling to directed modeling has three important effects, that we describe in the following sections.
The effect on the partition function:
The most immediate consequence of the transition to directed modeling is that the partition function becomes tractable. This changes the nature of learning algorithms that can be applied to the model, since most of the difficulty in training an RBM comes from estimating the gradient of the log partition function. The partition function of S3C is also guaranteed to exist for all possible settings of the model parameters, which is not true of the ssRBM.
The effect on the posterior: RBMs have a factorial posterior, but S3C and sparse coding have a complicated posterior due ot the “explaining away” effect. This means that for RBMs, features defined by similar basis functions will have similar activations, while in directed models, similar features will compete so that only the most relevant feature will remain active. As shown by Coates and Ng (2011a), the sparse Gaussian RBM is not a very good feature extractor – the set of basis functions learned by the RBM actually work better for supervised learning when these parameters are plugged into a sparse coding model than when the RBM itself is used for feature extraction. We think this is due to the factorial posterior. In the vastly overcomplete setting, being able to selectively activate a small set of features likely provides S3C a major advantage in discriminative capability.
The effect on the prior:
The addition of the interaction term causes S3C to have a factorial prior. This probably makes it a poor generative model, but this is not a problem for the purpose of feature discovery.
2 Other Related work
The notion of a spikeandslab prior was established in statistics by Mitchell and Beauchamp (1988). Outside the context of unsupervised feature discovery for supervised, semisupervised and selftaught learning, the basic form of the S3C model (i.e. a spikeandslab latent factor model) has appeared a number of times in different domains (Lücke and Sheikh, 2011; Garrigues and Olshausen, 2008; Mohamed et al., 2011; Titsias and LázaroGredilla, 2011). To this literature, we contribute an inference scheme that scales to the kinds of object classifications tasks that we consider. We outline this inference scheme next.
3 Variational EM for S3C
Having explained why S3C is a powerful model for unsupervised feature discovery we turn to the problem of how to perform learning and inference in this model. Because computing the exact posterior distribution is intractable, we derive an efficient and effective inference mechanism and a variational EM learning algorithm.
We turn to variational EM (Saul and Jordan, 1996) because this algorithm is wellsuited for models with latent variables whose posterior is intractable. It works by maximizing a variational lower bound on the loglikelihood called the energy functional (Neal and Hinton, 1999). More specifically, it is a variant of the EM algorithm (Dempster et al., 1977) with the modification that in the Estep, we compute a variational approximation to the posterior rather than the posterior itself. While our model admits a closedform solution to the Mstep, we found that online learning with small gradient steps on the Mstep objective worked better in practice. We therefore focus our presentation on the Estep, given in Algorithm 1.
The goal of the variational Estep is to maximize the energy functional with respect to a distribution over the unobserved variables. We can do this by selecting the
that minimizes the Kullback–Leibler divergence:
(4) 
where is drawn from a restricted family of distributions. This family can be chosen to ensure that is tractable.
Our Estep can be seen as analogous to the encoding step of the sparse coding algorithm. The key difference is that while sparse coding approximates the true posterior with a MAP point estimate of the latent variables, we approximate the true posterior with the distribution . We use the family .
Observing that eq. (4) is an instance of the EulerLagrange equation (Gelfand, 1963), we find that the solution must take the form
(5) 
where and must be found by an iterative process. In a typical application of variational inference, the iterative process consists of sequentially applying fixed point equations that give the optimal value of the parameters and for one factor given the value all of the other factors’ parameters. This is for example the approach taken by Titsias and LázaroGredilla (2011) who independently developed a variational inference procedure for the same problem. This process is only guaranteed to decrease the KL divergence if applied to each factor sequentially, i.e. first updating and to optimize , then updating and to optimize , and so on. In a typical application of variational inference, the optimal values for each update are simply given by the solutions to the EulerLagrange equations. For S3C, we make three deviations from this standard approach.
Because we apply S3C to very largescale problems, we need an algorithm that can fully exploit the benefits of parallel hardware such as GPUs. Sequential updates across all factors require far too much runtime to be competetive in this regime.
We propose a different method that enables parallel updates to all units. First, we partially minimize the KL divergence with respect to . The terms of the KL divergence that depend on make up a quadratic function so this can be minimized via conjugate gradient descent. We implement conjugate gradient descent efficiently by using the Roperator to perform Hessianvector products rather than computing the entire Hessian explicitly (Schraudolph, 2002). This step is guaranteed to improve the KL divergence on each iteration.
We next update in parallel, shrinking the update by a damping coefficient. This approach is not guaranteed to decrease the KL divergence on each iteration but it is a widely applied approach that works well in practice (Koller and Friedman, 2009).
In practice we find that we can obtain a faster algorithm that reaches equally good solutions by replacing the conjugate gradient update to
with a more heuristic approach. We use a parallel damped update on
much like what we do for . In this case we make an additional heuristic modification to the update rule which is made necessary by the unbounded nature of . We clip the update to so that if has the opposite sign from , its magnitude is at most . In all of our experiments we used but any value in is sensible. This prevents a case where multiple mutually inhibitory units inhibit each other so strongly that rather than being driven to 0 they change sign and actually increase in magnitude. This case is a failure mode of the parallel updates that can result in amplifying without bound if clipping is not used.(Left) The energy functional of a batch of 5000 image patches increases during the Estep. (Right) Semisupervised classification accuracy on CIFAR10. In both cases the hyperparameters for the unsupervised stage were optimized for performance on the full CIFAR10 dataset, not reoptimized for each point on the learning curve.
We include some visualizations that demonstrate the effect of our Estep. Figure 1 (right) shows that it produces a sparse representation. Figure 1 (left) shows that the explainingaway effect incrementally makes the representation more sparse. Figure 2 (left) shows that the Estep increases the energy functional.
4 Results
We conducted experiments to evaluate the usefulness of S3C features for supervised learning and semisupervised learning on CIFAR10 (Krizhevsky and Hinton, 2009), a dataset consisting of color images of animals and vehicles. It contains ten labeled classes, with 5000 train and 1000 test examples per class.
For all experiments, we used the same procedure as Coates and Ng (2011a). CIFAR10 consists of images. We train our feature extractor on contrastnormalized and ZCAwhitened patches from the training set. At test time, we extract features from all patches on an image, then averagepool them. The averagepooling regions are arranged on a nonoverlapping grid. Finally, we train a linear SVM on the pooled features.
Coates and Ng (2011a) used 1600 basis vectors in all of their sparse coding experiments. They postprocessed the sparse coding feature vectors by splitting them into the positive and negative part for a total of 3200 features per averagepooling region. They averagepool on a grid for a toal of 12,800 features per image. We used as our feature vector. This does not have a negative part, so using a grid we would have only 6,400 features. In order to compare with similar sizes of feature vectors we used a pooling grid for a total of 14,400 features.
4.1 Cifar10
On CIFAR10, S3C achieves a test set accuracy of % with 95% confidence (or % when using a grid). Coates and Ng (2011a) do not report test set accuracy for sparse coding with “natural encoding” (i.e., extracting features in a model whose parameters are all the same as in the model used for training) but sparse coding with different parameters for feature extraction than training achieves an accuracy of (Coates and Ng, 2011a). Since we have not enhanced our performance by modifying parameters at feature extraction time these results seem to indicate that S3C is roughly equivalent to sparse coding for this classification task. S3C also outperforms ssRBMs, which require 4,096 basis vectors per patch and a pooling grid to achieve accuracy. All of these approaches are close to the state of the art of %, which used a three layer network (Coates and Ng, 2011b).
We also used CIFAR10 to evaluate S3C’s semisupervised learning performance by training the SVM on small subsets of the CIFAR10 training set, but using features that were learned on the entire CIFAR10 train set. The results, summarized in Figure 2 (right) show that S3C is most advantageous for medium amounts of labeled data. S3C features thus include an aspect of flexible regularization– they improve generalization for smaller training sets yet do not cause underfitting on larger ones.
5 Transfer Learning Challenge
For the NIPS 2011 Workshop on Challenges in Learning Hierarchical Models (Le et al., 2011), the organizers proposed a transfer learning competition. This competition used a dataset consisting of 32 32 color images, including 100,000 unlabeled examples, 50,000 labeled examples of 100 object classes not present in the test set, and 120 labeled examples of 10 object classes present in the test set. The test set was not made public until after the competition. We chose to disregard the 50,000 labels and treat this as a semisupervised learning task. We applied the same approach as on CIFAR10 and won the competition, with a test set accuracy of 48.6 %.
6 Conclusion
We have motivated the use of the S3C model for unsupervised feature discovery. We have described a variational approximation scheme that makes it feasible to perform learning and inference in largescale S3C models. Finally, we have demonstrated that S3C is an effective feature discovery algorithm for supervised, semisupervised, and selftaught learning.
Acknowledgements
This work was funded by DARPA and NSERC. The authors would like to thank Pascal Vincent for helpful discussions. The computation done for this work was conducted in part on computers of RESMIQ, Clumeq and SharcNet. We would like to thank the developers of theano
(Bergstra et al., 2010) and pylearn2 (WardeFarley et al., 2011).References
 Bergstra et al. (2010) Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., WardeFarley, D., and Bengio, Y. (2010). Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy). Oral Presentation.
 Coates and Ng (2011a) Coates, A. and Ng, A. Y. (2011a). The importance of encoding versus training with sparse coding and vector quantization. In ICML 28.
 Coates and Ng (2011b) Coates, A. and Ng, A. Y. (2011b). Selecting receptive fields in deep networks. In NIPS 2011.

Coates et al. (2011)
Coates, A., Lee, H., and Ng, A. Y. (2011).
An analysis of singlelayer networks in unsupervised feature
learning.
In
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS 2011)
. 
Courville et al. (2011)
Courville, A., Bergstra, J., and Bengio, Y. (2011).
Unsupervised models of images by spikeandslab RBMs.
In
Proceedings of the Twentyeight International Conference on Machine Learning (ICML’11)
.  Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximumlikelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39, 1–38.
 Garrigues and Olshausen (2008) Garrigues, P. and Olshausen, B. (2008). Learning horizontal connections in a sparse coding model of natural images. In NIPS’07, pages 505–512. MIT Press, Cambridge, MA.
 Gelfand (1963) Gelfand, I. M. (1963). Calculus of Variations. Dover.
 Koller and Friedman (2009) Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
 Krizhevsky and Hinton (2009) Krizhevsky, A. and Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto.
 Le et al. (2011) Le, Q. V., Ranzato, M., Salakhutdinov, R., Ng, A., and Tenenbaum, J. (2011). NIPS Workshop on Challenges in Learning Hierarchical Models: Transfer Learning and Optimization. https://sites.google.com/site/nips2011workshop.
 Lücke and Sheikh (2011) Lücke, J. and Sheikh, A.S. (2011). A closedform EM algorithm for sparse coding.

Mitchell and Beauchamp (1988)
Mitchell, T. J. and Beauchamp, J. J. (1988).
Bayesian variable selection in linear regression.
J. Amer. Statistical Assoc., 83(404), 1023–1032. 
Mohamed et al. (2011)
Mohamed, S., Heller, K., and Ghahramani, Z. (2011).
Bayesian and l1 approaches to sparse unsupervised learning.
 Neal and Hinton (1999) Neal, R. and Hinton, G. (1999). A view of the em algorithm that justifies incremental, sparse, and other variants. In M. I. Jordan, editor, Learning in Graphical Models. MIT Press, Cambridge, MA.
 Olshausen and Field (1997) Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research, 37, 3311–3325.
 Raina et al. (2007) Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A. Y. (2007). Selftaught learning: transfer learning from unlabeled data. In Z. Ghahramani, editor, ICML 2007, pages 759–766. ACM.
 Salakhutdinov and Hinton (2009) Salakhutdinov, R. and Hinton, G. (2009). Deep Boltzmann machines. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS 2009), volume 8.
 Saul and Jordan (1996) Saul, L. K. and Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In NIPS’95. MIT Press, Cambridge, MA.
 Schraudolph (2002) Schraudolph, N. N. (2002). Fast curvature matrixvector products for secondorder gradient descent. Neural Computation, 14(7), 1723–1738.
 Smolensky (1986) Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, chapter 6, pages 194–281. MIT Press, Cambridge.
 Titsias and LázaroGredilla (2011) Titsias, M. K. and LázaroGredilla, M. (2011). Spike and slab variational inference for multitask and multiple kernel learning. In Advances in Neural Information Processing Systems 24.
 WardeFarley et al. (2011) WardeFarley, D., Goodfellow, I. J., Lamblin, P., Desjardins, G., Bastien, F., and Bengio, Y. (2011). pylearn2. http://deeplearning.net/software/pylearn2.

Yu et al. (2011)
Yu, K., Lin, Y., and Lafferty, J. (2011).
Learning image representations from the pixel levelvia hierarchical
sparse coding.
In
CVPR’11: IEEE Conference on Computer Vision and Pattern Recognition
, pages 1713–1720, Colorado Springs, CO.  Zeiler et al. (2009) Zeiler, M., Taylor, G., and Fergus, R. (2009). Adaptive deconvolutional networks for mid and high level feature learning. In Proc. International Conference on Computer Vision (ICCV’09), pages 2146–2153. IEEE.