1 Introduction
Convolutional net (Convnet) LeCun et al. (1998)
have shown to be powerful models in extracting rich features from highdimensional images. They employ hierarchical layers of combined convolution and pooling to extract compressed features that capture the intraclass variations between images. The purpose of applying pooling over neighbor activations in featuremaps of Convnet is to break the spatial correlation of neighboring pixels, and to improve the scale and translation invariant features learned by Convnet. This also helps in learning filters for generic feature extraction of lowmidhigh level of concepts, such as edge detectors, geometric shapes, and object class
Krizhevsky et al. (2012); Donahue et al. (2013); Zeiler et al. (2010); Zeiler & Fergus (2014).Several regularization techniques have been proposed to improve feature extraction in Counvnet and to overcome overfitting in large deep networks with many parameters. A dropout technique in Srivastava et al. (2014)
is based on randomly dropping hidden units with its connnection during training to avoid coadaptaion or redundant filter training. This method resemble averaging over ensemble of submodels, where each submodel is trained based on a subset of parameters. A maxout neuron is proposed in
Goodfellow et al. (2013b)while a maxout neuron, with the maximum of activity across featuremaps is computed in Counvnets. Maxout networks have shown to improve the classification performance by building a convex an unbounded activation function, which prevents learning dead filters. A winnertakeall method is employed in
Makhzani & Frey (2014) to reduce or eliminate redundant and delta type filters in pretraining of Counvnet using Convolutional AutoEncoder (CAE), by taking the maximum activity inside featuremap in each training step.Sparse feature learning is a common method for compressed feature extraction in shallow encoderdecoderbased networks, i.e. in sparse coding Hoyer (2002, 2004); Olshausen et al. (1996); Olshausen & Field (1997), in Autoencoders (AE) Ng (2011)
, and in Restricted Boltzmann Machines (RBM)
Poultney et al. (2006); Ranzato et al. (2007). Bach et al. Bach et al. (2012) organize sparsity in a structured form to capture interpretable features and improve prediction performance of the model. In this paper, we present a novel Structured Model of sparse feature extraction in CAE that improves the performance of feature extraction by regularizing the distribution of activities inside and across featuremaps. We employ the idea of sparse filtering Ngiam et al. (2011) to regularize the activity across featuremaps and to improve sparsity within and across featuremaps. The proposed model is using and normalization on the featuremap activations to implement partbased feature extraction.2 Model
In this section, the model of Structured Sparse CAE (SSCAE) is described. CAE consists of convolution/pooling/nonlinearity based encoding and decoding layers, where the feature vector is represented as featuremaps, i.e. hidden output of the encoding layer.
In CAE with encoding filters, the featuremaps are computed based on a convolution/pooling/nonlinearity layer, with nonlinear function applied on the pooled activation of convolution layer, as in Eq. 1.
(1) 
where and are the filter and bias of th featuremap, respectively. We refer to as single neuron activity in th featuremap , whereas is defined as a feature vector across featuremaps as illustrated in Fig. 1.
In SSCAE, the featuremaps are reqularized and sparsified to represent three properties; (i) Sparse feature vector ; (ii) Sparse neuronal activity within each of the th featuremap ; (iii)Uniform distribution of feature vectors .
In (i), sparsity is imposed on feature vector to increase diversity of features represented by each featuremap, i.e. each should represent a distinguished and discriminative characteristic of the input, such as different parts, edges, etc. This property is exemplified in Fig. 1(b) with digits decomposed into parts across featuremaps . As stipulated in (ii), sparsity is imposed on each featuremap to only contain few nonzero activities . This property is encouraged for each featuremap to represent a localized feature of the input. Fig. 1(b) shows property (i) for MNIST dataset, where each featuremap is a localized feature of a digit, wherein Fig. 1(a) shows extracted digit shaperesemblance featurs, a much less successful and nonsparse outcome compared to Fig. 1(b). Fig. 2 also depicts the technique for numerical sparsification of each featuremap. The property (iii) is imposed on activation features to have similar statistics with uniform activity. In other words,
will be of nearly equal or uniform activity level, if they lie in the object spatial region, or nonactive, if not. Uniform activity also improves the generic and partbased feature extraction where the contributing activation features
of digits, i.e. , fall within convolutional region of digits and filters show uniform activity level, which results in generic and partbased features.To enforce the aformentioned sparsity properties in CAE models, we have used the combination of and normalization on of Eq. 1, as proposed in Ngiam et al. (2011), and as shown in Fig. 3. In SSCAE, a normalization layer is added on the encoding layer, where the normalized featuremaps and feature vectors are imposed by two normalization steps, as in Eq.3 and Eq. 2, respectively,
(2) 
(3) 
The final normalized featuremaps are forwarded as inputs to the decoding layer of unpooling/deconvolution/nonlinearity to reconstruct the input as in Eq. 4,
(4) 
where and are the filters and biases of decoding layer. In order to enforce the sparsity properties of (i)(iii), the sparsity is applied on as in Eq. 6, where the averaged sparsity over featuremaps and training data is minimized during the reconstruction of input , as in Eq’s. 5, 6 and 7,
(5) 
(6) 
(7) 
where , and
are the reconstruction, sparsity and SSCAE loss functions, respectively.
indicates the sparsity penalty on and . Fig. 2 demonstrate the steps of normalization and sparsification by selected feature maps of MNIST data.3 Experiments
We used Theano
Bastien et al. (2012) and Pylearn Goodfellow et al. (2013a), on Amazon EC2 g2.8xlarge instances with GPU GRID K520 for our experiments.3.1 Reducing Dead filters
In order to compare the performance of our model in minimizing dead filters by learning sparse and local filters, the trained filters of MNIST data are compared between CAE and SSCAE with and without pooling layer in Fig. 4. It is shown in Fig. 4(a)(c) that CAE with and without pooling layer learn some delta filters which provide simply an identity function. However, the sparsity function used in SSCAE is trying to reduce in extracting delta filters by managing the activation across featuremaps, as shown in Fig. 4(b)(d).
3.2 Improving learning of reconstruction
To investigate the effect of structured sparsity on learning of filters through reconstruction, the performance of CAE and SSCAE is compared on SVHN dataset, as shown in Fig. 6. To show the performance of structured sparsity on reconstruction, a small CAE with 8 filters is trained on SVHN dataset. Fig. 6(a) shows the performance of CAE after training which fails to extract edgelike filters and results in poor reconstruction. Fig. 8 also depicts the learnt 16 encoding and decoding filters on small NORB dataset, where structured sparsity improve the extraction of localized and edgelike filters. However, SSCAE outperform CAE in reconstruction due to learnt edgelike filters. The selected featuremap of the two models are shown in Fig. 7(a)(b). The convergence rate of reconstruction optimization for CAE and SSCAE is also compared on MNIST (Fig. 5(a)), SVHN (Fig. 5(b)), small NORB (Fig. 5(c)), and CIFAR10 (Fig. 5(d)) datasets, which indicate faster convergence in SSCAE.
References
 Bach et al. (2012) Bach, Francis, Jenatton, Rodolphe, Mairal, Julien, Obozinski, Guillaume, et al. Structured sparsity through convex optimization. Statistical Science, 27(4):450–468, 2012.
 Bastien et al. (2012) Bastien, Frédéric, Lamblin, Pascal, Pascanu, Razvan, Bergstra, James, Goodfellow, Ian J., Bergeron, Arnaud, Bouchard, Nicolas, and Bengio, Yoshua. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
 Donahue et al. (2013) Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.
 Goodfellow et al. (2013a) Goodfellow, Ian J., WardeFarley, David, Lamblin, Pascal, Dumoulin, Vincent, Mirza, Mehdi, Pascanu, Razvan, Bergstra, James, Bastien, Frédéric, and Bengio, Yoshua. Pylearn2: a machine learning research library. arXiv preprint arXiv:1308.4214, 2013a. URL http://arxiv.org/abs/1308.4214.
 Goodfellow et al. (2013b) Goodfellow, Ian J, WardeFarley, David, Mirza, Mehdi, Courville, Aaron, and Bengio, Yoshua. Maxout networks. arXiv preprint arXiv:1302.4389, 2013b.

Hoyer (2002)
Hoyer, P. O.
Nonnegative sparse coding.
In
Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing
, pp. 557–565. IEEE, 2002.  Hoyer (2004) Hoyer, P. O. Nonnegative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5:1457–1469, 2004.
 Krizhevsky et al. (2012) Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105, 2012.
 LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Makhzani & Frey (2014) Makhzani, Alireza and Frey, Brendan. A winnertakeall method for training sparse convolutional autoencoders. arXiv preprint arXiv:1409.2752, 2014.
 Ng (2011) Ng, A. Sparse autoencoder. In CS294A Lecture notes, URL https://web.stanford.edu/class/cs294a/sparseAutoencoder_2011new.pdf, 2011. Stanford University.
 Ngiam et al. (2011) Ngiam, Jiquan, Chen, Zhenghao, Bhaskar, Sonia A, Koh, Pang W, and Ng, Andrew Y. Sparse filtering. In Advances in Neural Information Processing Systems, pp. 1125–1133, 2011.
 Olshausen & Field (1997) Olshausen, B. A. and Field, D. J. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325, 1997.
 Olshausen et al. (1996) Olshausen, B. A. et al. Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.

Poultney et al. (2006)
Poultney, C., Chopra, S., LeCun, Y., et al.
Efficient learning of sparse representations with an energybased model.
In Advances in Neural Information Processing Systems, pp. 1137–1144, 2006. 
Ranzato et al. (2007)
Ranzato, M., Boureau, Y. L., and LeCun, Y.
Sparse feature learning for deep belief networks.
Advances in neural information processing systems, 20:1185–1192, 2007. 
Srivastava et al. (2014)
Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and
Salakhutdinov, Ruslan.
Dropout: A simple way to prevent neural networks from overfitting.
The Journal of Machine Learning Research
, 15(1):1929–1958, 2014.  Zeiler & Fergus (2014) Zeiler, Matthew D and Fergus, Rob. Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014, pp. 818–833. Springer, 2014.

Zeiler et al. (2010)
Zeiler, Matthew D, Krishnan, Dilip, Taylor, Graham W, and Fergus, Rob.
Deconvolutional networks.
In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
, pp. 2528–2535. IEEE, 2010.