Log In Sign Up

Structured Sparse Convolutional Autoencoder

by   Ehsan Hosseini-Asl, et al.

This paper aims to improve the feature learning in Convolutional Networks (Convnet) by capturing the structure of objects. A new sparsity function is imposed on the extracted featuremap to capture the structure and shape of the learned object, extracting interpretable features to improve the prediction performance. The proposed algorithm is based on organizing the activation within and across featuremap by constraining the node activities through ℓ_2 and ℓ_1 normalization in a structured form.


page 2

page 4

page 6

page 7


Deep Learning Representation using Autoencoder for 3D Shape Retrieval

We study the problem of how to build a deep learning representation for ...

Interpolated Convolutional Networks for 3D Point Cloud Understanding

Point cloud is an important type of 3D representation. However, directly...

Neural Networks with Activation Networks

This work presents an adaptive activation method for neural networks tha...

The Interpretable Dictionary in Sparse Coding

Artificial neural networks (ANNs), specifically deep learning networks, ...

Joint embedding of structure and features via graph convolutional networks

The creation of social ties is largely determined by the entangled effec...

Learning Spatiotemporal Features with 3D Convolutional Networks

We propose a simple, yet effective approach for spatiotemporal feature l...

Progressive Tree-like Curvilinear Structure Reconstruction with Structured Ranking Learning and Graph Algorithm

We propose a novel tree-like curvilinear structure reconstruction algori...

1 Introduction

Convolutional net (Convnet) LeCun et al. (1998)

have shown to be powerful models in extracting rich features from high-dimensional images. They employ hierarchical layers of combined convolution and pooling to extract compressed features that capture the intra-class variations between images. The purpose of applying pooling over neighbor activations in featuremaps of Convnet is to break the spatial correlation of neighboring pixels, and to improve the scale and translation invariant features learned by Convnet. This also helps in learning filters for generic feature extraction of low-mid-high level of concepts, such as edge detectors, geometric shapes, and object class  

Krizhevsky et al. (2012); Donahue et al. (2013); Zeiler et al. (2010); Zeiler & Fergus (2014).

Several regularization techniques have been proposed to improve feature extraction in Counvnet and to overcome overfitting in large deep networks with many parameters. A dropout technique in Srivastava et al. (2014)

is based on randomly dropping hidden units with its connnection during training to avoid co-adaptaion or redundant filter training. This method resemble averaging over ensemble of sub-models, where each sub-model is trained based on a subset of parameters. A maxout neuron is proposed in 

Goodfellow et al. (2013b)

while a maxout neuron, with the maximum of activity across featuremaps is computed in Counvnets. Maxout networks have shown to improve the classification performance by building a convex an unbounded activation function, which prevents learning dead filters. A winner-take-all method is employed in 

Makhzani & Frey (2014) to reduce or eliminate redundant and delta type filters in pretraining of Counvnet using Convolutional AutoEncoder (CAE), by taking the maximum activity inside featuremap in each training step.

Sparse feature learning is a common method for compressed feature extraction in shallow encoder-decoder-based networks, i.e. in sparse coding Hoyer (2002, 2004); Olshausen et al. (1996); Olshausen & Field (1997), in Autoencoders (AE) Ng (2011)

, and in Restricted Boltzmann Machines (RBM) 

Poultney et al. (2006); Ranzato et al. (2007). Bach et al. Bach et al. (2012) organize sparsity in a structured form to capture interpretable features and improve prediction performance of the model. In this paper, we present a novel Structured Model of sparse feature extraction in CAE that improves the performance of feature extraction by regularizing the distribution of activities inside and across featuremaps. We employ the idea of sparse filtering Ngiam et al. (2011) to regularize the activity across featuremaps and to improve sparsity within and across featuremaps. The proposed model is using and normalization on the featuremap activations to implement part-based feature extraction.

2 Model

In this section, the model of Structured Sparse CAE (SSCAE) is described. CAE consists of convolution/pooling/nonlinearity based encoding and decoding layers, where the feature vector is represented as featuremaps, i.e. hidden output of the encoding layer.

(a) CAE (b) SSCAE
Figure 1: 16 example filters () and featuremaps (), with feature vectors (

), extracted from non-whitened MNIST with sigmoid nonlinearity and no pooling using (a) CAE, (b) SSCAE. Effect of sparse feature extraction using SSCAE is shown w/o pooling layer. Digits are input pixelmaps

, for this example.
Figure 2: Structured Sparsity on illustration on (a) two-dimensional and (b) three-dimensional space for featuremaps (, , ) of MNIST dataset. Each example is first projected onto the unit -ball and then optimized for sparsity. The unit -ball is shown together with level sets of the -norm. Notice that the sparseness of the features (in the sense) is maximized when the examples are on the axes Ngiam et al. (2011).

In CAE with encoding filters, the featuremaps are computed based on a convolution/pooling/nonlinearity layer, with nonlinear function applied on the pooled activation of convolution layer, as in Eq. 1.


where and are the filter and bias of -th featuremap, respectively. We refer to as single neuron activity in -th featuremap , whereas is defined as a feature vector across featuremaps as illustrated in Fig. 1.

In SSCAE, the featuremaps are reqularized and sparsified to represent three properties; (i) Sparse feature vector ; (ii) Sparse neuronal activity within each of the -th featuremap ; (iii)Uniform distribution of feature vectors .

In (i), sparsity is imposed on feature vector to increase diversity of features represented by each featuremap, i.e. each should represent a distinguished and discriminative characteristic of the input, such as different parts, edges, etc. This property is exemplified in Fig. 1(b) with digits decomposed into parts across featuremaps . As stipulated in (ii), sparsity is imposed on each featuremap to only contain few non-zero activities . This property is encouraged for each featuremap to represent a localized feature of the input. Fig. 1(b) shows property (i) for MNIST dataset, where each featuremap is a localized feature of a digit, wherein Fig. 1(a) shows extracted digit shape-resemblance featurs, a much less successful and non-sparse outcome compared to Fig. 1(b). Fig. 2 also depicts the technique for numerical sparsification of each featuremap. The property (iii) is imposed on activation features to have similar statistics with uniform activity. In other words,

will be of nearly equal or uniform activity level, if they lie in the object spatial region, or non-active, if not. Uniform activity also improves the generic and part-based feature extraction where the contributing activation features

of digits, i.e. , fall within convolutional region of digits and filters show uniform activity level, which results in generic and part-based features.

Figure 3: Model architecture of Structured Sparse Convolutional AutoEncoder (SSCAE)

To enforce the aformentioned sparsity properties in CAE models, we have used the combination of and normalization on of Eq. 1, as proposed in Ngiam et al. (2011), and as shown in Fig. 3. In SSCAE, a normalization layer is added on the encoding layer, where the normalized featuremaps and feature vectors are imposed by two -normalization steps, as in Eq.3 and Eq. 2, respectively,


The final normalized featuremaps are forwarded as inputs to the decoding layer of unpooling/deconvolution/nonlinearity to reconstruct the input as in Eq. 4,


where and are the filters and biases of decoding layer. In order to enforce the sparsity properties of (i)-(iii), the sparsity is applied on as in Eq. 6, where the averaged sparsity over featuremaps and training data is minimized during the reconstruction of input , as in Eq’s. 56 and 7,


where , and

are the reconstruction, sparsity and SSCAE loss functions, respectively.

indicates the sparsity penalty on and . Fig. 2 demonstrate the steps of normalization and sparsification by selected feature maps of MNIST data.

3 Experiments

We used Theano 

Bastien et al. (2012) and Pylearn Goodfellow et al. (2013a), on Amazon EC2 g2.8xlarge instances with GPU GRID K520 for our experiments.

3.1 Reducing Dead filters

In order to compare the performance of our model in minimizing dead filters by learning sparse and local filters, the trained filters of MNIST data are compared between CAE and SSCAE with and without pooling layer in Fig. 4. It is shown in Fig. 4(a)(c) that CAE with and without pooling layer learn some delta filters which provide simply an identity function. However, the sparsity function used in SSCAE is trying to reduce in extracting delta filters by managing the activation across featuremaps, as shown in Fig. 4(b)(d).

(a) CAE w/o pooling, select delta filter and featuremap (b) SSCAE w/o pooling, select filter and sparse featuremap

(c) CAE w/ max-pooling, select delta filter and featuremap

(d) SSCAE w/ max-pooling, select filter and sparse featuremap
Figure 4:

Comparison of 8 filters learnt from MNIST by CAE and SSCAE w/o pooling (a,b) and w/ non-overlapping max-pooling (c,d) using ReLu nonlinearity. Select single filter and respective featuremaps shown on the digit.

3.2 Improving learning of reconstruction

To investigate the effect of structured sparsity on learning of filters through reconstruction, the performance of CAE and SSCAE is compared on SVHN dataset, as shown in Fig. 6. To show the performance of structured sparsity on reconstruction, a small CAE with 8 filters is trained on SVHN dataset. Fig. 6(a) shows the performance of CAE after training which fails to extract edge-like filters and results in poor reconstruction. Fig. 8 also depicts the learnt 16 encoding and decoding filters on small NORB dataset, where structured sparsity improve the extraction of localized and edge-like filters. However, SSCAE outperform CAE in reconstruction due to learnt edge-like filters. The selected featuremap of the two models are shown in Fig. 7(a)(b). The convergence rate of reconstruction optimization for CAE and SSCAE is also compared on MNIST (Fig. 5(a)), SVHN (Fig. 5(b)), small NORB (Fig. 5(c)), and CIFAR-10 (Fig. 5(d)) datasets, which indicate faster convergence in SSCAE.

(a) (b) (c) (d)
Figure 5: Learning rate convergence of CAE and SSCAE on (a) MNIST, (b) SVHN, (c) small NORB, and (d) CIFAR-10 dataset using 16 filters of size.
(a) CAE (b) SSCAE
Figure 6: SVHN data-flow visualization in (a) CAE and (b) SPCAE with 8 filters. The effect of structured sparsity is shown in encoding and decoding filters and the reconstruction. NO ZCA whitening is applied.
(a) CAE (b) SSCAE
Figure 7: Selected featuremap of SVHN dataset extracted by (a) CAE, and (b)SSCAE with 8 filters of size. NO ZCA whitening is applied.
(a) CAE encoding filter (b) SSCAE encoding filter (c) CAE decoding filter (d) SSCAE decoding filter
Figure 8: 16 Learnt encoding and decoding filters of (a)(c) CAE and (b)(d) SSCAE on small NORB dataset.