Image classification has long been a challenging task in vision community, especially when the image amount and intra-class variability get continually increasing. Numerous efforts have been made to counter this significant challenge, among which the bag-of-features (BoF) model has shown desirable performance. BoF works by extracting local features (e.g. SIFT) from the images, vector quantizing them and then representing images as histograms over the visual words. Thus, in the BoF representation, the spatial layout is completely discarded. As an extension of BoF, SPM (spatial pyramid matching) takes account of the spatial information of images, and has improved the classification performance on relatively small classification benchmarks like Caltech101 and Caltech256. However, such model design fails to demonstrate alike performance on mid-scale datasets such as STL-10 and large-scale datasets CIFAR-10. Parallel processing based on distributed resources  seems to release the bottleneck that the increasing image scale meets, while the improvement on processing algorithms should be the most fundamental solution. This question raised considerable interest in the subject of mid-level features [2, 3], and feature learning in general [4, 5, 6].
In recent years, deep convolutional neural networks (CNN)  have demonstrated breakthrough accuracies for large-scale image classification, which stimulates a hurry of study on further improving CNN architectures [8, 9, 10]. Training with sufficiently large and diverse datasets, these improved CNN networks successfully obtain state-of-the-art performance on visual recognition tasks. The success of CNN is mainly attributed to their ability to learn rich mid-level image representations instead of hand-designed low-level features. Typically, the convolutional neural networks adopt a three-stage formulation: filter bank convolution, neuron activation, and pooling. Among the aforementioned three stages, the filter bank convolution plays a central role. To learn an effective filter bank at each convolution stage, a variety of methods have been proposed, such as restricted Boltzmann machines (RBM) [11, 12], regularized auto-encoders and their variations . In general, previous CNN networks optimize the filter bank by utilizing stochastic gradient descent (SGD) method on large number of labeled images, which strictly relies on the expertise of parameter initiation and fine tuning. In addition, such filter learning procedure is rather time-consuming, especially on a common CPU. With the emergence of GPU computing  and the fast deep learning framework Caffe, conventional CNN networks still seem to be promising. However, such hardware-based techniques just cannot relieve the aforementioned restrictions from the source. Besides, traditional CNN is based on supervised learning, which means that the image label is strictly required. However, nowadays, along with the increasing image scale, image label becomes scarce.
Considering that the success of current CNNs possesses a certain randomness due to the unsure filter bank learning procedure, researchers proposed another mathematically justified model named wavelet scattering networks (ScatNet) [14, 15, 16]. ScatNet is similar with CNN except for the design of its filter bank. The filter bank in ScatNet is simply predefined as wavelet operators, which significantly avoids the weights learning procedure. Despite the simpleness of the wavelet filter bank,  and  have verified that a similar multistage architecture of CNN leads ScatNet to accomplish superior performance on handwritten digit and texture recognition. However, such prefixed filter bank fails to capture information in diverse images, which makes it hard to be generalized to show competent performance in arbitrary vision tasks.
In this paper, we propose to construct a compact unsupervised network (CUNet) for image classification, which consists of the simpleness of filter bank in ScatNet and the generalized ability of CNN. Specifically, we straightforwardly use the classical K-means to learn the filter bank from randomly extracted image patches. Here, the scarce labeled images are not necessary, unlabeled ones are engouth to train filters. After the convolution, we maintains the Rectified Linear Units (ReLUs) to activate neurons, followed by the proposed weighted pooling. Subsequent hidden layers are constructed in the same way, except that the filter banks are learned from previous feature map patches. In the output layer, each neuron is binary-mapped, and each group of feature maps are integrated to coarsely represent the input image. Then, histograms are straightforwardly computed in each non-overlapped block, followed by the max-pooling operation on the adjacent blocks to reduce the feature redundancy and select the most competitive features.
The contribution of our proposed CUNet can be concluded in three aspects:
(1) The filter bank learning procedure is compact and unsupervised, which abandons the millions of parameters initialization and fine tuning, and relieves the bottleneck of the scarce labeled images. Thus, CUNet effectively avoids falling into local optimum which traditional CNN usually suffers;
(2) The proposed weighted pooling considers the different effects of all the activations in the pooling region, which contributes to improve the robustness to small image distortions;
(3) The histogram computing is a rather straightforward manner in image feature extraction. We choose to compute histograms in multiple blocks, which helps obtain the spatial information at a certain extent. The max-pooling trick further improve the feature competitiveness.
(3) The histogram computing is a rather straightforward manner in image feature extraction. We choose to compute histograms in multiple blocks, which helps obtain the spatial information at a certain extent. The max-pooling trick further improve the feature competitiveness.
The rest of the paper is organized as follows: Section 2 highlights the related works; Section 3 gives the formulation details of CUNet; Section 4 provides comprehensive experimental results to validate the superiority of CUNet; finally, Section 5 concludes the paper with directions for future work.
Ii Related Work
Convolutional networks have recently demonstrated impressive progress in a variety of image classification and recognition tasks [13, 17, 18]. The promising perspective of CNNs stimulates researchers to make further study on this network for better performance. Multiple layers of unpooled convolution  have been utilized lately with considerable success, while such architectures must be carefully designed and sized using good intuition along with extensive trial-and-error experiments on a validation set.  proposed to transfer image representations learned with CNNs on large datasets to other visual recognition tasks with limited training data. Although  has achieved some success when reusing the ImageNet representation to compute mid-level image representation for the PASCALVOC dataset, it still needs tough training on ImageNet before the transfer operation. Besides, the representation learned from large datasets may incur overfitting issues when it is transferred to small datasets.  proposed a new activation function called maxout to avoid pitfalls such as failing to use many of a model’s filters, which make it possible to train deeper networks. Compared with conventional convolutional layers which perform linear separation, the maxout network is more potent as it can separate concepts that lie within convex sets. However, maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space, which does not necessarily hold.  proposed the NIN network composed of mlpconv layers which use multilayer perceptrons to convolve the input and a global average pooling layer as a replacement for the fully connected layers in conventional CNN. While mlpconv layers model the local patches better and global average pooling prevents overfitting globally, NIN still faces the difficulty of millions of parameters training and fine tuning. Training recurrent neural networks usually incurs the vanishing and the exploding gradient problems.  proposed a gradient norm clipping strategy to deal with the exploding gradients problem, and used a regularization term that forces the error signal not to vanish as it travels back in time to relieve the vanishing gradient restriction. Though some improvements on the gradient training have been achieved,  still fails to simplify the inherent complex of current neural networks.
Our proposed CUNet does not use any image transformations or other regularization such as dropout  or maxout , only involves preprocessing image patches, learning K-means filter bank, computing histograms and selecting the most competitive histogram bins. Thus, our simplifications do not entail a departure from current methods in terms of performance.
Iii Compact Unsupervised Network
In this section, we present the detail formulation of our proposed CUNet. A two-layer CUNet structure is illustrated in Fig.1 where the output layer is precisely highlighted in Fig.2. In the next subsections, we will elaborate each component of the block diagram in detail.
Iii-a The pre-processing of the input layer
Suppose we are given input training images of size , where for gray images and for RGB ones. CUNet begins by extracting random patches from the training images . Each patch can be denoted as a vector in of pixel intensity values, with . Then, we can construct a dataset containing randomly extracted patches, , where . Given this patch dataset, we apply some necessary pre-processing operations on to obtain better configuration.
It is common practice for vision tasks to perform some simple normalization steps before attempting to generate features from the input data. In this work, each patch
is normalized by subtracting the mean and dividing by the standard deviation of its elements. After normalizing each input vector, we apply the whitening operation  over the whole dataset.  has discussed the superiority of whitened images over non-whitened. Then, we obtain the pre-processed input dataset . Assuming that the number of filters in the first layer is , we run K-means on to get the filter bank denoted as where each centroid will act as a convolution filter in the subsequent convolution stage.
Iii-B The formulation of the hidden layer
We maintain the typical processing stages of the traditional CNN, i.e., filter convolution, pooling, neuron activation. Next, we will elaborate each stage added with special design constructed in CUNet.
Filter convolution: Given the first layer’s convolution filter bank , we convolve each training image with the filters:
where is the first layer’s feature map set of , is the feature map of convolved by the filter , and denotes the convolution operation.
Nonlinear activation: Then, the neurons in the feature maps need to be activated through a pre-defined activation function. The Tangent function
and Sigmoid functionare commonly used in previous networks and have been proved to be effective. However, considering the training time, these saturating nonlinearities are much slower than the non-saturating nonlinearity . Following , we call the neurons activated by this nonlinearity Rectified Linear Units (ReLUs).  has verified that deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. Therefore, CUNet likewise adopts ReLUs to accomplish subsequent efficient processing. In fact, we have tried the Tangent function and Sigmoid function in our CUNet while find they are not competitive with the ReLUs.
Weighted pooling: To build robustness to small distortions, we set pooling layer after the activation layer just as most of the traditional ConvNet architectures did. Conventional pooling usually includes two popular choices, i.e. max pooling and average pooling. Max pooling always captures the largest response values, which may loose the useful information of the small ones. As for average pooling, it aggregates local statistics information by preventing large response values taking over and small ones being removed out. However, since average pooling treats each neuron equally, the usefulness of each neuron’s response may be confused.  proposed the stochastic pooling, which replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. Obviously, the choice of the multinomial distribution has dominated effect on the pooling performance. Inspired by these previous pooling strategy, we propose a new pooling method denominated weighted pooling, which considers each neuron’s response as well as its response’s usefulness account in the whole neurons’ responses, that is, each neuron in the pooling region owns its proper weight. Suppose that the pooling window is of size , the response value of each neuron is with . Then, the pooling results of the window can be calculated according to Eq.(2):
where is the weight of . In this paper, we simply compute each neuron’s value proportion in the pooling region as its weight, i.e.,
. The proposed weighted pooling will capture different proportion of local information of each neuron in the original feature map, thus leading to a better local representation. To validate the effectiveness of our proposed weighted pooling, we conducted experiments in Section 4 to compare the performance under different pooling strategies. Conventionally, pooling operation commonly summarizes the non-overlapping neighborhoods containing adjacent units, which reduces the computing complexity while leads to coarse pooling results. To be more precise, CUNet applies an overlapping sliding window with a strideto accomplish further accurate pooling results.
Thus, one layer of CUNet is accomplished, including three main stages: convolution, non-linear rectification, and weighted pooling. Note that the three steps maintain the feature map size as the original input image, the convolution and pooling operations both pad the images (or feature maps) with zeros. We tried the feature map size-maintained model and observe that it outperforms traditional size-changed ones. The second layer is in a similar formulation with the first layer, except that the filter bankis obtained by running K-means on the patches randomly extracted from the first layer’s output. Certainly, we can stack multiple layers as previous works  to gain higher level features, whereas, we find that two layers beyond provide subtle performance improvement. Thus, our CUNet adopts a two-layer architecture, and a deeper model can be implemented, where applicable, in the same way.
Iii-C The design of the output layer
The detailed design of the output layer is highlighted in Fig.2. In the second layer, each of the feature maps has outputs . First, each set of the feature maps are binary mapped, where each unit value is set as 1 if it is positive and 0 if non-positive. Thus, the feature maps are all composed of ones and zeros, and we call these B-maps. Obviously, such crude mapping inevitably loses some useful feature information. To obtain complementary feature information and inspired by , we integrate the B-maps in into one integer-valued image with each feature map multiplying a coefficient :
where , is the -th B-map in . The order and weights of the B-maps does not have any relevant effects on the network performance.
For each of the feature maps in , we can obtain its corresponding image with . Next, we simply compute the histogram of each to gain the final image representation. Considering the robustness that geometric invariance brings to image classification and matching of highly variable scenes, we propose to compute the histogram in a window-wise manner.  similarly adopts such strategy and achieves desirable performance. However, we argue that such histogram computing incurs feature redundancy and high dimension problem. In order to release this restriction, we implement the max-pooling operation on histogram bins in adjacent blocks. In particular, for the adjacent blocks, we select the max bins in each block. Thus, these histograms results in one histogram. Such max-pooling operation helps obtain the most competitive image feature, avoids feature redundancy, and controls the feature dimension in a reasonable scope. Finally, we concatenate the histograms gained from each group of
blocks as the image feature, followed by a liblinear SVM as classifier to classify the images.
Iv Experimental Evaluation
We evaluate the performance of CUNet on four benchmark datasets: STL-10, Caltech101, CIFAR-10, and MNIST. The networks used for the four datasets all consist of two stacked layers, followed by a linear SVM classifier. More particular experimental settings are presented in subsequent sections. we quote results directly from the literature to give a comparison since we note that sometimes we could not reproduce previous works results, largely due to subtle engineering details.
Iv-a The Classification Performance
The CIFAR-10 dataset is composed of 10 classes of natural images split into 50,000 for training and 10,000 for testing. Each image is an RGB one of size 32-by-32. Images vary greatly within each class not only in object position and object size, but also in colors and textures. Besides, the background of each image shows significant variance.
In particular, we learn filters of size in the first layer and of size in the second layer. Both the two layers set their weighted pooling size as 22, and the pooling windows are overlapped with one pixel stride. The histogram computing blocks are all of size , non-overlapped. After getting the block-wise histograms, we select the max bins over adjacent blocks into one single histogram.
TABLE 1 presents the classification accuracy of different methods on CIFAR-10. We observe that CUNet, with weighted pooling, achieves desirable performance among these methods. Besides, the results show that the pooling strategy somewhat influences the final classification performance when all the other settings keep the same. Among the three pooling strategies (i.e., our proposed weighted pooling, the prevalent max and average pooling), our weighted pooling shows better performance, about 0.38% higher than max pooling and 0.85% higher than average pooling. Note that the filter banks used in the condition of weighted pooling are maintained to work in max and average pooling, which strictly avoids the subtle influence of filters randomly learned by K-means. This rigid experimental setting is similarly applied to STL-10, Caltech101, and MNIST for a fair condition.
|CUNet Weighted pooling||80.31|
|CUNet Max pooling||79.93|
|CUNet Average pooling||79.46|
|Tiled CNN ||73.10|
|Improved LCC ||74.50|
|K-means (Triangle,4000features) ||79.60|
|Discriminative SPN ||83.96|
|TIOMP-1/T (combined, K= 4,000) ||82.20|
|2x PDL (1600 codes) ||78.71|
The STL-10 dataset consists of 96-by-96 pixels color images belonging to 10 different classes. This dataset is inspired by the CIFAR-10 while providing fewer training examples (500 per class) and test examples (800 per class), which forces algorithms to rely heavily on acquired prior knowledge of image statistics. We downsampled the STL-10 images into pixels for a simpler configuration.
Experimental settings for STL-10 are similar with CIFAR-10, except that filters. TABLE 2 gives the comparison of different methods on STL-10. We observe that CUNet, with weighted pooling, provides desirable performance among these previous works. With the other settings keeping the same, weighted pooling helps increase the classification accuracy by 0.6% (max pooling) and 0.4%(average pooling).
|CUNet Weighted pooling||63.00|
|CUNet Max pooling||62.40|
|CUNet Average pooling||62.60|
|2x PDL (1600 codes) ||58.28|
|Discriminative SPN ||62.30|
|sparse TIRBM (combined) ||58.70|
We list some sample images from each class in Fig.3, and the classification accuracy of each class is labeled next to the corresponding image rows. From the results, we observe that the relatively simple classes commonly achieve higher accuracy, such as airplane(81.38%), ship(81.00%), and car(80.13%). These aforementioned three objects are all present relatively simplex appearance. Besides, they are static objects and thus do not incur the confusing problems such as activity variance. Differently, the living objects such as monkey(53.50%), cat(43.50%), and dog(31.00%) commonly gives lower accuracy. From the sample images, we observe that such animals commonly includes various kinds, and they usually show different actions, even hidden by some other obstruction, which undoubtedly brings classification difficulty.
Caltech101 dataset contains 101 classes (including animals, vehicles, flowers, etc.) with significant variance in shape, and a background class. The number of images per category varies from 31 to 800. For experimental convenience, we convert all the images into grey, and resize the images into without keeping the aspect ratio. Following the traditional settings, we randomly select 15 and 30 train images per class (including the background class), respectively. TABLE 3 presents the classification results on Caltech101. For both 15 and 30 training images (per class), we train filters. Other settings are the same with CIFAR-10.
From TABLE 3, we observe that the proposed CUNet with weighted pooling achieves desirable performance among current state-of-the-arts methods based on raw pixels. Note that we directly resize the images into without keeping the aspect ratio, while previous methods commonly adopts some strategies to maintain the aspect ratio of the images. Even though, our proposed CUNet still shows its competitiveness on Caltech101. Similar with CIFAR-10 and STTL-10, the weighted pooling successfully outperforms max pooling and average pooling at different extent.
|CUNet Weighted pooling||58.62||66.72|
|CUNet Max pooling||58.00||66.34|
|CUNet Average pooling||58.14||66.48|
|Chen et al. ||58.20||65.80|
|Zou et al. ||-||66.50|
Fig.4 shows some classification results of Caltech101. Each two contiguous rows are two classes that have little inter-class difference. From the result, we observe that the classification accuracy between each two rows shows a big gap. For example, the classification accuracy of the class Faces is 76.90%, about 19.77% lower than the class Faces easy(96.67%). From the listed example images, we observe that the faces in the class Faces easy are exactly in the center of the images and little background are included, while the positions of faces in the class Faces are random (left or right, but no center), and all the images present a complex background, which makes the main object (face) become confusing. Therefore, Faces is more difficult to classify than Faces easy. Besides, once the images are roughly resized, the objects of Faces are commonly get distorted, which further brings classification difficulty to Faces. The two class Chair and Windsor chair similarly show great difference on classification performance. Chair (23.40%) gives 69.28% lower accuracy than Windsor chair (92.68%). From the listed example images, it is obvious that the object in Chair varies greatly and the background is relatively complex. Differently, the intra-class variability of Windsor chair is subtle, and the background is much more simpler than Chair. Thus, it is not surprising why the Windsor chair classification performance is much higher than Chair. Similar analysis goes for the listed Cougar body (50.82%) and Cougar face (47.54%), Crocodile head (16.67%) and Crocodile (2.86%). From aforementioned discussion, we argue that CUNet is competitive in classifying the classes that show simple background, little intra-class variability, and obvious object. However, we have to admit that CUNet shows less competitiveness in the classes that have complex background, great intra-class variability, and confusing object.
The basic MNIST dataset consists of 28-by-28 greyscale images of handwritten digits 0-9, with 10,000 training, 2,000 validation, and 10,000 test examples. To conveniently obey the processing baseline, we resize each MNIST image into , and keep other settings the same with aforementioned three datasets, except that the filter number .
TABLE 4 gives the classification error rate on basic MNIST with different methods. Still, the proposed weighted pooling outperforms the average and max pooling. Since MNIST is a relatively simple dataset, all methods perform well and close, thus, the subtle performance difference is not so statistically meaningful.
|CUNet Weighted pooling||1.80|
|CUNet Max pooling||1.86|
|CUNet Average pooling||1.90|
Fig.4 presents some MNIST training examples and the corresponding error rate. From Fig.4, we observe that both the simple classes 0 and 1 show little intra-class variance, and these two classes achieve lowest error rate. As for the other more complex digits which show great intra-class variance, they achieve higher error rate at different extent. From the sample images, we find it is even hard to artificially judge the these confusing digits.
Iv-B Impact of the number of filters
In this section, we conducted experiments to validate the impact of the filter number on CUNet performance. We fix the experimental settings as aforementioned in Section 4.1 (i.e., the settings make each dataset gain its best performance), only change the filter number of the first layer. In particular, since CIFAR-10 is a relatively complicated dataset, we vary from 20 to 40. For the mid-scale STL-10 and Caltech101, we vary from 10 to 30. For the simpler MNIST, we vary from 5 to 15.
Fig.6 illustrates the impact of the filter number on classification performance. From Fig.6, we observe that the classification accuracy will get improvement when the filter number increases. Whatever the dataset is, more filters will certainly help improve the classification performance as we expect. However, such increase is not always existent. When the filter number achieves its saturation value, the classification performance shows subtle improvement. We attribute such phenomenon to the useless duplicates of the filters, that is, some filters will be repeated if the filter number is set larger than the saturation value. The repeated filters will contribute nothing and even drag down the final classification performance. Hence, the set of filter number plays some dominated role in CUNet.
Iv-C Impact of the block size
Fig.7 illustrates the impact of the block size on CUNet performance. Here, the block size refers to the width and height of the histogram computing windows. For each of the dataset, we set the block size as , , , and fix other settings as discussed in Section 4.1. From Fig.7, we observe that the classification performance get decreased when the block size increases. Commonly, for all the datasets, the classification accuracy achieves the highest when the block size is . When the block size increases to , the classification performance get undesirable decrease, and such performance decline is even larger when the block size increases to . However, it is deserved to be mentioned that although the classification accuracy goes down along with the increase of block size, feature dimension also gets decreased, which obviously brings computation release to the experimental devices. In particular, when the block size is , the feature dimension is , while when the block size is set as , the feature dimension is , 4 times dimension decrease. When the block size is , the feature dimension is , 16 times decrease compared with the feature dimension of block size .
Based on aforementioned analysis, the block size has two-way influence on CUNet. On the one hand, the increase of block size results in performance decline. On the other hand, feature dimension desirably decreases along with the increase of the block size. Hence, the choice of the block size is largely depended on one’s own focus. If accuracy is strictly required, then the block size should set smaller. On the contrary, if the experimental devices can not meet the dataset scale, then larger block size should be set.
Aforementioned experiments have successfully validated the effectiveness of CUNet from different aspects. Four datasets (CIFAR-10, STL-10, Caltech101, MNIST) are employed to test the performance of CUNet on different classification tasks. Firstly, we provide image classification accuracy on these four datasets to validate the feasibility of CUNet. The accuracies of some example classes are presented to analyze the superiority and inferiority of CUNet. From the result, we found that CUNet is more competitive on those static objects (e.g. , airplane, ship, car) and those showing little inner-class variability. Correspondingly, CUNet presents less competitive ability on those dynamic objects (e.g. , dog, cat, monkey) and those showing little intra-class variability. Secondly, we test the effect of the inner settings of CUNet on classification performance: 1) we found that whatever the dataset is, more filters will certainly help improve the classification performance, but such increase is not always existent. When the filter number achieves its saturation value, the classification performance shows subtle improvement ; 2) the choice of the block size largely depends on real applications. The classification accuracy goes down along with the increase of block size, however, feature dimension also gets decreased, which obviously brings computation release to the experimental devices.
We propose a compact unsupervised network called CUNet for image classification tasks. The main purpose of the proposed CUNet is to simplify the traditional convolutional neural network. CUNet is compact which avoids millions of parameters tuning and does not require numerical optimization solver. Besides, unsupervisedly learning convolution filters addresses the bottleneck of scarce image labels. Experimental results verify that CUNet is competitive among previous state-of-the-art works. In future work, we would like to further simplify CUNet and make it feasible for more challenging large-scale dataset benchmarks, especially those own great intra-class variance.
This work was supported in part by the National Natural Science Foundation of China under Grant 61370149, in part by the Fundamental Research Funds for the Central Universities (No. ZYGX2013J083), and in part by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.
-  R. Ji, L. Y. Duan, H. Yao, L. X, Y. Rui, and W. Gao. Learning to distribute vocabulary indexing for scalable visual search. IEEE Transactions on Multimedia, vol. 15, no. 1, pp. 153 - 166, 2013.
-  S. Liu, S. Yan, T. Zhang, C. Xu, J. Liu, H. Lu. Weakly supervised graph propagation towards collective image parsing. IEEE Transactions on Multimedia, vol. 14, no. 2, pp. 361 - 373, 2012.
M. Juneja, A. Vedaldi, C. V. Jawahar, and A. Zisserman. Blocks that shout: distinctive parts for scene classification. In CVPR, 2013.
-  X. Ren and D. Ramanan. Histograms of sparse codes for object detection. In CVPR, 2013.
-  Qi, G. - J., X. S. Hua, Y. Rui, J. Tang, and H. J. Zhang. Image classification with kernelized spatial-context. IEEE Transactions on Multimedia, vol. 12, no. 4, pp. 278 - 287, 2010.
-  P. Li, M. Wang, J. Cheng, C. Xu, H. Liu. Spectral hashing with semantically consistent graph for image indexing. IEEE Transactions on Multimedia, vol. 15, no. 1, pp. 141 - 152, 2013.
-  M. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In arXiv 1311.2901, 2013.
-  I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio. Maxout networks. In ICML, 2013.
-  M. Lin, Q. Chen, and S. Yan. Network in network. In ICLR, 2014.
-  J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In arXiv 1406.3332v2, 2014.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: a review and new perspectives. IEEE TPAMI, vol. 35, no. 8, pp. 1798 C1828, 2013.
-  K. Sohn, G. Zhou, C. Lee, and H. Lee. Learning and selecting features jointly with point-wise gated Boltzmann machine. In ICML, 2013.
-  A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural network. In NIPS, 2012.
-  J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE TPAMI, vol. 35, no. 8, pp. 1872 C1886, 2013.
-  L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture discrimination. In CVPR, 2013.
-  L. SIfre and S. Mallat. Rigid-Motion Scattering for texture classification. In arXiv:1403.1687v1,2014.
-  I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street view imagery using deep convolutional neural networks. arXiv preprint arXiv:1312.6082, 2013.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: integrated recognition, localization and detection using convolutional networks. http://arxiv.org/abs/1312.6229, 2014.
-  M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-Level image representations using convolutional neural networks. In CVPR,2014.
-  R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In arXiv:1211.5063v2, 2014.
-  W. Li, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using dropconnect. In ICML, 2013.
-  M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In ICLR, 2013.
A. Hyvarinen and E. Oja. Independent component analysis: algorithms and applications. Neural networks, 13(4-5):411 C430, 2000.
V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proc. 27th International Conference on Machine Learning, 2010.
-  T. H. Chan, K. Jia, S. H. Gao, J. W. Lu, Z. N. Zeng, and Y. Ma. PCANet: A simple deep learning baseline for image classification?. In arXiv:1404.3606v2, 2014.
-  Q. V. Le, J. Ngiam, Z. Chen, D. Chia, P. W. Koh, and A. Y. Ng. Tiled convolutional neural networks. In NIPS, 2010.
-  K. Yu and T. Zhang. Improved local coordinate coding using local tangents. In ICML, 2010.
-  L. Bo, X. Ren, and D. Fox. Kernel descriptors for visual recognition. In NIPS, 2010.
-  A. Coates, H. Lee, and A. Y. Ng. An analysis of single-layer networks in unsupervised feature learning. In AISTATS, 2011.
-  A. Krizhevsky. cuda-convnet. http://code.google.com/p/cuda-convnet/, July 18, 2014.
-  R. Gens, and P. Domingos. Discriminative learning of sum-product networks. In NIPS, 2012.
-  K. Sohn, and H. Lee. Learning Invariant representations with local transformations. In ICML, 2012.
-  Y. Q. Jia, O. Vinyals, and T. Darrell. Pooling-invariant image feature learning. In arXiv:1302.5056v1, 2013.
-  A. Romero, P. Radeva, and C. Gatta. No more meta-parameter tuning in unsupervised sparse feature learning. In arXiv:1402.5766v1, 2014.
-  S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: explicit invariance during feature extraction. In ICML, 2011.
-  B. Chen, G. Polatkan, G. Sapiro, D. B. Dunson, and L. Carin. The hierarchical beta process for convolutional factor analysis and deep learning. In ICML, 2011.
-  W. Y. Zou, S. Zhu, A. Y. Ng, and K. Yu. Deep learning of invariant features via simulated fixations in video. In NIPS, 2012.
H. Lee, R. Grosse, R. Rananth, and A. Ng. Convolutional deep belief networks for scalable unsupervised learnig of hierachical representation. In ICML, 2009.
-  K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierarchies for visual recognition. In NIPS, 2010.
-  M. D. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.