Visual object recognition is a major topic in computer vision and machine learning. In the past decade, people have realized that the central problem of object recognition is to learn meaningful representations (features) of the image/videos. A large amount of focus has been put on constructing effective learning architecture that combines modern machine learning methods and in the meanwhile considers the characteristics of image data and vision problems.
In this work, we combine the power of deep learning architecture and the bag-of-visual-words (BoV) pipeline to construct a new unsupervised feature learning architecture for learning image representations. Compared to the single-layer sparse coding (SC) framework, our method can extract feature hierarchies at the different levels of abstraction. The sparse codes at the same layer keeps the spatial smoothness across image patches and different SC hierarchies also capture different spatial scopes of the representation abstraction. As a result, the method has richer representation power and hence has better performance on object recognition tasks. Compared to deep learning methods, our method benefits from effective hand-crafted features, such as SIFT features, as the input. Each module of our architecture has sound explanation and can be formulated as explicit optimization problems with promising computational performance. The method shows superior performance over the state-of-the-art methods in multiple experiments.
In the rest of this section, we review the technical background of the new framework, including the pipeline of using bag-of-visual-words for object recognition and a low-dimensional embedding method called DRLIM.
1.1 Bag-of-visual-words pipeline for object recognition
We now review the bag-of-visual-words pipeline consisting of hand-crafted descriptor computing, bag-of-visual-words representation learning, spatial pyramid pooling and finally a classifier.
The first step of the pipeline is to exact a set of overlapped image patches from each image with fixed patch size, while the spacing between the centers of two adjacent image patches is also fixed. Then a -dimensional hand-crafted feature descriptor (e.g. -dimensional SIFT descriptor) is computed from each image patch. Now let denote the set of feature descriptors, which are converted from overlapped image patches extracted from the -th image (e.g. size ), i.e.,
where is the feature descriptor of the -th patch in the -th image.
Let , where , denote the set of all feature descriptors from all training images. The second step of the pipeline consists of a dictionary learning process and a bag-of-visual-words representation learning process. In the case of using sparse coding to learn the bag-of-visual-words representation, the two processes can be unified as the following problem.
where denotes the dictionary of visual-words, and columns of are the learned sparse codes, and
is the parameter that controls sparsity of the code. We should note, however, other sparse encoding methods such as vector quantization and LLC could be used to learn the sparse representations (see for review and comparisons). Moreover, the dictionary learning process of finding in (1) is often conducted in an online style  and then the feature descriptors of the -th image stored in are encoded as the bag-of-visual-words representations stored in in the -dimensional space (). Intuitively speaking, the components of the bag-of-visual-words representation are less correlated compared to the components of dense descriptors. Therefore, compared to the dense feature descriptors, the high-dimensional sparse representations are more favorable for the classification tasks.
In the third stage of the pipeline, the sparse bag-of-visual-words representations of all image patches from each image are pooled together to obtain a single feature vector for the image based on the histogram statistics of the visual-words. To achieve this, each image is divided into three levels of pooling regions as suggested by the spatial pyramid matching (SPM) technique 
. The first level of pooling region is the whole image. The second level is consist of 4 pooling regions which are 4 quadrants of the whole image. The third level consist of 16 pool regions which are quadrants of the second level pooling regions. In this way, we obtain 21 overlapped pooling regions. Then for each pooling region, a max-pooling operator is applied to all the sparse codes whose associating image patch center locates in this pooling region, and we obtain a single feature vector as the result. The max-pooling operator maps any number of vectors that have the same dimensionality to a single vector, whose components are the maximum value of the corresponding components in the mapped vectors. Formally, given the descriptorsthat are in the same pooling region, we calculate
where max is operated component-wisely. From the second stage of the framework, we know that the nonzero elements in a sparse code imply the appearance of corresponding visual-words in the image patch. Therefore, the max-pooling operator is actually equivalent to calculating the histogram statistics of the visual-words in a pooling region. Finally, the pooled bag-of-visual-words representations from 21 pooling regions are concatenated to obtain a single feature vector, which is regarded as the representation for the image and linear SVM is then used for training and testing on top of this representation. Since the labels of the training images are not used until the final training of SVM, the whole pipeline is regarded as an unsupervised method. For the rest of this paper, we focus on the version of the pipeline where the feature (bag-of-visual-words representation) learning part is performed by a sparse coding step as in (1).
1.2 Dimensionality reduction by learning an invariant mapping
We now review a method called dimensionality reduction by learning an invariant mapping (DRLIM, see ), which is the base model for our new method in Subsection 2.3. Different from traditional unsupervised dimensionality reduction methods, DRLIM relies not only on a set of training instances , but also on a set of binary labels , where is the set of index pairs such that if the label for the corresponding instance pair is available. The binary label if the pair of training instances and are similar instances, and if and are known to be dissimilar. Notice that the similarity indicated by is usually from extra resource instead of the knowledge that can be learned from data instances directly. DRLIM learns a parametric mapping
such that the embeddings of similar instances attract each other in the low-dimensional space while the embeddings of dissimilar instances push each other away in the low-dimensional space. In this spirit, the exact loss function of DRLIM is as follows:
where is the parameter for the contrastive loss term which decides the extent to which we want to push the dissimilar pairs apart. Since the parametric mapping is assumed to be decided by some parameter. DRLIM learn the mapping by minimizing the loss function in (3) with respect to the parameters of . The mapping A could be either linear or nonlinear. For example, we can assume
is a two-layer fully connected neural network and then minimize the loss function (3) with respect to the weight. Finally, for any new data instance , its low-dimensional embedding is represented by without knowing its relationship to the training instances.
2 Deep sparse learning framework
Recent progress in deep learning 
has shown that the multi-layer architecture of deep learning system, such as that of deep belief networks, is helpful for learning feature hierarchies from data, where different layers of feature extractors are able to learn feature representations of different scopes. This results in more effective representations of data and benefits a lot of further tasks. The rich representation power of deep learning methods motivate us to combine deep learning with the bag-of-visual-words pipeline to achieve better performance on object recognition tasks. In this section, we introduce a new learning framework, named as deep sparse coding (DeepSC), which is built of multiple layers of sparse coding.
Before we introduce the details of the DeepSC framework, we first identify two difficulties in designing such a multi-layer sparse coding architecture.
First of all, to build the feature hierarchies from bottom-level features, it is important to take advantage of the spatial information of image patches such that a higher-level feature is a composition of lower-level features. However, this issue is hardly addressed by simply stacking sparse encoders.
Second, it is well-known (see [16, 10]) that sparse coding is not “smooth”, which means a small variation in the original space might lead to a huge difference in the code space. For instance, if two overlapped image patches have similar SIFT descriptors, their corresponding sparse codes can be very different. If another sparse encoder were applied to the two sparse codes, they would lost the affinity which was available in the SIFT descriptor stage. Therefore, stacking sparse encoders would only make the dimensionality of the feature higher and higher without gaining new informations.
Based on the two observations above, we propose the deep sparse coding (DeepSC) framework as follows. The first layer of DeepSC framework is exactly the same as the bag-of-visual-words pipeline introduced in Subsection 1.1. Then in each of the following layer of the framework, there is a sparse-to-dense module which converts the sparse codes obtained from the last layer to dense codes, which is then followed by a sparse coding module. The output sparse code of the sparse coding module is the input of the next layer. Furthermore, the spatial pyramid pooling step is conducted at every layer such that the sparse codes of current layer are converted to a single feature vector for that layer. Finally, we concatenate the feature vectors from all layers as the input to the classifier. We summarize the DeepSC framework in Figure 2. It is important to emphasis that the whole framework is unsupervised until the final classifier.
The sparse-to-dense module is the key innovation of the DeepSC framework, where a “pooling function” is proposed to tackle the aforementioned two concerns. The pooling function is the composition of a local spatial pooling step and a low-dimensional embedding step, which are introduced in Subsection 2.2 and Subsection 2.3 respectively. On one hand, the local spatial pooling step ensures the higher-level features are learned from a collection of nearby lower-level features and hence exhibit larger scopes. On the other hand, the low-dimensional embedding process is designed to take into account the spatial affinities between neighboring image patches such that the spatial smoothness information is not lost during the dimension reduction process. As the combination of the two steps, the pooling function fills the gaps between the sparse coding modules, such that the power of sparse coding and spatial pyramid pooling can be fully expressed in a multi-layer fashion.
2.2 Learning the pooling function
In this subsection, we introduce the details of designing the local spatial pooling step, which performs as the first part of the pooling function. First of all, we define the pooling function as a map from a set of sparse codes on a sampling grid to a set of dense codes on a new sampling grid. Assume that is the sampling grid that includes sampling points on a image, where the any two adjacent sampling points have fixed spacing (number of pixels) between them. As introduced in Subsection 1.1, each sampling point corresponds to the center of a image patch. Let be the sparse codes on the sampling grid , where each is associated with a sampling point on according to its associated image patch. Mathematically, the pooling function is defined as the map:
where is the new sampling grid with sampling points and stores the -dimensional dense codes ( 111For simplicity, we let be the same as the dimensionality of SIFT features.) associated with the sampling points on the new sampling grid .
As the feature representations learned in the new layer are expected have larger scope than those in the previous layer, we enforce each of the sampling points on new grid to cover a larger area in the image. To achieve this, we take the center of neighboring sampling points in and let it be the new sampling points in . By taking the center of every other neighboring sampling points, the spacing between neighboring sampling points in is twice of that in . As a result, we map to a coarser grid such that (see Figure 3).
Once the new sampling grid is determined, we finish the local spatial pooling step by applying the max-pooling operator (defined in (2)) to the subsets of sparse codes and obtain pooled sparse codes associated with the new sampling grid . More specifically, let denote the pooled sparse codes associated with the -th sampling point in , where . We have
where are the indices of the sampling points in that are most close to the -th sampling point in .
2.3 Dimensionality reduction with spatial information
In this subsection, we introduce the details of combining the DRLIM method  with the spatial information of image patches to learn a low-dimensional embedding such that
As the feature vector is transformed by to lower-dimensional space, part of its information is discarded while some is preserved. As introduced in Subsection 1.2, DRLIM is trained on a collection of data instance pairs , each of which is associated with a binary label indicating their relationship. Therefore, it provides the option to incorporate prior knowledge in the dimensionality reduction process by determining the binary labels of training pairs based on the prior knowledge.
In the case of object recognition, the prior knowledge that we want to impose on the system is that if a image patch is shifted by a few pixels, it still contains the same object. Therefore, we constructed the collection of training pairs for DRLIM as follows. We extract training pairs such that there always exist overlapped pixels between the two corresponding patches. Let and be the pooled sparse codes corresponding to two image patches that have overlapped pixels and be the distance (in terms of pixels) between them, which is calculated based on the coordinate of the image patch centers. Given a thresholding , we set
Generated this way, indicates the two image patches are mostly overlapped, while indicates that the two image patch are only partially overlapped. This process of generating training pairs ensures that the training of the transformation is focused on the most difficult pairs. Experiments shows that if we instead take the pooled sparse codes of far-apart image patches as the negative pairs (), DRLIM suffers downgrading in performance. The sensitivity of the system to the thresholding parameter is demonstrated in Table 7.
Let the linear transformationbe defined by the transformation matrix such that
and then the loss function with respect to the pair is
Let be the set of index pairs for training pairs collected from all training images, is then obtained by minimizing the loss with respect to all training pairs, i.e., solving
In this section, we evaluate the performance of DeepSC framework for image classification on three data sets: Caltech-101  , Caltech-256  and 15-Scene. Caltech-101 data set contains images belonging to classes, with about to images per class. Most images of Caltech-101 are with medium resolution, i.e., about . Caltech-256 data set contains images from 256 categories. The collection has higher intra-class variability and object location variability than Caltech-101. The images are of similar size to Caltech-101. 15-Scene data set is compiled by several researchers [8, 13, 15], contains a total of 4485 images falling into 15 categories, with the number of images per category ranging from 200 to 400. The categories include living room, bedroom, kitchen, highway, mountain, street and et al.
For each data set, the average per-class recognition accuracy is reported. Each reported number is the average of 10 repeated evaluations with random selected training and testing images. For each image, following , we sample image patches with 4-pixel spacing and use dimensional SIFT feature as the basic dense feature descriptors. The final step of classification is performed using one-vs-all SVM through LibSVM toolkit . The parameters of DRLIM and the parameter to control sparsity in the sparse coding are selected layer by layer through cross-validation. In the following, we present a comprehensive set of experimental results, and discuss the influence of each of the parameters independently. In the rest of this paper, DeepSC-2 indicates two-layer DeepSC system; DeepSC-3 represents three-layer DeepSC system, and SPM-SC means the one layer baseline, i.e. the BoV pipeline with sparse coding plus spatial pyramid pooling.
3.1 Effects of Number of DeepSC Layers
As shown in Figure 2, the DeepSC framework utilizes multiple-layers of feature abstraction to get a better representation for images. Here we first check the effect of varying the number of layers utilized in our framework. Table 1 shows the average per-class recognition accuracy on three data sets when all using 1024 as dictionary size. The number of training images per class for the three data sets is set as for Caltech-101, for Caltech-256, and for 15-Scene respectively. The second row shows the results when we have only one layer of the sparse coding, while the third row and the fourth row describe the results when we have two layers in DeepSC or three layers in DeepSC. Clearly the multi-layer structured DeepSC framework has superior performance on all three data sets compared to the single-layer SPM-SC system. Moreover, the classification accuracy improves as the number of layers increases.
3.2 Effects of SC Dictionary Size
We examine how performance of the proposed DeepSC framework changes when varying the dictionary size of the sparse coding. On each of the three data sets, we consider three settings where the dimension of the sparse codes is and . The number of training images per class for these experiments is set as for Caltech-101, for Caltech-256, and for 15-Scene respectively. We report the results for the three data sets in Table 2, Table 3 and Table 4 respectively. Clearly, when increasing the dictionary size of sparse coding from 1024 to 4096, the accuracy of the system improves for all three data sets. We can observe that the performance of DeepSC is always improved with more layers, while in the case of
the performance boost in term of accuracy is not so significant. This probably is due to that the parameter space in this case is already very large for the limited training data size. Another observation we made from Table2, Table 3 and Table 4 is that DeepSC-2 (K=1024) always performs better than SPM-SC (K=2048), and DeepSC-2 (K=2048) always performs better than SPM-SC (K=4096). These two comparisons demonstrate that simply increasing the dimension of sparse codes doesn’t give the same performance boost as increasing the number of layers, and therefore DeepSC framework indeed benefits from the feature hierarchies learned from the image.
3.3 Effects of Varying Training Set Size
Furthermore, we check the performance change when varying the number of training images per class on two Caltech data sets. Here we fix the dimension of the sparse codes as 2048. On Caltech-101, we compare two cases: randomly select or images per category respectively as training images and test on the rest. On Caltech-256, we randomly select , and images per category respectively as training images and test on the rest. Table 5 and Table 6 show that with the smaller set of training images, DeepSC framework still continues to improve the accuracy with more layers.
3.4 Effects of varying parameters of DRLIM
In table 7, we report the performance variations when tuning the parameters for DRLIM. The parameter is the threshold for selecting positive and negative training pairs (see (6)) and the parameter in the hinge loss (see (7)) of DRLIM model is for controlling penalization for negative pairs. We can see that it is important to choose the proper thresholding parameter such that the transformation learned by DRLIM can differentiate mostly overlapped image pairs and partially overlapped image pairs.
3.5 Comparison with other methods
We then compare our results with other algorithms in Table 8. The most direct baselines 222We are also aware of that some works achieve very high accuracy based on adaptive pooling step  or multiple-path system that utilizes image patches of multiple sizes . for DeepSC to compare are the sparse coding plus SPM framework (ScSPM) , LLC, and SSC. Table 8 shows the comparison of our DeepSC versus the ScSPM and SSC. We can see that our results are comparable to SSC, with a bit lower accuracy on the 15-Scene data (the std of SSC is much higher than ours). For the LLC method proposed from , it reported to achieve 73.44% for Caltech-101 when using and 47.68% when using . Our DeepSC-3 has achieved 78.43% for Caltech-101 when using and 49.91% when using . Overall our system achieves the state-of-the-art performance on all the three data sets.
-  K. Balasubramanian, K. Yu, and G. Lebanon. Smooth sparse coding via marginal regression for learning sparse representations. In ICML, 2013.
-  Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. arXiv preprint arXiv:1206.5538, 2012.
-  L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchical matching pursuit. CVPR, 2013.
-  Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In CVPR, pages 2559–2566. IEEE, 2010.
C.-C. Chang and C.-J. Lin.
Libsvm: a library for support vector machines.ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
-  A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In ICML, volume 8, page 10, 2011.
-  L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In CVPR, Workshop on Generative-Model Based Vision., 2004.
-  L. Fei-Fei and P. Perona. A bayesian hierarchical model for learning natural scene categories. In CVPR, 2005.
-  J. Feng, B. Ni, Q. Tian, and S. Yan. Geometric ℓ p-norm feature pooling for image classification. In CVPR, pages 2609–2704. IEEE, 2011.
-  S. Gao, I. W. Tsang, L.-T. Chia, and P. Zhao. Local features are not lonely–laplacian sparse coding for image classification. In CVPR, pages 3555–3561. IEEE, 2010.
-  G. Griffin, A. Holub, and P. Perona. Caltech-256 object category dataset. 2007.
-  R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, volume 2, pages 1735–1742. IEEE, 2006.
-  S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pages 2169–2178. IEEE, 2006.
-  J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 689–696. ACM, 2009.
-  A. Oliva and A. Torraba. Modeling the shape of the scene: A holistic representation of the spatial envelop. In IJCV, 2001.
-  J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In CVPR, pages 3360–3367. IEEE, 2010.
-  J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, pages 1794–1801. IEEE, 2009.