Image classification automatically assigns an unknown image to a category according to its visual content, which has been a major research direction in computer vision. Image classification has two major challenges. First, each image may contain multiple objects with similar low level features, it is thus hard to accurately categorize the image using the global statistical information such as color or texture histograms. Second, a medium-sized (e.g.,) grayscale image corresponds to a vector with dimensionality of , this brings up the scalability issue with image classification techniques.
To address these problems, numerous approaches Mendoza2012 ; Lu2013 have been proposed in the past decade, among which one of the most popular methods is Bag-of-Features (BOF) or called Bag-of-Words (BOW). BOW originates from document analysis Joachims1996 ; Blei2003
. It models each document as the joint probability distribution of a collection of words.Sivic2003 ; Csurka2004 ; Fei2005 incorporated the insights of BOW into image analysis by treating each image as a collection of unordered appearance descriptors extracted from local patches. Each descriptor is quantized into a discrete ‘visual words’ corresponding to a given codebook (i.e., dictionary), and then the compact histogram representation is calculated for semantic image classification.
The huge success of BOF has inspired a lot of works Grauman2005 ; Bolovinou2013 . In particular, Lazebnik et al. Lazebnik2006 proposed Spatial Pyramid Matching (SPM) which divides each image into blocks in different scales and computes the histograms of local features inside each block, and finally concatenates all the histograms to represent the image. SPM has been the major component of most state-of-the-art systems such as Yu2009 ; Wang2010 ; Yu2011 ; Gao2013TIP ; Zhou2013 , which has achieved considerably improved performance on a range of image classification benchmarks like Columbia University Image Library-100 (COIL100) COIL100 and Caltech101 Fei2006
. However, to obtain a good performance, SPM has to pass the obtained representation to a Support Vector Machine classifier (SVM) with nonlinear Mercer kernels, e.g., the intersection kernel. This brings up the scalability issue with SPM in practical applications.
instead of k-means based vector quantization to encode each Scale-Invariant Feature Transform (SIFT) descriptorLowe2004 over a codebook. Benefiting from the nonlinear structure of sparse representation, Yang’s method (namely ScSPM) with linear SVM obtains a higher classification accuracy than the traditional nonlinear SPM method, while taking less time for training and testing.
However, as Liu2013 pointed out, sparse representation encodes each data point independently and thus cannot capture the class structure. Moreover, due to the over-high computational complexity of sparse coding, it is a daunting task to perform ScSPM when the data size is larger than (i.e., when the number of blocks is over ). To solve these two problems, this paper proposes using Low Rank Representation (LRR) rather than sparse code to hierarchically encode each SIFT descriptor. To the best of our knowledge, this is the first work to formulate the image classification as a LRR problem under the framework of SPM.
Our method is motivated by a fact that: each subject consists of multiple images and each image consists of multiple local descriptors. The number of subjects is largely less than that of the descriptors and therefore the representation of the descriptors is naturally low rank. This is the first work to incorporate LRR into SPM and the following problems are focused in this paper: (1) The computational complexity of the traditional LRR is equivalent to that of sparse coding. In other words, simply using LRR to replace sparse representation cannot solve the scalability issue of the ScSPM. To address this problem, we propose a fast version of LRR based on the equivalence theory between the Nuclear norm and the Frobenius norm Favaro2011 ; Peng2012 ; Zhang2014 . Our method has a closed form solution and thus can be calculated very fast. Moreover, the method can be run in an online way, which makes handling the incremental data possible. (2) Most of the recent LRR works use the inputs as the codebook (so-called self-expression), which is not suitable for classification scenario. To address this problem, we propose a new objective function and derive the corresponding optimal solution. (3) The codebook of the original LRR technique Liu2013 probably contains various errors such as the Gaussian noises. In this paper, we calculate the representation for each descriptor using a clean codebook. Extensive experimental results show that the proposed method, namely LrrSPM, which achieves competitive results on several image databases and is times faster than ScSPM. Figure 1 shows a schematic comparison of the original SPM, ScSPM, and LrrSPM.
The rest of the paper is organized as follows: Section 2 provides a brief review on two classic image classification methods, i.e., Spatial Pyramid Matching (SPM) Lazebnik2006 and Sparse coding based Spatial Pyramid Matching (ScSPM) Yang2009 . Section 3 presents our method (LrrSPM) which uses multiple-scale low rank representation rather than vector quantization or sparse code to represent each image. Moreover, a fast and online low rank representation method is introduced. Section 4 carries out some experiments using seven image data sets and several popular approaches. Finally, Section 5 concludes this paper.
|the number of descriptors (features)|
|the scale or resolution of a given image|
|the dimensionality of the descriptors|
|the number of subjects|
|the size of codebook|
|the rank of a given matrix|
|a set of features|
|the representation of over|
2 Related works
Let be a collection of the descriptors and each column vector of represents a feature vector , Spatial Pyramid Matching (SPM) Lazebnik2006 applies Vector Quantization (VQ) to encode via
where denotes -norm, is the representation or called the cluster assignment of , the constraint guarantees that only one entry of is with value of one and the rest are zeroes, and denotes the codebook.
In the training phase, and are iteratively solved, and VQ is equivalent to the classic k-means clustering algorithm which aims to
where consists of cluster centers identified from .
In the testing phase, each is actually assigned to the nearest . Since each has only one nonzero element, it discards a lot of information for (so-called, hard coding problem). Yang et al. Yang2009 proposed ScSPM which uses sparse representation to encode each via
where denotes -norm which sums the absolute values of a vector, and is the sparsity parameter.
The advantage of ScSPM is that the sparse representation has a small number of nonzero entries and it can represent better with less reconstruction errors. Extensive studies Wang2010 ; Yang2009 have shown that ScSPM with linear SVM is superior to the original SPM with nonlinear SVM on a range of databases. The disadvantage of ScSPM is that each data point is encoded independently, and thus the sparse representation of cannot reflect the class structure. Moreover, the computational cost of sparse coding is very high. Any medium-sized data set will bring up scalability issue with ScSPM.
3 Fast Low Rank Representation Learning for Spatial Pyramid Matching
LRR seeks the low rank representation of a given data set. It can capture the relations among different subjects, thus providing better representation. LRR has been widely studied in image clustering Xiao2014 , semi-supervised classification Yang2014 , subspace learning Liu2011 , and so on.
In this work, we propose an approach, called Low Rank Representation based Spatial Pyramid Matching (LrrSPM), which uses the multiple-scale LRR of the SIFT descriptors as feature vectors to train and test the linear SVM classifier. Our method (see Figure 2 for the flow chart of the algorithm) is based on a fact that each subject consists of multiple images and each image consists of multiple local descriptors. The size of subjects is much less than that of the descriptors and thus the representation of the descriptors is low rank. We aim to solve
where denotes the collection of the SIFT descriptors, denotes the representation of over the codebook , and generally consists of cluster centers.
Since the rank operator is nonconvex and discontinuous, we can use nuclear norm as a convex relaxation based on the theoretical result from Recht2010 . Moreover, since probably contains the errors (e.g., noises), we aim to solve
where , is the
-th singular value of, and denotes the rank of .
The major difference between our coding method (5) and the existing LRR methods is the objective function. Liu2013 ; Liu2011 use the input as the codebook and perform encoding using a corrupted codebook, i.e., their constraint term is instead of . Different objective functions result in different optimization algorithms and results. We argue that a clean codebook would provide better representative ability.
To solve (5), the Augmented Lagrange Multiplier method (ALM) is adopted, which minimizes
where denotes Frobenius norm, is a balanced parameter, and is the Lagrange multiplier.
ALM solves (6) with respect to and in an iterative way. The optimization process involves variables, and thus it is inefficient in large scale setting. Furthermore, (6) is an offline process. For any datum not including into , the above formulation cannot get its representation.
To solve these two problems, we propose an approximate LRR method. The method is based on the equivalence theory between the Nuclear norm and the Frobenius norm given by Favaro2011 ; Peng2012 ; Zhang2014 . Zhang2014 proves that the Frobenius norm and the Nuclear norm have the same unique solution in the case of error-free (i.e., ). Peng2012 theoretically and experimentally shows that the Frobenius norm is equivalent to the truncated Nuclear norm Favaro2011 in the case of . In other words, one can obtain the lowest rank representation by solving a Frobenius norm based objective function. Hence, LrrSPM solves
The solution of (7) is given by . For the incremental datum , the corresponding code is . When , this solution is the deserved LRR, which is also called Collaborative Representation (CR) Zhang2011 . In Zhang2011 ; Peng2014 ; Wei2014
, CR has been extensively investigated and achieved a lot of success in face recognition, palm recognition, and so on. In practice, however,probably contains various errors (i.e., ), which makes the solution of (7) not the lowest rank. To obtain the lowest rank representation in this case, we thresholds the trivial entries for each based on the theoretical results Favaro2011 ; Peng2012 .
denotes the scale or the level of the pyramid. For each block at each level, perform max pooling for each block at each level via, where denotes the -th LRR vector belonging to the -th block, and .
Algorithm 1 summarizes our algorithm. Similar to Lazebnik2006 ; Yang2009 , the codebook can be generated by the k-means clustering method or dictionary learning methods such as Gao2014TIP . For training or testing purpose, LrrSPM can get the low rank representation in an online way, which further explores the potential of LRR in online and incremental learning. Moreover, our method is very efficient since its coding process only involves a simple projection operation.
4.1 Baseline Algorithms and Databases
We implemented and evaluated four classes of SPM methods on seven image databases111The MATLAB codes and the used data set can be downloaded at http://goo.gl/sTSa6k.. Besides our own implementations, we also quote some results directly from the literature.
The implemented methods include BOF Fei2005 with linear SVM (LinearBOF) and kernel SVM (KernelBOF), SPM Lazebnik2006 with linear SVM (LinearSPM) and kernel SVM (KernelSPM), Sparse Coding based SPM with linear SVM(ScSPM) (Yang2009, ), and Locality-constrained Linear Coding with linear SVM (LLC) Wang2010 .
The used databases include four scene image data sets, two object image data sets (i.e., COIL20 COIL20 and COIL100 COIL100 ), and one facial image database (i.e., Extended Yale B Georghiades2001 ). The scene image data sets are from Oliva and Torralba Oliva2001 , Fei-Fei and Perona Fei2005 , Lazebnik et al. Lazebnik2006 , and Fei-Fei et al Fei2006 , which are referred to as OT, FP, LS, and Caltech101, respectively. Table 2 gives a brief review on these data sets.
|Databases||Type||Data Size||Image Size|
|Extended Yale B Georghiades2001||face||2414||59–64||38|
4.2 Experimental setup
denotes the scale. And we extract the SIFT descriptors from each block as features. To obtain the codebook, we use the k-means clustering algorithm to find 256 cluster centers for each data set and use the same codebook for different algorithms. In each test, we split the images per subject into two parts, one is for training and the other is for testing. Following the common benchmarking procedures, we repeat the test 5 times with different training and testing data partitions and record the average of per-subject recognition rates and the time costs for each run. We report the final results by the mean and standard deviation of the recognition rates and the time costs. For the LrrSPM approach, we fixedand assigned different for different databases. For the competing approaches, we referred to the parameters configurations in Lazebnik2006 ; Wang2010 ; Yang2009 . Besides our own implementation, we also quote some state-of-the-art results directly from the literature.
4.3 Influence of the parameters
LrrSPM has two user-specified parameters, the regularization parameter is used to avoid overfitting and the thresholding parameter is used to eliminate the effect of the errors. In this section, we investigate the influence of these two parameters on OT data set. We fix () and reported the mean classification accuracy of LrrSPM with the varying (). Figure 3 shows the results, from which one can see that LrrSPM is robust to the choice of the parameters. When increases from 0.2 to 2.0 with an interval of 0.1, the accuracy ranges from 83.68% to 85.63%; When increases from 0.5 to 1.0 with an interval of 0.02, the accuracy ranges from 84.07% to 86.03%.
4.4 Robustness with Respect to the Size of Codebook
In this Section, we report the performance of the evaluated methods when the size of codebook increases from to . we carried out the experiments on the Caltech101 data set by randomly selecting 30 samples per subject for training and using the rest for testing. The is set as for LrrSPM. Moreover, we directly quote some state-of-the-art results achieved in Lee2009; Kavukcuoglu2010; Zhang2014Learning. Table 3 shows the results, from which we can find that:
|Algorithms||Accuracy||Time Costs (seconds)|
|state of the art|
|Convolution DBN-1 Lee2009||-||-||60.50||-||-||-|
|Convolution DBN-2 Lee2009||-||-||65.40||-||-||-|
LrrSPM, ScSPM and LLC are superior to LinearBOF, KernelBOF, LinearSPM, and KernelSPM. LrrSPM achieves comparable result compared to ScSPM and LLC, whereas consuming less time for coding and classification. For example, when the codebook includes bases (i.e., ), the recognition rates of LrrSPM is 28.78% higher than that of LinearBOF, 20.73% higher than that of Kernel BOF, 22.36 higher than that of LinearSPM, 12.38% higher than that of KernelSPM, 0.5% higher than that of ScSPM and lower than that of LLC, whereas LrrSPM only takes about 3% (30%) CPU time of ScSPM (LLC).
With increasing , all evaluated methods achieve better recognition results and takes more time for coding and classification. ScSPM, LLC, and LrrSPM use the SVM with linear kernel. Therefore, they take less time to train and test the classifier than KernelBOF and KernelSPM. However, these three methods take more time to encode each SIFT descriptor than KernelBOF and KernelSPM.
We could not reproduce the results reported in the literature for some evaluated methods. The possible reason is due to subtle engineering details, e.g., Lazebnik2006 tested 50 rather than the all images per subject, Wang2010 ; Yang2009 used a much larger codebook () and the codebook could probably be different even when using the same-sized codebook.
4.5 Scene Classification
This section reports the performance of LrrSPM on three scene image databases. The codebook consists of bases identifying by the k-means method. For each data set, we randomly chose 100 samples from each subject for training and used the rest for testing.
|Algorithms||the OT database||the FP database||the LS database|
|Accuracy (%)||Time (s)||Accuracy (%)||Time (s)||Accuracy (%)||Time (s)|
Table 4 shows that LrrSPM is slightly better than the other evaluated algorithms in most tests. Although LrrSPM is not the fastest method, it finds a good balance between the efficiency and the classification rate. On the OT database, the speed of LrrSPM is about 5.49 and 46.07 times faster than ScSPM and LLC, respectively. On the LS database, the speedups are 5.59 and 50.26 times.
4.6 Object and Face Recognition
This section investigates the performance of LrrSPM on two object image data sets (i.e., COIL20 and COIL100) and one facial image database (i.e., Extended Yale Database B). To analyze the time costs of the examined methods, we also report the time costs of the methods for encoding the SIFT descriptors and for classifying the representation using a linear or nonlinear SVM.
|Algorithms||Training Images for Each Subject|
|Algorithms||Training Images for Each Subject|
|Algorithms||Training Images for Each Subject|
|Algorithms||COIL20||COIL100||Extended Yale B|
Tables 5– 7 report the recognition rate of the tested approaches on COIL20, COIL100, and Extended Yale B, respectively. In most cases, our method achieves the best results and is followed by ScSPM and LLC. When 50 samples per subject of COIL20 and COIL100 are used for training the classifier, LrrSPM perfectly grouped the remaining images into the correct categories. On the Extended Yale B, LrrSPM also classifies almost all the samples into the correct categories (the recognition rate is about 99.81%).
Table 8 shows the efficiency of the evaluated methods. One can find that LrrSPM, BOF, and SPM are obviously more efficient than ScSPM and LLC for the encoding and the classification. Specifically, the CPU time of LrrSPM is only about 2.35%–3.90% of that of ScSPM and about 5.99%–10.44% of that of LLC.
In this paper, we propose a spatial pyramid matching method which is based on the lowest rank representation (LRR) of the SIFT descriptors. The proposed method, named as LrrSPM, is very efficient in computation while still maintaining a competitive accuracy on many data sets. LrrSPM formulates the quantization of the SIFT descriptors as a Nuclear norm optimization problem and utilizes the multiple-scale representation to characterize the statistical information of the image. The paper also introduces an approximation method to speed up the computation of LRR. The method makes LRR handling incremental and large scale data possible. Experimental results based on several well-known data sets show the good performance of LrrSPM.
- (1) N. Acosta-Mendoza, A. Gago-Alonso, J. E. Medina-Pagola, Frequent approximate subgraphs as features for graph-based image classification, Knowledge-Based Systems 27 (2012) 381–392.
- (2) J. Lu, G. Wang, P. Moulin, Image set classification using holistic multiple order statistics features and localized multi-kernel metric learning, in: IEEE International Conference on Computer Vision, 2013, pp. 329–336.
- (3) T. Joachims, A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization., Tech. rep., DTIC Document (1996).
D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, the Journal of machine Learning research 3 (2003) 993–1022.
- (5) J. Sivic, A. Zisserman, Video google: A text retrieval approach to object matching in videos, in: IEEE International Conference on Computer Vision, 2003, pp. 1470–1477.
- (6) G. Csurka, C. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: Workshop on statistical learning in computer vision, ECCV, Vol. 1, 2004, pp. 1–2.
L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natural scene categories, in: IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2005, pp. 524–531.
- (8) K. Grauman, T. Darrell, The pyramid match kernel: Discriminative classification with sets of image features, in: IEEE International Conference on Computer Vision, Vol. 2, 2005, pp. 1458–1465.
- (9) A. Bolovinou, I. Pratikakis, S. Perantonis, Bag of spatio-visual words for context inference in scene classification, Pattern Recognition 46 (3) (2013) 1039–1053.
- (10) S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, in: IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, 2006, pp. 2169–2178.
- (11) K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding, in: Advances in neural information processing systems, 2009, pp. 2223–2231.
- (12) J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 3360–3367.
- (13) K. Yu, Y. Lin, J. Lafferty, Learning image representations from the pixel level via hierarchical sparse coding, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1713–1720.
- (14) S. Gao, I. W.-H. Tsang, L.-T. Chia, Sparse representation with kernels, IEEE Transactions on Image Processing 22 (2) (2013) 423–434.
- (15) L. Zhou, Z. Zhou, D. Hu, Scene classification using a multi-resolution bag-of-features model, Pattern Recognition 46 (1) (2013) 424–433.
- (16) L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (4) (2006) 594–611.
- (17) J. Yang, K. Yu, Y. Gong, T. Huang, Linear spatial pyramid matching using sparse coding for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1794–1801.
- (18) D. G. Lowe, Distinctive image features from scale-invariant keypoints, International journal of computer vision 60 (2) (2004) 91–110.
- (19) G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1) (2013) 171–184.
P. Favaro, R. Vidal, A. Ravichandran, A closed form solution to robust subspace estimation and clustering, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1801–1807.
- (21) X. Peng, L. Zhang, Z. Yi, Constructing l2-graph for subspace learning and segmentation, arXiv preprint arXiv:1209.0841.
- (22) H. Zhang, Z. Yi, X. Peng, fLRR: fast low-rank representation using frobenius-norm, Electronics Letters 50 (13) (2014) 936–938.
- (23) S. Xiao, M. Tan, D. Xu, Weighted block-sparse low rank representation for face clustering in videos, in: European Conference on Computer Vision, 2014, pp. 123–138.
- (24) S. Yang, Z. Feng, Y. Ren, H. Liu, L. Jiao, Semi-supervised classification via kernel low-rank representation graph, Knowledge-Based Systemsdoi:http://dx.doi.org/10.1016/j.knosys.2014.06.007.
G. Liu, S. Yan, Latent low-rank representation for subspace segmentation and feature extraction, in: IEEE International Conference on Computer Vision, 2011, pp. 1615–1622.
- (26) B. Recht, M. Fazel, P. Parrilo, Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization, SIAM Review 52 (3) (2010) 471–501.
- (27) D. Zhang, M. Yang, X. Feng, Sparse representation or collaborative representation: Which helps face recognition?, in: IEEE International Conference on Computer Vision, 2011, pp. 471–478.
- (28) X. Peng, L. Zhang, Z. Yi, K. K. Tan, Learning locality-constrained collaborative representation for robust face recognition, Pattern Recognition 47 (9) (2014) 2794–2806.
- (29) L. Wei, F. Xu, J. Yin, A. Wu, Kernel locality-constrained collaborative representation based discriminant analysis, Knowledge-Based Systemsdoi:http://dx.doi.org/10.1016/j.knosys.2014.06.027.
- (30) S. Gao, I.-H. Tsang, Y. Ma, Learning category-specific dictionary and shared dictionary for fine-grained image categorization, IEEE Transactions on Image Processing 23 (2) (2014) 623–634.
- (31) S. A. Nene, S. K. Nayar, H. Murase, et al., Columbia object image library (coil-20), Tech. rep., Technical Report CUCS-005-96 (1996).
- (32) S. K. Nayar, S. A. Nene, H. Murase, Columbia object image library (coil 100), Department of Comp. Science, Columbia University, Tech. Rep. CUCS-006-96.
- (33) A. S. Georghiades, P. N. Belhumeur, D. J. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6) (2001) 643–660.
- (34) A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International journal of computer vision 42 (3) (2001) 145–175.