1 Introduction
Image classification has been attracting massive attentions in computer vision and pattern recognition communities in recent years. It is one of the most fundamental but challenging vision problems because images, as illustrated in Fig.
1, often suffer from significant scale, view or illumination variations (e.g., in texture classification [8] and material recognition [22]), and pose changes, background clutter, partial occlusion (e.g., in scene categorization [30, 31] and object recognition [17, 18, 21, 47]).For a long time the bagoffeatures (BoF) model [40] has been almost given priority to image classification. As shown in Fig. 2
(a), the BoFbased methods generally consist of five components: local features extraction, learning codebook with training data, coding local features with pretrained codebook, pooling or aggregating codes over images, and finally, learning classifier (e.g., SVM) for classification. With this processing pipeline, the BoFbased methods can be seen as a handcrafted fivelayer hierarchical feedforward network
[43] with a pretrained feature coding template (codebook) [7]. The learned codebook depicts the distribution of feature space, and makes coding of high dimensional features possible. This architecture has achieved very promising performance in a variety of image classification tasks.The codebook as a reference for feature coding serves as a bridge between local features and global image representation. However, it is well known that segmentation of feature space involved in building of codebook brings on quantization error [6], and leads to continuous striving for this side effect (e.g., soft coding methods [39, 45] alleviate but cannot completely eliminate it). Though offline, training of codebook, particularly large size ones, is time consuming. In addition, in general the pretrained codebook on one database cannot naturally adapt to other databases [52].
An alternative approach is to estimate the statistics directly on sets of local features from input images
[10, 35, 44], as illustrated in Fig. 2 (b), which is called codebookless model (CLM) in this paper. It is clear from Fig. 2 that the major difference is that the BoF model learns a codebook to explore the statistical distribution of local features and then performs coding of descriptors, while the CLM represents images with descriptors directly, requiring no pretrained codebook and the subsequent coding. Conceptually, the codebookless model has the potential to circumvent the aforementioned limitations of the BoF model, however, which has received little attention in image classification community. The main reasons may be that such methods have not yet shown competitive classification performance, and that they often need to utilize inefficient and unscalable kernelbased classifiers.In this paper, we propose an effective CLM scheme, and argue that the CLM can be a competitive alternative to the BoF methods for image classification. The comparison between stateoftheart BoF method, Fisher Vector (FV) [39], and our CLM on various image databases is shown in Fig. 1. First and foremost, we extract a set of local features (e.g., SIFT [34]
) on a dense grid of image, and simply model them with a single Gaussian model to represent the input image. Then, we employ a twostep metric for matching Gaussian models. By using this metric, Gaussian models can be fed to a linear classifier for ensuring efficient and scalable classification while respecting the Riemannian geometry structure of Gaussian models. Moreover, we introduce two wellmotivated parameters into the used metric. One is to balance the effect between mean and covariance of Gaussian, and another is for eigenvalue power normalization on covariance.
Our codebookless model usually is of high dimension, by incorporating lowrank learning with SVM, we propose a joint learning method to effectively compress Gaussian models while respecting their Riemannian geometry structure. It is mentionable that, to the best of our knowledge, we make the first attempt to perform joint learning of lowrank transformation and SVM on Gaussian manifold. Finally, to alleviate the side effect of background clutter, a saliencybased partial background removal method is proposed to enhance our CLM. The experimental results show that partial background removal is helpful to CLM when images are heavily cluttered (e.g., CUB2002011 and Pascal VOC2007).
2 Related work
The codebookless model for directly modeling the statistics of local features has been studied in past decades. Rubner [38] introduced signatures for image representation, and proposed the Earth Mover’s Distance for image matching which is robust but has high computational cost. Tuzel [44] for the first time used covariance matrices for representing regular image regions, and employed AffineRiemannian metric which suffers from high computational cost [36]. Gaussian model as image descriptor has been used for visual tracking [19]
, in which Gaussian models are matched based on the Riemannian metric, involving expensive operations to solve generalized eigenvalue problem. Going beyond Gaussian, Gaussian mixture model (GMM) is more informative and is used in image retrieval
[3]. However, GMM suffers from some limitations, such as high computational cost of matching methods and lacking of general criteria for model selection.Our work is motivated by [9, 10] and [35]. Carreira [9, 10]
modeled the freeform regions obtained by image segmentation with estimating the secondorder moments. By using LogEuclidean metric
[2], the method in [9, 10] can be combined with a linear classifier, which has shown competing recognition performance on images with less background clutter (e.g., Caltech101 [18]). Different from [9, 10], we employ a Gaussian model to represent the whole image. It is wellknown that a covariance matrix can be seen as a Gaussian model with fixed mean vector. Compared to [9, 10], our CLM contains both the firstorder (mean) and secondorder (covariance) information. Note that the firstorder statistics has proven important in image classification [25, 39]. Moreover, the manifold of Gaussian models and that of covariance matrices are quite different, and the embedding method in our CLM makes Gaussian models can be handled flexibly and conveniently.Nakayama [35] also represented an image with a global Gaussian for scene categorization. However, they matched two Gaussian models by using the KullbackLeibler (KL) divergence, and hence kernelbased classifiers have to be used. This method is not scalable and has high computational cost. In contrast to [35], our metric is decoupled which allows a linear classifier to be combined, which makes our method more efficient and scalable than the KL kernel based one in [35]. Moreover, compared with the adhoc linear kernel (Euclidean baseline) in [35], our method takes advantage of the geometry structure of Gaussian models and brings large performance improvement.
There is another line of research on codebookless model methods. Grauman [20] proposed a pyramid match kernel to map feature sets to multiresolution histograms, and employed histogram intersection kernel for classification. Bo [5] presented efficient match kernels to map local features into a low dimensional space, and adopted a linear classifier. Boiman [6] developed an imagetoclass distance between the sets of local features, and employed a nearest neighbor classifier. Yao [50] proposed a codebookfree approach by using a large number of randomly generated image templates for image representation, and developed a baggingbased classifier.
3 Proposed method
We first introduce the image representation by a single Gaussian model. Then, we employ an effective and efficient twostep metric for matching Gaussian models, and propose two wellmotivated parameters to improve the used distance metric. Finally, we present a joint learning method of lowrank transformation and SVM on Gaussian manifold.
3.1 Gaussian model for image representation
Given an input image, we extract a set of local features at a dense grid. By the maximum likelihood method, the image can be represented by the following Gaussian model:
where and are mean vector and covariance matrix, and denotes matrix determinant. Compared with histogram and covariance, Gaussian model is more informative. Meanwhile, unlike matching of signatures [38] or GMMs [3], matching of Gaussian models does not bring high computational cost.
3.2 Twostep metric between Gaussian models
To match Gaussian models, we exploit a twostep metric which has been proposed to compute the ground distance between Gaussian components of GMMs [32]. The first step is to embed Gaussian manifold into the space of SPD matrices [33], and then map the Lie group of SPD matrices into its corresponding Lie algebra, a linear space, by using the LogEuclidean metric [2].
The space of dimensional Gaussian models is a Riemannian manifold. Let be a Gaussian model with mean vector and covariance matrix . Through a continuous function ,
is mapped to an affine matrix, an element in the affine group
; that is,(1) 
where is the Cholesky factorization of . Further, through the function , is mapped to an SPD matrix . So far, by the successive functions and , is uniquely designated as an SPD matrix
(2) 
Please refer to [33] for details on the embedding process.
The space of SPD matrices is a Lie group that forms a Riemannian manifold. Two operations, namely the logarithmic multiplication and the scalar logarithmic multiplication, are defined in the LogEuclidean metric [2], which equip with structures of not only the Lie group but also vector space. Through the matrix logarithm, is mapped into its Lie algebra , the vector space of symmetric matrices. The matrix logarithm is a deffemorphism and an isomorphism so that operations over SPD matrices can be replaced by the Euclidean operations of their counterparts in the vector space. So, through the matrix logarithm, an SPD matrix is onetoone mapped to a symmetric matrices which lies in a linear space, and the geodesic distance between SPD matrices and is defined by , where is the Frobenius norm.
3.3 Two wellmotivated parameters
In practice, we found that it is important to balance mean vector and covariance matrix in the embedding matrix (2), because their dimensions and order of magnitude of each dimension may vary considerably. Meanwhile, the effect of mean vector and covariance matrix may vary for different tasks. With these considerations, we introduce a parameter in the function (1):
(3) 
Accordingly, the embedding matrix has the following form:
(4) 
The embedding matrix (4) reduces to the covariance matrix when , and is equal to the original one when . Hence, the role of mean vector and covariance matrix can be adjusted by .
The maximum likelihood estimator of the empirical covariance matrix is susceptible to interference of noise, especially for high dimension space [15]. Based on observation that the maximum likelihood estimator of covariance ought to be improvable by eigenvalue shrinkage [42], we exploit power normalization on the eigenvalues of covariance matrix (EPN). Let be a Gaussian model estimated from a set of descriptors extracted from some image. The covariance matrix has eigenvalue decomposition , where is an orthornormal matrix whose
column is the eigenvector of
and is the corresponding eigenvalue, and denotes diagonal matrix. Then by introducing a parameter , our normalization is defined as(5) 
With EPN, our final embedding matrix is:
(6) 
It is easy to prove that the embedding matrix (6) is still positive definite as being an SPD matrix. The eigenvalues power normalization has been proposed to measure distances between covariance matrices [16, 24]
or tensor
[29], namely, PowerEuclidean metric. Different from previous work, we use eigenvalues power normalization for robust estimation of covariance matrices in Gaussian setting for the case of high dimensional features, and compare Gaussians by using Gaussian embedding and the LogEuclidean metric.According to the LogEuclidean framework, the matrix can be further embedded into a linear space by matrix logarithm:
(7) 
Let and be two Gaussian models and their corresponding symmetric matrices are and . The distance between two Gaussian models is
(8) 
It is easy to know that distance (8) is decoupled so that and can be computed separately and adopted in a linear classifier. For notational simplicity, we omit the parameters and in the distance measure (8).
3.4 Joint lowrank learning and SVM classifier
Our CLM usually is of high dimension (). In order to suppress redundant and noisy information while reducing computational and storage cost, we propose a lowrank learning method to compact our CLM. The matrix in geodesic distance (8) is a symmetric matrix which lies in the Euclidean space. Due to its symmetry, we can unfold the upper triangular part of to a vector of size . We can modify geodesic distance (8) by introducing a lowrank transformation matrix :
(9) 
where and are the unfolding vectors of two Gaussian models and , respectively.
Recent researches [26, 49] have shown that joint optimization of dimensionality reduction with classifier performs better than separate optimization of the two modules. Thus, given training samples , we optimize the lowrank learning jointly with a linear SVM (LRSVM):
(10)  
where are parameters of SVM, and is the label of . The dimensionality reduction for SPD matrices [23] has been studied with dimensionality reduction and classification separately performed, while our method is quite different in that we focus on Gaussian models and perform joint learning of lowrank transformation and SVM.
In practice, we extend the objective function (10) to multiclass problem under the spatial pyramid matching (SPM) framework [30]. Given an image , we can obtain its SPM representation , where is the number of blocks in SPM, which is fed to a one vs. all SVM for solving the classes problem. As suggested in [26], we optimize the dual problem of the objective function (10) under the SPM framework:
(11)  
where indicates all training features, and is the diagonal label matrix of the th class with diagonal element .
The problem (3.4) is nonconvex and can be optimized by a twostep alternating method: Step One, fixing , we can optimize the Lagrange parameters with offtheshelf SVM; Step Two, for fixed , we solve the following trace maximization problem:
(12)  
We optimize the problem (12) by independently solving each with a closeform solution [26]. Due to the problem (3.4
) being nonconvex, initialization is nontrivial to reach a good local optimal solution and for fast convergence. In this paper, we use the basis of principal component analysis (PCA) as initialization, and we find that it can always achieve good performance and fast convergence.
4 Partial background removal (PBR)
We then present a simple yet effective method for analyzing and handling the side effect of background clutter based on unsupervised, bottomtoup saliency detection. Our purpose here is to remove the interference of background, which is distinguished from the purpose of precise foreground localization in saliency detection community. Our method consists of two steps: coarse foreground detection and partial background removal. In the first step we localize in image the foreground based on saliency detection method [27]
and then determine the boundingbox surrounding the foreground. Next, we adaptively expand boundingbox to accommodate some background regions based on size and intensity variance of the area inside the boundingbox. Then, the area outside boundingbox is removed for recognition. Our method is based on the considerations that accurate foreground detection is currently very difficult and neighboring regions of object can serve as the context and may be helpful for recognition. In our experiments, we adopt PBR to the two datasets with heavy background clutter: CUB2002011 and VOC2007. Since PBR is designed for foreground objects with separable background clutter, we do not perform PBR on images with less background clutter and scene images where both foreground and background are valuable for scene understanding.
5 Implementation details
We extract multiscale SIFT descriptors [34] (standard pipeline in the BoF model) with cell size , , and single scale pixelwise covariance descriptor [44] via the dense sampling strategy with steplength 2. The dense covariance descriptors are computed with 17 dimensional raw features including intensity and four kinds of firstorder and secondorder gradients from [37]. We perform matrix logarithm on the covariance descriptors (LogCov), which are then vectorized. The SIFT features are calculated via the VLFeat library [46]. Moreover, following [9, 10], we also extract additional image cues, including color, location, scale, gradient and entropy to concatenate SIFT and LogCov. In order to ensure that there is sufficient data to estimate Gaussian models and covariance matrices are positive definite, we limit the minimum size of width or height of images to be larger than 64, and add to the diagonal entries of covariance matrices, respectively. We employ the spatial pyramid strategy [30] which divides an image into some regular regions (e.g., , , , ). For each region we compute a Gaussian model, and then concatenate them to represent the whole image. Each Gaussian is weighted by , where and are the number of pyramid levels and regions in the layer, respectively. We implement a onevsall SVM with LibSVM [11] and set parameter to on VOC2007 and on all the other databases. All algorithms are written in Matlab, and run on a PC equipped with i74770k CPU and 32G RAM.
6 Experimental evaluation
In this section, we evaluate the classification performance of our CLM on eight benchmark databases. First of all, we make an analysis of local features, the parameters of our method, the proposed lowrank learning method and the partial background removal method on the challenging CUB2002011 [47]. Then, we compare with stateoftheart methods on Caltech101 [18], Caltech256 [21], KTHTIPS2b [8], Flickr Material Database (FMD) [22], Pascal VOC2007 [17], Scene15 [30] and Sports8 [31]. Finally, we analyze the computational complexity of our CLM.
6.1 Parameters analysis
Local descriptors  Parameters  BR  
ST  eST  LC  eLC  Beta  EPN  PBR  GT  Acc.  
Cov.  
Gau.  
Local descriptors Four kinds of local descriptors, SIFT (ST) and its enrichment (eST), and LogCov (LC) and its enrichment (eLC), are evaluated in this section. The results of our CLM with various local descriptors on CUB2002011 are shown in Table 1. We can see that the Gaussian model used in our method outperforms covariance matrix by or higher with either SIFT or eSIFT, which, we believe, can indicate that the firstorder (mean) information is nontrivial. We use eST to evaluate other parameters as follows.
Two wellmotivated parameters The proposed EPN (5) is a generic method for robust estimation of covariance in high dimension space. We set parameter in EPN (5) as in all databases. From Table 1, we can see that EPN can bring performance gain over the relevant method without EPN. The embedding parameter (6) balances the effect of mean vector and covariance matrix. To test its effect, we determine the optimal value of via cross validation. The performances of our CLM with various are illustrated in Fig. 3 (left). Compared to (covariance matrix only [9, 10]) and (the embedding in [33]), appropriate balancing at achieves and gains, respectively.
LRSVM To evaluate the proposed LRSVM method, we compare LRSVM with unsupervised principal component analysis (PCA) and supervised partial least square (PLS) [1] under different compression ratios. The LRSVM is initialized by PCA, and the results on CUB2002011 are illustrated in Fig. 3 (right). From it we can see that LRSVM always performs better than PLS, and is superior to PCA by a large margin. Different from PLS which exploits the least squares loss, LRSVM uses the hinge loss. We argue that the improvement owes to the joint learning of dimensionality reduction and classifier. Note that, with larger compression ratio, LRSVM achieves larger improvement over PCA and PLS. Meanwhile, the proposed LRSVM has insignificant performance loss (less than ) with large compression ratio (). We also can see that LRSVM can slightly improve the performance of our CLM when compression rations are smaller (), which we owe to that LRSVM can suppress some noisy information. In general, we set compression ratio as to balance the efficiency and effectiveness.
Impact of PBR We apply PBR to CUB2002011 and the results are presented in Table 1. We can see that the method using PBR achieves great gains (more than ) over the one without PBR. Note that we achieve about gain in VOC2007 by using PBR. It shows that our PBR is a general method to handle background for CLM. The gains achieved by using ground truth (GT) bounding box indicate more advanced background removal methods have further ability to improve the recognition performance of our CLM. Compared with the improvement in CUB2002011, the gains in VOC2007 are relative small. The reasons are mainly that the saliencybased methods fail to locate precisely the foregrounds in the challenging databases, and CUB2002011 only contains one object per image while one image may contain multiple objects in VOC2007. PBR can not segment image into multiple objects so that multiobject images will heavily influence the performance of CLM.
Database  Classes  Images in total  Training/Test  Measurement  Scale  View  Illumination  Pose  Bg Clutter  Occlusion 

CUB2002011 [47]  200  11,788  Split in [47]  Acc. of split  
Caltech101 [18]  102  9,144  30/remaining per class  Acc. of 5 runs  
Caltech256 [21]  256  30,607  30/remaining per class  Acc. of 5 runs  
Sports8 [31]  8  1,792  70/60 per class  Acc. of 5 runs  
KTHTIPS2b [8]  11  4,752  [13]  Acc. of splits  
FMD [22]  10  1,000  50/50 per class  Acc. of 5 runs  
VOC2007 [17]  20  9,963  Split in [17]  mAP of split  
Scene15 [30]  15  4,485  100/remaining per class  Acc. of 5 runs 
6.2 Comparison with stateoftheart methods
We compare our CLM with more than ten stateoftheart methods on eight widely used benchmarks. The descriptions and experimental setup on these benchmarks are listed in Table 2. We report the results in Table 3, and discuss the experimental results as follows.
Comparison of various local descriptors We combine our CLM with four kinds of local descriptors, and assess them on all databases. From Table 3 we can see that SIFT and LogCov achieve comparable results. For object recognition, LogCov is superior to SIFT on CUB2002011 and VOC2007 while SIFT outperforms LogCov on Caltech101 and Caltech256. On scene categorization, SIFT and LogCov obtain similar performances on both Sports8 and Sence15. For texture and material classification, SIFT achieves gains over LogCov on KTHTIPS2b while LogCov is superior to SIFT by a large margin on FMD. The eSIFT and eLogCov perform with the similar rule as SIFT and LogCov, respectively. The enrichment on SIFT and LogCov can considerably boost the performance of our CLM, which encourages us to utilize more informative descriptors for further improvement.
Comparison with counterparts Here, we compare our CLM with its counterparts, O2P [10], Global Gaussian (GG) [35] and NBNN [6]. As shown in Tables 1 & 3, our CLM significantly outperforms O2P [10] on CUB2002011 and Caltech101, and is also superior to its variant with sparse quantization (SQO2P) [7] on Caltech101 and VOC2007 by a large margin, which are mainly due to the appropriate use of mean information and EPN. Moreover, our CLM performs much better than GG methods [35] with adhoc linear kernel (adlinear), center tangent linear kernel (ctlinear) and KL divergence on Sports8 and Sence15. The adlinear can be seen as a baseline in Euclidean space. It is mentionable that the methods in [35] exploit probabilistic discriminant analysis (PDA) as a classifier. If SVM is used, their results will drop to , and on Sports8, and , and on Scene15, respectively. We attribute the gains of our CLM over [35] to the use of twostep metric with the proposed wellmotivated parameters. We also compare our CLM with NBNN [6]. It is easy to see that our CLM performs much better than NBNN on Caltech101 and Caltech256. The main differences between our CLM and NBNN are that our CLM employs an effective modeltomodel distance and SVM classifier.
Comparison with FV We make a comprehensive comparison with one stateoftheart BoF method, FV [39], throughout all databases, and also adopt enrichment SIFT (eSIFT) to FV. On all databases except for FMD, our CLM achieves better than or comparable performances with FV when SIFT or eSIFT is used. On FMD, with SIFT or eSIFT, our CLM is inferior to FV, but with LogCov or eLogCov, our CLM is much better than FV. In our experiments, we find that LogCov or eLogCov is not very suitable for FV, so the relevant results are not reported. It is found that our CLM is more sensitive to local descriptors than FV, as eSIFT brings less or no gains on FV while our CLM greatly benefits from the enrichment on SIFT or LogCov.
Comparison with other stateoftheart methods Some recent results are also presented for comparison. On Caltech101, DeCAF [14] with 6 layers CNN and dropout strategy [41] slightly outperforms our CLM. Without dropout, the result of DeCAF drops to . On Caltech256, our CLM outperforms the deep architecture Multipath Hierarchical Matching Pursuit (MHMP) [4] by . Cimpoi [13] achieved stateoftheart results on KTHTIPS2b and FMD with semantic attributes which are trained on the additional database by combining FV [39] and DeCAF [14]. Our CLM is superior to the method with attributes, FV and DeCAF. By combining attribute features, FV and DeCAF, Cimpoi [13] obtained and accuracy on KTHTIPS2b and FMD. Kobayashi [28] proposed a histogram transformation method, and it achieves stateoftheart results on Sports8 and VOC2007.
Summary In this paper, we assess our CLM on eight image benchmarks, as shown in Table 2, which contains various transformations or noisy factors. We claim that (1) the results on Caltech101 and Caltech256 show that our CLM can well deal with location and pose variations of objects; (2) the results on FMD and KTHTIPS2b show that our CLM is robust to scale, viewpoint, illumination and appearance variation; (3) the results on Sports8 and Sence15 indicate our CLM can well classify scene images with certain background clutters; and (4) the results on CUB2002011 and VOC2007 demonstrate our CLM also can handle images with complex surroundings, such as heavy background clutters and occlusion.
6.3 Computational complexity analysis
Our CLM for classification mainly consists of three components: extracting local descriptors, computing Gaussian models using Eq.(4) followed by EPN (5) and matrix logarithm in Eq.(8), and learning LRSVM for classification. Most of the computational costs of CLM lie in the eigenvalue decomposition produced by EPN and matrix logarithm. Their computational complexity are and , respectively, where is the dimension of local descriptors. During joint training of lowrank matrix and SVM classifier, optimizing the objective function (3.4) consists of alternating SVM minimization problem and trace minimization problem, whose complexity is , where is the number of training samples of dimension , and is the number of iterations which is less than in our experiments.
Here, we give empirical running time by taking KTHTIPS2b and Caltech101 as examples. The time of computing image representation, which includes extraction of SIFT at multiple scales, and the time of computation of Gaussian models and embedding matrices, are 30 minutes on KTHTIPS2b and 1.5 hours on Caltech101. The average time of modeling one image takes about 0.4 second and 0.6 second on relevant databases. For each trial, training (resp. test) of LRSVM takes 20s (resp. 2s) and 7min (resp. 40s) on KTHTIPS2b and Caltech101, respectively.
7 Discussion and conclusion
The bagoffeatures (BoF) is a popular method in classification and recognition fields, demonstrating convincing performance in many computer vision tasks in the past years. It might seem that training codebook & descriptor coding are indispensable ingredients. However, the codebookless model (CLM) proposed in this work has proven to be an effective alternative method to the BoF methods for image classification. Below we give some discussions about why CLM shows such competitive performance.
Different from the BoF methods, our CLM leverages continuous functions for statistical modeling of local descriptors, which does not need codebook and thus has no quantization brought in. Recent research [12] showed that high dimensionality can bring impressive performance. The stateoftheart BoF methods such as SV/VLAD or FV have inherently high dimensionality, which, in our opinion, is the key for characterizing distinctness and discriminativess of individual images as well as image categories. Our CLM directly employs the first and secondorder statistics of high dimensional local descriptors, giving rise to informative imagelevel models of high dimensionality as well. In this respect, it is worthwhile to study more informative or high dimensional CLM. Moreover, as shown in [9, 10], the CLM is more efficient than the BoF methods for modeling images because learning codebook & coding are not necessary. In addition, the CLM may be more suitable for the tasks where the datasets will be regularly updated or increased, and thus the codebook in the BoF model has to be regularly adjusted to fit the changing datasets.
The contributions of this paper are concluded as follows. (1) Our work has clearly shown that the CLM is a very competitive alternative to the mainstream BoF model. We hope our work can raise potential interests in the classification (or retrieval) community and pave a way to future research. (2) Our method enables Gaussian models to be successfully combined with linear SVM classifier, which makes our method scalable and efficient. The key is that we embed Gaussian models into a vector space which also allows us to perform joint lowrank learning and SVM on Gaussian manifold. Meanwhile, the proposed two wellmotivated parameters further improve our CLM. (3) We performed extensive experiments, evaluating various aspects of our CLM and comparing with its counterparts as well as stateoftheart methods. The comprehensive experiments demonstrated the promising performance of our CLM.
References
 [1] J. ArenasGarc a, K. B. Petersen, and L. K. Hansen. Sparse kernel orthonormalized PLS for feature extraction in large data sets. In NIPS, 2006.
 [2] V. Arsigny, P. Fillard, X. Pennec, and N. Ayache. Fast and simple calculus on tensors in the LogEuclidean framework. In MICCAI, 2005.
 [3] C. Beecks, A. M. Zimmer, S. Kirchhoff, and T. Seidl. Modeling image similarity by gaussian mixture models and the signature quadratic form distance. In ICCV, 2011.
 [4] L. Bo, X. Ren, and D. Fox. Multipath sparse coding using hierarchical matching pursuit. In CVPR, 2013.
 [5] L. Bo and C. Sminchisescu. Efficient match kernel between sets of features for visual recognition. In NIPS, 2009.
 [6] O. Boiman, E. Shechtman, and M. Irani. In defense of nearestneighbor based image classification. In CVPR, 2008.
 [7] X. Boix, G. Roig, S. Diether, and L. V. Gool. Selfadaptable templates for feature coding. In NIPS, 2014.
 [8] B. Caputo, E. Hayman, and P. Mallikarjuna. Classspecific material categorisation. In ICCV, 2005.
 [9] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. Semantic Segmentation with SecondOrder Pooling. In ECCV, 2012.
 [10] J. Carreira, R. Caseiro, J. Batista, and C. Sminchisescu. FreeForm Region Description with SecondOrder Pooling. TPAMI, PP:1, 2014.
 [11] C.C. Chang and C.J. Lin. LIBSVM: A library for support vector machines. ACM TIST, 2(3):27, 2011.
 [12] D. Chen, X. Cao, F. Wen, and J. Sun. Blessing of dimensionality: Highdimensional feature and its efficient compression for face verification. In CVPR, 2013.
 [13] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In CVPR, 2014.
 [14] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, 2014.
 [15] D. L. Donoho, M. Gavish, and I. M. Johnstone. Optimal shrinkage of eigenvalues in the spiked covariance model. arXiv, 1311.0851, 2014.
 [16] L. Dryden, A. Koloydenko, and D. Zhou. Noneuclidean statistics for covariance matrices, with applications to diffusion tensor imaging. Annals of Applied Statistics, 2009.
 [17] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The Pascal Visual Object Classes (VOC) Challenge. IJCV, 88(2):303–338, 2010.
 [18] L. FeiFei, R. Fergus, and P. Perona. Oneshot learning of object categories. TPAMI, 28(4):594–611, 2006.
 [19] L. Gong, T. Wang, and F. Liu. Shape of gaussians as feature descriptors. In CVPR, 2009.
 [20] K. Grauman and T. Darrell. The pyramid match kernel: Discriminative classification with sets of image features. In ICCV, 2005.
 [21] G. Griffin, A. Holub, and P. Perona. The Caltech256. Technical report, California Institute of Technology, 2007.
 [22] L. haran, R. Rosenholtz, and E. H. Adelson. Material perception: What can you see in a brief glance? Jour. of Vis., 9(8):784, 2009.
 [23] M. T. Harandi, M. Salzmann, and R. Hartley. From manifold to manifold: Geometryaware dimensionality reduction for spd matrices. In ECCV, 2014.
 [24] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the riemannian manifold of symmetric positive definite matrices. In CVPR, 2013.
 [25] H. Jégou, M. Douze, C. Schmid, and P. Pérez. Aggregating local descriptors into a compact image representation. In CVPR, 2010.
 [26] S. Ji and J. Ye. Linear dimensionality reduction for multilabel classification. In IJCAI, 2009.

[27]
B. Jiang, L. Zhang, H. Lu, C. Yang, and M.H. Yang.
Saliency detection via absorbing markov chain.
In ICCV, 2013.  [28] T. Kobayashi. Dirichletbased histogram feature transform for image classification. In CVPR, 2014.
 [29] P. Koniusz, F. Yan, P.H. Gosselin, and K. Mikolajczyk. Higherorder Occurrence Pooling on Mid and Lowlevel Features: Visual Concept Detection. Technical report, 2013.
 [30] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, 2006.
 [31] L.J. Li and F.F. Li. What, where and who? classifying events by scene and object recognition. In ICCV, 2007.
 [32] P. Li, Q. Wang, and L. Zhang. A novel earth mover’s distance methodology for image matching with gaussian mixture models. In ICCV, 2013.

[33]
M. Lovric, M. MinOo, and E. A. Ruh.
Multivariate normal distributions parametrized as a riemannian symmetric space.
JMVA, 74(1):36–48, 2000.  [34] D. G. Lowe. Distinctive image features from scaleinvariant keypoints. IJCV, 60(2):91–110, 2004.
 [35] H. Nakayama, T. Harada, and Y. Kuniyoshi. Global gaussian approach for scene categorization using information geometry. In CVPR, 2010.
 [36] X. Pennec, P. Fillard, and N. Ayache. A riemannian framework for tensor computing. IJCV, pages 41–66, 2006.
 [37] W. K. Pratt. Digital Image Processing, 4th Edition. John Wiley & Sons, Inc., New York, NY, USA, 2007.
 [38] Y. Rubner, C. Tomasi, and L. J. Guibas. The Earth Mover’s Distance as a metric for image retrieval. IJCV, 40(2):99–121, 2000.
 [39] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the Fisher vector: Theory and practice. IJCV, 105(3):222–245, 2013.
 [40] J. Sivic and A. Zisserman. Video Google: A text retrieval approach to object matching in videos. In ICCV, 2003.

[41]
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: A simple way to prevent neural networks from overfitting.
JMLR, 15:1929–1958, 2014.  [42] C. Stein. Lectures on the theory of estimation of many parameters. Jour. of Math. Sci., 34(1):1373–1403, 1986.
 [43] V. Sydorov, M. Sakurada, and C. H. Lampert. Deep fisher kernels  end to end learning of the Fisher kernel GMM parameters. In CVPR, 2014.
 [44] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In ECCV, 2006.
 [45] J. van Gemert, C. J. Veenman, A. W. M. Smeulders, and J.M. Geusebroek. Visual word ambiguity. TPAMI, 32(7):1271–1283, 2010.
 [46] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable library of computer vision algorithms. http://www.vlfeat.org/, 2008.
 [47] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The CaltechUCSD Birds2002011 Dataset. Technical report, 2011.
 [48] J. Wang, J. Yang, K. Yu, F. Lv, T. S. Huang, and Y. Gong. Localityconstrained linear coding for image classification. In CVPR, 2010.
 [49] J. Weston, S. Bengio, and N. Usunier. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011.
 [50] B. Yao, G. Bradski, and L. FeiFei. A codebookfree and annotationfree approach for finegrained image categorization. In CVPR, 2012.
 [51] N. Zhang, R. Farrell, and T. Darrell. Pose pooling kernels for subcategory recognition. In CVPR, 2012.
 [52] W. Zhou, M. Yang, H. Li, X. Wang, Y. Lin, and Q. Tian. Towards codebookfree: Scalable cascaded hashing for mobile image search. TMM, 16(3):601–611, 2014.
 [53] X. Zhou, K. Yu, T. Zhang, and T. S. Huang. Image classification using supervector coding of local image descriptors. In ECCV, 2010.