1 Introduction
In many computer vision applications, we are often interested in transforming our initial feature representation to a new representation such that suits better for the application under consideration. For example, if we are interested in classification, we may want to transform such that samples from the same class are close to each other and samples from different classes are far away from each other. If we are interested in retrieving images using text query, we may want to transform our image representation such that it is highly correlated with the corresponding text representation.
If we plan to use a linear transformation, then we are interested in learning a transformation matrix
, such that has certain desired properties depending on the application of interest. Though various different algorithms (aiming at different applications) have been proposed in the past for learning the transformation matrix , many of them end up solving a ratiotrace problem [40, 37] (equation (2)), whose optimal solution can be obtained using generalized Eigenvalue decomposition (GEVD)
[10].Algorithms based on ratiotrace problems have been extensively used in various computer vision applications [40, 8, 3, 32, 5, 17, 7, 19, 30]. Some of the popular algorithms formulated as a ratiotrace problem (equation (2)) are linear discriminant analysis (LDA) [10], semisupervised discriminant analysis (SDA) [5], sideinformation based LDA (SILDA) [17], local discriminant embedding (LDE) [7], marginal Fisher analysis (MFA) [40], locality preserving projections (LPP) [15], neighborhood preserving embedding (NPE) [14], canonical correlation analysis (CCA) [13], and orthonormal PLSSB [31].
All the above mentioned linear algorithms suffer from two main disadvantages: (i) They require input data to be represented in the form of feature vectors in an Euclidean space. This may not be possible in applications where the data of interest is represented using bagoffeatures [22], matrices [34] or manifold features [36]. In some applications, we may only have similarities or distances between the features instead of explicit representations. (ii) Linear transformations may be too simple to be effective in some applications as they can not handle nonlinearity present in the data. Both of these issues can be handled by using kernels. Kernelized versions of these linear algorithms also end up solving a ratiotrace problem (equation (3)).
Though kernelbased methods have been successfully used in many computer vision applications, the kernel function and the associated feature space are central choices that are generally made by the user. Recently, automatic selection or combination of kernels (or features) based on MKL approaches has been shown to produce stateoftheart results in various applications [35, 11, 42]. Multiple kernel learning was initially proposed [21] for SVM and has since received significant attention [33, 29, 20]. An excellent overview of various MKL algorithms can be found in [12].
Though MKL has been extensively studied in the context of SVM, it is relatively less explored for ratiotrace problems. Motivated by MKLSVM, Kim et. al. [18] and Ye et. al. [41] extended the MKL approach to LDA (which is a specific instance of ratiotrace problem (2)) formulating it as a convex optimization problem. Arguing for nonsparse MKL, Yan et. al. [39] proposed a nonsparse version of MKLLDA, which imposes a general norm regularization on the kernel weights.
Motivated by MKLLDA, Lin et. al. [23] extended the MKL approach to graph embedding framework [40] which covers a large number of dimensionality reduction algorithms. The MKLDR framework of [23] is based on traceratio formulation which is different from the ratiotrace formulation used in this paper. We refer the readers to [37] for a discussion on the differences between traceratio and ratiotrace formulations. The traceratio based MKLDR formulation of [23] results in a nonconvex optimization problem. In [23], the authors used an iterative optimization procedure that has no convergence guarantees.
In this paper, we show that similar to MKLLDA [41], kernel learning can be formulated as a convex optimization problem for a large class of ratiotrace problems (equations (2) and (3)) that includes popular algorithms like LDA, SDA, SILDA, LDE, NPE, MFA, LPP, CCA and orthonormal PLSSB. We also provide an optimization procedure that is guaranteed to converge to the global optimum of the proposed convex optimization problem.
In practice, MKL is typically used in two different ways: (i) Various kernels are defined for the same feature representation, for example Gaussian kernels with different values of , and an optimal kernel is learned as a combination of these kernels [21, 41], (ii) Multiple feature descriptors are used to represent objects of interest and a similarity kernel is generated from each feature [11, 23]
. In this case, kernel learning effectively solves the feature selection problem and the MKL coefficients can be interpreted as weights given to the corresponding features. One can also use a mixed strategy
[42] of using multiple features and defining multiple kernels for each feature. In this paper, we use MKL for feature selection in the context of ratiotrace problems.Contributions: 1) We show that MKL can be formulated as a convex optimization problem for a large class of ratiotrace problems. The proposed MKLRT formulation is applicable to various popular algorithms like LDA, SDA, SILDA, LDE, NPE, MFA, LPP, CCA and orthonormal PLSSB. 2) We provide an optimization procedure that is guaranteed to converge to the global optimum of the proposed optimization problem. 3) We experimentally show that the proposed MKLRT approach can be successfully used to select features for discriminative dimensionality reduction and crossmodal retrieval. 4) We show that the proposed ratiotrace based convex MKLRT approach performs better than the traceratio based nonconvex MKLDR approach of [23].
Organization: Section 2 presents the general class of ratiotrace problems for which MKL can be formulated as a convex optimization problem. Section 3 discusses some specific instances of the ratiotrace problem which will be used in our experimental evaluation. Section 4 presents the proposed convex MKLRT formulation. We present our experimental results in section 5 and conclude the paper in section 6.
Notations: We use to denote the indicator function. The transpose of a matrix is denoted by . We use
to denote an identity matrix of appropriate size. We use
to denote the absolute value and to denote the empty set.2 Ratiotrace Problem
Definition 1: For any two symmetric positive semidefinite matrices and , the ratiotrace problem is defined as
(1) 
In this paper, we focus on the class of algorithms that learn the data transformation matrix by solving the following ratiotrace problem:
(2) 
where is the data matrix (assumed to be centered), is a regularization parameter used to prevent overfitting, and are (algorithmdependent) symmetric positive semidefinite matrices.
Some of the popular algorithms that fall into this class are LDA, SDA, SILDA, LDE, MFA, LPP, NPE, CCA and orthonormal PLSSB. All these linear algorithms can be made nonlinear by using kernels. For a given kernel function , the kernelized versions of these algorithms solve the following ratiotrace problem:
(3) 
where is the kernel matrix with . The optimal solution to (3
) is given by the generalized Eigenvectors
[10] corresponding to the nonzero generalized Eigenvalues of the matrix pair :(4) 
Once is obtained, the new (nonlinearly transformed) representation for a data sample can be computed using
3 Instances of the Ratiotrace Problem
As mentioned earlier, various algorithms [40, 5, 17, 7, 15, 14, 13, 31] used in many computer vision applications are formulated as a ratiotrace problem. In this section, we briefly discuss three specific instances of the ratiotrace problem which will be used in the experimental evaluation of the proposed MKLRT approach in section 5.
3.1 KFDA  Kernel Fisher Discriminant Analysis
KFDA [2] is a popularlyused nonlinear discriminative dimensionality reduction algorithm. Let be the set of labeled training samples, where is the class label of feature . Let , be the membership vector corresponding to class , and be the number of labeled samples in class . Let be a kernel function defined on features and be the corresponding kernel matrix. Then, KFDA solves the ratiotrace problem (3) with
(5) 
The lower dimensional representation for a data sample can be computed using
3.2 KCCA  Kernel Canonical Correlation Analysis
KCCA [13] is a popular approach used in crossmodal retrieval applications. KCCA maps the data (nonlinearly) from two different modalities to a common lower dimensional latent/concept space where the two modalities are highly correlated. Let be training data pairs where and are samples from the first and second modalities respectively. Let and be kernel functions defined on features and respectively. Let and be the corresponding kernel matrices. Then, KCCA solves the ratiotrace problem (3) with
(6) 
The latent space representations for samples and from first and second modalities respectively are given by and , where
(7) 
Here, is the diagonal matrix of nonzero generalized Eigenvalues given by (4).
3.3 LKCCA  Labeled Kernel Canonical Correlation Analysis
KCCA requires paired training data samples from two modalities to learn the transformations from the initial feature spaces to the common latent space. Suppose, instead of pairings, we are provided with class labels for the training data in the two modalities. We cannot directly use KCCA in this case. One simple way to handle this situation is to generate data pairs using the class labels and then use KCCA with the generated pairs. We refer to this extension of KCCA as labeled KCCA in this paper.
Let be labeled training samples from first modality with being the class label of feature . Let be labeled training samples from second modality with being the class label of feature . Let and denote the number of training samples from class in first and second modalities respectively. Let and . Let and .
In LKCCA, we form a training pair between samples and if . A straightforward way to implement this is to replicate each data sample as many times as the number of samples from the same class in the other modality. But, this would unnecessarily increase the size of kernel matrices and . Instead, LKCCA can be efficiently implemented without actually replicating the samples. This efficient implementation of LKCCA solves the ratiotrace problem (3) with
(8) 
where is a diagonal matrix with , is a diagonal matrix with , and .
The latent space representations for samples and from first and second modalities respectively are given by and , where
(9) 
Here, is the diagonal matrix of nonzero generalized Eigenvalues given by (4). We refer the readers to supplementary material for further details about LKCCA.
4 MKLRT: MKL for Ratiotrace Problem
In the MKL framework, the kernel function is parametrized as a linear combination of predefined base kernel functions :
(10) 
and the weights are learned from the data. Under this framework, MKL for ratiotrace problem can be formulated as the following optimization problem:
(11)  
where is the kernel matrix corresponding to the base kernel function .
The optimization problem (11) is a nonconvex optimization problem. Nevertheless, the optimal for (11) can be obtained by solving a different convex optimization problem as stated in the following theorem.
Theorem 1: Let and be two symmetric positive semidefinite matrices with ranks and respectively. Let and be the nonzero EigenvalueEigenvector pairs of and respectively. Let and for . For , let
(12) 
be functions defined Let be optimal for the following convex optimization problem:
(13)  
Then is optimal for the optimization problem (11).
Proof: Please refer to the supplementary material for the proof.
Note that the optimization problem (13
) is a semiinfinite linear program (SILP). Following
[33, 41], we use an iterative approach to solve (13). In each iteration, the optimal and are computed for a restricted subset of constraints in (13). Constraints that are not satisfied by current and are added successively to the restricted problem until all the constraints are satisfied. For faster convergence, in each iteration, we add the constraint that maximizes the violation for current and . To find the maximum violating constraint, we solve(14) 
Using the definition of from equation (12), it can be easily verified that the optimum for (14) can be obtained by individually solving for each using
(15) 
where . Note that the optimization problem (15) is an unconstrained quadratic program whose solution can be obtained by solving the following system of linear equations:
(16) 
Input: . 

Initialization: 
while 
for 
Compute by solving 
end 
if break; 
else 
Add to the constraint set . Update and by solving restricted 
version of (13) using only . 
end 
t = t + 1; 
end 
Output: and . 
Hence, in each iteration we solve linear systems to find the maximum violating constraint and one linear program to update and . Following [33, 41], we use the following stopping criterion:
(17) 
Table 1 summarizes the algorithm used for solving (13). This iterative algorithm is referred to as column generation technique and is guaranteed to converge [33, 41].
5 Experimental Evaluation
Input: . 

Compute and for , where 
and are the nonzero EigenvalueEigenvector pairs of and respectively. 
Solve the SILP (13) to obtain using the algorithm summarized in table 1. 
Solve the ratiotrace problem (3) using GEVD with to get optimal . 
The new nonlinearly transformed representation for a data sample can be computed 
using , where . 
We evaluated the proposed convex MKLRT approach using three different instances of the ratiotrace problem: KFDA, KCCA and LKCCA (explained in section 3), covering two different applications: discriminative dimensionality reduction (for classification) and crossmodal retrieval. We used Caltech101 [9] and Oxford flowers17 [25] datasets for discriminative dimensionality reduction experiments, and Wikipedia articles [30] and PascalVOC 2007 [16] datasets for crossmodal retrieval experiments.
5.1 Comparative Methods
In all the experiments, we compare the proposed MKLRT approach with the following methods:

AKRT (Average kernel): We solve the ratiotrace problem (3) using the arithmetic mean kernel defined as .

PKRT (Product kernel): We solve the ratiotrace problem (3
) using the geometric mean kernel
defined as .

BIKRT (Best individual kernel): We solve the ratiotrace problem (3) using the best kernel among .
In the case of Caltech101 and Oxford flowers17 datasets, we use SVM and nearest neighbor (NN) rule for classification after discriminative dimensionality reduction. Hence, for these datasets, we also compare the proposed MKLRT approach with the following SVM and NN based approaches:

Kernel SVM approaches: Average kernel SVM (AKSVM), product kernel SVM (PKSVM), best individual kernel SVM (BIKSVM) and MKLSVM.

Kernel NN approaches (without dimensionality reduction): Average kernel NN (AKNN), product kernel NN (PKNN) and best individual kernel NN (BIKNN). We computed the distances from kernels using
(18)
5.2 KFDA for Discriminative Dimensionality Reduction
In these experiments, we first performed dimensionality reduction using KFDA and then used SVM and NN rule for classification in the lower dimensional space. We used two different datasets, namely Caltech101 [9] and Oxford flowers17 [25]. The number of KFDA dimensions was chosen to be , where is the number of classes.
Caltech101 [9] is a multiclass object recognition dataset with 101 object categories and 1 background category. The authors of [11] precomputed 39 different kernels for this dataset using various image features and the kernel matrices are available online^{2}^{2}2http://files.is.tue.mpg.de/pgehler/projects/iccv09/. We used these 39 kernels in our experiments and followed the experimental setup used in [11]. For brevity we omit the details of the features and refer to [11]. We report the results using all 102 classes of Caltech101. We repeated the experiment 5 times using different training and test splits and report the average results. The performance measure used is the mean prediction rate per class. We performed experiments using 5, 10, 15, 20, 25 and 30 training images per class and up to 50 test images per class. The regularization parameter was chosen based on crossvalidation.
Table 3 shows the recognition rates of various approaches for this dataset. For AKSVM, PKSVM, BIKSVM and MKLSVM, we report the results from [11], which were obtained using the same kernel matrices and splits. We can make the following key observations from these results:

Simple kernel NN approaches produce very poor results.

Performing discriminative dimensionality reduction gives a huge improvement with NN classifier (around 2530%) and a moderate improvement with SVM classifier (around 2%). This is expected since the dimensionality reduction step makes samples from same class to be close to each other and samples from different classes to be far away from each other.

Among various KFDA approaches, the proposed convex MKLRT approach works best with both NN and SVM classifiers.

The nonconvex MKLDR approach of [23]
performs poorly compared to AKRT, PKRT and MKLRT approaches. The standard deviation of the MKLDR approach is very high when the number of training samples is low (around 4% for 5 training samples per class and 2.5% for 10 training samples per class). This shows that the nonconvex MKLDR is overfitting the training data.

The proposed MKLRT approach performs better (around 2.5% on average) than MKLSVM.
Number of training images per class  
Method  5  10  15  20  25  30 
AKNN  27.3 0.5  32.6 0.6  36.1 0.9  38.0 1.1  39.9 1.0  42.4 1.5 
PKNN  27.5 0.7  32.8 1.0  36.4 1.1  38.3 1.2  40.4 1.1  42.9 1.4 
BIKNN  29.4 0.9  35.5 1.2  39.5 0.7  41.3 1.0  43.3 0.8  45.7 1.1 
AKRTKFDA + NN  46.0 0.8  57.6 0.6  64.2 0.8  68.1 1.2  70.8 0.9  73.6 1.1 
PKRTKFDA + NN  45.0 0.6  56.1 0.6  62.6 0.6  66.7 0.7  69.4 0.9  72.4 1.0 
BIKRTKFDA + NN  46.1 0.6  54.8 0.1  59.7 0.5  63.0 0.8  65.4 0.7  67.8 0.7 
MKLRTKFDA + NN  45.7 0.6  58.6 0.4  65.3 0.8  69.5 0.7  72.1 0.4  74.6 0.7 
MKLDRKFDA + NN  40.2 4.0  53.6 2.5  61.7 1.4  65.5 0.8  68.6 0.9  72.1 1.2 
AKSVM  44.4 0.6  55.7 0.5  62.2 1.1  66.1 1.0  68.9 1.0  71.6 1.5 
PKSVM  43.6 0.7  54.7 0.5  61.3 0.9  65.4 0.8  68.3 0.7  71.3 1.4 
BIKSVM  46.1 0.9  55.6 0.5  61.0 0.2  64.3 0.9  66.9 0.8  69.4 0.4 
MKLSVM  42.1 1.2  55.1 0.7  62.3 0.8  67.1 0.9  70.5 0.8  73.7 0.7 
AKRTKFDA + SVM  46.1 0.8  57.6 0.6  64.2 0.8  68.1 1.0  70.8 0.9  73.7 1.0 
PKRTKFDA + SVM  45.0 0.6  56.2 0.6  62.6 0.7  66.7 0.8  69.4 0.9  72.5 1.1 
BIKRTKFDA + SVM  46.1 0.6  54.8 0.2  59.8 0.4  63.0 0.8  65.5 0.6  67.8 0.7 
MKLRTKFDA + SVM  46.3 0.9  58.9 0.4  65.5 0.7  69.7 1.0  72.2 0.4  74.7 0.7 
MKLDRKFDA + SVM  40.2 4.0  53.6 2.5  61.7 1.4  65.5 0.8  68.6 0.9  72.1 1.3 
Table 4 shows the number of kernels selected by MKLRTKFDA for the 5 splits used in our experiments. A kernel is considered to be selected if its contribution is greater than 0.1%, i.e., its coefficient is greater than . We can see that the number of kernels selected by MKLRTKFDA (around 714) is much less than the total number of kernels, which is 39 in this case. This clearly shows that the proposed approach can be successfully used to select features for discriminative dimensionality reduction. In contrast, the nonconvex MKLDR approach of [23] ended up selecting all the 39 kernels (all the weights were greater than 0.001 after normalization). The main reason for this could be the lack of sparsitypromoting norm constraint (on the weights) in the MKLDR formulation. Figures 4 and 4 respectively show the kernel weights of MKLRTKFDA and MKLDRKFDA approaches for the fifth random split with 30 training samples per class.
Oxford flowers17 [25] is a multiclass dataset consisting of 17 categories of flowers with 80 images per category. This dataset comes with 3 predefined splits into training ( images), validation ( images) and test ( images) sets. Moreover, the authors of [25] precomputed 7 distance matrices using various features and the matrices are available online^{3}^{3}3http://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html. For brevity we omit the details of the features and refer to [25, 26]. We used these distance matrices and followed the same procedure as in [11] to compute 7 different kernels: , where is the mean of the pairwise distances for the feature. We performed experiments using the three predefined splits and report the average results. The regularization parameter was chosen based on crossvalidation using the training and validation sets.
Table 5 shows the recognition rates of various approaches for this dataset. For the SVMbased approaches, we report the results from [11], which were obtained using the same kernel matrices and splits. We can see that all the observations (except the high standard deviation of MKLDR) made in the case of caltech101 hold true for oxford flowers17 dataset also. Figures 5 and 5 respectively show the kernel weights for MKLRTKFDA and MKLDRKFDA approaches for the third split.
5.3 KCCA for Crossmodal Retrieval
In these experiments, we used KCCA to map the data from two different modalities (image and text) to a common latent space, and used the cosine distance in the latent space for crossmodal retrieval. We used Wikipedia articles [30] dataset for these experiments. We measure the retrieval performance using mean average precision (MAP) [30].
Wikipedia articles [30] is a dataset of imagetext pairs designed for crossmodal retrieval applications. It consists of 2173 training imagetext pairs and 693 test imagetext pairs which are grouped into 10 broad categories like art, history, etc. For text, we used a linear kernel generated from the 10dimensional latent Dirichlet allocation modelbased features provided by [30].^{4}^{4}4http://www.svcl.ucsd.edu/projects/crossmodal/ For images, we extracted various features and constructed 21 different kernels as described below:
PHOG shape descriptor [4]: The descriptor is a histogram of oriented () or unoriented () gradients computed on the output of a Canny edge detector. The histogram consists of 40 bins and the histogram consists of 20 bins. We generated 4 kernels corresponding to different levels of spatial pyramid [22] from both and . Each kernel is an RBF kernel based on the distance between histograms.
SIFT appearance descriptor: We computed the grayscale SIFT [24] descriptors on a regular grid on the image with a spacing of 2 pixels and for four different sizes We followed two different approaches, namely BOW model with 1000 codewords and secondorder pooling [6], to obtain region descriptors from the SIFT descriptors. In each case, we generated 3 kernels corresponding to different levels of spatial pyramid. In the case of BOW model, each kernel is an RBF kernel based on the distance between histograms. In the case of secondorder pooling, each kernel is an RBF kernel based on the logEuclidean distance [1] between covariance matrices.
LBP texture features [27]: We used the histograms of uniform rotation invariant features and generated 3 kernels corresponding to different levels of spatial pyramid. Each kernel is an RBF kernel based on the distance between histograms.
Region covariance features [34]: We used the covariance of simple perpixel features described in [34]. We generated 3 kernels corresponding to different levels of spatial pyramid. Each kernel is an RBF kernel based on the LogEuclidean distance [1] between covariance matrices.
GIST image descriptor [28]: We generated an RBF kernel using the 512dimensional GIST descriptor that records the pooled steerable filter responses within a grid of spatial cells across the image.
The number of KCCA dimensions was chosen to be 9 and the regularization parameter was chosen using crossvalidation. Table 6 shows the MAP scores of various KCCA approaches on this dataset for text and image queries. We can clearly see that the proposed MKLRT approach gives the best retrieval performance. Similar to KFDA experiments, the MKLDR approach performs poorly compared to AKRT, PKRT and MKLRT. Figures 6 and 6 respectively show the kernel weights for MKLRTKCCA and MKLDRKCCA approaches. For this dataset, the proposed convex MKLRTKCCA approach selected 9 kernels out of 21, whereas the nonconvex MKLDRKCCA approach ended up selecting all the kernels. This clearly shows that the proposed approach can be successfully used for feature selection in crossmodal retrieval applications.
5.4 LKCCA for Crossmodal Retrieval
In these experiments, we used LKCCA to map the data from two different modalities (image and text) to a common latent space, and used the cosine distance in the latent space for crossmodal retrieval. We used Wikipedia articles [30] and PascalVOC 2007 [16] datasets for these experiments. The number of LKCCA dimensions was chosen to be , where is the number of classes. We use the mean average precision to measure the retrieval performance.
For the Wikipedia dataset, we used the same image and text kernels that were used in KCCA experiments. Table 7 shows the MAP scores of various LKCCA approaches on this dataset for text and image queries. We can clearly see that the proposed MKLRT approach gives the best retrieval performance. Similar to KFDA and KCCA experiments, the MKLDR approach performs poorly compared to AKRT, PKRT and MKLRT. Figures 7 and 7 respectively show the kernel weights for MKLRTLKCCA and MKLDRLKCCA approaches. For this dataset, the proposed convex MKLRTLKCCA approach selected 12 kernels out of 21, whereas the nonconvex MKLDRLKCCA approach selected 18 kernels.
PascalVOC 2007 [16] dataset consists of 5011 training imagetext pairs and 4952 test imagetext pairs corresponding to 20 different object categories. Since some of the images are multilabeled, following [32, 38], we selected the images with only one object. This gave us 2799 training imagetext pairs and 2820 test imagetext pairs. For text, we used a linear kernel generated from the absolute and relative tag rank features provided by [16].^{5}^{5}5http://vision.cs.utexas.edu/projects/tag/bmvc10.html For images, we extracted various different features (same as those used for the Wikipedia dataset) and constructed 21 different kernels.
Table 8 shows the MAP scores of various LKCCA approaches on this dataset for text and image queries. For image query, the proposed MKLRT approach gives the best retrieval performance and the MKLDR approach performs very poorly. For text query, the MKLDR approach gives the best performance and the proposed MKLRT approach is the second best. Considering the average retrieval performance, the proposed MKLRT approach is the best. Figures 8 and 8 respectively show the kernel weights for MKLRTLKCCA and MKLDRLKCCA approaches. For this dataset, the proposed convex MKLRTLKCCA approach selected 8 kernels out of 21, whereas the nonconvex MKLDRLKCCA approach selected 20 kernels.
For PascalVOC dataset, we also performed experiments using KCCA. The results produced by all the KCCA methods were much lower than the corresponding results of LKCCA methods. So, in the interest of space, we are not presenting those results.
6 Conclusion and Future Work
In this paper, we showed that MKL can be formulated as a convex optimization problem for a large class of ratiotrace problems that includes many popular algorithms like LDA, SDA, SILDA, LDE, MFA, LPP, NPE, CCA and Orthonormal PLSSB. We also provided an optimization procedure that is guaranteed to converge to the global optimum of the proposed convex optimization problem. We performed experiments using three different instances of the ratiotrace problem and demonstrated that the proposed MKLRT approach can be successfully used to select features for discriminative dimensionality reduction and crossmodal retrieval. We also showed that the proposed convex MKLRT approach performs better than the nonconvex MKLDR approach of [23].
In the near future, we plan to test our approach on various other instances of the ratiotrace problem. Similar to the lines of MKLSVM and MKLLDA, we also plan to extend this work to MKLRT.
References
 [1] Arsigny, V., Fillard, P., Pennec, X., Ayache, N.: Geometric means in a novel vector space structure on symmetric positivedefinite matrices. SIAM J. Mat. Anal. App. 29(1), 328–347 (2007)
 [2] Baudat, G., Anouar, F.: Generalized discriminant analysis using a kernel approach. Neural Computation 12(10), 2385–2404 (2000)
 [3] Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. PAMI 19(7), 711–720 (1997)
 [4] Bosch, A., Zisserman, A., Munoz, X.: Representing shape with a spatial pyramid kernel. In: CIVR (2007)
 [5] Cai, D., He, X., Han, J.: Semisupervised discriminant analysis. In: ICCV (2007)
 [6] Carreira, J., Caseiro, R., Batista, J., Sminchisescu, C.: Semantic segmentation with secondorder pooling. In: ECCV (2012)
 [7] Chen, H.T., Chang, H.W., Liu, T.L.: Local discriminant embedding and its variants. In: CVPR (2005)
 [8] Etemad, K., Chellappa, R.: Discriminant analysis for recognition of human face images. Jl. Optical Society of America 14, 1724–1733 (1997)
 [9] FeiFei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: CVPR Workshops (2004)

[10]
Fukunaga, K.: Introduction to statistical pattern recognition. Academic Press, second edn. (1991)
 [11] Gehler, P.V., Nowozin, S.: On feature combination for multiclass object classification. In: ICCV (2009)
 [12] Gönen, M., n, E.A.: Multiple kernel learning algorithms. JMLR 12, 2211–2268 (2011)
 [13] Hardoon, D.R., Szedmak, S., S.Taylor, J.: Canonical correlation analysis: An overview with application to learning methods. Neural Computation 16(12), 2639–2664 (2004)
 [14] He, X., Cai, D., Yan, S., Zhang, H.J.: Neighborhood preserving embedding. In: ICCV (2005)
 [15] He, X., Niyogi, P.: Locality preserving projections. In: NIPS (2003)
 [16] Hwang, S.J., Grauman, K.: Reading between the lines: Object localization using implicit cues from image tags. PAMI 34(6), 1145–1158 (2012)

[17]
Kan, M., Shan, S., Xu, D., Chen, X.: Sideinformation based linear discriminant analysis for face recognition. In: BMVC (2011)
 [18] Kim, S.J., Magnani, A., Boyd, S.: Optimal kernel selection in kernel fisher discriminant analysis. In: ICML (2006)

[19]
Kim, T.K., Wong, S.F., Cipolla, R.: Tensor canonical correlation analysis for action classification. In: CVPR (2007)
 [20] Kloft, M., Brefeld, U., Sonnenburg, S., Zien, A.: lpnorm multiple kernel learning. JMLR 12, 953–997 (2011)
 [21] Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jordan, M.I.: Learning the kernel matrix with semidefinite programming. JMLR 5, 27–72 (2004)
 [22] Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)
 [23] Lin, Y.Y., Liu, T.L., Fuh, C.S.: Multiple kernel learning for dimensionality reduction. PAMI 33(6), 1147–1160 (2011)
 [24] Lowe, D.G.: Distinctive image features from scaleinvariant keypoints. IJCV 60(2), 91–110 (2004)
 [25] Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: CVPR (2006)
 [26] Nilsback, M.E., Zisserman, A.: Automated flower classification over a large number of classes. In: ICVGIP (2008)
 [27] Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution grayscale and rotation invariant texture classification with local binary patterns. PAMI 24(7), 971–987 (2002)
 [28] Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV 42(3), 145–175 (2001)
 [29] Rakotomamonjy, A., Bach, F.R., Canu, S., Grandvalet, Y.: SimpleMKL. JMLR 9, 2491–2521 (2008)
 [30] Rasiwasia, N., Pereira, J.C., Coviello, E., Doyle, G., Lanckriet, G.R.G., Levy, R., Vasconcelos, N.: A new approach to crossmodal multimedia retrieval. In: ACM Multimedia (2010)
 [31] Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. Lecture Notes in Computer Science pp. 34–51 (2006)
 [32] Sharma, A., Kumar, A., Daume, H., Jacobs, D.W.: Generalized multiview analysis: A discriminative latent space. In: CVPR (2012)
 [33] Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. JMLR 7, 1531–1565 (2006)
 [34] Tuzel, O., Porikli, F., Meer, P.: Region covariance: A fast descriptor for detection and classification. In: ECCV (2006)
 [35] Vedaldi, A., Gulshan, V., Varma, M., Zisserman, A.: Multiple kernels for object detection. In: ICCV (2009)
 [36] Vemulapalli, R., Pillai, J.K., Chellappa, R.: Kernel learning for extrinsic classification of manifold features. In: CVPR (2013)
 [37] Wang, H., Yan, S., Xu, D., Tang, X., Huang, T.S.: Trace ratio vs. ratio trace for dimensionality reduction. In: CVPR (2007)
 [38] Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for crossmodal matching. In: ICCV (2013)
 [39] Yan, F., Kittler, J., Mikolajczyk, K., Tahir, A.: Nonsparse multiple kernel fisher discriminant analysis. JMLR 13, 607–642 (2012)
 [40] Yan, S., Xu, D., Zhang, B., Zhang, H.J., Yang, Q., Lin, S.: Graph embedding and extensions: A general framework for dimensionality reduction. PAMI 29(1), 40–51 (2007)
 [41] Ye, J., Ji, S., Chen, J.: Multiclass discriminant kernel learning via convex programming. JMLR 9, 719–758 (2008)
 [42] Yeh, Y.R., Lin, T.C., Chung, Y.Y., Wang, Y.C.F.: A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection. IEEE Transactions on Multimedia 14(3), 563–574 (2012)
Comments
There are no comments yet.