1 Introduction
Kernel methods are popular in machine learning as they model relations between objects in feature spaces of arbitrary, even infinite dimension
(Schölkopf and Smola, 2002). Kernels are inner productfunctions and provide means to rich representations, which is useful for learning in domains not associated to vector spaces, such as structured objects, strings or trees. Computation of inner product values for all pairs of data points to obtain the kernel matrix requires large computation and storage, which scales as
in the number of data instances. Kernel approximations are thus indispensable when learning on large datasets and can be classified in two groups: approximations of the
kernel function or approximations of the kernel matrix.Direct approximation of the kernel function can achieve significant performance gains. A large body of work relies on approximating the frequently used Gaussian kernel, which has a rapidly decaying eigenspectrum, as proved by the Bochner’s theorem and the concept of random features (Pennington and Yu, 2015; Szabo, 2015; Rahimi and Recht, 2007). Recently, the property of matrices generated by the Gaussian kernel were further exploited to achieve sublinear complexity in the approximation rank (Yang et al., 2014; Si et al., 2014; Le et al., 2013)
. In another line of work, lowdimensional features can be derived for translationinvariant kernels based on their Fourier transform
(Vedaldi and Zisserman, 2012). These methods currently present the most space and time efficient approximations, but are limited to kernels of particular functional forms.Approximations of the kernel matrix are applicable to any symmetric positivedefinite matrix even if the underlying kernel function is unknown. Such approximations, termed data dependent
, can be obtained using the eigenvalue decomposition, by selecting a subset of data points (the Nyström method) or by using the Cholesky decomposition, which minimizes the divergence between the original matrix and its lowrank approximation
(Rudi et al., 2015; Xu et al., 2015; Li et al., 2015; Gittens and Mahoney, 2013; Williams and Seeger, 2001; Fine and Scheinberg, 2001). However, these methods are unsupervised, they disregard side information, e.g., the target variables. Predictive decompositions use the target variables to obtain a supervised lowrank approximation via the Cholesky with Side Information (Bach and Jordan, 2005) or they minimize the Bregman divergence measures (Kulis et al., 2009). The effects of kernel matrix approximation has been discussed in context of sparse Gaussian processes (Quiñonero Candela and Rasmussen, 2005), where the approximation leads to degenerate Gaussian process. Learning the inducing points is equivalent to learning the pivots in matrix decompositions, but can be replaced by optimizing over the whole input domain (Snelson and Ghahramani, 2006; Wilson, 2015), with necessarily continuous domain. Cao et al. (2015) relax this limitation with a hybrid approach to kernel function and inducing set optimization. All methods listed so far operate on single kernels. This presents a limitation, since the choice of optimal kernel for a given learning task is often nontrivial.Similarly to kernel (matrix) approximation, approaches learning the optimal kernel for a given task (dependent on the data) can be classified to i) learning the kernel (covariance) function or ii) learning the kernel matrix. Learning the kernel function is possible in continuous domains, where kernel hyperparameters are optimized to match the training data
(Mohsenzadeh et al., 2015; Bishop, 2006). Alternatively, kernel functions can be learned via Fourier transforms from corresponding powerspectrums (Gal and Turner, 2015; Wilson and Adams, 2013).Multiple kernel learning (MKL) methods learn the optimal weighted sum of given kernel matrices with respect to the target variables, such as class labels (Gönen and Alpaydin, 2011). Different kernels can thus be used to model the data and their relative importance is assessed via the predictive accuracy, offering insights into the problem domain. Depending on the userdefined constraints, the resulting optimization problems are quadratic (QP) or semidefinite programs (SDP), assuming the complete knowledge of the kernel matrices. Cortes et al. (2012) solve a QP on centered kernel matrices, which corresponds to centering the data in the original input space. Lowrank approximations have been used for MKL, e.g., by performing GramSchmidt orthogonalization and subsequently MKL (Kandola et al., 2002). In a recent study, the combined kernel matrix is learned via efficient generalized Forwardbackward algorithm, however assuming that lowrank approximations are available beforehand (Rakotomamonjy and Chanda, 2014). Gönen et al. develop a Bayesian treatment for joint dimensionality reduction and classification, solved via gradient descent (Gönen and Alpaydin, 2010) or variational approximation (Kaski and Gonen, 2014), while assuming access to full kernel matrices and not exploiting their symmetric structure.
In this work, we propose a joint treatment of lowrank kernel approximations and MKL. We assume an input of: i) a set of objects with corresponding continuous target variables and a ii) set of kernels that define inner products on the same objects. We designed mklaren, a greedy algorithm that couples Incomplete Cholesky Decomposition and Leastangle regression to learn a lowrank approximation of the combined kernel matrix. Our innovative approach to pivot column selection is closely associated to the selection of feature vectors in leastangle regression (LAR) (Efron and Hastie, 2004)
. At each step, the method keeps a current estimate of the regression model within the span of the current approximation. In comparison to existing methods, it has the following two advantages. First, the method is aware of multiple kernel functions. In each step, the next pivot column to be added is chosen greedily from all remaining pivot columns from all kernels. Kernels that give more information about the current residual are thus more likely to be selected. In contrast to methods that assume access to the complete kernel matrices, the importance of a kernel is estimated at the time of its approximation. Also, this is different from performing the decomposition for each kernel
independently and subsequently determining kernels weights. Second, the criterion only considers the gain with respect to the current regression residual; the notion of kernel matrix approximation error is completely abolished. Even though accurate approximation is proportional to the similarity of the model using the full kernel matrix (Cortes et al., 2010), it was recently shown that i) the expected generalization error is related to maximal marginal degrees of freedom rather than the approximation error and ii) empirically, lowrank approximation can lead to a regularizationlike effect
(Bach, 2012). Nevertheless, the residual approximation error is guaranteed to monotonically decrease by definition of Cholesky decompositions.When explicit feature space representation is available for kernels, the relation between primal and dual regression weights is used for model interpretation. In contrast to MKL algorithms, which rely on convex optimization or Bayesian methods, our approach relies on geometrical principles solely, leading to a straightforward algorithm with low computational complexity in the number of data points and kernels.
A common assumption when applying matrix approximation or MKL is that the resulting decomposition can only be applied in transductive learning (Zhang et al., 2012; Lanckriet and Cristianini, 2004), i.e., the test data samples are included in the model training phase. We apply the lemma on the uniqueness of the lowrank approximation for a fixed active set to relate the Incomplete Cholesky decomposition and the Nyström method. With this we circumvent the limitation to the transductive setting, infer lowrank representation of arbitrary test data point, enabling outofsample prediction.
The predictive performance, run times and model interpretation were evaluated empirically on multiple benchmark regression datasets. The provided implementation of mklaren compared favorably against related lowrank kernel matrix approximation and stateoftheart MKL approaches. Additionally, we isolate the effect of lowrank approximation and compare the method with full kernel matrix MKL methods on a very large set of rankone kernels.
The article is structured as follows. The mklaren algorithm with pivot column updates with the LARbased selection criterion is presented in Section 2. Auxilliary results regarding outofsample prediction, model interpretation and computational complexity analysis are given in Section 3. Experimental evaluation is presented in Section 4. Description of the Leastangle regression method is given in the Appendix. The algorithm implementation and code to reproduce the presented experiments is available at https://github.com/mstrazar/mklaren.
2 Multiple kernel learning with leastangle regression
Let be a set of points in a Hilbert space of arbitrary dimension, associated with targets . Let the Hilbert spaces be isomorphic to and endowed with respective inner product (kernel) functions . The kernels are positive definite and map from to . Hence, a data point can be represented in multiple inner product spaces, which can be related to different data views or representations. Evaluating for each pair of determines a kernel matrix . The goal of a predictive (supervised) approximation algorithm is to learn the corresponding lowrank approximations , where , , using additional information on the targets. In the context of regression, the regression line is learned simultaneously with the approximations, as their construction depends on the residual vector .
The mklaren algorithm simultaneously learns lowrank approximations of kernel matrices associated to each kernel function and the regression line . It uses Incomplete Cholesky Decomposition (ICD) to iteratively construct each . At each iteration, a kernel and a pivot column
are chosen using a heuristic that evaluates the explained information on the residual
for each potential new column of . This is achieved by using leastangle regression in the space spanned by the previously selected (normalized and centered) pivot columns of all .The highlevel pseudo code of the mklaren algorithm is shown in Algorithm 1, and its steps are described in detail in the following subsections.
2.1 Simultaneous Incomplete Cholesky decompositions
We start with the description of Incomplete Cholesky Decomposition (ICD) of a single kernel matrix, and later extend it to simultaneous decomposition of multiple kernels. A kernel matrix is approximated with a Cholesky factor . The ICD is a family of methods that produce a finite sequence of matrices , such that
(1) 
Initially, is initialized to . A diagonal vector representing the lowerbound on the approximation gain is initialized as The active set , keeping track of selected pivot columns. At iteration , a pivot is selected from the remaining set and its pivot column is computed as follows:
(2) 
Importantly, only the information on one column of is required at each iteration and need never be computed explicitly. The selected pivot is added to the active set, the counter and the diagonal vector are updated:
(3) 
In the case of multiple, kernels, each kernel function , determines a corresponding , which is approximated with Cholesky factors . An example scenario is depicted on Fig. 1.
The set of all Cholesky factors is used to construct a combined feature matrix to be used for leastangle regression, as follows. At any point, assume the existence of the residual vector , to be constructed in Section 2.2. For each selected pivot column in kernel define the following transformation.
(4) 
where the operator is the centering projection and is the sign of the correlation . Each is normalized and makes an angle at most 90 degrees with the residual . The set of columns span the combined feature space, equivalent to any matrix containing this same set of columns (in any order).
(5) 
The Fig. 1 shows one such matrix. Note that applying the operator is equivalent to centering the positive semidefinite matrix , which represents the combined kernel.
Leastangle regression is used to iteratively select the next pivot column and thus determine the order of columns in while simultaneously updating the regression line. The next kernel and pivot are selected from all remaining sets , based on the current residual . The corresponding pivot column is computed using the Cholesky step in Eq. LABEL:e:icd and added to . At any iteration, each may contain a different number of columns as their selection depends on the relevance for explaining the residual.
2.2 Pivot selection based on Leastangle regression
Leastangle regression (LAR) is an active set method
, for feature subset selection in linear regression (see Appendix and
Efron and Hastie (2004) for thorough description). Here, we propose an idea based on the LAR column selection to determine the next pivot column to be added to any of the and consequently to combined feature matrix .The original LAR method assumes availability of all variables representing the covariates (column vectors) in the sample data matrix. In our case, however, this matrix is and is constructed iteratively. The adaptation of the LARbased column selection is nontrivial, since the exact values of the new columns and are unknown at selection time.
This section describes a method to construct given the columns and learn . In favor of clarity we assume (only in this section) that values of all are known and describe the ordering of in . The problem of unknown candidate pivot column values is postponed to Section 2.3.
The matrix is initialized to . The regression line and the residual are initialized
(6) 
By construction, and for all . The will be added to in a defined ordering
(7) 
where a unique kernel, pivot pair is selected for each position . The ordering depends on the correlation with the residual . Therefore, the Cholesky factors containing pivot columns with more information on the current residual are selected preferably.
The column selection procedure is depicted in Fig. 2a and is defined as follows. At iteration , the first vector is chosen to maximize correlation . This is added to .
At each iteration , contains columns . By elementary linear algebra, there exist the bisector , having and making equal angles, less than 90 degrees, between the residual and vectors vectors currently in . Updating the regression line along direction and the residual
(8) 
causes the correlations to change equally for all in , for an arbitrary step size . The value is set such that some new column not in will have the same correlation to as all the columns already in :
(9) 
The step size and are selected as follows. Define the following quantities
(10) 
Then,
(11) 
Here, is the minimum over positive arguments for each choice of . The selected column vector is the minimizer of Eq. 11 and is inserted at the th position in , . For the last column vector (as there are no further column vectors to chose from) the step size simplifies to
, yielding the ordinary leastsquares solution for
and .The mentioned problem in our case is that the exact values of all potential pivot columns not in and its corresponding are unknown. Explicit calculation of all columns using the Cholesky step in Eq. LABEL:e:icd would yield quadratic computational complexity, as their values are dependent on all previously selected pivots. The issue is addressed by using approximations and that are less expensive to compute, as described in the following section.
2.3 Lookahead decompositions
The selection of a new column vector to be added to the combined feature matrix and its corresponding , is based only on the values , in Eq. 11. Instead of explicitly calculating each candidate for all at each iteration, we use an approximate column vector . The approach uses a similar idea to lookahead (information) columns in (Cao et al., 2015; Bach and Jordan, 2005).
Consider the kernel matrix , its current Cholesky factor and active set . By definition of ICD in Eq. LABEL:e:icd, the values of a candidate pivot column at step and pivot are:
(12) 
The main computation cost in the above definition is the computation of a rank kernel matrix for each . Instead, lookahead columns are used to get a lookahead approximation (Fig. 2b). This defines approximate values :
(13) 
Given and consequently , consider the computation of :
(14) 
Inserting as in Eq. 13, the denominator cancels out. The norm can be computed as:
(15) 
Similarly, dot product with the residual is computed as:
(16) 
Computation of is analogous. Correctly ordering the order of computation yields the computational complexity per column. Note that matrices , , , are the same for all columns (independent of ) and need to be computed only once per iteration.
The values and can be computed efficiently for all kernel matrices and enable the selection of next kernel, pivot column pair to be added to and consequently . After selecting a Cholesky step in performed (Eq. LABEL:e:icd) to compute the exact and
(17) 
The computation of a new column renders the lookahead columns in at indices invalid. After applying Eq. 17, all columns at indices are recomputed using standard Cholesky step with pivot selection based on current maximal value in at a cost .
2.4 The mklaren algorithm
The steps described in previous sections complete the mklaren algorithm. Given a sample of data objects with a targets and kernel functions, the user specifies three additional parameters: the maximum rank of combined feature matrix, number of lookahead columns and regularization parameter (constraining , discussed in Section 3.4).
The variables related to regression line (, residual and bisector ) and individual decompositions (active sets , column counters ) are initialized in lines 11. Each is initialized using standard ICD with lookahead columns, as described in Section 2.3 (line 1).
The main loop is executed for iterations, until the sum of selected pivot colums equals , where at each iteration a kernel and a pivot column , are selected and added to and consequently the combined feature matrix . For each kernel and each pivot , and are computed. Based on these approximated values, the kernel and pivot are selected. Given the optimal and pivot , the pivot column and are computed. The new pivot column is added to , is incremented and the columns at are recomputed using standard ICD (lines 11).
3 Auxiliary theoretical results
This section presents auxiliary theoretical results required for outofsample prediction (Sections 3.13.2), model interpretation using the relation between primal and dual regression coefficients (Section 3.3), regularizaion (Section 3.4), and computational complexity (Section 3.5).
3.1 Computing the regression coefficients
The regression coefficients are computed from the regression line and the combined feature space as defined in Eq. 5. using the relation
(18) 
3.2 Outofsample prediction
Inference of Cholesky factors corresponding to test (unseen) samples is possible without explicitly repeating the Cholesky steps. The coefficients are then used to predict the responses for new samples. To simplify notation, we show the approach for one kernel matrix and its corresponding Cholesky factors, while the computation for multiple kernels is analogous.
Nyström approximation. Let be an arbitrary active set of pivot indices. The Nyström aproximation (Williams and Seeger, 2001) of the kernel matrix is defined as follows:
(19) 
The construction of crucially influences the prediction performance. Note that mklaren defines a method to construct .
Proposition. The Incomplete Cholesky decomposition with pivots yields the same approximation as the Nyström approximation using the active set .
(20) 
The proof follows directly from Bach and Jordan (2005), Proposition 1. There exists an unique matrix that is: (i) symmetric, (ii) has the column space spanned by and (iii) . It follows that both Incomplete Cholesky decomposition and the Nyström approximation result in the same approximation matrix .
Corollary. Let be the Cholesky factors obtained on the training set using pivots indices . Let be the values of the kernel function evaluated for all test samples and training samples in the active set , for . The Cholesky factors for test samples
are inferred using the linear transform
.(21) 
The matrix is inexpensive to compute and can be stored permanently after the training phase. Hence, the Cholesky factors are computed from the inner product between the test and the active sets . The combined feature matrix and the predictions are obtained after centering and normalization against the training Cholesky factors :
(22) 
where and is defined in Eq. 18.
3.3 Computing the dual coefficients
Regardless of using the approximation to kernels, a limited form of model interpretation is still possible for a certain class of kernels. Again, we show the approach for one kernel matrix and the combined feature matrix while the computation for multiple kernels is analogous.
Kernel ridge regression is often stated in terms of dual coefficients , satisfying the relation:
(23) 
This is an overdetermined system of equations. The vector with minimal norm can be obtained by solving the following leastnorm problem:
(24) 
The problem has an analytical solution equal to
(25) 
Obtaining dual coefficients can be useful if the range of the explicit feature map induced by a kernel is finite, such that , which is the case for linear, polynomial, and various string kernels (Sonnenburg et al., 2005). An interpretation of regression coefficients in the range of , is obtained by computing the matrix for the training set and considering
(26) 
Moreover, if the vector is sparse, only the relevant portions of need to be computed. This condition can be enforced by using techniques such as matching pursuit when solving for (Bach et al., 2010).
3.4 norm regularization
Regularization is achieved by constraining the norm of weights . Zou and Hastie (2005) prove the following lemma, which shows that regularized regression problem can be stated as ordinary least squares using an appropriate augmentation of the data . The following lemma assumes for all , , and .
Lemma. Define the augmented data set to equal
The leastsquares solution of is then equivalent to Ridge regression of the original data with parameter . For proof, see Zou and Hastie (2005). The augmented dataset can be included in LAR to achieve the regularized solution. This is achieved by modifying the columns of the combined feature matrix in Eq. 5:
(27) 
3.5 Computational complexity
The mklaren algorithm scales as a linear function of both the number of data points and kernels . The computational complexity is
(28) 
The lookahead Cholesky decompositions are standard Cholesky decompositions with pivots and complexity . The main loop is executed times. The selection of kernel and pivot pairs is based on the LAR criterion, which includes inverting of size , thus having a complexity of . However, as each step is a rankone modification to , the MorrisonShermanWoodbury lemma on matrix inversion (Meyer, 2000) can be used to achieve complexity per update. The computation of correlations with the bisector in Eq. 14 and residuals are computed for kernels in . Recomputation of Cholesky factors requires standard Cholesky steps of complexity . The computation of the gradient step is of the same complexity as the gradient step. Updating the regression line is . The QR decomposition in Eq. 18 takes and the computation of linear transform in Eq. 21 is of complexity.
4 Experiments
In this section, we provide an empirical evaluation of the proposed method on known regression datasets. We compare mklaren with several wellknown lowrank matrix approximation methods: Incomplete Cholesky Decomposition (icd, (Fine and Scheinberg, 2001)), Cholesky with side Information (csi, (Bach and Jordan, 2005)) and the Nyström method (Nyström, (Williams and Seeger, 2001)).
We also compare mklaren
with a family of stateoftheart multiple kernel learning methods tha use the fullkernel matrix. The comparison was performed on a sentiment analysis data set with a large number of rankone kernels
(Cortes et al., 2012).4.1 Comparison with lowrank approximations
Dataset  mklaren  csi  icd  Nyström  uniform 

boston  4.393 0.432  
kin  0.018 0.000  
pumadyn  1.252 0.032  
abalone  2.638 0.116  
comp  5.288 0.461  
ionosphere  0.283 0.017  
bank  0.036 0.001  
diabetes  54.680 3.61 
Dataset  mklaren  csi  icd  Nyström  uniform 

boston  3.792 0.454  
kin  0.016 0.001  
pumadyn  1.257 0.032  
abalone  2.526 0.097  
comp  3.100 0.942  
ionosphere  0.234 0.028  
bank  0.035 0.001  
diabetes  55.220 3.56 
Dataset  mklaren  csi  icd  Nyström  uniform 

boston  3.493 0.489  
kin  0.014 0.000  
pumadyn  1.251 0.027  
abalone  2.500 0.110  
comp  1.330 0.409  
ionosphere  0.221 0.012  
bank  0.034 0.002  
diabetes  55.214 4.03 
The main advantage of mklaren over established kernel matrix approximation methods is simultaneous approximation of multiple kernels, which considers the current approximation to the regression line and greedily selects the next kernel and pivot to include in the decomposition. To elucidate this, we performed a comparison on eight known regression datasets: abalone, bank, boston, compactive, diabetes, ionosphere, kinematics, pumadyn^{1}^{1}1http://archive.ics.uci.edu/ml/^{2}^{2}2http://www.cs.toronto.edu/~delve/data/datasets.html.
Similar to Cortes et al. (2012), seven Gaussian kernels with different length scale parameters are used. The Gaussian kernel function is defined as , where the length scale parameter is in range . For approximation methods icd, csi and Nyström, each kernel matrix was approximated independently using a fixed maximum rank . The combined feature space of seven kernel matrices was used with ridge regression.
For mklaren, the approximation is defined simultaneously for all kernels and the maximum rank was set to , i.e., seven times the maximum rank of individual kernels used in icd, csi and Nyström. Thus, the lowrank feature space of all four methods had exactly the same dimension. The uniform kernel combination (uniform) using the fullkernel matrix was included as an empirical lower bound.
The performance was assessed using 5fold crossvalidation as follows. Initially up to 1000 data points were selected randomly from the dataset. For each random split of the data set, a training set containing 60% of the data was used for kernel matrix approximation and fitting the regression model. A validation set containing 20% of the data was used to select the regularization parameter from range . The final reported performance using root mean square error (RMSE) was obtained on the test set with remaining 20% of the data. All variables were standardized and the targets were centered to the mean. The lookahead parameter was set to 10 for mklaren and csi.
The results for different settings of are shown in Table 1. Not surprisingly, the performance of supervised mklaren and csi is consistently superior to unsupervised icd and Nyström. Moreover, mklaren outperforms csi on the majority of regression tasks, especially at lower values of . At higher values of , the difference vanishes as all approximation methods recover sufficient information of the feature space induced by the kernels.
Dataset  mklaren  csi  icd  Nyström  

boston  506  42  63  119  
kin  1000  63  
pumadyn  1000  49  56  98  
abalone  1000  21  28  35  49 
comp  1000  49  63  
ionosphere  351  14  14  42  35 
bank  1000  21  42  42  112 
diabetes  442  14  14  14  21 
Comparison of minimal rank for which the RMSE differs by at most one standard deviation to RMSE obtained with the full kernel matrices using uniform alignment. The number of data samples is denoted by
. Shown in bold is the method with lowest maximal rank to achieve equivalent performance to uniform.It is interesting to compare the utilization of the vectors in the lowdimensional feature space. Table 2 shows the minimal setting of where the performance is at most one standard deviation away from the performance obtained by uniform. On seven out of eight datasets, mklaren reaches equivalent performance to uniform at the smallest setting of . The differences in ranks among all four evaluated methods in Table 2 are statistically significant (p=0.0012, Friedman ranksum test). Additionaly, mklaren and csi difference in ranks is statistically significant (Wilcoxon signedrank test, p=0.03552). On only the diabetes dataset, the unsupervised icd and Nyström outperform the supervised methods at low ranks. However, at higher setting of the performance of csi and mklaren overtakes icd and Nyström as can be seen in Table 1.
Overall the results confirm the utility of the greedy approach to select not only pivots, but also the kernels to be approximated and suggest mklaren to be the method of choice when competitive performance at very lowrank feature spaces is desired. The kernels that are not added to the decomposition can be discarded. This point is discussed further in the next subsection.
4.2 Comparison with MKL methods on rankone kernels
The comparison of mklaren to multiple kernel learning methods using the full kernel matrix is challenging as it is unrealistic to expect improved performance with lowrank approximation methods. Although the restriction to lowrank feature spaces may result in implicit regularization and improved performance as a consequence, the difference in implicit dimension of the feature space makes the comparison difficult (Bach, 2012).
We focus on the ability of mklaren to select from a set of kernels in a way that takes into account the implicit correlations between the kernels. To this end, we built on the empirical analysis of Cortes et al. (2012). The mentioned reference used four wellknown sentiment analysis datasets compiled by Blitzer et al. (2007). In each dataset, the examples are user reviews of products and the target is the product rating in a discrete range
. The features are counts of 4000 most frequent unigrams and bigrams in each dataset. Each feature was represented by a rankone kernel, thus enabling the use of multiple kernel learning for feature selection and explicit control over feature space dimension. The datasets contain moderate number of examples: books (
), electronics (), kitchen () and dvd (). The splits into training and test part were readily included as a part of the data set.We compared mklaren with three stateoftheart multiple kernel learning methods for comparison. All methods are based on maximizing centered kernel alignment Cortes et al. (2012). The align method infers the kernel weights independently, while alignf and alignfc consider the betweenkernel correlations when maximizing the alignment. The combined kernel learned by all three methods was used with kernel ridge regression model. The align method is linear in the number of kernels (), while alignf and alignfc are cubic as the include solving an unconstrained (alignf) or a constrained (alignfc) QP.
When testing for different ranks , the features were first filtered according to the descending centered alignment metric for align, alignf, alignfc prior to optimization. When using mklaren the pivot columns were selected from the complete set of 4000 features. The parameter was set to 1. Note that in this scenario, mklaren is equivalent to the original LAR algorithm, thus excluding the effect of lowrank approximation and comparing only the kernel selection part. This way, the same dimension of the feature space was ensured.
The performance was measured via 5fold crossvalidation. At each step, 80% of the training was used for kernel matrix approximation (mklaren) or determining kernel weights (align, alignf, alignfc). The remaining 20% of the training set was used for selecting regularization parameter from range and the final performance was reported on the test set using RMSE.
The results for different settings of are shown in Fig. 3. For low settings of , mklaren outperforms all four other MKL methods that assume full kernelmatrices. The performance of mklaren at K= is within one standard deviation of best performance using 160 features using any of the methods, showing that the greedy kernel and pivot selection criterion considers implicit correlations between kernels.
However, there is an important difference in computational complexity. Note that mklaren is linear in the number of kernels , which presents a practical advantage over alignf and alignfc. The comparison of methods’ implementation run times is shown on Fig. 4.
We compared the methods run times on a synthetic dataset with Gaussian kernels differing in parameters, rank and variable . Since the methods (align, alignf, alignfc, uniform) require the computation of the whole kernel matrix, mklaren was significanlty more efficient (up to 3 orders of magnitude with =4000 samples).
The experiments with varying number of kernels were performed on the books dataset. The centered kernel alignment value can be computed efficiently due to the usage of rankone linear kernels, without explicit computation of the outer product. This proves very efficient for uniform and align methods where the weights are computed independently. While the mklaren method is also linear in the number kernels (), it accounts for the inbetween kernel correlations. The overhead in calculating lowrank approximations is beneficiary when the number of kernels exceeds 2000. Thus, effect for accounting of betweenkernel correlations is achieved at a significantly computational lower cost.
Finally, we compare the methods with respect to feature selection on the kitchen dataset. Each of the methods mklaren, align, alignf, and alignfc returns and ordering of the kernels (features). With mklaren, this order is obtained as the pivot columns corresponding to kernels are added to the approximation. With alignmentbased methods, we use the order induced by the kernel weight vector. In Fig. 5, we display the top 40 features as obtained from each such ordering, shown as words on the horizontal axis. We incrementally add features to an active set. As each feature is added at step , we infer an ordinary leastsquares , which uses all features up to . The explained variance is calculated as the ratio of the difference of the RMSE on the training set versus total variance. The arrows below each word at step indicate the sign of the corresponding weight in .
Intuitively, the slope (change in explained variance) is higher for features corresponding to words associated to strong sentiments. This is most notable for words such as great, good, love, etc. Not surprisingly, the order in which features are added to the model critically influences the explained variance. Here, mklaren outperforms the alignmentbased methods. Due to its linear complexity in the number of kernels , the features strongly correlated to the response are identified early, irrespective to their magnitude. On the other hand, the centered alignment appears to be biased towards words with a high number of nonzero entries in the dataset, such as propositions. Moreover, the word associated to negative or positive sentiments are approximately balanced, according to the signs in . The results confirm mklaren can be used for model interpretation.
5 Conclusion
Subquadratic complexity in the number of training examples is essential in largescale application of kernel methods. Learning the kernel matrix efficiently from the data and the selection of relevant portions on the data early can reduce time and storage requirements further up the machine learning pipeline. The complexity with respect to the number of kernels should not be disregarded when the number of kernels is large. Using a greedy lowrank approximation to multiple kernels, we achieve linear complexity in the number of kernels and data points without sacrificing the consideration of inbetween kernel correlations. Moreover, the approach learns a regression model, but is nevertheless applicable in any kernelbased model. The extension to classification or ranking tasks is an interesting subject for future work. Contrary to the recent kernel matrix approximations, we present an idea based entirely on geometric principles, which is not limited to transductive learning. With the abundance of different data representations, we expect kernel methods to remain essential in machine learning applications.
Appendix
Leastangle regression
Leastangle regression (LAR) is an active set method, originally designed for feature subset selection in linear regression (Friedman et al., 2001; Hesterberg et al., 2008; Efron and Hastie, 2004). A column is chosen from the set of candidates such that the correlations with the residual are equal for all active variables. This is possible because all variables (columns) are known a priori, which clearly does not hold for candidate pivot columns. The monotonically decreasing maximal correlation in the active set is therefore not guaranteed. Moreover, the addition of a column to the active set potentially affects the values in all further columns. Naively recomputing these values at each iteration would yield a computational complexity of order .
Let the predictor variables
be vectors in , arranged in a matrix . The associated response vector is . The LAR method iteratively selects the predictor variables and the corresponding coefficients are updated at the same time as they are moved towards their leastsquares coefficients. At last step, the method reaches the leastsquares solution .The highlevel pseudo code is as follows:

Start with the residual , and regression coefficients .

Find the variable most correlated with .

Move towards its leastsquares coefficient until another has as much correlation with .

Move and in the direction towards their joint leastsq. coeff., until some new has as much correlation with .

Repeat until all variables have been entered, reaching the leastsq. solution.
Note that the method is easily modified to include early stopping, after a maximum number of selected predictor variables are included. Importantly, the method can be viewed as a version of supervised Incomplete Cholesky Decomposition of the linear kernel which corresponds to the usual inner product in .
Assume that the predictor variables are standardized and response has had its mean subtracted off:
(29) 
Initialize the regression line , the residual and the active set :
(30) 
The LAR algorithm estimates in successive steps. Say the predictor has the largest correlation with . Then, the index is added to the active set and the regression line and residual are updated:
(31) 
The step size is set such that a new predictor will enter the model after is updated and all predictors in the active set as well as will be equally correlated to . The key parts are the selection of predictors added to the model and the calculation of the step size.
The active matrix for a subset of indices with sign is defined as
(32) 
By elementary linear algebra, there exist a bisector  an equiangular vector, having and making equal angles, less than 90 degrees, with vectors in . Define the following quantities respectively:
Comments
There are no comments yet.