Learning the kernel matrix via predictive low-rank approximations

01/17/2016 ∙ by Martin Stražar, et al. ∙ University of Ljubljana 0

Efficient and accurate low-rank approximations of multiple data sources are essential in the era of big data. The scaling of kernel-based learning algorithms to large datasets is limited by the O(n^2) computation and storage complexity of the full kernel matrix, which is required by most of the recent kernel learning algorithms. We present the Mklaren algorithm to approximate multiple kernel matrices learn a regression model, which is entirely based on geometrical concepts. The algorithm does not require access to full kernel matrices yet it accounts for the correlations between all kernels. It uses Incomplete Cholesky decomposition, where pivot selection is based on least-angle regression in the combined, low-dimensional feature space. The algorithm has linear complexity in the number of data points and kernels. When explicit feature space induced by the kernel can be constructed, a mapping from the dual to the primal Ridge regression weights is used for model interpretation. The Mklaren algorithm was tested on eight standard regression datasets. It outperforms contemporary kernel matrix approximation approaches when learning with multiple kernels. It identifies relevant kernels, achieving highest explained variance than other multiple kernel learning methods for the same number of iterations. Test accuracy, equivalent to the one using full kernel matrices, was achieved with at significantly lower approximation ranks. A difference in run times of two orders of magnitude was observed when either the number of samples or kernels exceeds 3000.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Kernel methods are popular in machine learning as they model relations between objects in feature spaces of arbitrary, even infinite dimension 

(Schölkopf and Smola, 2002). Kernels are inner product

functions and provide means to rich representations, which is useful for learning in domains not associated to vector spaces, such as structured objects, strings or trees. Computation of inner product values for all pairs of data points to obtain the kernel matrix requires large computation and storage, which scales as

in the number of data instances. Kernel approximations are thus indispensable when learning on large datasets and can be classified in two groups: approximations of the

kernel function or approximations of the kernel matrix.

Direct approximation of the kernel function can achieve significant performance gains. A large body of work relies on approximating the frequently used Gaussian kernel, which has a rapidly decaying eigenspectrum, as proved by the Bochner’s theorem and the concept of random features (Pennington and Yu, 2015; Szabo, 2015; Rahimi and Recht, 2007). Recently, the property of matrices generated by the Gaussian kernel were further exploited to achieve sublinear complexity in the approximation rank (Yang et al., 2014; Si et al., 2014; Le et al., 2013)

. In another line of work, low-dimensional features can be derived for translation-invariant kernels based on their Fourier transform 

(Vedaldi and Zisserman, 2012). These methods currently present the most space- and time- efficient approximations, but are limited to kernels of particular functional forms.

Approximations of the kernel matrix are applicable to any symmetric positive-definite matrix even if the underlying kernel function is unknown. Such approximations, termed data dependent

, can be obtained using the eigenvalue decomposition, by selecting a subset of data points (the Nyström method) or by using the Cholesky decomposition, which minimizes the divergence between the original matrix and its low-rank approximation 

(Rudi et al., 2015; Xu et al., 2015; Li et al., 2015; Gittens and Mahoney, 2013; Williams and Seeger, 2001; Fine and Scheinberg, 2001). However, these methods are unsupervised, they disregard side information, e.g., the target variables. Predictive decompositions use the target variables to obtain a supervised low-rank approximation via the Cholesky with Side Information (Bach and Jordan, 2005) or they minimize the Bregman divergence measures (Kulis et al., 2009). The effects of kernel matrix approximation has been discussed in context of sparse Gaussian processes (Quiñonero Candela and Rasmussen, 2005), where the approximation leads to degenerate Gaussian process. Learning the inducing points is equivalent to learning the pivots in matrix decompositions, but can be replaced by optimizing over the whole input domain (Snelson and Ghahramani, 2006; Wilson, 2015), with necessarily continuous domain. Cao et al. (2015) relax this limitation with a hybrid approach to kernel function and inducing set optimization. All methods listed so far operate on single kernels. This presents a limitation, since the choice of optimal kernel for a given learning task is often non-trivial.

Similarly to kernel (matrix) approximation, approaches learning the optimal kernel for a given task (dependent on the data) can be classified to i) learning the kernel (covariance) function or ii) learning the kernel matrix. Learning the kernel function is possible in continuous domains, where kernel hyperparameters are optimized to match the training data 

(Mohsenzadeh et al., 2015; Bishop, 2006). Alternatively, kernel functions can be learned via Fourier transforms from corresponding power-spectrums (Gal and Turner, 2015; Wilson and Adams, 2013).

Multiple kernel learning (MKL) methods learn the optimal weighted sum of given kernel matrices with respect to the target variables, such as class labels (Gönen and Alpaydin, 2011). Different kernels can thus be used to model the data and their relative importance is assessed via the predictive accuracy, offering insights into the problem domain. Depending on the user-defined constraints, the resulting optimization problems are quadratic (QP) or semidefinite programs (SDP), assuming the complete knowledge of the kernel matrices. Cortes et al. (2012) solve a QP on centered kernel matrices, which corresponds to centering the data in the original input space. Low-rank approximations have been used for MKL, e.g., by performing Gram-Schmidt orthogonalization and subsequently MKL (Kandola et al., 2002). In a recent study, the combined kernel matrix is learned via efficient generalized Forward-backward algorithm, however assuming that low-rank approximations are available beforehand (Rakotomamonjy and Chanda, 2014). Gönen et al. develop a Bayesian treatment for joint dimensionality reduction and classification, solved via gradient descent (Gönen and Alpaydin, 2010) or variational approximation (Kaski and Gonen, 2014), while assuming access to full kernel matrices and not exploiting their symmetric structure.

In this work, we propose a joint treatment of low-rank kernel approximations and MKL. We assume an input of: i) a set of objects with corresponding continuous target variables and a ii) set of kernels that define inner products on the same objects. We designed mklaren, a greedy algorithm that couples Incomplete Cholesky Decomposition and Least-angle regression to learn a low-rank approximation of the combined kernel matrix. Our innovative approach to pivot column selection is closely associated to the selection of feature vectors in least-angle regression (LAR) (Efron and Hastie, 2004)

. At each step, the method keeps a current estimate of the regression model within the span of the current approximation. In comparison to existing methods, it has the following two advantages. First, the method is aware of multiple kernel functions. In each step, the next pivot column to be added is chosen greedily from all remaining pivot columns from all kernels. Kernels that give more information about the current residual are thus more likely to be selected. In contrast to methods that assume access to the complete kernel matrices, the importance of a kernel is estimated at the time of its approximation. Also, this is different from performing the decomposition for each kernel

independently and subsequently determining kernels weights. Second, the criterion only considers the gain with respect to the current regression residual; the notion of kernel matrix approximation error is completely abolished. Even though accurate approximation is proportional to the similarity of the model using the full kernel matrix (Cortes et al., 2010)

, it was recently shown that i) the expected generalization error is related to maximal marginal degrees of freedom rather than the approximation error and ii) empirically, low-rank approximation can lead to a regularization-like effect 

(Bach, 2012). Nevertheless, the residual approximation error is guaranteed to monotonically decrease by definition of Cholesky decompositions.

When explicit feature space representation is available for kernels, the relation between primal and dual regression weights is used for model interpretation. In contrast to MKL algorithms, which rely on convex optimization or Bayesian methods, our approach relies on geometrical principles solely, leading to a straightforward algorithm with low computational complexity in the number of data points and kernels.

A common assumption when applying matrix approximation or MKL is that the resulting decomposition can only be applied in transductive learning (Zhang et al., 2012; Lanckriet and Cristianini, 2004), i.e., the test data samples are included in the model training phase. We apply the lemma on the uniqueness of the low-rank approximation for a fixed active set to relate the Incomplete Cholesky decomposition and the Nyström method. With this we circumvent the limitation to the transductive setting, infer low-rank representation of arbitrary test data point, enabling out-of-sample prediction.

The predictive performance, run times and model interpretation were evaluated empirically on multiple benchmark regression datasets. The provided implementation of mklaren compared favorably against related low-rank kernel matrix approximation and state-of-the-art MKL approaches. Additionally, we isolate the effect of low-rank approximation and compare the method with full kernel matrix MKL methods on a very large set of rank-one kernels.

The article is structured as follows. The mklaren algorithm with pivot column updates with the LAR-based selection criterion is presented in Section 2. Auxilliary results regarding out-of-sample prediction, model interpretation and computational complexity analysis are given in Section 3. Experimental evaluation is presented in Section 4. Description of the Least-angle regression method is given in the Appendix. The algorithm implementation and code to reproduce the presented experiments is available at https://github.com/mstrazar/mklaren.

2 Multiple kernel learning with least-angle regression

Let be a set of points in a Hilbert space of arbitrary dimension, associated with targets . Let the Hilbert spaces be isomorphic to and endowed with respective inner product (kernel) functions . The kernels are positive definite and map from to . Hence, a data point can be represented in multiple inner product spaces, which can be related to different data views or representations. Evaluating for each pair of determines a kernel matrix . The goal of a predictive (supervised) approximation algorithm is to learn the corresponding low-rank approximations , where , , using additional information on the targets. In the context of regression, the regression line is learned simultaneously with the approximations, as their construction depends on the residual vector .

The mklaren algorithm simultaneously learns low-rank approximations of kernel matrices associated to each kernel function and the regression line . It uses Incomplete Cholesky Decomposition (ICD) to iteratively construct each . At each iteration, a kernel and a pivot column

are chosen using a heuristic that evaluates the explained information on the residual

for each potential new column of . This is achieved by using least-angle regression in the space spanned by the previously selected (normalized and centered) pivot columns of all .

The high-level pseudo code of the mklaren algorithm is shown in Algorithm 1, and its steps are described in detail in the following subsections.

2.1 Simultaneous Incomplete Cholesky decompositions

We start with the description of Incomplete Cholesky Decomposition (ICD) of a single kernel matrix, and later extend it to simultaneous decomposition of multiple kernels. A kernel matrix is approximated with a Cholesky factor . The ICD is a family of methods that produce a finite sequence of matrices , such that

(1)

Initially, is initialized to . A diagonal vector representing the lower-bound on the approximation gain is initialized as The active set , keeping track of selected pivot columns. At iteration , a pivot is selected from the remaining set and its pivot column is computed as follows:

(2)

Importantly, only the information on one column of is required at each iteration and need never be computed explicitly. The selected pivot is added to the active set, the counter and the diagonal vector are updated:

(3)

In the case of multiple, kernels, each kernel function , determines a corresponding , which is approximated with Cholesky factors . An example scenario is depicted on Fig. 1.

Figure 1: Overview of variables included in the hypothetical model using three kernels, . Kernel matrices in dashed values are never computed explicitly. The markers circle, rectangle and triangle represent the selected pivot columns for kernels and respectively.

The set of all Cholesky factors is used to construct a combined feature matrix to be used for least-angle regression, as follows. At any point, assume the existence of the residual vector , to be constructed in Section 2.2. For each selected pivot column in kernel define the following transformation.

(4)

where the operator is the centering projection and is the sign of the correlation . Each is normalized and makes an angle at most 90 degrees with the residual . The set of columns span the combined feature space, equivalent to any matrix containing this same set of columns (in any order).

(5)

The Fig. 1 shows one such matrix. Note that applying the operator is equivalent to centering the positive semidefinite matrix , which represents the combined kernel.

Least-angle regression is used to iteratively select the next pivot column and thus determine the order of columns in while simultaneously updating the regression line. The next kernel and pivot are selected from all remaining sets , based on the current residual . The corresponding pivot column is computed using the Cholesky step in Eq. LABEL:e:icd and added to . At any iteration, each may contain a different number of columns as their selection depends on the relevance for explaining the residual.

2.2 Pivot selection based on Least-angle regression

Least-angle regression (LAR) is an active set method

, for feature subset selection in linear regression (see Appendix and

Efron and Hastie (2004) for thorough description). Here, we propose an idea based on the LAR column selection to determine the next pivot column to be added to any of the and consequently to combined feature matrix .

The original LAR method assumes availability of all variables representing the covariates (column vectors) in the sample data matrix. In our case, however, this matrix is and is constructed iteratively. The adaptation of the LAR-based column selection is non-trivial, since the exact values of the new columns and are unknown at selection time.

This section describes a method to construct given the columns and learn . In favor of clarity we assume (only in this section) that values of all are known and describe the ordering of in . The problem of unknown candidate pivot column values is postponed to Section 2.3.

The matrix is initialized to . The regression line and the residual are initialized

(6)

By construction, and for all . The will be added to in a defined ordering

(7)

where a unique kernel, pivot pair is selected for each position . The ordering depends on the correlation with the residual . Therefore, the Cholesky factors containing pivot columns with more information on the current residual are selected preferably.

The column selection procedure is depicted in Fig. 2a and is defined as follows. At iteration , the first vector is chosen to maximize correlation . This is added to .

At each iteration , contains columns . By elementary linear algebra, there exist the bisector , having and making equal angles, less than 90 degrees, between the residual and vectors vectors currently in . Updating the regression line along direction and the residual

(8)

causes the correlations to change equally for all in , for an arbitrary step size . The value is set such that some new column not in will have the same correlation to as all the columns already in :

(9)

The step size and are selected as follows. Define the following quantities

(10)

Then,

(11)

Here, is the minimum over positive arguments for each choice of . The selected column vector is the minimizer of Eq. 11 and is inserted at the -th position in , . For the last column vector (as there are no further column vectors to chose from) the step size simplifies to

, yielding the ordinary least-squares solution for

and .

The mentioned problem in our case is that the exact values of all potential pivot columns not in and its corresponding are unknown. Explicit calculation of all columns using the Cholesky step in Eq. LABEL:e:icd would yield quadratic computational complexity, as their values are dependent on all previously selected pivots. The issue is addressed by using approximations and that are less expensive to compute, as described in the following section.

(a)
(b)
Figure 2: a) Updating the regression line within the combined feature matrix containing two vectors and . The residual is , where and . The new residual upon selection of is obtained by adding to and updating accordingly. The step size is increased until some new vector will have the same correlation (angle) with as both and , i.e., . b) Schematic representation of selected pivot columns and look-ahead columns.

2.3 Look-ahead decompositions

The selection of a new column vector to be added to the combined feature matrix and its corresponding , is based only on the values , in Eq. 11. Instead of explicitly calculating each candidate for all at each iteration, we use an approximate column vector . The approach uses a similar idea to look-ahead (information) columns in (Cao et al., 2015; Bach and Jordan, 2005).

Consider the kernel matrix , its current Cholesky factor and active set . By definition of ICD in Eq. LABEL:e:icd, the values of a candidate pivot column at step and pivot are:

(12)

The main computation cost in the above definition is the computation of a rank- kernel matrix for each . Instead, look-ahead columns are used to get a look-ahead approximation (Fig. 2b). This defines approximate values :

(13)

Given and consequently , consider the computation of :

(14)

Inserting as in Eq. 13, the denominator cancels out. The norm can be computed as:

(15)

Similarly, dot product with the residual is computed as:

(16)

Computation of is analogous. Correctly ordering the order of computation yields the computational complexity per column. Note that matrices , , , are the same for all columns (independent of ) and need to be computed only once per iteration.

The values and can be computed efficiently for all kernel matrices and enable the selection of next kernel, pivot column pair to be added to and consequently . After selecting a Cholesky step in performed (Eq. LABEL:e:icd) to compute the exact and

(17)

The computation of a new column renders the look-ahead columns in at indices invalid. After applying Eq. 17, all columns at indices are recomputed using standard Cholesky step with pivot selection based on current maximal value in at a cost .

The exact values of and determine to be added to and enables the correct computation of , and step size in Eq. 11. The regression line and the residual can be correctly updated according to Eq. 8.

2.4 The mklaren algorithm

The steps described in previous sections complete the mklaren algorithm. Given a sample of data objects with a targets and kernel functions, the user specifies three additional parameters: the maximum rank of combined feature matrix, number of look-ahead columns and regularization parameter (constraining , discussed in Section 3.4).

The variables related to regression line (, residual and bisector ) and individual decompositions (active sets , column counters ) are initialized in lines 1-1. Each is initialized using standard ICD with look-ahead columns, as described in Section 2.3 (line 1).

The main loop is executed for iterations, until the sum of selected pivot colums equals , where at each iteration a kernel and a pivot column , are selected and added to and consequently the combined feature matrix . For each kernel and each pivot , and are computed. Based on these approximated values, the kernel and pivot are selected. Given the optimal and pivot , the pivot column and are computed. The new pivot column is added to , is incremented and the columns at are recomputed using standard ICD (lines 1-1).

Having computed the exact , the true values and can be computed and the regression line and the residual are updated (lines 1-1).

The regression coefficients  solving , required for out-of-sample prediction, can be obtained by constructing and solving a linear system discussed in Section 3.1 (line 1).

Input:
      set of objects in ,
      kernel functions on ,
      regression targets, with ,
      maximum total rank,
      number of look-ahead columns,
      regularization parameter.
Result:
      ,
          Cholesky factors,
      combined feature space,
      active sets of pivot indices,
      regression line on the training set,
      regression coefficients.
1 Initialize:
2       ,       residual ,       bisector ,       regression line ,       active sets and counters for . Compute standard Cholesky Decompositions with look-ahead columns for . while do
3       Compute and for each kernel and pivot (Eq. 14)
4       Select based on the minimum in Eq. 11
5       Compute (Eq. LABEL:e:icd) and (Eq. 4)
6            
7            
8            
9       Recompute using standard ICD
10       Compute true and (Eq. 11)
11       Compute the bisector of columns in except (Eq. 33)
12       Compute for (Eq. 11) and update
13            
14            
15      
Solve linear system for using Eq. 18.
Algorithm 1 The mklaren algorithm pseudocode.

3 Auxiliary theoretical results

This section presents auxiliary theoretical results required for out-of-sample prediction (Sections 3.1-3.2), model interpretation using the relation between primal and dual regression coefficients (Section 3.3), regularizaion (Section 3.4), and computational complexity (Section 3.5).

3.1 Computing the regression coefficients

The regression coefficients are computed from the regression line and the combined feature space as defined in Eq. 5. using the relation

(18)

where

is the thin QR decomposition 

Golub and Van Loan (2012).

3.2 Out-of-sample prediction

Inference of Cholesky factors corresponding to test (unseen) samples is possible without explicitly repeating the Cholesky steps. The coefficients are then used to predict the responses for new samples. To simplify notation, we show the approach for one kernel matrix and its corresponding Cholesky factors, while the computation for multiple kernels is analogous.

Nyström approximation. Let be an arbitrary active set of pivot indices. The Nyström aproximation (Williams and Seeger, 2001) of the kernel matrix is defined as follows:

(19)

The construction of crucially influences the prediction performance. Note that mklaren defines a method to construct .

Proposition. The Incomplete Cholesky decomposition with pivots yields the same approximation as the Nyström approximation using the active set .

(20)

The proof follows directly from Bach and Jordan (2005), Proposition 1. There exists an unique matrix that is: (i) symmetric, (ii) has the column space spanned by and (iii) . It follows that both Incomplete Cholesky decomposition and the Nyström approximation result in the same approximation matrix .

Corollary. Let be the Cholesky factors obtained on the training set using pivots indices . Let be the values of the kernel function evaluated for all test samples and training samples in the active set , for . The Cholesky factors for test samples

are inferred using the linear transform

.

(21)

The matrix is inexpensive to compute and can be stored permanently after the training phase. Hence, the Cholesky factors are computed from the inner product between the test and the active sets . The combined feature matrix and the predictions are obtained after centering and normalization against the training Cholesky factors :

(22)

where and is defined in Eq. 18.

3.3 Computing the dual coefficients

Regardless of using the approximation to kernels, a limited form of model interpretation is still possible for a certain class of kernels. Again, we show the approach for one kernel matrix and the combined feature matrix while the computation for multiple kernels is analogous.

Kernel ridge regression is often stated in terms of dual coefficients , satisfying the relation:

(23)

This is an overdetermined system of equations. The vector with minimal norm can be obtained by solving the following least-norm problem:

(24)

The problem has an analytical solution equal to

(25)

Obtaining dual coefficients can be useful if the range of the explicit feature map induced by a kernel is finite, such that , which is the case for linear, polynomial, and various string kernels (Sonnenburg et al., 2005). An interpretation of regression coefficients in the range of , is obtained by computing the matrix for the training set and considering

(26)

Moreover, if the vector is sparse, only the relevant portions of need to be computed. This condition can be enforced by using techniques such as matching pursuit when solving for  (Bach et al., 2010).

3.4 norm regularization

Regularization is achieved by constraining the norm of weights . Zou and Hastie (2005) prove the following lemma, which shows that regularized regression problem can be stated as ordinary least squares using an appropriate augmentation of the data . The following lemma assumes for all , , and .

Lemma. Define the augmented data set to equal

The least-squares solution of is then equivalent to Ridge regression of the original data with parameter . For proof, see Zou and Hastie (2005). The augmented dataset can be included in LAR to achieve the -regularized solution. This is achieved by modifying the columns of the combined feature matrix in Eq. 5:

(27)

This definition is now equivalent to performing LAR in augmented space , resulting in an regularized solution for after steps of the approximations. It is straightforward to modify Eq. 11, and Eq. 14-16 for .

3.5 Computational complexity

The mklaren algorithm scales as a linear function of both the number of data points and kernels . The computational complexity is

(28)

The look-ahead Cholesky decompositions are standard Cholesky decompositions with pivots and complexity . The main loop is executed times. The selection of kernel and pivot pairs is based on the LAR criterion, which includes inverting of size , thus having a complexity of . However, as each step is a rank-one modification to , the Morrison-Sherman-Woodbury lemma on matrix inversion (Meyer, 2000) can be used to achieve complexity per update. The computation of correlations with the bisector in Eq. 14 and residuals are computed for kernels in . Recomputation of  Cholesky factors requires standard Cholesky steps of complexity . The computation of the gradient step is of the same complexity as the gradient step. Updating the regression line is . The QR decomposition in Eq. 18 takes and the computation of linear transform in Eq. 21 is of complexity.

4 Experiments

In this section, we provide an empirical evaluation of the proposed method on known regression datasets. We compare mklaren with several well-known low-rank matrix approximation methods: Incomplete Cholesky Decomposition (icd, (Fine and Scheinberg, 2001)), Cholesky with side Information (csi, (Bach and Jordan, 2005)) and the Nyström method (Nyström, (Williams and Seeger, 2001)).

We also compare mklaren

 with a family of state-of-the-art multiple kernel learning methods tha use the full-kernel matrix. The comparison was performed on a sentiment analysis data set with a large number of rank-one kernels 

(Cortes et al., 2012).

4.1 Comparison with low-rank approximations

Dataset mklaren csi icd Nyström uniform
boston 4.393 0.432
kin 0.018 0.000
pumadyn 1.252 0.032
abalone 2.638 0.116
comp 5.288 0.461
ionosphere 0.283 0.017
bank 0.036 0.001
diabetes 54.680 3.61
Dataset mklaren csi icd Nyström uniform
boston 3.792 0.454
kin 0.016 0.001
pumadyn 1.257 0.032
abalone 2.526 0.097
comp 3.100 0.942
ionosphere 0.234 0.028
bank 0.035 0.001
diabetes 55.220 3.56
Dataset mklaren csi icd Nyström uniform
boston 3.493 0.489
kin 0.014 0.000
pumadyn 1.251 0.027
abalone 2.500 0.110
comp 1.330 0.409
ionosphere 0.221 0.012
bank 0.034 0.002
diabetes 55.214 4.03
Table 1: Comparison of regression performance (RMSE) on test sets via 5-fold cross-validation for different values of rank (). Shown in bold is the low-rank approximation method with lowest RMSE. Top K=14. Middle K=28. Bottom K=42.

The main advantage of mklaren over established kernel matrix approximation methods is simultaneous approximation of multiple kernels, which considers the current approximation to the regression line and greedily selects the next kernel and pivot to include in the decomposition. To elucidate this, we performed a comparison on eight known regression datasets: abalone, bank, boston, comp-active, diabetes, ionosphere, kinematics, pumadyn111http://archive.ics.uci.edu/ml/222http://www.cs.toronto.edu/~delve/data/datasets.html.

Similar to Cortes et al. (2012), seven Gaussian kernels with different length scale parameters are used. The Gaussian kernel function is defined as , where the length scale parameter is in range . For approximation methods icd, csi and Nyström, each kernel matrix was approximated independently using a fixed maximum rank . The combined feature space of seven kernel matrices was used with ridge regression.

For mklaren, the approximation is defined simultaneously for all kernels and the maximum rank was set to , i.e., seven times the maximum rank of individual kernels used in icd, csi and Nyström. Thus, the low-rank feature space of all four methods had exactly the same dimension. The uniform kernel combination (uniform) using the full-kernel matrix was included as an empirical lower bound.

The performance was assessed using 5-fold cross-validation as follows. Initially up to 1000 data points were selected randomly from the dataset. For each random split of the data set, a training set containing 60% of the data was used for kernel matrix approximation and fitting the regression model. A validation set containing 20% of the data was used to select the regularization parameter from range . The final reported performance using root mean square error (RMSE) was obtained on the test set with remaining 20% of the data. All variables were standardized and the targets were centered to the mean. The look-ahead parameter was set to 10 for mklaren and csi.

The results for different settings of are shown in Table 1. Not surprisingly, the performance of supervised mklaren and csi is consistently superior to unsupervised icd and Nyström. Moreover, mklaren outperforms csi on the majority of regression tasks, especially at lower values of . At higher values of , the difference vanishes as all approximation methods recover sufficient information of the feature space induced by the kernels.

Dataset mklaren csi icd Nyström
boston 506 42 63 119
kin 1000 63
pumadyn 1000 49 56 98
abalone 1000 21 28 35 49
comp 1000 49 63
ionosphere 351 14 14 42 35
bank 1000 21 42 42 112
diabetes 442 14 14 14 21
Table 2:

Comparison of minimal rank for which the RMSE differs by at most one standard deviation to RMSE obtained with the full kernel matrices using uniform alignment. The number of data samples is denoted by

. Shown in bold is the method with lowest maximal rank to achieve equivalent performance to uniform.

It is interesting to compare the utilization of the vectors in the low-dimensional feature space. Table 2 shows the minimal setting of where the performance is at most one standard deviation away from the performance obtained by uniform. On seven out of eight datasets, mklaren reaches equivalent performance to uniform at the smallest setting of . The differences in ranks among all four evaluated methods in Table 2 are statistically significant (p=0.0012, Friedman rank-sum test). Additionaly, mklaren and csi difference in ranks is statistically significant (Wilcoxon signed-rank test, p=0.03552). On only the diabetes dataset, the unsupervised icd and Nyström outperform the supervised methods at low ranks. However, at higher setting of the performance of csi and mklaren overtakes icd and Nyström as can be seen in Table 1.

Overall the results confirm the utility of the greedy approach to select not only pivots, but also the kernels to be approximated and suggest mklaren to be the method of choice when competitive performance at very low-rank feature spaces is desired. The kernels that are not added to the decomposition can be discarded. This point is discussed further in the next subsection.

4.2 Comparison with MKL methods on rank-one kernels

The comparison of mklaren to multiple kernel learning methods using the full kernel matrix is challenging as it is unrealistic to expect improved performance with low-rank approximation methods. Although the restriction to low-rank feature spaces may result in implicit regularization and improved performance as a consequence, the difference in implicit dimension of the feature space makes the comparison difficult (Bach, 2012).

We focus on the ability of mklaren to select from a set of kernels in a way that takes into account the implicit correlations between the kernels. To this end, we built on the empirical analysis of Cortes et al. (2012). The mentioned reference used four well-known sentiment analysis datasets compiled by Blitzer et al. (2007). In each dataset, the examples are user reviews of products and the target is the product rating in a discrete range

. The features are counts of 4000 most frequent unigrams and bigrams in each dataset. Each feature was represented by a rank-one kernel, thus enabling the use of multiple kernel learning for feature selection and explicit control over feature space dimension. The datasets contain moderate number of examples: books (

), electronics (), kitchen () and dvd (). The splits into training and test part were readily included as a part of the data set.

We compared mklaren with three state-of-the-art multiple kernel learning methods for comparison. All methods are based on maximizing centered kernel alignment Cortes et al. (2012). The align method infers the kernel weights independently, while alignf and alignfc consider the between-kernel correlations when maximizing the alignment. The combined kernel learned by all three methods was used with kernel ridge regression model. The align method is linear in the number of kernels (), while alignf and alignfc are cubic as the include solving an unconstrained (alignf) or a constrained (alignfc) QP.

Figure 3: RMSE on the test set for MKL methods. The rank is equal to the number of kernels included.

When testing for different ranks , the features were first filtered according to the descending centered alignment metric for align, alignf, alignfc prior to optimization. When using mklaren the pivot columns were selected from the complete set of 4000 features. The parameter was set to 1. Note that in this scenario, mklaren is equivalent to the original LAR algorithm, thus excluding the effect of low-rank approximation and comparing only the kernel selection part. This way, the same dimension of the feature space was ensured.

The performance was measured via 5-fold cross-validation. At each step, 80% of the training was used for kernel matrix approximation (mklaren) or determining kernel weights (align, alignf, alignfc). The remaining 20% of the training set was used for selecting regularization parameter from range and the final performance was reported on the test set using RMSE.

The results for different settings of are shown in Fig. 3. For low settings of , mklaren outperforms all four other MKL methods that assume full kernel-matrices. The performance of mklaren at K= is within one standard deviation of best performance using 160 features using any of the methods, showing that the greedy kernel and pivot selection criterion considers implicit correlations between kernels.

However, there is an important difference in computational complexity. Note that mklaren is linear in the number of kernels , which presents a practical advantage over alignf and alignfc. The comparison of methods’ implementation run times is shown on Fig. 4.

We compared the methods run times on a synthetic dataset with Gaussian kernels differing in parameters, rank and variable . Since the methods (align, alignf, alignfc, uniform) require the computation of the whole kernel matrix, mklaren was significanlty more efficient (up to 3 orders of magnitude with =4000 samples).

The experiments with varying number of kernels were performed on the books dataset. The centered kernel alignment value can be computed efficiently due to the usage of rank-one linear kernels, without explicit computation of the outer product. This proves very efficient for uniform and align methods where the weights are computed independently. While the mklaren method is also linear in the number kernels (), it accounts for the in-between kernel correlations. The overhead in calculating low-rank approximations is beneficiary when the number of kernels exceeds 2000. Thus, effect for accounting of between-kernel correlations is achieved at a significantly computational lower cost.

Finally, we compare the methods with respect to feature selection on the kitchen dataset. Each of the methods mklaren, align, alignf, and alignfc returns and ordering of the kernels (features). With mklaren, this order is obtained as the pivot columns corresponding to kernels are added to the approximation. With alignment-based methods, we use the order induced by the kernel weight vector. In Fig. 5, we display the top 40 features as obtained from each such ordering, shown as words on the horizontal axis. We incrementally add features to an active set. As each feature is added at step , we infer an ordinary least-squares , which uses all features up to . The explained variance is calculated as the ratio of the difference of the RMSE on the training set versus total variance. The arrows below each word at step indicate the sign of the corresponding weight in .

Intuitively, the slope (change in explained variance) is higher for features corresponding to words associated to strong sentiments. This is most notable for words such as great, good, love, etc. Not surprisingly, the order in which features are added to the model critically influences the explained variance. Here, mklaren outperforms the alignment-based methods. Due to its linear complexity in the number of kernels , the features strongly correlated to the response are identified early, irrespective to their magnitude. On the other hand, the centered alignment appears to be biased towards words with a high number of nonzero entries in the dataset, such as propositions. Moreover, the word associated to negative or positive sentiments are approximately balanced, according to the signs in . The results confirm mklaren can be used for model interpretation.

Figure 4: Comparison of running times. (left) Time versus number of samples on a synthetic dataset with kernels. (right) Time versus number of kernels on books training dataset, samples.
Figure 5: Increase in explained variance upon incrementally including features to an ordinary least-squares model. The order of features is determined by the magnitude of kernel weights for align, alignf and alignfc or the order of selection by mklaren. Explained variance is measured as a ratio of training RMSE vs. total variance. Arrows indicate positive (black) or negative (gray) sign of the feature in the model weight vector upon inclusion. Highlighted are words "great" and "not", which significantly alter the explained variance when discovered by align, alignf and alignfc models.

5 Conclusion

Subquadratic complexity in the number of training examples is essential in large-scale application of kernel methods. Learning the kernel matrix efficiently from the data and the selection of relevant portions on the data early can reduce time and storage requirements further up the machine learning pipeline. The complexity with respect to the number of kernels should not be disregarded when the number of kernels is large. Using a greedy low-rank approximation to multiple kernels, we achieve linear complexity in the number of kernels and data points without sacrificing the consideration of in-between kernel correlations. Moreover, the approach learns a regression model, but is nevertheless applicable in any kernel-based model. The extension to classification or ranking tasks is an interesting subject for future work. Contrary to the recent kernel matrix approximations, we present an idea based entirely on geometric principles, which is not limited to transductive learning. With the abundance of different data representations, we expect kernel methods to remain essential in machine learning applications.

Appendix

Least-angle regression

Least-angle regression (LAR) is an active set method, originally designed for feature subset selection in linear regression (Friedman et al., 2001; Hesterberg et al., 2008; Efron and Hastie, 2004). A column is chosen from the set of candidates such that the correlations with the residual are equal for all active variables. This is possible because all variables (columns) are known a priori, which clearly does not hold for candidate pivot columns. The monotonically decreasing maximal correlation in the active set is therefore not guaranteed. Moreover, the addition of a column to the active set potentially affects the values in all further columns. Naively recomputing these values at each iteration would yield a computational complexity of order .

Let the predictor variables

be vectors in , arranged in a matrix . The associated response vector is . The LAR method iteratively selects the predictor variables and the corresponding coefficients are updated at the same time as they are moved towards their least-squares coefficients. At last step, the method reaches the least-squares solution .

The high-level pseudo code is as follows:

  1. Start with the residual , and regression coefficients .

  2. Find the variable most correlated with .

  3. Move towards its least-squares coefficient until another has as much correlation with .

  4. Move and in the direction towards their joint least-sq. coeff., until some new has as much correlation with .

  5. Repeat until all variables have been entered, reaching the least-sq. solution.

Note that the method is easily modified to include early stopping, after a maximum number of selected predictor variables are included. Importantly, the method can be viewed as a version of supervised Incomplete Cholesky Decomposition of the linear kernel which corresponds to the usual inner product in .

Assume that the predictor variables are standardized and response has had its mean subtracted off:

(29)

Initialize the regression line , the residual and the active set :

(30)

The LAR algorithm estimates in successive steps. Say the predictor has the largest correlation with . Then, the index is added to the active set and the regression line and residual are updated:

(31)

The step size is set such that a new predictor will enter the model after is updated and all predictors in the active set as well as will be equally correlated to . The key parts are the selection of predictors added to the model and the calculation of the step size.

The active matrix for a subset of indices with sign is defined as

(32)

By elementary linear algebra, there exist a bisector - an equiangular vector, having and making equal angles, less than 90 degrees, with vectors in . Define the following quantities respectively: