Kernel machines with two layers and multiple kernel learning

01/15/2010
by   Francesco Dinuzzo, et al.
University of Pavia
0

In this paper, the framework of kernel machines with two layers is introduced, generalizing classical kernel methods. The new learning methodology provide a formal connection between computational architectures with multiple layers and the theme of kernel learning in standard regularization methods. First, a representer theorem for two-layer networks is presented, showing that finite linear combinations of kernels on each layer are optimal architectures whenever the corresponding functions solve suitable variational problems in reproducing kernel Hilbert spaces (RKHS). The input-output map expressed by these architectures turns out to be equivalent to a suitable single-layer kernel machines in which the kernel function is also learned from the data. Recently, the so-called multiple kernel learning methods have attracted considerable attention in the machine learning literature. In this paper, multiple kernel learning methods are shown to be specific cases of kernel machines with two layers in which the second layer is linear. Finally, a simple and effective multiple kernel learning method called RLS2 (regularized least squares with two layers) is introduced, and his performances on several learning problems are extensively analyzed. An open source MATLAB toolbox to train and validate RLS2 models with a Graphic User Interface is available.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/11/2018

Learning Multiple Levels of Representations with Kernel Machines

We propose a connectionist-inspired kernel machine model with three key ...
11/14/2016

Post Training in Deep Learning with Last Kernel

One of the main challenges of deep learning methods is the choice of an ...
07/13/2020

Deep Neural-Kernel Machines

In this chapter we review the main literature related to the recent adva...
06/15/2018

Learning kernels that adapt to GPU

In recent years machine learning methods that nearly interpolate the dat...
06/30/2011

Kernels for Vector-Valued Functions: a Review

Kernel methods are among the most popular techniques in machine learning...
06/02/2021

Transformers are Deep Infinite-Dimensional Non-Mercer Binary Kernel Machines

Despite their ubiquity in core AI fields like natural language processin...
11/15/2017

Optimizing Kernel Machines using Deep Learning

Building highly non-linear and non-parametric models is central to sever...

1 Introduction

Learning by minimizing costs in functional spaces has proven to be an important approach to better understand many estimation problems. Indeed, the functional analytic point of view is the theoretical core of many successful learning methodologies such as smoothing splines, Gaussian processes, and support vector machines

[54, 18, 41, 52, 45, 37], collectively referred to as kernel methods. One of the most appealing properties of kernel methods is optimality according to a variety of representer theorems. These results are usually presented within the theory of RKHS [5], and formalize the intuition that optimal learning machines trained with a finite number of data must be expressed by a finite number of parameters, even when the hypothesis space is infinite dimensional. Representer theorems have been generalized and analyzed in many forms [10, 40, 49, 12, 31, 53, 14, 4], since their first appearance [26].

Recently, has been pointed out that standard kernel machines are somehow limited in their ability to approximate complex functional classes, as a consequence of being shallow architectures. In addition, existing representer theorems only apply to single-layer architectures, though an extension of the theory to include multi-layer networks would be useful to better understand the behavior of current multi-layer architectures and characterize new methods with more flexible approximation capabilities. Such extension is also suggested by complexity theory of circuits [8] as well as by biological motivated learning models, [43]. In the field of kernel methods, the need for complex hypothesis spaces reflecting “broad” prior knowledge has led to the idea of learning the kernel from empirical data simultaneously with the predictor [7, 27, 33, 30, 3, 55, 32]

. Indeed, the difficulty of choosing a good hypothesis space with little available a-priori knowledge is significantly reduced when the kernel is also learned from the data. The flexibility of algorithms implementing the framework of kernel learning makes also possible to address important machine learning issues such as feature selection, learning from heterogeneous sources of data, and multi-scale approximation. A major extension to the framework of classical kernel methods, based on the concept of

hyper-kernels, has been introduced in [33], encompassing many convex kernel learning algorithms. In this paper, the connection between learning the kernel in standard (single layer) kernel machines and learning in multi-layer architectures is analyzed. We introduce the framework of kernel machines with two layers, that also encompasses classical single layer kernel methods as well as many kernel learning algorithms.

Consider a generic architecture whose input-output behavior can be described as a function composition of two layers

(1)

where is a generic set while and are two Hilbert spaces. The problem of learning simultaneously the two layers and from input-output data pairs can be formalized in a functional analytic setting. Indeed, in section 2 it is shown that a representer theorem holds: even if and are searched into infinite dimensional function spaces, optimal learning architectures are finite linear combination of kernel functions on each layer, where optimality is measured according to general regularization functionals. Remarkably, such representer theorem also imply that, upon training, architecture (1) can be equivalently regarded as a standard kernel machine in which the kernel function has been learned from the data. After discussing the general result on the solution representation for two non-linear layers, the attention is focused on the case in which the second layer is linear. In section 3, we introduce a regularization framework that turns out to be equivalent to a general class of methods to perform multiple kernel learning

, that is the simultaneous supervised learning of a predictor and the associated kernel as a convex combination of

basis kernels. The general problem of learning the kernel is receiving a lot of attention in recent years, both from functional analytic point of view and from pure optimization perspectives. Since the earlier works [27, 30], many improved optimization schemes have been proposed [2, 46, 32, 35, 6]. In section 4

, a method called RLS2 (regularized least squares with two layers) based on regularized multiple kernel learning with the square loss function is introduces and studied. Along the line of recent advances in multiple kernel learning

[46, 35, 6], it is shown that the involved optimization can be efficiently carried out using a two-step procedure. For RLS2, two-step optimization turns out to be especially simple and computationally appealing, alternating between the solution of a linear system and a constrained least squares problem. The application of RLS2 on a variety of learning problems is analyzed in section 5. State of the art generalization performances are achieved on several datasets, including multi-class classification of genomic data. An open source MATLAB toolbox to train and validate RLS2 models with a Graphic User Interface is available at http://www.mloss.org. All the proof of Theorems and Lemmas are given in the Appendix.

2 A representer theorem for architectures with two layers

A learning architecture with two layers can be formalized as a map expressed as a function composition as in equation (1). Introduce two RKHS and of vector-valued functions [31] defined over and respectively, with (operator-valued) kernel functions and , and consider the following problem:

Problem 1

Here, are loss functions measuring the approximation of training data, while are two extended-valued non-decreasing functions (not identically ) that play the role of regularization terms. Problem 1 is outside the scope of standard representer theorems [40] due to the presence of the composition . Nevertheless, it still holds that linear combinations of a finite number of kernel functions are optimal solutions, as soon as there exist minimizers.

Theorem 1

If the functional of Problem 1 admit minimizers, then there exist optimal solutions of Problem 1 in the form

Therefore, there exists optimal learning architectures in the following input-output form:

(2)

Theorem 1 is a restriction theorem: the search for solutions of Problem 1 can be restricted to kernel machines with two layers involving a finite number of kernel functions, even when and are infinite dimensional spaces. Notice that Theorem 1 is not an existence theorem, since existence of minimizers is one of the hypotheses. As shown in the next sections, existence can be ensured under mild additional conditions on . Under the general hypothesis of Theorem 1, uniqueness of minimizers in Problem 1 is also not guaranteed, even when loss functions are strictly convex. Notice also that Theorem 1 do admit the presence of optimal solutions not in the form of finite kernel expansions. However, if such solutions exist, then their projections over the finite dimensional span of kernel sections are optimal as well, so that one can restrict the attention to kernel machines with two layers also in this case. Finally, when and are strictly increasing, it holds that every optimal solution of Problem 1 can be expressed as a kernel machine with two layers.

3 Multiple kernel learning as a kernel machine with two layers

Theorem 1 shows that training an architecture with two layers is equivalent to train simultaneously a single-layer kernel network and the kernel function, see equation (2). In this section, it is shown that multiple kernel learning, consisting in simultaneous learning of a finite linear combination of kernels and the associated predictor, can be interpreted as a specific instance of kernel architecture with two layers. Introduce a set of positive kernels defined on , called basis kernels and consider the following choice for and .

  • is an RKHS of vector valued functions associated with the matrix-valued kernel function such that

  • is the RKHS of real valued functions associated with the linear kernel

    where is a diagonal scaling matrix:

For any , let , () denote its components. Introduce the indicator function of the interval defined as

let denote lower semi-continuous convex loss functions and . In the following, we analyze the particular case of Problem 1 in which is a square regularization and is the indicator function regularization:

Problem 2

Let’s briefly discuss the choice of regularizers. First of all, notice that both and are convex functions. Since are convex loss functions and is linear, the problem is separately convex in both and . Apparently, regularizing with the indicator function is equivalent to impose the constraint . Lemma 1 below shows that minimization into the unitary ball can be carried out without any loss of generality.

Lemma 1

Let denote an optimal solution of the following problem:

Then, is an an optimal solution of Problem 2 with and satisfy

(3)

Thanks to the scaling properties coming from the linearity of the second layer and the use of the indicator function, the introduction of an additional regularization parameter can be avoided, thus significantly reducing the complexity of model selection. The next Theorem characterizes optimal solutions of Problem 2.

Theorem 2

There exist optimal solutions and of Problem 2 in the form

Letting , optimal coefficients solves the multiple kernel learning Problem 3 below, where

(4)

Finally, the solution of Problem 2 can be written as in equation (2), where the kernel satisfies

(5)
Problem 3

subject to

(6)

Theorem 2 shows that the variational Problem 2 for a two-layer kernel machine is equivalent to the multiple kernel learning Problem 3. The non-negativity constraints leads to a sparse selection of a subset of basis kernels. In standard formulations, multiple kernel learning problems feature the equality constraint , instead of the inequality in (6). Nevertheless, Lemma 2 below shows that there always exist optimal solutions of Problem 4 satisfying the equality, so that the two optimization problems are equivalent.

A few comments on certain degeneracies in Problem 2 are in order. First of all, observe that the absolute value of optimal coefficients characterizing the two layer kernel machine is given by , but is undetermined. Then, without loss of generality, it is possible to choose . Second, observe that the objective functional of Problem 3 depends on through the product . When is singular, the optimal vector is not unique (independently of ). In particular, if belongs to the null space of , then achieves the same objective value of , for any . This case also occurs in standard (single-layer) kernel methods. One possible way to break the indetermination, again without any loss of generality, is to constrain to belong to the range of . With such additional constraint, there exists such that

(7)

where denote the Moore-Penrose pseudo-inverse (notice that, in general, might be different from ). Remarkably, the introduction of such change of variable makes also possible to derive an addition formulation of Problem 3, which can be shown to be a convex optimization problem. Indeed, by rewriting Problem 3 as a function of , the following problem is obtained:

Problem 4
Lemma 2

Problem 4 is a convex optimization problem and there exists an optimal vector satisfying the equality constraint

(8)

Lemma 2 completes the equivalence between the specific kernel machines with two layers obtained by solving Problem 2 and multiple kernel learning algorithms. The Lemma also gives another important insight into the structure of Problem 3: local minimizers are also global minimizers, a property that directly transfer from Problem 4 through the change of variable (7).

3.1 Linear machines

In applications of standard kernel methods involving high-dimensional input data, the linear kernel on

(9)

plays an important role. Optimization algorithms for linear machines are being the subject of a renewed attention in the literature, due to some important experimental findings. First, it turns out that linear models are already enough flexible to achieve state of the art classification performances in application domains such as text document classification, word-sense disambiguation, and drug design, see e.g. [24]. Second, linear machines can be trained using extremely efficient and scalable algorithms [23, 44, 16]. Finally, linear methods can be also used to solve certain non-linear problems (by using non-linear feature maps), thus ensuring a good trade-off between flexibility and computational convenience.

Linear kernels are also meaningful in the context of multiple kernel learning methods. Indeed, when the input set is a subset of , a possible choice for the set of basis kernels is given by linear kernels on each component:

(10)

Such a choice makes the input output map (2) a linear function:

(11)

where

(12)

Here, an important benefit is sparsity in the vector of weights , that follows immediately from sparsity of vector . In this way, linear multiple kernel learning algorithms can simultaneously perform regularization and linear feature selection. Such property is apparently linked to the introduction of the additional layer in the architecture, since standard kernel machines with one layer are not able to perform any kind of automatic feature selection. From the user’s point of view, linear kernel machines with two layers behave similarly to sparse regularization methods such as the Lasso [50], performing feature selection by varying with continuity a shrinking parameter. However, it seems that regularization methods cannot be interpreted as kernel machines (not even with two layers) and these two classes of algorithms are thus distinct. An instance of linear regularization methods with two layers is proposed in subsection 4.2 and analyzed in the experimental section 5.

4 Regularized least squares with two layers

  
  
  
  while (stopping criterion is not met) do
     
     for  do
        
     end for
     
     
     for  do
        
     end for
      Solution of Problem (7).
     
  end while
Algorithm 1 Alternate optimization for RLS2

In the previous section, a general class of convex optimization problems to learn finite linear combinations of kernels is shown to be equivalent to a two-layer kernel machine. As for standard kernel machines, different choices of loss functions lead to a variety of learning algorithms. For instance, from the results of the previous section it follows that the two-layer version of standard Support Vector Machines with “hinge” loss functions

is equivalent to the SILP (Semi-Infinite Linear Programming) multiple kernel learning problem studied in

[46], whose solution can be computed, for instance, by using gradient descent or SimpleMKL [35].

In this section, attention is focussed on square loss functions and the associated kernel machine with two layers. As we shall show, coefficients and defining the architecture as well as the “equivalent input-output kernel” can be computed by solving a very simple optimization problem. Such problem features the minimization of a quartic functional in , that is separately quadratic in both and

. It is worth noticing that the square loss function can be used to solve regression problems as well as classification ones. Indeed, generalization performances of regularized least squares classifiers have been shown to be comparable to that of Support Vector Machines on many dataset, see

[38, 17] and references therein.

Problem 5 (Regularized least squares with two layers (RLS2))

Let denote the standard -simplex in :

For any fixed , Problem 5 is an unconstrained quadratic optimization problem with respect to . It is then possible to solve for the optimal in closed form as a function of :

(13)

As shown in Lemma 3 below, Problem 5 can be reduced to the following Problem in only.

Problem 6
Lemma 3

The pair is an optimal solution of Problem 5 if and only if equation (13) holds and is an optimal solution of Problem 6.

Along the lines of recent developments in multiple kernel learning optimization [46, 35], we propose a two-step minimization procedure that alternates between kernel and predictor optimization. The specific structure of our problem allows for exact optimization in each of the two phases of the optimization process. Let

For any fixed , minimization with respect to boils down to the following simplex-constrained least squares problem, as ensured by the subsequent Lemma 4.

Problem 7
Lemma 4

For any fixed , the optimal coefficient vector of Problem 5 can be obtained as the solution of Problem 7.

Optimal coefficients can be computed using an iterative two-step procedure such as the Algorithm 1, that alternates between minimization with respect to obtained through the solution of the linear system (13), and the solution of the simplex-constrained least squares Problem 7 in . The non-negativity constraint induce sparsity in the vector , and thus also in the input-output kernel expansion. To understand the initialization of coefficients and in Algorithm 1, consider the limiting solution of the optimization problem when the regularization parameter tends to infinity. Such solution is the most natural starting point for a regularization path because optimal coefficients can be computed in closed form.

Lemma 5

The limiting solution of Problem 5 when is given by

As shown in subsection 4.3, the result of Lemma 5 can be also used to give important insights into the choice of the scaling in the second layer.

4.1 A Bayesian maximum a posteriori interpretation of RLS2

The equivalence between regularization Problem 2 and the multiple kernel learning optimization Problem 8 can be readily exploited to give a Bayesian MAP (maximum a posteriori) interpretation of RLS2. To specify the probabilistic model, we need to put a prior distribution over the set of functions from into and define the data generation model (likelihood). In the following,

denote a real Gaussian distribution with mean

and variance

, a Gaussian measure on the set of functions from into with mean and covariance function , and

the uniform distribution in

over a set of positive finite measure. Let be such that

where is distributed according to a Gaussian measure over with zero mean and (matrix-valued) covariance function , and is a random vector independent of , distributed according to an uniform distribution over the ellipsoid :

Regarding the likelihood, assume that the data set , is generated by drawing pairs independently and identically distributed according to the usual additive Gaussian noise model:

When is a finite dimensional space, the MAP estimate of is:

where maximize the posterior density:

Specifically, we have

It follows that and, by taking the negative logarithm of , that the MAP estimate coincides with the solution of the regularization Problem 2 with square loss functions and . When is an infinite-dimensional function space, the Gaussian measure prior for

do not admit a probability density. Nevertheless, the regularization Problem

2

can be still recovered by understanding MAP estimate as a maximal point of the posterior probability measure, as described in

[21].

4.2 Linear regularized least squares with two layers

As described in subsection 3.1, when the input set is a subset of the linear choice of basis kernels (10) produces linear machines with feature selection capabilities. First of all, recall that standard regularized least squares with the linear kernel (9) boils down to finite-dimensional Tikhonov regularization [51]

also known as ridge regression

[22]:

where denote the matrix of inputs such that . Lemma 6 below states that the linear version of RLS2 is equivalent to a “scaled” ridge regression problem, in which the optimal scaling is also estimated from the data. Let

(14)

For any fixed , let be the number of non-zero coefficients , denote the diagonal sub-matrix of containing all the strictly positive coefficients. Moreover, let denote the scaled sub-matrix of selected features .

Problem 8
Lemma 6

When basis kernels are as in (10), the optimal solution of Problem 5 can be written as in (11)-(12), where solves Problem 8.

From Problem 8, one can easily see that when is fixed to its optimal value, the optimal in Problem 8 is given by the familiar expression:

The result of Lemma 6 can be also used to give an interesting interpretation of linear RLS2. In fact, the coefficient can be interpreted as a quantity proportional to the inverse variance of the -th coefficient . Hence, Problem 8 can be seen as a Bayesian MAP estimation with Gaussian residuals, Gaussian prior on the coefficients and uniform prior over a suitable simplex on the vector of inverse coefficients’s variances.

It is useful to introduce a notion of “degrees of freedom”, see e.g.

[15, 20]. Degrees of freedom is an index more interpretable than the regularization parameter, and can be also used to choose the regularization parameter according to tuning criteria such as [28], AIC [1], BIC [42], GCV [11]. A general expression for the effective degrees of freedom of non-linear kernel regression methods with one layer, based on the SURE (Stein’s Unbiased Risk Estimator) approximation [48] has been recently derived in [13]. For linear RLS2, the following quantity seems an appropriate approximation of the degrees of freedom:

(15)

Expression (15) corresponds to the equivalent degrees of freedom of a linear regularized estimator with regressors fixed to and diagonal regularization . Notice that (15) neglects the non-linear dependence of matrix on the output data and does not coincide with the SURE estimator of the degrees of freedom. Nevertheless, the property holds, so that can be conveniently used to interpret the complexity of the linear RLS2 model (see subsection 5.1 for an example).

4.3 Choice of the scaling and feature/kernel selection

The ability of RLS2 to select features or basis kernels is highly influenced by the scaling in the second layer. In this subsection, we analyze certain scaling rules that are connected with popular statistical indices, often used as “filters” for feature selection [19]. Since the issue of scaling still needs further investigation, it is not excluded that new rules different from those mentioned in this subsection may work better on specific problems.

A key observation is the following: according to Lemma 5, RLS2 with heavy regularization favors basis kernels that maximizes the quantity , that represents a kind of alignment between the kernel and the outputs. This means that RLS2 tends to select kernels that are highly aligned with the outputs. Since each alignment is proportional to the scaling factor , an effective way to choose the scaling is one that makes the alignment a meaningful quantity to maximize. First of all, we discuss the choice of scaling for the linear RLS2 algorithm introduces in subsection 4.2. The generalization to the case of non-linear basis kernels easily follows by analyzing the associated feature maps.

In the linear case, we have , where is the -th feature vector, so that

By choosing

the alignment becomes the squared cosine of the angle between the -th feature vector and the output vector:

In particular, when the outputs and the features are centered to have zero mean, coincides with the squared Pearson correlation coefficient between the outputs and the -th feature, also known as coefficient of determination. This means that RLS2 with heavy regularization selects the features that mostly correlates to the outputs. Since the term is common to all the factors, one can also use

(16)

without changing the profile of solutions along a regularization path (though, the scale of regularization parameters is shifted). Observe that rule (16) may also make sense when data are not centered or centered around values other than the mean. In fact, for some datasets, performances are better without any centering (this is the case, for instance, of Experiment 1 in subsection 5.1). Notice also that (16) only uses training inputs whereas, in a possible variation, one can replace with the vector containing values of the -th feature for both training and test data (when available). The latter procedure sometimes work better than scaling using training inputs only, and will be referred to as transductive scaling in the following. For binary classification with labels , the choice (16) with or without centering still make sense, but other rules are also possible. Let and denote the number of samples in the positive and negative class, respectively, and and denote the within-class mean values of the -th feature:

By choosing

(17)

where and

denote the within class standard deviations of the

-th feature, one obtain

When the two classes are balanced (), boils down to a quantity proportional to the classical Fisher criterion (or signal-to-interference ratio):

Rules (16) and (17) can be generalized to the case of non-linear basis kernels, by observing that non-linear kernels can be always seen as linear upon mapping the data in a suitable feature space. A rule that generalizes (16) is the following [[, see e.g.]]Rakotomamonjy08:

(18)

that amounts to scale each basis kernel by the trace of the kernel matrix, and reduces exactly to (16) in the linear case. Also (18) can be applied with or without centering. A typical centering is normalization in feature space, that amounts to subtract to the basis kernel , before computing (18). A transductive scaling rule can be obtained by extending the sum to both training and test inputs, namely computing the inverse trace of the overall kernel matrix, as in [27]. Finally, a non-linear generalization of (17) is given by the following:

5 Experiments

In this section, the behavior of linear and non-linear RLS2 on several learning problems is analyzed. In subsection 5.1, an illustrative analysis of linear RLS2 is proposed, whose goal is to study the feature selection capabilities and the dependence on the regularization parameter of the algorithm in simple experimental settings. RLS2 with non-linear kernels is analyzed in subsection 5.2, where an extensive benchmark on several regression and classification problems from UCI repository is carried out. Finally, multi-class classification of microarray data is considered in subsection 5.3.

Computations are carried out in a Matlab environment and the sub-problem (Problem 7) is solved using an SMO-like (Sequential Minimal Optimization) algorithm [34]. Current implementation features conjugate gradient to solve linear systems and a sophisticated variable shrinking technique to reduce gradient computations. The stopping criterion for Algorithm 1 used in all the experiments is the following test on the normalized residual of linear system (13):

The choice turns out to be sufficient to make all the coefficients stabilize to a good approximation of their final values. A full discussion of optimization details is outside the scope of the paper. All the experiments have been run on a Core 2 Duo T7700 2.4 GHz, 800 MHz FSB, 4 MB L2 cache, 2 GB RAM.

5.1 Linear RLS2: illustrative experiments

In this subsection, we perform two experiments to analyze the behavior of linear RLS2. In the first experiment, a synthetic dataset is used to investigate the ability of linear RLS2 to perform feature selection. The dependence of generalization performances of RLS2 and other learning algorithms on the training set size is analyzed by means of learning curves. The goal of the second experiment is to illustrate the qualitative dependence of coefficients on the regularization parameter and give an idea of the predictive potentiality of the algorithm.

Experiment 1 (Binary strings data)

Figure 1: Binary strings data: lower bounds of RMSE learning curves. The top plot shows test RMSE for training set sizes between 1 and 150 with five different methods: RLS with ideal kernel (see details in the text), RLS with linear kernel (ridge regression), RLS with Gaussian RBF kernel, linear RLS2 and Lasso. The bottom plot is the “zoomed” version of the top plot for training set sizes between 1 and 30, for RLS with ideal kernel, linear RLS2 and Lasso.

In the first experiment, a synthetic Binary strings dataset has been generated: 250 random binary strings

are obtained by independently sampling each bit from a Bernoulli distribution with

. Then, the outputs have been generated as

where are small independent Gaussian noises with zero mean and standard deviation . In this way, the outputs only depend on the first three bits of the input binary string. The dataset has been divided into a training set of 150 input output pairs and a test set containing the remaining 100 data pairs. We compare the RMSE (root mean squared error) learning curves obtained by varying the training set size using five different methods:

  1. RLS (regularized least squares) with “ideal” kernel:

    (19)
  2. RLS with linear kernel (9) (ridge regression).

  3. RLS with Gaussian RBF kernel

  4. RLS2 with linear basis kernels (10) and scaling (16).

The goal here is to assess the overall quality of regularization paths associated with different regularization algorithms, independently of model selection procedures. To this end, we compute the RMSE on the test data as a function of the training set size and evaluate the lower bounds of the learning curves with respect to variation of the regularization parameter. Results are shown in Figure 3, whose top plot reports the lower bounds of learning curves for all the five algorithms with training set sizes between 1 and 150. Notice that all the five methods are able to learn, asymptotically, the underlying “concept”, up to the precision limit imposed by the noise, but methods that exploits coefficients sparsity are faster to reach the asymptotic error rate. Not surprisingly, the best method is RLS with the “ideal kernel” (19), which incorporates a strong prior knowledge: the dependence of the outputs on the first three bits only. Though knowing in advance the optimal features is not realistic, this method can be used as a reference. The slowest learning curve is that associated to RLS with Gaussian RBF kernel, which only incorporates a notion of smoothness. A good compromise is RLS with linear kernel, which uses the knowledge of linearity of the underlying function, and reaches a good approximation of the asymptotic error rate after seeing about 100 strings. The remaining two methods (Lasso and linear RLS2) incorporate the knowledge of both linearity and sparsity. They are able to learn the underlying concept after seeing only 12 examples, despite the presence of the noise. Since after the 12-th example Lasso and linear RLS2 are basically equivalent, is it interesting to see what happen for very small sample sizes. The bottom plot of Figure 3 is a zoomed version of the top plot with training set sizes between 1 and 30, showing only the learning curves for the three methods that impose sparsity. Until the 8-th example, the Lasso learning curve stays lower than the RLS2 learning curve. After the 8-th example, the RLS2 learning curve stays uniformly lower than the Lasso, indicating an high efficiency in learning noisy sparse linear combinations. Since the multiple kernel learning interpretation of RLS2 suggests that the algorithm is being learning the “ideal” kernel (19) simultaneously with the predictor, it might be interesting to analyze the asymptotic values of kernel coefficients . Indeed, after the first 12 training examples, RLS2 sets to zero all the coefficients except the first three, which are approximately equal to 1/3.

Experiment 2 (Prostate Cancer data)

Figure 2: Prostate Cancer

data: 10-fold cross-validation prediction error curves and their standard errors bands estimated for linear RLS2. Model complexity increases from the right to the left. The vertical line corresponds to the least complex model within one standard error of the best.

Figure 3: Prostate Cancer data: profiles of RLS2 coefficients with respect to a continuous variation of the regularization parameter. Coefficients are plotted versus , the approximate degrees of freedom. The vertical line corresponds to the value of chosen in the validation phase.
Term RLS2 Best subset LS Ridge Lasso PCR PLS
Intercept 2.452 2.477 2.465 2.452 2.468 2.497 2.452
lcavol 0.544 0.740 0.680 0.420 0.533 0.543 0.419
lweight 0.207 0.316 0.263 0.238 0.169 0.289 0.344
age -0.141 -0.046 -0.152 -0.026
lbph 0.104 0.210 0.162 0.002 0.214 0.220
svi 0.170 0.305 0.227 0.094 0.315 0.243
lcp -0.288 0.000 -0.051 0.079
gleason -0.021 0.040 0.232 0.011
pgg45 0.064 0.267 0.133 -0.056 0.084
Test error 0.454 0.492 0.521 0.492 0.479 0.449 0.528
Std error 0.152 0.143 0.179 0.165 0.164 0.105 0.152
Table 1: Prostate Cancer data: comparison of RLS2 with other subset selection and shrinkage methods. Estimated coefficients, test error and their standard error are reported. Results for methods other than RLS2 are taken from [20]. Blank entries corresponds to variables not selected.

Linear RLS2 is applied to the Prostate Cancer dataset, a regression problem whose goal is to predict the level of prostate-specific antigen on the basis of a number of clinical measures in men who were about to receive a radical prostatectomy [47]. These data are used in the textbook [20] to compare different feature selection and shrinkage methods, and have been obtained from the web site http://www-stat.stanford.edu/ElemStatLearn/. Data have been preprocessed by normalizing all the inputs to zero mean and unit standard deviation. The dataset is divided into a training set of 67 examples and a test set of 30 examples. To choose the regularization parameter, the 10-fold cross-validation score has been computed for different values of in the interval on a logarithmic scale. The scaling coefficients are chosen as in (16), thus normalizing each training feature to have unit norm. An intercept term equal to the average of training outputs has been subtracted to the outputs before estimating the other coefficients. For each of the dataset splits, the MSE (mean squared error) has been computed on the validation data. Figure 2 reports average and standard error bands for validation MSE along a regularization path. Following [20], we pick the value of corresponding to the least complex model within one standard error of the best validation score.

In a second phase, the whole training set (67 examples) is used to compute the RLS2 solution with different values of . Figure 3 reports the profile of RLS2 coefficients , see equation (12), along the whole regularization path as a function of the degrees of freedom defined as in (15). RLS2 does a continuous feature selection that may resemble that of the Lasso. However, the dependence of coefficients on the regularization parameter is rather complex and the profile in Figure 2 is not piecewise linear. In correspondence with the value of chosen in the validation phase, RLS2 selects 5 input variables out of 8. Table 1 reports the value of coefficients estimated by RLS2 together with the test error and his standard error. For comparison, Table 1 also reports models and results taken from [20] associated with LS (Least Squares), Best subset regression, Ridge Regression, Lasso regression, PCR (Principal Component Regression), PLS (Partial Least Squares). The best model on these data is PCR, but RLS2 achieves the second lowest test error by using only 5 variables.

5.2 Non-linear RLS2: regression and classification benchmark

Dataset Feature standardization Examples Features Kernels
Auto-mpg Yes 392 7 104
Cpu Yes 209 8 494
Servo No 167 4 156
Housing Yes 506 13 182
Heart Yes 270 13 182
Liver No 345 6 91
Pima Yes 768 8 117
Ionosphere Yes 351 33 442
Wpbc Yes 194 34 455
Sonar No 208 60 793
Table 2: Data sets used in the experiments. The first four are regression problems while the last six are classification problems.

Figure 4: RLS2 on the Auto-mpg (left) and the Cpu (right) dataset: RMSE on the test data (top), number of selected kernels (center), and number of iterations (bottom) along a regularization path.

Figure 5: RLS2 on the Servo (left) and the Housing (right) dataset: RMSE on the test data (top), number of selected kernels (center), and number of iterations (bottom) along a regularization path.

Figure 6: RLS2 on the Heart (left) and Liver (right) dataset: classification accuracy on the test data (top), number of selected kernels (center), and number of iterations (bottom) along a regularization path.

Figure 7: RLS2 on the Pima (left) and Ionosphere (right) dataset: classification accuracy on the test data (top), number of selected kernels (center), and number of iterations (bottom) along a regularization path.

Figure 8: RLS2 on the Wpbc (left) and Sonar (right) dataset: classification accuracy on the test data (top), number of selected kernels (center), and number of iterations (bottom) along a regularization path.
Dataset 60/40 70/30
Auto-mpg
Cpu
Servo
Housing
Heart 84 (3.28)
Liver
Pima
Ionosphere
Wpbc
Sonar
Table 3: RLS2 regression and classification: average and standard deviation of test performance (RMSE for regression, accuracy for classification) over 100 dataset splits. Results with two different training/test ratio are reported: 60/40 (first two columns), and 70/30 (last two columns).
Dataset Split ratio Kernels Iterations Time (s)
Auto-mpg 60/40
70/30
Cpu 60/40
70/30
Servo 60/40
70/30
Housing 60/40
70/30
Heart 60/40
70/30
Liver 60/40
70/30
Pima 60/40
70/30
Ionosphere 60/40
70/30
Wpbc 60/40
70/30
Sonar 60/40
70/30
Table 4: RLS2 regression and classification: number of selected kernels in correspondence with the optimal value of , number of iterations and training time in seconds to compute a regularization path.

In this subsection, benchmark experiments on four regression and six classification problems from UCI repository are illustrated (Table 2). RLS2 has been run on 100 random dataset splits with two different training/test ratios: and . For each dataset split, an approximate regularization path with 30 values of on a logarithmic scale in the interval has been computed. To speed-up the regularization path computation, a warm-start technique is employed: the value of is iteratively decreased, while kernel-expansion coefficients are initialized to their optimal values obtained with the previous value of . Performances are measured by accuracy for classification and RMSE (root mean squared error) for regression. For each dataset split and value of the regularization parameter, the following quantities are computed: prediction performance on the test set, number of selected kernels (number of non-zero ), training time in seconds and number of iterations to compute the whole regularization path. Datasets have been pre-processed by removing examples with missing features and converting categorical features to binary indicators. For some of the datasets (see Table 2) input features have been standardized to have zero mean and unitary standard deviation. For classification, output labels are and predictions are given by the sign of . For regression, an intercept term equal to the mean of training outputs is subtracted to the training data. Basis kernel matrices are pre-computed and the scaling matrix is chosen according to the rule (18) with transductive scaling.

To better compare the results to similar benchmarks for multiple kernel learning, see e.g. [35], the same set of basis kernels for all the datasets has been chosen. We remark that such agnostic approach is not representative of a realistic application of the algorithm, in which the choice of basis kernels should reflect a-priori knowledge about the learning task to be solved. The set of basis kernels contains the following:

  • Polynomial kernels

    with .

  • Gaussian RBF kernels

    with 10 different values of chosen on a logarithmic scale between and .

Kernels on each single feature and on all the features are considered, so that the number of basis kernels is an affine function of the number of features (recall that categorical features have been converted to binary indicators). More precisely, we have .

All the profiles of test prediction performance, number of kernels and number of iterations for the 70/30 dataset split in correspondence with different values of the regularization parameter are reported in Figures 4-8. From the top plots, it can be seen that test performances are relatively stable to variations of the regularization parameter around the optimal value , indicating that RLS2 is robust with respect to the use of different model selection procedures. For regression datasets such as Cpu, Servo, or Housing, optimal performances seems to be reached in correspondence with the un-regularized solution . Lines in light color are associated with single dataset splits, while thick lines are the averages over different dataset splits. The vertical dotted line corresponds to the value of the regularization parameter with best average test performance. The average number of selected kernels vary quite smoothly with respect to the regularization parameter. For large values of , RLS2 chooses only one basis kernel. For small values of , the number of selected kernels grows and exhibits an higher variability. The bottom plots in Figures 4-8 give an idea of the computation burden required by alternate optimization for RLS2 in correspondence with different values of . In correspondence with high values of the regularization parameter, the algorithm converges in a single iteration. This occurs also for the very first value on the regularization path, meaning that the initialization rule is effective. With low values of , RLS2 also converges in a single iteration, since the second layer doesn’t change much from an iteration to the next.

Test performances for regression and classification are summarized in Table 3, where the average and standard deviation with respect to the 100 dataset splits of either RMSE (regression) or accuracy (classification) in correspondence with to the best value of are reported. Performances of other kernel learning algorithms on some of these datasets can be found in [27, 33, 35] and references therein. Another benchmark study that might be useful for comparison is [29]. Comparisons should be handled with care due to the use of different experimental procedures and optimization problems. For instance, [27] uses an 80/20 dataset split ratio, [33] uses 60/40, while [35] uses 70/30. Also, different numbers of dataset splits have been used. Individuating what kind of datasets are better suited to what algorithm is a complex issue, that is certainly worth further investigation. These experiments shows that RLS2 results are competitive and complexity of the model is well controlled by regularization. In particular, state of the art performances are reached on Servo, Housing, Hearth, Pima, Ionosphere. Finally, it should be observed that, although multiple kernel learning machines have been used as black box methods, the use of basis kernels on single features sometimes also selects a subset of relevant features. Such property is remarkable since standard single-layer kernel methods are not able to perform “embedded” feature selection.

Table 4 reports the average and standard deviation of number of selected kernels in correspondence with the optimal value of , number of iterations and training time needed to compute a regularization path for all the regression and classification datasets studied in this subsection. From the columns of selected kernels, it can be seen that a considerable fraction of the overall number of basis kernels is filtered out by the algorithm in correspondence with the optimal value of the regularization parameter. By looking at the number of iterations needed to compute the path with 30 values of the regularization parameter, one can see that the average number of iterations to compute the solution for a single value of is in between 1 and 3, indicating that the warm-start procedure is rather effective at exploiting the continuity of solutions with respect to the regularization parameter. As a matter of fact, most of the optimization work is spent in correspondence with a central interval of values of , as shown in the bottom plots of Figures 4-8. Finally, from the last column, reporting average and standard deviation of training times, it can be seen that, with the current implementation of RLS2, regularization paths for all the datasets in this subsection can be computed in less than one minute in the average (see the introduction of this section for experimental details). Although the current implementation of RLS2 has been designed to be efficient, it is believed that there’s still considerable margin for further computational improvements. This issue may well be the subject of future developments.

5.3 RLS2: multi-class classification of microarray data

Figure 9: 14 Cancers data: profiles of training error, 8-fold validation error and test error for different values of the regularization parameter. Standard error bands for the validation error are also shown. The vertical line corresponds to the least complex model maximizing the validation accuracy.
Method Test errors Selected genes
Support Vector Classifier 14.0 16,063
L1-penalized multinomial 13.0 269
Lasso regression (OVA) 12.5 1,429
L2-penalized discriminant analysis 12.0 16,063
Elastic-net penalized multinomial 11.8 384
Linear RLS2 (OVA) 9.8 855
Table 5: 14 Cancers data: average test error (see the text for details) and number of selected genes for different classification algorithms. For RLS2, the number of selected genes is relative to the least complex model maximizing the validation accuracy. Results for methods other than RLS2 are taken from [20].

RLS2 can be applied to multi-class classification problems by solving several binary classification problems and combining their outcomes. A possible way to combine binary classifiers is the OVA (one versus all) approach, in which each class is compared to all the others and test labels are assigned to the class maximizing the confidence (the real-valued output) of the corresponding binary classifier.

Linear RLS2 with OVA has been applied to the 14 Cancers dataset [36], a delicate multi-class classification problem whose goal is to discriminate between 14 different types of cancer, on the basis of microarray measurements of 16063 gene expressions. Gene measurements and type of cancer (labels) are available for 198 patients, the dataset being already divided in a training set of 144 patients, and a test set of 54 patients. Another important goal in this problem is to individuate a small subset of genes which is relevant to discriminate between the different kind of cancer. [20] reports several results for these data using a variety of classification methods. Algorithms such as the Support Vector Classifier uses all the genes to compute the classification boundaries, while others such as Lasso or Elastic Net are also able to select a subset of relevant genes. Since the feature selection experiment in subsection 5.1 suggests that RLS2 may be very efficient at selecting relevant features from noisy examples, a microarray dataset seems to be an appropriate choice for testing the algorithm.

Gene expressions for each patient have been firstly standardized to have zero mean and variance one. For each binary classifier, coefficients are chosen as , where and are the within-class sample variances computed using all the training data. Such scaling gives more weight to genes whose expressions exhibits small within-class variability, and seems to slightly improve classification performances. A validation accuracy profile has been computed using stratified 8-fold cross validation, where the folds are organized to approximately preserve the class proportions111We thank Trevor Hastie for kindly providing the folds used in their experiments.. For the final model, we pick the highest value of maximizing the validation accuracy. Figure 9 reports the profiles of training accuracy, cross-validation accuracy with corresponding standard error bands, and test accuracy for 50 logarithmically spaced values of the regularization parameter. Table 5 reports the number of test errors and selected genes in correspondence with the value of chosen in the validation phase, for RLS2 and other methods from [20]. Test errors in Table 5 are averages of test errors for different classifiers associated with all the different values of the regularization parameter that maximizes the cross-validation score (this explains the presence of non-integer values). For linear RLS2, such procedure yields a value of about 9.8. In correspondence with the least complex model maximizing the cross-validation accuracy, one obtain 10 test errors using 855 genes. Although the test set size is too small to draw significative conclusions from this comparison, linear RLS2 seems to work rather well on this problem and achieve the best test performances. Such good performance also confirm effectiveness of the OVA multi-class approach for RLS2.

6 Conclusions

The connection between learning with a two-layer network and the problem of learning the kernel has been analyzed. While architectures with more than one layer are justified by a representer theorem, an alternative perspective to look at the problem of kernel learning is proposed. Such perspective makes clear that these two methodologies aim both at increasing the approximation power of standard single layer methods by using machines that can adaptively select functions with a variety of shapes when little prior knowledge is available. In particular, the multiple kernel learning framework is shown to be an important specific case of a more general two-layer architecture. We also introduce RLS2, a new method to perform multiple kernel learning based on regularization with the square loss function and alternate optimization. RLS2 exhibits state of the art performances on several learning problems, including multi-class classification of microarray data. An open source set of MATLAB scripts for RLS2 and linear RLS2 is available at http://www.mloss.org and also includes a Graphic User Interface.

Appendix (proofs)

Proof of Theorem 1

For any fixed , let . By fixing an optimal , Problem 1 can be written as a function of alone as

By standard representer theorems for vector valued functions (see [31] and the remark on monotonicity in [40] after Theorem 1), there exists an optimal in the form

Then, by fixing on optimal in this form, Problem 1 can be written as a function of alone as

where . Notice that the new loss functions depends on . Again, by the single-layer representer theorem the finite kernel expansion for follows. Finally, it is immediate to see that the overall input-output relation can be written as in (2).

Proof of Lemma 1

By linearity of , it is immediate to see that (3) holds. It follows that

In addition,

In the last equation, we exploit the fact that , for any positive , a property that is satisfied only by indicator functions. The thesis follows by letting .

Proof of Theorem 2

Problem 2 is a specific instance of Problem 1. The functional to minimize is bounded below, lower semi-continuous and radially-unbounded with respect to . Existence of minimizers follows by weak-compactness of the unit ball in and . By Theorem 1, there exists an optimal in the form

Introduce the matrix whose rows are denoted by and whose columns are . Then, the -th component of can be written as:

By Theorem 1, there exists an optimal such that