1 Introduction
Learning by minimizing costs in functional spaces has proven to be an important approach to better understand many estimation problems. Indeed, the functional analytic point of view is the theoretical core of many successful learning methodologies such as smoothing splines, Gaussian processes, and support vector machines
[54, 18, 41, 52, 45, 37], collectively referred to as kernel methods. One of the most appealing properties of kernel methods is optimality according to a variety of representer theorems. These results are usually presented within the theory of RKHS [5], and formalize the intuition that optimal learning machines trained with a finite number of data must be expressed by a finite number of parameters, even when the hypothesis space is infinite dimensional. Representer theorems have been generalized and analyzed in many forms [10, 40, 49, 12, 31, 53, 14, 4], since their first appearance [26].Recently, has been pointed out that standard kernel machines are somehow limited in their ability to approximate complex functional classes, as a consequence of being shallow architectures. In addition, existing representer theorems only apply to singlelayer architectures, though an extension of the theory to include multilayer networks would be useful to better understand the behavior of current multilayer architectures and characterize new methods with more flexible approximation capabilities. Such extension is also suggested by complexity theory of circuits [8] as well as by biological motivated learning models, [43]. In the field of kernel methods, the need for complex hypothesis spaces reflecting “broad” prior knowledge has led to the idea of learning the kernel from empirical data simultaneously with the predictor [7, 27, 33, 30, 3, 55, 32]
. Indeed, the difficulty of choosing a good hypothesis space with little available apriori knowledge is significantly reduced when the kernel is also learned from the data. The flexibility of algorithms implementing the framework of kernel learning makes also possible to address important machine learning issues such as feature selection, learning from heterogeneous sources of data, and multiscale approximation. A major extension to the framework of classical kernel methods, based on the concept of
hyperkernels, has been introduced in [33], encompassing many convex kernel learning algorithms. In this paper, the connection between learning the kernel in standard (single layer) kernel machines and learning in multilayer architectures is analyzed. We introduce the framework of kernel machines with two layers, that also encompasses classical single layer kernel methods as well as many kernel learning algorithms.Consider a generic architecture whose inputoutput behavior can be described as a function composition of two layers
(1) 
where is a generic set while and are two Hilbert spaces. The problem of learning simultaneously the two layers and from inputoutput data pairs can be formalized in a functional analytic setting. Indeed, in section 2 it is shown that a representer theorem holds: even if and are searched into infinite dimensional function spaces, optimal learning architectures are finite linear combination of kernel functions on each layer, where optimality is measured according to general regularization functionals. Remarkably, such representer theorem also imply that, upon training, architecture (1) can be equivalently regarded as a standard kernel machine in which the kernel function has been learned from the data. After discussing the general result on the solution representation for two nonlinear layers, the attention is focused on the case in which the second layer is linear. In section 3, we introduce a regularization framework that turns out to be equivalent to a general class of methods to perform multiple kernel learning
, that is the simultaneous supervised learning of a predictor and the associated kernel as a convex combination of
basis kernels. The general problem of learning the kernel is receiving a lot of attention in recent years, both from functional analytic point of view and from pure optimization perspectives. Since the earlier works [27, 30], many improved optimization schemes have been proposed [2, 46, 32, 35, 6]. In section 4, a method called RLS2 (regularized least squares with two layers) based on regularized multiple kernel learning with the square loss function is introduces and studied. Along the line of recent advances in multiple kernel learning
[46, 35, 6], it is shown that the involved optimization can be efficiently carried out using a twostep procedure. For RLS2, twostep optimization turns out to be especially simple and computationally appealing, alternating between the solution of a linear system and a constrained least squares problem. The application of RLS2 on a variety of learning problems is analyzed in section 5. State of the art generalization performances are achieved on several datasets, including multiclass classification of genomic data. An open source MATLAB toolbox to train and validate RLS2 models with a Graphic User Interface is available at http://www.mloss.org. All the proof of Theorems and Lemmas are given in the Appendix.2 A representer theorem for architectures with two layers
A learning architecture with two layers can be formalized as a map expressed as a function composition as in equation (1). Introduce two RKHS and of vectorvalued functions [31] defined over and respectively, with (operatorvalued) kernel functions and , and consider the following problem:
Problem 1
Here, are loss functions measuring the approximation of training data, while are two extendedvalued nondecreasing functions (not identically ) that play the role of regularization terms. Problem 1 is outside the scope of standard representer theorems [40] due to the presence of the composition . Nevertheless, it still holds that linear combinations of a finite number of kernel functions are optimal solutions, as soon as there exist minimizers.
Theorem 1
Theorem 1 is a restriction theorem: the search for solutions of Problem 1 can be restricted to kernel machines with two layers involving a finite number of kernel functions, even when and are infinite dimensional spaces. Notice that Theorem 1 is not an existence theorem, since existence of minimizers is one of the hypotheses. As shown in the next sections, existence can be ensured under mild additional conditions on . Under the general hypothesis of Theorem 1, uniqueness of minimizers in Problem 1 is also not guaranteed, even when loss functions are strictly convex. Notice also that Theorem 1 do admit the presence of optimal solutions not in the form of finite kernel expansions. However, if such solutions exist, then their projections over the finite dimensional span of kernel sections are optimal as well, so that one can restrict the attention to kernel machines with two layers also in this case. Finally, when and are strictly increasing, it holds that every optimal solution of Problem 1 can be expressed as a kernel machine with two layers.
3 Multiple kernel learning as a kernel machine with two layers
Theorem 1 shows that training an architecture with two layers is equivalent to train simultaneously a singlelayer kernel network and the kernel function, see equation (2). In this section, it is shown that multiple kernel learning, consisting in simultaneous learning of a finite linear combination of kernels and the associated predictor, can be interpreted as a specific instance of kernel architecture with two layers. Introduce a set of positive kernels defined on , called basis kernels and consider the following choice for and .

is an RKHS of vector valued functions associated with the matrixvalued kernel function such that

is the RKHS of real valued functions associated with the linear kernel
where is a diagonal scaling matrix:
For any , let , () denote its components. Introduce the indicator function of the interval defined as
let denote lower semicontinuous convex loss functions and . In the following, we analyze the particular case of Problem 1 in which is a square regularization and is the indicator function regularization:
Problem 2
Let’s briefly discuss the choice of regularizers. First of all, notice that both and are convex functions. Since are convex loss functions and is linear, the problem is separately convex in both and . Apparently, regularizing with the indicator function is equivalent to impose the constraint . Lemma 1 below shows that minimization into the unitary ball can be carried out without any loss of generality.
Lemma 1
Let denote an optimal solution of the following problem:
Then, is an an optimal solution of Problem 2 with and satisfy
(3) 
Thanks to the scaling properties coming from the linearity of the second layer and the use of the indicator function, the introduction of an additional regularization parameter can be avoided, thus significantly reducing the complexity of model selection. The next Theorem characterizes optimal solutions of Problem 2.
Theorem 2
Problem 3
subject to
(6) 
Theorem 2 shows that the variational Problem 2 for a twolayer kernel machine is equivalent to the multiple kernel learning Problem 3. The nonnegativity constraints leads to a sparse selection of a subset of basis kernels. In standard formulations, multiple kernel learning problems feature the equality constraint , instead of the inequality in (6). Nevertheless, Lemma 2 below shows that there always exist optimal solutions of Problem 4 satisfying the equality, so that the two optimization problems are equivalent.
A few comments on certain degeneracies in Problem 2 are in order. First of all, observe that the absolute value of optimal coefficients characterizing the two layer kernel machine is given by , but is undetermined. Then, without loss of generality, it is possible to choose . Second, observe that the objective functional of Problem 3 depends on through the product . When is singular, the optimal vector is not unique (independently of ). In particular, if belongs to the null space of , then achieves the same objective value of , for any . This case also occurs in standard (singlelayer) kernel methods. One possible way to break the indetermination, again without any loss of generality, is to constrain to belong to the range of . With such additional constraint, there exists such that
(7) 
where denote the MoorePenrose pseudoinverse (notice that, in general, might be different from ). Remarkably, the introduction of such change of variable makes also possible to derive an addition formulation of Problem 3, which can be shown to be a convex optimization problem. Indeed, by rewriting Problem 3 as a function of , the following problem is obtained:
Problem 4
Lemma 2
Problem 4 is a convex optimization problem and there exists an optimal vector satisfying the equality constraint
(8) 
Lemma 2 completes the equivalence between the specific kernel machines with two layers obtained by solving Problem 2 and multiple kernel learning algorithms. The Lemma also gives another important insight into the structure of Problem 3: local minimizers are also global minimizers, a property that directly transfer from Problem 4 through the change of variable (7).
3.1 Linear machines
In applications of standard kernel methods involving highdimensional input data, the linear kernel on
(9) 
plays an important role. Optimization algorithms for linear machines are being the subject of a renewed attention in the literature, due to some important experimental findings. First, it turns out that linear models are already enough flexible to achieve state of the art classification performances in application domains such as text document classification, wordsense disambiguation, and drug design, see e.g. [24]. Second, linear machines can be trained using extremely efficient and scalable algorithms [23, 44, 16]. Finally, linear methods can be also used to solve certain nonlinear problems (by using nonlinear feature maps), thus ensuring a good tradeoff between flexibility and computational convenience.
Linear kernels are also meaningful in the context of multiple kernel learning methods. Indeed, when the input set is a subset of , a possible choice for the set of basis kernels is given by linear kernels on each component:
(10) 
Such a choice makes the input output map (2) a linear function:
(11) 
where
(12) 
Here, an important benefit is sparsity in the vector of weights , that follows immediately from sparsity of vector . In this way, linear multiple kernel learning algorithms can simultaneously perform regularization and linear feature selection. Such property is apparently linked to the introduction of the additional layer in the architecture, since standard kernel machines with one layer are not able to perform any kind of automatic feature selection. From the user’s point of view, linear kernel machines with two layers behave similarly to sparse regularization methods such as the Lasso [50], performing feature selection by varying with continuity a shrinking parameter. However, it seems that regularization methods cannot be interpreted as kernel machines (not even with two layers) and these two classes of algorithms are thus distinct. An instance of linear regularization methods with two layers is proposed in subsection 4.2 and analyzed in the experimental section 5.
4 Regularized least squares with two layers
In the previous section, a general class of convex optimization problems to learn finite linear combinations of kernels is shown to be equivalent to a twolayer kernel machine. As for standard kernel machines, different choices of loss functions lead to a variety of learning algorithms. For instance, from the results of the previous section it follows that the twolayer version of standard Support Vector Machines with “hinge” loss functions
is equivalent to the SILP (SemiInfinite Linear Programming) multiple kernel learning problem studied in
[46], whose solution can be computed, for instance, by using gradient descent or SimpleMKL [35].In this section, attention is focussed on square loss functions and the associated kernel machine with two layers. As we shall show, coefficients and defining the architecture as well as the “equivalent inputoutput kernel” can be computed by solving a very simple optimization problem. Such problem features the minimization of a quartic functional in , that is separately quadratic in both and
. It is worth noticing that the square loss function can be used to solve regression problems as well as classification ones. Indeed, generalization performances of regularized least squares classifiers have been shown to be comparable to that of Support Vector Machines on many dataset, see
[38, 17] and references therein.Problem 5 (Regularized least squares with two layers (RLS2))
Let denote the standard simplex in :
For any fixed , Problem 5 is an unconstrained quadratic optimization problem with respect to . It is then possible to solve for the optimal in closed form as a function of :
(13) 
Problem 6
Lemma 3
Along the lines of recent developments in multiple kernel learning optimization [46, 35], we propose a twostep minimization procedure that alternates between kernel and predictor optimization. The specific structure of our problem allows for exact optimization in each of the two phases of the optimization process. Let
For any fixed , minimization with respect to boils down to the following simplexconstrained least squares problem, as ensured by the subsequent Lemma 4.
Problem 7
Lemma 4
Optimal coefficients can be computed using an iterative twostep procedure such as the Algorithm 1, that alternates between minimization with respect to obtained through the solution of the linear system (13), and the solution of the simplexconstrained least squares Problem 7 in . The nonnegativity constraint induce sparsity in the vector , and thus also in the inputoutput kernel expansion. To understand the initialization of coefficients and in Algorithm 1, consider the limiting solution of the optimization problem when the regularization parameter tends to infinity. Such solution is the most natural starting point for a regularization path because optimal coefficients can be computed in closed form.
Lemma 5
The limiting solution of Problem 5 when is given by
As shown in subsection 4.3, the result of Lemma 5 can be also used to give important insights into the choice of the scaling in the second layer.
4.1 A Bayesian maximum a posteriori interpretation of RLS2
The equivalence between regularization Problem 2 and the multiple kernel learning optimization Problem 8 can be readily exploited to give a Bayesian MAP (maximum a posteriori) interpretation of RLS2. To specify the probabilistic model, we need to put a prior distribution over the set of functions from into and define the data generation model (likelihood). In the following,
denote a real Gaussian distribution with mean
and variance
, a Gaussian measure on the set of functions from into with mean and covariance function , andthe uniform distribution in
over a set of positive finite measure. Let be such thatwhere is distributed according to a Gaussian measure over with zero mean and (matrixvalued) covariance function , and is a random vector independent of , distributed according to an uniform distribution over the ellipsoid :
Regarding the likelihood, assume that the data set , is generated by drawing pairs independently and identically distributed according to the usual additive Gaussian noise model:
When is a finite dimensional space, the MAP estimate of is:
where maximize the posterior density:
Specifically, we have
It follows that and, by taking the negative logarithm of , that the MAP estimate coincides with the solution of the regularization Problem 2 with square loss functions and . When is an infinitedimensional function space, the Gaussian measure prior for
do not admit a probability density. Nevertheless, the regularization Problem
2can be still recovered by understanding MAP estimate as a maximal point of the posterior probability measure, as described in
[21].4.2 Linear regularized least squares with two layers
As described in subsection 3.1, when the input set is a subset of the linear choice of basis kernels (10) produces linear machines with feature selection capabilities. First of all, recall that standard regularized least squares with the linear kernel (9) boils down to finitedimensional Tikhonov regularization [51]
also known as ridge regression
[22]:where denote the matrix of inputs such that . Lemma 6 below states that the linear version of RLS2 is equivalent to a “scaled” ridge regression problem, in which the optimal scaling is also estimated from the data. Let
(14) 
For any fixed , let be the number of nonzero coefficients , denote the diagonal submatrix of containing all the strictly positive coefficients. Moreover, let denote the scaled submatrix of selected features .
Problem 8
Lemma 6
From Problem 8, one can easily see that when is fixed to its optimal value, the optimal in Problem 8 is given by the familiar expression:
The result of Lemma 6 can be also used to give an interesting interpretation of linear RLS2. In fact, the coefficient can be interpreted as a quantity proportional to the inverse variance of the th coefficient . Hence, Problem 8 can be seen as a Bayesian MAP estimation with Gaussian residuals, Gaussian prior on the coefficients and uniform prior over a suitable simplex on the vector of inverse coefficients’s variances.
It is useful to introduce a notion of “degrees of freedom”, see e.g.
[15, 20]. Degrees of freedom is an index more interpretable than the regularization parameter, and can be also used to choose the regularization parameter according to tuning criteria such as [28], AIC [1], BIC [42], GCV [11]. A general expression for the effective degrees of freedom of nonlinear kernel regression methods with one layer, based on the SURE (Stein’s Unbiased Risk Estimator) approximation [48] has been recently derived in [13]. For linear RLS2, the following quantity seems an appropriate approximation of the degrees of freedom:(15) 
Expression (15) corresponds to the equivalent degrees of freedom of a linear regularized estimator with regressors fixed to and diagonal regularization . Notice that (15) neglects the nonlinear dependence of matrix on the output data and does not coincide with the SURE estimator of the degrees of freedom. Nevertheless, the property holds, so that can be conveniently used to interpret the complexity of the linear RLS2 model (see subsection 5.1 for an example).
4.3 Choice of the scaling and feature/kernel selection
The ability of RLS2 to select features or basis kernels is highly influenced by the scaling in the second layer. In this subsection, we analyze certain scaling rules that are connected with popular statistical indices, often used as “filters” for feature selection [19]. Since the issue of scaling still needs further investigation, it is not excluded that new rules different from those mentioned in this subsection may work better on specific problems.
A key observation is the following: according to Lemma 5, RLS2 with heavy regularization favors basis kernels that maximizes the quantity , that represents a kind of alignment between the kernel and the outputs. This means that RLS2 tends to select kernels that are highly aligned with the outputs. Since each alignment is proportional to the scaling factor , an effective way to choose the scaling is one that makes the alignment a meaningful quantity to maximize. First of all, we discuss the choice of scaling for the linear RLS2 algorithm introduces in subsection 4.2. The generalization to the case of nonlinear basis kernels easily follows by analyzing the associated feature maps.
In the linear case, we have , where is the th feature vector, so that
By choosing
the alignment becomes the squared cosine of the angle between the th feature vector and the output vector:
In particular, when the outputs and the features are centered to have zero mean, coincides with the squared Pearson correlation coefficient between the outputs and the th feature, also known as coefficient of determination. This means that RLS2 with heavy regularization selects the features that mostly correlates to the outputs. Since the term is common to all the factors, one can also use
(16) 
without changing the profile of solutions along a regularization path (though, the scale of regularization parameters is shifted). Observe that rule (16) may also make sense when data are not centered or centered around values other than the mean. In fact, for some datasets, performances are better without any centering (this is the case, for instance, of Experiment 1 in subsection 5.1). Notice also that (16) only uses training inputs whereas, in a possible variation, one can replace with the vector containing values of the th feature for both training and test data (when available). The latter procedure sometimes work better than scaling using training inputs only, and will be referred to as transductive scaling in the following. For binary classification with labels , the choice (16) with or without centering still make sense, but other rules are also possible. Let and denote the number of samples in the positive and negative class, respectively, and and denote the withinclass mean values of the th feature:
By choosing
(17) 
where and
denote the within class standard deviations of the
th feature, one obtainWhen the two classes are balanced (), boils down to a quantity proportional to the classical Fisher criterion (or signaltointerference ratio):
Rules (16) and (17) can be generalized to the case of nonlinear basis kernels, by observing that nonlinear kernels can be always seen as linear upon mapping the data in a suitable feature space. A rule that generalizes (16) is the following [[, see e.g.]]Rakotomamonjy08:
(18) 
that amounts to scale each basis kernel by the trace of the kernel matrix, and reduces exactly to (16) in the linear case. Also (18) can be applied with or without centering. A typical centering is normalization in feature space, that amounts to subtract to the basis kernel , before computing (18). A transductive scaling rule can be obtained by extending the sum to both training and test inputs, namely computing the inverse trace of the overall kernel matrix, as in [27]. Finally, a nonlinear generalization of (17) is given by the following:
5 Experiments
In this section, the behavior of linear and nonlinear RLS2 on several learning problems is analyzed. In subsection 5.1, an illustrative analysis of linear RLS2 is proposed, whose goal is to study the feature selection capabilities and the dependence on the regularization parameter of the algorithm in simple experimental settings. RLS2 with nonlinear kernels is analyzed in subsection 5.2, where an extensive benchmark on several regression and classification problems from UCI repository is carried out. Finally, multiclass classification of microarray data is considered in subsection 5.3.
Computations are carried out in a Matlab environment and the subproblem (Problem 7) is solved using an SMOlike (Sequential Minimal Optimization) algorithm [34]. Current implementation features conjugate gradient to solve linear systems and a sophisticated variable shrinking technique to reduce gradient computations. The stopping criterion for Algorithm 1 used in all the experiments is the following test on the normalized residual of linear system (13):
The choice turns out to be sufficient to make all the coefficients stabilize to a good approximation of their final values. A full discussion of optimization details is outside the scope of the paper. All the experiments have been run on a Core 2 Duo T7700 2.4 GHz, 800 MHz FSB, 4 MB L2 cache, 2 GB RAM.
5.1 Linear RLS2: illustrative experiments
In this subsection, we perform two experiments to analyze the behavior of linear RLS2. In the first experiment, a synthetic dataset is used to investigate the ability of linear RLS2 to perform feature selection. The dependence of generalization performances of RLS2 and other learning algorithms on the training set size is analyzed by means of learning curves. The goal of the second experiment is to illustrate the qualitative dependence of coefficients on the regularization parameter and give an idea of the predictive potentiality of the algorithm.
Experiment 1 (Binary strings data)
In the first experiment, a synthetic Binary strings dataset has been generated: 250 random binary strings
are obtained by independently sampling each bit from a Bernoulli distribution with
. Then, the outputs have been generated aswhere are small independent Gaussian noises with zero mean and standard deviation . In this way, the outputs only depend on the first three bits of the input binary string. The dataset has been divided into a training set of 150 input output pairs and a test set containing the remaining 100 data pairs. We compare the RMSE (root mean squared error) learning curves obtained by varying the training set size using five different methods:

RLS (regularized least squares) with “ideal” kernel:
(19) 
RLS with linear kernel (9) (ridge regression).

RLS with Gaussian RBF kernel
The goal here is to assess the overall quality of regularization paths associated with different regularization algorithms, independently of model selection procedures. To this end, we compute the RMSE on the test data as a function of the training set size and evaluate the lower bounds of the learning curves with respect to variation of the regularization parameter. Results are shown in Figure 3, whose top plot reports the lower bounds of learning curves for all the five algorithms with training set sizes between 1 and 150. Notice that all the five methods are able to learn, asymptotically, the underlying “concept”, up to the precision limit imposed by the noise, but methods that exploits coefficients sparsity are faster to reach the asymptotic error rate. Not surprisingly, the best method is RLS with the “ideal kernel” (19), which incorporates a strong prior knowledge: the dependence of the outputs on the first three bits only. Though knowing in advance the optimal features is not realistic, this method can be used as a reference. The slowest learning curve is that associated to RLS with Gaussian RBF kernel, which only incorporates a notion of smoothness. A good compromise is RLS with linear kernel, which uses the knowledge of linearity of the underlying function, and reaches a good approximation of the asymptotic error rate after seeing about 100 strings. The remaining two methods (Lasso and linear RLS2) incorporate the knowledge of both linearity and sparsity. They are able to learn the underlying concept after seeing only 12 examples, despite the presence of the noise. Since after the 12th example Lasso and linear RLS2 are basically equivalent, is it interesting to see what happen for very small sample sizes. The bottom plot of Figure 3 is a zoomed version of the top plot with training set sizes between 1 and 30, showing only the learning curves for the three methods that impose sparsity. Until the 8th example, the Lasso learning curve stays lower than the RLS2 learning curve. After the 8th example, the RLS2 learning curve stays uniformly lower than the Lasso, indicating an high efficiency in learning noisy sparse linear combinations. Since the multiple kernel learning interpretation of RLS2 suggests that the algorithm is being learning the “ideal” kernel (19) simultaneously with the predictor, it might be interesting to analyze the asymptotic values of kernel coefficients . Indeed, after the first 12 training examples, RLS2 sets to zero all the coefficients except the first three, which are approximately equal to 1/3.
Experiment 2 (Prostate Cancer data)
Term  RLS2  Best subset  LS  Ridge  Lasso  PCR  PLS 

Intercept  2.452  2.477  2.465  2.452  2.468  2.497  2.452 
lcavol  0.544  0.740  0.680  0.420  0.533  0.543  0.419 
lweight  0.207  0.316  0.263  0.238  0.169  0.289  0.344 
age  0.141  0.046  0.152  0.026  
lbph  0.104  0.210  0.162  0.002  0.214  0.220  
svi  0.170  0.305  0.227  0.094  0.315  0.243  
lcp  0.288  0.000  0.051  0.079  
gleason  0.021  0.040  0.232  0.011  
pgg45  0.064  0.267  0.133  0.056  0.084  
Test error  0.454  0.492  0.521  0.492  0.479  0.449  0.528 
Std error  0.152  0.143  0.179  0.165  0.164  0.105  0.152 
Linear RLS2 is applied to the Prostate Cancer dataset, a regression problem whose goal is to predict the level of prostatespecific antigen on the basis of a number of clinical measures in men who were about to receive a radical prostatectomy [47]. These data are used in the textbook [20] to compare different feature selection and shrinkage methods, and have been obtained from the web site http://wwwstat.stanford.edu/ElemStatLearn/. Data have been preprocessed by normalizing all the inputs to zero mean and unit standard deviation. The dataset is divided into a training set of 67 examples and a test set of 30 examples. To choose the regularization parameter, the 10fold crossvalidation score has been computed for different values of in the interval on a logarithmic scale. The scaling coefficients are chosen as in (16), thus normalizing each training feature to have unit norm. An intercept term equal to the average of training outputs has been subtracted to the outputs before estimating the other coefficients. For each of the dataset splits, the MSE (mean squared error) has been computed on the validation data. Figure 2 reports average and standard error bands for validation MSE along a regularization path. Following [20], we pick the value of corresponding to the least complex model within one standard error of the best validation score.
In a second phase, the whole training set (67 examples) is used to compute the RLS2 solution with different values of . Figure 3 reports the profile of RLS2 coefficients , see equation (12), along the whole regularization path as a function of the degrees of freedom defined as in (15). RLS2 does a continuous feature selection that may resemble that of the Lasso. However, the dependence of coefficients on the regularization parameter is rather complex and the profile in Figure 2 is not piecewise linear. In correspondence with the value of chosen in the validation phase, RLS2 selects 5 input variables out of 8. Table 1 reports the value of coefficients estimated by RLS2 together with the test error and his standard error. For comparison, Table 1 also reports models and results taken from [20] associated with LS (Least Squares), Best subset regression, Ridge Regression, Lasso regression, PCR (Principal Component Regression), PLS (Partial Least Squares). The best model on these data is PCR, but RLS2 achieves the second lowest test error by using only 5 variables.
5.2 Nonlinear RLS2: regression and classification benchmark
Dataset  Feature standardization  Examples  Features  Kernels 

Autompg  Yes  392  7  104 
Cpu  Yes  209  8  494 
Servo  No  167  4  156 
Housing  Yes  506  13  182 
Heart  Yes  270  13  182 
Liver  No  345  6  91 
Pima  Yes  768  8  117 
Ionosphere  Yes  351  33  442 
Wpbc  Yes  194  34  455 
Sonar  No  208  60  793 
Dataset  60/40  70/30 

Autompg  
Cpu  
Servo  
Housing  
Heart  84 (3.28)  
Liver  
Pima  
Ionosphere  
Wpbc  
Sonar 
Dataset  Split ratio  Kernels  Iterations  Time (s) 

Autompg  60/40  
70/30  
Cpu  60/40  
70/30  
Servo  60/40  
70/30  
Housing  60/40  
70/30  
Heart  60/40  
70/30  
Liver  60/40  
70/30  
Pima  60/40  
70/30  
Ionosphere  60/40  
70/30  
Wpbc  60/40  
70/30  
Sonar  60/40  
70/30 
In this subsection, benchmark experiments on four regression and six classification problems from UCI repository are illustrated (Table 2). RLS2 has been run on 100 random dataset splits with two different training/test ratios: and . For each dataset split, an approximate regularization path with 30 values of on a logarithmic scale in the interval has been computed. To speedup the regularization path computation, a warmstart technique is employed: the value of is iteratively decreased, while kernelexpansion coefficients are initialized to their optimal values obtained with the previous value of . Performances are measured by accuracy for classification and RMSE (root mean squared error) for regression. For each dataset split and value of the regularization parameter, the following quantities are computed: prediction performance on the test set, number of selected kernels (number of nonzero ), training time in seconds and number of iterations to compute the whole regularization path. Datasets have been preprocessed by removing examples with missing features and converting categorical features to binary indicators. For some of the datasets (see Table 2) input features have been standardized to have zero mean and unitary standard deviation. For classification, output labels are and predictions are given by the sign of . For regression, an intercept term equal to the mean of training outputs is subtracted to the training data. Basis kernel matrices are precomputed and the scaling matrix is chosen according to the rule (18) with transductive scaling.
To better compare the results to similar benchmarks for multiple kernel learning, see e.g. [35], the same set of basis kernels for all the datasets has been chosen. We remark that such agnostic approach is not representative of a realistic application of the algorithm, in which the choice of basis kernels should reflect apriori knowledge about the learning task to be solved. The set of basis kernels contains the following:

Polynomial kernels
with .

Gaussian RBF kernels
with 10 different values of chosen on a logarithmic scale between and .
Kernels on each single feature and on all the features are considered, so that the number of basis kernels is an affine function of the number of features (recall that categorical features have been converted to binary indicators). More precisely, we have .
All the profiles of test prediction performance, number of kernels and number of iterations for the 70/30 dataset split in correspondence with different values of the regularization parameter are reported in Figures 48. From the top plots, it can be seen that test performances are relatively stable to variations of the regularization parameter around the optimal value , indicating that RLS2 is robust with respect to the use of different model selection procedures. For regression datasets such as Cpu, Servo, or Housing, optimal performances seems to be reached in correspondence with the unregularized solution . Lines in light color are associated with single dataset splits, while thick lines are the averages over different dataset splits. The vertical dotted line corresponds to the value of the regularization parameter with best average test performance. The average number of selected kernels vary quite smoothly with respect to the regularization parameter. For large values of , RLS2 chooses only one basis kernel. For small values of , the number of selected kernels grows and exhibits an higher variability. The bottom plots in Figures 48 give an idea of the computation burden required by alternate optimization for RLS2 in correspondence with different values of . In correspondence with high values of the regularization parameter, the algorithm converges in a single iteration. This occurs also for the very first value on the regularization path, meaning that the initialization rule is effective. With low values of , RLS2 also converges in a single iteration, since the second layer doesn’t change much from an iteration to the next.
Test performances for regression and classification are summarized in Table 3, where the average and standard deviation with respect to the 100 dataset splits of either RMSE (regression) or accuracy (classification) in correspondence with to the best value of are reported. Performances of other kernel learning algorithms on some of these datasets can be found in [27, 33, 35] and references therein. Another benchmark study that might be useful for comparison is [29]. Comparisons should be handled with care due to the use of different experimental procedures and optimization problems. For instance, [27] uses an 80/20 dataset split ratio, [33] uses 60/40, while [35] uses 70/30. Also, different numbers of dataset splits have been used. Individuating what kind of datasets are better suited to what algorithm is a complex issue, that is certainly worth further investigation. These experiments shows that RLS2 results are competitive and complexity of the model is well controlled by regularization. In particular, state of the art performances are reached on Servo, Housing, Hearth, Pima, Ionosphere. Finally, it should be observed that, although multiple kernel learning machines have been used as black box methods, the use of basis kernels on single features sometimes also selects a subset of relevant features. Such property is remarkable since standard singlelayer kernel methods are not able to perform “embedded” feature selection.
Table 4 reports the average and standard deviation of number of selected kernels in correspondence with the optimal value of , number of iterations and training time needed to compute a regularization path for all the regression and classification datasets studied in this subsection. From the columns of selected kernels, it can be seen that a considerable fraction of the overall number of basis kernels is filtered out by the algorithm in correspondence with the optimal value of the regularization parameter. By looking at the number of iterations needed to compute the path with 30 values of the regularization parameter, one can see that the average number of iterations to compute the solution for a single value of is in between 1 and 3, indicating that the warmstart procedure is rather effective at exploiting the continuity of solutions with respect to the regularization parameter. As a matter of fact, most of the optimization work is spent in correspondence with a central interval of values of , as shown in the bottom plots of Figures 48. Finally, from the last column, reporting average and standard deviation of training times, it can be seen that, with the current implementation of RLS2, regularization paths for all the datasets in this subsection can be computed in less than one minute in the average (see the introduction of this section for experimental details). Although the current implementation of RLS2 has been designed to be efficient, it is believed that there’s still considerable margin for further computational improvements. This issue may well be the subject of future developments.
5.3 RLS2: multiclass classification of microarray data
Method  Test errors  Selected genes 

Support Vector Classifier  14.0  16,063 
L1penalized multinomial  13.0  269 
Lasso regression (OVA)  12.5  1,429 
L2penalized discriminant analysis  12.0  16,063 
Elasticnet penalized multinomial  11.8  384 
Linear RLS2 (OVA)  9.8  855 
RLS2 can be applied to multiclass classification problems by solving several binary classification problems and combining their outcomes. A possible way to combine binary classifiers is the OVA (one versus all) approach, in which each class is compared to all the others and test labels are assigned to the class maximizing the confidence (the realvalued output) of the corresponding binary classifier.
Linear RLS2 with OVA has been applied to the 14 Cancers dataset [36], a delicate multiclass classification problem whose goal is to discriminate between 14 different types of cancer, on the basis of microarray measurements of 16063 gene expressions. Gene measurements and type of cancer (labels) are available for 198 patients, the dataset being already divided in a training set of 144 patients, and a test set of 54 patients. Another important goal in this problem is to individuate a small subset of genes which is relevant to discriminate between the different kind of cancer. [20] reports several results for these data using a variety of classification methods. Algorithms such as the Support Vector Classifier uses all the genes to compute the classification boundaries, while others such as Lasso or Elastic Net are also able to select a subset of relevant genes. Since the feature selection experiment in subsection 5.1 suggests that RLS2 may be very efficient at selecting relevant features from noisy examples, a microarray dataset seems to be an appropriate choice for testing the algorithm.
Gene expressions for each patient have been firstly standardized to have zero mean and variance one. For each binary classifier, coefficients are chosen as , where and are the withinclass sample variances computed using all the training data. Such scaling gives more weight to genes whose expressions exhibits small withinclass variability, and seems to slightly improve classification performances. A validation accuracy profile has been computed using stratified 8fold cross validation, where the folds are organized to approximately preserve the class proportions^{1}^{1}1We thank Trevor Hastie for kindly providing the folds used in their experiments.. For the final model, we pick the highest value of maximizing the validation accuracy. Figure 9 reports the profiles of training accuracy, crossvalidation accuracy with corresponding standard error bands, and test accuracy for 50 logarithmically spaced values of the regularization parameter. Table 5 reports the number of test errors and selected genes in correspondence with the value of chosen in the validation phase, for RLS2 and other methods from [20]. Test errors in Table 5 are averages of test errors for different classifiers associated with all the different values of the regularization parameter that maximizes the crossvalidation score (this explains the presence of noninteger values). For linear RLS2, such procedure yields a value of about 9.8. In correspondence with the least complex model maximizing the crossvalidation accuracy, one obtain 10 test errors using 855 genes. Although the test set size is too small to draw significative conclusions from this comparison, linear RLS2 seems to work rather well on this problem and achieve the best test performances. Such good performance also confirm effectiveness of the OVA multiclass approach for RLS2.
6 Conclusions
The connection between learning with a twolayer network and the problem of learning the kernel has been analyzed. While architectures with more than one layer are justified by a representer theorem, an alternative perspective to look at the problem of kernel learning is proposed. Such perspective makes clear that these two methodologies aim both at increasing the approximation power of standard single layer methods by using machines that can adaptively select functions with a variety of shapes when little prior knowledge is available. In particular, the multiple kernel learning framework is shown to be an important specific case of a more general twolayer architecture. We also introduce RLS2, a new method to perform multiple kernel learning based on regularization with the square loss function and alternate optimization. RLS2 exhibits state of the art performances on several learning problems, including multiclass classification of microarray data. An open source set of MATLAB scripts for RLS2 and linear RLS2 is available at http://www.mloss.org and also includes a Graphic User Interface.
Appendix (proofs)
Proof of Theorem 1
For any fixed , let . By fixing an optimal , Problem 1 can be written as a function of alone as
By standard representer theorems for vector valued functions (see [31] and the remark on monotonicity in [40] after Theorem 1), there exists an optimal in the form
Then, by fixing on optimal in this form, Problem 1 can be written as a function of alone as
where . Notice that the new loss functions depends on . Again, by the singlelayer representer theorem the finite kernel expansion for follows. Finally, it is immediate to see that the overall inputoutput relation can be written as in (2).
Proof of Lemma 1
By linearity of , it is immediate to see that (3) holds. It follows that
In addition,
In the last equation, we exploit the fact that , for any positive , a property that is satisfied only by indicator functions. The thesis follows by letting .
Proof of Theorem 2
Problem 2 is a specific instance of Problem 1. The functional to minimize is bounded below, lower semicontinuous and radiallyunbounded with respect to . Existence of minimizers follows by weakcompactness of the unit ball in and . By Theorem 1, there exists an optimal in the form
Introduce the matrix whose rows are denoted by and whose columns are . Then, the th component of can be written as:
By Theorem 1, there exists an optimal such that