1 Introduction
A central issue in the field of machine learning is to design and analyze the generalization ability of learning algorithms. Since the seminal work of Vapnik and Chervonenkis
[1], various approaches and techniques have been advocated and a large body of literature has emerged in learning theory providing rigorous generalization and performance bounds [2]. This literature has mainly focused on scalarvalued function learning algorithms like binary classification [3] and realvalued regression [4]. However, interest in learning vectorvalued functions is increasing [5]. Much of this interest stems from the need for more sophisticated learning methods suitable for complexoutput learning problems such as multitask learning [6] and structured output prediction [7]. Developing generalization bounds for vectorvalued function learning algorithms then becomes more and more crucial to the theoretical understanding of such complex algorithms. Although relatively recent, the effort in this area has already produced several successful results, including [8, 9, 10, 11]. Yet, these studies have considered only the case of finitedimensional output spaces, and have focused more on linear machines than nonlinear ones. To the best of our knowledge, the only work investigating the generalization performance of nonlinear multitask learning methods when output spaces can be infinitedimensional is that of Caponnetto and De Vito [12]. In their study, the authors have derived from a theoretical (minimax) analysis generalization bounds for regularized least squares regression in reproducing kernel Hilbert spaces (RKHS) with operatorvalued kernels. It should be noted that, unlike the scalarvalued function learning setting, the reproducing kernel in this context is a positivedefinite operatorvalued function^{2}^{2}2The kernel is a matrixvalued function in the case of finite dimensional output spaces.. The operator has the advantage of allowing us to take into account dependencies between different tasks and then to model task relatedness. Hence, these kernels are known to extend linear multitask learning methods to the nonlinear case, and are referred to as multitask kernels^{3}^{3}3In the context of this paper, operatorvalued kernels and multitask kernels mean the same thing. [13, 14].The convergence rates proposed by Caponnetto and De Vito [12], although optimal in the case of finitedimensional output spaces, require assumptions on the kernel that can be restrictive in the infinitedimensional case. Indeed, their proof depends upon the fact that the kernel is HilbertSchmidt (see Definition 1) and this restricts the applicability of their results when the output space is infinitedimensional. To illustrate this, let us consider the identity operatorbased multitask kernel , where is a scalarvalued kernel and is the identity operator. This kernel which was already used by Brouard et al. [15] and Grunewalder et al. [16] for structured output prediction and conditional mean embedding, respectively, does not satisfy the HilbertSchmidt assumption (see Remark 1), and therefore the results of [12] cannot be applied in this case (for more details see Section 5). It is also important to note that, since the analysis of Caponnetto and De Vito [12] is based on a measure of the complexity of the hypothesis space independently of the algorithm, it does not take into account the properties of learning algorithms.
In this paper, we address these issues by studying the stability of multitask kernel regression algorithms when the output space is a (possibly infinitedimensional) Hilbert space. The notion of algorithmic stability, which is the behavior of a learning algorithm following a change of the training data, was used successfully by Bousquet and Elisseeff [17] to derive bounds on the generalization error of deterministic scalarvalued learning algorithms. Subsequent studies extended this result to cover other learning algorithms such as randomized, transductive and ranking algorithms [18, 19, 20], both in i.i.d^{4}^{4}4The abbreviation “i.i.d.” stands for “independently and identically distributed” and noni.i.d scenarios [21]. But, none of these papers is directly concerned with the stability of nonscalarvalued learning algorithms. It is the aim of the present work to extend the stability results of [17] to cover vectorvalued learning schemes associated with multitask kernels. Specifically, we make the following contributions in this paper: 1) we show that multitask kernel regression algorithms are uniformly stable for the general case of infinitedimensional output spaces, 2) we derive under mild assumption on the kernel generalization bounds of such algorithms, and we show their consistency even with non HilbertSchmidt operatorvalued kernels (see Definition 1), 3) we demonstrate how to apply these results to various multitask regression methods such as vectorvalued support vector regression (SVR) and functional ridge regression, 4) we provide examples of infinitedimensional multitask kernels which are not HilbertSchmidt, showing that our assumption on the kernel is weaker than the one in [12].
The rest of this paper is organized as follows. In Section 2 we introduce the necessary notations and briefly recall the main concepts of operatorvalued kernels and the corresponding Hilbertvalued RKHS. Moreover, we describe in this section the mathematical assumptions required by the subsequent developments. In Section 3 we state the result establishing the stability and providing the generalization bounds of multitask kernel based learning algorithms. In Section 4, we show that many existing multitask kernel regression algorithms such as vectorvalued SVR and functional ridge regression do satisfy the stability requirements. In Section 5 we give examples of non HilbertSchmidt operatorvalued kernels that illustrate the usefulness of our result. Section 6 concludes the paper.
2 Notations, Background and Assumptions
In this section we introduce the notations we will use in this paper. Let
be a probability space,
a Polish space, a (possibly infinitedimensional) separable Hilbert space, a separable Reproducing Kernel Hilbert Space (RKHS) with its reproducing kernel, and ^{5}^{5}5We denote by the application of the operator to . the space of continuous endomorphisms of equipped with the operator norm. Let and bei.i.d. copies of the pair of random variables
following the unknown distribution .We consider a training set consisting of a realization of i.i.d. copies of , and we denote by the set where the couple is removed. Let
be a loss function. We will describe stability and consistency results in Section
3 for a general loss function, while in Section 5 we will provide examples to illustrate them with specific forms of . The goal of multitask kernel regression is to find a function , , that minimizes a risk functionalThe empirical risk of on is then
and its regularized version is given by
We will denote by
(1) 
the function minimizing the regularized risk over .
Let us now recall the definition of the operatorvalued kernel associated to the RKHS when is infinite dimensional. For more details see [5].
Definition 1
The application is called the Hermitian positivedefinite reproducing operatorvalued kernel of the RKHS if and only if :

the application
belongs to .

,


, ,
(i) and (ii) define a reproducing kernel, (iii) and (iv) corresponds to the Hermitian and positivedefiniteness properties, respectively.
Moreover, the kernel will be called HilbertSchmidt if and only if , , a base of , such that This is equivalent to saying that the operator is HilbertSchmidt.
We now discuss the main assumptions we need to prove our results. We start by the following hypothesis on the kernel .
Hypothesis 1
such that ,
where is the operator norm of on .
Remark 1
It is important to note that Hypothesis 1 is weaker than the one used in [12] which requires that is HilbertSchmidt and . While the two assumptions are equivalent when the output space is finitedimensional, this is no longer the case when, as in this paper, . Moreover, we can observe that if the hypothesis of [12] is satisfied, then our Hypothesis 1 holds (see proof below). The converse is not true (see Section 5 for some counterexamples).
Proof of Remark 1 : Let be a multitask kernel satisfying the hypotheses of [12], i.e is HilbertSchmidt and . Then, , , an orthonormal basis of , an orthogonal family of with such that
Thus,
Hence
As a consequence of Hypothesis 1, we immediately obtain the following elementary Lemma which allows us to control with . This is crucial to the proof of our main results.
Lemma 1
Let be a Hermitian positive kernel satisfying Hypothesis 1. Then , .
Proof :
Moreover, in order to avoid measurability problems, we assume that, , the application :
is measurable. Since is separable, this implies that all the functions used in this paper are measurable (for more details see [12]).
A regularized multitask kernel based learning algorithm with respect to a loss function is the function defined by:
(2)  
where is determined by equation (1). This leads us to introduce our second hypothesis.
Hypothesis 2
The minimization problem defined by (1) is well posed. In other words, the function exists for all and is unique.
Now, let us recall the notion of uniform stability of an algorithm.
Definition 2
An algorithm is said to be uniformly stable if and only if: , a training set, and a realisation of independent,
From now and for the rest of the paper, a stable algorithm will refer to the uniform stability. We make the following assumption regarding the loss function.
Hypothesis 3
The application is admissible, i.e. convex with respect to and Lipschitz continuous with respect to , with its Lipschitz constant.
The above three hypotheses are sufficient to prove the stability for a family of multitask kernel regression algorithms. However, to show their consistency we need an additional hypothesis.
Hypothesis 4
such that a realization of the couple , and a training set,
3 Stability of MultiTask Kernel Regression
In this section, we state a result concerning the uniform stability of regularized multitask kernel regression. This result is a direct extension of Theorem 22 in [17] to the case of infinitedimensional output spaces. It is worth pointing out that its proof does not differ much from the scalarvalued case, and requires only small modifications of the original to fit the operatorvalued kernel approach. For the convenience of the reader, we present here the proof taking into account these modifications.
Theorem 3.1
Proof : Since is convex with respect to , we have
Then, by summing over all couples in ,
(3) 
Symmetrically, we also have
(4) 
Thus, by summing (3) and (4), we obtain
(5)  
Now, by definition of and ,
(6)  
hence, since is Lipschitz continuous with respect to , and the inequality is true ,
which gives that
This implies that, a realization of ,
Note that the obtained in Theorem 3.1 is a
. This allows to prove the consistency of the multitask kernel based estimator from a result of
[17].Theorem 3.2
Let be a stable algorithm, whose cost function satisfies Hypothesis 4. Then, , , the following bound holds :
Proof : See theorem 12 in [17].
Since the right term of the previous inequality tends to when , theorem 3.2 proves the consistency of a class of multitask kernel regression methods even when the dimensionality of the output space is infinite. We give in the next section several examples to illustrate the above results.
4 Stable MultiTask Kernel Regression Algorithms
In this section, we show that multitask extension of a number of existing kernelbased regression methods exhibit good uniform stability properties. In particular, we focus on functional ridge regression (RR) [22], vectorvalued support vector regression (SVR) [23]
, and multitask logistic regression (LR)
[24]. We assume in this section that all of these algorithms satisfy Hypothesis 2.Functional response RR. It is an extension of ridge regression (or regularized least squares regression) to functional data analysis domain [22], where the goal is to predict a functional response by considering the output as a single function observation rather than a collection of individual observations. The operatorvalued kernel RR algorithm is linked to the square loss function, and is defined as follows:
We should note that Hypothesis 3 is not satisfied in the least squares context. However, we will show that the following hypothesis is a sufficient condition to prove the stability when Hypothesis 1 is verified (see Lemma 2).
Hypothesis 5
such that a.s.
It is important to note that this Lemma can replace the Lipschitz property of in the proof of Theorem 3.1.
Proof of Lemma 2 : First, note that is convex with respect to its second argument. Since is a vector space, . Thus,
(7)  
where the first line follows from the definition of (see Equation 1), and in the third line we used the bound on . This inequality is uniform over , and thus holds for .
Hypothesis 4 is also satisfied with . We can see that from Equation 8. Using Theorem 3.1, we obtain that the RR algorithm is stable with
and one can apply Theorem 3.2 to obtain the following generalization bound, with probability at least :
Vectorvalued SVR. It was introduced in [23] to learn a function which maps inputs to vectorvalued outputs , where is the number of tasks. In the paper, only the finitedimensional output case was addressed, but a general class of loss functions associated with the norm of the error was studied. In the spirit of the scalarvalued SVR, the insensitive loss function which was considered has the following form: , and from this general form of the norm formulation, the special cases of 1, 2 and norms was discussed. Since in our work we mainly focus our attention to the general case of any infinitedimensional Hilbert space , we consider here the following vectorvalued SVR algorithm:
where the associated loss function is defined by:
This algorithm satisfies Hypothesis 3. Hypothesis 4 is also verified with when Hypothesis 5 holds. This can be proved by the same way as the RR case. Theorem 3.1 gives that the vectorvalued SVR algorithm is stable with
We then obtain the following generalization bound, with probability at least :
Multitask LR. As in the case of SVR, kernel logistic regression [24] can be extended to the multitask learning setting. The logistic loss can then be expanded in the manner of the insensitive loss , that is . It is easy to see that the multitask LR algorithm satisfies Hypothesis 3 with and Hypothesis 4 since . Thus the algorithm is stable with
The associated generalization bound, with probability at least , is :
Hence, we have obtained generalization bounds for the RR, SVR and LR algorithms even when the kernel does not satisfy the Hilbert Schmidt property (see the following section for examples of such kernels).
5 Discussion and Examples
We provide generalization bounds for multitask kernel regression when the output space is infinite dimensional Hilbert space using the notion of algorithmic stability.
As far as we are aware, the only previous study of this problem was carried out in [12]. However, only learning rates of the regularized least squares algorithm was provided when the operatorvalued kernel is assumed to be HilbertSchmidt.
We have shown in Section 3 that one may use non HilbertSchmidt kernels in addition to obtaining theoretical guarantees. It should be pointed out that in the finitedimensional case the HilbertSchmit assumption is always satisfied, so it is important to discuss applied machine learning situations where infinitedimensional output spaces can be encountered. Note that our bound can be recovered from [12] when both our and their hypotheses are satisfied
Functional regression.
From a functional data analysis (FDA) point of view, infinitedimensional output spaces for operator estimation problems are frequently encountered in functional response regression analysis, where the goal is to predict an entire function. FDA is an extension of multivariate data analysis suitable when the data are curves, see
[25] for more details. A functional response regression problem takes the form where both predictors and responses are functions in some functional Hilbert space, most often the space of square integrable functions. In this context, the function is an operator between two infinitedimensional Hilbert spaces. Most previous work on this model suppose that the relation between functional responses and predictors is linear. The functional regression model is an extension of the multivariate linear model and has the following form:for a regression parameter . In this setting, an extension to nonlinear contexts can be found in [22] where the authors showed how Hilbert spaces of functionvalued functions and infinitedimensional operatorvalued reproducing kernels can be used as a theoretical framework to develop nonlinear functional regression methods. A multiplication based operatorvalued kernel was proposed, since the linear functional regression model is based on the multiplication operator.
Structured output prediction. One approach to dealing with this problem is kernel dependency estimation (KDE) [26]. It is based on defining a scalarvalued kernel on the outputs, such that one can transform the problem of learning a mapping between input data and structured outputs to a problem of learning a Hilbert space valued function between and , where is the projection of by into a realvalued RKHS . Depending on the kernel , the RKHS can be infinitedimensional. In this context, extending KDE for RKHS with multitask kernels was first introduced in [15], where an identity based operatorvalued kernel was used to learn the function .
Conditional mean embedding.
As in the case of structured output learning, the output space in the context of conditional mean embedding is a scalarvalued RKHS. In the framework of probability distributions embedding, Grünewälder et al.
[16] have shown an equivalence between RKHS embeddings of conditional distributions and multitask kernel regression. On the basis of this link, the authors derived a sparse embedding algorithm using the identity based operatorvalued kernel.Collaborative filtering. The goal of collaborative filtering (CF) is to build a model to predict preferences of clients “users” over a range of products “items” based on information from customer’s past purchases. In [27], the authors show that several CF methods such as rankconstrained optimization, tracenorm regularization, and those based on Frobenius norm regularization, can all be cast as special cases of spectral regularization on operator spaces. Using operator estimation and spectral regularization as a framework for CF permit to use potentially more information and incorporate additional useritem attributes to predict preferences. A generalized CF approach consists in learning a preference function that takes the form of a linear operator from a Hilbert space of users to a Hilbert space of items, for some compact operator .
Now we want to emphasize that in the case of infinite dimensions Hypothesis 1 on the kernel is not equivalent to that used in [12]. We have shown in Section 2 that our assumption on the kernel is weaker. To illustrate this, we provide below examples of operatorvalued kernels which satisfy Hypothesis 1 but are not HilbertSchmidt, as was assumed in [12].
Example 1
Identity operator. Let , , , be the identity morphism in , and .The kernel is positive, Hermitian, and
Example 2
Multiplication operator  Separable case. Let be a positivedefinite scalarvalued such that , an interval of , , and . Let be such that .
We now define the multiplication based operatorvalued kernel as follows
Such kernels are suited to extend linear functional regression to nonlinear context [22]. is a positive Hermitian kernel but, even if always satisfy Hypothesis 1, the HilbertSchmidt property of depends on the choice of and may be difficult to verify. For instance, let . Then
(10)  
where is an orthonormal basis of (which exists, since is separable).
Example 3
Multiplication Operator  Non separable case^{6}^{6}6A kernel is called non separable, as opposed to separable, if it cannot be written as the product of a scalar valued kernel and a operator independent of the choice of .. Let an interval of , , , .
Let be the following operatorvalued function:
is a positive Hermitian kernel satisfying Hypothesis 1. Indeed,
On the other hand, is not HilbertSchmidt for all choice of (in fact it is not Hilbert Schmidt as long as such that , ). To illustrate this, let choose as defined in the previous example. Then, for an orthonormal basis of , we have
Example 4
Sum of kernels. This example is provided to show that in the case of multiple kernels the sum of a non HilbertSchmidt kernel with an HilbertSchmidt one gives a nonHilbertSchmidt kernel. This makes the assumption on the kernel of [12] inconvenient for multiple kernel learning (MKL) [28], since one would like to learn a combination of different kernels which can be non HilbertSchmidt (like the basic identity based operatorvalued kernel).
Let be a positivedefinite scalarvalued kernel satisfying , a Hilbert space, and . Let be the following kernel:
is a positive and Hermitian kernel. Note that a similar kernel was proposed for multitask learning [28], where the identity operator is used to encode the relation between a task and itself, and a second kernel is added for sharing the information between tasks. satisfies Hypothesis 1, since
However, is not Hilbert Schmidt. Indeed, it is the sum of a Hilbert Schmidt kernel ( resp. ) and a HilbertSchmidt one ( resp. ), which is not Hilbert Schmidt. To see this, note that since the trace of is the sum over an absolutely summable family, and the trace is linear, so the trace of is the sum of an convergent series and a divergent one, hence it diverges, so is not Hilbert Schmidt.
6 Conclusion
We have shown that a large family of multitask kernel regression algorithms, including functional ridge regression and vectorvalued SVR, are stable even when the output space is infinitedimensional. This result allows us to provide generalization bounds and to prove under mild assumptions on the kernel the consistency of these algorithms. However, obtaining learning bounds with optimal rates for infinitedimensional multitask kernel based algorithms is still an open question.
References
 [1] Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16(2) (1971) 264–280

[2]
Herbrich, R., Williamson, R.C.:
Learning and generalization: Theoretical bounds.
In: Handbook of Brain Theory and Neural Networks.
2nd edition edn. (2002) 619–623  [3] Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics 9 (2005) 323 – 375
 [4] Györfi, L., Kohler, M., Krzyżak, A., Walk, H.: A distributionfree theory of nonparametric regression. Springer Series in Statistics. SpringerVerlag, New York (2002)
 [5] Micchelli, C.A., Pontil, M.: On learning vectorvalued functions. Neural Computation 17 (2005) 177–204
 [6] Caruana, R.: Multitask Learning. PhD thesis, School of Computer Science, Carnegie Mellon University (1997)
 [7] Bakir, G., Hofmann, T., Schölkopf, B., Smola, A.J., Taskar, B., Vishwanathan, S.: Predicting Structured Data. The MIT Press (2007)

[8]
Baxter, J.:
A model of inductive bias learning.
Journal of Artificial Intelligence Research
12 (2000) 149–198  [9] Maurer, A.: Algorithmic stability and metalearning. Journal of Machine Learning Research 6 (2005) 967–994
 [10] Maurer, A.: Bounds for linear multitask learning. Journal of Machine Learning Research 7 (2006) 117–139
 [11] Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research 6 (2005) 1817–1853
 [12] Caponnetto, A., Vito, E.D.: Optimal rates for the regularized least square algorithm. Fundations of the computational mathematics 7 (2006) 361–368
 [13] Micchelli, C., Pontil, M.: Kernels for multi–task learning. In: Advances in Neural Information Processing Systems 17. (2005) 921–928
 [14] Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. Journal of Machine Learning Research 6 (2005) 615–637
 [15] Brouard, C., D’AlcheBuc, F., Szafranski, M.: Semisupervised penalized output kernel regression for link prediction. ICML ’11 (June 2011)
 [16] Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M.: Conditional mean embeddings as regressors. ICML ’12 (July 2012)
 [17] Bousquet, O., Elisseeff, A.: Stability and generalisation. Journal of Machine Learning Research 2 (2002) 499–526
 [18] Elisseeff, A., Evgeniou, T., Pontil, M.: Stability of randomized learning algorithms. Journal of Machine Learning Research 6 (2005) 55–79
 [19] Cortes, C., Mohri, M., Pechyony, D., Rastogi, A.: Stability of transductive regression algorithms. In: Proceedings of the 25th International Conference on Machine Learning (ICML). (2008) 176–183
 [20] Agarwal, S., Niyogi, P.: Generalization bounds for ranking algorithms via algorithmic stability. Journal of Machine Learning Research 10 (2009) 441–474
 [21] Mohri, M., Rostamizadeh, A.: Stability bounds for stationary phimixing and betamixing processes. Journal of Machine Learning Research 11 (2010) 789–814
 [22] Kadri, H., Duflos, E., Preux, P., Canu, S., Davy, M.: Nonlinear functional regression: a functional rkhs approach. AISTATS ’10 (May 2010)
 [23] Brudnak, M.: Vectorvalued support vector regression. In: IJCNN, IEEE (2006) 1562–1569
 [24] Zhu, J., Hastie, T.: Kernel logistic regression and the import vector machine. In Dietterich, T.G., Becker, S., Ghahramani, Z., eds.: Advances in Neural Information Processing Systems 14 (NIPS), MIT Press (2002) 1081–1088
 [25] Ramsay, J., Silverman, B.: Functional Data Analysis. Springer Series in Statistics. Springer Verlag (2005)
 [26] Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., Vapnik, V.: Kernel dependency estimation. In: Advances in Neural Information Processing Systems (NIPS). Volume 15., MIT Press (2003) 873–880
 [27] Abernethy, J., Bach, F., Engeniou, T., Vert, J.P.: A New Approach to Collaborative Filtering : Operator Estimation with Spectral Regularization. Journal of Machine Learning Research 10 (2009) 803–826
 [28] Kadri, H., Rakotomamonjy, A., Bach, F., Preux, P.: Multiple Operatorvalued Kernel Learning. In: Neural Information Processing Systems (NIPS). (2012)
Comments
There are no comments yet.