1 Introduction
Multitask learning is an approach which learns multiple tasks simultaneously. The problem has potential to learn the structure of the related tasks. The idea is that exploring task relatedness can lead to improved performance. Various learning algorithms are studied to incorporate the structure of task relations in literature [6, 22, 23, 29]. In agreement with past empirical work on multitask learning, learning multiple related tasks simultaneously has been empirically [5, 10, 29, 30, 37, 61, 64, 65] and theoretically [5, 13, 15]
shown significantly improved performance relative to learning each task independently. Multitask learning is becoming interesting due to its applications in computer vision, image processing and many other fields such as object detection/classification
[49], image denoising, inpainting, finance and economics forecasting predicting [34], marketing modeling for the preferences of many individuals [2, 7] and in bioinformatics for example to study tumor prediction from multiple micro array data sets or analyze data from multiple related diseases.In this work, we discuss multitask learning approach that considers a notion of relatedness based on the concept of manifold regularization. In scalarvalued function setting, Belkin et al. [14] introduced the concept of manifold regularization which focus on a semisupervised framework that incorporates labeled and unlabeled data in a generalpurpose learner. Minh and Sindhwani [50] generalized the concept of manifold learning for vectorvalued functions which exploits output interdependencies while enforcing smoothness with respect to input data geometry. Further, Minh and Sindhwani [49]
present a general learning framework that encompasses learning across three different paradigms, namely vectorvalued, multiview and semisupervised learning simultaneously. Multiview learning approach is considered to construct the regularized solution based on different views of the input data using different hypothesis spaces
[18, 42, 43, 49, 56, 58, 60]. Micchelli and Pontil [48] introduced the concept of vectorvalued reproducing kernel Hilbert spaces to facilitate theory of multitask learning. Also, the fact that every vectorvalued RKHS is corresponding to some operatorvalued positive definite kernel, reduces the problem of choosing appropriate RKHS (hypothesis space) to choosing appropriate kernel [48]. In paper [9, 19, 47], the authors proposed multiple kernel learning from a set of kernels. Here we consider the direct sum of reproducing kernel Hilbert spaces as the hypothesis space.Multitask learning is studied under the elegant and effective framework of kernel methods. The expansion of automatic data generation and acquisition bring data of huge size and complexity which raises challenges to computational capacities. In order to tackle these difficulties, various techniques are discussed in the literature such as replacing the empirical kernel matrix with a smaller matrix obtained by (column) subsampling [8, 57, 63], greedytype algorithms [59], divideandconquer approach [35, 36, 67]. We are inspired from the work of Kriukova et al. [39]
in which the authors discussed an approach to aggregate various regularized solutions based on Nyström subsampling in singlepenalty regularization. Here we consider the socalled Nyström type subsampling in large scale kernel methods for dealing with big data which particularly can be seen as a regularized projection scheme. We achieve the optimal convergence rates for multipenalty regularization based on the Nyström type subsampling approach, provided the subsampling size is appropriate. We adapt the aggregation approach for multitask learning manifold regularization scheme to improve the accuracy of the results. We consider the linear combination of Nyström approximants and try to obtain a combination of the approximants which is closer to the target function. The coefficients of the linear combination are estimated by means of the linear functional strategy. The aggregation approach tries to accumulate the information hidden inside various approximants to produce the estimator of the target function
[38, 39] (also see reference therein).The paper is organized as follows. In Section 2, we describe the framework of vectorvalued multipenalized learning problem with some basic definitions and notations. In Section 3, we discuss the convergence issues of the vectorvalued multipenalty regularization scheme based on Nyström type subsampling in the norm in and the norm in . In Section 4, we discuss the aggregation approach to accumulate various estimators based on the Nyström type subsampling. In the last section, we demonstrate the performance of multipenalty regularization based on Nyström type subsampling on Caltech101 data set for multiclass image classification and NSLKDD benchmark data set for intrusion detection problem.
2 Multitask learning via vectorvalued RKHS
The problem of learning multiple tasks jointly can be modeled by the vectorvalued functions whose components represent individual taskpredictors, i.e., for (). Here we consider general framework of vectorvalued functions developed by Micchelli and Pontil [48] to address the multitask learning algorithm. We consider the concept of vectorvalued reproducing kernel Hilbert space which is the extension of wellknown scalarvalued reproducing kernel Hilbert space.
Definition 2.1.
Vectorvalued reproducing kernel Hilbert space (RKHSvv). Let be a nonempty set, be a real separable Hilbert space. The Hilbert space of functions from to is called reproducing kernel Hilbert space if for any and , the linear functional which maps to is continuous.
Suppose be the Banach space of bounded linear operators on . A function is said to be an operatorvalued positive definite kernel if for each pair , , and for every finite set of points and ,
There exists a unique Hilbert space of functions on satisfying the following conditions:

for all and , the functions , defined by

the span of the set is dense in , and

for all , (reproducing property).
Moreover, there is one to one correspondence between operatorvalued positive definite kernels and vectorvalued RKHS [48].
In the learning theory, we are given with the random samples
drawn identically and independently from a unknown joint probability measure
on the sample space . We assume that the input space is a locally compact countable Hausdorff space and the output space is a real separable space. The goal is to predict the output values for the inputs. Suppose we predict for the input based on our algorithm but the true output is . Then we suffer a loss, where the loss function
. A widely used approach based on the square loss function in regularization theory is Tikhonov type regularization:The regularization parameter controls the trade off between the error measuring the fitness of data and the complexity of the solution measured in the RKHSnorm.
We discuss the multitask learning approach that considers a notion of task relatedness based on the concept of manifold regularization. In this approach, different RKHSvv are used to estimate the target functions based on different views of input data, such as different features or modalities and a datadependent regularization term is used to enforce consistency of output values from different views of the same input example.
We consider the following regularization scheme to analyze the multitask manifold learning scheme corresponding to different views:
(1) 
where is given set of labeled and unlabeled data, is a symmetric, positive operator, and .
The direct sum of reproducing kernel Hilbert spaces is also a RKHS. Suppose is the kernel corresponding to RKHS .
Throughout this paper we assume the following hypothesis:
Assumption 2.1.
Let be a reproducing kernel Hilbert space of functions such that

For all , is a HilbertSchmidt operator and , where for HilbertSchmidt operator , for an orthonormal basis of .

The realvalued function , defined by , is measurable .
By the representation theorem [49], the solution of the multipenalized regularization problem (1) will be of the form:
(2) 
where with , is diagonal matrix with the first diagonal entries as and the rest , is identity of order and .
In order to obtain the computationally efficient algorithm from the functional (1), we consider the Nyström type subsampling which uses the idea of replacing the empirical kernel matrix with a smaller matrix obtained by (column) subsampling [39, 59, 63]. This can also be seen as a restriction of the optimization functional (1) over the space:
where and is a subset of the input points in the training set.
The minimizer of the manifold regularization scheme (1) over the space will be of the form:
(3) 
where denotes the MoorePenrose pseudoinverse of a matrix , , with and and .
The computational time of the Nyström approximation (3) is of order while the computational time complexity of standard manifold regularized solution (2) is of order . Therefore, the randomized subsampling methods can break the memory barriers and consequently achieve much better time complexity compare with standard manifold regularization algorithm.
We analyze the more general vectorvalued multipenalty regularization scheme based on Nyström subsampling:
(4) 
where are bounded operators, , are nonnegative real numbers and denotes the ordered set .
Here we introduce the sampling operator which is useful in the analysis of regularization schemes.
Definition 2.2.
The sampling operator associated with a discrete subset is defined by
Then its adjoint is given by
The sampling operator is bounded by .
We obtain the following explicit expression of the minimizer of the regularization scheme (4). The proof of the theorem follows the same steps as of Lemma 1 [57].
Theorem 2.1.
For the positive choice of , the functional (4) has unique minimizer:
where is the orthogonal projection operator with range .
The datafree version of the considered regularization scheme (4) is
(5) 
Using the fact , we get,
(6) 
We assume
(7) 
which implies
where the integral operator is a selfadjoint, nonnegative, compact operator on the Hilbert space of squareintegrable functions from to with respect to , defined as
The integral operator is bounded by . The integral operator can also be defined as a selfadjoint operator on . We use the same notation for both the operators defined on different domains. Though it is notational abuse, for convenience we use the same notation for both the operators defined on different domains. It is wellknown that is an isometry from the space of square integrable functions to reproducing kernel Hilbert space (For more properties see [24, 25]).
Our aim is to discuss the convergence issues of the regularized solution based on Nyström type subsampling. We estimate the error bounds of by measuring the bounds of sample error and approximation error . The approximation error is estimated with the help of the singlepenalty regularized solution .
For any probability measure, we can always obtain a solution converging to the prescribed target function but the convergence rates may be arbitrarily slow. This phenomena is known as no free lunch theorem [28]. Therefore, we need some prior assumptions on the probability measure in order to achieve the uniform convergence rates for learning algorithms. Following the notion of Bauer et al. [12], Caponnetto and De Vito [20], we consider the following assumptions on the joint probability measure in terms of the complexity of the target function and a theoretical spectral parameter effective dimension:

For the probability measure on ,
(8) 
There exists the minimizer of the generalization error over the RKHS ,
(9) 
There exist some constants such that
(10) holds for almost all .
It is worthwhile to observe that for the realvalued functions and multitask learning algorithms, the boundedness of output space can be easily ensured. So we can get the error estimates from our analysis without imposing any condition on the conditional probability measure (10).
The smoothness of the target function can be described in terms of the integral operator by the source condition:
Assumption 2.2.
(Source condition) Suppose
where is operator monotone function on the interval with the assumption and is a concave function. Then the condition is usually referred to as general source condition [46].
Assumption 2.3.
(Polynomial decay condition) For fixed positive constants and
, we assume that the eigenvalues
’s of the integral operator follows the polynomial decay:We define the class of the probability measures satisfying the conditions (i), (ii), (iii) and Assumption 2.2. We also consider the probability measure class which satisfies the conditions (i), (ii), (iii) and Assumption 2.2, 2.3.
The convergence rates discussed in our analysis depend on the effective dimension. We achieve the optimal minimax convergence rates using the concept of the effective dimension. For the integral operator , the effective dimension is defined as
The fact, is a trace class operator implies that the effective dimension is finite. The effective dimension is continuously decreasing function of from to . For further discussion on effective dimension we refer to the literature [16, 17, 40, 41, 66].
The effective dimension can be estimated from Proposition 3 [20] under the polynomial decay condition as follows,
(11) 
and without the polynomial decay condition, we have
Assumption ( qualification)  Scheme  general source condition  Optimal rates  
Kriukova et al. [39]  N/A  Singlepenalty Tikhonov regularization  
Rudi et al. [57]  N/A  Singlepenalty Tikhonov regularization  
Our Results  Multipenalty regularization 
Now we review the previous results on the regularization schemes based on Nyström subsampling which are directly comparable to our results: Kriukova et al. [39] and Rudi et al. [57]. For convenience, we tried to present the most essential points in the unified way in Table 1. We have shown the convergence rates under Hölder’s source condition. Rudi et al. [57] obtained the minimax optimal convergence rates depending on the eigenvalues of in norm. To obtain the optimal rates the concept of effective dimension is exploited. Kriukova et al. [39] considered the Tikhonov regularization with Nyström type subsampling under general source condition. They discussed the upper conevergence rates and do not take into account the polynomial decay condition of the eigenvalues of the integral operator . We used the idea of Nyström type subsampling to efficiently implement the multipenalty regularization algorithm. We obtain optimal convergence rates of multipenalty regularization with Nyström type subsampling under general source condition. In particular, we also get optimal rates of singlepenalty Tikhonov regularization with Nyström type subsampling under general source condition as the special case.
3 Convergence issues
In this section, we present the optimal minimax convergence rates for vectorvalued multipenalty regularization based on Nyström type subsampling using the concept of effective dimension over the classes of the probability measures and .
In order to prove the optimal convergence rates, we need the following inequality which is used in the papers [12, 20] and based on the results of Pinelis and Sakhanenko [53].
Proposition 3.1.
Let be a random variable on the probability space with values in real separable Hilbert space . If there exist two constants and satisfying
(12) 
then for any and for all ,
In particular, the inequality (12) holds if
In the following proposition, we measure the effect of random sampling using noise assumption (10) in terms of the effective dimension . The quantity describes the probabilistic estimates of the perturbation measure due to random sampling.
Proposition 3.2.
Proof.
To estimate the first expression, we consider the random variable from to reproducing kernel Hilbert space with
and
Under the assumption (10) we get,
On applying Proposition 3.1 we conclude that
with confidence .
The second expression can be estimated easily by considering the random variable from to . The proof can also be found in De Vito et al. [27]. ∎
The following conditions on the sample size and subsample size are used to derive the convergence rates of regularized learning algorithms. In particular, we can assume the following inequality for sufficiently large sample with the confidence :
(15) 
Following the notion of Rudi et al. [57] and Kriukova et al. [39] on subsampling, we measure the approximation power of the projection method induced by the projection operator in terms of . We make the assumption on as considered in Theorem 2 [39]:
(16) 
Under the parameter choice for , we obtain
(17) 
Moreover, from Lemma 6 [57] under Assumption 2.3 and , , for every , the following inequality holds with the probability ,
Then under the condition (17) using Proposition 2, 3 [45] we get,
and
In the following section, we discuss the error analysis of the multipenalty regularization scheme based on Nyström type subsampling in probabilistic sense. In general, we derive the convergence rates for regularization algorithms in the norm in RKHS and the norm in separately. In Theorem 3.1, 3.2, 3.3, we estimate error bounds for multipenalty regularization based on Nyström type subsampling in weighted norm which consequently provides the convergence rates of the regularized solution in both norm and norm.
Theorem 3.1.
Let be i.i.d. samples drawn according to the probability measure with the assumption that , , are nonincreasing functions. Then under the parameter choice for , for sufficiently large sample according to (15) and for subsampling according to (17), the following convergence rates of holds with the confidence for all ,
where , , , and .
Proof.
We discuss the error bound for by estimating the expressions and . The first term can be expressed as
which implies
where , , and .
Comments
There are no comments yet.