# Manifold regularization based on Nyström type subsampling

In this paper, we study the Nyström type subsampling for large scale kernel methods to reduce the computational complexities of big data. We discuss the multi-penalty regularization scheme based on Nyström type subsampling which is motivated from well-studied manifold regularization schemes. We develop a theoretical analysis of multi-penalty least-square regularization scheme under the general source condition in vector-valued function setting, therefore the results can also be applied to multi-task learning problems. We achieve the optimal minimax convergence rates of multi-penalty regularization using the concept of effective dimension for the appropriate subsampling size. We discuss an aggregation approach based on linear function strategy to combine various Nyström approximants. Finally, we demonstrate the performance of multi-penalty regularization based on Nyström type subsampling on Caltech-101 data set for multi-class image classification and NSL-KDD benchmark data set for intrusion detection problem.

## Authors

• 6 publications
• 2 publications
• ### Convergence analysis of Tikhonov regularization for non-linear statistical inverse learning problems

We study a non-linear statistical inverse learning problem, where we obs...
02/14/2019 ∙ by Abhishake Rastogi, et al. ∙ 0

• ### Optimal rates for the regularized learning algorithms under general source condition

We consider the learning algorithms under general source condition with ...
11/07/2016 ∙ by Abhishake Rastogi, et al. ∙ 0

• ### A new interpretation of (Tikhonov) regularization

Tikhonov regularization with square-norm penalty for linear forward oper...
03/15/2021 ∙ by Daniel Gerth, et al. ∙ 0

• ### Estimates on Learning Rates for Multi-Penalty Distribution Regression

This paper is concerned with functional learning by utilizing two-stage ...
06/16/2020 ∙ by Zhan Yu, et al. ∙ 0

• ### On Sparsity Inducing Regularization Methods for Machine Learning

During the past years there has been an explosion of interest in learnin...
03/25/2013 ∙ by Andreas Argyriou, et al. ∙ 0

• ### Regularization Techniques for Learning with Matrices

There is growing body of learning problems for which it is natural to or...
10/04/2009 ∙ by Sham M. Kakade, et al. ∙ 0

• ### Convex regularization in statistical inverse learning problems

We consider a statistical inverse learning problem, where the task is to...
02/18/2021 ∙ by Tatiana A. Bubba, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Multi-task learning is an approach which learns multiple tasks simultaneously. The problem has potential to learn the structure of the related tasks. The idea is that exploring task relatedness can lead to improved performance. Various learning algorithms are studied to incorporate the structure of task relations in literature [6, 22, 23, 29]. In agreement with past empirical work on multi-task learning, learning multiple related tasks simultaneously has been empirically [5, 10, 29, 30, 37, 61, 64, 65] and theoretically [5, 13, 15]

shown significantly improved performance relative to learning each task independently. Multi-task learning is becoming interesting due to its applications in computer vision, image processing and many other fields such as object detection/classification

[49], image denoising, inpainting, finance and economics forecasting predicting [34], marketing modeling for the preferences of many individuals [2, 7] and in bioinformatics for example to study tumor prediction from multiple micro array data sets or analyze data from multiple related diseases.

In this work, we discuss multi-task learning approach that considers a notion of relatedness based on the concept of manifold regularization. In scalar-valued function setting, Belkin et al. [14] introduced the concept of manifold regularization which focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Minh and Sindhwani [50] generalized the concept of manifold learning for vector-valued functions which exploits output inter-dependencies while enforcing smoothness with respect to input data geometry. Further, Minh and Sindhwani [49]

present a general learning framework that encompasses learning across three different paradigms, namely vector-valued, multi-view and semi-supervised learning simultaneously. Multi-view learning approach is considered to construct the regularized solution based on different views of the input data using different hypothesis spaces

[18, 42, 43, 49, 56, 58, 60]. Micchelli and Pontil [48] introduced the concept of vector-valued reproducing kernel Hilbert spaces to facilitate theory of multi-task learning. Also, the fact that every vector-valued RKHS is corresponding to some operator-valued positive definite kernel, reduces the problem of choosing appropriate RKHS (hypothesis space) to choosing appropriate kernel [48]. In paper [9, 19, 47], the authors proposed multiple kernel learning from a set of kernels. Here we consider the direct sum of reproducing kernel Hilbert spaces as the hypothesis space.

Multi-task learning is studied under the elegant and effective framework of kernel methods. The expansion of automatic data generation and acquisition bring data of huge size and complexity which raises challenges to computational capacities. In order to tackle these difficulties, various techniques are discussed in the literature such as replacing the empirical kernel matrix with a smaller matrix obtained by (column) subsampling [8, 57, 63], greedy-type algorithms [59], divide-and-conquer approach [35, 36, 67]. We are inspired from the work of Kriukova et al. [39]

in which the authors discussed an approach to aggregate various regularized solutions based on Nyström subsampling in single-penalty regularization. Here we consider the so-called Nyström type subsampling in large scale kernel methods for dealing with big data which particularly can be seen as a regularized projection scheme. We achieve the optimal convergence rates for multi-penalty regularization based on the Nyström type subsampling approach, provided the subsampling size is appropriate. We adapt the aggregation approach for multi-task learning manifold regularization scheme to improve the accuracy of the results. We consider the linear combination of Nyström approximants and try to obtain a combination of the approximants which is closer to the target function. The coefficients of the linear combination are estimated by means of the linear functional strategy. The aggregation approach tries to accumulate the information hidden inside various approximants to produce the estimator of the target function

[38, 39] (also see reference therein).

The paper is organized as follows. In Section 2, we describe the framework of vector-valued multi-penalized learning problem with some basic definitions and notations. In Section 3, we discuss the convergence issues of the vector-valued multi-penalty regularization scheme based on Nyström type subsampling in the norm in and the norm in . In Section 4, we discuss the aggregation approach to accumulate various estimators based on the Nyström type subsampling. In the last section, we demonstrate the performance of multi-penalty regularization based on Nyström type subsampling on Caltech-101 data set for multi-class image classification and NSL-KDD benchmark data set for intrusion detection problem.

## 2 Multi-task learning via vector-valued RKHS

The problem of learning multiple tasks jointly can be modeled by the vector-valued functions whose components represent individual task-predictors, i.e., for (). Here we consider general framework of vector-valued functions developed by Micchelli and Pontil [48] to address the multi-task learning algorithm. We consider the concept of vector-valued reproducing kernel Hilbert space which is the extension of well-known scalar-valued reproducing kernel Hilbert space.

###### Definition 2.1.

Vector-valued reproducing kernel Hilbert space (RKHSvv). Let be a non-empty set, be a real separable Hilbert space. The Hilbert space of functions from to is called reproducing kernel Hilbert space if for any and , the linear functional which maps to is continuous.

Suppose be the Banach space of bounded linear operators on . A function is said to be an operator-valued positive definite kernel if for each pair , , and for every finite set of points and ,

 N∑i,j=1⟨yi,K(xi,xj)yj⟩Y≥0.

There exists a unique Hilbert space of functions on satisfying the following conditions:

1. for all and , the functions , defined by

 (Kxy)z=K(z,x)y for all z∈X,
2. the span of the set is dense in , and

3. for all , (reproducing property).

Moreover, there is one to one correspondence between operator-valued positive definite kernels and vector-valued RKHS [48].

In the learning theory, we are given with the random samples

drawn identically and independently from a unknown joint probability measure

on the sample space . We assume that the input space is a locally compact countable Hausdorff space and the output space is a real separable space. The goal is to predict the output values for the inputs. Suppose we predict for the input based on our algorithm but the true output is . Then we suffer a loss

, where the loss function

. A widely used approach based on the square loss function in regularization theory is Tikhonov type regularization:

 1mm∑i=1||f(xi)−yi||2Y+λ||f||2H,

The regularization parameter controls the trade off between the error measuring the fitness of data and the complexity of the solution measured in the RKHS-norm.

We discuss the multi-task learning approach that considers a notion of task relatedness based on the concept of manifold regularization. In this approach, different RKHSvv are used to estimate the target functions based on different views of input data, such as different features or modalities and a data-dependent regularization term is used to enforce consistency of output values from different views of the same input example.

We consider the following regularization scheme to analyze the multi-task manifold learning scheme corresponding to different views:

 argminf∈HK1⨁…⨁HKv{1mm∑i=1||f(xi)−yi||2Y+λA||f||2H+λI⟨f,Mf⟩Yn}, (1)

where is given set of labeled and unlabeled data, is a symmetric, positive operator, and .

The direct sum of reproducing kernel Hilbert spaces is also a RKHS. Suppose is the kernel corresponding to RKHS .

Throughout this paper we assume the following hypothesis:

###### Assumption 2.1.

Let be a reproducing kernel Hilbert space of functions such that

1. For all , is a Hilbert-Schmidt operator and , where for Hilbert-Schmidt operator , for an orthonormal basis of .

2. The real-valued function , defined by , is measurable .

By the representation theorem [49], the solution of the multi-penalized regularization problem (1) will be of the form:

 fz,λ=n∑i=1Kxici, for some c=(c1,…,cn)=(JKn+λAmIn+λImLKn)−1yn, (2)

where with , is diagonal matrix with the first diagonal entries as and the rest , is identity of order and .

In order to obtain the computationally efficient algorithm from the functional (1), we consider the Nyström type subsampling which uses the idea of replacing the empirical kernel matrix with a smaller matrix obtained by (column) subsampling [39, 59, 63]. This can also be seen as a restriction of the optimization functional (1) over the space:

 Hxs:={f|f=s∑i=1Kxici,c=(c1,…,cs)∈Ys},

where and is a subset of the input points in the training set.

The minimizer of the manifold regularization scheme (1) over the space will be of the form:

 fsz,λ=s∑i=1Kxici, for c=(c1,…,cs)=(KTmsKms+λAmKss+λImKTnsLKns)†KTmsy, (3)

where denotes the Moore-Penrose pseudoinverse of a matrix , , with and and .

The computational time of the Nyström approximation (3) is of order while the computational time complexity of standard manifold regularized solution (2) is of order . Therefore, the randomized subsampling methods can break the memory barriers and consequently achieve much better time complexity compare with standard manifold regularization algorithm.

We analyze the more general vector-valued multi-penalty regularization scheme based on Nyström subsampling:

 fsz,λ=argminf∈Hxs{1mm∑i=1||f(xi)−yi||2Y+λ0||f||2H+p∑j=1λj||Bjf||2H}, (4)

where are bounded operators, , are non-negative real numbers and denotes the ordered set .

Here we introduce the sampling operator which is useful in the analysis of regularization schemes.

###### Definition 2.2.

The sampling operator associated with a discrete subset is defined by

 Sx(f)=(f(x))x∈x.

Then its adjoint is given by

 S∗xy=1mm∑i=1Kxiyi,    ∀y=(y1,…,ym)∈Ym.

The sampling operator is bounded by .

We obtain the following explicit expression of the minimizer of the regularization scheme (4). The proof of the theorem follows the same steps as of Lemma 1 [57].

###### Theorem 2.1.

For the positive choice of , the functional (4) has unique minimizer:

 fsz,λ=(PxsS∗xSxPxs+λ0I+p∑j=1λjPxsB∗jBjPxs)−1PxsS∗xy,

where is the orthogonal projection operator with range .

The data-free version of the considered regularization scheme (4) is

 fsλ:=argminf∈Hxs{∫Z||f(x)−y||2Ydρ(x,y)+λ0||f||2H+p∑j=1λj||Bjf||2H}. (5)

Using the fact , we get,

 fsλ=(PxsLKPxs+λ0I+p∑j=1λjPxsB∗jBjPxs)−1PxsLKPxsfH. (6)

We assume

 fsλ0:=argminf∈Hxs{∫Z||f(x)−y||2Ydρ(x,y)+λ0||f||2H} (7)

which implies

 fsλ0=(PxsLKPxs+λ0I)−1PxsLKPxsfH,

where the integral operator is a self-adjoint, non-negative, compact operator on the Hilbert space of square-integrable functions from to with respect to , defined as

 LK(f)(x):=∫XK(x,t)f(t)dρX(t),  x∈X.

The integral operator is bounded by . The integral operator can also be defined as a self-adjoint operator on . We use the same notation for both the operators defined on different domains. Though it is notational abuse, for convenience we use the same notation for both the operators defined on different domains. It is well-known that is an isometry from the space of square integrable functions to reproducing kernel Hilbert space (For more properties see [24, 25]).

Our aim is to discuss the convergence issues of the regularized solution based on Nyström type subsampling. We estimate the error bounds of by measuring the bounds of sample error and approximation error . The approximation error is estimated with the help of the single-penalty regularized solution .

For any probability measure, we can always obtain a solution converging to the prescribed target function but the convergence rates may be arbitrarily slow. This phenomena is known as no free lunch theorem [28]. Therefore, we need some prior assumptions on the probability measure in order to achieve the uniform convergence rates for learning algorithms. Following the notion of Bauer et al. [12], Caponnetto and De Vito [20], we consider the following assumptions on the joint probability measure in terms of the complexity of the target function and a theoretical spectral parameter effective dimension:

1. For the probability measure on ,

 ∫Z||y||2Y dρ(x,y)<∞ (8)
2. There exists the minimizer of the generalization error over the RKHS ,

 fH:=argminf∈H{∫Z||f(x)−y||2Ydρ(x,y)}. (9)
3. There exist some constants such that

 ∫Y(e||y−fH(x)||Y/M−||y−fH(x)||YM−1)dρ(y|x)≤Σ22M2 (10)

holds for almost all .

It is worthwhile to observe that for the real-valued functions and multi-task learning algorithms, the boundedness of output space can be easily ensured. So we can get the error estimates from our analysis without imposing any condition on the conditional probability measure (10).

The smoothness of the target function can be described in terms of the integral operator by the source condition:

###### Assumption 2.2.

(Source condition) Suppose

 Ωϕ,R:={f∈H:f=ϕ(LK)g and ||g||H≤R},

where is operator monotone function on the interval with the assumption and is a concave function. Then the condition is usually referred to as general source condition [46].

###### Assumption 2.3.

(Polynomial decay condition) For fixed positive constants and

, we assume that the eigenvalues

’s of the integral operator follows the polynomial decay:

 αn−b≤tn≤βn−b  ∀n∈N.

We define the class of the probability measures satisfying the conditions (i), (ii), (iii) and Assumption 2.2. We also consider the probability measure class which satisfies the conditions (i), (ii), (iii) and Assumption 2.2, 2.3.

The convergence rates discussed in our analysis depend on the effective dimension. We achieve the optimal minimax convergence rates using the concept of the effective dimension. For the integral operator , the effective dimension is defined as

 N(γ):=Tr((LK+γI)−1LK), for γ>0.

The fact, is a trace class operator implies that the effective dimension is finite. The effective dimension is continuously decreasing function of from to . For further discussion on effective dimension we refer to the literature [16, 17, 40, 41, 66].

The effective dimension can be estimated from Proposition 3 [20] under the polynomial decay condition as follows,

 N(γ)≤βbb−1γ−1/b, for b>1 (11)

and without the polynomial decay condition, we have

 N(γ)≤||(LK+γI)−1||L(H)Tr(LK)≤κ2γ.

We define the random variable

for and let

 N∞(γ):=supx∈XNx(γ)<∞.

Now we review the previous results on the regularization schemes based on Nyström subsampling which are directly comparable to our results: Kriukova et al. [39] and Rudi et al. [57]. For convenience, we tried to present the most essential points in the unified way in Table 1. We have shown the convergence rates under Hölder’s source condition. Rudi et al. [57] obtained the minimax optimal convergence rates depending on the eigenvalues of in -norm. To obtain the optimal rates the concept of effective dimension is exploited. Kriukova et al. [39] considered the Tikhonov regularization with Nyström type subsampling under general source condition. They discussed the upper conevergence rates and do not take into account the polynomial decay condition of the eigenvalues of the integral operator . We used the idea of Nyström type subsampling to efficiently implement the multi-penalty regularization algorithm. We obtain optimal convergence rates of multi-penalty regularization with Nyström type subsampling under general source condition. In particular, we also get optimal rates of single-penalty Tikhonov regularization with Nyström type subsampling under general source condition as the special case.

## 3 Convergence issues

In this section, we present the optimal minimax convergence rates for vector-valued multi-penalty regularization based on Nyström type subsampling using the concept of effective dimension over the classes of the probability measures and .

In order to prove the optimal convergence rates, we need the following inequality which is used in the papers [12, 20] and based on the results of Pinelis and Sakhanenko [53].

###### Proposition 3.1.

Let be a random variable on the probability space with values in real separable Hilbert space . If there exist two constants and satisfying

 E{||ξ−E(ξ)||nH}≤12n!S2Qn−2   ∀n≥2, (12)

then for any and for all ,

 Prob{(ω1,…,ωm)∈Ωm:∣∣ ∣∣∣∣ ∣∣1mm∑i=1[ξ(ωi)−E(ξ(ωi))]∣∣ ∣∣∣∣ ∣∣H≤2(Qm+S√m)log(2η)}≥1−η.

In particular, the inequality (12) holds if

 ||ξ(ω)||H≤Q and E(||ξ(ω)||2H)≤S2.

In the following proposition, we measure the effect of random sampling using noise assumption (10) in terms of the effective dimension . The quantity describes the probabilistic estimates of the perturbation measure due to random sampling.

###### Proposition 3.2.

Let be i.i.d. samples drawn according to the probability measure satisfying the assumptions (8), (9), (10) and . Then for all , with the confidence , we have

 ||(LK+γI)−1/2Pxs{S∗xy−S∗xSxfH}||H≤2⎛⎝κMm√γ+√Σ2N(γ)m⎞⎠log(4η) (13)

and

 ||S∗xSx−LK||L(H)≤2(κ2m+κ2√m)log(4η). (14)
###### Proof.

To estimate the first expression, we consider the random variable from to reproducing kernel Hilbert space with

 Ez(ξ1)=∫Z(LK+γI)−1/2PxsKx(y−fH(x))dρ(x,y)=0,
 1mm∑i=1ξ1(zi)=(LK+γI)−1/2Pxs(S∗xy−S∗xSxfH)

and

 Ez(||ξ1−Ez(ξ1)||nH) = Ez(||(LK+γI)−1/2PxsKx(y−fH(x))||nH) ≤ Ez(||K∗xPxs(LK+γI)−1PxsKx||n/2L(Y)||y−fH(x)||nY) ≤ Ex(||K∗xPxs(LK+γI)−1PxsKx||n/2L(Y)Ey(||y−fH(x)||nY)).

Under the assumption (10) we get,

 Ez(||ξ1−Ez(ξ1)||nH)≤n!2(Σ√N(γ))2(κM√γ)n−2,  ∀n≥2.

On applying Proposition 3.1 we conclude that

 ||(LK+γI)−1/2Pxs{S∗xy−S∗xSxfH}||H≤2⎛⎝κMm√γ+√Σ2N(γ)m⎞⎠log(4η)

with confidence .

The second expression can be estimated easily by considering the random variable from to . The proof can also be found in De Vito et al. [27]. ∎

The following conditions on the sample size and sub-sample size are used to derive the convergence rates of regularized learning algorithms. In particular, we can assume the following inequality for sufficiently large sample with the confidence :

 8κ2√mlog(4η)≤λ0. (15)

Following the notion of Rudi et al. [57] and Kriukova et al. [39] on subsampling, we measure the approximation power of the projection method induced by the projection operator in terms of . We make the assumption on as considered in Theorem 2 [39]:

 Δs=||L1/2K(I−Pxs)||L(H)≤√Θ−11/2(m−1/2), for Θ1/2(t)=√tϕ(t). (16)

Under the parameter choice for , we obtain

 Δ2s≤Θ−11/2(m−1/2)≤Ψ−1(m−1/2)=λ0. (17)

Moreover, from Lemma 6 [57] under Assumption 2.3 and , , for every , the following inequality holds with the probability ,

 Δ2s≤∣∣ ∣∣∣∣ ∣∣(LK+λ03I)1/2(I−Pxs)∣∣ ∣∣∣∣ ∣∣2L(H)≤λ0.

Then under the condition (17) using Proposition 2, 3 [45] we get,

 ||(I−Pxs)ψ(LK)||L(H)≤ψ(||L1/2K(I−Pxs)||2L(H))≤ψ(λ0)

and

 ||Pxsψ(LK)Pxs−ψ(PxsLKPxs)||L(H)≤cψψ(||L1/2K(I−Pxs)||2L(H))≤cψψ(λ0).

In the following section, we discuss the error analysis of the multi-penalty regularization scheme based on Nyström type subsampling in probabilistic sense. In general, we derive the convergence rates for regularization algorithms in the norm in RKHS and the norm in separately. In Theorem 3.1, 3.2, 3.3, we estimate error bounds for multi-penalty regularization based on Nyström type subsampling in -weighted norm which consequently provides the convergence rates of the regularized solution in both -norm and -norm.

###### Theorem 3.1.

Let be i.i.d. samples drawn according to the probability measure with the assumption that , , are nonincreasing functions. Then under the parameter choice for , for sufficiently large sample according to (15) and for subsampling according to (17), the following convergence rates of holds with the confidence for all ,

 ||ψ(LK)(fsz,λ−fH)||H≤ψ(λ0)⎧⎨⎩c1ϕ(λ0)+c2Bλλ3/20+c31mλ0+c4√N(λ0)mλ0⎫⎬⎭log(4η),

where , , , and .

###### Proof.

We discuss the error bound for by estimating the expressions and . The first term can be expressed as

 fsz,λ−fsλ = +(PxsS∗xSxPxs−PxsLKPxs)(fH−fsλ)}

which implies

 ||ψ(LK)(fsz,λ−fsλ)||H≤ψ(λ0)√λ0I1{I2+1√λ0(I3+I4||fH−fsλ||H)},

where , , and .

The estimates of and can be obtained from Proposition 3.2. Under the condition (15) using the second estimate of Proposition 3.2, we obtain

 Tr((PxsLKPxs+λ0I)−1(PxsLKPxs−PxsS∗xSxPxs))≤I4λ0≤4κ2√mλ0log(4η)≤12

and under the norm inequalities , which implies

 I1 ≤ ||(LK+λ0I)1/2(PxsS∗xSxPxs+λ0I)−1(LK+λ0I)1/2||L(H) ≤ Tr((PxsS∗xSxPxs+λ0I)−1(LK+λ0I)) = Tr((PxsS∗xSxPxs+λ0I)−1((I−Pxs)LK+PxsLK(I−Pxs))) +Tr((PxsS∗xSxPxs+λ0I)−1(PxsLKPxs+λ0I)) ≤ 2λ0||LK(I−P