# Bias Correction for Regularized Regression and its Application in Learning with Streaming Data

We propose an approach to reduce the bias of ridge regression and regularization kernel network. When applied to a single data set the new algorithms have comparable learning performance with the original ones. When applied to incremental learning with block wise streaming data the new algorithms are more efficient due to bias reduction. Both theoretical characterizations and simulation studies are used to verify the effectiveness of these new algorithms.

## Authors

• 40 publications
08/07/2017

### Learning Theory of Distributed Regression with Bias Corrected Regularization Kernel Network

Distributed learning is an effective way to analyze big data. In distrib...
03/30/2022

### Remember to correct the bias when using deep learning for regression!

When training deep learning models for least-squares regression, we cann...
06/30/2021

### Real-Time Regression Analysis of Streaming Clustered Data With Possible Abnormal Data Batches

This paper develops an incremental learning algorithm based on quadratic...
08/01/2016

### Efficient Multiple Incremental Computation for Kernel Ridge Regression with Bayesian Uncertainty Modeling

This study presents an efficient incremental/decremental approach for bi...
11/05/2020

### Accurate inference in negative binomial regression

Negative binomial regression is commonly employed to analyze overdispers...
02/22/2021

### Debiased Kernel Methods

I propose a practical procedure based on bias correction and sample spli...
10/24/2016

### Parallelizing Spectral Algorithms for Kernel Learning

We consider a distributed learning approach in supervised learning for a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

As modern technologies allow to collect data much easily, the size of data sets is growing fast in both dimensionality and number of instances. This makes big data become ubiquitous in many fields and draw great attentions of researchers in recent years. In the statistics and machine learning context, many traditional data processing tools and techniques become inviable due to the big size of the data. New models and computational techniques are required. This had driven the resurgence of the research in online learning and the use of parallel computing.

Online learning deals with streaming data. Online algorithms update the knowledge incrementally as new data come in. The streaming data could be instance wise or block wise. The instance wise streaming data could be processed as block wise data. This may be preferred in some particular application domains. For instance, in the dynamic pricing problems (see e.g. [33, 2]) the price is usually not updated each time when an instance of sales information becomes available, because customers may not like the price changing too constantly. When processing block wise streaming data, a base algorithm is applied to each incoming data block and a coupling method is then used to update the knowledge by combining the knowledge from the past blocks and the incoming block; see e.g. [9, 16]

. In statistical learning theory where the knowledge is usually represented by a target function, the simplest way to couple the information is to use the average of the functions learnt from different blocks. In learning with big data, the divide and conquer algorithm

[38] divides the whole data set into smaller subsets, applies a base learning algorithm on each subset, and takes the average of the learnt functions from all subsets as the target function for prediction purpose. It is computationally efficient because the second stage could be implemented via parallel computing. Although the divide and conquer algorithm is different from the aforementioned online learning with block wise streaming data, they clearly share the same spirit – a base algorithm for a single data block is required and the average of the outputs from this algorithm over multiple data blocks is used. A natural problem arising from these two frameworks is the choice of the base learning algorithm for a single data set. As an algorithm is efficient and optimal for a single data set, it is not necessarily efficient and optimal for learning with block wise data.

In this paper we focus on the regression learning problem where a set of observations are collected for

predictors and a scalar response variable. Assume they are linked by

 yi=f∗(xi)+ϵi,i=1,2,…,n,

where , and is the zero-mean noise. The target is to recover the unknown true model

as accurate as possible to understand the impact of predictors and predict the response on unobserved data. The ordinary least square (OLS) is the most traditional and well developed method. It assumes a linear model and estimates the coefficients by minimizing the squared error between the responses and the predictions. The OLS estimator requires the inverse of the covariance matrix of the explanatory variables and could be numerically unstable if the covariance matrix is singular or has very large condition number. A variety of regularization approaches have been introduced to overcome the numerical instability and/or for some other purposes (e.g. sparsity). Typical regularized methods include ridge regression

[18, 17], LASSO [31], elastic net [39] and many others. The nonlinear extension of ridge regression could be implemented by regularization kernel network [13]. The data are first mapped to a feature space. Then a linear model is built in the feature space which, when projected back to the original space, becomes a nonlinear model.

Although different regularization techniques have different properties, they share some common features. The estimators obtained from regularized regression are usually biased. The regularization is helpful to improve the computational stability and reduce the variance. By trading off the bias and variance, regularization schemes may lead to smaller prediction errors than unbiased estimators. Therefore, regularization theory has become an important topic in the statistical learning context.

Regularization algorithms, such as the ridge regression, regularization kernel network, and support vector machines, have been successful in a variety of regression and classification applications. However, they may be suboptimal when they serve as base algorithms in learning with block wise streaming data or in the divide and conquer algorithm. When there are many data blocks, the regularization algorithm may provide good estimator for each data block. By coupling the estimators together, the variance usually shrinks when more and more data blocks are taken into account. But the bias may not shrink and prevent the algorithm to achieve optimal performance. To overcome this difficulty, adjustments are required to remove or reduce the bias of the algorithm.

In the paper, we will propose an approach to correct the bias of ridge regression and regularization kernel network. The resulted two new algorithms and their properties will be described in Section 2 and Section 4. Their theoretical properties are proved in Section 3 and Section 5, respectively. In Section 6 we discuss why the new algorithms are effective in learning with block wise data. In Section 7 simulations are used to illustrate their effectiveness from an empirical aspect. We close by Section 8 with conclusions and discussions.

### 1.1 Related Work

The idea of bias correction has long history in statistics. For instance, bias correction to maximum likelihood estimation dates at least back to 1950s [25] and a variety method were proposed later on; see e.g. [22, 26, 14, 11]

. Bias reduction to kernel density estimators was studied in

[6, 1, 20, 10]. Bias correction to nonparametric estimation was studied in [15, 24, 35].

The existence of bias in ridge regression and its impact on statistical inference has been noticed since its invention [18, 23]. In high dimensional linear models where the dimension greatly exceeds the sample size, bias correction method was introduced in [7] to correct the projection bias, the difference of the true regression coefficient and its projection in the subspace spanned by the sample, which appears because the sample cannot span the whole Euclidian space as In [36, 7, 19], projection bias correction was introduced to LASSO in high dimensional linear models. The purpose of projection bias is to construct accurate

values to facilitate accurate statistical inference such as hypothesis testings and confidence intervals. It seems the bias caused by regularization has minimal impact for this purpose.

As for regularization kernel network, its predictive consistency has been extensively studied in the literature; see e.g. [5, 37, 12, 34, 4, 8, 27, 30, 28] and many references therein. Its application was also extensively explored and shown successful in many problem domains. But to my best knowledge, the idea of bias correction to improve this algorithm is novel. Note that bias reduction for regularized regression does not improve the learning performance on a single data set, as illustrated in Section 7. It is worthy of exploration because of its effectiveness in learning with streaming data or distributed regression.

## 2 Bias correction for ridge regression

In linear regression, the response variable is assumed to linearly depend on the explanatory variables, i.e.

 yi=w⊤xi+b+ϵi

with some and Ridge regression minimizes the penalized mean squared prediction error on the observed data

 1nn∑i=1(yi−w⊤xi−b)2+λ∥w∥2,

where is the regularization parameter used to trade off the fitting error and model complexity. Let denote the sample mean of ’s and be the sample mean of ’s. Denote by the centered data matrix for the explanatory variables and the vector of centered response values. Then the sample covariance matrix of is and the solution to the ridge regression is given by

 ^w=1n(λI+^Σ)−1~X⊤~Y

and Here and in the sequel,

denotes the identity matrix (or the identity operator).

The solution is a biased estimator for Define

 b(n,λ)=∥E[^w]−w∥

and

 v(n,λ)=E[∥^w−E[^w]∥2].

Then is the Euclidian norm of the bias and is the variance. The mean squared error is given by

 mse(n,λ)=E[∥^w−w∥2]=b2(n,λ)+v(n,λ).

Denote by the covariance matrix of the explanatory variables . Let

be the eigenvalues of

and

the corresponding eigenvectors. Then

 Σ=p∑i=1σiviv⊤i.

The vectors are the principal components. The following theorem characterizes the bias and variance of ridge regression.

###### Theorem 1.

If is bounded and , then

1. converges as

2. If , then

 limn→∞b(n,λ)= ⎷p∑i=1λ2c2i(λ+σi)2.

Without loss of generality, we assume the eigenvalues are in decreasing order, i.e., . Then is increasing. Theorem 1 tells that, for a fixed , the bias of ridge regression will be small if the true model heavily depends on the first several principle components. Conversely, if heavily depends on the last several principle components, the bias will be large.

According to Theorem 1 (i), the asymptotic bias of ridge estimator is . If we can subtract it from the ridge regression estimator, we are able to obtain an asymptotically unbiased estimator . However, this is not computationally feasible because both and are unknown. Instead, we propose to replace by its sample version and by the ridge estimator . The resulted new estimator, which we call bias corrected ridge regression estimator, becomes

 ^w♯=^w+λ(λI+^Σ)−1^w. (1)

Since the bias correction uses an estimated quantity, this new estimator is still biased. But the bias is reduced. Let

 b♯(n,λ)=∥E[^w♯]−w∥

and

 v♯(n,λ)=E[∥^w♯−E[^w♯]∥2].

We have the following conclusion.

###### Theorem 2.

If is bounded and , then

1. converges to as

2. If , then

 limn→∞b♯(n,λ)= ⎷p∑i=1λ4c2i(λ+σi)4.

Since , the asymptotic bias is smaller. The bias reduction could be significant if the true model depends only on the first several principle components. We also remark that, although and are of the same order, is found slightly larger in simulations. The overall performance, as measured by the mean squared error, of these two estimators is comparable when used in learning with a single data set.

## 3 Proofs of Theorem 1 and Theorem 2

In this section we proveTheorem 1 and Theorem 2

. To this end, we first introduce several useful lemmas. In our analysis, we will deal with vector or operator valued random variables. We need the following inequalities for Hilbert space valued random variables.

###### Lemma 3.

Let be a random variable with values in a Hilbert space . Then for any we have

 E[∥ξ−E[ξ]∥2]≤E[∥ξ−h∥2].
###### Proof.

The proof is quite direct:

 E[∥ξ−E[ξ]∥2] =E[∥(ξ−h)−(E[ξ]−h)∥2] =E[∥ξ−h∥2]−2E[⟨ξ−h, E[ξ]−h⟩]+∥E[ξ]−h∥2 =E[∥ξ−h∥2]−∥E[ξ]−h∥2 ≤E[∥ξ−h∥2].

###### Lemma 4.

Let be a Hilbert space and be a random variable with values in . Assume that almost surely. Let be a sample of independent observations for Then

###### Proof.

Since for all and are mutually independent, we have

 E⎡⎣∥∥ ∥∥1nn∑i=1ξi−E[ξ]∥∥ ∥∥2⎤⎦ =E⎡⎣∥∥ ∥∥1nn∑i=1ξi∥∥ ∥∥2⎤⎦−2E[⟨1nn∑i=1ξi,E[ξ]⟩]+∥E[ξ]∥2 =1nE[∥ξ∥2]−1n∥E[ξ]∥2 ≤1nE[∥ξ∥2] ≤M2n.

This proves the first inequality. The second one follows from the first one and Schwartz inequality. ∎

In the sequel, we assume is uniformly bounded by

###### Lemma 5.

We have

 E[∥^Σ−Σ∥2]≤10M4nandE[∥^Σ−Σ∥]≤M2√10n
###### Proof.

Let be the mean of Note that and similarly

 ^Σ=1nn∑i=1(xi−¯x)(xi−¯x)⊤=1nn∑i=1xix⊤i−¯x¯x⊤.

Thus,

 ∥∥^Σ−Σ∥∥2 ≤2∥∥ ∥∥1nn∑i=1xix⊤i−E[xx⊤]∥∥ ∥∥2+2∥∥¯x¯x⊤−μμ⊤∥∥2 ≤2∥∥ ∥∥1nn∑i=1xix⊤i−E[xx⊤]∥∥ ∥∥2F+8M2∥¯x−μ∥2, (2)

where represent the Frobenius norm and we have used the fact that for all matrices.

Recall that matrices of form a Hilbert space with the Frobenius norm Applying Lemma 4 to which satisfies , we obtain

 (3)

Next we apply Lemma 4 to and obtain

 E[∥¯x−μ∥2]≤M2n. (4)

Then the first inequality follows by taking expectation on both sides of (2) and applying (3) and (4). The second one follows from the first one and Schwartz inequality. ∎

Now we can prove the two theorems.

###### Proof of Theorem 1.

Note that and thus We have

 ∥∥(λI+^Σ)−1^Σw−(λI+Σ)−1Σw∥∥ ≤ ≤ 1λ∥^Σ−Σ∥+∥∥(λI+^Σ)−1(Σ−^Σ)(λI+Σ)−1Σw∥∥ ≤ 2λ∥^Σ−Σ∥∥w∥.

The conclusion (i) follows from

 ∥E[^w]−(λI+Σ)−1Σw∥ ≤E[∥∥(λI+^Σ)−1^Σw−(λI+Σ)−1Σw∥∥] ≤2∥w∥λE[∥^Σ−Σ∥] =2M2∥w∥λ√10n⟶0.

The conclusion (ii) is an easy consequence of (i) by noting that

 limn→∞∥E[^w]−w∥2 =∥∥(λI+Σ)−1Σw−w∥∥2.

Denote to be the column vector of residuals. By the fact we have and

 ^w=(λI+^Σ)−1^Σw+(λI+^Σ)−1(1n~XR).

By the fact for and , we obtain

 E[∥∥1n~XR∥∥2] =1n2n∑i=1n∑j=1E[ϵiϵj(xi−¯x)⊤(xi−¯x)] =1n2n∑i=1E[ϵ2i∥xi−¯x∥2] ≤4M2E[ϵ2]n. (5)

By Lemma 3 and Lemma 5, we have

 E[∥^w−E[^w]∥2] ≤E[∥∥^w−(λI+Σ)−1Σw∥∥2] ≤2E[∥∥(λI+^Σ)−1^Σw−(λI+Σ)−1Σw∥∥2]+2E[∥∥(λI+^Σ)−1(1n~XR)∥∥2] ≤4∥w∥2λ2E[∥∥^Σ−Σ∥∥2]+2λ2E[∥∥1n~XR∥∥2] ≤40M4∥w∥2+8M2E[ϵ2]nλ2.

This verifies (iii). ∎

###### Proof of Theorem 2.

It is easy to verify that Therefore, To prove (i) we write

 (λI+^Σ)−2(2λI+^Σ)^Σw−(λI+Σ)−2(2λI+Σ)Σw = (λI+^Σ)−2[(λI+^Σ)2−λ2I]w−(λI+Σ)−2[(λI+Σ)2−λ2I]w = λ2[(λI+^Σ)−2−(λI+Σ)−2]w = λ2(λI+^Σ)−2[2λ(Σ−^Σ)+(Σ−^Σ)Σ+^Σ(Σ−^Σ)](λI+Σ)−2w.

By and we obtain

 (6)

The conclusion (i) follows from the following estimate:

 ∥E[^w♯]−^Σ)−2(2λI+^Σ)^Σw∥ ≤E[∥∥(λI+^Σ)−2(2λI+^Σ)^Σw−(λI+Σ)−2(2λI+Σ)Σw∥∥] ≤4∥w∥λE[∥∥^Σ−Σ∥∥]≤4M2∥w∥λ√10n⟶0.

The conclusion (ii) is an easy consequence of (i).

To prove (iii), we write By Lemma 3, Lemma 5, the estimates in (6) and (5), we have

 E[∥∥^w♯−E[^w♯]∥∥2] ≤E[∥∥^w♯−(λI+Σ)−2(2λI+Σ)Σw∥∥2] ≤2E[∥∥(λI+^Σ)−2(2λI+^Σ)^Σw−(λI+Σ)−2(2λI+Σ)Σw∥∥2] +2E[∥∥(λI+^Σ)−2(2λI+^Σ)(1n~XR)∥∥2] ≤16∥w∥2λ2E[∥∥^Σ−Σ∥∥2]+8λ2E[∥∥1n~XR∥∥2] ≤160M4∥w∥2+32M2E[ϵ2]nλ2,

where we have used the

## 4 Bias correction for regularization kernel network

When the true regression model is nonlinear, kernel method can be used. Denote by the space of explanatory variables. A Mercer kernel is a continuous, symmetric, and positive-semidefinite function . The inner product defined by induces a reproducing kernel Hilbert space (RKHS) associated to the kernel . The space is the closure of the function class spanned by The reproducing property leads to . Thus can be embedded into We refer to [3] for more other properties of RKHS.

The regularization kernel network [13] estimates the true model by a function that minimizes the regularized sample mean squared error

 1nn∑i=1(yi−f(xi))2+λ∥f∥2K.

The representer theorem [32] tells that So although the RKHS may be infinitely dimensional, the optimization of the regularization kernel network could be implemented in an dimensional space. Actually, let denote the kernel matrix on and . The coefficients could be solved by a linear system

In [27] an operator representation for is proved. Let be the sampling operator defined by for Its dual operator, is given by for Then we have

 ^f=1n(λI+1nS∗S)−1S∗Y.

The operator is a sample version of the integral operator define by

 LKf(x)=Et[K(x,t)f(t)]=∫XK(x,t)f(t)dPX(t)

where is the marginal distribution on Note that defines a bounded operator both on

(associate to the probability measure

) and Let and

be the eigenvalues and eigenfunctions of

. Then form an orthogonal basis of and

 LKf=∞∑i=1τi⟨f,ϕi⟩L2ϕi,∀ f∈L2.

Also, form an orthonormal basis of and, as an operator on ,

 LKf=∑i:τi≠0τi⟨f,ψi⟩Kψi,∀ f∈HK.

Moreover, maps all functions in onto and

 ∥f∥L2=∥L1/2Kf∥K∀ f∈span{ϕi:τi≠0}⊂L2.

In particular, this is true for all Note that is the closure of in . Only functions in can be well approximated by functions in .

Regularization kernel network can be regarded as a nonlinear extension of ridge regression. If we can measure the difference between and in and prove some conclusions that are analogous to those in Theorem 1. But unfortunately, this is generally not true. To make our result more general, we measure the difference between and in sense, which is equivalent to measure the mean squared forecasting error. For this purpose, we define

 bK(n,λ)=∥∥ED[^f]−f∗∥∥L2=√Ex[(ED[^f(x)]−f∗(x))2]

and

 vK(n,λ)=ED[∥∥^f−ED[^f]∥∥2L2]=ED,x[(^f(x)−ED[^f(x)])2],

where is the expectation with respect to the data and is the expectation with respect to .

###### Theorem 6.

If and almost surely, then

1. converges to in

2. If , then

 limn→∞bK(n,λ)= ⎷∑i:τi≠0λ2α2i(λ+τi)2.
3. if satisfies and

Theorem 6 (ii) characterizes the asymptotic bias for target functions that belong to and thus can be well learned by the regularization kernel network. If the target function has a component orthogonal to , the orthogonal component is not learnable and its norm should be added to the right hand side. The variance bound in Theorem 6 (iii) is presented with the assumptions and , which, according to the literature (e.g. [27, 30]), usually guarantee the regularization kernel network to achieve the optimal learning rate. When this is not true, an explicit bound can be found in the proof in Section 5.

Following the same idea as in Section 2, we propose to reduce the bias by using an adjusted function

 ^f♯(x)=^f+λ(λI+1nS∗S)−1^f. (7)

The implementation of this new approach is easy. We can verify that

 ^f♯(x)=n∑i=1c♯iK(xi,x)

with

 c♯=c+λ(λI+1nK)−1c.

Let

 b♯K(n,λ)=Ex[(ED[^f♯(x)]−f∗(x))2]

and

 v♯K(n,λ)=Ex[(^f♯(x)−ED[^f♯(x)])2].
###### Theorem 7.

If and almost surely, then

1. converges to in

2. If , then

 limn→∞b♯K(n,λ)= ⎷∑i:τi≠0λ4α2i(λ+τi)4.

## 5 Proofs of Theorem 6 and Theorem 7

The proofs Theorem 6 and Theorem 7 are more complicated than those of Theorem 1 and Theorem 2 because they require techniques to handle the estimation of integral operators. Without loss of generality we assume and throughout this section in order to simplify our notations. We will always use for in case there is no confusion from the context.

###### Lemma 8.

Let be positive random variable. For any ,

The following concentration inequality is proved in [27].

###### Lemma 9.

Let be a Hilbert space and be a random variable with values in . Assume that almost surely. Let be a sample of independent observations for Then for any

 ∥∥ ∥∥1mm∑i=1ξi−E(ξ)∥∥ ∥∥≤2Mlog(2/δ)n+√2E[∥ξ∥2]log(2/δ)n. (8)

.

When no information regarding is available, we can use to derive a simpler estimation as follows. For any with confidence at least

 ∥∥ ∥∥1mm∑i=1ξi−E(ξ)∥∥ ∥∥≤2M√log(2/δ)m. (9)

Before we state the next lemma, let us recall the Hilbert-Schmidt operators on . Let be a set of orthonormal basis of . An operator is a Hilbert-Schmidt operator if is finite. A Hilbert-Schmidt operator is also a bounded operator with All Hilbert-Schmidt operators form a Hilbert space. For

, the rank one tensor operator

is defined by for all A tensor operator is a Hilbert-Schmidt operator with

###### Lemma 10.

For any we have

 ∥∥∥1nS∗S−LK∥∥∥≤2κ2√log(2/δ)m

with confidence at least Also, we have

 E[∥∥∥1nS∗S−LK∥∥∥2]≤κ4nandE[∥∥∥1nS∗S−LK∥∥∥]≤κ2√n.
###### Proof.

Consider the random variable It satisfies , and Then the first inequality follows from (9).

The second and third inequalities have been obtained in [12, 29]. They also follow from Lemma 4 easily. ∎

###### Lemma 11.

For any we have

 ∥∥ ∥∥1nn∑i=1f(xi)K(xi,⋅)−LKf∗∥∥ ∥∥≤2κM√log(2/δ)n

with confidence at least Also, we have

 E⎡⎣∥∥ ∥∥1nn∑i=1f(xi)K(xi,⋅)−LKf∗