As modern technologies allow to collect data much easily, the size of data sets is growing fast in both dimensionality and number of instances. This makes big data become ubiquitous in many fields and draw great attentions of researchers in recent years. In the statistics and machine learning context, many traditional data processing tools and techniques become inviable due to the big size of the data. New models and computational techniques are required. This had driven the resurgence of the research in online learning and the use of parallel computing.
Online learning deals with streaming data. Online algorithms update the knowledge incrementally as new data come in. The streaming data could be instance wise or block wise. The instance wise streaming data could be processed as block wise data. This may be preferred in some particular application domains. For instance, in the dynamic pricing problems (see e.g. [33, 2]) the price is usually not updated each time when an instance of sales information becomes available, because customers may not like the price changing too constantly. When processing block wise streaming data, a base algorithm is applied to each incoming data block and a coupling method is then used to update the knowledge by combining the knowledge from the past blocks and the incoming block; see e.g. [9, 16]
. In statistical learning theory where the knowledge is usually represented by a target function, the simplest way to couple the information is to use the average of the functions learnt from different blocks. In learning with big data, the divide and conquer algorithm divides the whole data set into smaller subsets, applies a base learning algorithm on each subset, and takes the average of the learnt functions from all subsets as the target function for prediction purpose. It is computationally efficient because the second stage could be implemented via parallel computing. Although the divide and conquer algorithm is different from the aforementioned online learning with block wise streaming data, they clearly share the same spirit – a base algorithm for a single data block is required and the average of the outputs from this algorithm over multiple data blocks is used. A natural problem arising from these two frameworks is the choice of the base learning algorithm for a single data set. As an algorithm is efficient and optimal for a single data set, it is not necessarily efficient and optimal for learning with block wise data.
In this paper we focus on the regression learning problem where a set of observations are collected for
predictors and a scalar response variable. Assume they are linked by
where , and is the zero-mean noise. The target is to recover the unknown true model
as accurate as possible to understand the impact of predictors and predict the response on unobserved data. The ordinary least square (OLS) is the most traditional and well developed method. It assumes a linear model and estimates the coefficients by minimizing the squared error between the responses and the predictions. The OLS estimator requires the inverse of the covariance matrix of the explanatory variables and could be numerically unstable if the covariance matrix is singular or has very large condition number. A variety of regularization approaches have been introduced to overcome the numerical instability and/or for some other purposes (e.g. sparsity). Typical regularized methods include ridge regression[18, 17], LASSO , elastic net  and many others. The nonlinear extension of ridge regression could be implemented by regularization kernel network . The data are first mapped to a feature space. Then a linear model is built in the feature space which, when projected back to the original space, becomes a nonlinear model.
Although different regularization techniques have different properties, they share some common features. The estimators obtained from regularized regression are usually biased. The regularization is helpful to improve the computational stability and reduce the variance. By trading off the bias and variance, regularization schemes may lead to smaller prediction errors than unbiased estimators. Therefore, regularization theory has become an important topic in the statistical learning context.
Regularization algorithms, such as the ridge regression, regularization kernel network, and support vector machines, have been successful in a variety of regression and classification applications. However, they may be suboptimal when they serve as base algorithms in learning with block wise streaming data or in the divide and conquer algorithm. When there are many data blocks, the regularization algorithm may provide good estimator for each data block. By coupling the estimators together, the variance usually shrinks when more and more data blocks are taken into account. But the bias may not shrink and prevent the algorithm to achieve optimal performance. To overcome this difficulty, adjustments are required to remove or reduce the bias of the algorithm.
In the paper, we will propose an approach to correct the bias of ridge regression and regularization kernel network. The resulted two new algorithms and their properties will be described in Section 2 and Section 4. Their theoretical properties are proved in Section 3 and Section 5, respectively. In Section 6 we discuss why the new algorithms are effective in learning with block wise data. In Section 7 simulations are used to illustrate their effectiveness from an empirical aspect. We close by Section 8 with conclusions and discussions.
1.1 Related Work
The idea of bias correction has long history in statistics. For instance, bias correction to maximum likelihood estimation dates at least back to 1950s  and a variety method were proposed later on; see e.g. [22, 26, 14, 11]
. Bias reduction to kernel density estimators was studied in[6, 1, 20, 10]. Bias correction to nonparametric estimation was studied in [15, 24, 35].
The existence of bias in ridge regression and its impact on statistical inference has been noticed since its invention [18, 23]. In high dimensional linear models where the dimension greatly exceeds the sample size, bias correction method was introduced in  to correct the projection bias, the difference of the true regression coefficient and its projection in the subspace spanned by the sample, which appears because the sample cannot span the whole Euclidian space as In [36, 7, 19], projection bias correction was introduced to LASSO in high dimensional linear models. The purpose of projection bias is to construct accurate
values to facilitate accurate statistical inference such as hypothesis testings and confidence intervals. It seems the bias caused by regularization has minimal impact for this purpose.
As for regularization kernel network, its predictive consistency has been extensively studied in the literature; see e.g. [5, 37, 12, 34, 4, 8, 27, 30, 28] and many references therein. Its application was also extensively explored and shown successful in many problem domains. But to my best knowledge, the idea of bias correction to improve this algorithm is novel. Note that bias reduction for regularized regression does not improve the learning performance on a single data set, as illustrated in Section 7. It is worthy of exploration because of its effectiveness in learning with streaming data or distributed regression.
2 Bias correction for ridge regression
In linear regression, the response variable is assumed to linearly depend on the explanatory variables, i.e.
with some and Ridge regression minimizes the penalized mean squared prediction error on the observed data
where is the regularization parameter used to trade off the fitting error and model complexity. Let denote the sample mean of ’s and be the sample mean of ’s. Denote by the centered data matrix for the explanatory variables and the vector of centered response values. Then the sample covariance matrix of is and the solution to the ridge regression is given by
and Here and in the sequel,
denotes the identity matrix (or the identity operator).
The solution is a biased estimator for Define
Then is the Euclidian norm of the bias and is the variance. The mean squared error is given by
Denote by the covariance matrix of the explanatory variables . Let
be the eigenvalues ofand
the corresponding eigenvectors. Then
The vectors are the principal components. The following theorem characterizes the bias and variance of ridge regression.
If is bounded and , then
If , then
Without loss of generality, we assume the eigenvalues are in decreasing order, i.e., . Then is increasing. Theorem 1 tells that, for a fixed , the bias of ridge regression will be small if the true model heavily depends on the first several principle components. Conversely, if heavily depends on the last several principle components, the bias will be large.
According to Theorem 1 (i), the asymptotic bias of ridge estimator is . If we can subtract it from the ridge regression estimator, we are able to obtain an asymptotically unbiased estimator . However, this is not computationally feasible because both and are unknown. Instead, we propose to replace by its sample version and by the ridge estimator . The resulted new estimator, which we call bias corrected ridge regression estimator, becomes
Since the bias correction uses an estimated quantity, this new estimator is still biased. But the bias is reduced. Let
We have the following conclusion.
If is bounded and , then
converges to as
If , then
Since , the asymptotic bias is smaller. The bias reduction could be significant if the true model depends only on the first several principle components. We also remark that, although and are of the same order, is found slightly larger in simulations. The overall performance, as measured by the mean squared error, of these two estimators is comparable when used in learning with a single data set.
. To this end, we first introduce several useful lemmas. In our analysis, we will deal with vector or operator valued random variables. We need the following inequalities for Hilbert space valued random variables.
Let be a random variable with values in a Hilbert space . Then for any we have
The proof is quite direct:
Let be a Hilbert space and be a random variable with values in . Assume that almost surely. Let be a sample of independent observations for Then
Since for all and are mutually independent, we have
This proves the first inequality. The second one follows from the first one and Schwartz inequality. ∎
In the sequel, we assume is uniformly bounded by
Let be the mean of Note that and similarly
where represent the Frobenius norm and we have used the fact that for all matrices.
Recall that matrices of form a Hilbert space with the Frobenius norm Applying Lemma 4 to which satisfies , we obtain
Next we apply Lemma 4 to and obtain
Now we can prove the two theorems.
Proof of Theorem 1.
Note that and thus We have
The conclusion (i) follows from
The conclusion (ii) is an easy consequence of (i) by noting that
Proof of Theorem 2.
It is easy to verify that Therefore, To prove (i) we write
By and we obtain
The conclusion (i) follows from the following estimate:
The conclusion (ii) is an easy consequence of (i).
4 Bias correction for regularization kernel network
When the true regression model is nonlinear, kernel method can be used. Denote by the space of explanatory variables. A Mercer kernel is a continuous, symmetric, and positive-semidefinite function . The inner product defined by induces a reproducing kernel Hilbert space (RKHS) associated to the kernel . The space is the closure of the function class spanned by The reproducing property leads to . Thus can be embedded into We refer to  for more other properties of RKHS.
The regularization kernel network  estimates the true model by a function that minimizes the regularized sample mean squared error
The representer theorem  tells that So although the RKHS may be infinitely dimensional, the optimization of the regularization kernel network could be implemented in an dimensional space. Actually, let denote the kernel matrix on and . The coefficients could be solved by a linear system
In  an operator representation for is proved. Let be the sampling operator defined by for Its dual operator, is given by for Then we have
The operator is a sample version of the integral operator define by
where is the marginal distribution on Note that defines a bounded operator both on
(associate to the probability measure) and Let and
be the eigenvalues and eigenfunctions of. Then form an orthogonal basis of and
Also, form an orthonormal basis of and, as an operator on ,
Moreover, maps all functions in onto and
In particular, this is true for all Note that is the closure of in . Only functions in can be well approximated by functions in .
Regularization kernel network can be regarded as a nonlinear extension of ridge regression. If we can measure the difference between and in and prove some conclusions that are analogous to those in Theorem 1. But unfortunately, this is generally not true. To make our result more general, we measure the difference between and in sense, which is equivalent to measure the mean squared forecasting error. For this purpose, we define
where is the expectation with respect to the data and is the expectation with respect to .
If and almost surely, then
converges to in
If , then
if satisfies and
Theorem 6 (ii) characterizes the asymptotic bias for target functions that belong to and thus can be well learned by the regularization kernel network. If the target function has a component orthogonal to , the orthogonal component is not learnable and its norm should be added to the right hand side. The variance bound in Theorem 6 (iii) is presented with the assumptions and , which, according to the literature (e.g. [27, 30]), usually guarantee the regularization kernel network to achieve the optimal learning rate. When this is not true, an explicit bound can be found in the proof in Section 5.
Following the same idea as in Section 2, we propose to reduce the bias by using an adjusted function
The implementation of this new approach is easy. We can verify that
If and almost surely, then
converges to in
If , then
The proofs Theorem 6 and Theorem 7 are more complicated than those of Theorem 1 and Theorem 2 because they require techniques to handle the estimation of integral operators. Without loss of generality we assume and throughout this section in order to simplify our notations. We will always use for in case there is no confusion from the context.
Let be positive random variable. For any ,
The following concentration inequality is proved in .
Let be a Hilbert space and be a random variable with values in . Assume that almost surely. Let be a sample of independent observations for Then for any
When no information regarding is available, we can use to derive a simpler estimation as follows. For any with confidence at least
Before we state the next lemma, let us recall the Hilbert-Schmidt operators on . Let be a set of orthonormal basis of . An operator is a Hilbert-Schmidt operator if is finite. A Hilbert-Schmidt operator is also a bounded operator with All Hilbert-Schmidt operators form a Hilbert space. For
, the rank one tensor operatoris defined by for all A tensor operator is a Hilbert-Schmidt operator with
For any we have
with confidence at least Also, we have
Consider the random variable It satisfies , and Then the first inequality follows from (9).
For any we have
with confidence at least Also, we have