 # Correcting the bias in least squares regression with volume-rescaled sampling

Consider linear regression where the examples are generated by an unknown distribution on R^d× R. Without any assumptions on the noise, the linear least squares solution for any i.i.d. sample will typically be biased w.r.t. the least squares optimum over the entire distribution. However, we show that if an i.i.d. sample of any size k is augmented by a certain small additional sample, then the solution of the combined sample becomes unbiased. We show this when the additional sample consists of d points drawn jointly according to the input distribution that is rescaled by the squared volume spanned by the points. Furthermore, we propose algorithms to sample from this volume-rescaled distribution when the data distribution is only known through an i.i.d sample.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Unbiased estimators for linear regression are useful because averaging such estimators gives an unbiased estimator whose prediction variance vanishes as the number of averaged estimators increases. Such estimators might for example be produced in a distributed fashion from multiple small samples. In this paper we develop a unique method for correcting the bias of linear least squares estimators. Our main methodology for producing an unbiased estimator is volume sampling. For a fixed design matrix , the most basic variant of volume sampling chooses a subset of dimension many rows (i.e.

) with probability proportional to the squared volume spanned by the rows, i.e.

, where is the sub-matrix of rows indexed by . This procedure generalizes to sampling sets of any fixed size :

 P(S)\tiny{def}=det(X⊤SXS)(n−dk−d)det(X⊤X). (1)

Volume sampling has the property that for any design matrix with

rows and any real response vector

, the linear least squares solution for the subproblem is an unbiased estimator for the solution of the full problem .

We propose the following previously unobserved alternate sampling method for size volume sampling: First volume sample a set of size

and then pad the sample with a uniform subset

of rows outside of . Now the probability of the combined size sample is again volume sampling (1):

 P(S)=∑S∘⊆S|S∘|=dP(R=S∖S∘|S∘)1(n−dk−d)P(S∘)det(XS∘)2det(X⊤X)=det(X⊤SXS)(n−dk−d)det(X⊤X),

where the equality is the Cauchy-Binet formula for determinants. Furthermore, we study a more general statistical learning model where the points come from an unknown probability distribution over

, and the goal is to recover the least squares solution w.r.t. the distribution. In this paper we generalize volume sampling to this case by rescaling the i.i.d. sampling distribution by the squared volume of the sampled points.

The simplest way to obtain a linear least squares estimator in the statistical learning model is to find the linear least squares solution for a size i.i.d. sample. Unfortunately such estimators are generally biased. Note that this is not the kind of bias that we deliberately impose with regularization to reduce the variance of the estimator. Rather, due to the random design, the least squares estimator is typically biased even when it is not regularized at all , and we have limited control over how large that bias may be (see Section 1.2 for a motivating example). However our alternate sampling procedure for volume sampling (discussed in the previous paragraph) implies the following strategy for correcting the bias: We show that if an i.i.d. sample of any size is augmented with a size volume-rescaled sample for this distribution, then the combined sample is a volume-rescaled sample of size , and its linear least squares solution is an unbiased estimator of the optimum. In one dimension, this means that if an i.i.d. sample is augmented with just one example, where this additional example is drawn from a distribution whose marginal distribution on is proportional to the original (unknown) marginal density times , then the resulting least squares estimator becomes unbiased. Curiously enough, for the purpose of correcting the bias it does not matter whether the size volume-rescaled sample was generated before or after the original size i.i.d. sample was drawn, since they are independent of each other.

In addition to generalizing volume sampling to the continuous domain and showing that only a subsample of size needs to be rescaled by the squared volume, we study the time and sample complexity of volume-rescaled sampling when the data distribution is only known through an i.i.d. sample. Specifically:

1. We extend determinantal rejection sampling  to arbitrary data distributions with bounded support, and our improved analysis reduces its time and sample complexity by a factor of .

2. When the data distribution is Gaussian with unknown covariance, we propose a new algorithm with sample complexity.

#### Related work.

Discrete volume sampling of size was introduced to computer science literature by , with later algorithms by [10, 14]. The extension to sets of size is due to , with algorithms by [21, 8, 9], and additional applications in experimental design explored by [1, 24, 22]. Our alternate volume sampling procedure implies that the algorithms by [10, 14] can be used to volume sample larger sets at no additional cost. The unbiasedness of least squares estimators under volume sampling was explored by [8, 9], drawing on observations of .

For arbitrary data distributions, volume-rescaled sampling of size is a special case of a determinantal point process (DPP) (see, e.g. [3, 15]). However for

and arbitrary distributions, we are not aware of such sampling appearing in the literature. Related variants of discrete DPPs have been extensively explored in the machine learning community

[19, 18, 20, 12, 5, 6].

#### Notations and assumptions.

Throughout the paper, is a random example drawn from some distribution . We assume that the point and the response

both have finite second moments, i.e.

and . The marginal probability measure of is denoted as , while is the probability measure over of i.i.d. samples drawn from . We define and w.l.o.g. assume that it is invertible. Given a data sample , we denote the least squares estimators for and , respectively, as

 w∗(S) \tiny{def}=argminw∑(xi,yi)∈S(x⊤iw−yi)2and w∗D \tiny{def}=argminwED[(x⊤w−y)2]=Σ−1DXED[xy].

### 1.1 Statistical Results

Our results are centered around the following size joint sampling distribution.

###### Definition 1

Given distribution and any , we define volume-rescaled size sampling from as the following probability measure: For any event measurable w.r.t. , its probability is

 VSkDX(A) \tiny{def}= EDkX[1A rescaling factordet(∑ki=1xix⊤i)]d!(kd)det(ΣDX),

where is the indicator variable of event .

This distribution integrates to over its domain as a consequence of a continuous version of the classic Cauchy-Binet formula, which has appeared in the literature in various contexts (Lemma 7).

Although we define volume-rescaled sampling for any sample size , we focus primarily on the special case of in the main results below. This is because we show that any can be decomposed into and , the latter being the distribution of a size i.i.d. sample from .

###### Theorem 1

Let and . Let denote a random permutation of the points from concatenated with , i.e. , where is a random permutation. Then .

Given the above decomposition, one may wonder what is the purpose of defining volume-rescaled sampling for any size . In fact, we will see in the following sections that both in the proofs and in algorithms it is sometimes easier to work with rather than its decomposed version. For example in the theorem below, we show that for any , the least squares estimator computed on a volume-rescaled sample is unbiased. Despite the fact that continuous determinantal point processes have been studied extensively in the past, we were not able to find this result for arbitrary in the literature.

###### Theorem 2

Consider the following distribution on samples of size :

 Sample x1,…,xk ∼ VSkDX, Query yi ∼ DY|x=xi∀i=1..k.

Then .

Combining Theorems 1 and 2, we conclude that an i.i.d. sample only needs to be augmented by a dimension-size volume-rescaled sample (i.e., ) so that the least squares estimator becomes unbiased.

###### Corollary 3

Let , for any . Consider the following procedure:

 Sample ˜x1,…,˜xd ∼ VSdDX, Query ˜yi ∼ DY|x=˜xi∀i=1..d.

Then for ,

 E[w∗(⟨S,S∘⟩)] =ES∼Dk[ES∘∼VSdD[w∗(⟨S,S∘⟩)]] (Theorem 1) =E˜S∼VSk+dD[w∗(˜S)] (Theorem 2) =w∗D.

To put the above result in context, we note that in the fixed design case it was known that a single volume sampled subset of any size produces an unbiased least squares estimator (see, e.g., ). However this required that all points be sampled jointly from this special distribution. Thus, Corollary 3 says that volume sampling can be used to correct the bias in existing i.i.d. samples via sample augmentation (requiring labels/responses for only additional points from

). This is important in active learning scenarios, where samples from

(unlabeled data) are cheaper than draws from (label queries). We also develop methods for generating the small sample from only using additional unlabeled samples from (see Section 1.3). Indeed, active learning was a motivation for volume sampling in previous works [8, 9].

### 1.2 A Simple Gaussian Experiment

The bias in least squares estimators is present even when input is a standard Gaussian. As an example, we let and set:

 x⊤=(x1,…,xd)i.i.d.∼N(0,1), y=ξ(x)+ϵ,

where the response is a non-linear function

plus independent white noise

. Note that it is crucial that the response contains some non-linearity, and it is something that one would expect in real datasets. For the purposes of the experiment, we wish to make the least squares solution easy to compute algebraically, so we choose the following response model:

 ξ(x)=d∑i=1xi+x3i3,ϵ∼N(0,1).

We stress that there is nothing special about the choice of this response model other than the fact that it contains a non-linearity and it is easy to solve algebraically for . We now compare the bias of the least squares estimator produced for this problem by i.i.d. sampling of points, with that of an estimator computed from i.i.d. samples augmented by volume samples (so that the total number of samples is the same in both cases). We used a special formula (Theorem 6 below) to produce the volume-rescaled samples when is Gaussian. Our strategy is to produce many such estimators independently (e.g. by computing them in parallel on separate machines), and look at estimation error of the average of those estimators, i.e.

Figure 1 shows the above experiment for several values of and a range of values of (each presented data point is an average over 50 runs). Since the corrected estimator “i.i.d. + volume” is unbiased, the estimation error of the average estimator exhibits convergence to zero (regardless of ). This type of convergence appears as a straight line on the log-log plot. In contrast, the i.i.d. sampled estimator is biased for any sample size (although the bias decreases with ), and therefore the averaged estimator does not converge to the optimum.

### 1.3 Sampling Algorithms

To our knowledge, existing literature on algorithms for DPPs and volume sampling (other than the exceptions discussed below) generally assumes full or considerable knowledge of the distribution , which often may not be the case in practice, for example when the data is coming in a stream, or is drawn from a larger population. In this work, we are primarily interested in the setting where access to distribution is limited to some approximate statistics plus the ability to draw i.i.d. samples from it. Two key concerns in this model are the time and sample complexities of volume-rescaled sampling for a given distribution .

We first consider distributions with bounded support. We use a standard notion of conditioning number for multivariate distributions (see, e.g., ):

When is known to be bounded and we are given the exact knowledge of the covariance matrix , then it is possible to produce a volume-rescaled sample using a classical algorithm from the DPP literature described in  by employing rejection sampling (see also ). This approach requires draws from and runs in time . However, sampled sets produced by that algorithm diverge from the desired distribution unless the given covariance matrix matches the true one exactly. This may be unrealistic when we do not have full access to the distribution . Is it possible to sample from without the exact knowledge of ?

We answer the question affirmatively. We show that a recently proposed algorithm from  for fixed design volume sampling can be adapted to arbitrary in such a way that it only requires an approximation of the covariance matrix , while still returning samples exactly from . The original algorithm, called determinantal rejection sampling, samples from a given finite design matrix (i.e., a discrete distribution which is fully-known), but it was shown in  that the procedure only requires an approximation of the covariance matrix , where . We extend this algorithm to handle arbitrary distributions , and also improve the analysis by reducing the required approximation quality to .

###### Theorem 4

Given any s.t.

 (1−ϵ)ΣDX⪯ˆΣ⪯(1+ϵ)ΣDX, whereϵ=1√2dandK≥KDX1−ϵ,

there is an algorithm which returns , and with probability at least its sample and time complexity is and , respectively.

###### Remark 5

Our condition improves the result from  (where was used). When is given as a finite set of vectors in , the main cost of volume sampling is an preprocessing step of computing , where hides . Setting in Appendix F of , we reduce that cost from to .

In Section 4, we discuss how can be obtained just by sampling from the distribution , which requires samples with high probability and time , nearly the same (up to log terms) as for the algorithm of Theorem 4 (here, the improved also plays a key role).

The conditioning number can be much larger than the dimension of the distribution , so obtaining an appropriate estimate of required for Theorem 4 may still be prohibitively expensive. Thus, it is natural to ask if there are some structural assumptions on distribution which can allow us to sample from without any estimate of the covariance matrix. In the following result, we exploit a connection between volume-rescaled sampling and the Wishart distribution to show that when is a centered multivariate normal, then without any knowledge of , we can produce a volume-rescaled sample from only samples of and in running time.

###### Theorem 6

Suppose that the point distribution is a Gaussian and let . Then , where

 ˜xi\tiny{% def}=(2d+2∑j=d+1xjx⊤j)12(d∑j=1xjx⊤j)−12xi.

Note. For a positive definite matrix , we define as the unique lower triangular matrix with positive diagonal entries s.t. .

Finding other distribution families which allow for volume-rescaled sampling with bounded sample complexity is an interesting future research direction.

## 2 Sample Augmentation

Let denote the th row of a matrix . First, we extend a classic lemma by , which was originally used to show the expected value of a metric in multivariate statistics known as “generalized variance”.

###### Lemma 7 (based on )

If the (transposed) rows of the random matrices are sampled as pairs of vectors i.i.d. from a distribution over random vectors such that exists, then

 E[det(A⊤B)] =d!(kd)det(E[ab⊤]).

The above result is slightly different than what was presented in  (the original one had , and the sample mean was subtracted from the vectors before constructing the matrix ), but the analysis is similar (see proof in Appendix A). Note that for , Lemma 7 shows that integrates to , making it a well-defined probability distribution:

 EDkX[det(k∑i=1xix⊤i)]=d!(kd)det(ΣDX).

The asymmetry of Lemma 7 is crucial for showing the unbiasedness property of volume-rescaled sampling.

Proof of Theorem 2  For , the least squares estimator is simply the unique solution to a system of linear equations111Unless , in which case we let ., so Cramer’s rule states that the th component of that solution is given by:

 (w∗(S))i=det(Xi←y)det(X),

where is matrix with column replaced by . We first prove unbiasedness of for samples of size :

 EVSdD[(w∗(S))i] =EDd[det(X)2(w∗(S))i]d!det(ΣDX) =EDd[det(X)det(Xi←y)]d!det(ΣDX) (Lemma 7) =det(ED[x(xi←y)⊤])det(ΣDX) =det(ΣDXi←ED[xy])det(ΣDX)=(w∗D)i,

where we applied Lemma 7 to the pair of matrices and . The case of follows by induction based on a formula shown in :

 E VSkD[w∗(S)]=EDk[det(X⊤X)w∗(S)]d!(kd)det(ΣDX) (2)=kk−dd!(k−1d)d!(kd)EVSk−1D[w∗(S)]=EVSk−1D[w∗(S)],

where denotes matrix without the th row, follows from the formula shown in  (given in Lemma 15 of Appendix A), while follows because the samples are exchangeable, i.e. is distributed identically to .

Finally, our key observation given in Theorem 1 is that size volume-rescaled sampling can be decomposed into size volume-rescaled sampling plus i.i.d. sampling of points. Note that a version of this already occurs for discrete volume sampling (see Section 1). However it was not previously known even in that case.

Proof of Theorem 1  Let denote the distribution of a matrix whose transposed rows are . The probability of a measurable event w.r.t.  is:

where , matrix consists of the rows of indexed by set , and follows from the Cauchy-Binet formula.

## 3 Volume-Rescaled Gaussian

In this section, we obtain a simple formula for producing volume-rescaled samples when is a centered multivariate Gaussian with any (non-singular) covariance matrix. We achieve this by making a connection to the Wishart distribution. Thus, for this section, assume that , and let be the transposed rows of matrix . Then matrix is distributed according to Wishart distribution with degrees of freedom. The density function of this random matrix is proportional to . On the other hand, if is constructed from vectors , then its density function is multiplied by an additional , thus increasing the value of in the exponent of the determinant. This observation leads to the following result:

###### Theorem 8

If and are rows of a random matrix , then

 ˜X⊤˜X∼Wd(k+2,ΣDX).

Proof  Let and . For any measurable event over the random matrix , we have

 Pr(˜X⊤˜X∈A) =E[1[X⊤X∈A]det(X⊤X)]E[det(X⊤X)]

where follows because the density function of Wishart distribution is proportional to .
This gives us an easy way to produce the total covariance matrix of volume-rescaled samples in the Gaussian case. We next show that the individual vectors can also be recovered easily.

Proof of Theorem 6  The proof relies on the following two lemmas.

###### Lemma 9

For any , the conditional distribution of given is the same as the conditional distribution of given .

While this lemma (proven in Appendix B) relies primarily on the definition of conditional probability, the second one uses properties of the matrix variate Beta and Dirichlet distributions.

###### Lemma 10

For and vectors forming the transposed rows of a matrix , let

 ˜xi=Σ12(X⊤X)−12xi.

Then

are jointly distributed as

Gaussians conditioned on .

Putting Theorem 8 together with the two lemmas, we observe that for any , constructing , and plugging it into Lemma 10, we obtain that , completing the proof of Theorem 6.
We conclude this section with the proof of Lemma 10, which demonstrates an interesting application for classical results in matrix variate statistics.

Proof of Lemma 10  Let and be independent Wishart matrices (where ). Then matrix

is matrix variate beta distributed, written as

. The following was shown by :

###### Lemma 11 (, Lemma 3.5)

If is distributed independently of , and if , then

 B =Σ12U(Σ12)⊤andC=Σ12(I−U)(Σ12)⊤

are independently distributed and , .

Now, suppose that we are given a matrix . We can decompose it into components of degree one via a splitting procedure described in , namely taking and computing , as in Lemma 11, then recursively repeating the procedure on (instead of ) with , …, until we get Wishart matrices of degree one summing to :

 B1 =Σ12U1(Σ12)⊤ B2 ⋮ Bk =Σ12(1−Uk−1)12…C1/2k−1Uk…((1−Uk−1)12)⊤(Σ12)⊤(C1/2k−1)⊤.

The above collection of matrices can be described more simply via the matrix variate Dirichlet distribution. Given independent matrices for , the matrix variate Dirichlet distribution corresponds to a sequence of matrices

 Vi=Σ−12Σi(Σ−12)⊤, i=1..s,Σ=s∑i=1Σi.

Now, Theorem 6.3.14 from  states that matrices defined recursively as above can also be written as

 Bi=Σ12Vi(Σ12)⊤,(V1,…,Vk)∼Dd(1,…,1).

In particular, we can construct them as

 Bi=˜xi˜x⊤i=Σ12(X⊤X)−12xix⊤i((X⊤X)−12)⊤(Σ12)⊤.

Note that since matrix is independent of vectors , we can condition on it without altering the distribution of the vectors. It remains to observe that the conditional distribution of matrix determines the distribution of up to multiplying by , and since both and are identically distributed, we recover the correct distribution of conditioned on , completing the proof.

## 4 General Algorithm

In this section, we present a general algorithm for volume-rescaled sampling, which uses approximate leverage score sampling to generate a larger pool of points from which the smaller volume-rescaled sample can be drawn. The method relies on a technique called “determinantal rejection sampling”, introduced recently in  for a variant of volume sampling of finite subsets of points from a fixed set. Also, as in  our algorithm uses the most standard volume sampling distribution (see (1) and the associated discussion in the introduction) as a subroutine which samples a subset of points/rows from a fixed set. This is done via an efficient implementation of “reverse iterative sampling”  (See Algorithm 2 for a high-level description of this sampling method). Curiously enough, the efficient implementation of reverse iterative sampling given by  (denoted here as “VolSamp” and not repeated here for lack of space) is again based on rejection sampling: It samples a set of points out of in time (independent of ). The runtime bound for this implementation only holds with high probability because of its use of rejection sampling.

For our algorithm we assume that an estimate of the covariance matrix is available, along with an upper-bound on the conditioning number.

Algorithm 1

, which controls the number of inner-loop iterations. Our analysis works for any , although for simplicity we use in the main result.

Our analysis of Algorithm 1 uses the following two lemmas, both of which are extensions of results from .

###### Lemma 12

For , let . Define the following probability measure over :

 LevˆΣ,X(A)\tiny{def}=EDX[1A lˆΣ(x)tr(ΣDXˆΣ−1)].

If , and , then

 det(˜ΣˆΣ−1)≤1,where˜Σ=1tt∑i=1˜xi˜x⊤i,

Let