 # Unbiased estimators for random design regression

In linear regression we wish to estimate the optimum linear least squares predictor for a distribution over d-dimensional input points and real-valued responses, based on a small sample. Under standard random design analysis, where the sample is drawn i.i.d. from the input distribution, the least squares solution for that sample can be viewed as the natural estimator of the optimum. Unfortunately, this estimator almost always incurs an undesirable bias coming from the randomness of the input points. In this paper we show that it is possible to draw a non-i.i.d. sample of input points such that, regardless of the response model, the least squares solution is an unbiased estimator of the optimum. Moreover, this sample can be produced efficiently by augmenting a previously drawn i.i.d. sample with an additional set of d points drawn jointly from the input distribution rescaled by the squared volume spanned by the points. Motivated by this, we develop a theoretical framework for studying volume-rescaled sampling, and in the process prove a number of new matrix expectation identities. We use them to show that for any input distribution and ϵ>0 there is a random design consisting of O(d d+ d/ϵ) points from which an unbiased estimator can be constructed whose square loss over the entire distribution is with high probability bounded by 1+ϵ times the loss of the optimum. We provide efficient algorithms for generating such unbiased estimators in a number of practical settings and support our claims experimentally.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider linear regression where the examples are generated by an unknown distribution over , with

denoting the marginal distribution of a row vector

and denoting the conditional distribution of given . In statistics, it is common to assume that the response is a linear function of

plus zero-mean Gaussian noise; the goal is then to estimate this linear function. We decidedly make no such assumption. Instead, we allow the distribution to be arbitrary except for the nominal requirement that the second moments of the point

and response are bounded, i.e., and . The target of the estimation is the linear least squares predictor of from with respect to :

 w∗D\tiny{def}=argminw∈RdLD(w),whereLD(w)\tiny{def}=E[(x⊤w−y)2].

Here, we assume is invertible so we have the concise formula . Our goal is to construct a “good” estimator of this target from a small sample. For the rest of the paper we use as a shorthand.

In our setup, the estimator of is based on solving a least squares problem on a sample of examples . We assume that given , the responses are conditionally independent, and the conditional distribution of only depends on , i.e., for . However, for the applications we have in mind, the marginal distribution of is allowed to be flexibly designed based on . The most standard choice is i.i.d. sampling from the distribution of , i.e., . We shall seek other choices that can be implemented given the ability to sample from but that lead to better statistical properties for .

In particular, the properties we want of the estimator are the following.

1. Unbiasedness:  .

2. Near-optimal square loss:   for some (small) .

The central question is how to sample to achieve these properties with as small as possible. Note that while in general it is very natural to seek an unbiased estimator, in the context of random design regression it is highly unusual. This is because, as we discuss shortly, standard approaches fail in this regard. In fact, until recently, unbiased estimators have been considered out of reach for this problem.

An important and motivating case of our general setup occurs when

is the uniform distribution over a fixed set of

points and is deterministic. That is, there is an fixed design matrix and a response vector such that the distribution is uniform over the rows. Here, the loss of can be written as . This traditionally fixed design setting turns into a random design when we are required to sample rows of , observe only the entries of corresponding to those rows, and then construct an estimate of the least squares solution for all of

. Such constraints are often imposed in the context of experimental design and active learning, where

represents the budget of responses that we are allowed to observe (e.g., because the responses are expensive). Here, once again we want to be unbiased and have near-optimal loss.

Throughout the introduction we give some intuition about our results by discussing the one dimensional case. For example, consider the following fixed design problem:

 X=[x1: 1x2: 2],y=[y1: 1y2: 1],with target:w∗=∑ixiyi∑ix2i=35. (1.1)

Suppose that we wish to estimate the target after observing only a single response (i.e., ). If we draw the response uniformly at random (i.e. from the distribution ), then the least squares estimator for this sample will be a biased estimate of the target:

The bias in least squares estimators is present even when each input component is drawn independently from a standard Gaussian. As an example, we let and set:

 x⊤=(x1,…,xd)i.i.d.∼N(0,1), y=ξ(x)+ϵ,where  ξ(x)=d∑i=1xi+x3i3,ϵ∼N(0,1).

The response is a non-linear function

plus independent white noise

. Note that it is crucial that the response contains some non-linearity, and it is something that one would expect in real datasets. The response is cubic and was chosen so that it is easy to solve algebraically for the optimum solution (exact calculation given in Appendix A). Figure 1.1: Averaging least squares estimators for Gaussian inputs with d=5.

For this Gaussian setup we compute the bias of the least squares estimator produced for this problem by i.i.d. sampling of points. We do this by producing many such estimators independently, and look at the estimation error of the average of those estimators :

 estimation error:∥ˆw−w∗∥2.

Figure 1.1 (red curves) shows the experiment for several values of and a range of values of (each presented data point is an average over 50 runs). The i.i.d. sampled estimator is biased for any sample size (although the bias decreases with ), and therefore the averaged estimator clearly does not converge to the optimum. We next discuss how to construct an unbiased estimator (dashed blue curves), for which the estimation error of the averaged estimator exhibits convergence to zero (regardless of ). This type of convergence appears as a straight line on the log-log plot on Figure 1.1.

We extend a method for constructing unbiased estimators developed recently by Dereziński and Warmuth (2018) for the case where is uniform over a fixed design . This method, called volume sampling, jointly draws a subset of rows of the design matrix with probability proportional to , where denotes the submatrix of with rows indexed by . For this distribution, the linear least squares estimator is unbiased, i.e., , where denotes the Moore-Penrose pseudoinverse. Indeed, if we volume sample set of size 1 in the example problem (1.1), then .

Our first contribution in this paper is extending these ideas to arbitrary distributions (instead of uniform over a fixed design matrix). Let the sample be drawn jointly with probability proportional to , i.e., we reweigh the -fold i.i.d. distribution by the determinant of the sample covariance. We refer to this as volume-rescaled sampling from and denote it as . In this general context, we are able to prove that for arbitrary distributions and , volume-rescaled sampling produces unbiased linear least squares estimators (Theorem 2.8

). This result does not follow from the fixed design analysis, and in obtaining it we develop novel extensions of fundamental expectation identities for the determinant of a random matrix.

The fact that volume-rescaled sampling of size always produces unbiased estimators of the target stands in contrast to i.i.d. sampling from which generally fails in this regard. Yet surprisingly, we show that a volume-rescaled sample of any size is essentially composed of an i.i.d. sample of size from plus a volume-rescaled sample of size (Theorem 2.3). This means that the linear least squares estimator of such composed sample is also unbiased. Thus, as an immediate corollary of Theorems 2.3 and 2.8 we reach the following remarkable conclusion:

Even though i.i.d. sampling typically results in a biased least squares estimator, adding a volume-rescaled sample of size to the i.i.d. sample eliminates that bias altogether:

i.i.d. sample sol. for i.i.d. sample volume-rescaled sample query responses sol. for i.i.d + volume
Our result: even though typically

Indeed, in the simple Gaussian experiment used for Figure 1.1, the estimators produced from i.i.d. samples augmented with a volume-rescaled sample of size (dashed blue curves) become unbiased (straight lines). To get some intuition, let us show how the bias disappears in the one-dimensional fixed design case where is a uniform sample from . In this case, reweighing the probability of just the first sampled point by its square already results in an unbiased estimator. Let be the least squares estimator computed from with all indices sampled uniformly from . Now, suppose that we replace with sampled proportionally to , and denote the modified estimator as . Due to symmetry, it makes no difference which index we choose to replace, so

 E[˜w]=E[x2i1∑jx2jˆw]=1kk∑t=1E[x2it∑jx2jˆw]=E[1k(∑tx2it)ˆw]∑jx2j.

By definition of the least squares estimator, , from which it follows that This simple argument at once shows the unbiasedness of and the composition property discussed in the previous paragraph. In higher dimensions, the analysis gets considerably more involved, but it follows a similar outline.

Perhaps surprisingly, volume-rescaled sampling may not lead to estimators with near-optimal loss guarantees: We show that for any there are distributions for which volume-rescaled sampling of size results in the linear least squares estimator having loss at least twice as large as the optimum loss (with probability at least 0.25). However, we remedy this bad behavior by rescaling the distribution using statistical leverage scores. This rescaling achieves the following feat: It does not affect the unbiasedness of the estimator and, after rescaling, volume-rescaled sampling has good approximation properties; specifically, points are sufficient for the resulting estimator to have loss at most with probability . The same bound was known for vanilla i.i.d. leverage score sampling, but the estimators produced from leverage score sampling are biased.

To prove our results in arbitrary dimension and w.r.t. an arbitrary distribution , we develop a new tool kit for computing expectations under volume-rescaled sampling, which includes new expectation formulas for sampled pseudoinverses and adjugates. Our work also leads to improved algorithms. When distribution is defined by a fixed design , we get the following improvements for obtaining an unbiased subsampled estimator with loss within of the optimum: The sample size is reduced from to and the time complexity from to .

Remarkably, we show that exact volume-rescaled sampling is possible even when distribution is unknown (and possibly continuous) and we only have oracle access to it. First, if

is a normal distribution with unknown covariance, then we show that

additional samples from can be used to modify a sample from so that it becomes a volume-rescaled sample of size . Second, we develop distortion-free intermediate sampling, a method for exact volume-rescaled sampling from any with bounded support: We first sample a larger pool of points based on approximate i.i.d. leverage scores and then down-sample from that pool. We use rejection sampling for down-sampling to ensure exactness of the resulting overall sampling distribution. Surprisingly, this does not adversely affect the complexity because of the provably high acceptance rate during rejection sampling. The size of the intermediate pool that is necessary to achieve this grows linearly with a certain condition number of the distribution (which is likely unavoidable in general). This intermediate sampling method has proven to be an effective sampling strategy for other commonly studied determinantal distributions. In particular, follow-up works by Dereziński (2019) and Dereziński et al. (2019) extended our approach to obtain new efficient sampling algorithms for determinantal point processes.

### Related work

A discrete variant of volume-rescaled sampling of size was introduced to computer science literature by Deshpande et al. (2006) for sampling from a finite set of vectors, with algorithms given later by Deshpande and Rademacher (2010); Guruswami and Sinop (2012). A first extension to samples of size is due to Avron and Boutsidis (2013), with algorithms by Li et al. (2017); Dereziński and Warmuth (2018); Dereziński et al. (2018), and additional applications in experimental design explored by Wang et al. (2017); Nikolov et al. (2018); Mariet and Sra (2017). Prior to this work, the best known time complexity for this sampling method, called here discrete volume sampling, was , as shown by Dereziński and Warmuth (2017), whereas in this paper we give an sampling algorithm.

For distributions , volume-rescaled sampling of size as defined in this paper is a special case of a determinantal point process (DPP) (also called a projection DPP, see Hough et al., 2006b; Bardenet et al., 2017)

. Related variants of discrete DPPs have been extensively explored in the machine learning community by

Kulesza and Taskar (2012, 2011); Li et al. (2016); Gartrell et al. (2016); Celis et al. (2018) among many others. The existing sampling algorithms for size volume-rescaled sampling require the exact knowledge of the covariance matrix for . We propose the first algorithm that allows exact sampling given only an approximation of . For , we are not aware of volume-rescaled sampling from arbitrary distributions appearing in the literature.

The unbiasedness of least squares estimators under determinantal distributions was first explored in the context of sampling from finite datasets by Dereziński and Warmuth (2017), drawing on observations of Ben-Tal and Teboulle (1990). Focusing on small sample sizes, Dereziński and Warmuth (2017) proved multiplicative bounds for the expected loss under sample size with discrete volume sampling. Because the produced estimators are unbiased, averaging such estimators produced an estimator based on a sample of size with expected loss at most times the optimum at a total sampling cost of

. Additional variance bounds were shown for discrete volume sampling in

Dereziński and Warmuth (2018) under the assumption that the responses are linear functions of the input points plus white noise. We extend them here to arbitrary volume-rescaled sampling w.r.t. a distribution.

Other techniques applicable to our linear regression problem include leverage score sampling (Drineas et al., 2006) and spectral sparsification (Batson et al., 2012; Lee and Sun, 2015). Leverage score sampling is an i.i.d. sampling procedure which achieves loss bounds matching the ones we obtain here for volume-rescaled sampling, however it produces biased estimators and experimental results (see Section 6) show that it has weaker performance for small sample sizes. A different and more elaborate sampling technique based on spectral sparsification (Batson et al., 2012; Lee and Sun, 2015) was recently shown to be effective for linear regression (Chen and Price, 2019): They show that samples suffice to produce an estimator with expected loss . However this method also does not produce unbiased estimators, which is a primary concern of this paper and desirable in many settings.

Our work greatly expands and generalizes the results of two conference papers: Dereziński et al. (2018, 2019). The first introduced the leverage score rescaling method in the limited context of volume sampling from a fixed design matrix and proved the sample size bound for obtaining an unbiased estimator with a loss bound. The second paper showed how to correct the bias of i.i.d. sampling using a small size volume-rescaled sample. The current paper generalizes the loss bound of the first conference paper to the case of an arbitrary hidden distribution on the examples (Theorem 3.1). In the process, we developed new formulas for the expectation of the inverses and pseudoinverses of random matrices under volume-rescaled sampling (Theorems 2.6 and 2.7) and characterized the marginals of this distribution (Theorem 2.5). We also extended the decomposition property of volume-rescaled sampling given in the second conference paper (Theorem 2.3), thereby greatly simplifying our proofs. Finally, we gave a new lower bound that complements our main results (Theorems 4.1).

### Outline

In Section 2 we give our basic definition of volume-rescaled sampling w.r.t. an arbitrary distribution over the examples and prove the basic expectation formulas as well as the fundamental decomposition property which is repeatedly used in later sections. We also show that the linear least squares estimator is unbiased under volume-rescaled sampling. The decomposition property is then used in Section 3 to show that volume-rescaled leverage score sampling produces a linear least squares estimator with loss at most for sample size . The lower bounds in Section 4 show that i.i.d. sampling leads to biased estimators and plain volume-rescaled sampling does not have loss bounds.

In Section 5 we show that if is normal, then additional samples can be used to construct a volume-rescaled sample of size . When the distribution is arbitrary but we are given an approximation of the covariance matrix of , then a special variant of approximate leverage score sampling can be used to construct a larger intermediate sample that contains a volume sample with high probability. We then show how to construct an approximate covariance matrix from additional samples from . The number of samples we need grows linearly with a variant of a condition number of . Finally we show how the new intermediate sampling method introduced here leads to improved time bounds in the fixed design case.

In Section 6 we compare the performance of the algorithms discussed in this paper on some real datasets. We conclude with an overview and some open problems in Section 7.

## 2 Volume-rescaled sampling

In this section, we formally define volume-rescaled sampling and describe its basic properties. We then use it to introduce the central concept of this paper: an unbiased estimator for random design least squares regression.

Notation.  Let denote the th row of a matrix , and let be the submatrix of containing rows of indexed by the set . Also, we use , and to denote matrix with th row removed, th column removed, and both removed, respectively. When is , we use to denote the adjugate of which is a matrix such that . We use to denote the distribution of a -variate random row vector and we assume throughout that exists and is full rank. Distribution is called -variate if it produces a joint sample where and . A random matrix consisting of independent rows distributed as is denoted . We also use the following standard shorthand: .

###### Definition 1

Given a -variate distribution and any , we define volume-rescaled size sampling from as a -variate probability measure such that for any event measurable w.r.t. , its probability is

 VSkDX(A) \tiny{def}= E[det(X⊤X)⋅1[X∈A]]E[det(X⊤X)], where  X∼DkX.

For , this volume-rescaled sampling is an example of a determinantal point process (DPP, see Hough et al., 2006a), The case of can be viewed as an extension of that family of distributions.

###### Remark 2.1

Distribution is well-defined whenever is finite. Also, for any

is measurable if and only if is measurable for , and then it follows that

The remark follows from a key lemma which is an extension of a classic result by van der Vaart (1965), who essentially showed (2.1) below when , but not (2.2). Part (2.1) of the lemma lets us rewrite the normalization of volume-rescaled sampling as:

 E¯X[det(X⊤X)]=(kd––/kd)⋅det(E[X⊤X])=kd––⋅det(ΣDX), where ΣDX=EDX[xx⊤].
###### Lemma 2.2

If the rows of the random matrices are sampled as an i.i.d. sequence of pairs of joint random vectors , then

 kdE[det(A⊤B)] =kd––det(E[A⊤B]) for any k≥d, (2.1) kd−1E[adj(A⊤B)] =kd−1–––––adj(E[A⊤B]) for any k≥d−1. (2.2)

Proof  First, suppose that , in which case . Recall that by definition the determinant can be written as:

 det(C)=∑σ∈Sdsgn(σ)d∏i=1ci,σi,

where is the set of all permutations of , and is the parity of the number of swaps from to . Using this formula and denoting , we can rewrite the expectation as:

 ddE[det(A)det(B)] =∑σ,σ′∈Sdsgn(σ)sgn(σ′)d∏i=1E[d⋅aiσibiσ′i] =∑σ∈Sd∑σ′∈Sdsgn(σ,σ′)d∏i=1cσiσ′i =d!∑σ′∈Sdsgn(σ′)d∏i=1ciσ′i =d!det(E[A⊤B]),

which proves (2.1) for . The case of follows by induction via a standard determinantal formula:

 E[det(A⊤B)]

where follows from the Cauchy-Binet formula. Finally, (2.2) can be derived from (2.1):

where recall that denotes matrix with the th column removed.

### 2.1 Basic properties

In this section we look at the relationship between the random matrix of an i.i.d. sample from and the corresponding volume-rescaled sample . Even though the rows of are not independent, we show that they contain among them an i.i.d. sample distributed according to .

###### Theorem 2.3

Let and be a random size set s.t. . Then , , is uniformly random, and the three random variables , , and are mutually independent.

Before proceeding with the proof, we would like to discuss the implications of the theorem at a high level. First, observe that it allows us to “compose” a unique matrix (which must be distributed according to ) from a -row draw from , a -row draw from , and a uniformly drawn subset of size from . We construct by placing the rows at row indices and the rows at the remaining indices. Another way to think of the construction of is that we index the rows of from to and the rows of from to , and then permute the indices by a random permutation :

 volume + i.i.d. VSdDXx1…xdDk−dXxd+1…………xk (2.3) ⇕ VSkDX xσ1…………………..xσk (2.4)

Perhaps more surprisingly, given a volume-rescaled sample of size from (i.e., ), sampling a set of size with probability (discrete volume sampling) “filters out” a size volume-rescaled sample from (i.e., ). That sample is independent of the remaining rows in , so after reordering we recover (2.3).

We can repeat the steps of going “back and forth” between (2.3) and (2.4). That is, we can compose a sample from by appending the size sub-sample we filtered out from with its compliment and permuting randomly, and then again filter out a size volume sub-sample w.r.t.  from the permuted sample. The size sub-samples produced the first and second time are likely going to be different, but they have the same distribution .

This phenomenon can already be observed in one dimension (i.e., ). In this case, (2.3) samples one point and independently draws . Note that the random variables are mutually independent but not identically distributed. Now, if we randomly permute the order of the variables as in (2.4), then the new variables are identically distributed but not mutually independent. Intuitively, this is because observing (the length of) any one of the variables alters our belief about where the volume-rescaled sample was placed. Applying Theorem 2.3, we can now “decompose” the dependencies by sampling a singleton subset with probability proportional to . Even though the selected variable may not be the same as the one chosen originally, it is distributed according to volume-rescaled sampling w.r.t.  and the remaining points are i.i.d. samples from .

Proof  The distribution of conditioned on is the discrete volume sampling distribution over sets of size whose normalization constant is via the Cauchy-Binet formula. Denote and let , and be measurable events for variables , and

, respectively. We next show that the three events are mutually independent and we compute their probabilities. The law of total probability with respect to the joint distribution of

and , combined with Remark 2.1 (using ) implies that:

 Pr(S∈A∧¯XS∈B∧¯XSc∈C) =E¯X[Pr(S∈A∧¯XS∈B∧¯XSc∈C | ¯X)] (a)=E[det(X⊤X)⋅∑S∈Adet(XS)2det(X⊤X)1[XS∈B]1[XSc∈C]]kd––det(ΣDX) (b)=∑S∈AE[det(XS)21[XS∈B]1[XSc∈C]]kd––det(ΣDX) (c)=|A|⋅E[det(X[d])21[X[d]∈B]]⋅E[1[X[k]∖[d]∈C]](kd)d!det(ΣDX) =|A|(kd)⋅VSdDX(B)⋅Dk−dX(C).

Here uses Cauchy-Binet to obtain the normalization for , which is then cancelled out in . Finally follows because the rows of are i.i.d. so and are independent for any fixed , and the choice of does not affect the expectation.

Theorem 2.3 implies that for , the distributions and are in fact very close to each other because they only differ on a small sample of size . Since the rows of are exchangeable, they are also identically distributed. The marginal distribution of a single row exhibits a key connection between volume-rescaled sampling and leverage score sampling (when generalized to our distribution setting), which we will exploit later. Recall that for a fixed matrix , the leverage score of row is defined as . Note that in this case, the leverage scores sum to . The following definition is a natural generalization of leverage scores to arbitrary distributions.

###### Definition 2

Given a -variate distribution , we define leverage score sampling from as a -variate probability measure such that for any event measurable w.r.t. , its probability is

 LevDX(A) \tiny{def}= EDX[1[x⊤∈A]⋅x⊤Σ−1DXx]EDX[x⊤Σ−1DXx], where  x⊤∼DX.

Clearly, when finite.

###### Remark 2.4

Distribution is well-defined whenever is finite. Also, for any , random variable is measurable if and only if is measurable for , and then it follows that

 ELevDX[F(¯x⊤)]=EDX[F(x⊤)x⊤Σ−1DXx]/d.
###### Theorem 2.5

The marginal distribution of each row vector of is

 dk⋅LevDX+(1−dk)⋅DX.

Proof  For , this can be derived from existing work on determinantal point processes (see Lemma 3.3 for more details). We present an independent proof using the identity and Lemma 2.2. Given ,