Exact expressions for double descent and implicit regularization via surrogate random design

12/10/2019
by   Michał Dereziński, et al.
berkeley college
0

Double descent refers to the phase transition that is exhibited by the generalization error of unregularized learning models when varying the ratio between the number of parameters and the number of training samples. The recent success of highly over-parameterized machine learning models such as deep neural networks has motivated a theoretical analysis of the double descent phenomenon in classical models such as linear regression which can also generalize well in the over-parameterized regime. We build on recent advances in Randomized Numerical Linear Algebra (RandNLA) to provide the first exact non-asymptotic expressions for double descent of the minimum norm linear estimator. Our approach involves constructing what we call a surrogate random design to replace the standard i.i.d. design of the training sample. This surrogate design admits exact expressions for the mean squared error of the estimator while preserving the key properties of the standard design. We also establish an exact implicit regularization result for over-parameterized training samples. In particular, we show that, for the surrogate design, the implicit bias of the unregularized minimum norm estimator precisely corresponds to solving a ridge-regularized least squares problem on the population distribution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/28/2018

Implicit ridge regularization provided by the minimum-norm least squares estimator when n≪ p

A conventional wisdom in statistical learning is that large models requi...
12/16/2019

More Data Can Hurt for Linear Regression: Sample-wise Double Descent

In this expository note we describe a surprising phenomenon in overparam...
07/25/2020

A finite sample analysis of the double descent phenomenon for ridge function estimation

Recent extensive numerical experiments in high scale machine learning ha...
11/28/2020

On Generalization of Adaptive Methods for Over-parameterized Linear Regression

Over-parameterization and adaptive methods have played a crucial role in...
10/18/2021

Minimum ℓ_1-norm interpolators: Precise asymptotics and multiple descent

An evolving line of machine learning works observe empirical evidence th...
03/08/2021

Asymptotics of Ridge Regression in Convolutional Models

Understanding generalization and estimation error of estimators for simp...
12/11/2020

Avoiding The Double Descent Phenomenon of Random Feature Models Using Hybrid Regularization

We demonstrate the ability of hybrid regularization methods to automatic...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classical statistical learning theory asserts that to achieve generalization one must use training sample size that sufficiently exceeds the complexity of the learning model, where the latter is typically represented by the number of parameters (or some related structural parameter)

[FHT01]

. In particular, this seems to suggest the conventional wisdom that one should not use models that fit the training data exactly. However, modern machine learning practice often seems to go against this intuition, using models with so many parameters that the training data can be perfectly interpolated, in which case the training error vanishes. It has been shown that models such as deep neural networks, as well as certain so-called interpolating kernels and decision trees, can generalize well in this regime. In particular, recent work

[BHMM19] empirically demonstrated a phase transition in generalization performance of learning models which occurs at an interpolation thershold, i.e., a point where training error goes to zero (as one varies the ratio between the model complexity and the sample size). Moving away from this threshold in either direction tends to reduce the generalization error, leading to the so-called double descent curve.

To understand this surprising phenomenon, in perhaps the simplest possible setting, we study it in the context of linear or least squares regression. Consider a full rank data matrix

and a vector

of responses corresponding to each of the data points (the rows of ), where we wish to find the best linear model , parameterized by a -dimensional vector . The simplest example of an estimator that has been shown to exhibit the double descent phenomenon [BHX19] is the Moore-Penrose estimator, : in the so-called over-determined regime, i.e., when , it corresponds to the least squares solution, i.e., ; and in the under-determined regime (also known as over-parameterized or interpolating), i.e., when , it corresponds to the minimum norm solution to the linear system . Given the ubiquity of linear regression and the Moore-Penrose solution, e.g., in kernel-based machine learning, studying the performance of this estimator can shed some light on the effects of over-parameterization/interpolation in machine learning more generally. Of particular interest are results that are exact (i.e., not upper/lower bounds) and non-asymptotic (i.e., for large but still finite and ).

We build on methods from Randomized Numerical Linear Algebra (RandNLA) in order to obtain exact non-asymptotic expressions for the mean squared error (MSE) of the Moore-Penrose estimator (see Theorem 1). This provides a precise characterization of the double descent phenomenon for perhaps the simplest and most ubiquitous regression problem. In obtaining these results, we are able to provide precise formulas for the implicit regularization induced by minimum norm solutions of under-determined training samples, relating it to classical ridge regularization (see Theorem 2). This result has been observed empirically for RandNLA methods [Mah11]

, but it has also been shown in deep learning 

[Ney17] and machine learning [Mah12] more generally. To obtain our precise results, we use a somewhat non-standard random design, which we term surrogate random design (see Section 2 for a detailed discussion), and which we expect will be of more general interest. Informally, the goal of a surrogate random design is to modify an original design to capture its main properties while being “nicer” in some useful way. In Theorem 3 and Section 5 we show, both theoretically and empirically, that our surrogate design accurately preserves the key properties of the original design when the data distribution is a multivariate Gaussian.

1.1 Main results: double descent and implicit regularization

As the performance metric in our analysis, we use the mean squared error (MSE), defined as , where is a fixed underlying linear model of the responses. In analyzing the MSE, we make the following standard assumption on the response noise.

Assumption 1 (Homoscedastic noise).

Responses are where .

Our main result provides an exact expression for the MSE of the Moore-Penrose estimator under our surrogate design denoted , where is the -variate distribution of the row vector and is the sample size (details in Section 2). This surrogate is used in place of the standard random design , where data points (the rows of ) are sampled independently from . Unlike for the standard design, our MSE formula is fully expressible as a function of the covariance matrix . To state our main result, we need an additional minor assumption on which is satisfied by most standard continuous distributions such as any multivariate Gaussian with positive definite covariance matrix.

Assumption 2 (General position).

For , if , then almost surely.

Under Assumptions 1 and 2, we can establish our first main result, stated as the following theorem.

Theorem 1 (Exact non-asymptotic MSE).

If the response noise is homoscedastic with variance 

(Assumption 1) and is in general position (Assumption 2), then for (Definition 4) and , we have

where satisfies , while and .

Definition 1.

We will use to denote the above expressions for .

For illustration, we plot these MSE expressions in Figure 1

a, comparing them with empirical estimates of the true MSE under the i.i.d. design for a multivariate Gaussian distribution

with several different covariance matrices . We keep the number of features fixed to and vary the number of samples , observing a double descent peak at . We observe that our theory aligns well with the empirical estimates, whereas previously, no such theory was available except for special cases such as (more details in Theorem 3 and Section 5). The plots show that varying the spectral decay of has a significant effect on the shape of the curve in the under-determined regime. We use the horizontal line to denote the MSE of the null estimator

. When the eigenvalues of

decay rapidly, then the Moore-Penrose estimator suffers less error than the null estimator for some values of , and the curve exhibits a local optimum in this regime.

(a) Surrogate MSE expressions (Theorem 1) closely match numerical estimates even for non-isotropic features. Eigenvalue decay leads to a steeper descent curve in the under-determined regime ().
(b) The mean of the estimator exhibits shrinkage which closely matches the shrinkage of a ridge-regularized least squares optimum (theory lines), as characterized by Theorem 2.
Figure 1: Illustration of the main results for and where is diagonal with eigenvalues decaying exponentially and scaled so that . We use our surrogate formulas to plot (a) the MSE (Theorem 1) and (b) the norm of the expectation (Theorem 2) of the Moore-Penrose estimator (theory

lines), accompanied by the empirical estimates based on the standard i.i.d. design (error bars are three times the standard error of the mean). We consider three different condition numbers

of , with isotropic corresponding to , i.e., . In all cases, we use with .

One important aspect of Theorem 1 comes from the relationship between and the parameter , which together satisfy . This expression is precisely the classical notion of effective dimension

for ridge regression regularized with

 [AM15], and it arises here even though there is no explicit ridge regularization in the problem being considered in Theorem 1. The global solution to the ridge regression task (i.e., -regularized least squares) with parameter is defined as:

When Assumption 1 holds, then , however ridge-regularized least squares is well-defined for much more general response models. Our second result makes a direct connection between the (expectation of the) unregularized minimum norm solution on the sample and the global ridge-regularized solution. While the under-determined regime (i.e., ) is of primary interest to us, for completeness we state this result for arbitrary values of and . Note that, just like the definition of regularized least squares, this theorem applies more generally than Theorem 1, in that it does not require the responses to follow any linear model as in Assumption 1.

Theorem 2 (Implicit regularization of Moore-Penrose estimator).

For satisfying111The proof of Theorem 2

can be easily extended to probability measures

that do not satisfy Assumption 2 (such as discrete distributions). We include this assumption here to simplify the presentation. Assumption 2 and any such that is well-defined, if and , then

where, as in Theorem 1, is chosen so that the effective dimension equals .

That is, when , the Moore-Penrose estimator (which itself is not regularized), computed on the random training sample, in expectation equals the global ridge-regularized least squares solution of the underlying regression problem. Moreover, , i.e., the amount of implicit -regularization, is controlled by the degree of over-parameterization in such a way as to ensure that

becomes the ridge effective dimension (a.k.a. the effective degrees of freedom).

We illustrate this result in Figure 1b, plotting the norm of the expectation of the Moore-Penrose estimator. As for the MSE, our surrogate theory aligns well with the empirical estimates for i.i.d. Gaussian designs, showing that the shrinkage of the unregularized estimator in the under-determined regime matches the implicit ridge-regularization characterized by Theorem 2. While the shrinkage is a linear function of the sample size for isotropic features (i.e., ), it exhibits a non-linear behavior for other spectral decays. Such implicit regularization has been studied previously [MO11, PM11, GM14, Mah12]; it has been observed empirically for RandNLA sampling algorithms [Mah11, MMY15]; and it has also received attention more generally within the context of neural networks [Ney17]. While our implicit regularization result is limited to the Moore-Penrose estimator, this new connection (and others, described below) between the minimum norm solution of an unregularized under-determined system and a ridge-regularized least squares solution offers a simple interpretation for the implicit regularization observed in modern machine learning architectures.

Our exact non-asymptotic expressions in Theorem 1 and our exact implicit regularization results in Theorem 2 are derived for the surrogate design, but Figure 1 suggests that they accurately describe the MSE (up to lower order terms) also under the standard i.i.d. design , particularly when is a multivariate Gaussian. As a third result, we can verify this in the cases where there exist known expressions for the MSE under the i.i.d. design (standard Gaussian for the under-determined setting, and arbitrary Gaussian for the over-determined one).

Theorem 3 (Asymptotic consistency of surrogate design).

Let , and satisfy Assumption 1. If and

then the absolute difference between surrogate expressions and the true MSE is bounded as follows:

Remark 1.

For equal to , or , the true MSE under Gaussian random design can be infinite, whereas the surrogate MSE is finite and has a closed form expression.

Empirical estimates given in Figure 1 suggest that the consistency of surrogate expressions holds much more generally than it is stated above. Based on a detailed empirical analysis described in Section 5.2, we conjecture that an asymptotic consistency result of the form similar to the statement of Theorem 3 holds true in the under-determined regime without the assumption that . In this case, no formula is known for , whereas the expressions for the surrogate Gaussian design naturally extend.

1.2 Key techniques: surrogate designs and determinant preserving matrices

The standard random design model for linear regression assumes that each pair is drawn independently, where the row vector comes from some -variate distribution and

is a random response variable drawn conditionaly on

. Precise theoretical analysis of under-determined regression in this setting poses significant challenges, even in such special cases as the Moore-Penrose estimator and a Gaussian data distribution . Rather than trying to directly analyze the usual i.i.d. random design described above, we modify it slightly by introducing the notion of a surrogate random design, . Informally, the goal of a surrogate random design is to modify an original design to capture the main properties of the original design, while being “nicer” for theoretical or empirical analysis. In particular, here, we will modify the distribution of matrix so as to:

  1. closely preserve the behavior of the Moore-Penrose estimator from the i.i.d. design; and

  2. obtain exact expressions for double descent in terms of the mean squared error.

A key element in the construction of our surrogate designs involves rescaling the measure by the pseudo-determinant , i.e., a product of the non-zero eigenvalues. A similar type of determinantal design was suggested in prior work [DWH19b], but it was restricted there only to . We broaden this definition by not only allowing the sample size to be less than , but also allowing it to be randomized. Our definition of a determinantal design matrix follows by expressing for any real-valued function as (see Definition 2):

where and

is a random variable. Then, we define (in Definition 

3) our surrogate design for each as a determinantal design with a carefully chosen random variable , so that the expected sample size is equal to

and so that it is possible to derive closed form expressions for the MSE. We achieve this by using modifications of the Poisson distribution to construct the variable

.

The key technical contribution that allows us to derive the MSE for determinantal designs is the concept of determinant preserving random matrices, a notion that we expect to be useful more generally. Specifically, in Section 3 we define a class of random matrices for which taking the determinant commutes with taking the expectation, for the matrix itself and any of its square submatrices (see Definition 4):

Not all random matrices satisfy this property, however many interesting and non-trivial examples can be found. Constructing these examples is facilitated by the closure properties that this class enjoys. In particular, if and are determinant preserving and independent, then and are also determinant preserving (see Lemma 3). We use these techniques to prove a number of determinantal expectation formulas. For example, we show that if , where is a Poisson random variable, then:

These formulas are used to derive the normalization constants for our surrogate design distribution, which are later used in proving Theorems 1 and 2.

1.3 Related work

There is a large body of related work, which for simplicity we cluster into three groups.

Double descent. The double descent phenomenon (a term introduced by [BHMM19]) corresponds to the phase transition in the generalization error that occurs when the ratio between the model complexity and the sample size crosses the so-called interpolation threshold. It has been observed empirically in a number of learning models, including neural networks [BHMM19, GJS19], kernel methods [BMM18, BRT19], nearest neighbor models [BHM18], and decision trees [BHMM19]. The theoretical analysis of double descent, and more broadly the generalization properties of interpolating estimators, have primarily focused on various forms of linear regression. The most comparable to our work are [BLLT19, LR19] and [HMRT19], who provide non-asymptotic upper/lower bounds and asymptotic formulas, respectively, for the generalization error of the Moore-Penrose estimator under essentially the same i.i.d. random design setting as ours. On the other hand, [MVSS19] provide bounds for the error of the ideal linear interpolator (instead of the minimum norm one). Note that while we analyze the classical mean squared error, many works focus on the squared prediction error instead (some of them still refer to it as the MSE). Another line of literature deals with linear regression in the so-called misspecified setting, where the set of observed features does not match the feature space in which the response model is linear [BHX19, HMRT19, Mit19, MM19b], e.g., when the learner observes a random subset of features from a larger population. This is an important distinction, because it allows varying the model complexity by changing the number of observed features while keeping the linear model fixed (see further discussion in Section 6). We believe that our results can be extended to this important setting, and we leave this as a direction for future work.

RandNLA. Randomized numerical linear algebra [Mah11, DM16, DM17] has traditionally focused on obtaining algorithmic improvements for tasks such as least squares and low-rank approximation via techniques that include sketching [Sar06] and i.i.d. leverage score sampling [DMM06]. However, there has been growing interest in understanding the statistical properties of these randomized methods [MMY15, RM16], for example looking at the mean squared error of the least squares estimator obtained via i.i.d. subsampling under the standard linear response model. Determinantal sampling methods (a.k.a. volume sampling, or determinantal point processes), which first found their way into RandNLA in the context of low-rank approximation [DRVW06], have been recently shown to combine strong worst-case guarantees with elegant statistical properties. In particular, [DW17] showed that the least-squares estimator subsampled via the so-called size volume sampling (loosely corresponding to the special case of our surrogate design where

) is an unbiased estimator that admits exact formulas for both the expected square loss (a worst-case metric) and the mean squared error (a statistical metric). These results were developed further by

[DWH18, DWH19a, DCMW19], however they were still limited to the over-determined setting (with the exception of [DW18a, DLM19] who gave upper bounds on the mean squared error of the ridge estimator under different determinantal samplings). Also in the over-determined setting, [DWH19b] provided evidence for the fact that determinantal rescaling can be used to modify the original data distribution (particularly, a multivariate Gaussian) without a significant distortion to the estimator, while making certain statistical quantities expressible analytically. We take this direction further by analyzing the unregularized least squares estimator in the under-determined setting which is less well understood, partly due to the presence of implicit regularization.

Implicit regularization. The term implicit regularization typically refers to the notion that approximate computation (e.g., rather than exactly minimizing a function , instead running an approximation algorithm to get an approximately optimal solution) can implicitly lead to statistical regularization (e.g., exactly minimizing an objective of the form , for some well-specified and ). See [MO11, PM11, GM14] and references therein for early work on the topic; and see [Mah12] for an overview. More recently, often motivated by neural networks, there has been work on implicit regularization that typically considered SGD-based optimization algorithms. See, e.g., theoretical results on simplified models [NTS14, Ney17, SHN18, GWB17, ACHL19, KBMM19] as well as extensive empirical and phenomenological results on state-of-the-art neural network models [MM18, MM19a]. The implicit regularization observed by us is different in that it is not caused by an inexact approximation algorithm (such as SGD) but rather by the selection of one out of many exact solutions (e.g., the minimum norm solution). In this context, most relevant are the asymptotic results of [LJB19] (which used the asymptotic risk results for ridge regression of [DW18b]) and [KLS18]. Our non-asymptotic results are also related to recent work in RandNLA on the expectation of the inverse [DM19] and generalized inverse [MDK19] of a subsampled matrix.

2 Surrogate random designs

In this section, we provide the definition of our surrogate random design , where is a -variate probability measure and is the sample size. This distribution is used in place of the standard random design consisting of row vectors drawn independently from

. Our surrogate design uses determinantal rescaling to alter the joint distribution of the vectors so that certain expected quantities (such as the mean squared error of the Moore-Penrose estimator) can be expressed in a closed form. We start by introducing notation.

Preliminaries. The set will be denoted by . For an matrix , we use to denote the pseudo-determinant of , which is the product of non-zero eigenvalues. For index subsets and , we use to denote the submatrix of with rows indexed by and columns indexed by . We may write to indicate that we take a subset of rows. We use to denote the adjugate of , defined as follows: the th entry of is . We will use two useful identities related to the adjugate: (1) for invertible , and (2) . For a probability measure over , we use to denote a random row vector sampled according to this distibution. We let denote a random matrix with rows drawn i.i.d. according to , and the th row is denoted as . We also let , where refers to the expectation with respect to , assuming throughout that is well-defined and positive definite. We use as the Poisson distribution restricted to values less than or equal to , and a similar convention is used for the restriction to values greater or equal . Finally, we use to denote the number of rows of .

We now define a family of determinantal distributions over random matrices , where not only the entries but also the number of rows is randomized. This randomized sample size is a crucial property of our designs that enables our analysis. Our definition follows by expressing for real-valued functions (the expectation may be undefined for some functions).

Definition 2.

Let satisfy Assumption 2 and let be a random variable over non-negative integers. A determinantal design is a distribution such that for any as above,

Setting to 1, observe that the proportionality constant must be . The above definition can be interpreted as rescaling the density function of by the pseudo-determinant, and then renormalizing it.

We now construct our surrogate design by appropriately selecting the random variable . One might be tempted to use the obvious choice of , but this does not result in simple closed form expressions for the MSE in the under-determined regime (i.e., ), which is the regime of primary interest to us. Instead, we derive our random variables from the Poisson distribution.

Definition 3.

For satisfying Assumption 2, define surrogate design as where:

  1. if , then with being the solution of ,

  2. if , then we simply let ,

  3. if , then with .

Note that the under-determined case, i.e., , is restricted to so that, under Assumption 2, with probability 1. On the other hand in the over-determined case, i.e., , we have so that . In the special case of both of these equations are satisfied: .

The first non-trivial property of the surrogate design is that the expected sample size is in fact always equal to , which we prove at the end of this section.

Lemma 1.

Let for any . Then, we have .

Our general template for computing expectations under a surrogate design is to use the following expressions based on the i.i.d. random design :

These formulas follow from Definitions 2 and 3 because the determinants and are non-zero precisely in the regimes and , respectively, which is why we can drop the restrictions on the range of the Poisson distribution. Crucially, the normalization constants for computing the expectations can be obtained using the following formulas: if then

(Lemma 6)
(Lemma 5)
(Lemma 4)
Remark 2.

We will use as a shorthand for the above normalization constants.

We prove Lemmas 4 and 6 in Section 3 by introducing the concept of determinant preserving random matrices. The lemmas play a crucial role in deriving a number of new expectation formulas for the under- and over-determined surrogate designs that we use to prove Theorems 1 and 2 in Section 4. On the other hand, Lemma 5 and the design can be found in the literature [vdV65, DWH19b], and we will rely on those known results in this special case. Importantly, the case offers a continuous transition between the under- and over-determined regimes because the distribution converges to when approaches from above and below. Another important property of the design is that it can be used to construct an over-determined design for any . A similar version of this result was also previously shown by [DWH19b] for a different determinantal design.

Lemma 2.

Let and , where . Then the matrix composed of a random permutation of the rows from and is distributed according to .

Proof.

Let denote the matrix constructed from the permuted rows of and . Letting , we derive the expectation by summing over the possible index subsets that correspond to the rows coming from :

where uses the Cauchy-Binet formula to sum over all subsets of size . Finally, since the sum shifts from to , the last expression can be rewritten as , where recall that and , matching the definition of . ∎

We now return to the proof of Lemma 1, where we establish that the expected sample size of is indeed .

Proof.

(of Lemma 1)  The result is obvious when , whereas for it is an immediate consequence of Lemma 2. Finally, for the expected sample size follows as a corollary of a more general expectation formula proven in Section 4, which states that

where is the orthogonal projection onto the subspace spanned by the rows of . Since the rank of this subspace is equal to the number of the rows, we have , so

which completes the proof. ∎

3 Determinant preserving random matrices

In this section, we introduce the key tool for computing expectation formulas of matrix determinants. It is used in our analysis of the surrogate design, and it should be of independent interest.

The key question motivating the following definition is: when does taking expectation commute with computing a determinant for a square random matrix?

Definition 4.

A random matrix is called determinant preserving (d.p.), if

Note that from the definition of an adjugate matrix (see Section 2) it immediately follows that if is determinant preserving then adjugate commutes with expectation for this matrix:

(1)

We next give a few simple examples to provide some intuition. First, note that every random matrix is determinant preserving simply because taking a determinant is an identity transfomation in one dimension. Similarly, every fixed matrix is determinant preserving because in this case taking the expectation is an identity transformation. In all other cases, however, Definition 4 has to be verified more carefully. Further examples (positive and negative) follow.

Example 1.

If has i.i.d. Gaussian entries , then is d.p. because .

In fact, it can be shown that all random matrices with independent entries are determinant preserving. However, this is not a necessary condition.

Example 2.

Let , where is fixed with , and is a scalar random variable. Then for we have

so if then is determinant preserving, whereas if and then it is not.

To construct more complex examples, we show that determinant preserving random matrices are closed under addition and multiplication. The proof of this result is an extension of an argument given by [DM19] (Lemma 7) for computing the expected determinant of the sum of rank-1 random matrices.

Lemma 3.

If are independent and d.p. then and are also determinant preserving.

Proof.

First, we show that is d.p. for any fixed . Below, we use a standard identity for the rank one update of a determinant: . It follows that for any and of the same size,

where used (1), i.e., the fact that for d.p. matrices, adjugate commutes with expectation. Crucially, through the definition of an adjugate this step implicitly relies on the assumption that all the square submatrices of are also determinant preserving. Iterating this, we get that is d.p. for any fixed . We now show the same for :

where uses the fact that after conditioning on we can treat it as a fixed matrix. Next, we show that is determinant preserving via the Cauchy-Binet formula:

where recall that denotes the submatrix of consisting of its (entire) rows indexed by . ∎

Finally, we introduce another important class of d.p. matrices: a sum of i.i.d. rank-1 random matrices with the number of i.i.d. samples being a Poisson random variable. Our use of the Poisson distribution is crucial for the below result to hold. It is an extension of an expectation formula given by [Der19] for sampling from discrete distributions.

Lemma 4.

If is a Poisson random variable and are random matrices whose rows are sampled as an i.i.d. sequence of joint pairs of random vectors, then is d.p., and in particular,

To prove the above result, we will use the following lemma, many variants of which appeared in the literature (e.g., [vdV65]). We use the one given by [DWH19a].

Lemma 5 ([DWH19a]).

If the rows of random matrices are sampled as an i.i.d. sequence of pairs of joint random vectors, then

(2)

Here, we use the following standard shorthand: . Note that the above result almost looks like we are claiming that the matrix is d.p., but in fact it is not because . The difference in those factors is precisely what we are going to correct with the Poisson random variable. We now present the proof of Lemma 4.

Proof.

(of Lemma 4)  Without loss of generality, it suffices to check Definition 4 with both and equal . We first expand the expectation by conditioning on the value of and letting :

(Lemma 5)

Note that , which is independent of . Thus we can rewrite the above expression as:

which concludes the proof. ∎

Finally we use Lemma 4 combined with Lemma 3 to show the expectation formula needed for obtaining the normalization constant of the under-determined surrogate design (proven by setting ). Note that the below result is more general than the normalization constant requires, because it allows the matrices and to be different. In fact, we will use this more general statement later on in our analysis.

Lemma 6.

If is a Poisson random variable and , are random matrices whose rows are sampled as an i.i.d. sequence of joint pairs of random vectors, then

Proof.

By Lemma 4, the matrix is determinant preserving. Applying Lemma 3 we conclude that is also d.p., so

where the second equality is known as Sylvester’s Theorem. We now use the following standard determinantal formula.

Lemma 7 ([Kt12]).

For any matrices we have .

We rewrite the expectation of by applying the lemma. Letting , we obtain:

where follows from the exchangeability of the rows of and , which implies that the distribution of is the same for all subsets of a fixed size . ∎

4 Expectation formulas for surrogate designs

In this section we prove a number of expectation formulas for determinantal surrogate designs, which we then use to prove Theorems 1 and 2. In the process, we derive closed form expressions for , i.e., the expectation of the orthogonal projection onto the subspace spanned by the the rows of , and for , the trace of the pseudo-inverse of the sample covariance matrix. To our knowledge, neither of these quantities admit closed form expressions for standard i.i.d. random designs such as Gaussian with general covariance (except for the isotropic case).

4.1 Proof of Theorem 1

Let follow the homoscedastic noise model with variance (Assumption 1). Recall that we have and , where . A standard decomposition of the MSE of the Moore-Penrose estimator proceeds as follows:

Thus, our task is to find closed form expressions for the two expectations above. If , then the latter goes away because when has full column rank then . When , this expectation is given in the following result.

Lemma 8.

If and , then we have

The proof of Lemma 8 is deferred to Section 4.2 because it follows as a corollary of a more general result (Lemma 11). We next derive the second expectation needed to compute the MSE. The under- and over-determined cases are proven separately, starting with the former.

Lemma 9.

If for , then we have

Proof.

Let for . Note that if then using the fact that

for any invertible matrix

, we can write:

where is a shorthand for . Assumption 2 ensures that , which allows us to write: