The Implicit Regularization of Ordinary Least Squares Ensembles

Ensemble methods that average over a collection of independent predictors that are each limited to a subsampling of both the examples and features of the training data command a significant presence in machine learning, such as the ever-popular random forest, yet the nature of the subsampling effect, particularly of the features, is not well understood. We study the case of an ensemble of linear predictors, where each individual predictor is fit using ordinary least squares on a random submatrix of the data matrix. We show that, under standard Gaussianity assumptions, when the number of features selected for each predictor is optimally tuned, the asymptotic risk of a large ensemble is equal to the asymptotic ridge regression risk, which is known to be optimal among linear predictors in this setting. In addition to eliciting this implicit regularization that results from subsampling, we also connect this ensemble to the dropout technique used in training deep (neural) networks, another strategy that has been shown to have a ridge-like regularizing effect.

Authors

• 6 publications
• 7 publications
• 66 publications
• Implicit Regularization of Random Feature Models

Random Feature (RF) models are used as efficient parametric approximatio...
02/19/2020 ∙ by Arthur Jacot, et al. ∙ 0

• A Risk Comparison of Ordinary Least Squares vs Ridge Regression

We compare the risk of ridge regression to a simple variant of ordinary ...
05/04/2011 ∙ by Paramveer S. Dhillon, et al. ∙ 0

• Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation

Automated machine learning (AutoML) can produce complex model ensembles ...
06/25/2020 ∙ by Rasool Fakoor, et al. ∙ 0

• On the Optimal Weighted ℓ_2 Regularization in Overparameterized Linear Regression

We consider the linear model 𝐲 = 𝐗β_⋆ + ϵ with 𝐗∈ℝ^n× p in the overparam...
06/10/2020 ∙ by Denny Wu, et al. ∙ 0

• A Continuous-Time View of Early Stopping for Least Squares Regression

We study the statistical properties of the iterates generated by gradien...
10/23/2018 ∙ by Alnur Ali, et al. ∙ 0

• Predicting MOOCs Dropout Using Only Two Easily Obtainable Features from the First Week's Activities

While Massive Open Online Course (MOOCs) platforms provide knowledge in ...
08/12/2020 ∙ by Ahmed Alamri, et al. ∙ 0

• Random design analysis of ridge regression

This work gives a simultaneous analysis of both the ordinary least squar...
06/13/2011 ∙ by Daniel Hsu, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ensemble methods (Breiman, 1996; Amit and Geman, 1997; Josse and Wager, 2016) are an oft-used strategy used successfully in a broad range of problems in machine learning and statistics, in which one combines a number of weak predictors together to obtain one powerful predictor. This is accomplished by giving each weak learner a different view of the training data. Various strategies for changing this training data view exist, among which many are simple sampling-based techniques in which each predictor is (independently) given access to a subsampling the rows (examples) and columns (features) of the training data matrix, such as bagging (Breiman, 1996; Bühlmann and Yu, 2002). Another noteworthy technique is boosting (Freund and Schapire, 1997; Breiman, 1998), in which the training data examples are reweighted adaptively according to how badly they have been misclassified while buliding the ensemble. In this work, we consider the former class of techniques—those that train each weak predictor using an independent subsampling of the training data.

Ensemble methods based on independent example and feature subsampling are attractive for two reasons. First, they are computationally appealing in that they are massively parallelizable, and since each member of the ensemble uses only part of the data, they are able to overcome memory limitations faced by other methods (Louppe and Geurts, 2012)

. Second, ensemble methods are known to achieve lower risk due to the fact that combining several different predictors reduces variance

(Bühlmann and Yu, 2002; Wager et al., 2014; Scornet et al., 2015), and empirically they have been found to perform very well. Random forests (Breiman, 2001; Athey et al., 2019; Friedberg et al., 2018), for example, ensemble methods that combine example and feature subsampling with shallow decision tress, remain among the best-performing off-the-shelf machine learning methods available (Cutler and Zhao, 2001; Fernández-Delgado et al., 2014; Wyner et al., 2017).

Let be the training data matrix consisting of examples of data points each having features. While there exist theoretical results on the benefits of example (row) subsampling (Bühlmann and Yu, 2002), the exact nature of the effect of feature (column) subsampling

on ensemble performance remains poorly understood. In this paper, we study the prototypical form of this problem in the context of linear regression. That is, given the data matrix

and target variables , we study the ensemble , where each is learned using ordinary least squares on an independent random subsampling of both the examples and features of the training data. This subsampling is illustrated in Figure 1. We show that under such a scheme, the resulting predictor of this ensemble performs as well as the ridge regression (Hoerl and Kennard, 1970; Friedman et al., 2001) predictor fit using the entire training data, which is known to be the optimal linear predictor under the data assumptions that we consider. Further, the asymptotic risk of the ensemble depends only on the amount of feature subsampling and not on the amount of example subsampling, provided each individual ordinary least squares problem is underdetermined. Our main result in Theorem 3.6, made possible using the recent result on asymptotic risk for ridge regression by Dobriban and Wager (2018), can be summarized as follows:

Theorem 3.6 (informal statement).

When the features and underlying model weights both follow i.i.d. Gaussian distributions, the optimal asymptotic risk for an ensemble of ordinary least squares predictors is equal to the optimal asymptotic ridge regression risk.

We can interpret this result as an example of implicit regularization (Neyshabur et al., 2014; Gunasekar et al., 2017; Arora et al., 2019)

. That is, while the individual ordinary least squares subproblems are completely unregularized, the ensemble behaves as if it had been regularized using a ridge regression penalty. Recently, there has been much interest in investigating the implicit regularization effects of commonly used heuristic methods, particularly in cases where they enable the training of highly

overparameterized models that generalize well to test data despite having the capacity to overfit the training data (Zhang et al., 2017; Belkin et al., 2018)

. Examples of heuristic techniques that have been shown to have implicit regularization effects include stochastic gradient descent

(Hardt et al., 2016) and dropout (Srivastava et al., 2014). Incidentally, we show below a strong connection between the ensemble of ordinary least squares predictors and dropout, which is known to have a ridge-like regularizing effect (Wager et al., 2013), and this link is made via stochastic gradient descent.

Contributions

We summarize our contributions as follows:

1. [label=[C0]]

2. We prove that when the amount of feature subsampling is optimized to minimize risk, an ensemble of ordinary least squares predictors achieves the same risk as the optimal ridge regression predictor asymptotically as (see Section 3).

3. We demonstrate the converge of the ensemble risk to the optimal ridge regression risk via simulation (see Section 4.1).

4. We reveal a connection between the ordinary least squares ensemble and the popular dropout technique used in deep (neural) network training (see Section 4.3) and from the insight gained from this connection develop a recipe for mitigating excess risk under suboptimal feature subsampling via simple output scaling (see Section 4.4).

2 Ensembles of Ordinary Least Squares Predictors

We consider the familiar setting of linear regression, where there exists a linear relationship between the target variable and the feature variables —i.e., , where

is the model parameter vector. The goal of a machine learning algorithm is to estimate these parameters given

i.i.d. noisy samples . The noise relationship is given by

 \vy=\mX\bbeta+σ\vz, (1)

where , , and , where

are i.i.d. zero-mean random variables with unit variance independent of

. We assume a Gaussian distribution on .

Our ensemble consists of linear predictors each fit using ordinary least squares on a submatrix of , and the resulting prediction is the average of the outputs. Equivalently, our ensemble is defined by its estimate of the parameters

 ˆ\bbetaens\defeq1kk∑i=1ˆ\bbeta(i), (2)

where is the parameter estimate of the -th member of the ensemble. To characterize the estimates , we first introduce some notation. Let the selection matrix corresponding to a subset of indices denote the the matrix obtained by selecting from the columns corresponding to the indices in , where denotes the identity matrix. This definition of selection matrices also analogously applies to subsets of . Returning to the ensemble, let and denote the collection of feature subsets and example subsets, respectively, where each and each . Then, assuming , for each member of the ensemble we let

 ˆ\bbeta(i)Si=\argmin\bbeta′\norm[2]\mT⊤i(\mX\mSi\bbeta′−\vy), (3) ˆ\bbeta(i)Sci=0, (4)

where denotes the complement of the set . This can alternatively be written in closed form as

 ˆ\bbeta(i)=\mSi(\mT⊤i\mX\mSi)†\mT⊤i\vy, (5)

where denotes the Moore–Penrose pseudoinverse. Thus, the closed-form expression for the ensemble parameter estimate is given by

 ˆ\bbetaens=1kk∑i=1\mSi(\mT⊤i\mX\mSi)†\mT⊤i\vy. (6)

3 Ensemble Risk

We define the risk of a linear predictor as the expected squared error of a prediction of the target variable on an independent data point :

 R(\bbeta′) =⟨\bbeta−\bbeta′,\bSigma(\bbeta−\bbeta′)⟩. (7)

For any predictor of the form , for some , we can rewrite parameter estimation error as

 \bbeta−\bbeta′=(\mIp−f(\mX)\mX)\bbeta−σf(\mX)\vz. (8)

Then by the independence of and and some algebra, we can decompose the risk into the so-called “bias” and “variance” components

 (9)

For the ensemble, we obtain for the bias

 bias(ˆ\bbetaens)=1k2k∑i,j=1biasij(ˆ\bbetaens), (10)

where

 biasij(ˆ\bbetaens)= ⟨\bbeta\bbeta⊤,(\mIp−\mSi(\mT⊤i\mX\mSi)†\mT⊤i\mX)⊤\bSigma(\mIp−\mSj(\mT⊤j\mX\mSj)†\mT⊤j\mX)⟩. (11)

Similarly, for the variance we have

 variance(ˆ\bbetaens)=1k2k∑i,j=1varianceij(ˆ\bbetaens), (12)

where

 varianceij(ˆ\bbetaens)=σ2⟨ \mSi(\mT⊤i\mX\mSi)†\mT⊤i,\bSigma\mSj(\mT⊤j\mX\mSj)†\mT⊤j⟩. (13)

Thus, evaluating the risk of the ensemble is a matter of evaluating these pairwise interaction terms.

To begin evaluating the above terms, we need to introduce additional assumptions. Specifically, we assume that the subsets are independent and that all indices are equally likely to be included in each subset.

Assumption 3.1 (finite subsampling).

The subsets in the collections and are selected at random such that and that the following hold:

• for all ,

• for all ,

• The subsets are conditionally independent given the example subset sizes .

A simple sampling strategy that satisfies these assumptions is to fix and such that and select subsets uniformly at random of the given sizes. Another strategy is to construct the subsets by flipping a coin for each index, rejecting any resulting subsets that fail to satisfy .

With Assumption 3.1, we are now equipped to evaluate the pairwise interaction terms. For simplicity, we will also assume identity covariance on the data. The following two lemmas enable us to characterize the bias and variance components of the risk in the finite-dimensional setting. The proofs of these lemmas are exercises in linear algebra and conditional expectations and can be found in the Appendix.

With some slight abuse of notation, we allow to denote the expectation taken with respect to the choice of indices in the subsets, but not their sizes. In other words, indicates the conditional expectation over and , conditioned on the subset sizes indicated by the context.

Lemma 3.2 (bias).

Assume that and that Assumption 3.1 holds. Then for ,

 \E\mX,S,T[biasij(ˆ\bbetaens)]=⎧⎪ ⎪⎨⎪ ⎪⎩|Sci∩Scj|p(1+|Si∩Sj|n−|Si∩Sj|−1)\norm[2]\bbeta2ifi≠j,|Sci|p(1+|Si||Ti|−|Si|−1)\norm[2]\bbeta2ifi=j. (14)
Lemma 3.3 (variance).

Assume that and that Assumption 3.1 holds. Then

 \E\mX,S,T[varianceij(ˆ\bbetaens)]=⎧⎪ ⎪⎨⎪ ⎪⎩σ2|Si∩Sj|n−|Si∩Sj|−1ifi≠j,σ2|Si||Ti|−|Si|−1ifi=j. (15)

One observation that we can make already from these results is that the example subsampling only affects the terms where . Assuming that the subsampling procedure is the same for each , so that for large the terms are sure to dominate the sum, this means that in the limit as , the effects of example subsampling are non-existent. We note that this is a result of the assumption that , and that if we were to have , then we would observe effects of example subsampling when , which we discuss further in Section 5.2.

We now turn our attention to the setting where in order to better reason about the results contained in these lemmas. We introduce the following additional assumption.

Assumption 3.4 (asymptotic subsampling).

For some , the subsets in the collections and are selected randomly such as and as for all .

This assumption is easily satisfied. For example, in the sampling strategy where we fix and , we can choose and

. For the coin-flipping strategy, we can select feature subsets with a coin of probability

and example subsets with a coin of probability .

Under this assumption, and additionally assuming without loss of generality that , if such that and , the quantities in (14) and (15) converge almost surely as follows:

 \E\mX,S,T[biasij(ˆ\bbetaens)]a.s.−−→⎧⎪ ⎪⎨⎪ ⎪⎩(1−α)2(1+α2γ1−α2γ)if i≠j,(1−α)(1+αγη−αγ)if i=j, (16)

and

 \E\mX,S,T[varianceij(ˆ\bbetaens)]a.s.−−→⎧⎪⎨⎪⎩σ2α2γ1−α2γif i≠j,σ2αγη−αγif i=j. (17)

We are now equipped to state our asymptotic risk result for the ensemble of ordinary least squares predictors. Denote for an ensemble satisfying Assumptions 3.1 and 3.4 with parameters , , and the limiting risk

 Rensα,η,k\defeqlimn,p→∞\E\mX,\vz,S,T[R(ˆ\bbetaens)]. (18)

From (10) and (12), we know that both the bias and variance components of the limiting risk are the averages of terms, and from (16) and (17), we know that the terms where will take one value and the remaining terms where will take another. Thus we have the limiting bias

 limn,p→∞\E\mX,\vz,S,T[bias(ˆ\bbetaens)]=k−1k((1−α)21−α2γ)+1k(η(1−α)η−αγ) (19)

and limiting variance

 (20)

Upon careful examination of these quantities, we observe that in fact both the limiting bias and the limiting variance are decreasing in , and thus the ensemble serves not only as a means to reduce variance (as is well understood), but also to reduce bias. We defer further discussion to Section 4.2. Adding the limiting bias and variance yields the following result.

Theorem 3.5 (limiting risk).

Assume that and and that Assumptions 3.1 and 3.4 hold. Then in the limit as with , for , we have almost surely that

 Rensα,η,k=k−1k((1−α)2+σ2α2γ1−α2γ)+1k(η(1−α)+σ2αγη−αγ). (21)

Here we see again more explicitly that for large , the effect of example subsampling vanishes. This leaves us with the large-ensemble risk

 Rensα \defeqlimk→∞Rensα,η,k =(1−α)2+σ2α2γ1−α2γ. (22)

We note that while the large-ensemble risk depends only upon , we cannot realize this risk with an ensemble if . Our remaining results concern the large-ensemble risk and therefore assume that for simplicity, but we caution the reader that some of these results may not be valid for some smaller values of , depending on and .

Because

is an algorithmic hyperparameter, it can be tuned to minimize the risk. If we do so, then what we obtain is the perhaps surprising result that the optimal large-ensemble risk of the ordinary least squares predictor is equal to the limiting risk of the

ridge regression predictor under our assumptions. The ridge regression predictor with parameter is defined as

 ˆ\bbetaridgeλ \defeq\argmin\bbeta′\norm[2]\mX\bbeta′−\vy2+λ\norm[2]\bbeta′2 =\inv\mX⊤\mX+λ\mIp\mX⊤\vy. (23)

We formally state this result in the following theorem. This result leverages the recent analysis of the limiting risk of ridge regression by Dobriban and Wager (2018). The proof is found in the Appendix.

Theorem 3.6.

Assume that and and that Assumptions 3.1 and 3.4 hold with . Then in the limit as with , we have almost surely that

 infα<γ−1Rensα=infλR(ˆ\bbetaridgeλ). (24)

A curious result obtained during the proof of this theorem is the following corollary relating the optimal large ensemble risk to the optimal choice of the hyperparameter .

Corollary 3.7.

Assume that and and that Assumptions 3.1 and 3.4 hold with . Then in the limit as with , we have almost surely that

 Rensα∗=1−α∗, (25)

where .

The implication of Theorem 3.6 is quite strong. Under the assumption of the theorem that true parameters have a Gaussian distribution with covariance , the ridge regression predictor is the predictor with the lowest expected risk of all predictors of the form . To see this, note that if we take the expectation of (9) with respect to , we find that the optimal must satisfy the first order optimality condition

 \bSigmaf(\mX)(\mX\mX⊤+pσ2\mIp)=\bSigma\mX⊤, (26)

which for invertible yields the optimally tuned ridge regression predictor. Thus, in the setting, the optimally tuned ensemble achieves the optimal risk for any linear predictor.

4 Discussion

4.1 Convergence

In practice, any ensemble will have only a finite number of members. Therefore, it is important to understand the rates at which the risk of the ensemble converges to large-ensemble risk in (3). From Theorem 3.5, it is clear that as a function of , the limiting risk converges to the large-ensemble risk at a rate . However, as the choice of approaches , this rate becomes slower. In Figure 2, we plot the convergence in of the limiting risk to the large-ensemble risk for (using all examples) and for (near to as small as possible while still having ). We plot these curves for and for three different values of , using , which is sufficient to realize the convergence in and . We choose , the minimizer of the large-ensemble risk. What we observe is that, indeed, for both choices of , the risks converge to the optimal ridge risk. As expected, however, with the smaller choice of the risk converges nearly an order of magnitude more slowly.

While the choice of will result in optimal risk for large enough ensembles, for finite this choice can in some cases be undesirable. For instance, consider the setting where and . Then as , . This obviously yields the optimal large-ensemble risk, by definition, but for any finite , the limiting risk tends to infinity for this choice of . However, if we know what the size of our ensemble will be, we can tune to the limiting risk for finite instead of the large ensemble risk. In general, this means choosing an smaller than . In Figure 3, we demonstrate the convergence in to the large-ensemble risk as a function of for and for . We plot these curves for and , using . While for both choices of we see convergence in for each , as , the risk is very large for . For adapted to the choice of , however, this effect is mitigated.

4.2 Bias and Variance Decrease with Ensemble Size

We return here to the observation made in Section 3 that the limiting bias and variance are both decreasing in . This can be seen by comparing the and terms in each case. In the case of bias, for the bias to be decreasing, it must be that

 (1−α)21−α2γ<η(1−α)η−αγ. (27)

Since and , after some algebra, this reduces to

 γ(α−1)<η(1−αγ). (28)

Because , the left-hand side is non-positive, and since , the right-hand side is strictly positive. Thus this inequality always holds, and the bias is decreasing.

In the case of variance, for the variance to be decreasing, we must have

 α2γ1−α2γ<αγη−αγ. (29)

Again since and , this reduces to

 αη<1. (30)

So, unless both and , in which case every member of the ensemble is the ordinary least squares predictor fit using the entire training data, the variance is decreasing.

4.3 Dropout and Ridge Regression

There is an interesting connection between the ordinary least squares ensemble with and the popular dropout technique (Srivastava et al., 2014) used in deep (neural) network training, which consists of randomly masking the features at each iteration of (stochastic) gradient descent. To draw this connection, define

 ℓi(\bbeta′)=\norm[2]\mX\mSi\mS⊤i\bbeta′−\vy2. (31)

Then our ensemble member parameter estimates are minimizers of this loss function.

 ˆ\bbeta(i)=\argmin\bbeta′ℓi(\bbeta′) s.t. \bbeta′Sci =0. (32)

For each , the -th member of the ensemble is able to solve its subproblem independently of the other members. As a result, we can consider the ensemble to be a model with parameters that are eventually averaged to reduce them down to parameters. If we were to instead constrain ourselves so that we were allowed to use only parameters, such that we could not optimize each member of the ensemble independently, we might try to optimize them jointly by minimizing the average loss. That is,

 ˆ\bbeta=\argmin\bbeta′1kk∑i=1ℓi(\bbeta′). (33)

If we go a step further and let and optimize this loss using stochastic gradient descent where at each iteration we use the gradient of an individual selected at random, then our ensemble becomes equivalent to the predictor learned using dropout. It is well-known that dropout with linear regression has a very strong connection to ridge regression (Srivastava et al., 2014); specifically, we find that

 ˆ\bbeta=1α\inv\mX⊤\mX+1−ααdiag(\mX⊤\mX)\mX⊤\vy. (34)

In the case of , will converge to as , in which case dropout and ridge regression are equivalent up to a rescaling. We discuss the case where in Section 5.1.

4.4 Scaled Ensembles

Our ensemble combines the individual predictors by simple averaging. However, in light of the fact that dropout is only equivalent to ridge regression up to a rescaling of the output, it is worth considering the effect of using an equally-weighted linear combination but using different weights from in constructing the ensemble predictor. That is, we consider the risk of the -scaled predictor . A simple calculation, proved in the Appendix, shows that under the assumptions of Theorem 3.5 the large-ensemble risk of the -scaled predictor is given by

 Rensα,μ=μ2Rensα+(1−μ)2+2μ(1−μ)(1−α). (35)

Hence, it is possible to minimize the risk of over the choice of parameter . This results in

 μ∗ =αRensα+2α−1 (36)

as the optimal choice for and

 Rensα,μ∗ =1−α22α−1+Rensα (37)

as the achieved risk for the optimally-scaled ensemble. Note that as a result of Corollary 3.7, . Therefore, for ensembles with optimally-tuned we have , and any scaling in constructing the ensemble predictor will not further improve the achieved risk. However, it is easy to see that when (the ensemble members select more features than is optimal), , and the risk is improved by adding extra shrinkage to the ensemble predictor. Similarly, if , (the ensemble members select less features than is optimal), , and the risk is improved by inflating the ensemble predictor. We illustrate the improvement in risk to be had in Figure 4, where we plot the risk with () and without () optimal scaling for two choices of —one where we always select half as many features as optimal (, and one where we always use half of the available features .

5 Future Directions

5.1 Non-Identity Covariance

Of course, it is important to understand the behavior of the ordinary least squares ensemble in the case where when considering applications of the method to real data. As discussed in Section 3, provided is invertible, ridge regression remains the optimal linear predictor, and whether the ensemble (or extensions thereto) still achieves the optimal risk in this setting remains an open question.

By inspection of the closed-form solution of dropout in (34), we see that it is no longer equivalent (as ) to ridge regression and is therefore no longer optimal. We believe that this is likely the case for the ensemble as well. However, if we extend the coin-flipping strategy for feature subset selection to one where we have a collection of coin with probabilities , one for each feature, we can extend the result in (34) to obtain the closed-form dropout solution

 ˆ\bbeta=\mA−1\inv\mX⊤\mX+(\mIp−\mA)\mA−1diag(\mX⊤\mX)\mX⊤\vy, (38)

where . We prove this result in the Appendix. Thus, if is chosen such that

 1−αjαj=λn[\bSigma]jj, (39)

then the corrected dropout estimator

 ˜\bbeta=\mAˆ\bbeta (40)

is equivalent to ridge regression with parameter as . This leads us to believe that the optimal ensemble in the setting should also use non-uniform feature sampling, and extending our analysis to this case is an interesting area for future work.

5.2 Beyond Ordinary Least Squares: Ensembles of Interpolators

Throughout this work we have assumed that the members of the ensemble solve their subproblems using ordinary least squares, which yields the unique solution that minimizes the squared error given observations of variables, and this uniqueness requires that be no less than . In the case where , there are infinitely many solutions that minimize the squared error. However, we could in this case opt to regularize the solution to solve this problem. While analysis of the effect of regularizing the solution of the subproblems in the ensemble is beyond the scope of this work, we comment briefly on what would happen if we were to simply use the same solution presented in (5)—i.e., use the pseudoinverse solution, which has the smallest norm of all solutions to the least squares problem. In this case, when , the learned predictor would be an interpolator (Belkin et al., 2018; Hastie et al., 2019) of the training data, and such methods have recently become increasingly of interest given the ability of deep (neural) network methods to have extremely good test performance while having (nearly) zero training error (Zhang et al., 2017; Belkin et al., 2019).

Specifically, it becomes immediately clear that in this setting, the effect of the choice of does not vanish as . Lemma 3.3 can easily be extended to this setting, since the roles of and in (13) can simply be reversed to yield

 \E\mX,S,T[varianceij(ˆ\bbetaens)]=⎧⎪ ⎪⎨⎪ ⎪⎩σ2|Ti∩Tj|p−|Ti∩Tj|−1if i≠j,σ2|Ti||Si|−|Ti|−1if i=j. (41)

As , this converges almost surely to

 ⎧⎪⎨⎪⎩σ2η2γ−η2if % i≠j,σ2ηαγ−ηif i=j. (42)

Thus, the variance component of the large-ensemble risk in this setting is equal to and does not depend upon . In future work, we plan to extend our analysis for the bias component of the large-ensemble risk to this setting, and we expect that in this case the bias will depend on both and .

5.3 Optimal Ensemble Mixing

In the ordinary least squares ensemble, we have used equal weighting when taking the average of our predictors. Instead, we could extend the idea presented in Section 4.4 to consider unequal weighting parameterized by , giving us the ensemble parameter estimate . While equal weighting gives us optimal risk in the setting where , where ridge regression is optimal, under other distributional assumptions on , such as sparsity, where ridge regression is not optimal, unequal weighting has the potential to yield better ensembles.

Using the sparsity example, consider such that , and suppose that for some , , where . For simplicity, assume that , so that for all . In this case, any predictor that uses the remaining features injects noise into its predictions, so the best predictor uses only the features in . Under the i.i.d. Gaussian noise assumption, the predictor with lowest risk is in fact

 ˆ\bbeta =\argmin\bbeta′:\bbeta′Sc\bbeta=0\norm[2]\vy−\mX\bbeta′=ˆ\bbeta(i), (43)

where is such that . Thus an optimal weighting is given by

 μi={1Cif Si=S\bbeta,0otherwise, (44)

where . This optimal weighting is decidedly non-uniform, and this raises the question of what schemes could be employed, either adaptively or non-adaptively, to minimize risk, and how they would fit into this analysis framework.

Acknowledgements

This work was supported by NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR grants N00014-18-12571 and N00014-17-1-2551; AFOSR grant FA9550-18-1-0478; DARPA grant G001534-7500; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047.

References

• Amit and Geman (1997) Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9(7):1545–1588, 1997.
• Arora et al. (2019) S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655, 2019.
• Athey et al. (2019) S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, Apr. 2019.
• Belkin et al. (2018) M. Belkin, D. J. Hsu, and P. Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems 31, pages 2300–2311. 2018.
• Belkin et al. (2019) M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
• Breiman (1996) L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug. 1996.
• Breiman (1998) L. Breiman.

Arcing classifier (with discussion and a rejoinder by the author).

The Annals of Statistics, 26(3):801–849, June 1998.
• Breiman (2001) L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct. 2001.
• Bühlmann and Yu (2002) P. Bühlmann and B. Yu. Analyzing bagging. The Annals of Statistics, 30(4):927–961, Aug. 2002.
• Cutler and Zhao (2001) A. Cutler and G. Zhao. PERT - perfect random tree ensembles. Computing Science and Statistics, page 497, 2001.
• Dobriban and Wager (2018) E. Dobriban and S. Wager. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, Feb. 2018.
• Fernández-Delgado et al. (2014) M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
• Freund and Schapire (1997) Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
• Friedberg et al. (2018) R. Friedberg, J. Tibshirani, S. Athey, and S. Wager. Local linear forests. arXiv preprint arXiv:1807.11408, 2018.
• Friedman et al. (2001) J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning. Springer Series in Statistics, 2001.
• Gunasekar et al. (2017) S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems 30, pages 6151–6159. 2017.
• Hardt et al. (2016) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1225–1234, June 2016.
• Hastie et al. (2019) T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
• Hoerl and Kennard (1970) A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
• Josse and Wager (2016) J. Josse and S. Wager. Bootstrap-based regularization for low-rank matrix estimation. Journal of Machine Learning Research, 17(1):4227–4255, Jan. 2016.
• Louppe and Geurts (2012) G. Louppe and P. Geurts. Ensembles on random patches. In Machine Learning and Knowledge Discovery in Databases, pages 346–361, Berlin, Heidelberg, 2012.
• Neyshabur et al. (2014) B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
• Scornet et al. (2015) E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, Aug. 2015.
• Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, Jan. 2014.
• Wager et al. (2013) S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, pages 351–359. 2013.
• Wager et al. (2014) S. Wager, T. Hastie, and B. Efron. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of Machine Learning Research, 15:1625–1651, 2014.
• Wyner et al. (2017) A. J. Wyner, M. Olson, J. Bleich, and D. Mease. Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18(48):1–33, 2017.
• Zhang et al. (2017) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.

Understanding deep learning requires rethinking generalization.

In 5th International Conference on Learning Representations, ICLR 2017, 2017.

References

• Amit and Geman (1997) Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9(7):1545–1588, 1997.
• Arora et al. (2019) S. Arora, N. Cohen, W. Hu, and Y. Luo. Implicit regularization in deep matrix factorization. arXiv preprint arXiv:1905.13655, 2019.
• Athey et al. (2019) S. Athey, J. Tibshirani, and S. Wager. Generalized random forests. The Annals of Statistics, 47(2):1148–1178, Apr. 2019.
• Belkin et al. (2018) M. Belkin, D. J. Hsu, and P. Mitra. Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. In Advances in Neural Information Processing Systems 31, pages 2300–2311. 2018.
• Belkin et al. (2019) M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
• Breiman (1996) L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, Aug. 1996.
• Breiman (1998) L. Breiman.

Arcing classifier (with discussion and a rejoinder by the author).

The Annals of Statistics, 26(3):801–849, June 1998.
• Breiman (2001) L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct. 2001.
• Bühlmann and Yu (2002) P. Bühlmann and B. Yu. Analyzing bagging. The Annals of Statistics, 30(4):927–961, Aug. 2002.
• Cutler and Zhao (2001) A. Cutler and G. Zhao. PERT - perfect random tree ensembles. Computing Science and Statistics, page 497, 2001.
• Dobriban and Wager (2018) E. Dobriban and S. Wager. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, Feb. 2018.
• Fernández-Delgado et al. (2014) M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15:3133–3181, 2014.
• Freund and Schapire (1997) Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
• Friedberg et al. (2018) R. Friedberg, J. Tibshirani, S. Athey, and S. Wager. Local linear forests. arXiv preprint arXiv:1807.11408, 2018.
• Friedman et al. (2001) J. Friedman, T. Hastie, and R. Tibshirani. The Elements of Statistical Learning. Springer Series in Statistics, 2001.
• Gunasekar et al. (2017) S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems 30, pages 6151–6159. 2017.
• Hardt et al. (2016) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1225–1234, June 2016.
• Hastie et al. (2019) T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
• Hoerl and Kennard (1970) A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
• Josse and Wager (2016) J. Josse and S. Wager. Bootstrap-based regularization for low-rank matrix estimation. Journal of Machine Learning Research, 17(1):4227–4255, Jan. 2016.
• Louppe and Geurts (2012) G. Louppe and P. Geurts. Ensembles on random patches. In Machine Learning and Knowledge Discovery in Databases, pages 346–361, Berlin, Heidelberg, 2012.
• Neyshabur et al. (2014) B. Neyshabur, R. Tomioka, and N. Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
• Scornet et al. (2015) E. Scornet, G. Biau, and J.-P. Vert. Consistency of random forests. The Annals of Statistics, 43(4):1716–1741, Aug. 2015.
• Srivastava et al. (2014) N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, Jan. 2014.
• Wager et al. (2013) S. Wager, S. Wang, and P. S. Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems 26, pages 351–359. 2013.
• Wager et al. (2014) S. Wager, T. Hastie, and B. Efron. Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. Journal of Machine Learning Research, 15:1625–1651, 2014.
• Wyner et al. (2017) A. J. Wyner, M. Olson, J. Bleich, and D. Mease. Explaining the success of AdaBoost and random forests as interpolating classifiers. Journal of Machine Learning Research, 18(48):1–33, 2017.
• Zhang et al. (2017) C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.

Understanding deep learning requires rethinking generalization.

In 5th International Conference on Learning Representations, ICLR 2017, 2017.

Appendix A Useful Lemmas

The following two lemmas will be useful in deriving the bias and variance terms of the ensemble risk. Their proofs can be found in Section F.

Lemma A.1.

Let be a subset with corresponding selection matrix , and let be the selection matrix corresponding to

. Then for a random matrix

with rows independently drawn from such that , and for any random function that and are independent,

 \E\mX\mSc[\mS⊤\mX†]=(\mX\mS)† (45)

and

 \E\mX\mSc[\mSc⊤\mX⊤f(\mX\mS)\mS⊤\mX†]=0. (46)
Lemma A.2.

Let be independent random subsets with corresponding selection matrices such that . Then for random matrices independent of and with independent and identically distributed rows such that and are invertible, and for any matrix ,

 \ET1,T2[(\mT⊤1\mX)†\mT⊤1((\mT⊤2\mX)†\mT⊤2)⊤]=(\mX⊤\mX)† (47)

and

 \ET1,T2[((\mT⊤1\mX)†\mT⊤1)⊤\mA(\mT⊤2\mY)†\mT⊤2]=(\mX†)⊤\mA\mY†. (48)

Appendix B Proof of Lemma 3.2 (Bias)

To compute the bias, we need to evaluate terms of the form

 \E\mX,S,T⟨\bbeta\bbeta⊤,(\mIp−\mSi(\mT⊤i\mX\mSi)†\mT⊤i\mX)⊤(\mIp−\mSj(\mT⊤j\mX\mSj)†\mT⊤j\mX)⟩. (49)

First, we note that since ,

 \mIp−\mSi(\mT⊤i\mX\mSi)†\mT⊤i\mX =\mIp−\mSi(\mT⊤i\mX\mSi)†\mT⊤i\mX(\mSi\mS⊤i+\mSci\mSci⊤) (50) =\mIp−\mSi\mS⊤i−\mSi(\mT⊤i\mX\mSi)†\mT⊤i\mX\mSci\mSci⊤ (51) =(\mIp−\mSi(\mT⊤i\mX\mSi)†\mT⊤i\mX)\mSci\mSci⊤. (52)

So, we can equivalently evaluate

 \E\mX,S,T⟨\bbeta\bbeta⊤,\mSci\mSci⊤[\mIp−\mX⊤\mTi(\mS⊤i\mX⊤\mTi)†\mS⊤i][\mIp−\mSj(\mT⊤j\mX\mSj)†\mT⊤j\mX]\mScj\mScj⊤⟩. (53)

It suffices to evaluate the expectation of the second argument of the inner product:

 \E\mX,S,T[ \mSci\mSci⊤[\mIp−\mX⊤\mTi(\mS⊤i\mX⊤\mTi)†\mS⊤i][\mIp−\mSj(\mT⊤j\mX\mSj)†\mT⊤j\mX]\mScj\mScj⊤] = \E\mX,S,T[\mSci\mSci⊤\mX⊤\mTi(\mS⊤i\mX⊤\mTi)†\mS⊤i\mSj(\mT⊤j\mX\mSj)†\mT⊤j\mX\mS⊤i\mScj\mScj⊤ −\mSci\mSci⊤\mX⊤\mTi(\mS⊤i\mX⊤\mTi)†\mS⊤i−\mSj(\mT⊤j\mX\mSj)†\mT⊤j\mX\mScj\mScj⊤+\mSci\mSci⊤\mScj\mScj⊤]. (54)

The second and third terms are zero in expectation. To see this for the second term, observe that and are independent and each zero-mean. An analogous argument applies to the third term. The fourth term is equal to

 |Sci∩Scj|p