1 Introduction
Recent experimental studies [belkin_2019, zhang_2021] reveal that in the modern highdimensional regime, models that perfectly fit noisy training data can still generalize well. The phenomenon stands in contrast to the classical wisdom that interpolating the data results in poor statistical performance due to overfitting. Many theoretical papers have explored why, when, and to what extent interpolation can be harmless for generalization, suggesting a coherent storyline: High dimensionality itself can have a regularizing effect, in the sense that it lowers the model’s sensitivity to noise. This intuition emerges from the fastgrowing literature studying minnorm interpolation in the regression setting with input dimension substantially exceeding sample size (see [bartlett_2020, dobriban_2018] and references therein). Results and intuition for this setting also extend to kernel methods [ghorbani_2021, mei_2019]
. However, a closer look at this literature reveals that while high dimensionality decreases the sensitivity to noise (error due to variance), the prediction error generally does not vanish as
. Indeed, the bottleneck for asymptotic consistency is a nonvanishing bias term which can only be avoided when the features have low effective dimension , where is the covariance matrix [tsigler_2020]. Therefore, current theory does not yet provide a convincing explanation for why interpolating models generalize well for inherently highdimensional input data. This work takes a step towards addressing this gap. When the input data is effectively highdimensional ( isotropic and), we generally cannot expect any datadriven estimator to generalize well unless there is underlying structure that can be exploited. In this paper, we hence focus on linear regression on isotropic Gaussian features with the simplest structural assumption: sparsity of the ground truth in the standard basis. For this setting, the
penalized regressor (LASSO, [tibshirani_1996]) achieves minimax optimal rates in the presence of noise [vandegeer_2008], while basis pursuit (BP, [chen_1998]) – that is minnorm interpolation – generalizes well in the noiseless case but is known to be very sensitive to noise [candes_2008, donoho_2006]. Given recent results on high dimensionality decreasing sensitivity of interpolators to noise, and classical results on the low bias of BP for learning sparse signals, the following question naturally arises:Can we consistently learn sparse ground truth functions with minimumnorm interpolators on inherently highdimensional features?
So far, upper bounds on the error of the BP estimator of the order of the noise level have been derived for isotropic Gaussian [koehler_2021, ju_2020, wojtaszczyk_2010], subexponential [foucart_2014], or heavytailed [chinot_2021, krahmer_2018] features. In the case of isotropic Gaussian features, even though the authors of the paper [chinot_2021] show a tight matching lower bound for adversarial noise, for noise the best known results are not tight: there is a gap between the nonvanishing upper bound [wojtaszczyk_2010] and the lower bound [chatterji_2021, muthukumar_2020]. For noise, the authors of both papers [chinot_2021, koehler_2021] conjecture that BP does not achieve consistency.
Contribution.
We are the first to answer the above question in the affirmative. Specifically, we show that for isotropic Gaussian features, BP does in fact achieve asymptotic consistency, when grows superlinearly and subexponentially in . Our result closes the aforementioned gap in the literature on BP: We give matching upper and lower bounds of order on the error of the BP estimator, exact up to terms that are negligible when . Further, our proof technique is novel and may be of independent interest.
Structure of the paper.
The rest of the article is structured as follows. In Section 2, we give our main result and discuss its implications and limitations. In Section 3, we present the proof and provide insights on why our approach leads to tighter bounds than previous works. We conclude the paper in Section 4 with possible future directions.
2 Main result
In this section we state our main result, followed by a discussion of its implications and limitations in relation to previous work. We consider a linear regression model with input vectors
drawn from an isotropic Gaussian distribution
, and response variable
, where is the ground truth to be estimated and is a noise term independent of . Given random samples , the goal is to estimate and obtain a small prediction error for the estimate(1) 
where we subtract the irreducible error . Note that this is also exactly the error of the estimator. We study the minnorm interpolator (or BP solution) defined by
(2) 
Our main result, Theorem 2, provides nonasymptotic matching upper and lower bounds for the prediction error of this estimator: Suppose for some universal constant . There exist universal constants such that, for any with and , the prediction error is upper and lowerbounded as
(3) 
with probability at least
over the draws of the dataset. This theorem proves a statistical rate of for the prediction error of the BP solution. Previously, lower bounds of order for the same distributional setting (isotropic Gaussian features, noise) only applied under more restrictive assumptions, such as the zerosignal case [muthukumar_2020], or assuming [ju_2020]. Moreover, this lower bound stood against a constant upper bound [chinot_2021, wojtaszczyk_2010]. Our result both proves the lower bound in more generality and significantly improves the upper bound, showing that the lower bound is in fact tight. Furthermore, the upper bound implies that BP achieves highdimensional asymptotic consistency when .Dependency on .
Existing upper bounds in the literature are of the form^{2}^{2}2The notation means that there exists a universal constant such that and we write for and .
(4) 
(see [chinot_2021, Theorem 3.1]). That is, they contain a first term reflecting the error due to overfitting of the noise , and a second term depending explicitly on . In particular, for a constant noise level () and a sparse ground truth (), the first term clearly dominates. By contrast, our bound has no explicit dependency on , but holds under a condition controlling the magnitude of which essentially ensures that the effect of fitting the noise is dominant ( in the notation of the proof, Section 3). Furthermore, note that we can rewrite the condition on as: , with a universal constant and . This assumption is also called effective sparsity of the ground truth and is not very restrictive. Indeed, it holds if is constant and is sparse with constant sparsity, or more generally with .
2.1 Numerical simulations
We now present numerical simulations illustrating Theorem 2. Figure 0(a) shows the prediction error of BP plotted as a function of with varying and fixed, for isotropic inputs generated from the zeromean and unitvariance Normal, Log Normal and Rademacher distributions. For all three distributions, the prediction error closely follows the trend line (dashed curve). While Theorem 2 only applies for Gaussian features, the figure suggests that this statistical rate of BP holds more generally (see discussion in Section 2.3). Figure 0(b) shows the prediction error of the minnorm (BP) and minnorm interpolators as a function of the noise , for fixed and . The prediction error of the former again aligns with the theoretical prediction . Furthermore, we observe that the minnorm interpolator is sensitive to the noise level , while the minnorm interpolator has a similar (nonvanishing) prediction error across all values of . For both plots we use and average the prediction error over 20 runs; in Figure 0(b)
we additionally show the standard deviation (shaded regions). The ground truth is
. Finally, we choose in Figure 0(a), and in Figure 0(b).2.2 Implications and insights
We now discuss further highlevel implications and insights that follow from Theorem 2.
Highdimensional asymptotic consistency.
Our result proves consistency of BP for any asymptotic regime with . In fact, we argue that those are the only regimes of interest. For growing exponentially with , known minimax lower bounds for sparse problems of order (with the norm of the BP estimator), preclude consistency [verzelen_2012]. On the other hand, for linear growth , i.e., – studied in detail in the paper [li_2021] –, the uniform prediction error lower bound holding for all interpolators (see Section 3.3) also forbids vanishing prediction error. Note that for (), asymptotic consistency can also be achieved by a carefully designed “hybrid” interpolating estimator [muthukumar_2020, Section 5.2]; contrary to BP, this estimator is not a minimumnorm interpolator, and is not structured (not sparse).
Tradeoff between structural bias and sensitivity to noise.
As mentioned in the introduction, our upper bound on the prediction error shows that, contrary to minnorm interpolation, BP is able to learn sparse signals in high dimensions thanks to its structural bias towards sparsity. However, our lower bound can be seen as a tempering negative result: The prediction error decays only at a slow rate of . Compared to minnorm interpolation, BP (minnorm interpolation) suffers from a higher sensitivity to noise, but possesses a more advantageous structural bias. To compare the two methods’ sensitivity to noise, consider the case , where the prediction error purely reflects the effect of noise. In this case, although both methods achieve vanishing error, the statistical rate for BP, , is much slower than that of minnorm interpolation, [koehler_2021, Theorem 3]. Contrariwise, to compare the effect of structural bias, consider the noiseless case with a nonzero ground truth. It is well known that BP successfully learns sparse signals [candes_2008], while minnorm interpolation always fails to learn the ground truth due to the lack of any corresponding structural bias. Thus, there appears to be a tradeoff between structural bias and sensitivity to noise: BP benefits from a strong structural bias, allowing it to have good performance for noiseless recovery of sparse signals, but in return displays a poor rate in the presence of noise – while minnorm interpolation has no structural bias (except towards zero), causing it to fail to recover any nonzero signal even in the absence of noise, but in return does not suffer from overfitting of the noise. This behavior is also illustrated in Figure 0(b).
2.3 Limitations
In this section we discuss how restrictive the assumptions of Theorem 2 are, and whether they can be relaxed.
Gaussianity of the features.
The proof of Theorem 2 crucially relies on the (Convex) Gaussian Minimax Theorem [thrampoulidis_2015, gordon_1988], and hence on the assumption that the input features are drawn from a Gaussian distribution. In Figure 0(a), we include plots of the prediction error not only for Gaussian but also for Log Normal and Rademacher distributed features. We observe that in all three cases, the prediction error closely follows the trend line (dashed curve). We therefore conjecture that Theorem 2 can be extended to a more general class of distributions. For heavytailed distributions, a popular theoretical framework is the smallball method [mendelson_2014, koltchinskii_2015], which covers the Log Normal and Rademacher distributions. The authors of the paper [chinot_2021] apply this approach to minnorm interpolation, and obtain the constant upper bound , under more general assumptions than our setting (in particular their analysis handles adversarial noise with magnitude controlled by ). Yet, it is unclear whether the looseness of their upper bound is an artifact of their proof, or whether the smallball method itself is too general to capture the rates observed in Figure 0(a).
Isotropic features.
Our Theorem 2 assumes isotropic features as we are interested in showing consistency of BP for inherently highdimensional input data. By contrast, recently there has been an increased interest in studying spiked covariance data models (see [bartlett_2020] and references therein). We leave it as future work to generalize our result to Gaussian features with an arbitrary covariance matrix.
3 Proof of main result
In this section, we present the proof of our main result, Theorem 2. The proof is given in Section 3.1. In Section 3.2, we give intuition on the main improvements of our novel proof technique, that we describe in Section 3.4. In Section 3.3, we discuss a universal lower bound for interpolators which follows from our proof. Full proofs for the intermediary Lemmas and Propositions are given in Appendix A.
Notation.
On the finitedimensional space , we write for the Euclidean norm and for the inner product. The and norms are denoted by and , respectively. The vectors of the standard basis are denoted by , and is the vector with all components equal to . For and , is the vector such that if and otherwise.
is the normal distribution with mean
and covariance ,is the cumulative distribution function of the scalar standard normal distribution,
, and denotes the natural logarithm. The samples form the rows of the data matrix , with for each . The scalars , are also aggregated into vectors with and . With this notation, interpolates the data which is equivalent to . To easily keep track of the dependency on dimension and sample size, we reserve the notation to contain only universal constants, without any hidden dependency on , , or . We will also use and to denote positive universal constants reintroduced each time in the proposition and lemma statements, except for which should be considered as fixed throughout the whole proof.3.1 Proof of Theorem 2
We proceed by a localized uniform convergence approach, similar to the papers[chinot_2021, koehler_2021, ju_2020, muthukumar_2020], and common in the literature, , on structural risk minimization. That is, the proof consists of two steps:

Localization. We derive a (finer than previously known) highprobability upper bound on the norm of the minnorm interpolator , by finding such that
() and consequently , with high probability.

Uniform convergence. We derive highprobability uniform upper and lower bounds on the prediction error for all interpolators of norm less than . Namely, we find a highprobability upper bound for
() and a highprobability lower bound for
()
By definition of in (), with high probability the minnorm interpolator belongs to the set of feasible solutions in () and (), and hence the second step yields highprobability upper and lower bounds on its prediction error . The key is thus to derive tight highprobability bounds for the quantities . Our derivation proceeds in two parts, described below; the first part follows arguments already used by the authors of the paper [koehler_2021], while the second part is novel. The techniques developed in the latter are crucial to obtain our tight bounds and might be of independent interest.
Part a: (Convex) Gaussian Minimax Theorem.
Since each of the quantities is defined as the optimal value of a stochastic program with Gaussian parameters, we may apply the (Convex) Gaussian Minimax Theorem ((C)GMT) [gordon_1988, thrampoulidis_2015]. On a high level, given a “primary” optimization program with Gaussian parameters, the (C)GMT relates it to an “auxiliary” optimization program, so that highprobability bounds on the latter imply highprobability bounds on the former. The following proposition applies the CGMT on and the GMT on , . [] For , define the stochastic auxiliary optimization problems:
()  
()  
() 
where can be any small enough quantity. For any , it holds that
(5)  
(6)  
(7) 
where on the lefthand side
denotes the probability distribution over
and , and on the righthand side the distribution over . For the remainder of this paper, we choose .^{3}^{3}3This choice of is justified by the proof of Proposition 3.1. Indeed, for an arbitrary choice of , one could still show the same bound with just an extra factor: , holding with still the same probability. This would translate to a bound on holding with probability . So the choice “comes at no cost” in terms of the probability with which the bound holds, while being sufficiently small to allow for a satisfactory bound (it only affects the constant appearing in ). As such, from now on, we simply write , , . The proof of Proposition 3.1, given in Appendix LABEL:apx:subsec:CGMT, closely follows Lemmas 37 in the paper [koehler_2021]. For clarity, note that the three pairs of stochastic programs (/), (/), (/) are not coupled: Proposition 3.1 should be understood as consisting of three separate statements, each using a different independent copy of . As a result of the proposition, the goal of finding highprobability bounds on now reduces to finding highprobability bounds on , , , respectively.Part b: Bounds on .
To obtain tight bounds on the auxiliary quantities , we adopt a significantly different approach from previous works. The main idea is to reduce the optimization problems (), () and () to optimization problems over a parametric path . Here we only state the results and refer to Section 3.4 for their proofs and further intuition. For the remainder of this proof, we denote by
the quantile of the standard normal distribution defined by
. [] There exist universal constants such that, if and , then(8) 
with probability at least over the draws of . Consequently, the minnorm interpolator has norm bounded by the deterministic quantity
(9) 
with probability at least over the draws of and . Therefore, we henceforth set in (), () and so in (), (). We now establish highprobability upper/lower bounds for / for the specific choice of . [] Suppose for some universal constant . There exist universal constants such that, if and , then each of the two events
(10) 
happens with probability at least over the draws of . Theorem 2 follows straightforwardly from Propositions 3.1, 3.1 and 3.1.
3.2 Key improvements of the proof over previous results
Let us briefly point out the main features of our derivation that allow for a bound that is tighter than previous results.
Tighter bounds for the localization step.
Proposition 3.1 gives a highprobability upper bound for . Its expression contains the quantile defined by , for which we have the estimate (see Lemma 3.4.3 in Appendix LABEL:apx:subsec:concentration_gamma). Hence, we may give the following more explicit estimate:
(11) 
While existing bounds for in the literature are of the same asymptotic order [ju_2020, chinot_2021], using instead of in our derivation of Proposition 3.1 would only result in the constant upper bound . Further note that, while the estimate of in Equation (11) would already lead to upper/lower bounds for of matching order , in order to obtain the precise bounds presented in Proposition 3.1 we make use of precise properties of not captured by Equation (11).
Parametric path optimization for the uniform convergence step.
The overall structure of our proof is similar to that of Theorems 1 to 4 of the paper [koehler_2021]: We use the CGMT to localize the interpolator, and then use the GMT to derive a uniform prediction error bound. The results in the paper [koehler_2021] are applicable for general minimumnorm interpolators, while we focus on the norm only. However, the relaxations used in the mentioned paper are not tight enough to capture the consistency of minnorm interpolation for isotropic features (see Section 6, “Application: Isotropic features” in that paper). To intuitively understand why their general theorems fail to give the accurate rate, let us briefly reproduce their derivation in our notation. They derive an upper bound for with arbitrary , by the following simple relaxation of (). For all with , it holds for any , so the second constraint can be relaxed to , implying the upper bound . However this bound is loose, even when we plug in our tight localization bound by setting . Indeed, by the estimate in Equation (11) and by Gaussian concentration results, the above bound reads for
(12) 
Note that this bound is constant in any polynomial growth regime , while we prove an upper bound in Theorem 2 which vanishes in these regimes as . So the relaxation performed in the paper[koehler_2021] is not sufficiently tight. In order to obtain tighter bounds, we instead carry out a more refined analysis taking into account the relationship between and , by reducing the optimization problems (), (), and () to optimization problems over a parametric path (see Section 3.4.1).
3.3 Implication of the proof: Universal lower bound for interpolators
Interestingly, Proposition 3.1 immediately implies a lower bound for all interpolators holding with high probability. Indeed, consider the uniform lower bound () with so that the constraint is vacuous, meaning that is a lower bound for the prediction error of all interpolating estimators. Using an additional convergence argument as in Lemma 4 in the paper [koehler_2021], one can show that the GMT is still applicable in the limit where , and results in the auxiliary optimization problem () with , , with a vacuous first constraint. By the CauchySchwarz inequality, we can relax the second constraint to . Simple manipulations and concentration results for Gaussians then yield the highprobability bound
(13) 
By the GMT, this implies a highprobability lower bound of order on the prediction error of all interpolators uniformly. In particular, this bound implies that no linear interpolator can achieve asymptotic consistency in the regime where . A weaker lower bound of order was already noted in [muthukumar_2020, Corollary 1, case 3], which however does not capture the divergence of the prediction error at the interpolation threshold .
3.4 Proof of Propositions 3.1 and 3.1
In this section we detail our analysis of the auxiliary optimization problems (), () and (). We start by a remark that considerably simplifies notation: The definitions of are unchanged if, in (), (), (), is replaced by the reordered vector of its absolute order statistics, , by such that is the th largest absolute value of . Throughout this proof, we condition on the event where has distinct and positive components: , which holds with probability one. Henceforth, unless specified otherwise, references to the optimization problems (), () and () refer to the equivalent problems where is replaced by . Also recall that we choose . The key steps in the proof of Propositions 3.1 and 3.1 are as follows.

For each of the three optimization problems (), () and (), we show that the argmax (or argmin) is of the form for some and a parametric path (which depends on ). Hence we can restate (), () and () as optimization problems over a scalar variable and a scale variable . (Section 3.4.1)

Still conditioning on , we explicitly characterize the parametric path . In particular, we show that it is piecewise linear with breakpoints having closedform expressions. (Section 3.4.2)

A finegrained study of the intersection of with the constraint set of () and (), as well as the concentration properties of , yield the desired highprobability bounds on and . (Section 3.4.5)
3.4.1 Parametrizing the argmax/argmin
Note that in the optimization problems (), () and (), the variable only appears through , and . Thus, we can add the constraint that without affecting the optimal solution. We will show that the path can be used to parametrize the solutions of the optimization problems, where is defined by
(14) 
and . Specifically, the following key lemma states that (at least one element of) the argmax/argmin of (), () and () is of the form for some and . This allows to reduce the optimization problems to a single scalar variable and a scale variable. [] For concision, define . We have:
The proof of the lemma is given in Appendix LABEL:apx:subsec:param_argmaxmin. To give an intuitive explanation for the equivalence between () and (), consider a penalized version of (): with . For fixed values of and , minimizing this penalized objective is equivalent to minimizing . Hence, we can expect the argmin to be attained at for some .
3.4.2 Characterizing the parametric path
As is defined as the optimal solution of a convex optimization problem, we are able to obtain a closedform expression, by a straightforward application of Lagrangian duality. The only other nontrivial ingredient is to notice that, at optimality, the inequality constraint necessarily holds with equality. Denote the vector equal to on the first components and on the last , and similarly for . Define, for any integer ,
(15) 
Note that . Let . [] For all , denote the unique integer in such that . Then where the dual variables and are given by
(16)  
(17) 
The proof of the lemma is given in Appendix LABEL:apx:subsec:geometric_lemma.
3.4.3 Concentration of norms of
Given the explicit characterization of the parametric path, we now study its breakpoints (), and more precisely we estimate and as a function of (we have by definition ). Namely, we prove the following concentration result, where, analogously to , we let denote the quantity such that . [] There exist universal constants such that for any with and ,
(18) 
with probability at least over the draws of . This proposition relies on and extends the literature studying concentration of order statistics [boucheron_2012, li_2020]. An important ingredient for the proof of the proposition is the following lemma, which gives a tight approximation for . [] There exist universal constants such that, for all , satisfies
(19) 
where
(20) 
Furthermore, and can be chosen ( and ) such that . The proofs of Proposition 3.4.3 and of Lemma 3.4.3 are given in Appendix LABEL:apx:subsec:concentration_gamma.
3.4.4 Localization: Proof of Proposition 3.1 (upper bound for )
We now use the concentration bounds of Proposition 3.4.3 to obtain a highprobability upper bound for . Recall from Lemma 14 that it is given by ():
(21) 
We may rewrite the constraint as
(22)  
(23) 
Thus minimizing over shows that . Since we want to upperbound this minimum, it is sufficient to further restrict the optimization problem by the constraint , yielding
(24) 
We now show that for the choice , the constraint is satisfied with high probability, and we give a highprobability estimate for the resulting upper bound . See Remark 3.4.4 below for a justification of this choice. For the remainder of the proof of Proposition 3.1, we condition on the event where the inequalities in Equation (18) hold for . By the concentration bound for , a sufficient condition for the choice to be feasible is
(25) 
with some universal constant. Now by Lemma 3.4.3, and we chose . So we can choose sufficiently large such that the above inequality holds for any with . Moreover, by the concentration bounds for and , is upperbounded by
(26) 
Furthermore, by Lemma 3.4.3, so for a universal constant . This concludes the proof of Proposition 3.1. Let us informally justify why we can expect the choice
Comments
There are no comments yet.