Tight bounds for minimum l1-norm interpolation of noisy data

11/10/2021
by   Guillaume Wang, et al.
0

We provide matching upper and lower bounds of order σ^2/log(d/n) for the prediction error of the minimum ℓ_1-norm interpolator, a.k.a. basis pursuit. Our result is tight up to negligible terms when d ≫ n, and is the first to imply asymptotic consistency of noisy minimum-norm interpolation for isotropic features and sparse ground truths. Our work complements the literature on "benign overfitting" for minimum ℓ_2-norm interpolation, where asymptotic consistency can be achieved only when the features are effectively low-dimensional.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/17/2021

Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting

We consider interpolation learning in high-dimensional linear regression...
12/28/2018

Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon

We show that minimum-norm interpolation in the Reproducing Kernel Hilber...
12/01/2020

On the robustness of minimum-norm interpolators

This article develops a general theory for minimum-norm interpolated est...
10/06/2021

Foolish Crowds Support Benign Overfitting

We prove a lower bound on the excess risk of sparse interpolating proced...
06/10/2020

On Uniform Convergence and Low-Norm Interpolation Learning

We consider an underdetermined noisy linear regression model where the m...
11/28/2018

Basis Pursuit Denoise with Nonsmooth Constraints

Level-set optimization formulations with data-driven constraints minimiz...
02/19/2019

Interpolation of scattered data in R^3 using minimum L_p-norm networks, 1<p<∞

We consider the extremal problem of interpolation of scattered data in R...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent experimental studies [belkin_2019, zhang_2021] reveal that in the modern high-dimensional regime, models that perfectly fit noisy training data can still generalize well. The phenomenon stands in contrast to the classical wisdom that interpolating the data results in poor statistical performance due to overfitting. Many theoretical papers have explored why, when, and to what extent interpolation can be harmless for generalization, suggesting a coherent storyline: High dimensionality itself can have a regularizing effect, in the sense that it lowers the model’s sensitivity to noise. This intuition emerges from the fast-growing literature studying min--norm interpolation in the regression setting with input dimension substantially exceeding sample size (see [bartlett_2020, dobriban_2018] and references therein). Results and intuition for this setting also extend to kernel methods [ghorbani_2021, mei_2019]

. However, a closer look at this literature reveals that while high dimensionality decreases the sensitivity to noise (error due to variance), the prediction error generally does not vanish as

. Indeed, the bottleneck for asymptotic consistency is a non-vanishing bias term which can only be avoided when the features have low effective dimension , where is the covariance matrix [tsigler_2020]. Therefore, current theory does not yet provide a convincing explanation for why interpolating models generalize well for inherently high-dimensional input data. This work takes a step towards addressing this gap. When the input data is effectively high-dimensional ( isotropic and

), we generally cannot expect any data-driven estimator to generalize well unless there is underlying structure that can be exploited. In this paper, we hence focus on linear regression on isotropic Gaussian features with the simplest structural assumption: sparsity of the ground truth in the standard basis. For this setting, the

-penalized regressor (LASSO, [tibshirani_1996]) achieves minimax optimal rates in the presence of noise [vandegeer_2008], while basis pursuit (BP, [chen_1998]) – that is min--norm interpolation – generalizes well in the noiseless case but is known to be very sensitive to noise [candes_2008, donoho_2006]. Given recent results on high dimensionality decreasing sensitivity of interpolators to noise, and classical results on the low bias of BP for learning sparse signals, the following question naturally arises:

Can we consistently learn sparse ground truth functions with minimum-norm interpolators on inherently high-dimensional features?

So far, upper bounds on the -error of the BP estimator of the order of the noise level have been derived for isotropic Gaussian [koehler_2021, ju_2020, wojtaszczyk_2010], sub-exponential [foucart_2014], or heavy-tailed [chinot_2021, krahmer_2018] features. In the case of isotropic Gaussian features, even though the authors of the paper [chinot_2021] show a tight matching lower bound for adversarial noise, for  noise the best known results are not tight: there is a gap between the non-vanishing upper bound [wojtaszczyk_2010] and the lower bound [chatterji_2021, muthukumar_2020]. For  noise, the authors of both papers [chinot_2021, koehler_2021] conjecture that BP does not achieve consistency.

Contribution.

We are the first to answer the above question in the affirmative. Specifically, we show that for isotropic Gaussian features, BP does in fact achieve asymptotic consistency, when grows superlinearly and subexponentially in . Our result closes the aforementioned gap in the literature on BP: We give matching upper and lower bounds of order  on the -error of the BP estimator, exact up to terms that are negligible when . Further, our proof technique is novel and may be of independent interest.

Structure of the paper.

The rest of the article is structured as follows. In Section 2, we give our main result and discuss its implications and limitations. In Section 3, we present the proof and provide insights on why our approach leads to tighter bounds than previous works. We conclude the paper in Section 4 with possible future directions.

2 Main result

In this section we state our main result, followed by a discussion of its implications and limitations in relation to previous work. We consider a linear regression model with input vectors

drawn from an isotropic Gaussian distribution

, and response variable

, where is the ground truth to be estimated and is a noise term independent of . Given  random samples , the goal is to estimate and obtain a small prediction error for the estimate

(1)

where we subtract the irreducible error . Note that this is also exactly the -error of the estimator. We study the min--norm interpolator (or BP solution) defined by

(2)

Our main result, Theorem 2, provides non-asymptotic matching upper and lower bounds for the prediction error of this estimator: Suppose for some universal constant . There exist universal constants such that, for any with and , the prediction error is upper- and lower-bounded as

(3)

with probability at least

over the draws of the dataset. This theorem proves a statistical rate of for the prediction error of the BP solution. Previously, lower bounds of order for the same distributional setting (isotropic Gaussian features,  noise) only applied under more restrictive assumptions, such as the zero-signal case [muthukumar_2020], or assuming [ju_2020]. Moreover, this lower bound stood against a constant upper bound [chinot_2021, wojtaszczyk_2010]. Our result both proves the lower bound in more generality and significantly improves the upper bound, showing that the lower bound is in fact tight. Furthermore, the upper bound implies that BP achieves high-dimensional asymptotic consistency when .

Dependency on .

Existing upper bounds in the literature are of the form222The notation means that there exists a universal constant such that and we write for and .

(4)

(see  [chinot_2021, Theorem 3.1]). That is, they contain a first term reflecting the error due to overfitting of the noise , and a second term depending explicitly on . In particular, for a constant noise level () and a sparse ground truth (), the first term clearly dominates. By contrast, our bound has no explicit dependency on , but holds under a condition controlling the magnitude of which essentially ensures that the effect of fitting the noise is dominant ( in the notation of the proof, Section 3). Furthermore, note that we can rewrite the condition on as: , with a universal constant and . This assumption is also called effective -sparsity of the ground truth and is not very restrictive. Indeed, it holds if is constant and is sparse with constant sparsity, or more generally with .

(a) Prediction error of BP vs. 
(b) Prediction error vs. 
Figure 1: Prediction error as a function of (a) with varying and fixed, and (b) with . The features are generated by drawing from the isotropic zero-mean and unit-variance (b) Normal and (a) Normal, Log Normal and Rademacher distributions. For BP on Gaussian-distributed features (orange squares), the plots correctly reflect the theoretical rate (dashed curve). See Section 2.1 for further details.

2.1 Numerical simulations

We now present numerical simulations illustrating Theorem 2. Figure 0(a) shows the prediction error of BP plotted as a function of with varying and fixed, for isotropic inputs generated from the zero-mean and unit-variance Normal, Log Normal and Rademacher distributions. For all three distributions, the prediction error closely follows the trend line (dashed curve). While Theorem 2 only applies for Gaussian features, the figure suggests that this statistical rate of BP holds more generally (see discussion in Section 2.3). Figure 0(b) shows the prediction error of the min--norm (BP) and min--norm interpolators as a function of the noise , for fixed and . The prediction error of the former again aligns with the theoretical prediction . Furthermore, we observe that the min--norm interpolator is sensitive to the noise level , while the min--norm interpolator has a similar (non-vanishing) prediction error across all values of . For both plots we use and average the prediction error over 20 runs; in Figure 0(b)

we additionally show the standard deviation (shaded regions). The ground truth is

. Finally, we choose in Figure 0(a), and in Figure 0(b).

2.2 Implications and insights

We now discuss further high-level implications and insights that follow from Theorem 2.

High-dimensional asymptotic consistency.

Our result proves consistency of BP for any asymptotic regime with . In fact, we argue that those are the only regimes of interest. For growing exponentially with , known minimax lower bounds for sparse problems of order (with the -norm of the BP estimator), preclude consistency [verzelen_2012]. On the other hand, for linear growth , i.e., – studied in detail in the paper [li_2021] –, the uniform prediction error lower bound holding for all interpolators (see Section 3.3) also forbids vanishing prediction error. Note that for (), asymptotic consistency can also be achieved by a carefully designed “hybrid” interpolating estimator [muthukumar_2020, Section 5.2]; contrary to BP, this estimator is not a minimum-norm interpolator, and is not structured (not -sparse).

Trade-off between structural bias and sensitivity to noise.

As mentioned in the introduction, our upper bound on the prediction error shows that, contrary to min--norm interpolation, BP is able to learn sparse signals in high dimensions thanks to its structural bias towards sparsity. However, our lower bound can be seen as a tempering negative result: The prediction error decays only at a slow rate of . Compared to min--norm interpolation, BP (min--norm interpolation) suffers from a higher sensitivity to noise, but possesses a more advantageous structural bias. To compare the two methods’ sensitivity to noise, consider the case , where the prediction error purely reflects the effect of noise. In this case, although both methods achieve vanishing error, the statistical rate for BP, , is much slower than that of min--norm interpolation, [koehler_2021, Theorem 3]. Contrariwise, to compare the effect of structural bias, consider the noiseless case with a non-zero ground truth. It is well known that BP successfully learns sparse signals [candes_2008], while min--norm interpolation always fails to learn the ground truth due to the lack of any corresponding structural bias. Thus, there appears to be a trade-off between structural bias and sensitivity to noise: BP benefits from a strong structural bias, allowing it to have good performance for noiseless recovery of sparse signals, but in return displays a poor rate in the presence of noise – while min--norm interpolation has no structural bias (except towards zero), causing it to fail to recover any non-zero signal even in the absence of noise, but in return does not suffer from overfitting of the noise. This behavior is also illustrated in Figure 0(b).

2.3 Limitations

In this section we discuss how restrictive the assumptions of Theorem 2 are, and whether they can be relaxed.

Gaussianity of the features.

The proof of Theorem 2 crucially relies on the (Convex) Gaussian Minimax Theorem [thrampoulidis_2015, gordon_1988], and hence on the assumption that the input features are drawn from a Gaussian distribution. In Figure 0(a), we include plots of the prediction error not only for Gaussian but also for Log Normal and Rademacher distributed features. We observe that in all three cases, the prediction error closely follows the trend line (dashed curve). We therefore conjecture that Theorem 2 can be extended to a more general class of distributions. For heavy-tailed distributions, a popular theoretical framework is the small-ball method [mendelson_2014, koltchinskii_2015], which covers the Log Normal and Rademacher distributions. The authors of the paper [chinot_2021] apply this approach to min--norm interpolation, and obtain the constant upper bound , under more general assumptions than our setting (in particular their analysis handles adversarial noise with magnitude controlled by ). Yet, it is unclear whether the looseness of their upper bound is an artifact of their proof, or whether the small-ball method itself is too general to capture the rates observed in Figure 0(a).

Isotropic features.

Our Theorem 2 assumes isotropic features as we are interested in showing consistency of BP for inherently high-dimensional input data. By contrast, recently there has been an increased interest in studying spiked covariance data models (see [bartlett_2020] and references therein). We leave it as future work to generalize our result to Gaussian features with an arbitrary covariance matrix.

3 Proof of main result

In this section, we present the proof of our main result, Theorem 2. The proof is given in Section 3.1. In Section 3.2, we give intuition on the main improvements of our novel proof technique, that we describe in Section 3.4. In Section 3.3, we discuss a universal lower bound for interpolators which follows from our proof. Full proofs for the intermediary Lemmas and Propositions are given in Appendix A.

Notation.

On the finite-dimensional space , we write for the Euclidean norm and for the inner product. The and -norms are denoted by and , respectively. The vectors of the standard basis are denoted by , and is the vector with all components equal to . For and , is the vector such that if and otherwise.

is the normal distribution with mean

and covariance ,

is the cumulative distribution function of the scalar standard normal distribution,

, and denotes the natural logarithm. The samples form the rows of the data matrix , with for each . The scalars , are also aggregated into vectors with and . With this notation, interpolates the data which is equivalent to . To easily keep track of the dependency on dimension and sample size, we reserve the notation to contain only universal constants, without any hidden dependency on , , or . We will also use and to denote positive universal constants reintroduced each time in the proposition and lemma statements, except for which should be considered as fixed throughout the whole proof.

3.1 Proof of Theorem 2

We proceed by a localized uniform convergence approach, similar to the papers[chinot_2021, koehler_2021, ju_2020, muthukumar_2020], and common in the literature, , on structural risk minimization. That is, the proof consists of two steps:

  1. Localization. We derive a (finer than previously known) high-probability upper bound on the norm of the min--norm interpolator , by finding such that

    ()

    and consequently , with high probability.

  2. Uniform convergence. We derive high-probability uniform upper and lower bounds on the prediction error for all interpolators of -norm less than . Namely, we find a high-probability upper bound for

    ()

    and a high-probability lower bound for

    ()

By definition of in (), with high probability the min--norm interpolator belongs to the set of feasible solutions in () and (), and hence the second step yields high-probability upper and lower bounds on its prediction error . The key is thus to derive tight high-probability bounds for the quantities . Our derivation proceeds in two parts, described below; the first part follows arguments already used by the authors of the paper [koehler_2021], while the second part is novel. The techniques developed in the latter are crucial to obtain our tight bounds and might be of independent interest.

Part a: (Convex) Gaussian Minimax Theorem.

Since each of the quantities is defined as the optimal value of a stochastic program with Gaussian parameters, we may apply the (Convex) Gaussian Minimax Theorem ((C)GMT) [gordon_1988, thrampoulidis_2015]. On a high level, given a “primary” optimization program with Gaussian parameters, the (C)GMT relates it to an “auxiliary” optimization program, so that high-probability bounds on the latter imply high-probability bounds on the former. The following proposition applies the CGMT on and the GMT on , . [] For , define the stochastic auxiliary optimization problems:

()
()
()

where can be any small enough quantity. For any , it holds that

(5)
(6)
(7)

where on the left-hand side

denotes the probability distribution over

and , and on the right-hand side the distribution over . For the remainder of this paper, we choose .333This choice of is justified by the proof of Proposition 3.1. Indeed, for an arbitrary choice of , one could still show the same bound with just an extra factor: , holding with still the same probability. This would translate to a bound on holding with probability . So the choice “comes at no cost” in terms of the probability with which the bound holds, while being sufficiently small to allow for a satisfactory bound (it only affects the constant appearing in ). As such, from now on, we simply write , , . The proof of Proposition 3.1, given in Appendix LABEL:apx:subsec:CGMT, closely follows Lemmas 3-7 in the paper [koehler_2021]. For clarity, note that the three pairs of stochastic programs (/), (/), (/) are not coupled: Proposition 3.1 should be understood as consisting of three separate statements, each using a different independent copy of . As a result of the proposition, the goal of finding high-probability bounds on now reduces to finding high-probability bounds on , , , respectively.

Part b: Bounds on .

To obtain tight bounds on the auxiliary quantities , we adopt a significantly different approach from previous works. The main idea is to reduce the optimization problems (), () and () to optimization problems over a parametric path . Here we only state the results and refer to Section 3.4 for their proofs and further intuition. For the remainder of this proof, we denote by

the quantile of the standard normal distribution defined by

. [] There exist universal constants such that, if and , then

(8)

with probability at least over the draws of . Consequently, the min--norm interpolator has norm bounded by the deterministic quantity

(9)

with probability at least over the draws of and . Therefore, we henceforth set in (), () and so in (), (). We now establish high-probability upper/lower bounds for / for the specific choice of . [] Suppose for some universal constant . There exist universal constants such that, if and , then each of the two events

(10)

happens with probability at least over the draws of . Theorem 2 follows straightforwardly from Propositions 3.1, 3.1 and 3.1.

3.2 Key improvements of the proof over previous results

Let us briefly point out the main features of our derivation that allow for a bound that is tighter than previous results.

Tighter bounds for the localization step.

Proposition 3.1 gives a high-probability upper bound for . Its expression contains the quantile defined by , for which we have the estimate (see Lemma 3.4.3 in Appendix LABEL:apx:subsec:concentration_gamma). Hence, we may give the following more explicit estimate:

(11)

While existing bounds for in the literature are of the same asymptotic order [ju_2020, chinot_2021], using instead of in our derivation of Proposition 3.1 would only result in the constant upper bound . Further note that, while the estimate of in Equation (11) would already lead to upper/lower bounds for of matching order , in order to obtain the precise bounds presented in Proposition 3.1 we make use of precise properties of not captured by Equation (11).

Parametric path optimization for the uniform convergence step.

The overall structure of our proof is similar to that of Theorems 1 to 4 of the paper [koehler_2021]: We use the CGMT to localize the interpolator, and then use the GMT to derive a uniform prediction error bound. The results in the paper [koehler_2021] are applicable for general minimum-norm interpolators, while we focus on the -norm only. However, the relaxations used in the mentioned paper are not tight enough to capture the consistency of min--norm interpolation for isotropic features (see Section 6, “Application: Isotropic features” in that paper). To intuitively understand why their general theorems fail to give the accurate rate, let us briefly reproduce their derivation in our notation. They derive an upper bound for with arbitrary , by the following simple relaxation of (). For all with , it holds for any , so the second constraint can be relaxed to , implying the upper bound . However this bound is loose, even when we plug in our tight localization bound by setting . Indeed, by the estimate in Equation (11) and by Gaussian concentration results, the above bound reads for

(12)

Note that this bound is constant in any polynomial growth regime , while we prove an upper bound in Theorem 2 which vanishes in these regimes as . So the relaxation performed in the paper[koehler_2021] is not sufficiently tight. In order to obtain tighter bounds, we instead carry out a more refined analysis taking into account the relationship between and , by reducing the optimization problems (), (), and () to optimization problems over a parametric path (see Section 3.4.1).

3.3 Implication of the proof: Universal lower bound for interpolators

Interestingly, Proposition 3.1 immediately implies a lower bound for all interpolators holding with high probability. Indeed, consider the uniform lower bound () with so that the -constraint is vacuous, meaning that is a lower bound for the prediction error of all interpolating estimators. Using an additional convergence argument as in Lemma 4 in the paper [koehler_2021], one can show that the GMT is still applicable in the limit where , and results in the auxiliary optimization problem () with , , with a vacuous first constraint. By the Cauchy-Schwarz inequality, we can relax the second constraint to . Simple manipulations and concentration results for Gaussians then yield the high-probability bound

(13)

By the GMT, this implies a high-probability lower bound of order on the prediction error of all interpolators uniformly. In particular, this bound implies that no linear interpolator can achieve asymptotic consistency in the regime where . A weaker lower bound of order was already noted in [muthukumar_2020, Corollary 1, case 3], which however does not capture the divergence of the prediction error at the interpolation threshold .

3.4 Proof of Propositions 3.1 and 3.1

In this section we detail our analysis of the auxiliary optimization problems (), () and (). We start by a remark that considerably simplifies notation: The definitions of are unchanged if, in (), (), (), is replaced by the reordered vector of its absolute order statistics, , by such that is the -th largest absolute value of . Throughout this proof, we condition on the event where has distinct and positive components: , which holds with probability one. Henceforth, unless specified otherwise, references to the optimization problems (), () and () refer to the equivalent problems where is replaced by . Also recall that we choose . The key steps in the proof of Propositions 3.1 and 3.1 are as follows.

  • For each of the three optimization problems (), () and (), we show that the argmax (or argmin) is of the form for some and a parametric path (which depends on ). Hence we can restate (), () and () as optimization problems over a scalar variable and a scale variable . (Section 3.4.1)

  • Still conditioning on , we explicitly characterize the parametric path . In particular, we show that it is piecewise linear with breakpoints having closed-form expressions. (Section 3.4.2)

  • Thanks to the concentration properties of (Section 3.4.3), evaluating at one of the breakpoints yields the desired high-probability upper bound on (Section 3.4.4).

  • A fine-grained study of the intersection of with the constraint set of () and (), as well as the concentration properties of , yield the desired high-probability bounds on and . (Section 3.4.5)

3.4.1 Parametrizing the argmax/argmin

Note that in the optimization problems (), () and (), the variable only appears through , and . Thus, we can add the constraint that without affecting the optimal solution. We will show that the path can be used to parametrize the solutions of the optimization problems, where is defined by

(14)

and . Specifically, the following key lemma states that (at least one element of) the argmax/argmin of (), () and () is of the form for some and . This allows to reduce the optimization problems to a single scalar variable and a scale variable. [] For concision, define . We have:

  1. The variable in () can equivalently be constrained to belong to the set , i.e.,

    ()
  2. The variable in () can equivalently be constrained to belong to the set , i.e.,

    ()
  3. The variable in () can equivalently be constrained to belong to the set , i.e.,

    ()

The proof of the lemma is given in Appendix LABEL:apx:subsec:param_argmaxmin. To give an intuitive explanation for the equivalence between () and (), consider a penalized version of (): with . For fixed values of and , minimizing this penalized objective is equivalent to minimizing . Hence, we can expect the argmin to be attained at for some .

3.4.2 Characterizing the parametric path

As is defined as the optimal solution of a convex optimization problem, we are able to obtain a closed-form expression, by a straightforward application of Lagrangian duality. The only other non-trivial ingredient is to notice that, at optimality, the inequality constraint necessarily holds with equality. Denote the vector equal to on the first components and on the last , and similarly for . Define, for any integer ,

(15)

Note that . Let . [] For all , denote the unique integer in such that . Then where the dual variables and are given by

(16)
(17)

The proof of the lemma is given in Appendix LABEL:apx:subsec:geometric_lemma.

3.4.3 Concentration of norms of

Given the explicit characterization of the parametric path, we now study its breakpoints (), and more precisely we estimate and as a function of (we have by definition ). Namely, we prove the following concentration result, where, analogously to , we let denote the quantity such that . [] There exist universal constants such that for any with and ,

(18)

with probability at least over the draws of . This proposition relies on and extends the literature studying concentration of order statistics [boucheron_2012, li_2020]. An important ingredient for the proof of the proposition is the following lemma, which gives a tight approximation for . [] There exist universal constants such that, for all , satisfies

(19)

where

(20)

Furthermore, and can be chosen (  and ) such that . The proofs of Proposition 3.4.3 and of Lemma 3.4.3 are given in Appendix LABEL:apx:subsec:concentration_gamma.

3.4.4 Localization: Proof of Proposition 3.1 (upper bound for )

We now use the concentration bounds of Proposition 3.4.3 to obtain a high-probability upper bound for . Recall from Lemma 14 that it is given by ():

(21)

We may rewrite the constraint as

(22)
(23)

Thus minimizing over shows that . Since we want to upper-bound this minimum, it is sufficient to further restrict the optimization problem by the constraint , yielding

(24)

We now show that for the choice , the constraint is satisfied with high probability, and we give a high-probability estimate for the resulting upper bound . See Remark 3.4.4 below for a justification of this choice. For the remainder of the proof of Proposition 3.1, we condition on the event where the inequalities in Equation (18) hold for . By the concentration bound for , a sufficient condition for the choice to be feasible is

(25)

with some universal constant. Now by Lemma 3.4.3, and we chose . So we can choose sufficiently large such that the above inequality holds for any with . Moreover, by the concentration bounds for and , is upper-bounded by

(26)

Furthermore, by Lemma 3.4.3, so for a universal constant . This concludes the proof of Proposition 3.1. Let us informally justify why we can expect the choice