Understanding overfitting peaks in generalization error: Analytical risk curves for l_2 and l_1 penalized interpolation

06/09/2019 ∙ by Partha P. Mitra, et al. ∙ Cold Spring Harbor Laboratory 0

Traditionally in regression one minimizes the number of fitting parameters or uses smoothing/regularization to trade training (TE) and generalization error (GE). Driving TE to zero by increasing fitting degrees of freedom (dof) is expected to increase GE. However modern big-data approaches, including deep nets, seem to over-parametrize and send TE to zero (data interpolation) without impacting GE. Overparametrization has the benefit that global minima of the empirical loss function proliferate and become easier to find. These phenomena have drawn theoretical attention. Regression and classification algorithms have been shown that interpolate data but also generalize optimally. An interesting related phenomenon has been noted: the existence of non-monotonic risk curves, with a peak in GE with increasing dof. It was suggested that this peak separates a classical regime from a modern regime where over-parametrization improves performance. Similar over-fitting peaks were reported previously (statistical physics approach to learning) and attributed to increased fitting model flexibility. We introduce a generative and fitting model pair ("Misparametrized Sparse Regression" or MiSpaR) and show that the overfitting peak can be dissociated from the point at which the fitting function gains enough dof's to match the data generative model and thus provides good generalization. This complicates the interpretation of overfitting peaks as separating a "classical" from a "modern" regime. Data interpolation itself cannot guarantee good generalization: we need to study the interpolation with different penalty terms. We present analytical formulae for GE curves for MiSpaR with l_2 and l_1 penalties, in the interpolating limit λ→ 0.These risk curves exhibit important differences and help elucidate the underlying phenomena.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern machine learning has two salient characteristics: large numbers of measurements

, and non-linear parametric models with very many fitting parameters

, with both and in the range of for many applications. Fitting data with such large numbers of parameters stands in contrast to the inductive scientific process where models with small numbers of parameters are normative. Nevertheless, these large-parameter models are successful in dealing with real life complexity, raising interesting theoretical questions about the generalization ability of models with large numbers of parameters, particularly in the overparametrized regime .

Classical statistical procedures trade training (TE) and generalization error (GE) by controlling the model complexity. Sending TE to zero (for noisy data) is expected to increase GE[12]. However deep nets seem to over-parametrize and drive TE to zero (data interpolation) while maintaining good GE[26, 4]. Over-parametrization has the benefit that global minima of the empirical loss function proliferate and become easier to find[17, 22]. These observations have led to recent theoretical activity[3, 4, 16]. Regression and classification algorithms have been shown that interpolate data but also generalize optimally[3]. An interesting related phenomenon has been noted: the existence of a peak in GE with increasing fitting model complexity[2, 1, 10, 11]. In [2] it was suggested that this peak separates a classical regime from a modern (interpolating) regime where over-parametrization improves performance. While the presence of a peak in the GE curve is in stark contrast with the classical statistical folk wisdom where the GE curve is thought to be U-shaped, understanding the significance of such peaks is an open question, and motivates the current paper. Parenthetically, similar over-fitting peaks were reported almost twenty years ago (cf. statistical physics approach to learning) and attributed to increased fitting model entropy near the peak (see in particular Figs 4.3 and 5.2 in [8]).

[width=]Fig1_composite.eps

Figure 1: Numerical simulations of the MiSpaR model inferred using and penalties are compared with theoretical TE and GE curves for regularized regression. (A,B) and zooms (D,E) correspond to the interpolation limit . Plots in (C,F) show a theory-simulation comparison just for the case with . Here , and the numerical values are averaged over 100 draws of the design matrix , parameters and measurement noise . The rows of the design matrix are sub-sampled in the

regime. Standard errors across the 100 trials are shown. Note that for

, the GE values for the case are close to zero, whereas the values for the penalized case can be much larger. Note also that the overfitting peak is much larger for than for , and that the region of good generalization starts at , which can be to the left or right of the overfitting peak depending on the value of the undersampling parameter . For a single draw of the design matrix , one still obtains agreement between the theoretical curves and simulations due to self-averaging, although there is greater scatter (Fig.2).

1.1 Summary of Results

  1. We introduce a model, Misparametrized (or Misspecified) Sparse Regression (MiSpaR), which separates the number of measurements , the number of model parameters (which can be controlled for sparsity by a parameter ), and the number of fitting degrees of freedom . 111A similar misspecified model has been studied in [11] with regularization, but this paper did not study the effects of sparsity and penalized regression.

  2. We obtain analytical expressions for the GE and TE curves for penalized regression in the "high-dimensional" asymptotic regime keeping the ratios and fixed. We also present analytical expressions that permit computation of the GE for penalized regression, and obtain explicit expressions for and for the interpolating limit .

  3. We show that for and for , the overfitting peak appears at the data interpolation point () for both and penalized interpolation ( near ), but does not demarcate the point at which "good generalization" first occurs, which for small corresponds to the point () (Figures 1-3). The region of good generalization can start before or after the overfitting peak. The overfitting peak is suppressed at finite values of lambda.

  4. For infinitely large overparametrization, generalization does not occur: for both and penalized interpolation. However, for small values of the sparsity parameter

    and measurement noise variance

    , there is a large range of values of where regularized interpolation generalizes well, but penalized interpolation generalizes poorly (Figure 1). This range is given by , with .

    The reason for this difference is that in this regime the sparsity penalty is effective, and suppresses noise-driven mis-estimation of parameters for the

    penalty. This concretely demonstrates how generalization properties of penalized interpolation depend strongly on the inductive bias, and are not properties of data interpolation per se.

  5. For and for , . In contrast, if is greater than a critical value that depends on , then for a range of overparameterization . The maximum overparametrization for which depends on . For small values of , . For , rises quadratically from zero ( for small ) and .

  6. For and , goes to zero linearly at ( for small). When , only at the single point . In this case goes to zero with a nontrivial power on the left, but rises quadratically on the right . For , for all values of .

2 Model: Misparametrized Sparse Regression

Usually in linear regression the same (known) design matrix

is used both for data generation and for parameter inference. In MiSpaR the generative model has a fixed number of parameters , which generate measurements , but the number of parameters in the inference model is allowed to vary freely, with corresponding the under-parametrized and the over-parametrized case. For the under-parametrized case, a truncated version of the design matrix is used for inference, whereas for the over-parametrized case, the design matrix is augmented with extra rows.

In addition, we assume that the parameters in the generative model are sparse, and consider the effect of sparsity-inducing regularization in the interpolation limit. Combining misparametrization with sparsity is important to our study for two reasons

  • Dissociating data interpolation (which happens when , ) from the regime where good generalization can occur (this is controlled by the undersampling as well as by the model sparsity ).

  • We are able to study the effect of different regularization procedures on data interpolation in an analytically tractable manner and obtain analytical expressions for the generalization error.

To motivate this separation and the sparsity constraint, consider the following hypothetical scenario: suppose we are trying to make a diagnostic prediction characterized by some scalar parameter for the individual. We obtain a training sample of individuals, and for each individual measure a set of biological variables with (e.g. age, weight, etc) which we think might have diagnostic predictive value. We then proceed to fit a predictive model using these phenotypic variables (which will themselves stochastically vary from person to person, but can be measured for a new "test" person).

Now consider that the generative model is itself linear, but (i) only of these parameters have any effect on the diagnoses, and (ii) that there are latent or unmeasured variables that further subdivides the population into groups where only of the parameters have predictive value, with a different choice of the parameters in each group. For example, in one such latent group height may have predictive value, and not in another. Neither , nor the specific parameters which could have non-zero values, are known in advance (and one might err on the side of caution in measuring extra phenotypic variables that may or may not impact the disease in question), although one might be able to prioritize some of the parameters due to prior scientific knowledge. Since we can now obtain data from large populations, it is tempting to measure a large number of biological variables per person - this is almost inevitable given low-cost genomics and advanced imaging techniques. Thus, the model studied here is well-motivated and relates to real life scenarios.

[width=]Fig3_composite.eps

Figure 2: A simulation with a single draw of , with parameters otherwise corresponding to Fig.1. Although considerable scatter is seen, there is qualitative correspondence between theory and simulation. Averaging over 100 such draws produces a better correspondence (Fig.1)

2.1 Generative Model

We assume that the (known/measured) design variables are i.i.d. Gaussian distributed from one realization of the generative model to another

222

Note that these choices are convenient, but could be relaxed. The calculations rely on the large-n asymptotics of random matrix theory, and therefore exhibit corresponding universality properties.

with variance . This choice of variance is important to fix normalization. Other choices have been also employed in the literature (notably ) - this is important to keep in mind when comparing with literature formulae where factors of may need to be inserted appropriately to obtain a match.

Here is the distribution of the non-zero model parameters. We assume this distribution to be Gaussian as this permits closed form evaluation of integrals appearing in the case. Note that we term as overpametrization (referring to the case where ) and we term as undersampling (referring to the case where ).

2.2 Inference Model

The design matrix used for inference is mis-parametrized or mis-specified: under-specified (or partially observed) when ; over-specified, with extra, effect-free rows in the design matrix when

In the remaining sections, we will generally not explicitly annotate for the design matrix used in inference, since the usage is clear from context. Parameter inference is carried out by minimizing a penalized mean squared error

Note that for , the model parameters are augmented by zero entries. We consider and penalties (correspondingly or ) and the interpolation limit is obtained by taking . For the

penalty (ridge regression),

. The training and generalization errors are defined as the normalized MSEs on training and test sets:

. Note that the expectation is taken simultaneously over the parameter values, the design matrix and measurement noise. Where necessary below for clarity, we explicitly separate out the averaging over the design matrix.

3 Risk Curves

The focus in the paper is to obtain exact analytical expressions for the risk, rather than bounds. We do this in the limit where all tend to infinity, but the ratios are held finite. Similar "thermodynamic" or "high-dimensional" limiting procedures are used in statistical physics, eg in the study of random matrices and spin-glass models in large spatial dimensions[21, 19]. Such limits are also well-studied in modern statistics[25]

(for example to understand phase-transition phenomena in the LASSO algorithm

[6]).

In this limit, we give analytical formulae for TE and GE with or ridge regularization. For regularization, explicit formulae are given in some parameter regimes. More generally for the case we obtain a pair of simultaneous nonlinear equations in two variables, which can be solved numerically to obtain the GE. The nonlinear equations are given in closed form without hidden parameters and do not require integration.

3.1 Risk Curves: Formulae

The formulae are split into two cases: the underspecified case , and the overspecified case where . In the former case, there are un-observed columns of the design matrix which correspond to generative model parameters that can be non-zero, but are missing from the fitting model. Due to the model setup, different parameters are non-zero for different measurements (recall the motivating example). Thus, these un-observed parameters contribute an effective additive noise, resulting in an increase in the effective noise variance to

In the overspecified case, there are extra rows in the design matrix, which correspond to parameters in the generative model which are always zero. Since these parameters appear in the fitting model, they can in general have non-zero inferred values (due to measurement noise), and contribute to the estimation variance and generalization error.

3.1.1 Underspecified case:

. The training error is given by the following formulae:

The generalization error is given by the following formula:

[width=0.7]FIG2alpmuGE.eps

Figure 3: The theoretical generalization error for penalized interpolating regression is shown as a function of and with substantial sparsity and small additive noise (). The noise peak at where appears as a vertical white line. It can be clearly seen that the starting point of the "good generalization" regime is dissociated from the data interpolation line at and starts at (corresponding to a parabolic curve which is visible in the figure, to the left of which one obtains small values of ).

3.1.2 Overspecified case:

The formulae for TE and GE for the overspecified case, are obtained from the corresponding formulae above for the underspecified case, by making the substitutions and . That this should be the case is intuitively obvious, since in the overspecified case there are no unobserved parameters contributing to an effective noise, on the other hand the model parameters are effectively more sparse when compared to the total number of fitting paramters, and it makes sense that should be replaced by . This follows from the relevant derivation. We will not explicitly write out these formulae since they are straightforward to obtain by making the substitution mentioned.

Note that in both cases (overspecified and underspecified), both TE and GE when , as can be verified by taking the limit in the corresponding formulae above.

3.1.3 Interpolating limit

It is useful to collect together the formulae for TE and GE in the interpolating limit

3.2 Risk Curves: Derivation

First consider the case . The design matrix can be split into two parts,
where the second part corresponds to parameters that are not in the fitting model. In this case, with the effective noise being given by

Across different realizations of the generative model, the parameters vary with variance . Thus the only change from ordinary ridge regression with parameters in the fitting model, is the replacement of the noise variance with an effective variance

The inferred parameter values are given by

A little algebra then shows that the training error is given by (

are the eigenvalues of the

Wishart matrix )

(1)

Under the assumptions (including the asymptotic limits) the eigenvalues

follow the Marchenko-Pastur distribution

[18] across different realizations of the generative model. Importantly, sums such as in Eq.1 are self-averaging, and even for a given realization of the generative model, can be replaced by the ensemble average for large enough n. The following formula is used to compute the necessary sums for the TE and GE (using the appropriate form of from the corresponding cases)

(2)

Here . Applying this formula to the expression for the training error and performing the relevant integral using the method of residues and contour integration, one obtains the formula presented earlier for the training error for .

To compute generalization error, one has to pick a new row of the design matrix. Only a subset of the parameters are used in the forward prediction, corresponding to a restricted portion

of the vector


Some algebra then leads to the following equation for GE

(3)

Noting that and applying the Marchenko-Pastur distribution as above to compute the sum in the large limit, leads to the GE formula given in the previous section for . The formulae for follow a very similar derivation.

3.3 Risk Curves: Formulae

The formulae in this section are for the case . Some other forms of also lead to closed form expressions. The considerations of this section will continue to hold qualititatively as long as . Even better generalization is expected if , and especially if the distribution has a gap region near zero where it is zero throughout. However, these other choices of do not change the range of values of for which .

3.3.1 Underspecified case: ;

The generalization error with regularization is given by

Where the variable has to be found by solving the following three equations Eq.4-6 with three unknowns , and . Note that is the fraction of estimated parameters which are non-zero, this is a number which is constrained to be . Also .

(4)
(5)
(6)

where

with and . is the standard error function

3.3.2 Overspecified case: ;

As for regularization, the formulae for the overspecified case can also be obtained by making the substitutions and . This involves only the first two equations above (the other equations remain the same):

(7)
(8)

3.3.3 Analytical expressions for GE when

One can obtain analytical insights by considering special cases and limits. In the interpolating limit , Eq.4 implies that either or or must go to zero. In order for to go to zero, from Eq.6 it follows that (noise-free case). This is an interesting limit as it corresponds to the well known algorithmic phase transition for penalized regression[6]. We will consider it in the next section, but first we but examine the finite noise case.

For the considerations below, except where explicitly noted, we assume , which also implies from Eqs.5,8 that . Thus in the interpolating limit one must either have or . We will consider these two cases in turn: as can be seen below, the first case corresponds to and the second case to .

Case 1:
In this case, from Eq.4 or Eq.7 it follows that , ie all the fitted parameters are non-zero. It can be shown from the formulae for that both . It then follows from Eq.5 and Eq.8 that for and for . In either case, one must have for this limit to produce a solution to the simultaneous equations, so we conclude that corresponds to , and that for these values of the generalization error is given by

(9)

Notably, in this limit, and regularization yield the same results - thus, there is no difference between and penalized interpolation (see Figure 1 for numerical confirmation). This is intuitively consistent with the observation that in this limit , so that the sparsity constraint is not active - one does not expect it to have an extra beneficial effect as a result. Nevertheless, this constitutes an analytical result for the generalization error for penalized interpolation, and shows that the overfitting peak at also appears for the case. The behavior of GE to the left of the overfitting peak is identical for and penalties when However, the behavior to the right of the overfitting peak is quite different for the two penalty terms, as can be seen from the considerations below and from Fig.1.

Case 2: , ,

This is the more interesting case, as both the nonlinear equations remain operative. To obtain analytical insight one needs to take a further limit. Within the overparametrized regime , We consider (slight overparametrization), and (large overparametrizationn). We consider these cases separately.

Case 2.1: small,
For small amounts of overparametrization, one can recover the behavior of GE by looking at the case where is nonzero but small compared to 1. In this case, one can proceed by expanding the LHS of Eqs 5,6 or 8,9 for small and retaining terms to linear order in . Solving the resulting simultaneous linear equations for and substituting to obtain , one obtains

(10)

Thus the noise peak is recovered for . Note that unlike the case for where we obtain an exact expression for GE, this expression is only approximately valid for , but the approximation improves as (this is numerically confirmed in Figure 1). However, the situation is quite different when as can be seen below.

Case 2.2: ,

We are particularly interested in the case of large overparametrization, ie . We confine our attention to Eq.s 7,8 since for large and fixed, eventually one would get . First consider the case . In this case, it follows from Eq.7 that one must have . More precisely, setting in Eq.7, in the large limit we get , so that for large .

In this limit, one obtains from the corresponding formulae above that and . At large , . Thus, for large , where , one obtains , which as (although note that it does so slowly, only logarithmically - this will have an implication for the finite behavior as can be seen below). After substitution into Eq.8, one therefore obtains , and thus

(11)

. Thus, in the limit , overparametrization ceases to be effective, and for sufficiently large overparametrization the inferred model can no longer generalize.

However, when and are small, there is an interesting regime for values of where is small and good generalization is possible. In this regime, but . We expand the LHS of Eq.7, noting that , to obtain

(12)

This implies that , provided . The last condition implies that . To satisfy Eq.8 and maintain , one also requires to be small. Since one gets the additional condition .

Collecting these conditions together, we find that if , and , then . In this regime, where the noise variance is small, there is significant sparsity and the degree of overparametrization is modest, the penalized interpolation provides much better generalization than the case. This can be seen numerically in Fig.1. In fact, from this figure, it would be difficult to predict that , however the thoretical considerations above show that this must be the case for sufficiently large . Nevertheless, the figure shows that for large there can be a large difference between the generalization errors for interpolating regression with and penalties, depending on which penalty term is used. This difference is cleanly demonstrated in the limit as can be seen in the next section.

3.3.4 Analytical expressions for GE with penalty when

This is an interesting limit since for one can observe the algorithmic phase transition phenomena associated with penalized sparse regression[6], and there is a parameter range where , ie there is perfect generalization (in contrast with the case where there is no such range). It allows for a clean demonstration of one of the main points of the paper, that the region of good generalization is demarcated by rather than by (the interpolation point). The small case considered earlier can be qualitatively understood in the limit, and an important phenomenon is observed, namely that there is a regime of large overparametrization where . We separately consider the regimes and .

Case 1: (underparametrized regime):

In this case, can be obtained by setting in Eq.9,

If , then the theta function is not operative in this regime, and one obtains the noise peak as approaches 1 from the left. Although there is no additive noise, the underparametrization produces an effective noise with strength , and this leads to the noise peak. However, if , then goes to zero at , and stays zero in the regime . Also, there is no noise peak for . Note, as before, that in this underparametrized regime, .

Case 2: (overparametrized regime): In this regime, and penalized interpolation can produce dramatically different generalization behavior.

We first consider the case for , . In this case, it follows from the formula in section 3.3 that

For , the noise peak is present, and the generalization error reaches a minimum value of , after which it increases as with increasing . For , . In either case, .

On the other hand, we will show below that if (to be defined below), then

for and . When is large (which is the case when ), there is a sizeable region of large overparametrization where but (consistent with Fig.1). This "gap" region differentiates and penalized interpolating regression and shows that good generalization is not a property of regularized interpolation per se, but depends strongly on the method of interpolation.

In order to show the existence of the gap region, and derive the formula for , we need to study the Eq.s 4-8 after setting .

Case 2.1: and .

In this case, there is a noise peak even for due to the effective noise. For , from Eq.10 the noise peak shape matches that of penalized interpolation, . However, as increases and , the effective noise vanishes. When Eq.s 4,5 are identical to the corresponding equations for ordinary penalized regression [24] and we expect a continuous phase transition where as .

From Eq.6, since , we cannot have , and as we approach the transition from the left, . Thus to satisfy Eq.6 as we have . Expanding Eq.s 4,5 to leading order in , which is assumed small, and setting , we obtain the equations (Note that is defined below Eq.6)

(13)
(14)

From Eq.14 it can be seen that as , one must have where is a finite non-negative constant (possibly zero). Thus we obtain the following equations that must be satisfied for the limit to exist:

(15)
(16)

is an analytic function of and it can be verified that it is positive, with a minimum at a critical value of given by the equation . The value of at this critical point at which is a minimum is given by . Thus, for the equations to be satisfied, we must have , where . It follows that no solution with exists if . If , it also follows that , thus demonstrating that and also as . The proportionality constant can be worked out to be

Here we use the notation to denote that approaches from below. Note that as the slope diverges. This is due to the fact that exactly at , goes to zero as a power of that is smaller than 1. At from Eq.14 one has , so that and

This completes analysis of the case . For , one no longer has the effective noise term. There are two cases: and . In the former case, remains finite at and from an examination of the continuity of Eq.4,5 with Eq.7,8 at , continues to remains finite for . In the latter case however, one has a non-trivial solution with , given by the equations

(17)
(18)

Note that since , the relation no longer has to hold, but one must still have . Solving these equations simultaneously for , one obtains and as functions of .

It is important to note that just after the transition to at , the RHS of Eq.18 is different from the RHS of Eq.16 since the term is missing. This has a significant consequence. Note that has a minimum value of at . Since , one notes that there are two solutions to Eq.18. Further, examination of Eq.s 16 and 18 shows that at , the solution to Eqs. 15,16 satisfies . Since is a decreasing function of , it follows that . Thus, taking the solution to Eq.s 17,18 would result in just to the right of the transition point at . This is not permissible. Therefore just to the right of the transition point at there is a jump, and one must adopt the value of . This will result in a value of . Thus the number of non-zero estimated parameters (for infinitesimally small ) jumps from to a smaller value as crosses the transitional value .

As increases however, decreases and eventually one reaches the point at which again reaches the value . The point at which this happens is obtained by setting in Eq.17. This gives rise to the equation

(19)

Note that this equation has the form with and , with . This makes sense, since (instead of ) and . Thus (the number of fitting parameters) plays the role of (the number of parameters in the generative model which could be potentially nonzero). This equation helps explain why there is a transition when is large: as , both and become small while remaining proportional to which is fixed. As increases, () moves towards the origin along the straight line . This line intersects the curve , giving the transition point when one goes from the "perfect recovery" regime to the regime where recovery is not possible in the LASSO problem.

[width=0.5]CriticalMuPlot.eps

Figure 4: Critical value of overparametrization as a function of (thick line) together with the approximate form (thin line). Note that the plot is semi-logarithmic in base 10.

The equations giving the transitional value can be obtained by some algebra from Eqs 17,18 after setting , and are given by