A Farewell to the Bias-Variance Tradeoff? An Overview of the Theory of Overparameterized Machine Learning

09/06/2021 ∙ by Yehuda Dar, et al. ∙ Georgia Institute of Technology Rice University 308

The rapid recent progress in machine learning (ML) has raised a number of scientific questions that challenge the longstanding dogma of the field. One of the most important riddles is the good empirical generalization of overparameterized models. Overparameterized models are excessively complex with respect to the size of the training dataset, which results in them perfectly fitting (i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is traditionally associated with detrimental overfitting, and yet a wide range of interpolating models – from simple linear models to deep neural networks – have recently been observed to generalize extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon has revealed that highly overparameterized models often improve over the best underparameterized model in test performance. Understanding learning in this overparameterized regime requires new theory and foundational empirical studies, even for the simplest case of the linear model. The underpinnings of this understanding have been laid in very recent analyses of overparameterized linear regression and related statistical learning tasks, which resulted in precise analytic characterizations of double descent. This paper provides a succinct overview of this emerging theory of overparameterized ML (henceforth abbreviated as TOPML) that explains these recent findings through a statistical signal processing perspective. We emphasize the unique aspects that define the TOPML research area as a subfield of modern ML theory and outline interesting open questions that remain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 11

page 19

page 32

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning techniques have revolutionized the way many engineering and scientific problems are addressed, establishing the data-driven approach as a leading choice for practical success. Contemporary deep learning methods are extreme, extensively developed versions of classical machine learning (ML) settings that were previously restricted by limited computational resources and insufficient availability of training data. Established practice today is to learn a highly complex deep neural network (DNN) from a set of training examples that, while itself large, is quite small relative to the number of parameters in the DNN. While such overparameterized DNNs comprise the state-of-the-art in ML practice, the fundamental reasons for this practical success remain unclear. Particularly mysterious are two empirical observations: i) the explicit benefit (in generalization) of adding more parameters into the model, and ii) the ability of these models to generalize well despite perfectly fitting even noisy training data. These observations endure across diverse architectures in modern ML — while they were first made for complex, state-of-the-art DNNs (Neyshabur et al., 2014; Zhang et al., 2017), they have since been unearthed in far simpler model families including wide neural networks, kernel methods, and even linear models (Belkin et al., 2018b; Spigler et al., 2019; Geiger et al., 2020; Belkin et al., 2019a).

In this paper, we survey the recently developed theory of overparameterized machine learning (henceforth abbreviated as TOPML) that establishes foundational mathematical principles underlying phenomena related to interpolation (i.e., perfect fitting) of training data. We will shortly provide a formal definition of overparameterized ML, but describe here some salient properties that a model must satisfy to qualify as overparameterized. First, such a model must be highly complex in the sense that its number of independently tunable parameters111

The number of parameters is an accepted measure of learned model complexity for simple cases such as ordinary least squares regression in the underparameterized regime and underlies classical complexity measures like Akaike’s information criterion 

(Akaike, 1998). For overparameterized models, the correct definition of learned model complexity is an open question at the heart of TOPML research. See Section 7.3 for a detailed exposition. is significantly higher than the number of examples in the training dataset. Second, such a model must not be explicitly regularized in any way. DNNs are popular instances of overparameterized models that are usually trained without explicit regularization (see, e.g., Neyshabur et al., 2014; Zhang et al., 2017). This combination of overparameterization and lack of explicit regularization yields a learned model that interpolates the training examples and therefore achieves zero training error on any training dataset. The training data is usually considered to be noisy realizations from an underlying data class (i.e., noisy data model). Hence, interpolating models perfectly fit both the underlying data and the noise in their training examples. Conventional statistical learning has always associated such perfect fitting of noise with poor generalization performance (e.g., Friedman et al., 2001, p. 194); hence, it is remarkable that these interpolating solutions often generalize well to new test data beyond the training dataset.

Figure 1: Double descent of test errors (i.e., generalization errors) with respect to the complexity of the learned model. TOPML studies often consider settings in which the learned model complexity is expressed as the number of (independently tunable) parameters in the model. In this qualitative demonstration, the global minimum of the test error is achieved by maximal overparameterization.

The observation that overparameterized models interpolate noise and yet generalize well was first elucidated in pioneering experiments by Neyshabur et al. (2014); Zhang et al. (2017). These findings sparked widespread conversation across deep learning practitioners and theorists, and inspired several new research directions in deep learning theory. However, meaningful progress in understanding the inner workings of overparameterized DNNs has remained elusive222For example, the original paper (Zhang et al., 2017) now appears in a 2021 version of Communications of the ACM (Zhang et al., 2021). The latter survey highlights that almost all of the questions posed in the original paper remain open.

owing to their multiple challenging aspects. Indeed, the way for recent TOPML research was largely paved by the discovery of similar empirical behavior in far simpler parametric model families 

(Belkin et al., 2019a; Geiger et al., 2020; Spigler et al., 2019; Advani et al., 2020).

In particular, Belkin et al. (2019a); Spigler et al. (2019) explicitly evaluated the test error of these model families as a function of the number of tunable parameters and showed that it exhibits a remarkable double descent behavior (pictured in Figure 1). The first descent occurs in the underparameterized regime and is a consequence of the classical bias–variance tradeoff; indeed, the test error peaks when the learned model first becomes sufficiently complex to interpolate the training data. More unusual is the behavior in the overparameterized regime: there, the test error is observed to decrease monotonically with the number of parameters, forming the second descent in the double descent curve. The double descent phenomenon suggests that the classical bias–variance tradeoff, a cornerstone of conventional ML, is predictive only in the underparameterized regime where the learned model is not sufficiently complex to interpolate the training data. A fascinating implication of the double descent phenomenon is that the global minimum of the generalization error can be achieved by a highly overparameterized model even without explicit regularization, and despite perfect fitting of noisy training data.

Somewhat surprisingly, a good explanation for the double descent behavior did not exist until recently even for the simplest case of the overparameterized linear model — for example, classical theories that aim to examine the optimal number of variables in a linear regression task (e.g., Breiman and Freedman, 1983) only considered the underparameterized regime333This problem is usually called principal component regression. Recently, Xu and Hsu (2019) showed that double descent can occur with principal component regression in the overparameterized regime.. Particularly mysterious was the behavior of linear models that interpolate noise in data. The aforementioned double descent experiments (Belkin et al., 2019a; Spigler et al., 2019) inspired substantial TOPML research in the simple linear model444The pioneering experiments by Belkin et al. (2018b) also inspired a parallel thread of mathematical research on harmless interpolation of noise by nonparametric and local methods beginning with Belkin et al. (2018a, 2019b); Liang and Rakhlin (2020). These models are not explicitly parameterized and are not surveyed in this paper, but possess possible connections to TOPML research of future interest. . The theoretical foundations of this research originate in the studies by Belkin et al. (2020); Bartlett et al. (2020); Hastie et al. (2019); Muthukumar et al. (2020b) in early 2019. These studies provide a precise analytical characterization of the test error of minimum-norm solutions to the task of overparameterized linear regression with the squared loss (henceforth abbreviated as LS regression). Of course, linear regression is a much simpler task than learning a deep neural network; however, it is a natural starting point for theory and itself of great independent interest for a number of reasons. First, minimum

-norm solutions to linear regression do not include explicit regularization to avoid interpolating noise in training data, unlike more classical estimators such as ridge and the Lasso. Second, the closed form of the minimum

-norm solution to linear regression is equivalent to the solution obtained by gradient descent when the optimization process initialized at zero (Engl et al., 1996); thus, this solution arises easily and ubiquitously in practice.

1.1 Contents of this paper

In this paper, we survey the emerging field of TOPML research with a principal focus on foundational principles developed in the past few years. Compared to other recent surveys (Bartlett et al., 2021; Belkin, 2021), we take a more elementary signal processing perspective to elucidate these principles. Formally, we define the TOPML research area as the sub-field of ML theory where

  1. there is clear consideration of exact or near interpolation of training data

  2. the learned model complexity is high with respect to the training dataset size. Note that the complexity of the learned model is typically affected by (implicit or explicit) regularization aspects as a consequence of the learning process.

Importantly, this definition highlights that while TOPML was inspired by observations in deep learning, several aspects of deep learning theory do not involve overparameterization. More strikingly, TOPML is relevant to diverse model families other than DNNs.

The first studies of TOPML were conducted for the linear regression task; accordingly, much of our treatment centers around a comprehensive survey of overparameterized linear regression. However, TOPML goes well beyond the linear regression task. Overparameterization naturally arises in diverse ML tasks, such as classification (e.g., Muthukumar et al., 2020a), subspace learning for dimensionality reduction (Dar et al., 2020), data generation (Luzi et al., 2021), and dictionary learning for sparse representations (Sulam et al., 2020)

. In addition, overparameterization arises in various learning settings that are more complex than elementary fully supervised learning: unsupervised and semi-supervised learning 

(Dar et al., 2020)

, transfer learning 

(Dar and Baraniuk, 2020), pruning of learned models (Chang et al., 2021), and others. We also survey recent work in these topics.

This paper is organized as follows. In Section 2 we introduce the basics of interpolating solutions in overparameterized learning, as a machine learning domain that is outside the scope of the classical bias–variance tradeoff. In Section 3 we overview recent results on overparameterized regression. Here, we provide intuitive explanations of the fundamentals of overparameterized learning by taking a signal processing perspective. In Section 4 we review the state-of-the-art findings on overparameterized classification. In Section 5 we overview recent work on overparameterized subspace learning. In Section 6 we examine recent research on overparameterized learning problems beyond regression and classification. In Section 7 we discuss the main open questions in the theory of overparameterized ML.

2 Beyond the classical bias–variance tradeoff: The realm of interpolating solutions

We start by setting up basic notation and definitions for both underparameterized and overparameterized models. Consider a supervised learning setting with training examples in the dataset . The examples in reflect an unknown functional relationship that in its ideal, clean form maps a

-dimensional input vector

to an output

(1)

where the output domain depends on the specific problem (e.g., in regression with scalar response values, and in binary classification). The mathematical notations and analysis in this paper consider one-dimensional output domain , unless otherwise specified (e.g., in Section 5). One can also perceive as a signal that should be estimated from the measurements in . Importantly, it is commonly the case that the output is degraded by noise. For example, a popular noisy data model for regression extends (1) into

(2)

where is a scalar-valued noise term that has zero mean and variance . Moreover, the noise is also independent of the input .

The input vector can be perceived as a random vector that, together with the unknown mapping

and the relevant noise model, induces a probability distribution

over for the input-output pair . Moreover, the training examples in are usually considered to be i.i.d. random draws from .

The goal of supervised learning is to provide a mapping such that constitutes a good estimate of the true output for a new sample . This mapping is learned from the training examples in the dataset ; consequently, a mapping that operates well on a new beyond the training dataset is said to generalize

well. Let us now formulate the learning process and its performance evaluation. Consider a loss function

that evaluates the distance, or error, between two elements in the output domain. Then, the learning of is done by minimizing the training error given by

(3)

The training error is simply the empirical average (over the training data) of the error in estimating an output given an input . The search for the mapping that minimizes is done within a limited set of mappings that are induced by specific computational architectures. For example, in linear regression with scalar-valued response, denotes the set of all the linear mappings from to . Accordingly, the training process can be written as

(4)

where denotes the mapping with minimal training error among the constrained set of mappings . The generalization performance of a mapping is evaluated using the test error

(5)

where are random test data. The best generalization performance corresponds to the lowest test error, which is achieved by the optimal mapping

(6)

Note that the optimal mapping as defined above does not posit any restrictions on the function class. If the solution space of the constrained optimization problem in (4) includes the optimal mapping , the learning architecture is considered as well specified. Otherwise, the training procedure cannot possibly induce the optimal mapping , and the learning architecture is said to be misspecified.

The learned mapping depends on the specific training dataset that was used in the training procedure (4). If we use a larger class of mappings for training, we naturally expect the training error to improve. However, what we are really interested in is the mapping’s performance on independently chosen test data, which is unavailable at training time. The corresponding test error, , is also known as the generalization error

in statistical learning theory, as indeed this error reflects performance on “unseen” examples beyond the training dataset. For ease of exposition, we will consider the expectation of the test error over the training dataset, denoted by

, throughout this paper. We note that several of the theoretical analyses that we survey in fact provide expressions for the test error

that hold with high probability over the training dataset (recall that the both the input

and the output noise are random555As will become clear by reading the type of examples and results provided in this paper, the overparameterized regime necessitates a consideration of random design on training data to make the idea of good generalization even possible. Traditional analyses involving underparameterization or explicit regularization characterize the error of random design in terms of the fixed design (training) error, which is typically non-zero. In the overparameterized regime, the fixed design error is typically zero, but little can be said about the test error solely from this fact. See Section 7.1 and the discussions on uniform convergence in the survey paper (Belkin, 2021) for further discussion.; in fact, independently and identically distributed across ). See the discussion at the end of Section 3.1 for further details on the nature of these high-probability expressions.

The choice of loss function to evaluate training and test error involves several nuances, and is both context and task-dependent666The choice is especially interesting for the classification task: for interesting classical and modern discussions on this topic, see (Bartlett et al., 2006; Zhang, 2004; Ben-David et al., 2012; Muthukumar et al., 2020a; Hsu et al., 2021). The simplest case is a regression task (i.e., real-valued output space ), where we focus on the squared-loss error function for both training and test data for simplicity. Such regression tasks possess the following important property: the optimal mapping as defined in Equation (6) is the conditional mean of given , i.e., for both data models in (1) and (2). Moreover, the expected test error has the following bias-variance decomposition, which is well-suited for our random design setting (similar decompositions are given, e.g., by Geman et al. (1992); Yang et al. (2020)):

(7)

Above, the squared bias term is defined as

(8)

which is the expected squared difference between the estimates produced by the learned and the optimal mappings (where the learned estimate is under expectation with respect to the training dataset). For several popular estimators used in practice such as the least squares estimator (which is, in fact, the maximum likelihood estimator under Gaussian noise), the expectation of the estimator is equal to the best fit of the data within the class of mappings ; therefore, the bias can be thought of in these simple cases as the approximation-theoretic gap between the performance achieved by this best-in-class model and the optimal model given by . The variance term is defined as

(9)

which reflects how much the estimate can fluctuate due to the dataset used for training. The irreducible error term is defined as

(10)

which quantifies the portion of the test error that does not depend on the learned mapping (or the training dataset). The model in (2) implies that and .

The test error decomposition in (7) demonstrates a prominent concept in conventional statistical learning: the bias-variance tradeoff as a function of the “complexity” of the function class (or set of mappings) . The idea is that increased model complexity will decrease the bias, as better approximations of can be obtained; but will increase the variance, as the magnitude of fluctuations around the best-in-class model increase. Accordingly, the function class is chosen to minimize the sum of the bias (squared) and its variance; in other words, to optimize the bias-variance tradeoff. Traditionally, this measure of complexity is given by the number of free parameters in the model777The number of free parameters is an intuitive measure of the worst-case capacity of a model class, and has seen fundamental relationships with model complexity in learning theory (Vapnik, 2013) and classical statistics (Akaike, 1998) alike. In the absence of additional structural assumptions on the model or data distribution, this capacity measure is tight and can be matched by fundamental limits on performance. There are several situations in which the number of free parameters is not reflective of the true model complexity that involve additional structure posited both on the true function (Candès et al., 2006; Bickel et al., 2009) and the data distribution (Bartlett et al., 2005). As we will see through this paper, these situations are intricately tied to the double descent phenomenon.. In the underparameterized regime, where the learned model complexity is insufficient to interpolate the training data, this bias-variance tradeoff is ubiquitous in examples across model families and tasks. However, as we will see, the picture is significantly different in the overparameterized regime where the learned model complexity is sufficiently high to allow interpolation of training data.

2.1 Underparameterized vs. overparameterized regimes: Examples for linear models in feature space

We now illustrate generalization performance using two examples of parametric linear function classes that are inspired by popular models in statistical signal processing. Both examples rely on transforming the -dimensional input into a -dimensional feature vector that is used for the linear mapping to output. In other words, the learned mapping corresponding to learned parameters is given by

(11)

On the other hand, as we will now see, the examples differ in their composition of feature map as well as the relationship between the input dimension and the feature dimension .

[Linear feature map] Consider an input domain with , and a set of

real orthonormal vectors that form a basis for

. We define as a class of linear mappings with parameters, given by:

(12)

In this case, we can also write where and is a matrix with orthonormal columns. We define the -dimensional feature vector of as . Then, the training optimization problem in (4) constitutes least-squares (LS) regression and is given by

(13)

Here, is the matrix whose entry is given by , and the training data output is given by the -dimensional vector . [Nonlinear feature map]   Consider the input domain , which is the -dimensional unit cube. Let for be a family of real-valued functions that are defined over the unit cube and orthonormal in function space, i.e., we have , where denotes the Kronecker delta888One can consider a more general definition of orthonormality where and is the distribution over the input.. We define as a class of nonlinear mappings with parameters, given by:

(14)

In this case, we again define , but now the -dimensional feature vector of is defined as . The training optimization in (4) takes the form of

(15)

Here, is the matrix whose entry is given by , and the training data output is given by the -dimensional vector .

For Examples 2.1 and 13, we also consider overparameterized settings where is sufficiently large such that (13) and (15) have multiple solutions that correspond to zero training error (i.e., ). Accordingly, we choose the minimum -norm solution

(16)

where denotes the pseudoinverse of the feature matrix (which is defined according to the relevant example above). The minimum -norm solution is particularly simple to compute via gradient descent (Engl et al., 1996), and therefore is a natural starting point for the study of interpolating solutions.

Figures 2 and 3 demonstrate particular cases of solving LS regression using (16) and function classes as in Example 2.1. For both Figures 2 and 3, the data model (2) takes the form of where is an unknown, deterministic (non-random) parameter vector, is the input space dimension, and the noise component is . The number of training examples is . We denote by the Discrete Cosine Transform (DCT) matrix (recall that the columns of are orthonormal and span ). Then, we consider the input data where and is a real diagonal matrix. By Example 2.1, is the -dimensional feature vector where is a real matrix with orthonormal columns that extract the features for the learning process. Consequently, where . If the feature space for learning is based on DCT basis vectors (i.e., the columns of are the first columns of ), then the input feature covariance matrix is diagonal; in fact, its diagonal entries are exactly the first entries of the diagonal of the data covariance matrix .

2.2 The double descent phenomenon

Figure 2 studies the relationship between model complexity (in number of parameters) and test error induced by linear regression with a linear feature map (in line with Example 2.1). For both experiments, we set and , and vary the number of parameters to be between and . We also consider to be an approximately -sparse parameter vector such that the majority of its energy is contained in the first discrete cosine transform (DCT) features (see Figure 2 for an illustration). The input data covariance matrix is also considered to be anisotropic, as pictured in the heatmap of its values in Figure 2.

We consider two choices of linear feature maps, i.e., two choices of orthonormal bases : the discrete cosine transform (DCT) basis and the Hadamard basis. In the first case of DCT features, the model is well-specified for and the bulk of the optimal model fit will involve the first features as a consequence of the approximate sparsity of the true parameter in the DCT basis (see Figure 2). As a consequence of this alignment, the feature covariance matrix, , is diagonal and its entries constitute the first entries of the diagonalized form of the data covariance matrix, denoted by . The ensuing spiked covariance structure is depicted in Figure 2. Figure 2 plots the test error as a function of the number of parameters, , in the model and shows a counterintuitive relationship. When , we obtain non-zero training error and observe the classic bias-variance tradeoff. When , we are able to perfectly interpolate the training data (with the minimum -norm solution). Although interpolation of noise is traditionally associated with overfitting, here we observe that the test error monotonically decays with the number of parameters ! This is a manifestation of the double descent phenomenon in an elementary example inspired by signal processing perspectives.

The second choice of the Hadamard basis also gives rise to a double descent behavior, but is quite different both qualitatively and quantitatively. The feature space utilized for learning (the Hadamard basis) is fundamentally mismatched to the feature space in which the original parameter vector is sparse (the DCT basis). Consequently, as seen in Figure 2, the energy of the true parameter vector is spread over the Hadamard features in an unorganized manner. Moreover, the covariance matrix of Hadamard features, depicted as a heat-map in Figure 2, is significantly more correlated and dispersed in its structure. Figure 2 displays a stronger benefit of overparameterization in this model owing to the increased model misspecification arising from the choice of Hadamard feature map. Unlike in the well-specified case of DCT features, the overparameterized regime significantly dominates the underparameterized regime in terms of test performance.

Finally, the results in Fig. 3 correspond to that its components are shown in Fig. 3 and its DCT-domain representation is presented in Fig. 3. The majority of ’s energy is in a mid-range “feature band” that includes the 18 DCT features at coordinates ; in addition, there are two more high-energy features in coordinates , and the remaining coordinates are of low energy (see Fig. 3). The input covariance matrix , presented in Fig. 3, is the same as before (i.e., the majority of input feature variances is located at the first 20 DCT features). Therefore, when using a function class with DCT features (see results in the second row of subfigures of Fig. 3), there is a mismatch between the “feature bands” that contain the majority of ’s energy and the majority of input variances. This situation leads to the error curves presented in Fig. 3, where double descent of test errors occurs, but the global minimum of test error is not achieved by an interpolating solution. This will be explained in more detail below.

[] [] []
[] [] []
[] [] []

Figure 2: Empirical results for LS regression based on a class of linear functions in the form described in Example 2.1. The data model is . (a) The components of . (b) The components of the DCT transform of , i.e., . This shows that the majority of ’s energy is at the low-range DCT “feature band” of . (c) The input data covariance matrix . As explained in detail in the main text, the second row of subfigures correspond to with DCT features, and the third row of subfigures correspond to with Hadamard features.

[] [] []
[] [] []

Figure 3: Empirical results for LS regression based on a class of linear functions in the form described in Example 2.1. The data model is . (a) The components of . (b) The components of the DCT transform of , i.e., . This shows that besides two high-energy components at the first two features, the majority of ’s energy is at the mid-range DCT “feature band” of . (c) The input data covariance matrix . As explained in detail in the main text, the second row of subfigures correspond to with DCT features.

2.3 Why is double descent a surprise?

As a consequence of the classical bias-variance tradeoff, the test error as a function of the number of parameters forms a U-shaped curve in the underparameterized regime. For example, Fig. 2 depicts this U-shaped curve in the underparameterized range of solutions , i.e., to the left of the interpolation threshold (vertical dotted line at ). This classical bias-variance tradeoff is usually induced by two fundamental behaviors of non-interpolating solutions:

  • The bias term usually decreases because the optimal mapping can be better approximated by classes of mappings that are more complex. In other words, models with more parameters or less regularization typically result in a lower (squared) bias component in the test error decomposition (7). As an example, see the red dashed curve to the left of the interpolation threshold in Fig. 2.

  • The variance term usually increases because classes of mapping that are of higher complexity result in higher sensitivity to (or variation with respect to) the noise in the specific training dataset . Note that such noise can arise due to stochastic measurement errors and/or model misspecification (see, e.g., Rao, 1971). In other words, models with more parameters or less regularization typically result in a higher variance component in the test error decomposition (7). As an example, see the magenta dashed curve to the left of the interpolation threshold in Fig. 2.

Typically, the interpolating solution of minimal complexity (e.g., for ) has a high test error due to a high variance term around the transition between non-interpolating and interpolating regimes. This inferior performance at the entrance to the interpolation threshold (which can be mathematically explained through poor conditioning of the training data matrix) led to a neglect of the study of the range of interpolating solutions, i.e., models such as Examples 2.1-13 with . However, the recently observed success of interpolation and overparameterization across deep learning (Neyshabur et al., 2014; Zhang et al., 2017; Advani et al., 2020) and simpler models like linear models and shallow neural networks (Spigler et al., 2019; Belkin et al., 2019a; Geiger et al., 2020) alike has changed this picture; the interpolating/overparameterized regime is now actively researched. It is essential to note that we cannot expect double descent behavior to arise from any interpolating solution; indeed, it is easy to find solutions that would generalize poorly even as more parameters are added. Accordingly, we should expect the implicit regularization999A complementary, influential line of work (beginning with Soudry et al. (2018); Ji and Telgarsky (2019)

) shows that the max-margin support vector machine (SVM), which minimizes the

-norm of the solution subject to a hard margin constraint, arises naturally as a consequence of running gradient descent to minimize training error. This demonstrates the relative ease of finding -norm minimizing solutions in practice, and further motivates their study. present in the minimum -norm interpolation to play a critical role in aiding generalization. However, implicit regularization constitutes a sanity check rather than a full explanation for the double descent behavior. For one, it does not explain the seemingly harmless interpolation of noise in training data in the highly overparameterized regime. For another, minimum -norm interpolations will not generalize well for any data distribution: indeed, classic experiments demonstrate catastrophic signal reconstruction in some cases by the minimum -norm interpolation (Chen et al., 2001). An initial examination of the bias-variance decomposition of test error in the interpolating regime (Figures 2,2 and 3) demonstrates significantly different behavior from the underparameterized regime, at least when minimum -norm solutions are used: there appears to be a significant decrease in the variance component that dominates the increase in the bias component to cause an overall benefit in overparameterization. It is also unclear whether the interpolating regime is strictly beneficial for generalization in the sense that the global minimum of the test error is achieved by an interpolating solution. The simple experiments in this section demonstrate a mixed answer to this question: Figs. 2 and 2 demonstrate beneficial double descent behaviors, whereas Fig. 3 shows a double descent behavior but the underparameterized regime is the one that minimizes test error.

2.4 The role of model misspecification in beneficial interpolation

Misspecification was defined above as a learning setting where the function class does not include the optimal solution that minimizes the test error. In this section, we provide an elementary calculation to illustrate that this model misspecification is a significant ingredient in inducing the double descent behavior. Consider regression with respect to squared error loss, data model and a function class as in Example 2.1. Then, the optimal solution is . By the definition of the function class , the learned mapping is applied on a feature vector that includes only features (out of the possible features) of the input . This means that the optimal solution in the function class is

(17)

where the second equality is due to the orthogonality of the basis vectors . We can notice the difference between the optimal solution in and the unconstrained optimal solution as follows. Note that in (17) is the projection matrix onto the -dimensional feature space that is utilized for learning and spanned by the orthonormal vectors . Hence, unless both and the entire distribution of are contained in the -dimensional feature space, there is misspecification that excludes the optimal solution from the function class .

Consider cases where the features are statistically independent (e.g., the above example with DCT features and Gaussian input with covariance matrix that is diagonalized by the DCT matrix). These cases let us to further emphasize the effect of misspecification on the bias and variance of the learned solution from . The important observation here is that the data model can be written as

(18)

where is the -dimensional feature vector of the input, and

is a random variable which is statistically independent of the feature vector

. We consider zero-mean Gaussian input , here, with covariance matrix . Hence, the misspecification variable is also zero-mean Gaussian but with variance

(19)

which sums over the feature directions that are not utilized in the -dimensional feature space of .

We define the misspecification bias (squared) as

(20)

namely, the expected squared difference between the optimal solution in the function class and the optimal unconstrained solution. We also define the in-class bias (squared) as

(21)

which is the expected squared difference between the learned solution (expected over the training dataset) and the optimal solution in the function class that was utilized for learning. Now, based on the orthogonality of the feature maps that form and considering cases where the features are statistically independent, we can decompose the squared bias from (8) as

(22)

which can be further set in the test error decomposition in (7). More specifically, the definition of in Example 2.1 implies that

(23)

which is the variance (19) of the misspecification variable .

Still considering statistically independent features, we can also decompose the variance term (9) to express its portion due to misspecification in . We can write the variance decomposition as

(24)

Here, the variance component due to misspecification is

(25)

where is the matrix of training input features, and is the

diagonal matrix with input covariance eigenvalues

. The in-class variance component has a bit more intricate formulation that we exclude from this discussion. Hastie et al. (2019) further assume that the input and the true parameter vector are both isotropic random vectors and, accordingly, provide a simple formulation for the in-class variance term in their setting.

Figure 4: Bias-variance decomposition of test error, including further decomposition into misspecification and in-class components as formulated in (22)-(25) for cases with statistically independent features. This figure is an extension of Fig. 2. These results are for LS regression with DCT features and Gaussian input with covariance that is diagonlized by the DCT matrix.

Let us return to the example given in Fig. 2 for regression with DCT features that are statistically independent due to the covariance form of the Gaussian input. This setting allows us to demonstrate the misspecification components of the bias and variance from (23) and (25). Figure 4 extends Fig. 2 by illustrating the misspecification and in-sample components of the bias and variance versus the number of parameters in the function class. Indeed, the misspecification bias decreases as the function class is more parameterized. In contrast, the in-class bias is zero in the underparameterized regime, and increases with in the overparameterized regime. Yet, the reduction in the misspecification bias due to overparameterization is more significant than the corresponding increase in the in-sample bias; and this is in accordance with having the global minimum of test error in the overparameterized range. We can observe that both the in-class and misspecification variance terms peak around the interpolation threshold (i.e., ) due to the poor conditioning of the input feature matrix. Based on the structure of the true parameter vector (see Fig. 2), the significant misspecification occurs for and this is apparent for the misspecification components of both the bias and variance.

More generally than the case of statistically independent features, the examples in Figures 2-3 include misspecification for any . This is because, for both DCT and Hadamard features, the representation of the true parameter vector requires all the features (see Figures 2,2,3). Learning using DCT features benefits from the fact that has energy compaction in the DCT domain, which in turn significantly reduces the misspecification bias for that is large enough to include the majority of ’s energy. In contrast, in the case of learning using the Hadamard features, the representation of in the feature space has its energy spread randomly all over the features, without energy compaction, and therefore the effect of misspecification is further amplified in this case. The amplified misspecification due to a poor selection of feature space actually promotes increased benefits from overparameterization (compared to underparameterized solutions, which are extremely misspecified) as can be observed in Fig. 2.

The in-depth experiments and explanation above illustrate the rich variety of generalization behaviors even in elementary settings. Understanding these behaviors requires mathematical study beyond the fundamental discussion in this section. In the next section, we review recently conducted mathematical studies on this topic, and illuminate the core phenomena that explain generalization behavior in this interpolating regime.

3 Overparameterized regression

Precise theoretical explanations now exist for the phenomena of harmless interpolation of noise and double descent (Belkin et al., 2020; Bartlett et al., 2020; Hastie et al., 2019; Kobak et al., 2020; Muthukumar et al., 2020b; Mitra, 2019). In this section, we provide a brief recap of these results, all focused on minimum -norm interpolators. Additionally, we provide intuitive explanations for the double descent behavior in simplified toy settings that are inspired by statistical signal processing. We will see that the test error can be upper bounded by an additive decomposition of two terms:

  • A signal-dependent term, which expresses how much error is incurred by a solution that interpolates the noiseless version of our (noisy) training data.

  • A noise-dependent term, which expresses how much error is incurred by a solution that interpolates only the noise in our training data.

Situations that adequately tradeoff these terms give rise to the double descent behavior, and can be described succinctly for several families of linear models. We will now describe these results in more detail.

3.1 Summary of results

The minimum -norm interpolator (MNI), denoted by the -dimensional vector where , has a closed-form expression (recall Eq. (16)). In turn, this allows to derive closed-form exact expressions for the generalization error as a function of the training dataset; indeed, all aforementioned theoretical analyses of the minimum -norm interpolator take this route.

We can state the diversity of results by Belkin et al. (2020); Bartlett et al. (2020); Hastie et al. (2019); Kobak et al. (2020); Muthukumar et al. (2020b); Mitra (2019) in a common framework by decomposing the test error incurred by the MNI, denoted by , into the respective test errors that would be induced by two related solutions. To see this, consider the noisy data model (2) with scalar output, and note that the MNI from (16) satisfies where

  • is the minimum -norm interpolator on the noiseless training dataset . Note that , where the vectors include the training data outputs and their error components, respectively.

  • is the minimum -norm interpolator on the pure noise in the training data. In other words, is the minimum -norm interpolator of the dataset .

Then, a typical decomposition of an upper bound on the test error is given by

(26)

where is a signal specific component that is determined by the noiseless data interpolator ; and is a noise specific component that is determined by the pure noise interpolator . Moreover, we can typically lower bound the test error as ; therefore, this decomposition is sharp in a certain sense.

All of the recent results on minimum -norm interpolation can be viewed from the perspective of (26) such that they provide precise characterizations for the error terms and under the linear model, i.e., for some . Moreover, as we will see, these analytical results demonstrate that the goals of minimizing and

can sometimes be at odds with one another. Accordingly, the number of samples

, the number of parameters , and properties of the feature covariance matrix should be chosen to tradeoff these error terms. In some studies, the ensuing error bounds are non-asymptotic and can be stated as a closed-form function of these quantities. In other studies, the parameters and are grown together at a fixed rate and the asymptotic test error is exactly characterized as a solution to either linear or nonlinear equations that is sometimes closed form, but at the very least typically numerically evaluatable. Such exact characterizations of the test error are typically called precise asymptotics.

For simplicity, we summarize recent results from the literature in the non-asymptotic setting, i.e., error bounds that hold for finite values of , and (in addition to their infinite counterparts). Accordingly, we characterize the behavior of the test error (and various factors influencing it) as a function of the parameters using two types of notation. First, considering an arbitrary function , we will use to mean that there exists a universal constant such that for all . Second, we will use to mean that there exist universal constants such that for all . Moreover, the bounds that we state typically hold with high probability over the training data, in the sense that the probability that the upper bound holds goes to as (and with the desired proportions with respect to in order to maintain overparameterization).

3.1.1 When does the minimum -norm solution enable harmless interpolation of noise?

We begin by focusing on the contribution to the upper bound on the test error that arises from a fit of pure noise, i.e., , and identifying necessary and sufficient conditions under which this error is sufficiently small. We will see that these conditions, in all cases, reduce to a form of high “effective overparameterization” in the data relative to the number of training examples (in a sense that we will define shortly). In particular, these results can be described under the following two particular models for the feature covariance matrix :

  1. The simplest result to obtain is in the case of isotropic covariance, for which . For instance, this case occurs when the feature maps in Example 2.1 are applied on input data with isotropic covariance matrix . If the features are also statistically independent (e.g., consider Example 2.1 with isotropic Gaussian input data), the results by Bartlett et al. (2020); Hastie et al. (2019); Muthukumar et al. (2020b) imply that the noise error term scales as

    (27)

    where denotes the noise variance. The result in (27) displays an explicit benefit of overparameterization in fitting noise in a harmless manner. In more detail, this result is (i) a special case of the more general (non-asymptotic) results for arbitrary covariance by Bartlett et al. (2020), (ii) one of the closed-form precise asymptotic results presented as by Hastie et al. (2019), and (iii) a consequence of the more general characterization of the minimum-risk interpolating solution, or “ideal interpolator”, derived by Muthukumar et al. (2020b).

    On the other hand, the signal error term can be bounded as

    (28)

    displaying an explicit harm in overparameterization from the point of view of signal recovery. In more detail, this result is (i) a non-asymptotic upper bound provided by Bartlett et al. (2020) (and subsequent work by Tsigler and Bartlett (2020) shows that this term is matched by a sharp lower bound), (ii) precise asymptotics provided by Hastie et al. (2019).

    As discussed later in Section 3.1.2, the behavior of has negative implications for consistency in a certain high-dimensional sense. Nevertheless, the non-asymptotic example in Figure 5 shows that one can construct signal models with isotropic covariance that can benefit from overparameterization. Such benefits are partly due to the harmless interpolation of noise, and partly due to a decrease in the misspecification error arising from increased overparameterization.

  2. The general case of anisotropic covariance and features that are independent in their principal component basis is significantly more complicated. Nevertheless, it turns out that a high amount of “effective overparameterization” can be defined, formalized, and shown to be sufficient and necessary for interpolating noise in a harmless manner. This formalization was pioneered101010Asymptotic (not closed-form) results are also provided by the concurrent work of Hastie et al. (2019). These asymptotics match the non-asymptotic scalings provided by Bartlett et al. (2020) for the case of the spiked covariance model and Gaussian features in the regime where . These results can also be used to recover in part the benign overfitting results as shown by Bartlett et al. (2021)Muthukumar et al. (2020b) recover these scalings in a toy Fourier feature model through elementary calculations that will be reviewed in Section 3.2 using cosine features. by Bartlett et al. (2020), who provide a definition of effective overparameterization solely as a function of , and the spectrum of the covariance matrix , denoted by the eigenvalues . We refer the reader to Bartlett et al. (2021)

    for a detailed exposition of these results and their consequences. Here, we present the essence of these results in a popular model used in high-dimensional statistics, the spiked covariance model. Under this model (recall that

    ), we have a feature covariance matrix with high-energy eigenvalues and low-energy (but non-zero) eigenvalues. In other words, and for some (for example, notice that Figure 2 is an instance of this model). In this special case, the results of Bartlett et al. (2020) can be simplified to show that the test error arising from overfitting noise is given by

    (29)

    Equation (29) demonstrates a remarkably simple set of sufficient and necessary conditions for harmless interpolation of noise in this anisotropic ensemble:

    • The number of high-energy directions, given by , must be significantly smaller than the number of samples .

    • The number of low-energy directions, given by , must be significantly larger than the number of samples . This represents a requirement for sufficiently high “effective overparameterization” to be able to absorb the noise in the data, and fit it in a relatively harmless manner.

Interestingly, the work of Kobak et al. (2020) considered the above anisotropic model for the special case of a single spike (i.e., for

, which usually implies that the first condition is trivially met) and showed that the error incurred by interpolation is comparable to the error incurred by ridge regression. These results are related in spirit to the above framework, and so we can view the conditions derived by 

Bartlett et al. (2020) as sufficient and necessary for ensuring that interpolation of (noisy) training data does not lead to sizably worse performance than ridge regression. Of course, good generalization of ridge regression itself is not always a given (in particular, it generalizes poorly for isotropic models). In this vein, the central utility of studying this anisotropic case is that it is now possible to also identify situations under which the test error arising from fitting pure signal, i.e., , is sufficiently small. We elaborate on this point next.

[] [] []
[] [] []

Figure 5: Empirical results for LS regression based on a class of linear functions in the form described in Example 2.1. The data model is . (a) The components of . (b) The components of the DCT transform of , i.e., . This shows that the majority of ’s energy is at the DCT “feature band” of . (c) The input data covariance matrix . The learning is done using DCT features and, accordingly, the features are independent and isotropic due to the isotropic Gaussian input. Subfigure (d) shows the test error (as a function of ) and its decomposition to bias, variance, and their in-class and misspecification components.

3.1.2 When does the minimum -norm solution enable signal recovery?

The above results paint a compelling picture for an explicit benefit of overparameterization in harmless interpolation of noise (via minimum -norm interpolation). However, good generalization requires the term to be sufficiently small as well. As we now demonstrate, the picture is significantly more mixed for signal recovery.

As shown in Eq. (28), for the case of isotropic covariance, the signal error term can be upper and lower bounded upto universal constants as , displaying an explicit harm in overparameterization from the point of view of signal recovery. The ills of overparameterization in signal recovery with the minimum -norm interpolator are classically documented in statistical signal processing; see, e.g., the experiments in the seminal survey by Chen et al. (2001). A fundamental consequence of this adverse signal recovery is that the minimum -norm interpolator cannot have statistical consistency (i.e., we cannot have as ) under isotropic covariance for any scaling of that grows with , whether constant (i.e., for some fixed ), or ultra high dimensional (i.e., for some ). Instead, the test error converges to the “null risk”111111terminology used by Hastie et al. (2019), i.e., the test error that would be induced by the “zero-fit”121212terminology used by Muthukumar et al. (2020b) .

These negative results make clear that anisotropy of features is required for the signal-specific error term to be sufficiently small; in fact, this is the main reason for the requirement that

in the spiked covariance model to ensure good generalization in regression tasks. In addition to this effective low dimensionality in data, we also require a low-dimensional assumption on the signal model. In particular, the signal energy must be almost entirely aligned with the directions (i.e., eigenvectors) of the top

eigenvalues of the feature covariance matrix (representing high-energy directions in the data). Finally, we note as a matter of detail that the original work by Bartlett et al. (2020) precisely characterizes the contribution from upto constants independent of and ; however, they only provide upper bounds on the bias term. In follow-up work, Tsigler and Bartlett (2020) provide more precise expressions for the bias that show that the bias is non-vanishing in “tail” signal components outside the first “preferred” directions. Their upper and lower bounds match up to a certain condition number. The follow-up studies by Mei and Montanari (2019); Liao et al. (2020); Derezinski et al. (2020); Mei et al. (2021) also provide precise asymptotics for the bias and variance terms in very general random feature models, which can be shown to fall under the category of anisotropic covariance models.

3.1.3 When does double descent occur?

The above subsections show that the following conditions are sufficient and necessary for good generalization of the minimum -norm interpolator:

  • low-dimensional signal structure, with the non-zero components of the signal aligned with the eigenvectors corresponding to the highest-value eigenvalues (i.e., high-energy directions in the data)

  • low effective dimension in data

  • an overparameterized number of low-value (but non-zero) directions.

Given these pieces, the ingredients for double descent become clear. In addition to the aforementioned signal and noise-oriented error terms in (26), we will in general have a misspecification error term, denoted by , which will always decrease with overparameterization and therefore further contributes to the double descent phenomenon. For example, recall Section 2.4 and the example in Fig. 4 where the misspecification bias and variance terms decrease with in the overparameterized range. Optimization aspects can also affect the double descent behavior, for example, D’Ascoli et al. (2020) study the contribution of optimization initialization to double descent phenomena in a lazy training setting (i.e., where the learned parameters are close to their initializations (Chizat et al., 2019)).

Examples of the approximation-theoretic benefits of overparameterization are also provided by Hastie et al. (2019), but these do not go far enough to recover the double descent behavior, as they are still restricted to the isotropic setting in which signal recovery becomes more harmful as the number of parameters increases. Belkin et al. (2020) provided one of the first explicit characterizations of double descent under two models of randomly selected “weak features”. In their model, the features are selected uniformly at random from an ensemble of

features (which are themselves isotropic in distribution and follow either a Gaussian or Fourier model). This idea of “weak features” is spiritually connected to the classically observed benefits of overparameterization in ensemble models like random forests and boosting, which were recently discussed in an insightful unified manner 

(Wyner et al., 2017). This connection was first mentioned in Belkin et al. (2019a).

Subsequent to the first theoretical works on this topic, Mei and Montanari (2019) also showed precise asymptotic characterizations of the double descent behavior in a random features model.

3.2 A signal processing perspective on harmless interpolation

In this section, we present an elementary explanation for harmless interpolation using the concepts of aliasing and Parseval’s theorem for a “toy” case of overparameterized linear regression with cosine features on regularly spaced, one-dimensional input data. This elementary explanation was first introduced by Muthukumar et al. (2020b), and inspired subsequent work for the case of classification (Muthukumar et al., 2020a). While this calculation illuminates the impact of interpolation in both isotropic and anisotropic settings, we focus here on the isotropic case for simplicity.

This toy setting can be perceived as a special case of Example 13 with one-dimensional input and cosine feature maps. Note that our example for real-valued cosine feature maps differs in some low-level mathematical details compared to the analysis originally given by Muthukumar et al. (2020b) for Fourier features.

[Cosine features for one-dimensional input] Throughout this example, we will consider to be an integer multiple of . We also consider one-dimensional data and a family of cosine feature maps , where the normalization constant equals for and for . Then, the -dimensional cosine feature vector is given by

This is clearly an orthonormal, i.e., isotropic feature family in the sense that

where denotes the Kronecker delta. Furthermore, in this “toy” model we assume -regularly spaced training data points, i.e., . While the more standard model in the ML setup would be to consider to be i.i.d. draws from the input distribution (which is uniform over in this example), the toy model with deterministic, regularly-spaced will yield a particularly elementary and insightful analysis owing to the presence of exact aliases. In the regime of overparameterized learning, we have . The case of regularly-spaced training inputs allows us to see overparameterized learning as reconstructing an undersampled signal: the number of samples (given by ) is much less than the number of features (given by ). Thus, the undersampling perspective implies that aliasing, in some approximate131313The reason that this statement is in an approximate sense in practice is because the training data is typically random, and we can have a wide variety of feature families. sense, is a core issue that underlies model behavior in overparameterized learning . We illustrate this through Example 3.2, which is an idealized toy model and a special case of Example 13.

Here, the true signal is of the form . Then, we have the noisy training data

(30)

where . The overparameterized estimator has to reconstruct a signal over the continuous interval while interpolating the given data points using a linear combination of orthonormal cosine functions , , that were defined in Example 3.2 for . In other words, various interpolating function estimates can be induced by plugging different parameters in

(31)

such that for

. The assumed uniform distribution on the

test input implies that the performance of an estimate is evaluated as . In Figure 6 we demonstrate results in the setting of Example 3.2. Figure 6 illustrates the minimum -norm solution of the form (31) for parameters. The training and test error curves (as a function of ) in Figure 6 show that the global minimum of test error is obtained in the overparameterized with parameters (there are training examples).

[] []

Figure 6: Evaluation of LS regression using orthonormal cosine features as in Example 3.2. The true signal is a linear combination of the first 25 cosine functions (see blue curve in (a)

). The standard deviation of the noise is

. The number of training data points is . (a) Example of the minimum -norm solution for . (b) Test and training error curves for the min -norm solutions with varying number of parameters .

In the discussion below, we will consider (30) in both noisy and noiseless settings (where in the latter case). To demonstrate clearly the concept of aliasing, suppose the true signal is

(32)

i.e., a single cosine function whose frequency is determined by an integer . We start by considering the noiseless version of (30), which we illustrate in Figure 7. (Note that in this figure the true signal and the training data points are presented as blue curve and black circle markers, respectively.) Then, a trivial estimate that interpolates all the training data points is the unnormalized cosine function: . Yet, is not the only interpolating estimate that relies on a single cosine function. Recall that in Example 3.2 we assume to be an integer. Then, using periodicity and other basic properties of cosine functions, one can prove that there is a total of estimates (each in a form of a single cosine function from , ) that interpolate the training data over the grid , . Specifically, one subset of single-cosine interpolating estimates is given by

(33)

and the second subset of single-cosine interpolating estimates is given by

(34)

One of these aliases is shown as the green curve (along with the true signal as the blue curve) in Figure 7. Clearly, the total number of such aliases, given by , increases with overparameterization. Next, note that each of the interpolating aliases , , is induced by setting a value of magnitude to one of the relevant parameters in the estimate form (31). Accordingly, we will now see that none of these aliases is the minimum -norm interpolating estimate (interestingly note that the interpolating estimate is not of minimum -norm even though this estimate coincides with the true signal for any ). Indeed, the minimum -norm solution in this noiseless case of a pure cosine function equally weights all of the exact aliases, and is expressed by the parameters

[] []

Figure 7: Illustration of aliasing and its effect on the minimum -norm interpolator. Two idealized settings of Example 3.2 are shown. Left subfigure (a): A noiseless case () where the true signal is and the training data points are for . The trivial single-cosine alias, coincides with in the blue curve, a corresponding high-frequency alias is presented as the green curve. The minimum -norm estimate for is presented as the red curve. Right subfigure (b):  A noisy case where the noise standard deviation is , the true signal is , and the training data points are for . The interpolating estimate using the first cosine functions is presented as the purple curve, a high frequency alias of this estimate is presented as the green curve. The minimum -norm estimate for is presented as the red curve.