Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

02/26/2020 ∙ by Zitong Yang, et al. ∙ 0

The classical bias-variance trade-off predicts that bias decreases and variance increase with model complexity, leading to a U-shaped risk curve. Recent work calls this into question for neural networks and other over-parameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network. We vary the network architecture, loss function, and choice of dataset and confirm that variance unimodality occurs robustly for all models we considered. The risk curve is the sum of the bias and variance curves and displays different qualitative shapes depending on the relative scale of bias and variance, with the double descent curve observed in recent literature as a special case. We corroborate these empirical results with a theoretical analysis of two-layer linear networks with random first layer. Finally, evaluation on out-of-distribution data shows that most of the drop in accuracy comes from increased bias while variance increases by a relatively small amount. Moreover, we find that deeper models decrease bias and increase variance for both in-distribution and out-of-distribution data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) Case 1
(b) Case 2
(c) Case 3
Figure 1: Typical cases of expected risk curve (in black) in neural networks. Blue: squared bias curve. Red: variance curve.

Bias-variance tradeoff is a fundamental principle for understanding the generalization of predictive learning models (hastie01statisticallearning). The bias is an error term that stems from a mismatch between the model class and the underlying data distribution, and is typically monotonically non-increasing as a function of the complexity of the model. The variance measures sensitivity to fluctuations in the training set and is often attributed to a large number of model parameters. Classical wisdom predicts that model variance increases and bias decreases monotonically with model complexity (geman1992neural). Under this perspective, we should seek a model that has neither too little nor too much capacity and achieves the best trade-off between bias and variance.

In contrast, modern practice for neural networks repeatedly demonstrates the benefit of increasing the number of neurons 

(krizhevsky2012imagenet; simonyan2014very; zhang2017understanding), even up to the point of saturating available memory. This gap between classical theory and modern practice motivates us to revisit the bias-variance trade-off theory in this new context.

Recently, a series of works offer an explanation for the generalization of deep learning methods by proposing a

double descent risk curve (belkindouble; belkin2018understand; belkin2019does), which has also been analytically characterized for certain regression models (meisong; trevor18; Spigler:IOP19; deng2019model; Advani2017HighdimensionalDO). However, there exists two mysteries around the double descent risk curve. First, the double descent phenomenon can not be robustly observed (Ba2020Generalization). To observe it in modern neural network architectures, we have to artificially inject label noise (nakkiran2019deep). Second, there lacks an explanation for why the double descent risk curve should occur. In this work, we offer an simple explanation for these two mysteries by proposing an unexpected unimodal variance curve.

Specifically, we measure the bias and variance of modern deep neural networks trained on commonly used computer vision datasets. Our main finding is that while the bias is monotonically decreasing with network width as in the classical theory, the variance curve is

unimodal or bell-shaped: it first increases and then decreases (see Figure 2). Importantly, this unimodal variance phenomenon occurs robustly for varying network architecture and dataset. Moreover, by using a generalized bias-variance decomposition for Bregman divergences (pfau2013bergman), we verify that it occurs for both squared loss and cross-entropy loss.

This unimodal variance phenomenon initially appears to contradict recent theoretical work suggesting that both bias and variance are non-monotonic and exhibit a peak in some regimes (meisong; trevor18) . The difference is that this previous work considered the fixed-design bias and variance, while we measure the random-design bias and variance (we describe the differences in detail in §2.1). Neal:arxiv18 and nakkiran2019more also considered the random-design setting, but neither studied the shape of the variance curve as a function of model width and so had not investigated the unimodality phenomenon.

A key finding of our work is that the complex behavior of the risk curve arises due to the simple but non-classical variance unimodality phenomenon. Indeed, since the expected risk (test loss) is the sum of bias and variance, monotonic bias and unimodal variance can lead to three characteristic behaviors, illustrated in Figure 1, depending on the relative size of the bias and variance. If the bias completely dominates, we obtain monotonically decreasing risk curve (see Figure 1(a)). Meanwhile, if the variance dominates, we obtain a bell-shaped risk curve that first increases then decreases (see Figure 1(c)). The most complex behavior is if bias and variance dominate in different regimes, leading to the double-descent risk curve in Figure 1(b). All three behaviors are well-aligned with the empirical observation in deep learning that larger models typically perform better. The most common behavior in our experiments is the first case (monotonically decreasing risk curve) as bias is typically larger than variance. We can observe the double-descent risk curve when label noise is added to the training set (see §3.3), and can observe the unimodal risk curve when we use the generalized bias-variance decomposition for cross-entropy loss (see §3.2).

Further Implications.

The investigations described above characterize bias and variance as a function of network width, but we can explore the dependence on other quantities as well, such as model depth (§4.2). Indeed, we find that deeper models tend to have lower bias but higher variance. Since bias is larger at current model sizes, this confirms the prevailing wisdom that we should generally use deeper models when possible. On the other hand, it suggests that this process may have a limit—eventually very deep models may have low bias but high variance such that increasing the depth further harms performance.

We also investigate the commonly observed drop in accuracy for models evaluated on out-of-distribution data, and attribute it primarily to increased bias. Combined with the previous observation, this suggests that increasing model depth may help combat the drop in out-of-distribution accuracy, which is supported by experimental findings in hendrycks2019benchmarking.

Theoretical Analysis of A Two-Layer Neural Network.

Finally, we conduct a theoretical study of a two-layer linear network with a random Gaussian first layer. While this model is much simpler than those used in practice, we nevertheless observe the same characteristic behaviors for the bias and variance. In particular, by working in the asymptotic setting where the input data dimension, amount of training data, and network width go to infinity with fixed ratios, we show that the bias is monotonically decreasing while the variance curve is unimodal. Our analysis also characterizes the location of the variance peak as the point where the number of hidden neurons is approximately half of the dimension of the input data.

2 Preliminaries

In this section we present the bias-variance decomposition for squared loss. We also present a generalized bias-variance decomposition for cross-entropy loss in §2.2. The task is to learn a function based on i.i.d. training samples

drawn from a joint distribution

on , such that the mean squared error is minimal, where . Here we denote the learned function by to make the dependence on the training samples clear.

Note that the learned predictor is a random quantity depending on . We can assess its performance in two different ways. The first way, random-design, takes the expectation over such that we consider the expected error . The second way, fixed-design, holds the training covariates fixed and only takes expectation over , i.e., . The choice of random/fixed-design leads to different bias-variance decompositions. Throughout the paper, we focus on random-design, as opposed to fixed-design studied in meisong; trevor18; Ba2020Generalization.

2.1 Bias Variance Decomposition

Random Design.

In the random-design setting, decomposing the quantity gives the usual bias-variance trade-off from machine learning, e.g. geman1992neural; hastie01statisticallearning.

where . Here measures the average prediction error over different realizations of the training sample. In addition to take the expectation , we also average over , as discussed in bishop:2006:PRML. For future reference, we define

(1)
Variance (2)

In §2.2

, we present our estimator for bias and variance in equation (

1) and (2).

Fixed Design.

In fixed-design setting, the covariates are held be fixed, and the only randomness in the training set comes from . As presented in meisong; trevor18; Ba2020Generalization, a more natural way to present the fixed-design assumption is to hold to be fixed and let for , where is a ground-truth function and are random noises. Under this assumption, the randomness in all comes from the random noise . To make this clear, we write as . Then, we obtain the fixed-design bias-variance decomposition

where . In most practical settings, the expectation cannot be estimated from training samples , because we do not have access to independent copies of . In comparison to the random-design setting, the fixed-design setting tends to have larger bias and smaller variance, since more “randomness” is introduced into the variance term.

2.2 Estimating Bias and Variance

In this section, we present the estimator we use to estimate the bias and variance as defined in equation (1) and (2). The high level idea is to approximate the expectation by computing the sample average using multiple training sets . When evaluating the expectation , there is a trade-off between having larger training sets () within each training set and having larger number of splits (), since total number of training samples.

Mean Squared Error (MSE).

To estimate bias and variance in equation (1) and (2), we introduce an unbiased estimator for variance, and obtain bias by subtracting the variance from the risk. Let be a random disjoint split of training samples. In our experiment, we mainly take (for CIFAR10 each

has 25k samples). To estimate the variance, we use the unbiased estimator

where var depends on the test point and on the random training set . While var is unbiased, its variance can be reduced by using multiple random splits to obtain estimators and taking their average. This reduces the variance of the variance estimator since:

where the are identically distributed but not independent, and we used the Cauchy-Schwarz inequality.

Cross-Entropy Loss (CE).

In addition to the classical bias-variance decomposition for MSE loss, we also consider a generalized bias-variance decomposition for cross-entropy loss. Let

be the output of the neural network (a probability distribution over the class labels).

is a random variable since the training set

is random. Let

be the one-hot encoding of the ground-truth label. Then, omitting the dependence of

and on and , the cross entropy loss

can be decomposed as

  Input: Test point , Training set .
  for  to  do
     Split the into .
     for  to  do
        Train the model using ;
        Evaluate the model at ; call the result ;
     end for
  end for
  Compute
  (using element-wise log and exp; estimates ).
  Normalize to get a probability distribution.
  Compute the variance .
Algorithm 1 Estimating Generalized Variance
(3)

where is the -th element of , and is the average of log-probability after normalization, i.e.,

This decomposition is a special case of the general decomposition for Bregman divergence discussed in pfau2013bergman.

We apply Algorithm 1 to estimate the generalized variance in (3). Here we could not obtain an unbiased estimator, but the estimate is better if we take more random splits (larger ). In practice, we choose to be large enough so that the estimated variance stabilizes when we further increase (see §3.4). Similar to the case of squared loss, we estimate the bias by subtracting the variance from the risk.

3 Measuring Bias and Variance for Neural Networks

Figure 2: Mainline experiment on ResNet34, CIFAR10 dataset (25,000 training samples). (Left) Risk, bias, and variance for ResNet34. (Middle) Variance for ResNet34. (Right) Train error and test error for ResNet34.
(a) ResNext29, MSE loss, CIFAR10
(b) ResNet34, CE loss, CIFAR10
(c) DNN, MSE loss, MNIST
Figure 3:

Risk, bias, and variance with respect to different network architectures, training loss functions, and datasets. (a). ResNext29 trained by MSE loss on CIFAR10 dataset (25,000 training samples). (b). ResNet34 trained by CE loss (estimated by generalized bias-variance decomposition using Bregman divergence) on CIFAR10 dataset (10,000 training samples). (c). Fully connected network with one hidden layer and ReLU activation trained by MSE loss on MNIST dataset (10,000 training samples).

In this section, we study the bias and variance (equations (1) and (2)) of deep neural networks. While the bias is monotonically decreasing as folk wisdom would predict, the variance is unimodal (first increases to a peak and then decreases). We conduct extensive experiments to verify that this phenomenon appears robustly across architectures, datasets, optimizer, and loss function.

3.1 Mainline Experimental Setup

Figure 4: Increasing label noise leads to double-descent. (Left) Bias and variance under different label noise percentage. (Right) Training error and test error under different label noise percentage.

We first describe our mainline experimental setup; in the next subsection we vary each design choice to check robustness of the phenomenon. More extensive experimental results are given in the appendix.

For the mainline experiment, we trained a ResNet34 (he2016deep) on the CIFAR10 dataset (krizhevsky2009learning)

. We trained using stochastic gradient descent (SGD) with momentum

. The initial learning rate is 0.1. We applied stage-wise training (decay learning rate by a factor of 10 every 200 epochs), and used weight decay

. To change the model complexity of the neural network, we scale the number of filters (i.e., width) of the convolutional layers. More specifically, with , the number of filters are . We vary from 2 to 64 (the width of a regular ResNet34 designed for CIFAR10 in he2016deep is 16).

Relative to the standard experimental setup (he2016deep), there are two main differences. First, since bias-variance is usually defined for the squared loss (see (1) and (2)), our loss function is the squared error (squared

distance between the softmax probabilities and the one-hot class vector) rather than the log-loss. In the next section we also consider models trained with the log-loss and estimate the bias and variance by using a generalized bias-variance decomposition, as described in §

2.2. Second, to measure the variance (and hence bias), we need two models trained on independent subsets of the data as discussed in §2.2. Therefore, the training dataset is split in half and each model is trained on only data points. We estimate the variance by averaging over such random splits (i.e., we train copies of each model).

In Figure 2, we can see that the variance as a function of the width is unimodal and the bias is monotonically decreasing. Since the scale of the variance is small relative to the bias, the overall behavior of the risk is monotonically decreasing.

3.2 Varying Architectures, Loss Functions, Datasets

Architectures. We observe the same monotonically descreasing bias and unimodal variance phenomenon for ResNext29 (xie2017aggregated). To scale the “width” of the ResNext29, we first set the number of channels to 1 and increase the cardinality, defined in (xie2017aggregated), from 2 to 4, and then fix the cardinality at 4 and increase channel size from 1 to 32. Results are shown in Figure 3(a), where the width on the -axis is defined as the cardinality times the filter size.

Loss Function. In addition to the bias-variance decomposition for MSE loss, we also considered a similar decomposition for cross-entropy loss as described in §2.2. We train with cross-entropy loss and use training samples (5 splits), repeating times with independent random splits. As shown in Figure 3(b), the behavior of the generalized bias and variance for cross entropy is consistent with our earlier observations: the bias is monotonically decreasing and the variance is unimodal. The risk first increases and then decreases, corresponding to the unimodal risk pattern in Figure 1(c).

Datasets. In addition to CIFAR10, we study bias and variance on MNIST (lecun1998mnist) and Fashion-MNIST (xiao2017fashion)

. For these two datasets, we use a fully connected neural network with one hidden layer with ReLU activation function. The “width” of the network is the number of hidden nodes. We use 10,000 training samples (

). As seen in Figure 3(c) and 10 (in Appendix B), for both MNIST and Fashion-MNIST, the variance is again unimodal and the bias is monotonically decreasing.

In addition to the above experiments, we also conduct experiments on the CIFAR100 dataset, the VGG network architecture (simonyan2014very), various training sample sizes, and different weight decay regularization and present the results in Appendix B. We observe the same monotonically descreasing bias and unimodal variance phenomenon in all of these experiments.

3.3 Connection to Double-Descent Risk

When the relative scale of bias and variance changes, the risk displays one of the three patterns, monotonically decreasing, double descent, and unimodal, as presented in Figure 1(a), 1(b) and 1(c). In particular, the recent stream of observations on double descent risk (belkindouble) can be explained by unimodal variance and monotonically decreasing bias. In our experiments, including the experiments in previous sections, we typically observe monotonically decreasing risk; but with more label noise, the variance will increase and we observe the double descent risk curve.

Label Noise.

Similar to the setup in nakkiran2019more, for each split, we sample training data from the whole training dataset, and replace the label of each training example with a uniform random class with independent probability . Label noise increases the variance of the model and hence leads to double-descent risk as seen in Figure 4. If the variance is small, the risk does not have the double-descent shape because the variance peak is not large enough to overwhelm the bias, as observed in Figures 2, 3(a), 3(c) and 10.

(a) OOD Example
(b) Bias of model with different depth
(c) Variance of model with different depth
Figure 5: (a). Risk, bias, and variance for ResNet34 on out-of-distribution examples (CIFAR10-C dataset). (b)-(c). Bias and variance for ResNet with different depth trained by MSE loss on CIFAR10 (25,000 training samples).

3.4 Discussion of Possible Sources of Error

In this section, we briefly describe the possible sources of error in our estimator defined in §2.2.

Mean Squared Error. As argued in §2.2, the variance estimator is unbiased. To understand the variance of the estimator, we first split the data into two parts. For each part, we compute the bias and variance for varying network width by using our estimator. Averaging across different model width, the relative difference between the two parts is 0.6% for bias and 3% for variance, so our results for MSE are minimally sensitive to finite-sample effects. The complete experiments can be found in the appendix (see Figure 16).

Cross Entropy Loss. For cross entropy loss, we are currently unable to obtain an unbiased estimator. We can assess the quality of our estimator using the following scheme. We partition the dataset into five parts , i.e., set in Algorithm 1. Then, we sequentially plot the estimate of bias and variance using as described in Algorithm 1. We observe that using larger gives better estimates. In Figure 17 of Appendix B.8, we observe that as increases, the bias curve systematically decreases and the variance curve increases. Therefore our estimator over-estimates the bias and under-estimates the variance, but the overall behaviors of the curves remain consistent.

4 What Affects the Bias and Variance?

In this section, through the Bias-Variance decomposition analyzed in §3, we investigate the role of depth for neural networks and the robustness of neural networks on out-of-distribution examples.

4.1 Bias-Variance Tradeoff for Out-of-Distribution (OOD) Example

For many real-world computer vision applications, inputs can be corrupted by random noise, blur, weather, etc. These common occurring corruptions are shown to significantly decrease model performance (azulay2019deep; hendrycks2019benchmarking). To better understand the “generalization gap” between in-distribution test examples and out-of-distribution test examples, we empirically evaluate the bias and variance on the CIFAR10-C dataset developed by hendrycks2019benchmarking, which is a common corruption benchmark and includes 15 types of corruption.

By applying the models trained in the mainline experiment, we are able to evaluate the bias and variance on CIFAR10-C test dataset according to the definitions in (1) and (2). As we can see from Figure 5(a), both the bias and variance increase relative to the original CIFAR10 test set. Consistent with the phenomenon observed in the mainline experiment, the bias dominates the overall risk. The results indicate that the “generalization gap” mainly comes from increased bias, with relatively less contribution from variance as well.

4.2 Effect of Model Depth on Bias and Variance

In addition to the ResNet34 considered in the mainline experiment, we also evaluate the bias and variance for ResNet18 and ResNet50. Same as the mainline experiment setup, we estimate the bias and variance for ResNet using 25,000 training samples () and three independent random splits (). The standard building block of ResNet50 architecture in he2016deep is bottleneck block, which is different from the basic block used in ResNet18 and ResNet34. To ensure that depth is the only changing variable across three architectures, we apply the basic block for ResNet50. Same training epochs and learning rate decays are applied to three models.

From Figure 5(b) and 5(c), we observe that the bias decreases as the depth increases, while the variance increases as the depth increases. For each model, the bias is monotonically decreasing and the variance is unimodal. The differences in variance are small (around 0.01) compared with the changes in bias. Overall, the risk typically decreases as the depth increases. Our experimental results suggest that the improved generalization for deeper models, with the same network architecture, are mainly attributed to lower bias.

For completeness, we also include the bias and variance versus depth when basic blocks in ResNet are replaced by bottleneck blocks (see Figure 19 in the appendix). We observe similar qualitative trend of bias and variance.

Note that at high width, the bias of ResNet50 is slightly higher than the bias of ResNet18 and ResNet34. We attribute this inconsistency to difficulties when training ResNet50 without bottleneck blocks at high width. Lastly, we also include the bias and variance versus depth for out-of-distribution test samples, in which case we also observed decreased bias and increased variance as depth increases, as shown in Figure 18 of Appendix B.9.

5 Theoretical Insights from a Two-layer Linear Model

(a) Risk v.s. for different
(b) Bias v.s. for different
(c) Variance v.s. for different
Figure 6: Risk, bias, and variance for a two-layer linear neural network.

While the preceding experiments show that the bias and variance robustly exhibit monotonic-unimodal behavior in the random-design setting, existing theoretical analyses hold instead for the fixed-design setting, where the behavior of the bias and variance are more complex, with both the bias and variance exhibiting a peak and the risk exhibiting double descent pattern (meisong). In general, while the risk should be the same (in expectation) for the random and fixed design setting, the fixed-design setting has lower bias and higher variance.

Motivated by the more natural behavior in the random-design setting, we work to extend the existing fixed-design theory to the random-design case. Our starting point is meisong, who consider two-layer non-linear networks with random hidden layer weights. However, the randomness in the design complicates the analysis, so we make two points of departure to help simplify: first, we consider two-layer linear rather than non-linear networks, and second, we consider a different scaling limit ( rather than going to some constant). In this setting, we rigorously show that the variance is indeed unimodal and the bias is monotonically decreasing (Figure 6). Our precise assumptions are given below.

5.1 Model Assumptions

We consider the task of learning a function that maps each input vector to an output (label) value . The input-output pair is assumed to be drawn from a distribution where and

(4)

where is a weight vector. Given a training set with training samples drawn independently from the data distribution, we learn a two-layer linear neural network parametrized by and as

where is the number of hidden units in the network. In above, we take as a parameter independent of the training data

whose entries are drawn from i.i.d. Gaussian distribution

. Given , the parameter is estimated by solving the following ridge regression111 regularization on weight parameters is arguably the most widely used technique in training neural network, known for improving generalization (krogh1992simple). Other regularization such as can also be used and leads to qualitatively similar behaviors. problem

(5)

where denotes a matrix that contains training data vectors as its columns, denotes a vector containing training labels as its entries, and is the regularization parameter. By noting that the solution to (5) is given by

our estimator is given as

(6)

5.2 Bias-Variance Analysis

We may now calculate the bias and variance of the model described above via the following formulations:

where and are defined in (4) and (6), respectively. Note that the bias and variance are functions of the model parameter . To simplify the analysis, we introduce a prior and calculate the expected bias and expected variance as

(7)
(8)

The precise formulas for the expected bias and the expected variance are parametrized by the dimension of the input feature , the number of training points , the number of hidden units and also .

Previous literatures (meisong)

suggests that both the risk and the variance achieves a peak at the interpolation threshold (

). In the regime when is very large, the risk no longer exhibits a peak, but the unimodal pattern of variance still holds. In the rest of the section, we consider the regime where the is large (monotonically decreasing risk), and derive the precise expression for the bias and variance of the model. From our expression, we obtain the location where the variance achieves the peak. For this purpose, we consider the following asymptotic regime of and :

Assumption 1.

Let be a given sequence of triples. We assume that there exists a such that

For simplicity, we will write and .

With the assumption above, we have the expression of the expected bias, variance and risk as a function of and .

Theorem 1.

Given that satisfies Assumption 1, let for some fixed . The asymptotic expression of expected bias and variance are given by

(9)

where

The proof is given in Appendix C.

The risk can be obtained through . The expression in Theorem 1 is plotted as the red curves in Figure 6. In addition to the case when , we also plot the shape of bias, variance and risk when . We find that the risk of the model grows from unimodal to monotonically decreasing as the number of samples increased (see Figure 6(a)). Moreover, the bias of the model is monotonically decreasing (see Figure 6(b)) and the variance is unimodal (see Figure 6(c)).

Corollary 1 (Monotonicity of Bias).

The derivative of the limiting expected Bias in (9) can be calculated as

(10)

When , the expression in (10) is strictly non-positive, therefore the limiting expected bias is monotonically non-increasing as a function of , as classical theories predicts.

To gain further insight into the above formulas, we also consider the case when the ridge regularization amount is small. In particular, we consider the first order effect of on the bias and variance term, and compute the value of where the variance attains the peak.

Corollary 2 (Unimodality of Variance – small limit).

Under the assumptions of Theorem 1, the first order effect of on variance is given by

and the risk is given by

Moreover, up to first order, the peak in the variance is

Theorem 2 suggests that when is sufficiently small, the variance of the model is maximized when , and the effect of is to shift the peak slightly to .

From a technical perspective, to compute the variance in the random-design setting, we need to compute the element-wise expectation of certain random matrix. For this purpose, we apply the combinatorics of counting non-cross partitions to characterize the asymptotic expectation of products of Wishart matrices.

6 Conclusion and Discussion

In this paper we re-examine the classical theory of bias and variance trade-off as the width of a neural network increases. Through extensive experimentation, our main finding is that, while the bias is monotonically decreasing as classical theory would predict, the variance is unimodal. This combination leads to three typical risk curve patterns, all observed in practice. Theoretical analysis of a two-layer linear network corroborates these experimental observations.

The seemingly varied and baffling behaviors of modern neural networks are thus in fact consistent, and explainable through classical bias-variance analysis. The main unexplained mystery is the unimodality of the variance. We conjecture that as the model complexity approaches and then goes beyond the data dimension, it is regularization in model estimation (the ridge penalty in our theoretical example) that helps bring down the variance. Under this account, the decrease in variance for large dimension comes from better conditioning of the empirical covariance, making it better-aligned with the regularizer.

In the future, it would be interesting to see if phenomena characterized by the simple two-layer model can be rigorously generalized to deeper networks with nonlinear activation, probably revealing other interplays between model complexity and regularization (explicit or implicit). Such a study could also help explain another phenomenon we (and others) have observed: bias decreases with more layers as variance increases. We believe that the (classic) bias-variance analysis remains a powerful and insightful framework for understanding the behaviors of deep networks; properly used, it can guide practitioners to design more generalizable and robust networks in the future.

Acknowledgements.

We would like to thank Emmanuel Candés for first bringing the double-descent phenomenon to our attention, Song Mei for helpful discussion regarding random v.s. fixed design regression, and Nikhil Srivastava for pointing out to relevant references in random matrix theory. We would also like to thank Preetum Nakkiran, Mihaela Curmei, and Chloe Hsu for valuable feedback during preparation of this manuscript.

References

Appendix A Summary of Experiments

We summarize the experiments in Table 1, each row corresponds to one experiment, some include several independent splits, in this paper. Every experiment is related to one or multiple figures, which is specified in the last column “Figure”.

Dataset Architecture Loss Optimizer Train Size Splits() Label Noise Figure Comment
CIFAR10 ResNet34 MSE SGD(wd=5e-4) 25000 3 2, 5 Mainline
CIFAR10 ResNext29 MSE SGD(wd=5e-4) 25000 3 3(a), 7 Architecture
VGG11 MSE SGD(wd=5e-4) 10000 1 8
CIFAR10 ResNet34 CE SGD(wd=5e-4) 10000 4 3(b), 9 Loss
MNIST DNN MSE SGD(wd=5e-4) 10000 1 3(c) Dataset
Fashion-MNIST DNN MSE SGD(wd=5e-4) 10000 1 10
CIFAR100 ResNet34 CE SGD(wd=5e-4) 10000 1 11
CIFAR10 ResNet34 MSE SGD(wd=5e-4) 10000 1 10%/20% 4 Label noise
CIFAR10 ResNet18 MSE SGD(wd=5e-4) 25000 3 5 Depth
ResNet50 MSE SGD(wd=5e-4) 25000 3 5
CIFAR10 ResNet34 MSE SGD(wd=5e-4) 10000 1 12 Train size
ResNet34 MSE SGD(wd=5e-4) 2500 1 13
CIFAR10 ResNet34 MSE SGD(wd=1e-4) 10000 1 14 Weight decay
CIFAR10 ResNet26-B MSE SGD(wd=5e-4) 25000 3 19 Depth (with bottleneck block)
ResNet38-B MSE SGD(wd=5e-4) 25000 3 19
ResNet50-B MSE SGD(wd=5e-4) 25000 3 19
Table 1: Summary of Experiments.

Appendix B Additional Experiments

In this section, we provide additional experimental results, some of them are metioned in §3 and §4.

Network Architecture:

The implementation of the deep neural networks used in this work is mainly adapted from https://github.com/kuangliu/pytorch-cifar.

Training Details:

For CIFAR10 dataset and CIFAR100 dataset, when training sample size is 25,000, we use 500 epochs for training and decay by a factor of 10 the learning rate every 200 epoch. When training sample size is 10,000/5,000, we use 1000 epochs for training and decay by a factor of 10 the learning rate every 400 epoch. For MNIST dataset and FMNIST dataset, we use 200 epochs for training and decay by a factor of 10 the learning rate every 100 epoch.

b.1 Architecture

We provide additional results on ResNext29 presented in §3.2. The results are shown in Figure 7.

We also study the behavior of risk, bias, and variance of VGG network (simonyan2014very) on CIFAR10 dataset. Here we use VGG11 and the number of filters are , where is the width in Figure 8. The number of training samples of each split is 10,000. We use the same optimization setup as the mainline experiment (ResNet34 in Figure2).

Figure 7: Risk, bias, variance, train/test error for ResNext29 trained by MSE loss on CIFAR10 dataset (25,000 training samples). (Left) Risk, bias, and variance for ResNext29. (Middle) Variance for ResNext29. (Right) Train error and test error for ResNext29.
Figure 8: Risk, bias, variance, train/test error for VGG11 trained by MSE loss on CIFAR10 dataset (10,000 training samples). (Left) Risk, bias, and variance for VGG11. (Middle) Variance for VGG11. (Right) Train error and test error for VGG11.

b.2 Loss

We provide additional results on cross-entropy loss presented in §3.2, the results are shown in Figure 9.

Figure 9: Variance and train/test error for ResNet34 trained by cross-entropy loss (estimated by generalized bias-variance decomposition using Bregman divergence) on CIFAR10 dataset (10,000 training samples). (Left) Variance for ResNet34. (Right) Train error and test error for ResNet34.

b.3 Dataset

We provide the results on Fashion-MNIST dataset in Figure 10, which is mentioned in §3.2.

Figure 10: Fully connected network with one-hidden-layer and ReLU activation trained by MSE loss on Fashion-MNIST dataset (10,000 training samples).

We study the behavior of risk, bias, and variance of ResNet34 on CIFAR100 dataset. Because the number of class is large, we use cross-entropy during training, and apply the classical Bias-Vairance decomposition for MSE in (1) and (2) to estimate the risk, bias, and variance. As shown in Figure 11, we observe the bell-shaped variance curve and the monotonically decreasing bias curve on CIFAR100 dataset.

Figure 11: Risk, bias, variance, and train/test error for ResNet34 trained by cross-entropy loss (estimated by MSE bias-variance decomposition) on CIFAR100 (10,000 training samples). (Left) Risk, bias, and variance for ResNet34. (Middle) Variance for ResNet34. (Right) Train error and test error for ResNet34.

b.4 Training Size

Appart from the 2 splits case in Figure 2, we also consider 5 splits (10,000 training samples) and 20 splits case (2,500 training samples). We present the 5 splits case (10,000 training samples) in Figure 12, which corresponds to the label 0 case in Figure 4. We present the 20 splits (2,500 training samples) in Figure 13. With less number of training samples, both the bias and the variance will increase.

Figure 12: Risk, bias, variance, train/test error for ResNet34 trained by MSE loss on CIFAR10 dataset (10,000 training samples). (Left) Risk, bias, and variance for ResNet34. (Middle) Variance for ResNet34. (Right) Train error and test error for ResNet34.
Figure 13: Risk, bias, variance, train/test error for ResNet34 trained by MSE loss on CIFAR10 dataset (2,500 training samples). (Left) Risk, bias, and variance for ResNet34. (Middle) Variance for ResNet34. (Right) Train error and test error for ResNet34.

b.5 Weight Decay

We study another different weight decay parameter, (wd=1e-4) for ResNet34 on CIFAR10 dataset (10,000 training samples). The risk, bias, variance, and train/test error curves are shown in Figure 14. Compared with Figure 12, we observe that larger weight decay can decrease the variance.

Figure 14: Risk, bias, variance, train/test error for ResNet34 trained by MSE loss on CIFAR10 dataset (10,000 training samples), the weight decay parameter of SGD is 1e-4. (Left) Risk, bias, and variance for ResNet34. (Middle) Variance for ResNet34. (Right) Train error and test error for ResNet34.

b.6 Label Noise

We provide the risk curve for ResNet34 under different label noise percentage as described in §3.3, and the results are shown in Figure 15.

Figure 15: Risk under different label noise percentage. Increasing label noise leads to double descent risk curve.

b.7 Sources of Error for Mean Squared Error (MSE)

As argued in §2.2 the estimator for variance is unbiased estimator. To understand the variance of the estimator, we first split the data into two parts, and . For each part, we take multiple random splits () and estimate the variance by taking the average of those estimators, and vary the number of random splits . The results are shown in Figure 16. We can see that the variation between to parts of data is small. Quantitatively, veraging across different model width, the relative difference between two parts of data is 0.65% for bias and 3.15% for variance.

Figure 16: Bias and variance for two portions of data with from 1 to 5. (Left) Bias for ResNet18. (Right) Variance for ResNet18.

b.8 Sources of Error for Cross Entropy Loss (CE)

For cross entropy loss, we are currently unable to obtain an unbiased estimator. We can access the quality of our estimator using the following scheme. We partition the dataset into five parts , i.e., set in Algorithm 1. Then, we sequentially plot the estimate of bias and variance using as described in Algorithm 1. Using larger gives better estimate. As shown in Figure 17, when is small, our estimator over-estimate the bias and under-estimate the variance, but the overall behavior of the cruves are consistent.

Figure 17: Estimate of bias, variance, and risk using varying number of sample ( in Algorithm 1). (Left) Bias (CE) for ResNet34. (Middle) Variance (CE) for ResNet34. (Right) Risk (CE) for ResNet34.

b.9 Effect of Depth on Bias and Variance for Out-Of-Distribution Data

We study the role of depth on out-of-distribution test data. In Figure 18, we observe that increasing the depth can decrease the bias and increase the variance. Also, deeper ResNet can generalize better on CIFAR10-C dataset as shown in Figure 18.

Figure 18: Bias, variance, and test error for ResNet with different depth (ResNet18, ResNet34 and ResNet50 trained by MSE loss on 25,000 CIFAR10 training samples) evaluated on out-of-distribution examples (CIFAR10-C dataset). (Left) Bias for ResNet18, ResNet34 and ResNet50. (Middle) Variance for ResNet18, ResNet34 and ResNet50. (Right) Test error for ResNet18, ResNet34 and ResNet50.

b.10 Effect of Depth on ResNet using Bottleneck Blocks

In order to study the role of depth for ResNet on bias and variance, we apply basic residual block for ResNet50. To better investigate the depth of ResNet, we use Bottleneck block for ResNet26, ResNet38, and ResNet50. More specifically, the number of 3-layer bottleneck blocks for ResNet26, ResNet38, and ResNet50 are , , and . As shown in Figure 19, we observe that deeper ResNet with Bottleneck blocks has lower bias and higher variance.

Figure 19: Bias and variance for ResNet (bottleneck block) with different depth. (Left) Bias for ResNet26, ResNet38 and ResNet50. (Right) Variance for ResNet26, ResNet38 and ResNet50.

Appendix C Proof of Theorems in §5

Throughout this section, we use and to denote the Frobenius norm and spectral norm of a matrix, respectively. Recall that for any given , the training set satisfies the relation . By plugging this relation into (6), we get

(11)

where we define

(12)

To avoid cluttered notations, we omit the dependency of on and .

By using (11), the expected bias and expected variance in (7) and (8) can be written as functions on the statistics of . This is stated in the following proposition. To proceed, we introduce the change of variable

in order to be consistent with conventions in random matrix theory.

Proposition 1 (Expected Bias/Variance).

The expected bias and expected variance are given by

where is defined in (12).

Proof.

By plugging (11) into (7), and using the prior that and , we get

Similarly, by plugging (11) into (8) we get

The risk is given by

First, we show that in the asymptotic setting defined in Assumption 1, the expected Bias and expected Variance can be calculated as functions on the statistics of the following matrix:

(13)

In the following, we omit the dependency of on and .

Proposition 2 (Gap between and ).

Under Assumption 1 with , we have

Proof.

It suffices to show that almost surely. From (12) and (13), we have

where and

By using triangle inequality and the sub-multiplicative property of spectral norm, we have

(14)

Furthermore, by a classical result on the perturbation of matrix inverse (see e.g., ELGHAOUI2002171), we have

Combining this bound with (14) gives

It remains to show that and that , , and are bounded from above almost surely. By wainwright_2019, and ,

By letting and taking the asymptotic limit as in Assumption 1, we have

From geman1980

, the largest eigenvalue of

is almost surely . Therefore, we have

Finally, note that

We therefore conclude that almost surely, as desired. ∎

Proposition 3 (Asymptotic Risk).

Given the expression for Bias and Variance in Proposition 1, under the asymptotic assumptions from Assumption 1,

where , and for any ,

Proof.

Recall that , by Sherman-Morrision,

where . Let be the eigenvalues of . For notational simplicity, let . Then

Let , and be the spectral measure of . Then

According to Marchenko-Pastur Law (baibook), in the limit when when ,

where , and . For convenience, define

When ,

Then, in the asymptotic regime,

Proposition 4 (Asymptotic Bias).

Given the expression for Bias in Proposition 1, under the asymptotic assumptions in Assumption 1, the Bias for the model is given by

Proof.

Recall that

Recall that . Thus

By Neumann series,

where . According to Corollary 3.3 in bishop2018 (recall we are considering the asymptotic regime of ),

where

is the Narayana number. Therefore,