Revisiting complexity and the bias-variance tradeoff

06/17/2020
by   Raaz Dwivedi, et al.
26

The recent success of high-dimensional models, such as deep neural networks (DNNs), has led many to question the validity of the bias-variance tradeoff principle in high dimensions. We reexamine it with respect to two key choices: the model class and the complexity measure. We argue that failing to suitably specify either one can falsely suggest that the tradeoff does not hold. This observation motivates us to seek a valid complexity measure, defined with respect to a reasonably good class of models. Building on Rissanen's principle of minimum description length (MDL), we propose a novel MDL-based complexity (MDL-COMP). We focus on the context of linear models, which have been recently used as a stylized tractable approximation to DNNs in high-dimensions. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We derive closed-form expressions for MDL-COMP and show that for a dataset with n observations and d parameters it is not always equal to d/n, and is a function of the singular values of the design matrix and the signal-to-noise ratio. For random Gaussian design, we find that while MDL-COMP scales linearly with d in low-dimensions (d<n), for high-dimensions (d>n) the scaling is exponentially smaller, scaling as log d. We hope that such a slow growth of complexity in high-dimensions can help shed light on the good generalization performance of several well-tuned high-dimensional models. Moreover, via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP can inform hyper-parameter tuning for ridge regression in limited data settings, sometimes improving upon cross-validation.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/07/2019

Estimation of variance components, heritability and the ridge penalty in high-dimensional generalized linear models

For high-dimensional linear regression models, we review and compare sev...
05/24/2022

Bandwidth Selection for Gaussian Kernel Ridge Regression via Jacobian Control

Most machine learning methods depend on the tuning of hyper-parameters. ...
10/29/2020

Group-regularized ridge regression via empirical Bayes noise level cross-validation

Features in predictive models are not exchangeable, yet common supervise...
12/30/2019

A Parameter Choice Rule for Tikhonov Regularization Based on Predictive Risk

In this work, we propose a new criterion for choosing the regularization...
02/08/2022

Understanding the bias-variance tradeoff of Bregman divergences

This paper builds upon the work of Pfau (2013), which generalized the bi...
08/08/2022

Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality

Avoiding overfitting is a central challenge in machine learning, yet man...
06/02/2017

Bias-Variance Tradeoff of Graph Laplacian Regularizer

This paper presents a bias-variance tradeoff of graph Laplacian regulari...

Code Repositories

mdl-complexity

MDL Complexity computations and experiments from the paper "Revisiting complexity and the bias-variance tradeoff".


view repo

1 Introduction

Occam’s razor and the bias-variance tradeoff principle have long provided guidance for model selection in machine learning and statistics. Given a dataset, these principles recommend selecting a model which (1) fits the data well and (2) has low complexity. On the one hand, a model whose complexity is too low incurs

high bias, i.e., it does not fit the training data well (under-fitting). On the other hand, an overly complex model memorizes the training data, suffering from high variance, i.e. poor performance on unforeseen data (over-fitting). A good model can be chosen by trading off bias and variance as a function of complexity. However, as stated, these principles are qualitative and their widespread usage is often imprecise; it is generally unclear what class of models the tradeoff holds for, and it is left to the user to choose the notion of complexity.

In statistics and machine learning, complexity has been investigated and used multiple times to bound the out-of-sample prediction error of a fitted model and select between competing models. A common proxy for model complexity is the degrees of freedom 

efron1986biased; buja1989linear; efron2004estimation for linear models and its variants (effective/generalized degrees of freedom) for more general models meyer2000degrees; shen2002adaptive; efron2004estimation; hastie2005elements; zhang2012generalized or high-dimensional sparse regression zou2007degrees; tibshirani2012degrees; hastie2017generalized

. Intuitively, they reflect the effective number of parameters used to fit the model, e.g., in linear regression with

samples and features, it is equal to (for the case ). However, recent works kaufman2014does; janson2015effective; tibshirani2015degrees have shown that effective degrees of freedom may poorly capture complexity in high dimensions. Even earlier, several other complexity measures such as Vapnik-Chervonenkis dimension vapnik2015uniform, Rademacher Complexity bartlett2002rademacher; anthony2009neural; van2006empirical, Akaike information criterion akaike1974new, Mallow’s  mallows1973some, and Bayesian information criterion schwarz1978estimating were proposed in the literature. All of these measures simplify to parameter counting for linear models, and they have been justified with well-conditioned design matrices for . A more recent line of work has derived Rademacher(like)-complexity based generalization bounds for deep neural networks neyshabur2015norm; bartlett2017spectrally; neyshabur2017pac; golowich2017size; li2018tighter; neyshabur2018role, but such bounds have remained too loose to inform practical performance arora2018stronger. Moreover, there is growing evidence that large-scale trained models are often not complex, due to the implicit regularization induced by model architecture, optimization methods, and training datasets Nakkiran2019DeepDD; neyshabur2014search; arora2019implicit; neyshabur2018role.

Using complexity measures based on counting parameters, several recent works hastie2019surprises; belkin2019two

have demonstrated a seemingly contradictory behavior of the test error in the overparameterized setting. Even in the simple case of linear regression, the test error of the minimum-norm ordinary least squares (OLS) estimator does not follow the classical U-shaped curve (as would have been predicted by bias-variance tradeoff), and can decrease when the number of parameters grows beyond

(we reproduce this phenomenon in Fig. 1). Such a double-descent phenomenon was observed in early works on linear discriminant analysis skurichina1998bagging; skurichina2001bagging; skurichina2002bagging, and more recently in high-dimensional linear regression hastie2019surprises; belkin2019two; muthukumar2020harmless, and in DNNs for classification advani2017high; Nakkiran2019DeepDD. These observations have prompted several researchers to question the validity of the bias-variance tradeoff principle, especially in high-dimensional regimes belkin2019reconciling; belkin2019two.

This apparent contradiction prompts our revisit to two choices underlying the bias-variance tradeoff: (1) the class of estimators, and (2) the definition of complexity. With regard to the class of estimators, we argue that the bias-variance tradeoff is only sensible when the class of estimators is reasonable (Sec. 2). This observation motivates us to seek a valid complexity measure defined with respect to a reasonably good class of models, building on the optimality principle used in the algorithmic complexity of Kolmogorov, Chaitin, and Solomonoff kolmogorov1963tables; kolmogorov1968three; li2008introduction and the principle of minimum description length (MDL) of Rissanen rissanen1986stochastic; barron1998minimum; hansen2001model; grunwald2007minimum (Sec. 3). In Sec. 4, we propose an MDL-based complexity measure (MDL-COMP) for linear models which is defined generally as the minimum excess bits required to transmit the data when constructing an optimal code for a given dataset via a family of models based on the ridge estimators. We focus on (high-dimensional) linear models, which have been studied in several recent works as a stylized tractable approximation for deep neural networks jacot2018neural; du2018gradient; allen2018convergence. We provide an analytical and numerical investigation of MDL-COMP for both finite and , and in the large scale settings when such that (Theorems 3 and 1). We show that MDL-COMP has a wide range of behavior depending on the design matrix, and while it usually scales like for , it grows very slowly—logarithmic in —for . Finally, we show the usefulness of MDL-COMP by proving its optimality for in-sample MSE (Theorem 2) and showing that a data-driven and MDL-COMP

 inspired hyperparameter selection (Prac-MDL-COMP) provides competitive performance over cross-validation for ridge regression in several simulations and real-data experiments in limited data settings (

Appendices B and 5, Figs. 4 and 3). Moreover, in Sec. A.2, we propose a data-driven approximation to the (oracle) MDL-COMP, called Approx-MDL-COMP, which can be computed in practice, to calculate the proposed complexity measure to compare different high-dimensional models, and in future possibly for sample size calculations in experimental design.

2 Revisiting the bias-variance tradeoff

We begin by revisiting recent results on the behavior of test-mean squared error (MSE) for minimum-norm ordinary least squares (OLS), and cross-validation-tuned ridge estimators (Ridge) in high-dimensional linear regression hastie2019surprises; muthukumar2020harmless. As noted earlier, the bias-variance tradeoff should be studied for a given dataset where only the choice of models fitted to the data is varied (and not the data generating mechanism itself). Here we consider a dataset of observations and (ordered) features simulated from a Gaussian linear model (2) such that the true response depends on feature index ; and the true signal satisfies . Moreover, Gaussian noise with variance is added independently to each observation. We then fit OLS and Ridge estimators with the first features for different values of . Fig. 1 plots the mean-squared error (MSE) on a held-out dataset of samples and the bias-variance tradeoff for the OLS and Ridge estimator as a function of .

Figure 1: The bias-variance tradeoff for linear models fit to Gaussian data as is varied. While the OLS estimator shows a peaky behavior in test-error for , the error of the ridge estimator shows the typical U-shaped curve. Both OLS and Ridge estimator show the classical bias-variance tradeoff for (left of vertical dashed line). For the bias increases monotonically, while the variance shows an increase till , and a decrease beyond .

In Fig. 1(a), we also plot the test-error of the trivial zero estimator for the Gaussian design (Zero-Est). We observe that while the test error for ridge estimator shows a typical U-shaped curve (as the bias-variance tradeoff principle suggests), the test MSE of OLS shows a peak at ; although the rest of the curve seems to track the Ridge-curve pretty closely. The peaking at is caused by the instability of the inverse of the matrix when . Recently, this phenomenon has been referred to as double-descent (because of the second decrease in test MSE of OLS for ) and has reinvigorated the discussion whether a bias-variance tradeoff holds in high dimensional models hastie2019surprises; belkin2019two; muthukumar2020harmless. However, it is worth noting that: (I) The trivial (zero) estimator has constant bias and variance, and thus does not exhibit any bias-variance tradeoff for all choices of . (II) The OLS estimator admits a bias-variance tradeoff for (vertical dashed line); wherein the bias decreases and the variance increases as increases. However beyond , the variance peaks abruptly at and then decreases beyond ; and bias is negligible for , and then increases for . (III) The Ridge estimator shows a similar bias-variance tradeoff as OLS for most values of . However, a key difference is the absence of the peaking behavior for the variance (and hence the test MSE) showing that the peaky behavior of OLS is indeed due to its instability in high dimensions.

In Appendix E, we further discuss the bias-variance tradeoff for a few different designs and also consider the case . Figs. E2 and E2 show that for different choice of covariates , the test MSE of OLS can have multiple peaks, although overall the OLS estimator seems to have well-defined global minima (over ) and can be a reasonable choice for model selection in all settings. The observations for Ridge estimator remains qualitatively similar to that in Fig. 1, and its test-MSE is always lowest at . These observations suggest that the bias-variance tradeoff is not merely a function of the number of parameters considered; rather, it is tightly coupled with the estimators considered as well as the design matrix . Part of the surprise with the bias-variance tradeoff can be attributed to the choice of -axis where we treat as the complexity of the estimator—but as we now discuss such a choice is not well justified for .

3 Revisiting complexity

The choice of parameter count as a complexity measure for linear models has been justified rigorously for low-dimensional well-conditioned settings: The OLS is a good estimator when and its test error increases proportionally to , similar to most other complexity measures such as Rademacher, AIC, and BIC which scale linearly with . However, using as the complexity measure when remains unjustified (even for linear models). Indeed, in Fig. 1, we observe that the variance term for both OLS and Ridge first increase with up to , and then decreases as further increases beyond . Such an observation has also been made in regards to the double-descent observed in deep neural networks yang2020rethinking, wherein an increase in the width of the hidden layers beyond a threshold leads to a decrease in the variance. In prior works, the variance term itself has sometimes been used as a measure of complexity, e.g., in boosting buhlmann2003boosting. However, since the fitted model does not match the dimensionality of the true model (in Fig. 1), such a choice may be misinformed here.

To define a complexity measure in a principled manner, we build on Rissanen’s principle of minimum description length (MDL) with its intellectual root in the Kolmogorov complexity. Both Kolmogorov complexity and MDL are based on an optimality requirement, namely, the shortest length program that outputs the sequence on a universal Turing machine for the K-complexity, and the shortest (probability distribution-based) codelength for the data for MDL. Indeed, Risannen generalized Shannon’s coding theory to universal coding and used probability distributions to define universal encoding for a dataset and then used the codelength as an approximate measure of Kolmogorov’s complexity. In the MDL framework, a model or a probability distribution for data is equivalent to a (prefix) code; one prefers a code that utilizes redundancy in the data to compress it into as few bits as possible (see 

rissanen1986stochastic; barron1998minimum; grunwald2007minimum). Since its introduction, MDL has been used in many tasks including density estimation zhang2006, time-series analysis tanaka2005discovery, model selection barron1998minimum; hansen2001model; miyaguchi2018high, and DNN training hinton1993keeping; schmidhuber1997discovering; li2018measuring; blier2018description.

3.1 Universal codes and MDL-based complexity

We now provide a very brief background on MDL relevant for our definitions later on. For readers unfamiliar with MDL, sections 1 and 2 of the papers barron1998minimum; hansen2001model can serve as excellent references for background on MDL. Suppose we are given a set of observations . Let refer to a probability distribution on the space (e.g., a subset of ) where takes values, and let denote a set of such distributions. Given a probability distribution , we can associate an information-theoretic prefix code for the space , wherein for any observation we need to use number of bits to describe . Assuming a generative model for the observations , a natural way to measure the effectiveness of a coding scheme is to measure its excess code-length when compared to the true (but unknown) distribution . In such a case, the normalized regret or the expected excess code-length for is then given by:

where we normalized by to reflect that observations are coded together. Here

denotes the Kullback-Leibler divergence between the two distributions, and denotes the excess number of bits required by the code

, compared to the best possible code , which is infeasible because of the lack of knowledge of . Given a set of encodings and a generative model , a key quantity in MDL is the best possible excess codelength (or regret) defined as

(1)

This quantity is often treated as a complexity measure for the pair , and for it to be effective, the set of codes should be rich enough; otherwise the calculated complexity would be loose. The power of MDL framework comes from the fact that the set does not have to be a parametric class of fixed dimension, and one should note that does not denote the likelihood of the observation in general. MDL can be indeed by considered a global generalization of the maximum likelihood principle (MLE) (see Sec. D.1 for further discussion on MDL vs MLE). In the recent MDL universal coding form—known as normalized maximum likelihood (NML)—the distribution is defined directly on the space , and is in general not explicitly parametrized by the parameter of interest (e.g., the regression coefficient in linear models). However, when

is a parametric model class of dim

, several MDL-based complexities correspond to universal optimal codes in information theory, and simplify to asymptotically (fixed and ), and are thus equivalent (asymptotically) to BIC barron1998minimum; foster2004contribution; hansen2001model to the first order term. In a recent work, Grünwald et al.  grunwald2017tight further developed a framework to unify several complexities including Rademacher and NML, and derived excess risk bounds in terms of the unifying complexity measure, for several low dimensional settings. We now analyze a variant of the NML-encoding to obtain a valid complexity measure in high dimensions, and investigate its dependence on the design matrix of covariates.

4 MDL-COMP for linear models

We now derive analytical expressions for MDL-COMP with Gaussian linear models. Given a dataset with

called the response variable, and

’s the -dimensional covariates such that

(2)

where denotes the noise in the response variable when compared to the true response . Our aim is to compare the hardness of different problem instances for this linear model as vary, especially, when .

As in the MDL literature, we design codes on (for ) conditional on the covariate matrix or assuming is known. Since we are interested in an operational definition of complexity, we consider encodings that are associated with a computationally feasible set of estimators that achieve good predictive performance, here ridge estimators. We do not use the OLS-estimator since (i) it has poor performance in the over-parameterized regime, and (ii) it is known that the excess-codelength of such an estimator is typically infinite when is unbounded (see Example 11.1 grunwald2007minimum). Instead, we consider the -regularized least square estimators, with the penalty  for a positive definite matrix :

(3)

A common choice of for is . Given the ridge estimator , it remains to induce an encoding based on it. Recall that we assume that the covariates are known and only the information of is supposed to be encoded. We use the notation

(4)

for any . For a given , we use the following code with its density given by:

(5)

For regularized estimators, the choice of such a code (5) is called luckiness normalized maximum likelihood (LNML) code (see, Chapter 11.3 grunwald2007minimum), where the first factor in equation (4) captures the data fit, and the second factor corresponds to the regularization term, i.e., the luckiness factor. Both the functions, and do not correspond to probability densities thus do not correspond to a valid prefix code since they do not satisfy Kraft’s inequality in general. Moreover, the function depends on in two ways (since also depends on ). Hence to associate a valid prefix code, following the convention in MDL literature, we normalize the function with to obtain the density such that the codelength of is universally valid for any choice of . See Sec. D.2 for further connections with LNML. Next, to define the MDL-COMP, we choose the class of codes over a suitably defined set of positive definite matrices . We note that this family is not parametric in the classical sense (as it is not defined by the canonical parameter ). However, since the codes are induced by estimators, selecting a particular code corresponds to choosing a particular based-ridge estimator for all .

To keep complexity analytical calculations tractable, we assume that the matrix  (6) and are simultaneously orthogonally diagonalizable. In particular, let

(6)

where for , we use the convention that if . Here the matrix

denotes the (complete set of) eigenvectors of the matrix

. Moreover, let denote the set of positive semi-definite matrices . Given that our encodings are based hyper-parameter ’s, it remains to add the codelength required to encode these choices. We do so via a simple quantization based encoding for these parameters (denoted by , see Sec. C.1.1). We are now ready to define the MDL-based complexity for linear models.

Definition 1.

For the linear model with , and the ridge-induced encodings , we define MDL-COMP as follows:

(7)

where denotes the minimum (normalized by ) KL-divergence (1) between and any , and denotes the codelength needed to encode the optimal achieving the minimum in equation 1.

Note that the codelength does not appear in the optimality principle (1), and is added in MDL-COMP to reflect the bits required to transmit the choice of optimal hyper-parameter , in the spirit of the two-stage encoding form of MDL in the parametric case barron1998minimum; hansen2001model; grunwald2007minimum.

4.1 Analytic expressions of Mdl-Comp for finite

With the definition and notations in place, we now state our first main result.

Theorem 1.

For the linear model (2), the MDL complexity (7) is given by

(8)

where

denote the eigenvalues of

and the vector

was defined in equation (6).

See Sec. C.1 for the proof. The expression (8) depends on the design matrix and noise variance in addition to and . Depending on the interaction between the eigenvalues of the covariance matrix , and the scaled and rotated true parameter , MDL-COMP exhibits a wide-range of behavior. Appendix A contains expansions of MDL-COMP to demonstrate its wide range of behaviors. In particular, we show that MDL-COMP scales as in high-dimensions for random iid Gaussian decreasing. For low dimensions MDL-COMP—like most other complexity measures—scales as , but its scaling changes to in high-dimensions . We numerically verify this scaling in Fig. 2 for a few other settings. See equations 13 and 12 in Appendix A for mathematical expressions, where dependence on the , and the (intrinsic) dimensionality of the true parameter is also highlighted.

(a) (b)
Figure 2: MDL-COMP exhibits non-linear scaling with for various covariate designs. As we vary the dimension of the covariates used for computing MDL-COMP, we observe a linear scaling of MDL-COMP with for , but a slow logarithmic or growth for . We fix the sample size (), and denotes the variance of the noise in the observations (2). The parameter denotes the decay of the eigenvalues of the (diagonal) covariance of the full covariates such that ; and denotes the true dimensionality of (i.e., the observation in (2) depends on only first columns of ).

The expression (8) is oracle in nature since it depends on an unknown quantity, the true parameter via equation 6. In Sec. 5, we propose a data-driven and MDL-COMP inspired smoothing parameter selection criterion called Prac-MDL-COMP to tune a single smoothing parameter. Later in Sec. A.2, we seek a data-driven approximation to MDL-COMP so that our proposed co can be used as a practical complexity measure without requiring knowledge of .

4.2 Mdl-Comp versus mean-squared error (MSE)

Next, we show that the definition of MDL-COMP indirectly yields an optimality claim for the in-sample MSE (this quantity is different than the training MSE ). In-sample MSE can be considered as a valid proxy for the out-of-sample MSE, when the fresh samples have the same covariates as that of the training data. The in-sample MSE has been often used in prior works to study the bias-variance trade-off for different estimators raskutti2014early. Our next result shows that the that defines MDL-COMP also minimizes the in-sample MSE. This result provides an indirect justification for the choice of codes (5) used to compute the MDL-COMP. (See Sec. C.2 for the proof.)

Theorem 2.

For the ridge estimators (3), the expected in-sample mean squared error and the excess codelength (7) admit the same minimizer over , i.e.,

(9)

Later, we show that in experiments, tuning the ridge model via a practical (data driven) variant of MDL-COMP achieves an out-of-sample MSE that is competitive with CV-tuned ridge estimator.

4.3 Expressions for for large scale settings

More recently, several works have derived generalization behavior of different estimators for linear models when both and  hastie2019surprises; belkin2019two, to account for the large-scale datasets that arise in practice. We now derive the expression for , a key quantity in MDL-COMP (7) for such settings. We assume that the covariate matrix has i.i.d. entries drawn from a distribution with mean , variance

and fourth moments of order

, and that the parameter is drawn randomly from a rotationally invariant distribution with held constant.

Theorem 3.

Let such that . Then we have

(10)

almost surely, where .

See Sec. C.3 for the proof. We remark that the inequality in the theorem is an artifact of Jensen’s inequality and can be removed whenever a good control over the quantity is available. The first term on the RHS of equation 10 has a scaling of order when the norm of grows with ; but when is held constant with , the regret does not grow and remains bounded by a constant. In Sec. A.2, we show that Theorem 3 provides a reasonable approximation for even for finite and . Analytical derivation of the limit of the codelength for hyper-parameters (and thereby MDL-COMP) in such settings is left for future work.

5 Data driven MDL-COMP and experiments

This section proposes a practical version of MDL-COMP. Simulations and real-data experiments show that this data-driven MDL-COMP is useful for informing generalization. In particular, we first demonstrate that solving (11) correlates well with minimizing test error (Fig. 3). Next, we find that, for real datasets, MDL-COMP is competitive with cross-validation (CV) for performing model selection, actually outperforming CV in the low-sample regime (Fig. 4).

5.1 Mdl-Comp inspired hyper-parameter tuning

As defined, MDL-COMP can not be computed in practice, since it assumes knowledge of true parameter. Moreover, ridge estimators are typically fit with only one regularization parameter, shared across all features. As a result, we propose the following data-driven practical MDL-COMP:

(11)

where is the ridge estimator (3) for . This expression is the sample-version of the expression (19) for derived in the proof of Theorem 1 where we enforce . Moreover note Prac-MDL-COMP is a hyperparameter tuning criteria, and we have omitted the codelength for since we tune only a single scalar parameter. In Sec. A.2 (see Fig. A2), we show how Prac-MDL-COMP can be tweaked to compute tractable and reasonable approximations for MDL-COMP in practice (without the knowledge of ). Linear models (ridge) are fit using scikit-learn pedregosa2011scikit and optimization to obtain Prac-MDL-COMP  (11) is performed using SciPy 2020SciPy-NMeth.

Prac-MDL-COMP informs out-of-sample MSE in Gaussian simulations:

While Theorem 2 guarantees that the optimal regularization defining (the oracle) MDL-COMP also obtains the minimum possible in-sample MSE, here we show the usefulness of the Prac-MDL-COMP for achieving good test (out-of-sample) MSE. Fig. 3 shows that the model corresponding to the optimal achieving the minimum in equation 11, has a low test MSE (panels A-E), and comparable to the one obtained by leave-one-out cross validation (LOOCV, panel F). The setup follows the same Gaussian design as in Fig. 1, except the noise variance is set to 1. Here the covariates are dimensional and are drawn i.i.d. from . is taken to have dimension (resulting in misspecification when ); its entries are first drawn i.i.d. from and then scaled so that . The noise variance is set to .

Across all panels (A-E), i.e., any choice of , we observe that the minima of the test MSE and the objective (11) for defining Prac-MDL-COMP often occur close to each other (points towards the bottom left of these panels). Fig. 3F shows the generalization performance of the models selected by Prac-MDL-COMP in the same setting. Selection via Prac-MDL-COMP generalizes well, very close to the best ridge estimator selected by leave-one-out cross-validation. Appendix B shows more results suggesting that Prac-MDL-COMP can select models well even under different forms of misspecification.

Figure 3: Minimizing the objective (Eq 11) which defines Prac-MDL-COMP selects models with low test error. A-E. Different panels show that this holds even as is varied. F. Prac-MDL-COMPselects a model with Test MSE very close to Ridge cross-validation, avoiding the peak exhibited by OLS.

5.2 Experiments with real data

We now analyze the ability of Prac-MDL-COMP (11) to perform model selection on real datasets. Datasets are taken from PMLB olson2017pmlb; OpenML2013, a repository of diverse tabular datasets for benchmarking machine-learning algorithms. We omit datasets which are simply transformations of one another or which have too few features; this results in 19 datasets spanning a variety of tasks, such as predicting breast cancer from image features, predicting automobile prices, and election results from previous elections simonoff2013analyzing. The mean number of data points for these datasets is 122,259 and the mean number of features is 19.1. Fig. 4A shows that Prac-MDL-COMP tends to outperform Ridge-CV in the limited data regime, i.e. when is large. is fixed at the number of features in the dataset and is varied from its maximum value. As the number of training samples is increased (i.e., decreases), we no longer see an advantage provided by selection via Prac-MDL-COMP. Further details, and experiments on omitted datasets are provided in Sec. B.2 (see Tables 1, B3 and B2).

Figure 4: Comparing test-error when selecting models using Prac-MDL-COMP versus using cross-validation on real datasets. A. When data is most limited, Prac-MDL-COMP based hyperparameter tuning outperforms Ridge-CV. B, C. As the amount of training samples increases, Prac-MDL-COMP performs comparably to Ridge-CV. Each point averaged over 3 random bootstrap samples.

6 Discussion

In this work, we revisited the two factors dictating the bias-variance tradeoff principle: the class of estimators and the complexity measure. We defined an MDL-based complexity measure MDL-COMP for linear models with ridge estimators. Our analytical results show that MDL-COMP depends on , the design matrix, and the true parameter values. It does not grow as in high-dimensions, and actually can grow much slower, like for . Numerical experiments show that an MDL-COMP inspired hyperparameter tuning for ridge estimators provides good generalization performance, often outperforming cross-validation when the number of observations is limited.

Many future directions arise from our work. Ridge estimators are not suitable for parameter estimation with sparse models in high-dimensions, and thus deriving MDL-COMP with codes suitably adapted for -regularization can be an interesting future direction. Additionally, it remains to define and investigate MDL-COMP beyond linear models, e.g., kernel methods and neural networks, as well as models used for classification. Indeed, modern neural networks contain massive numbers of parameters, sometimes ranging from million Szegedy_2015 to 100 million he2016deep, while the training data remains fixed at a smaller number (e.g., examples). Directly plugging in the logarithmic scaling of MDL-COMP (when ), we obtain that such an increase in number of parameters leads to an increase by a factor of (and not times). This suggested complexity increase is much more modest, possibly explaining the ability of such large models to still generalize well. While such a conclusion does not directly follow from our results, we believe this work takes a useful step towards better understanding the bias-variance trade-off, and provides a proof of concept for the value of revisiting the complexity measure in modern high-dimensional settings. We hope this path can eventually shed light on a meaningful comparison of complexities of different DNNs and aid in the design of DNNs for new applications beyond cognitive tasks.

Broader impacts

As with any machine-learning methodology, a blind application of the results here can have harmful consequences. Using MDL-COMP for model selection in a prediction setting without external validation may result in a model that does not work well, resulting in potentially adverse effects. That being said, most of the results in this work are theoretical in nature and can hopefully lead to a more precise understanding of complexity and the bias-variance tradeoff. In addition, from an applied perspective, we hope MDL-COMP can help to select and develop better models in a wide variety of potential applications with real-world impacts. In particular, it can help with model selection in practical settings when labeled data is scarce, such as in medical imaging.

Acknowledgements

This work was partially supported by National Science Foundation grant NSF-DMS-1613002 and the Center for Science of Information (CSoI), a US NSF Science and Technology Center, under grant agreement CCF-0939370, and an Amazon Research Award to BY, and Office of Naval Research grant DOD ONR-N00014-18-1-2640 and National Science Foundation grant NSF-DMS-1612948 to MJW.

1Appendix .tocmtappendix mtchapternone mtappendixsubsection

Appendix A Further discussion on Mdl-Comp

In this appendix, we discuss the scaling of MDL-COMP for several settings (deferred from the main paper) in Sec. A.1. Then in Sec. A.2, we provide (a) comparisons of our theoretical results in Theorems 3 and 1, and (b) a practical strategy to approximate MDL-COMP via Prac-MDL-COMP which does not involve the knowledge of the unknown parameter (see Fig. A2).

a.1 Scaling of Mdl-Comp for random Gaussian design

We now describe the set-up associated with Fig. 2 in more detail.

When we say that the true dimensionality is , we mean that that the observations depend only on first covariates of the full matrix , Given a sample size (fixed for a given dataset, fixed in a given plot), the matrix , and where denotes the maximum number of covariates available for fitting the model (and in our plots can be computed by multiplying the sample size with the the maximum value of denoted on the -axis). When we vary , we use (selecting first columns of ) for fitting the ridge model and computing the MDL-COMP.

Next, to compute MDL-COMP equation 8, we need to compute the eigenvalues of , where denotes the first columns of the full matrix that are used for fitting the ridge model, and computing the corresponding encoding and MDL-COMP. We use Scipy LinAlg library to compute and (the eigenvectors of ). Moreover, we need to compute the vector defined equal to in (6). Note that has size and is dimensional. So when , the vector is computed by restricting to first dimensions (in equation 6), i.e., using (the orthogonal projection of on ) and define . On the other hand, for , we simply extend the by appending ’s, i.e., and then set .

In the next figure, we plot the behavior of MDL-COMP for even larger (up to ), to illustrate the slow growth of MDL-COMP for . Here we plot the results only for the iid Gaussian design.222We also plot an adjusted MDL-COMP to account for the codelength caused by the finite resolution (we choose in the plot) in encoding  (24). This adjustment leads to an extra offset and does not affect the qualitative trends in the figure.

(a) case : (b) case :
Figure A1: MDL-COMP shows an scaling in small dimensions, and a scaling for for both the low-dimensional design (i.e., ) and the high-dimensional design (). Here, denotes the true dimensionality of the signal and denotes the dimensionality of the covariates used for fitting the ridge estimator (and thereby the encoding ). We also plot MDL-COMP with the term in (24) and observe that the adjusted line remains parallel to MDL-COMP and thereby does not change the conclusions qualitatively.

Let us now unpack the expression for MDL-COMP from Theorem 1, namely, equation 8, and illustrate (informally) why MDL-COMP has a scaling of order for ; and a slow scaling for . Note thatMDL-COMP (8) depends on two quantities and ; both of which change as the fitting dimension varies. For simplicity, we assume and ; the argument can be sketched for other values in a similar fashion.

Using standard concentration arguments, we can argue that for , the matrix and hence . For , the matrix has rank , and one can argue that its () non-zero eigenvalues are roughly equal to . On the other hand, for the term (where ), we note that since (eigenvectors of

) can be seen as a random orthogonal matrix of size

, the terms would roughly be equal. Since , and when the different coordinates of are drawn iid, we conclude that and hence that . Plugging in these approximations, when , we find that

MDL-COMP
(12)

For the case when , i.e., the true dimensionality is larger than sample size, we find that

MDL-COMP
(13)

In both cases (12) and (13), we find that the MDL-COMP admits the scaling of for small , and for . Overall, our sketch above, and Figs. A1 and 2 suggest that should not be treated as the default complexity for ; and that MDL-COMP provides a scaling of order for .

a.2 Data-driven approximation to Mdl-Comp

We now elaborate on how we can tweak Prac-MDL-COMP to derive an excellent approximation for MDL-COMP which can be computed in practice without the knowledge of . Note that Prac-MDL-COMP (11) is composed of three terms:

(14)

From the proof of Theorem 1 we find that MDL-COMP derived in Theorem 1 (cf. (8)), is closely associated with the “Eigensum” term. In fact, noting the expression for from equation 24 in the proof of Theorem 1, we find that to approximate MDL-COMP a suitable adjustment would be an addition of the term to the “Eigensum” where denotes the optimal value of which achieves the minimum in the optimization problem (11). We call this quantity Approx-MDL-COMP:

Approx-MDL-COMP
(15)

Fig. A2(a) shows that the Approx-MDL-COMP provides an excellent approximation of the (oracle) MDL-COMP for all range of in the iid Gaussian design case. Further extensive investigation of Approx-MDL-COMP in real data settings is left for future work.

In Fig. A2, we also compare the expressions stated in Theorem 3 derived for , and that obtained for finite in the proof of Theorem 1 (see equation 16a). From Fig. A2(b), we observe that the Theorem 1 provides a good approximation of even for finite and for iid Gaussian random design.

(a) (b)
Figure A2: Comparison of different MDL bounds for iid Gaussian design with and . (a) The Approx-MDL-COMP (15) provides an excellent approximation of the (oracle) MDL-COMP, where once again we observe the linear growth of MDL-COMP in low dimensions and the slow logarithmic growth in high dimensions. (b) We observe that the asymptotic  (1) bound from Theorem 3 for large-scale settings closely approximates the exact expression equation 16a derived in the proof of Theorem 1 for finite .

Appendix B Further numerical experiments

We now present additional experiments showing the usefulness of the data-driven Prac-MDL-COMP (11). In particular, we show that Prac-MDL-COMP informs out-of-sample MSE in Gaussian as well as several misspecified linear models (Sec. B.1), and then provide further insight on the performance of real-data experiments (deferred from Sec. 5.2 of the main paper) in Sec. B.2.

b.1 Misspecified linear models

We specify three different types of model misspecification taken from prior work yu2020veridical and analyze the ability of MDL-COMP to select models that generalize. Fig. B1 shows that under these conditions, MDL-COMP still manages to pick a which generalizes fairly well. T-distribution refers to errors being distributed with a t-distribution with three degrees of freedom. Decaying eigenvalues refers to the eigenvalues of the covariance matrix decaying as , inducing block structure in the covariance matrix. Thresholds refers to calculating the outcome using indicator functions for in place of .

Figure B1: Model selection under misspecification. Under various misspecifications, Prac-MDL-COMP still manages to select models which generalize reasonably well. Different from other figures presented in the paper, here is fixed and the sample size is varied.

b.2 Real data experiments continued

We now provide further investigation. First, the good performance of Prac-MDL-COMP based linear model also holds ground against -fold CV, see Fig. B2. Next, we show results for 28 real datasets in Fig. B3 where the plot titles correspond to dataset IDs in OpenML OpenML2013. In the limited data regime (when is large, the right-hand side of each plot), MDL-COMP tends to outperform Ridge-CV. As the number of training samples is increased, the gap closes or cross-validation begins to outperform MDL-COMP. These observations provide evidence for promises of Prac-MDL-COMP based regularization for limited data settings.

Figure B2: Results for Fig 4 hold when using 5-fold CV rather than LOOCV.

Figure B3: Test MSE when varying the number of training points sampled from real datasets (lower is better). MDL-COMP performs well, particularly when the number of training points is small (right-hand side of each plot). Each point averaged over 3 random bootstrap samples.

Dataset name (OpenML ID) No. of observations () No. of features () Ridge-CV MSE () Prac-MDL-COMP MSE () 1028_SWD 1000 11 1.17 0.97 1029_LEV 1000 5 NaN* 1.10 1030_ERA 1000 5 NaN* 1.05 1096_FacultySalaries 50 5 NaN* 2.01 1191_BNG_pbc 1000000 19 1.02 1.00 1193_BNG_lowbwt 31104 10 0.98 1.01 1196_BNG_pharynx 1000000 11 1.03 1.00 1199_BNG_echoMonths 17496 10 1.47 1.00 1201_BNG_breastTumor 116640 10 2.61 1.12 192_vineyard 52 3 NaN* 1.57 201_pol 15000 49 0.82 0.87 210_cloud 108 6 3.41 0.73 215_2dplanes 40768 11 0.88 0.85 218_house_8L 22784 9 4.37 1.01 225_puma8NH 8192 9 3.26 0.87 228_elusage 55 3 NaN* 1.44 229_pwLinear 200 11 0.83 0.67 230_machine_cpu 209 7 2.48 0.71 294_satellite_image 6435 37 0.47 0.44 344_mv 40768 11 0.85 1.06 4544_GeographicalOriginalofMusic 1059 118 0.67 0.60 519_vinnie 380 3 NaN* 0.92 522_pm10 500 8 2.31 0.96 523_analcatdata_neavote 100 3 NaN* 1.15 527_analcatdata_election2000 67 15 0.50 0.49 529_pollen 3848 5 NaN* 0.94 537_houses 20640 9 3.00 1.06 542_pollution 60 16 0.88 0.82

Table 1: Prac-MDL-COMP vs Cross-validation for a range of datasets when the training data is limited. This table contains details about the datasets used in Figs. B2, B3 and 4. The first column denotes the name of the datasets, and the second and third columns report the size of the datasets. In the last two columns, we report the performance for CV-tuned Ridge (LOOCV), and Prac-MDL-COMP tuned ridge estimator, trained on a subsampled dataset such that . The reported MSE is computed on a hold-out testing set (which consists of 25% of the observations), and the better (of the two) MSE for each dataset is highlighted in bold. We observe that Prac-MDL-COMP based tuning leads to superior performance compared to cross-validation for most of the datasets. *When too few points are available, cross-validation fails to numerically fit for low values of , returning a value of NaN.

Appendix C Proofs

In this appendix, we collect the proofs of our main theorems 1, 3 and 2.

c.1 Proof of Theorem 1

We claim that equationparentequation

(16a)
(16b)

Putting these two expressions directly yields the claimed expression for MDL-COMP:

MDL-COMP
(17)
Computing the codelength (claim (16a)):

For the linear model with the true distribution , we have . Let denote the density of . To simplify notation, we use the shorthand .

(18)

Noting that the term simplifies to , and then dividing both sides by yields that

(19)

We note that the expression (19) was used to motivate Prac-MDL-COMP in Sec. 5.1. Next, we claim that equationparentequation

(20a)
(20b)

Assuming these claims as given at the moment, let us now complete the proof. We have

(21)

Finally to compute the  (19), we need to minimize the KL-divergence (21) where we note the objective depends merely on . We note that the objective (RHS of equation 21) is separable in each term . We have

(22)

Moreover, checking on the boundary of feasible values of , we have as , and as . Noting that for all , we find that minimum value of is attained at . Finally plugging the value in the expression (22), we obtain the conclusion

(23)

We now prove the claim (16b) before turning to the proof of our earlier claims (20a) and (20b) in Sections C.1.3 and C.1.2.

c.1.1 Computing the codelength (claim (16b))

Now we provide the codelength for the simple quantization based encoding scheme for the regularization parameters . In MDL, a (bounded) continuous hyper-parameter is typically encoded by quantizing to some small enough resolution ; and thus, for a real value , one can associate a (prefix) code-length . With such an encoding scheme, and recalling that (i) the optimal value of (equation 22), and (ii) for , the ’s for can be chosen to be zero and hence do not contribute to the codelength; we conclude that the number of bits needed to encode is given by

(24)

Here we have normalized by the number of observations to account for the fact that observations are encoded together. In fact, we normalize by a factor of (and not ) for analytical convenience. Nonetheless, the additional factor can be assumed without any difficulty by apriori deciding to always transmit the quantity up to a resolution (for a difference choice of ) and thereby the code length becomes . We note that, the term in equation 24 contributes a factor of in low dimensions and a constant term for ; and is generally omitted in MDL conventions. Moreover, we show in Fig. A1, that adding this factor does not distort the conclusions.

We now turn to proving our earlier claims (20a) and (20b). We prove them separately. We start with the low-dimensional case, i.e., when . Since the proof for the high-dimensional case is similar, we only outline the mains steps in Appendix C.1.3.

c.1.2 Proof of claims (20a) and (20b) for the case

We now expand the terms and from the display (18

). To do so, we first introduce some notations. Let the singular value decomposition of the matrix

be given by