Inject Machine Learning into Significance Test for Misspecified Linear Models

06/04/2020
by   Jiaye Teng, et al.
Tsinghua University
0

Due to its strong interpretability, linear regression is widely used in social science, from which significance test provides the significance level of models or coefficients in the traditional statistical inference. However, linear regression methods rely on the linear assumptions of the ground truth function, which do not necessarily hold in practice. As a result, even for simple non-linear cases, linear regression may fail to report the correct significance level. In this paper, we present a simple and effective assumption-free method for linear approximation in both linear and non-linear scenarios. First, we apply a machine learning method to fit the ground truth function on the training set and calculate its linear approximation. Afterward, we get the estimator by adding adjustments based on the validation set. We prove the concentration inequalities and asymptotic properties of our estimator, which leads to the corresponding significance test. Experimental results show that our estimator significantly outperforms linear regression for non-linear ground truth functions, indicating that our estimator might be a better tool for the significance test.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/04/2021

Analysis of Least square estimator for simple Linear Regression with a uniform distribution error

We study the least square estimator, in the framework of simple linear r...
03/07/2022

Fast rates for noisy interpolation require rethinking the effects of inductive bias

Good generalization performance on high-dimensional data crucially hinge...
06/19/2018

Estimation from Non-Linear Observations via Convex Programming with Application to Bilinear Regression

We propose a computationally efficient estimator, formulated as a convex...
07/03/2018

A Multiple Linear Regression Approach For Estimating the Market Value of Football Players in Forward Position

In this paper, market values of the football players in the forward posi...
02/08/2021

Ising Model Selection Using ℓ_1-Regularized Linear Regression

We theoretically investigate the performance of ℓ_1-regularized linear r...
08/19/2020

Structure Learning in Inverse Ising Problems Using ℓ_2-Regularized Linear Estimator

Inferring interaction parameters from observed data is a ubiquitous requ...
02/08/2019

Accounting for Significance and Multicollinearity in Building Linear Regression Models

We derive explicit Mixed Integer Optimization (MIO) constraints, as oppo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Linear regression is commonly used in practice because it can provide explanations for important features by significance test, but the required linear assumptions (e.g. linearity and normality) on the ground truth function are easily violated in practice (Osborne and Waters, 2002; Casson and Farmer, 2014). This problem is critical: when linear assumptions are violated, one may ignore important features or focus on unimportant features due to the fake significance level. One potential solution is considering all kinds of ground truth functions by removing the linear assumption, but this triggers another problem: since there are numerous types of non-linear functions, how can we learn the feature importance without knowing the exact type of the ground truth function?

The answer is simply applying linear regression to the unknown non-linear ground truth functions, i.e. using misspecified linear models (Fahrmexr, 1990; Hainmueller and Hazlett, 2014; Grünwald et al., 2017; Markiewicz and Puntanen, 2019). Apart from the fake significance level, misspecified linear models can also address other problems of linear regression, such as bias, inefficiency, and incorrect inferences (e.g. King and Zeng (2006)). Indeed, as we will show in Section 5, traditional significance test based on linear regression fails even for the simple non-linear ground truth functions like square function.

The most common approach for misspecification is introducing high-order terms and interactions (e.g. Friedrich (1982); Brambor et al. (2006)), but this only works for the prescribed types and usually cannot find the correct functional form. Another line of work (White, 1980; Berk et al., 2013; MacKinnon and White, 1985; Buja et al., 2015; Bachoc et al., 2020)

tries to do the significance test directly based on the least square estimation, and derive the consistent estimators of its variance. The downside of this approach is that the corresponding estimators contain inevitable system errors and bias due to wrong model selections (see Section 

5.2).

In this work, we introduce machine learning methods into the misspecified linear models, where we do not need to know the correct functional form and also effectively avoid system errors. We first use a machine learning method to fit the ground truth function in the training step and estimate the corresponding linear approximation. Afterward, we correct the mistakes made by the machine learning methods in the validation step. We show a positive correlation between the performance of the underlying machine learning method and the performance of our new estimator (see Theorem 1). Moreover, we prove the concentration inequalities (see Theorem 2) and asymptotic properties (see Theorem 3) of the newly proposed estimator, which can be further applied into the significance test.

Several experiments are conducted to show that this newly proposed estimator works well in both non-linear and linear scenarios. Especially, our newly proposed estimator can significantly outperform the traditional linear regression (see Table 2) when considering the Kolmogorov-Smirnov statistic in the non-linear scenarios (square function). This indicates that we make fewer mistakes in the significance test. For example, as we will show in Section 5.3

, in the non-linear scenario, our method makes mistakes with probability

, while for the traditional linear regression, the number is .

2 Related Work

The research on misspecified linear models can be broadly divided into Conformal Prediction, which focuses on the inference of prediction, and Parameter Inference, which focuses on the inference of the linear approximation parameter of ground truth function.

Conformal Prediction is a framework pioneered by Law (2006), which uses past experience to determine precise levels of confidence in new predictions. Conformal prediction (Shafer and Vovk, 2008; Papadopoulos et al., 2014; Barber et al., 2019; Cauchois et al., 2020; Zeni et al., 2020)

mainly focuses on the confidence interval for predictions, so it cannot provide explanations for feature importance. Our work can be regarded as a parallel line of conformal prediction that focuses on the assumption-free parameter estimation confidence interval.

Parameter Inference can be dated back to White (1980); MacKinnon and White (1985), where sandwich-type estimators for variance are proposed. Furthermore, Buja et al. (2015) introduce M-of-N Bootstrap techniques to improve the estimation of variance. Hainmueller and Hazlett (2014) reduce this misspecification bias from a kernel-based point of view. Some other techniques, e.g. LASSO (Lee et al., 2016), least angle regression (Taylor et al., 2014) are introduced in the post-selection inference. And some works (Rinaldo et al., 2016; Bühlmann et al., 2015) focus on a high-dimensonal reversions. More discussions can be found in Berk et al. (2013); Bachoc et al. (2020). However, this line of works relies on the direct misspecification of linear models, which means system errors are inevitable when the ground-truth function is non-linear. Furthermore, this type of estimator based on the least square estimation contains much bias, which will be further discussed in Section 5.2. In this paper, we use a machine learning based estimator instead of least square estimation, which contains less bias as we will see later in Section 5.2.

3 Preliminaries

In this section, we define the basic notations, starting from the definition of function norm and function distance.

Definition 1 (Function Norm and Function Distance)

Given a functional family defined on domain , for any

and a probability distribution

on with density , the function -norm of with respect to is defined as

When the context is clear, we simply use instead. Moreover, the function distance between and is defined as

Based on function distance, we can define the least square linear approximation, or simply linear approximation.

Definition 2 (Linear Approximation)

For a given function , its least square linear approximation is defined as

where is the linear functional family.

The traditional linear regression uses a single dataset to compute the parameters, but our method splits the dataset into two parts, a training set, and a validation set, as defined below.

Definition 3 (Training and Validation Set)

Given a dataset , we randomly split into training set and validation set , where , , and , .

In some scenarios, we may have additional unlabeled data points, therefore in total data points. Usually, unlabeled data are easier to get than labeled data, which can be used to calculating the linear approximation of the machine learning model, and help to improve the estimation of . As we will discuss in Section 5, our analysis still applies without unlabeled data, given that the machine learning model is the linear form, and is estimated from only the validation set. But more unlabeled data could help enrich the patterns of our choices for machine learning models.

In order to evaluate the performance of our model on the population distribution, we need to estimate the upper and lower bounds of a given function (will be defined later).

Definition 4 (Upper and Lower Bounds)

Given a function defined on , the upper and lower bounds of is

Similarly, given a dataset , where , the empirical lower bound and empirical upper bound of on set is defined by

While and are hard to get, we may assume that is at least loosely bounded.

Assumption 1

Given a loss function

defined on , we assume .

Notice that for linear regression, Assumption 1 usually holds, as we may assume that the domain of input is bounded, and the weight is also bounded. After applying proper scaling, we get .

As we will see in Theorem 1, our analysis depends on , and smaller gives more accurate estimations. If Assumption 1 holds, we immediately have . However, with additional prior knowledge on and , we may get tighter bounds of using Bayesian methods, as discussed in Lemma 1.

Lemma 1 (Tighter Estimation on )

Given a validation dataset where data points are randomly sampled from , and a function defined on with bounds and where and are unknown, and . Let , and assume the prior , where

represents uniform distribution. For any

, if , we have

where are the empirical bounds of on set .

The following Assumption 2 and Assumption 3 mainly focus on the explanatory variables .

Assumption 2 (Concentration of Explanatory Variables)

Let

be a random vector in

, and assume that there exists a constant , such that

almost surely.

Assumption 3 (Bounded Second Moment)

Let be a random vector in , and assume is invertible. Denote and . We assume that

In the following statements, we always use all data, including labeled and unlabeled data, to estimate , leading to the estimator .

4 Estimation

In this section, we study the problem of linear approximation of the oracle model , based on a machine learning framework. Specifically, we show the following: a) how the performance of our machine learning model affects the linear approximation estimator. b) after adding a bias term (we call it the residual term), one can get an estimator with better guarantees. c) how to run the hypothesis test (coefficient significance) based on the asymptotic distribution of our estimator. We defer all the proofs to Appendix A.

4.1 Approach Based on MSE

In this subsection, we study the relationship between the linear approximation functions and given that is close to . We use mean squared error (MSE) to measure the performance of machine learning models.

Theorem 1 (Performance of Machine Learning Models)

For , given oracle model and machine learning model , where their linear approximations are and , respectively. The labeled data is randomly split following Definition 3. Denote the loss function as , the population loss is then . Under Assumption 3, for any , the following inequality holds :

where , is the sample loss defined on validation set.

Remark: Theorem 1 shows that controls the approximation quality of and , which depends on both the validation set size and the validation loss. In other words, if the machine learning model generalizes well, we get good estimations of and .

The term depends on , which is bounded by based on Assumption 1. Although this term shrinks as the validation set grows, below we show that it can be bounded more accurately using Lemma 1.

Corollary 1

Under the assumptions of Lemma 1 and Theorem 1, by replacing in Lemma 1 by loss function and plugging it into Theorem 1, we have

where

Intuitively, using (the best linear approximation of ) to approximate seems to the optimal choice. However, as we will show below, this is not true, as may contain bias in the linear setting.

4.2 Filling the Bias

In this subsection, we will jump out of the restrictions of MSE, and improve our estimation by adding a bias term. We first present Lemma 2

that focuses on the estimation of the second-moment matrix of explanatory variables

.

Lemma 2 (The Second Moment Concentration)

Under Assumption 2 and Assumption 3, for every integer n, the following inequality holds :

where is a constant, are defined in Assumption 3.

By adding a small bias, we can derive the following Theorem 2, which mainly focuses on the coordinate wise bound. Here we denote as the feature of , and as the row of matrix .

Theorem 2 (Adding a bias term)

For , given oracle model and machine learning model , where the linear approximation is and , respectively. The labeled data is randomly split based on Definition 3. Assume that Assumption 2 and Assumption 3 hold. Denote . Then, for any , the following inequality holds :

where is the realization of defined on validation set.

Notice that is the total number of data. The first term in the bound is because we use samples in the validation set to estimate the population. A tighter bound requires smaller fluctuation for (which is ) and more samples in validation set (). The second term is because we use to replace . A smaller condition number and a larger help tighten the bound.

Similar to Corollary 1, we can use and to replace and under additional assumptions, see below.

Corollary 2

Under the assumptions of Lemma 1 and Theorem 2, by replacing in Lemma 1 by and plugging it into Theorem 2, we have

where

Therefore, we should use as the new estimator. Recall the bound in Theorem 1 (denoted as ) and the bound in Theorem 2 (denoted as ). We can see that as goes to zero, while , where denote the average sample loss. This means that cannot be arbitrarily similar to given a fixed machine learning model even if we have infinite data for validation. That is to say, although Theorem 1 contains the frequently-used MSE as a measure, it causes some natural bias. And Theorem 2 filled this bias by adding a correction term.

Furthermore, as the standard practice in statistics, we will derive the asymptotic property of in Section 4.3.

4.3 Asymptotic Properties

In this section, we study the asymptotic property of estimator , which gives us tighter and more practical guarantees. In the following analysis, we assume that . This is without loss of generality because otherwise we can directly use to estimate without any loss.

Theorem 3 (Asymptotic Properties)

Given oracle model and machine learning model , with its corresponding linear approximation , , respectively. The labeled data is randomly split based on Definition 3. Denote , and assume are bounded111This usually holds in practice as long as are all bounded., then under Assumption 3 , the following asymptotic property of holds:

where

represents normal distribution,

, , .

Remark:

Traditional asymptotic analysis is usually based on the assumptions for the ground truth function, but our analysis does not need such assumptions and instead relies on the training-validation framework. For example, in the traditional analysis, the claim that

follows normal distribution is based on the assumption that the ground truth function is linear, and also the label noise follows a well-defined distribution.

Now we can do the hypothesis test based on the asymptotic property of , including model test and coefficient test under significance level . The details can be found in Appendix B.

5 Experiments

In this section, we conduct experiments of the significance test derived in Section 4.3. We will show that our method works in both linear and non-linear scenarios, while the traditional linear regression fails in non-linear scenarios even in a simple square case. Due to space limitations, we show linear scenarios in Appendix D. More experimental details could be found in Appendix C. For each statistic, we repeat experiments 6 times and compute its confidence interval of its mean.

We choose two types of machine learning models: a three-layer Neural Network (labeled as Ours(NN)) and a Linear-form model (labeled as Ours(L)), respectively. Note that Ours(L) does not need unlabeled data, while Ours(NN) needs unlabeled data to calculate the linear approximation of machine learning methods.

5.1 Metrics

Two metrics are considered here, focusing on the correctness and efficiency, respectively.

Correctness is shown by the Kolmogorov-Smirnov statistic, which is defined in Equation 1. Kolmogorov-Smirnov statistic measures how close the simulation results and theoretical results are. Smaller Kolmogorov-Smirnov statistic is better.

(1)

where is the empirical CDF in simulation, and is the theoretical CDF.

Efficiency is shown by the average standard deviation (

) of the estimators. Efficiency measures how much uncertainty the new estimators have. Since we have removed the requirements of linear assumptions, more uncertainty appears in our newly proposed method. A smaller means that we have more confidence in our estimators.

5.2 Unbiasedness

First of all, we would show that the traditional estimators based on least squares (LSE) contain more bias (see also (Rinaldo et al., 2016)), including the linear regression methods and the estimators proposed in White (1980); MacKinnon and White (1985); Lee et al. (2016); Taylor et al. (2014); Bühlmann et al. (2015); Bachoc et al. (2020), etc.

LSE-based Estimator Ours(NN) Ours(L)
-0.0048 () -0.0010 () -0.0001 ()
0.0015 () 0.0048 () 0.0021 ()
Table 1: Bias Comparison (Non-Linear)
Figure 1: Confidence interval for each bias. LSE, NN, L is short for traditional LSE-based estimators, Ours(NN) and Ours(L), respectively.

We test a simple square case, where , our aim is to estimate its linear approximation. We repeat the simulation 1000 times, and each time we calculate the mean of the estimators. Note that for better showing the bias, we use a smaller dataset (see Appendix C.2). Table 1 shows the difference between simulation results and the theoretical parameter with their confidence interval.

Figure 1 shows that the traditional LSE-based estimators on are biased, where the confidence interval of bias does not contain . Our proposed methods can outperform these LSE-based methods because the newly proposed methods have a smaller bias. For simplification, we compare our proposed methods with only linear regression on their correctness and efficiency in Section 5.3, since linear regression is most widely used among these LSE-based methods in practice.

5.3 Non-linear Scenarios

In this section, we focus on the performance of the linear regression method and our newly proposed method under a non-linear scenario. We focus on a simple non-linear ground truth model, which is

with no randomness . Its linear approach can be theoretically calculated by

Our aim is to estimate . Thus the hypothesis test can be written as

We repeat the simulation 1000 times, each time we calculate the statistic and plot them in Figure 2. We also plot its theoretical distribution, which helps visualize how far the simulation results and theoretical results are. It is visualized in Figure 2 that traditional linear regression fails in even a simple square case, while our new estimator works well. We choose one group of the six here to show the figure.

(a) Baseline:
(b) Baseline:
(c) Baseline:
(d) Ours(NN):
(e) Ours(NN):
(f) Ours(NN):
(g) Ours(L):
(h) Ours(L):
(i) Ours(L):
Figure 2: Test for parameters under non-linear scenarios. The orange line is the theoretical true distribution. The baseline (linear regression) is more fat-tailed compared with Ours(NN) and Ours(L).

The phenomenon shown in Figure 2 leads to a fake significance test results! This fake fat-tailed distribution will make more variables determined to be significant incorrectly. For instance, when we set significance level as (which means that the parameter can be determined incorrectly with probability around 0.05), is determined incorrectly in linear regression (LSE) with probability , while Ours(L) with probability . We repeat the experiments six times, and the brackets show a confidence interval. The results of Ours(L) make the significance test more accurate.

We further show its confidence interval for correctness in Table 2 quantitatively, where our newly proposed estimators work better. More details about Table 2 are shown in Appendix E.

normal () normal ()
Linear Reg 0.1150 () 0.0635 () 0.0715 ()
Ours(NN) 0.0679 () 0.0326 () 0.0650 ()
Ours(L) 0.0810 () 0.0489 () 0.0276 ()
Table 2: Correctness Comparison (Non-Linear)

We compare the efficiency of Ours(NN) and Ours(L) in Table 3 with its confidence interval. Note that since linear regression returns a wrong asymptotic efficiency, it is not listed here.

Ours(NN) 0.0214() 0.0593()
Ours(L) 0.0328() 0.0604 ()
Table 3: Efficiency Comparison (Non-Linear)

6 Conclusion

In this paper, we propose a new estimator for the linear coefficient that works well in both linear and non-linear cases. Unlike traditional statistical inference methods, machine learning models are introduced to the significance test process. For future work, our new framework may be extended to the more general statistical inference scenarios (e.g. high-dimensional settings), and it will be interesting to show how machine learning models affect the efficiency of our estimator.

Broader Impact

Compared with the traditional significance test, our new methods output a more precise significance level (or p-value) when linear assumptions do not hold. Moreover, with small efficiency loss, one can better extract the relationship between explanatory variables and response variables. Therefore, our estimator might be a better tool for the significance test.

We are grateful to Yang Bai, Chenwei Wu for helpful comments on an early draft of this paper. This work has been partially supported by Shanghai Qi Zhi Institute, Zhongguancun Haihua Institute for Frontier Information Technology, the Institute for Guo Qiang Tsinghua University (2019GQG1002), and Beijing Academy of Artificial Intelligence.

References

  • F. Bachoc, D. Preinerstorfer, L. Steinberger, et al. (2020) Uniformly valid confidence intervals post-model-selection. The Annals of Statistics 48 (1), pp. 440–463. Cited by: §1, §2, §5.2.
  • R. F. Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani (2019) Conformal prediction under covariate shift.. arXiv: Methodology. Cited by: §2.
  • R. Berk, L. Brown, A. Buja, K. Zhang, L. Zhao, et al. (2013) Valid post-selection inference. The Annals of Statistics 41 (2), pp. 802–837. Cited by: §1, §2.
  • T. Brambor, W. R. Clark, and M. Golder (2006) Understanding interaction models: improving empirical analyses. Political analysis 14 (1), pp. 63–82. Cited by: §1.
  • P. Bühlmann, S. van de Geer, et al. (2015) High-dimensional inference in misspecified linear models. Electronic Journal of Statistics 9 (1), pp. 1449–1473. Cited by: §2, §5.2.
  • A. Buja, R. A. Berk, L. D. Brown, E. I. George, E. Pitkin, M. Traskin, L. Zhao, and K. Zhang (2015) Models as approximations-a conspiracy of random regressors and model deviations against classical inference in regression. Statistical Science, pp. 1. Cited by: §1, §2.
  • R. J. Casson and L. D. Farmer (2014) Understanding and checking the assumptions of linear regression: a primer for medical researchers. Clinical & experimental ophthalmology 42 (6), pp. 590–596. Cited by: §1.
  • M. Cauchois, S. Gupta, and J. Duchi (2020) Knowing what you know: valid confidence sets in multiclass and multilabel prediction. arXiv e-prints, pp. arXiv:2004.10181. External Links: 2004.10181 Cited by: §2.
  • L. Fahrmexr (1990) Maximum likelihood estimation in misspecified generalized linear models. Statistics 21 (4), pp. 487–502. Cited by: §1.
  • R. J. Friedrich (1982) In defense of multiplicative terms in multiple regression equations. American Journal of Political Science, pp. 797–833. Cited by: §1.
  • P. Grünwald, T. Van Ommen, et al. (2017)

    Inconsistency of bayesian inference for misspecified linear models, and a proposal for repairing it

    .
    Bayesian Analysis 12 (4), pp. 1069–1103. Cited by: §1.
  • J. Hainmueller and C. Hazlett (2014) Kernel regularized least squares: reducing misspecification bias with a flexible and interpretable machine learning approach. Political Analysis 22 (2), pp. 143–168. Cited by: §1, §2.
  • G. King and L. Zeng (2006) The dangers of extreme counterfactuals. Political Analysis 14 (2), pp. 131–159. Cited by: §1.
  • J. Law (2006) Review of "algorithmic learning in a random world by vovk, gammerman and shafer", springer, 2005, ISBN: 0-387-00152-2. SIGACT News 37 (4), pp. 38–40. External Links: Link, Document Cited by: §2.
  • J. D. Lee, D. L. Sun, Y. Sun, J. E. Taylor, et al. (2016) Exact post-selection inference, with application to the lasso. The Annals of Statistics 44 (3), pp. 907–927. Cited by: §2, §5.2.
  • J. G. MacKinnon and H. White (1985) Some heteroskedasticity-consistent covariance matrix estimators with improved finite sample properties. Journal of econometrics 29 (3), pp. 305–325. Cited by: §1, §2, §5.2.
  • A. Markiewicz and S. Puntanen (2019) Linear prediction sufficiency in the misspecified linear model. Communications in Statistics-Theory and Methods, pp. 1–20. Cited by: §1.
  • J. W. Osborne and E. Waters (2002) Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research, and Evaluation 8 (1), pp. 2. Cited by: §1.
  • H. Papadopoulos, V. Vovk, and A. J. Gammerman (2014) Regression conformal prediction with nearest neighbours. CoRR abs/1401.3880. External Links: Link, 1401.3880 Cited by: §2.
  • A. Rinaldo, L. Wasserman, M. G’Sell, and J. Lei (2016) Bootstrapping and sample splitting for high-dimensional, assumption-free inference. arXiv preprint arXiv:1611.05401. Cited by: §2, §5.2.
  • G. Shafer and V. Vovk (2008) A tutorial on conformal prediction. J. Mach. Learn. Res. 9, pp. 371–421. External Links: Link Cited by: §2.
  • J. Taylor, R. Lockhart, R. J. Tibshirani, and R. Tibshirani (2014) Exact post-selection inference for forward stepwise and least angle regression. arXiv preprint arXiv:1401.3889 7, pp. 10–1. Cited by: §2, §5.2.
  • R. Vershynin (2018)

    High-dimensional probability: an introduction with applications in data science

    .
    Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. External Links: Document Cited by: §A.4.
  • H. White (1980) Using least squares to approximate unknown regression functions. International Economic Review, pp. 149–170. Cited by: §1, §2, §5.2.
  • G. Zeni, M. Fontana, and S. Vantini (2020) Conformal prediction: a unified review of theory and new challenges. arXiv preprint arXiv:2005.07972. Cited by: §2.

Appendix A Proofs

a.1 The Proof of Lemma 1

This proof is mainly based on Bayesian Estimation, where we have the prior information that .

For a given , we have

The first equation is due to its definition. The second equation is because the probability is zero when . Denote . By setting , we have

We slightly enlarger with , which leads to the results that

The proof is done.

a.2 The Proof of Theorem 1

In this section, we will prove Theorem 1. Before the proof, we propose Lemma 3 first, which focuses on why we need to split the datasets into the training set and the validation set.

Lemma 3 (Independence Lemma)

If and

are independent random variables, and

is a fixed function which is independent of and , then is independent of .

Proof 1 (Proof of Lemma 3)

The proof directly follows the definition of independence of random variables.

(2)

where , is the corresponding measurable sets which is decided by and , . The second equality follows the independence of and . By definition, is independent of . The proof is done.

The next corollary 3 is a direct application of Lemma 3.

Corollary 3 (Random Split of Datasets)

Given i.i.d. data which is randomly split into training data and test data . If we use to train a machine learning model , then for two independent samples , , and loss of sample , is independent of .

It should be noted that is trained by , thus is independent of samples in the validation set. That is why the dataset needs to be randomly split. Armed with Corollary 3, we can go on to finish the proof.

We split the proof into two parts. In Lemma 4, we give an approximation measure for the machine learning model . In Lemma 5, we will prove that the linear approximations of two close functions are also close. In this part, we use MSE to measure how the machine learning model approaches ground truth functions.

Lemma 4 (Approximation measure for )

Given an i.i.d. dataset which is split into training set and validation set , where . Suppose we use MSE () as our loss, and the sample loss is denoted as , the population loss is denoted as . Then for a given , we have

Namely, , .

Note that Lemma 4 gives an measure for , which is a probabilistic upper bound. And the results also show a trade-off between training set and validation set. If more data are split into training set, decreases theoretically. If more data are split into validation set, decreases theoretically.

Proof 2 (Proof of Lemma 4)

The key of the proof is Hoeffding’s inequality, which states that given a series of bounded random variables , and , if , then for any , we have

Plug the bound of loss into this inequality. By setting , it holds that

Finally, notice that

The proof is done.

The next Lemma 5 shows that when the distance of two functions is bounded, the distance of their linear approximation is also bounded.

Lemma 5

Given oracle model and machine learning model , where their linear approximation is , , respectively, where is a linear function family. Given , if , then

where d is the dimension of .

Proof 3 (Proof of Lemma 5)

First, we would like to represent in a linear form. For simplification, we omit the superscript for a while.

Then we can derive an explicit representation of

Therefore, by adding the superscript, we have

(3)

It can be further calculated that

where the first inequality comes from the definition of matrix norm, the third inequality is due to Cauchy-Schwartz inequality, and the final equality is by the bound of functions and .

If eigenvalues of

is denoted by , then

That is to say,

(4)

Plug Equation 4 into ,

Therefore, we have

(5)

Plug into Equation 4 and Equation 5, we have

(6)

The proof is done.

Theorem 1 is a direct combination of Lemma 4 and Lemma 5.

a.3 The Proof of Corollary 1

Corollary 1 directly follows Theorem 1. From Theorem 1 we know that

where . By Lemma 1, we use empirical bounds to replace the real bounds. That is to say, we use to replace . This operation brings loss, the probability becomes