1 Introduction
Linear regression is commonly used in practice because it can provide explanations for important features by significance test, but the required linear assumptions (e.g. linearity and normality) on the ground truth function are easily violated in practice (Osborne and Waters, 2002; Casson and Farmer, 2014). This problem is critical: when linear assumptions are violated, one may ignore important features or focus on unimportant features due to the fake significance level. One potential solution is considering all kinds of ground truth functions by removing the linear assumption, but this triggers another problem: since there are numerous types of nonlinear functions, how can we learn the feature importance without knowing the exact type of the ground truth function?
The answer is simply applying linear regression to the unknown nonlinear ground truth functions, i.e. using misspecified linear models (Fahrmexr, 1990; Hainmueller and Hazlett, 2014; Grünwald et al., 2017; Markiewicz and Puntanen, 2019). Apart from the fake significance level, misspecified linear models can also address other problems of linear regression, such as bias, inefficiency, and incorrect inferences (e.g. King and Zeng (2006)). Indeed, as we will show in Section 5, traditional significance test based on linear regression fails even for the simple nonlinear ground truth functions like square function.
The most common approach for misspecification is introducing highorder terms and interactions (e.g. Friedrich (1982); Brambor et al. (2006)), but this only works for the prescribed types and usually cannot find the correct functional form. Another line of work (White, 1980; Berk et al., 2013; MacKinnon and White, 1985; Buja et al., 2015; Bachoc et al., 2020)
tries to do the significance test directly based on the least square estimation, and derive the consistent estimators of its variance. The downside of this approach is that the corresponding estimators contain inevitable system errors and bias due to wrong model selections (see Section
5.2).In this work, we introduce machine learning methods into the misspecified linear models, where we do not need to know the correct functional form and also effectively avoid system errors. We first use a machine learning method to fit the ground truth function in the training step and estimate the corresponding linear approximation. Afterward, we correct the mistakes made by the machine learning methods in the validation step. We show a positive correlation between the performance of the underlying machine learning method and the performance of our new estimator (see Theorem 1). Moreover, we prove the concentration inequalities (see Theorem 2) and asymptotic properties (see Theorem 3) of the newly proposed estimator, which can be further applied into the significance test.
Several experiments are conducted to show that this newly proposed estimator works well in both nonlinear and linear scenarios. Especially, our newly proposed estimator can significantly outperform the traditional linear regression (see Table 2) when considering the KolmogorovSmirnov statistic in the nonlinear scenarios (square function). This indicates that we make fewer mistakes in the significance test. For example, as we will show in Section 5.3
, in the nonlinear scenario, our method makes mistakes with probability
, while for the traditional linear regression, the number is .2 Related Work
The research on misspecified linear models can be broadly divided into Conformal Prediction, which focuses on the inference of prediction, and Parameter Inference, which focuses on the inference of the linear approximation parameter of ground truth function.
Conformal Prediction is a framework pioneered by Law (2006), which uses past experience to determine precise levels of confidence in new predictions. Conformal prediction (Shafer and Vovk, 2008; Papadopoulos et al., 2014; Barber et al., 2019; Cauchois et al., 2020; Zeni et al., 2020)
mainly focuses on the confidence interval for predictions, so it cannot provide explanations for feature importance. Our work can be regarded as a parallel line of conformal prediction that focuses on the assumptionfree parameter estimation confidence interval.
Parameter Inference can be dated back to White (1980); MacKinnon and White (1985), where sandwichtype estimators for variance are proposed. Furthermore, Buja et al. (2015) introduce MofN Bootstrap techniques to improve the estimation of variance. Hainmueller and Hazlett (2014) reduce this misspecification bias from a kernelbased point of view. Some other techniques, e.g. LASSO (Lee et al., 2016), least angle regression (Taylor et al., 2014) are introduced in the postselection inference. And some works (Rinaldo et al., 2016; Bühlmann et al., 2015) focus on a highdimensonal reversions. More discussions can be found in Berk et al. (2013); Bachoc et al. (2020). However, this line of works relies on the direct misspecification of linear models, which means system errors are inevitable when the groundtruth function is nonlinear. Furthermore, this type of estimator based on the least square estimation contains much bias, which will be further discussed in Section 5.2. In this paper, we use a machine learning based estimator instead of least square estimation, which contains less bias as we will see later in Section 5.2.
3 Preliminaries
In this section, we define the basic notations, starting from the definition of function norm and function distance.
Definition 1 (Function Norm and Function Distance)
Given a functional family defined on domain , for any
and a probability distribution
on with density , the function norm of with respect to is defined asWhen the context is clear, we simply use instead. Moreover, the function distance between and is defined as
Based on function distance, we can define the least square linear approximation, or simply linear approximation.
Definition 2 (Linear Approximation)
For a given function , its least square linear approximation is defined as
where is the linear functional family.
The traditional linear regression uses a single dataset to compute the parameters, but our method splits the dataset into two parts, a training set, and a validation set, as defined below.
Definition 3 (Training and Validation Set)
Given a dataset , we randomly split into training set and validation set , where , , and , .
In some scenarios, we may have additional unlabeled data points, therefore in total data points. Usually, unlabeled data are easier to get than labeled data, which can be used to calculating the linear approximation of the machine learning model, and help to improve the estimation of . As we will discuss in Section 5, our analysis still applies without unlabeled data, given that the machine learning model is the linear form, and is estimated from only the validation set. But more unlabeled data could help enrich the patterns of our choices for machine learning models.
In order to evaluate the performance of our model on the population distribution, we need to estimate the upper and lower bounds of a given function (will be defined later).
Definition 4 (Upper and Lower Bounds)
Given a function defined on , the upper and lower bounds of is
Similarly, given a dataset , where , the empirical lower bound and empirical upper bound of on set is defined by
While and are hard to get, we may assume that is at least loosely bounded.
Assumption 1
Notice that for linear regression, Assumption 1 usually holds, as we may assume that the domain of input is bounded, and the weight is also bounded. After applying proper scaling, we get .
As we will see in Theorem 1, our analysis depends on , and smaller gives more accurate estimations. If Assumption 1 holds, we immediately have . However, with additional prior knowledge on and , we may get tighter bounds of using Bayesian methods, as discussed in Lemma 1.
Lemma 1 (Tighter Estimation on )
Given a validation dataset where data points are randomly sampled from , and a function defined on with bounds and where and are unknown, and . Let , and assume the prior , where
represents uniform distribution. For any
, if , we havewhere are the empirical bounds of on set .
Assumption 2 (Concentration of Explanatory Variables)
Assumption 3 (Bounded Second Moment)
Let be a random vector in , and assume is invertible. Denote and . We assume that
In the following statements, we always use all data, including labeled and unlabeled data, to estimate , leading to the estimator .
4 Estimation
In this section, we study the problem of linear approximation of the oracle model , based on a machine learning framework. Specifically, we show the following: a) how the performance of our machine learning model affects the linear approximation estimator. b) after adding a bias term (we call it the residual term), one can get an estimator with better guarantees. c) how to run the hypothesis test (coefficient significance) based on the asymptotic distribution of our estimator. We defer all the proofs to Appendix A.
4.1 Approach Based on MSE
In this subsection, we study the relationship between the linear approximation functions and given that is close to . We use mean squared error (MSE) to measure the performance of machine learning models.
Theorem 1 (Performance of Machine Learning Models)
For , given oracle model and machine learning model , where their linear approximations are and , respectively. The labeled data is randomly split following Definition 3. Denote the loss function as , the population loss is then . Under Assumption 3, for any , the following inequality holds :
where , is the sample loss defined on validation set.
Remark: Theorem 1 shows that controls the approximation quality of and , which depends on both the validation set size and the validation loss. In other words, if the machine learning model generalizes well, we get good estimations of and .
The term depends on , which is bounded by based on Assumption 1. Although this term shrinks as the validation set grows, below we show that it can be bounded more accurately using Lemma 1.
Corollary 1
Under the assumptions of Lemma 1 and Theorem 1, by replacing in Lemma 1 by loss function and plugging it into Theorem 1, we have
where
Intuitively, using (the best linear approximation of ) to approximate seems to the optimal choice. However, as we will show below, this is not true, as may contain bias in the linear setting.
4.2 Filling the Bias
In this subsection, we will jump out of the restrictions of MSE, and improve our estimation by adding a bias term. We first present Lemma 2
that focuses on the estimation of the secondmoment matrix of explanatory variables
.Lemma 2 (The Second Moment Concentration)
By adding a small bias, we can derive the following Theorem 2, which mainly focuses on the coordinate wise bound. Here we denote as the feature of , and as the row of matrix .
Theorem 2 (Adding a bias term)
For , given oracle model and machine learning model , where the linear approximation is and , respectively. The labeled data is randomly split based on Definition 3. Assume that Assumption 2 and Assumption 3 hold. Denote . Then, for any , the following inequality holds :
where is the realization of defined on validation set.
Notice that is the total number of data. The first term in the bound is because we use samples in the validation set to estimate the population. A tighter bound requires smaller fluctuation for (which is ) and more samples in validation set (). The second term is because we use to replace . A smaller condition number and a larger help tighten the bound.
Similar to Corollary 1, we can use and to replace and under additional assumptions, see below.
Corollary 2
Therefore, we should use as the new estimator. Recall the bound in Theorem 1 (denoted as ) and the bound in Theorem 2 (denoted as ). We can see that as goes to zero, while , where denote the average sample loss. This means that cannot be arbitrarily similar to given a fixed machine learning model even if we have infinite data for validation. That is to say, although Theorem 1 contains the frequentlyused MSE as a measure, it causes some natural bias. And Theorem 2 filled this bias by adding a correction term.
Furthermore, as the standard practice in statistics, we will derive the asymptotic property of in Section 4.3.
4.3 Asymptotic Properties
In this section, we study the asymptotic property of estimator , which gives us tighter and more practical guarantees. In the following analysis, we assume that . This is without loss of generality because otherwise we can directly use to estimate without any loss.
Theorem 3 (Asymptotic Properties)
Given oracle model and machine learning model , with its corresponding linear approximation , , respectively. The labeled data is randomly split based on Definition 3. Denote , and assume are bounded^{1}^{1}1This usually holds in practice as long as are all bounded., then under Assumption 3 , the following asymptotic property of holds:
where
represents normal distribution,
, , .Remark:
Traditional asymptotic analysis is usually based on the assumptions for the ground truth function, but our analysis does not need such assumptions and instead relies on the trainingvalidation framework. For example, in the traditional analysis, the claim that
follows normal distribution is based on the assumption that the ground truth function is linear, and also the label noise follows a welldefined distribution.Now we can do the hypothesis test based on the asymptotic property of , including model test and coefficient test under significance level . The details can be found in Appendix B.
5 Experiments
In this section, we conduct experiments of the significance test derived in Section 4.3. We will show that our method works in both linear and nonlinear scenarios, while the traditional linear regression fails in nonlinear scenarios even in a simple square case. Due to space limitations, we show linear scenarios in Appendix D. More experimental details could be found in Appendix C. For each statistic, we repeat experiments 6 times and compute its confidence interval of its mean.
We choose two types of machine learning models: a threelayer Neural Network (labeled as Ours(NN)) and a Linearform model (labeled as Ours(L)), respectively. Note that Ours(L) does not need unlabeled data, while Ours(NN) needs unlabeled data to calculate the linear approximation of machine learning methods.
5.1 Metrics
Two metrics are considered here, focusing on the correctness and efficiency, respectively.
Correctness is shown by the KolmogorovSmirnov statistic, which is defined in Equation 1. KolmogorovSmirnov statistic measures how close the simulation results and theoretical results are. Smaller KolmogorovSmirnov statistic is better.
(1) 
where is the empirical CDF in simulation, and is the theoretical CDF.
Efficiency is shown by the average standard deviation (
) of the estimators. Efficiency measures how much uncertainty the new estimators have. Since we have removed the requirements of linear assumptions, more uncertainty appears in our newly proposed method. A smaller means that we have more confidence in our estimators.5.2 Unbiasedness
First of all, we would show that the traditional estimators based on least squares (LSE) contain more bias (see also (Rinaldo et al., 2016)), including the linear regression methods and the estimators proposed in White (1980); MacKinnon and White (1985); Lee et al. (2016); Taylor et al. (2014); Bühlmann et al. (2015); Bachoc et al. (2020), etc.
LSEbased Estimator  Ours(NN)  Ours(L)  

0.0048 ()  0.0010 ()  0.0001 ()  
0.0015 ()  0.0048 ()  0.0021 () 
We test a simple square case, where , our aim is to estimate its linear approximation. We repeat the simulation 1000 times, and each time we calculate the mean of the estimators. Note that for better showing the bias, we use a smaller dataset (see Appendix C.2). Table 1 shows the difference between simulation results and the theoretical parameter with their confidence interval.
Figure 1 shows that the traditional LSEbased estimators on are biased, where the confidence interval of bias does not contain . Our proposed methods can outperform these LSEbased methods because the newly proposed methods have a smaller bias. For simplification, we compare our proposed methods with only linear regression on their correctness and efficiency in Section 5.3, since linear regression is most widely used among these LSEbased methods in practice.
5.3 Nonlinear Scenarios
In this section, we focus on the performance of the linear regression method and our newly proposed method under a nonlinear scenario. We focus on a simple nonlinear ground truth model, which is
with no randomness . Its linear approach can be theoretically calculated by
Our aim is to estimate . Thus the hypothesis test can be written as
We repeat the simulation 1000 times, each time we calculate the statistic and plot them in Figure 2. We also plot its theoretical distribution, which helps visualize how far the simulation results and theoretical results are. It is visualized in Figure 2 that traditional linear regression fails in even a simple square case, while our new estimator works well. We choose one group of the six here to show the figure.
The phenomenon shown in Figure 2 leads to a fake significance test results! This fake fattailed distribution will make more variables determined to be significant incorrectly. For instance, when we set significance level as (which means that the parameter can be determined incorrectly with probability around 0.05), is determined incorrectly in linear regression (LSE) with probability , while Ours(L) with probability . We repeat the experiments six times, and the brackets show a confidence interval. The results of Ours(L) make the significance test more accurate.
We further show its confidence interval for correctness in Table 2 quantitatively, where our newly proposed estimators work better. More details about Table 2 are shown in Appendix E.
normal ()  normal ()  

Linear Reg  0.1150 ()  0.0635 ()  0.0715 () 
Ours(NN)  0.0679 ()  0.0326 ()  0.0650 () 
Ours(L)  0.0810 ()  0.0489 ()  0.0276 () 
We compare the efficiency of Ours(NN) and Ours(L) in Table 3 with its confidence interval. Note that since linear regression returns a wrong asymptotic efficiency, it is not listed here.
Ours(NN)  0.0214()  0.0593() 

Ours(L)  0.0328()  0.0604 () 
6 Conclusion
In this paper, we propose a new estimator for the linear coefficient that works well in both linear and nonlinear cases. Unlike traditional statistical inference methods, machine learning models are introduced to the significance test process. For future work, our new framework may be extended to the more general statistical inference scenarios (e.g. highdimensional settings), and it will be interesting to show how machine learning models affect the efficiency of our estimator.
Broader Impact
Compared with the traditional significance test, our new methods output a more precise significance level (or pvalue) when linear assumptions do not hold. Moreover, with small efficiency loss, one can better extract the relationship between explanatory variables and response variables. Therefore, our estimator might be a better tool for the significance test.
We are grateful to Yang Bai, Chenwei Wu for helpful comments on an early draft of this paper. This work has been partially supported by Shanghai Qi Zhi Institute, Zhongguancun Haihua Institute for Frontier Information Technology, the Institute for Guo Qiang Tsinghua University (2019GQG1002), and Beijing Academy of Artificial Intelligence.
References
 Uniformly valid confidence intervals postmodelselection. The Annals of Statistics 48 (1), pp. 440–463. Cited by: §1, §2, §5.2.
 Conformal prediction under covariate shift.. arXiv: Methodology. Cited by: §2.
 Valid postselection inference. The Annals of Statistics 41 (2), pp. 802–837. Cited by: §1, §2.
 Understanding interaction models: improving empirical analyses. Political analysis 14 (1), pp. 63–82. Cited by: §1.
 Highdimensional inference in misspecified linear models. Electronic Journal of Statistics 9 (1), pp. 1449–1473. Cited by: §2, §5.2.
 Models as approximationsa conspiracy of random regressors and model deviations against classical inference in regression. Statistical Science, pp. 1. Cited by: §1, §2.
 Understanding and checking the assumptions of linear regression: a primer for medical researchers. Clinical & experimental ophthalmology 42 (6), pp. 590–596. Cited by: §1.
 Knowing what you know: valid confidence sets in multiclass and multilabel prediction. arXiv eprints, pp. arXiv:2004.10181. External Links: 2004.10181 Cited by: §2.
 Maximum likelihood estimation in misspecified generalized linear models. Statistics 21 (4), pp. 487–502. Cited by: §1.
 In defense of multiplicative terms in multiple regression equations. American Journal of Political Science, pp. 797–833. Cited by: §1.

Inconsistency of bayesian inference for misspecified linear models, and a proposal for repairing it
. Bayesian Analysis 12 (4), pp. 1069–1103. Cited by: §1.  Kernel regularized least squares: reducing misspecification bias with a flexible and interpretable machine learning approach. Political Analysis 22 (2), pp. 143–168. Cited by: §1, §2.
 The dangers of extreme counterfactuals. Political Analysis 14 (2), pp. 131–159. Cited by: §1.
 Review of "algorithmic learning in a random world by vovk, gammerman and shafer", springer, 2005, ISBN: 0387001522. SIGACT News 37 (4), pp. 38–40. External Links: Link, Document Cited by: §2.
 Exact postselection inference, with application to the lasso. The Annals of Statistics 44 (3), pp. 907–927. Cited by: §2, §5.2.
 Some heteroskedasticityconsistent covariance matrix estimators with improved finite sample properties. Journal of econometrics 29 (3), pp. 305–325. Cited by: §1, §2, §5.2.
 Linear prediction sufficiency in the misspecified linear model. Communications in StatisticsTheory and Methods, pp. 1–20. Cited by: §1.
 Four assumptions of multiple regression that researchers should always test. Practical Assessment, Research, and Evaluation 8 (1), pp. 2. Cited by: §1.
 Regression conformal prediction with nearest neighbours. CoRR abs/1401.3880. External Links: Link, 1401.3880 Cited by: §2.
 Bootstrapping and sample splitting for highdimensional, assumptionfree inference. arXiv preprint arXiv:1611.05401. Cited by: §2, §5.2.
 A tutorial on conformal prediction. J. Mach. Learn. Res. 9, pp. 371–421. External Links: Link Cited by: §2.
 Exact postselection inference for forward stepwise and least angle regression. arXiv preprint arXiv:1401.3889 7, pp. 10–1. Cited by: §2, §5.2.

Highdimensional probability: an introduction with applications in data science
. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. External Links: Document Cited by: §A.4.  Using least squares to approximate unknown regression functions. International Economic Review, pp. 149–170. Cited by: §1, §2, §5.2.
 Conformal prediction: a unified review of theory and new challenges. arXiv preprint arXiv:2005.07972. Cited by: §2.
Appendix A Proofs
a.1 The Proof of Lemma 1
This proof is mainly based on Bayesian Estimation, where we have the prior information that .
For a given , we have
The first equation is due to its definition. The second equation is because the probability is zero when . Denote . By setting , we have
We slightly enlarger with , which leads to the results that
The proof is done.
a.2 The Proof of Theorem 1
In this section, we will prove Theorem 1. Before the proof, we propose Lemma 3 first, which focuses on why we need to split the datasets into the training set and the validation set.
Lemma 3 (Independence Lemma)
If and
are independent random variables, and
is a fixed function which is independent of and , then is independent of .Proof 1 (Proof of Lemma 3)
The proof directly follows the definition of independence of random variables.
(2)  
where , is the corresponding measurable sets which is decided by and , . The second equality follows the independence of and . By definition, is independent of . The proof is done.
Corollary 3 (Random Split of Datasets)
Given i.i.d. data which is randomly split into training data and test data . If we use to train a machine learning model , then for two independent samples , , and loss of sample , is independent of .
It should be noted that is trained by , thus is independent of samples in the validation set. That is why the dataset needs to be randomly split. Armed with Corollary 3, we can go on to finish the proof.
We split the proof into two parts. In Lemma 4, we give an approximation measure for the machine learning model . In Lemma 5, we will prove that the linear approximations of two close functions are also close. In this part, we use MSE to measure how the machine learning model approaches ground truth functions.
Lemma 4 (Approximation measure for )
Given an i.i.d. dataset which is split into training set and validation set , where . Suppose we use MSE () as our loss, and the sample loss is denoted as , the population loss is denoted as . Then for a given , we have
Namely, , .
Note that Lemma 4 gives an measure for , which is a probabilistic upper bound. And the results also show a tradeoff between training set and validation set. If more data are split into training set, decreases theoretically. If more data are split into validation set, decreases theoretically.
Proof 2 (Proof of Lemma 4)
The key of the proof is Hoeffding’s inequality, which states that given a series of bounded random variables , and , if , then for any , we have
Plug the bound of loss into this inequality. By setting , it holds that
Finally, notice that
The proof is done.
The next Lemma 5 shows that when the distance of two functions is bounded, the distance of their linear approximation is also bounded.
Lemma 5
Given oracle model and machine learning model , where their linear approximation is , , respectively, where is a linear function family. Given , if , then
where d is the dimension of .
Proof 3 (Proof of Lemma 5)
First, we would like to represent in a linear form. For simplification, we omit the superscript for a while.
Then we can derive an explicit representation of
Therefore, by adding the superscript, we have
(3) 
It can be further calculated that
where the first inequality comes from the definition of matrix norm, the third inequality is due to CauchySchwartz inequality, and the final equality is by the bound of functions and .
That is to say,
(4) 
Plug Equation 4 into ,
Therefore, we have
(5) 
The proof is done.
Comments
There are no comments yet.