Homeostasis phenomenon in predictive inference when using a wrong learning model: a tale of random split of data into training and test sets

This note uses a conformal prediction procedure to provide further support on several points discussed by Professor Efron (Efron, 2020) concerning prediction, estimation and IID assumption. It aims to convey the following messages: (1) Under the IID (e.g., random split of training and testing data sets) assumption, prediction is indeed an easier task than estimation, since prediction has a 'homeostasis property' in this case – Even if the model used for learning is completely wrong, the prediction results maintain valid. (2) If the IID assumption is violated (e.g., a targeted prediction on specific individuals), the homeostasis property is often disrupted and the prediction results under a wrong model are usually invalid. (3) Better model estimation typically leads to more accurate prediction in both IID and non-IID cases. Good modeling and estimation practices are important and, in many times, crucial for obtaining good prediction results. The discussion also provides one explanation why the deep learning method works so well in academic exercises (with experiments set up by randomly splitting the entire data into training and testing data sets), but fails to deliver many `killer applications' in real world applications.

Authors

• 6 publications
• 1 publication
• SPlit: An Optimal Method for Data Splitting

In this article we propose an optimal method referred to as SPlit for sp...
12/20/2020 ∙ by V. Roshan Joseph, et al. ∙ 0

• Experimental Design for Bathymetry Editing

We describe an application of machine learning to a real-world computer ...
07/15/2020 ∙ by Julaiti Alafate, et al. ∙ 11

• Equivalence Test in Multi-dimensional Space with Applications in A/B Testing

In this paper, we provide a statistical testing framework to check wheth...
09/24/2018 ∙ by Jing Miao, et al. ∙ 0

• Distribution-Free Prediction Sets with Random Effects

We consider the problem of constructing distribution-free prediction set...
09/20/2018 ∙ by Robin Dunn, et al. ∙ 0

• Unsupervised neural adaptation model based on optimal transport for spoken language identification

Due to the mismatch of statistical distributions of acoustic speech betw...
12/24/2020 ∙ by Xugang Lu, et al. ∙ 0

• The Manifold Assumption and Defenses Against Adversarial Perturbations

11/21/2017 ∙ by Xi Wu, et al. ∙ 0

• Overly Optimistic Prediction Results on Imbalanced Data: Flaws and Benefits of Applying Over-sampling

Information extracted from electrohysterography recordings could potenti...
01/15/2020 ∙ by Gilles Vandewiele, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This outstanding paper by Professor Efron [6] provides stimulating discussions on the future of our field in the remainder of the 21st Century. In this note, we echo and also provide additional support to two important points made by Professor Efron: (1) prediction is “an easier task than either attribution or estimation”; (2) the IID assumption (on both training and testing data sets) is crucial in the current developments on predictions, but we also need to do more for the case when the IID assumption is not met. Based on our own research, we provide additional evidence to support these discussions, including a discovery why prediction has a homeostasis property and works well under the IID setting even if the learning model used is completely wrong. We specifically highlight the importance of having a good learning model with good estimation to obtain a good prediction. We provide examples to show that, for the task of prediction, a good modeling and inference practice is important in the IID case and it becomes essential for non-IID case. The message remains: to get a good prediction outcome, we still need to make effort to build a good learning model and estimation algorithm, even if sometimes prediction appears to be an easier task than estimation.

From the outset, we would like to comment on that it is not a straw-man argument to consider non-IID testing data. On the contrary, they are prevalent in data science. In addition to those examples provided by Professor Efron that showed “drift,” we can easily imagine non-IID examples in many typical applications. For instance, a predictive algorithm is trained on a database of patient medical records and we would like to predict potential outcomes of a treatment for a new patient with more severe symptoms than what the average patient shows. The new patient with more severe symptoms is not a typical IID draw from the general patient population. Similarly, in the finance sector, one is often interested in predicting the financial performance of a particular company or group. If a predictive model is trained on all institutes, then the testing data (of the specific group of companies of interest) are unlikely IID draws from the same general population of the learning data. The limitation of IID assumption has hampered our efforts to fully take advantage of fast-developing machine learning methodologies (e.g., deep neural network model, tree based methods, etc.) in many real-world applications, a point that we will have more elaboration later.

Our discussions in this note are based on a so-called conformal prediction procedure, an attractive new prediction framework that is error (or model) distribution free; cf., e.g., [12, 10, 8, 1, 2]. We demonstrate that, under the IID assumption, the predictive conclusion is always valid even if the model used to train the data is completely wrong. We discover a homeostasis phenomenon that the prediction is resistant to wrong learning models in the IID case because the expected bias caused by learning using a wrong model can largely be offset by the corresponding negatively shifted predictive errors (cf., Sections 2.3 and 3.1). This robustness result clearly supports the claim that prediction is an easier task than modeling and estimation. However, the use of a wrong learning model has at least two undesirable impacts on prediction: (a) A prediction based on a wrong model typically produces a much wider predictive interval (or a wider predictive curve) than that based on a correct model; (b) Although the IID case enjoys a nice homeostatic cancellation of bias (in fitted model) and shifts (in associated predictive errors) when using a wrong learning model, in the non-IID case this cancellation is often no longer effective, resulting in invalid predictions. The use of a correct learning model can help mitigate and sometimes solve the problem of invalid prediction for non-IID (e.g., drifted or individual-specific) testing data.

The rest of the note is arranged as follows. Section 2 describes the conformal predictive inference in general terms. The prediction is valid under the IID setting, even if the learning model used is completely wrong. Section 3 contains two case studies, one on linear regression and the other on neural network model, to study the impact of using a wrong learning model on prediction under both IID and non-IID settings. Concluding remarks are in Section 4.

2 Prediction, testing data and learning models

As in equation (6.4) of Professor Efron’s article, we consider a typical setup in data science: Suppose we have a training (observed) data set of size : , where , , are IID random samples from an unknown population . For a given , we would like to predict what would be. We first use the typical assumption that is also an IID draw from . Later we relax this requirement and only assume that relates to the same way as relates to , but is fixed or follows a marginal distribution that is different from that of .

For notation convenience, we consider as the -th observation and introduce the index , with and as a potential value (or a “guess”) of the unobserved . Unless specified otherwise, the index “” and index “new” are exchangeable throughout the note.

2.1 Conformal prediction and level (1−α) conformal predictive intervals

A conformal prediction procedure is a distribution free prediction method that has attracted increasing attention in computer science and statistical learning communities in recent years; cf., e.g., [12, 10, 8, 1, 2]. The idea of conformal prediction is straightforward. In order to make a prediction of the unknown given , we examine a potential value , and see how “conformal” the pair is among the observed pairs of IID data points , . The higher the “conformality,” the more likely takes the potential value . Frequently, a learning model, say for , is used to assist prediction. However, the learning model is not essential. As we will see later, even if is totally wrong or does not exist, a conformal prediction can still provide us valid prediction, as long as the IID assumption holds for both the training and testing data, i.e., , for .

To be specific, this note employs a conformal prediction procedure that is referred to as the Jackknife-plus method by [1]. Specifically, consider a combined collection of both the training and testing data but with the unknown replaced by a potential value : . We define conformal residuals for , where is the prediction of based on the leave-two-out dataset . If a working model is used, for instance, the model is first fit based on the leave-two-out dataset and the point prediction is set to be , where is the fitted (trained) model using .

For each given (a potential value of ), we define

 Qn(yn+1)=1nn∑i=11{Rn+1,i≥Ri,n+1}, (1)

which relates to the degree of “conformity” of the residual values among the residuals (which in fact are the leave-one-out residuals of using only the training data ) , . If , then is around the middle of the training data residuals and thus “most conformal.” When or , is at the extreme ends of the training data residuals and thus “least conformal.” This intuition leads us to define a conformal predictive interval of as

 Cα ={y:Qn(y)≥α2}⋂{y:1−Qn(y)≥α2} =[qα2({^y−(i,n+1)n+1+Ri,n+1}ni=1),q1−α2({^y−(i,n+1)n+1+Ri,n+1}ni=1)], (2)

where is the

-th quantile of

. The predictive interval (2.1) is a slightly variant version of the Jackknife-plus predictive interval proposed by [1].

The following proposition states that, under the IID assumption, defined in (2.1) is guaranteed a level- predictive set for . We outline a proof of the proposition in Supplementary. The proposition and proof is almost the same as that provided in Theorem 1 of [1], except that the absolute residuals are used instead throughout their development.

Proposition 1.

Suppose , for . Then, we have .

Proposition 1 is proved for a finite , with a (conservatively) guaranteed coverage rate of [1] pointed out empirically has a typical coverage rate of . In the rest of the note, we treat as an approximate level- predictive interval.

A striking result is that Proposition 1 holds, even if the learning model used to assist prediction is completely wrong, as long as and , for any two pairs and , , , maintain “symmetry” or “exchangeability” (when shuffling indices) due to the IID assumption. This amazingly robust property is highly touted in the machine learning community. It gives support to the sentiment of using “black box” algorithms where the role of model fitting is reduced to an afterthought, although we will provide arguments to counter this sentiment later in the note.

2.2 Conformal predictive distribution and predictive curve

To get a full picture of the prediction intervals at all significance levels (as we present later in Figures 2 and 4), we would like to briefly describe the notions of predictive distribution (cf., [7, 11, 13]) and predictive curve

. Predictive distribution in Bayesian inference is well known, but the development of predictive distribution with confidence interpretation is relatively new; cf.,

[7, 11]

. Note that a predictive interval has the same frequency interpretation as a confidence interval, except that it is developed for a random

instead of a parameter of interest. Similarly, a predictive distribution (with a confidence interpretation) can be viewed as an extension of a confidence distribution but developed for the random instead for a parameter of interest.

[4] suggested that a confidence distribution be introduced “in terms of the set of confidence intervals of all levels”. To better understand the concept of predictive distribution and predictive curve, especially how to relate them to predictive intervals of all levels, it is prudent to briefly take a look at confidence distribution and confidence curve, and then move on to prediction. We consider a toy example below.

Example 1.

Assume in this toy example that . Instead of using a point () or an interval ( ) to estimate the unknown parameter , a confidence distribution suggests to use a sample-dependent function

, or more formally in the cumulative distribution function form

, to estimate the unknown parameter ; cf, e.g., [5, 14, 9]. A nice feature of a confidence distribution is that it can represent confidence intervals of all levels. For example, the level- one-sided interval and the level- two-sided interval . Here, is the inverse function of .

A closely related concept is confidence curve

 CVn(θ)=2min{Hn(θ),1−Hn(θ)},

which was first introduced by [3] as an “omnibus form of estimation” that “incorporates confidence limits and intervals at all levels.” For any , is a level- two-sided confidence interval. We could view the function as a result of stacking up two-sided confidence intervals of all levels for going from to ; cf., Figure 1 (a). The plot of confidence curve function provides a full picture of confidence intervals of all levels

, with a peak point corresponding to a median unbiased estimator

with and .

For a new sample , a predictive distribution is , or in its cumulative distribution function form . Parallel to confidence curve, we can define a predictive curve

 PVn(y)=2min{Qn(y),1−Qn(y)}=2min{Φ(y−¯y√1+1/n),1−Φ(y−¯y√1+1/n)}. (3)

Figure 1 (b) is a plot of the predictive curve in (3). Again, we can view the function as a result of stacking up two-sided predictive intervals of all levels for going from to . The plot of the predictive curve provides a full picture of predictive intervals of all levels . The peak point in Figure 1(b) corresponds to a median un- biased point predictor with and .

Back to our conformal prediction development, the function defined in (1) is in essence a predictive distribution of . The associated predictive curve for can then be defined as

 PVn(y)=2min{Qn(y),1−Qn(y)}.

The predictive interval in (2.1) is . We later plot our predictive curves in Figures 2 and 4, which provides a full picture of conformal predictive intervals of all levels in various setups.

Note that, in Example 1, is the -value for the one-sided test versus and is the is the -value for the two-sided test versus ; cf., e.g., [14, 9]. Thus, and can be interpreted as the same quantities of -value functions of one-sided and two-sided tests, respectively. Similarly, the predictive function and predictive curve also have the corresponding interpretation of -value functions of one-sided test versus and two-sided test versus , respectively.

2.3 Validity vs efficiency and IID vs non-IID under a wrong learning model

Although the validity result in Proposition 1 is robust against wrong learning models under the IID setting, there is no free lunch. The predictive intervals obtained under a wrong model will typically be wider. For instance, suppose that the true model is , but a wrong model is used. Since , we have . So, when is independent of , and the equality holds only when . Thus, the error term

under a wrong model has a larger variance than that of the error term

under the true model. The larger are (i.e. the more discrepant are), the larger the variance of the error term are. A larger error typically translates to less accurate estimation and prediction.

We have an intuitive explanation why a conformal predictive algorithm can still provide valid prediction even under a totally wrong model in the IID setting. Specifically, when we use a wrong model , the corresponding point predictor will be biased by the magnitude of , but at the same time the error term absorbs the bias, thus producing residuals with a shift by the magnitude of . In a conformal prediction algorithm, the quantiles of residuals are added back to the point prediction to form the bounds of predictive intervals. If the IID assumption holds, the bias is offset by the shift. Along with greater residual variance, the offsetting helps ensure the validity of the conformal prediction. We call this tendency of self balance to maintain validity a homeostasis phenomenon, and will explain it in explicit mathematical terms under a linear model in Section 3.1.

The IID assumption is a crucial condition to ensure the validity of a prediction under a wrong model. If the IID assumption does not hold for the testing data, the prediction based on a wrong model (or a correct model but a wrong parameter estimation) is often invalid with huge errors, as we see in our case studies in Section 3. We think this IID assumption also explains why deep neural networks and other machine learning methods work so well in academic research settings (where random split of data into training and testing sets is a common practice) but fail to produce “killer applications” to make predictions for a given patient or company whose are often not close to the center of the training data. The good news is that, if we use a correct model for training and can get good model estimates, a reasonably acceptable prediction for a fixed is possible. This is illustrated in the case studies in Section 3 under both linear and neural network models. Indeed, modeling and estimation remain relevant and often crucial for prediction in both IID and non-IID cases.

3 Case studies: conformal prediction under specific models

3.1 Prediction with data from a linear regression model

We assume in this subsection that the training data , , and are from the following linear model:

 yi=μ0(xi)+ϵi=xTiβ+ϵi (4)

where is the unknown regression coefficient and are IID random errors with mean and variance . We would like to compare the performances of the Jacknife-plus prediction procedure under two different learning models:

 (a) the true model: μ0(xi)=xTiβ \, vs \, (b) a wrong model: μ1(xi)=zTiγ ,

where is the first elements of the covariates of , , and is the corresponding unknown regression coefficient. We define notations: is the

response vector of the training (observed) data,

and are the and design matrices, respectively, and we have a matrix partition .

Under the true learning model and from the least squares estimation, we have, for each given and or ,

 ^y−(i,n+1)s =^μ0(xs;A−(i,n+1))=xTs(XTX−xixTi)−1(XTY−xiyi) =...=xTs^β−(yi−xTi^β)his1−hii,

where is the least squares estimator using all training data (of size ) and and . Therefore, for and replacing index with index ,

 ^y−(i,n+1)n+1+Ri,n+1=^y−(i,new)new+yi−^y−(i,new)i =xTnew^β+(yi−xTi^β)1−hi,new1−hii=xTnew^β+(1−hi,new)ui,

where is the deleted residual (using all training data of size ) and . Thus, from (2.1), the predictive interval of is:

 [xTnew^β+qα2({(1−hi,new)ui}ni=1),xTnew^β+q1−α2({(1−hi,new)ui}ni=1)]. (5)

Note that, given , the point predictor is an unbiased estimator of and , for . Thus, the prediction interval (5) can be interpreted as an interval “centered” at the unbiased predictor with its width determined by the “spread” of the mean-zero “noises” .

When the wrong model is used, we can use a similar derivation to get the predictive interval of :

 [zTnew^γ+qα2({(1−gi,new)vi}ni=1),zTnew^γ+q1−α2({(1−gi,new)vi}ni=1)]. (6)

where is the least squares estimator using the wrong model, , and . Here, given , the point predictor is actually biased, with mean . Thus, the bias caused by missing the covariates is

 bias =E[zTnew^γ|X,xnew]−xTnewβ=−wTnewβ2+zTnew(ZTZ)−1ZTWβ2, (7)

where is the last elements of .

Luckily, when the IID assumption holds, this bias can often be mitigated by a shift in the residual terms used to construct the predictive interval. Note that, the expectations of the residual terms are not zero:

 E{(1−gi,new)vi|X,xnew} =1−gi,new1−gii(wTi−zTi(ZTZ)−1ZTW)β2 =1−gi,new1−gii(w⊥i)Tβ2def=shift (8)

where is the th row of the matrix . The and often have the opposite signs and thus, when added together, they cancel each other to a certain extent.

For example, suppose a new individual case is an “average individual” of the training data with . Then, the bias of the point predictor and the average shift of the residual terms are and respectively. Since when ’s are IID (cf., Lemma A1 in Supplementary), the average shift bias, thus they are approximately canceled out in the predictive interval (6). This cancellation explains in part why the prediction interval (6) is still roughly on target, even if the learning model is wrong. The cancellation is not as complete, when the testing data is just an IID sample and not the “average” . It appears that the combination of an enlarged interval and the cancellation of the bias and shift helps ensure the validity of conformal prediction under a wrong model for IID testing data. This self balance to maintain validity mirrors a homeostasis process and we referred to it as a homeostasis phenomenon.

A wrong learning model also has implications on the lengths of the prediction intervals. The proposition below states that the width of the predictive interval based on the wrong model is expected to be wider than that based on the correct model , if . A proof can be found in the Supplementary.

Proposition 2.

Under model (4), assume ,

’s are IID from a normal distribution and

, where . Suppose , then

 limn→∞P[q1−α2({(1−gi,new)vi}ni=1)−qα2({(1−gi,new)vi}ni=1) >q1−α2({(1−hi,new)ui}ni=1)−qα2({(1−hi,new)ui}ni=1)]=1.

That is, with probability tending to 1, the width of predictive interval (

5) the width of predictive interval (6).

The following numerical example provides empirical evidence to support our discussions.

Example 2.

Suppose we have only two covariates in the true model

 yi=μ0(xi)+ϵi=β0+β1zi+β2wi+ϵi,ϵiiid∼N(0,σ2) (9)

where and and are independent. In our numerical study, , , , the -element of is , and .

For the testing data, we consider two scenarios: (i) under the IID assumption that and follows (9); (ii) the marginal distribution of is instead from and, given , follows (9). Here, and the -element of is , .

In addition to the correct model (a) , three wrong learning models are considered:

 (b)μ1(xi) =γ0+γ1zi(partially correct, without% covariate wi); (c)μ2(xi) =ξ0+ξ1z2i(a wrong regression form). (d)μ3(xi) =η0 (without any covariates);

For model fitting, we use the least squares method in all three cases.

Reported in each cell of Table 1 are the coverage rate and average length (inside brackets) of conformal predictive intervals for , computed based on repetitions. As expected, in the IID scenario, all learning models can provide valid prediction results. However, the smallest interval length is observed under the true model. In the non-IID scenario, only the true model can provide a valid prediction. The other three learning models do not provide valid predictive inference in terms of a correct coverage rate, even though their predictive intervals are wider. The results in both scenarios underscore the importance of using a correct learning model for prediction.

In order to get the full picture of the predictive intervals of all confidence levels under different scenarios and different learning models, we plot in Figure 2 the predictive curves obtained from the first realization of the repetitions (Other realizations produce more or less the same plots). Plot (a) is for , (b) is for and (c) . In each plot, we have four predictive curves corresponding to four working models, plus the target (oracle) predictive curve of obtained by pretending that we know exactly ’s distribution: with . In each of the plots (a)-(c), the predictive curves trained with the correct model (black solid curves) are very close to the target oracle predictive curves (red solid curves), indicating that if we use the true model as the learning model, we are able to provide very accurate prediction at all confidence levels. Under the wrong models, however, the take-home messages are very different. In plot (a) with being the “average individual,” we see an almost complete cancellation of and as described earlier. However, the predictive curves are much wider than those based on the correct model. Plot (b) is for the IID case of . In this case the curves are similar to those in plot (a), although the cancellations are not as complete as for the ‘average individual’. Nevertheless, the enlarged interval widths help maintain the coverage. Plot (c) is for non-IID case, in which the cancellations of and are not effective when wrong learning models are used, leading to wrong predictions. In plots (a) - (c), we can also see that a partially correct model performs better than the other two completely wrong models and .

In summary, when we train prediction algorithms using a wrong model, the IID assumption is essential for the validity of prediction, and using a wrong model often results in wider, sometimes much wider, predictive intervals. When we train the same algorithms using the correct model, the validity and efficiency of the predictions are observed in both IID and Non-IID scenarios conditional on .

3.2 Prediction in neural network model

The discussion in the linear model in Section 3.1 can be extended to other models. We consider in this subsection an example of simple neural network models. We use a simulation study to provide empirical support for our discussion. Note that, in the current neural network development, model fitting algorithms do not pay much attention to correctly estimate the model parameters. In addition to what we learned in the linear model, we find that the estimation of model parameters plays an important role in prediction as well.

Example 3.

Suppose our training data , , are IID samples from the model

 yi=μ0(xi)+ϵi=max{0,max{0,zi1+zi2}−max{0,wi}}+ϵi,ϵiiid∼N(0,σ2), (10)

where and and are independent. Here, , the -element of is , for , and . Model (10) is in fact a neural network model (with a diagram presented in Figure 3 (a)) and we can re-express  as

 μ0(xi)=f(A2f(A1xi)) (11)

Here,

is the ReLU activate function, and

and are the model parameters. Corresponding to (10), the true model parameter values are , and . In our analysis, we assume that we know the model form (11) but do not know the values of model parameters and .

For the testing data, we consider two scenarios: (i) [IID case] and, given , follows (10); (ii) [Non-IID case] the marginal distribution and, given , follows (10). Here,