 # Predictive Interval Models for Non-parametric Regression

Having a regression model, we are interested in finding two-sided intervals that are guaranteed to contain at least a desired proportion of the conditional distribution of the response variable given a specific combination of predictors. We name such intervals predictive intervals. This work presents a new method to find two-sided predictive intervals for non-parametric least squares regression without the homoscedasticity assumption. Our predictive intervals are built by using tolerance intervals on prediction errors in the query point's neighborhood. We proposed a predictive interval model test and we also used it as a constraint in our hyper-parameter tuning algorithm. This gives an algorithm that finds the smallest reliable predictive intervals for a given dataset. We also introduce a measure for comparing different interval prediction methods yielding intervals having different size and coverage. These experiments show that our methods are more reliable, effective and precise than other interval prediction methods.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Regression analysis is a statistical technique for estimating the value of one variable as a function of independent variables. They are widely applied in science and engineering, they are used in problems like function estimation, financial forecasting, and time series prediction. In the general form a regression equation has three variables: the response variable , a function and a random error , where . There are different kinds of regression techniques which estimate different characteristics of the conditional distribution of the response variable

. The most common approaches estimate the mean of the random variable

and are usually known as least-squares techniques. Robust regression approaches are similar to least-squares techniques but they are designed to be robust to outliers and violations of the least-squares assumptions. Another kind, called quantile regression, estimates the conditional quantiles of the response variable. In each category, the regression function can be estimated with a parametric linear, a parametric non-linear or a non-parametric method. This results in linear, non-linear or non-parametric regression models.

Interval prediction is an important part of every regression modeling procedure because regression models are always built with finite sample sizes. Thus the predicted mean or quantile is an estimate of the true unknown conditional mean or quantile of the random variable . Therefore while dealing with finite size datasets, we need to make some statistical inferences. There are currently many interval prediction methods for the regression context, however each interval has its own specific interpretation and application. In this work we are interested in finding two-sided prediction intervals in regression models which contain, with a high confidence level, at least a desired proportion of the conditional response variable. Such intervals are mostly used in application demanding a high level of confidence, like quality control, environmental monitoring, industrial hygiene, exposure data analysis, aircraft trajectory prediction, security systems etc.

We divide interval prediction approaches in regression into two main categories: The first category methods are based on the estimated conditional mean. These methods are usually based on least-squares models and propose interval prediction techniques that are centered on the estimation of the mean regression function. These approaches generally assume a non-biased regression model with a Gaussian error having constant variance. On the other hand we have quantile regression methods which directly estimate these intervals. Quantile regression methods are more robust to outliers and have less assumptions than the least-squares approaches. But they suffer from other weaknesses like slower speed of convergence and the crossing quantile effect (two different quantiles may cross or overlap each other).

One of our contributions is the review and the comparison of different least-squares and quantile regression techniques used to find intervals which contain a desired proportion of the response variable. We take advantage of this work to address common misunderstood questions about interval prediction methods in the machine learning community. We explain their applications and review their drawbacks. As pointed out at the beginning paragraph, we are interested in finding intervals in regression models which contain, with a high confidence level, at least a desired proportion of the conditional response variable. For that purpose, we introduce a new type of interval prediction method named “predictive interval methods”. A predictive interval model is guaranteed to contain for any query point

, at least a desired proportion of the conditional distribution of the response variable. Such models can be obtained with tolerance intervals for regression or confidence interval on quantile regression, but these concepts have limited applications in the literature. So we propose tolerance intervals for LLR. Then we use the tolerance intervals for LLR to obtain predictive interval models for LLR. Our predictive interval models are applied for two-sided interval prediction, however one can easily extend them to a one-sided interval prediction context. Then, we introduce a statistical test to check if an “interval prediction model” is a “predictive interval model”. In the same context, we introduce two measures for ranking interval prediction models. These measures rate the efficiency and the tightness of the obtained envelope. Our predictive interval methods are based on LLR and give variable size intervals. We assume that the mean regression function is locally linear and that the prediction error is locally homoscedastic (heterocedastic in general). Our method does not neglect the regression bias and finds intervals that work properly with biased regression models. The proposed predictive intervals are based on the LOO or -fold cross-validation prediction errors of the local linear regression.

In order to validate our methods, we use several regression datasets to compare our predictive interval method for local linear regression with other interval prediction methods. The selected methods will be tested on their capacity to provide two-sided -content predictive interval models. The models are compared by their reliability, efficiency, precision and the tightness of their obtained envelope. We also take advantage of our evaluation chapter to show that the conventional interval prediction method is not appropriate for high confidence interval prediction. It is almost always less efficient than our predictive interval methods and their envelope is almost always larger than the envelope obtained by our methods. ghasemi_reg_interval_sim

proposed a KNN based interval prediction method which finds intervals that guarantee to contain, simultaneously for all instances in the predictor space, at least a proportion

of the conditional distribution of the response value. They called it simultaneous interval regression for KNN and they stated that, their method is similar to simultaneous tolerance intervals for KNN regression with a high confidence level. Their work differs from our contribution in at least three points: first, we do not look for models that guarantee the simultaneous condition. Our models could be obtained by tolerance intervals for regression and not simultaneous tolerance intervals for regression. The difference between these concepts is explained in Section 3. Second, our methods are based on LLR and use local distribution of the prediction error instead of local distribution of the conditional response variable. Third, we propose a test and two interval comparison measures to obtain and compare our models.

This paper is organized as follows: Section 2 is a background on regression and tolerance interval for least squares regression. Section 3 describes the state of the art in for interval prediction in regression. Section 4 introduces two efficiency measures which can be used to compare different interval prediction models. Section 5 introduces the concept of predictive interval models. Section 6 proposes a methods to obtain tolerance intervals for LLR under the LHNPE conditions. Section 7 describes how to use the tolerance intervals for LLR to find predictive interval models for LLR. Chapter 8 use experiments to compares our methods with other least squares and quantile regression method on nine benchmark datasets and the final section is a discussion with concluding remarks.

## 2 Background

### 2.1 Context and Notation

We choose a fixed regression design where dataset is a random sample. The ’s are vectors of observations and

are random variables. These distributions are continuous probability distributions. We always suppose that there is one true mean regression function

with a zero mean error and an unknown variance . The most practical assumption is the Gaussian homoscedastic error, but it is not mandatory. is a finite random sample, so the estimated regression model finds a pair of ; denotes the estimated regression function and

is the estimated error standard deviation. This pair is a random vector in the probability space of regression models defined for the underlying regression type (for ex: OLSE). Note that in the case of error being not normally distributed, the pair

does not correctly represent the estimated regression model. Thus we will use the symbol instead of to refer to a probability distribution where the random vector is the estimated regression model based on the random sample . We also use the following notation:

• : the random sample of regression, the training set;

• : the true and unknown regression function;

• : the conditional mean of the response variable for specified combination of the predictors;

• : the estimated regression function;

• : the estimated regression function at point ;

• : the error variable;

• : the prediction error at point , ;

• : the true and unknown variance of the error variable;

• : the estimated variance of the error variable;

• : the variance of the estimated regression function at point ;

• : a point in the predictor space that does not exists in the training set;

• : the conditional response variable for a given combination of the predictors, ;

• : the random response variable, ;

• : an observation of the random variable ;

• : -content -coverage tolerance interval for the distribution of the response variable at point ;

• : -content -coverage tolerance interval for the distribution of the prediction error at point ;

• : -content predictive interval for the distribution of ;

• : -content predictive interval for the distribution of the prediction error at point ;

• : the -quantile of a standard normal distribution;

• : the

-quantile of a chi-square distribution with

degrees of freedom.

Note that, in this work we suppose that the prediction error for any point , is obtained with , where the mean estimate does not use the observation .

### 2.2 Least-squares Regression

Regression analysis is a statistical technique for estimating the value of one variable as a function of independent variables. In fixed design regression, there are pairs of observations , where is the vector of observations known as covariates and is the response variable. In other words, the random variable or follows a mean function with a random error term defined as:

 Yi=f(xi)+εi,where E(εi)=0. (1)

The model assumes that are mutually IID random variables. The goal is to estimate the mean function by , being as close as possible to the unknown function

. The usual assumption is to suppose that the variance of the error is the same everywhere. This is known as homoscedasticity and the opposite hypothesis (variable error variance) is known as heteroscedasticity.

In a least squares regression, the idea is to estimate the mean of by and based on some assumptions it results in finding the function that minimizes the MSE, i.e. finding that minimizes:

 MSE(f)=1nn∑i=1(yi−^f(xi))2.

If we have a homoscedastic error, the average risk of the regression estimator plus a constant value is equal to MSE. So in small to medium size datasets, leave-one-out or 10-fold cross validation MSE are well-known estimators of the risk function. For more details on the risk, see Appendix B.

### 2.3 Local regression methods

Local Polynomial Regression (LPR) assumes that the unknown function can be locally approximated by a low degree polynomial. LPR fits a low degree polynomial model in the neighborhood () of the point . The estimated vector of parameters used in the fitted LPR is the vector that minimizes a locally weighted sum of squares. Thus for each a new polynomial is fitted to its neighborhood and the response value is estimated by evaluating the fitted local polynomial with the vector as covariate. In general the polynomial degree () is or ; for , LPR becomes a kernel regression and when it changes to LLR.

#### 2.3.1 Definition of Local Polynomial Regression (LPR)

Suppose that the regression function at the point can be approximated locally for inside a neighborhood of . The idea is to write the Taylor’s expansion for inside a neighborhood of as follows fan_gijbels_1996:

 f(z)=d∑j=0fj(x)j!(z−x)j≡d∑j=0βj(z−x)j. (2)

Equation (2) models the regression function by a polynomial function. Thus, for every observation in the neighborhood of , we write (2) and estimate the vector by the vector of parameters which minimizes the locally weighted sum of squares defined in Equation (3), and where represents a kernel function with bandwidth . In fact, estimating for the random design as well as for the fixed design results in the locally weighted polynomial regression expressed by the equation below fan_gijbels_1996:

 ^βx=Argminβ∈R(d+1)n∑i=1Kb(xi−x)(Yi−d∑j=0βx,j(xi−x)j)2 (3)

where . So the local polynomial estimation for point is computed as follows:

 ^f(x)=n∑i=1ai(x)Yi, (4)
 where a(x)=IT1^βx and IT1=(1,0,⋯,0).

Kernel functions are used to weight the observations. They are chosen so that observations closer to the fitting point have bigger weights and those far from have smaller weights. If is a kernel, then is also a kernel function.

 Kb(u)=1bK(ub), where b>0.

Here, , known as the bandwidth, is a constant scalar value used to select an appropriate scale for the data. In this article, we use the following kernel:

 Kb(u)=1bK(D(u)b), (5)

where is a distance function like the -norm. Some authors including local_clev_1979 and local_clev_1988, took the -nearest neighbors of as its neighborhood. In this case, for each , , where is the distance from the -th nearest neighbors (the farthest neighbor) from the point . For a detailed discussion see locally_weighted_learning.

#### 2.3.2 Bandwidth Selection

A popular bandwidth selection method is the LOO technique suggested in Stone (1977) local_stone_1977 which chooses the following bandwidth :

 b=Argminn∑i=1(yi−^f−i(xi))2, (6)

where is the estimation without using the observation. Estimating the bandwidth by LOO is a time-consuming task, so it is common to minimize the K-fold cross validation score with or ; this leads to an approximation of LOO. Plug-in bandwidth is another smoothing strategy which is a formula for the asymptotically optimal bandwidth. The plug-in bandwidth requires several unknown quantities that must be estimated from the data. In section 4.2 of fan_gijbels_1996, a plug-in bandwidth for linear weighted local linear regression is defined. One of the required parameters for this estimator is ’s second derivative which is more difficult to estimate than itself. In this work we use -fold cross validation to find the best bandwidth of our dataset.

#### 2.3.3 Loess

Loess was introduced by Cleveland and Delvin local_clev_1988, and is a multivariate version of LOWESS local_clev_1979, which is another version of LPR. Loess is described by (4), where the polynomial degree is one or two . For the bandwidth selection and weight calculation, loess is similar to KNN. Its weights are calculated with (5) where, , is ’s -norm in the predictor space and is the Euclidean distance between the input vector and its nearest neighbor. The weight function chosen by Cleveland and Delvin local_clev_1988 was the Tricube kernel, however it is not mandatory.

In this work, linear loess is the non-parametric smoother function. Therefore, for each input vector , we use (3), with , to estimate the vector of parameter from the training set. As we can see in (7), for each prediction the locally weighted linear regression problem is converted to a WLSE in which the weights are estimated by a kernel function.

 ^βx=argminn∑i=1wi(yi−xTiβ)2, (7)

where is the weight of the observation.

### 2.4 Tolerance intervals

let denote a random sample from a continuous probability distribution. A tolerance interval is an interval that is guaranteed, with a specified confidence level , to contain a specified proportion of the population. The sign is used to refer to a -content -coverage tolerance interval Mathew_Tolerance_book. Then, we have:

 PX(P(X∈ITγ,β|X)≥β)=γ. (8)

When our sample set (of size ) comes from a univariate normal distribution, the lower and upper tolerance bounds ( and , respectively) are are obtained as follows:

 Xl=^θ−c^σ,Xu=^θ+c^σ (9) c=   ⎷(n−1)(1+1n)Z21−1−β2χ21−γ,n−1 (10)

Where is the sample mean of the distribution, its sample standard deviation, represents the quantile of the chi-square distribution with degree of freedom and is the squared of quantile of the standard normal distribution normal_tolrance_howe_1969.

### 2.5 Tolerance intervals for least-squares regression

In the case of regression with constant error variance and normal distribution of errors, usually inter-quantiles of a normal distribution with mean zero and variance , (being the error variance estimator) are used as an approximate solution to find intervals that contain a desired proportion of the distribution of the response variable for a given value of dependent variables. For instance, the inter-quantile is often used as the interval containing of the distribution of (i.e., as a regression tolerance interval). As shown by Wallis linear_tolerance_wallis, this statement is not true since and are only estimations of the true error variance and the true mean function at point , . These estimations are always made on a finite sample and are then pervaded with uncertainty. Tolerance intervals for least squares regression have been introduced in order to take into account this uncertainty. These intervals are described formally by (11). We will refer to such intervals, -content -coverage regression tolerance intervals and they are denoted by .

 P(∫UTβ,γ(x)UTβ,γ(x)px(t)dt≥β)=γ, (11)

where and

denotes the probability density function of

for a specified value of the predictor variable

. A two-sided tolerance interval for is of the form , where is the tolerance factor to be determined subject to the content and the desired confidence level . Let represent the content of this tolerance interval,

 C(x;^f,^σ)=PY(x)(^f(x)−ρ(x)^σ≤Y(x)≤^f(x)+ρ(x)^σ). (12)

The tolerance factor satisfies the following condition:

 P^f,^σ(C(x;^f,^σ)≥β)=γ. (13)

Equations (11) and (13) could also be expressed as follows:

 P^f,^σ(PY(x)(Y(x)∈ITγ,β(x))≥β)=γ, (14)
 ITγ,β(x)=[LTβ,γ(x),UTβ,γ(x)]=[^f(x)−ρ(x)^σ,^f(x)+ρ(x)^σ].

It is important to observe that tolerance intervals in regression are defined separately for each input vector. Therefore, for two different input vectors and , and are different and the event is independent of . For more details see stat_intervals_guide_1991 and Mathew_Tolerance_book.

This intervals has been studied for non-linear parametric in carroll_tol. They use transformation and/or weighting for the construction of prediction and tolerance intervals for a new response following a non-linear regression fit. Their work addresses the case of normally distributed errors and the non-parametric case in which the error distribution is unknown. But the non-parametric case concerns the error distribution and not the regression fit, and tolerance intervals for non-parametric regression have not, so far, been applied to non-parametric regression.

## 3 State of the Art of the Interval Prediction

Regression models are always built with finite sample size (), thus the predicted mean or quantile is an estimate of the true unknown conditional mean or quantile of the random variable . Therefore while dealing with datasets of finite size, we need to make some statistical inferences. This section reviews and gives a comparison of different least-squares and quantile regression techniques used to find such intervals. Besides, we take advantage of this section to address a misunderstood interval prediction method in the machine learning community. We explain its applications and review its drawbacks.

### 3.1 Interval Prediction Models

The goal of this paragraph is to emphasize the differences between intervals, interval prediction methods and interval prediction models. An interval prediction method is the procedure required for obtaining an interval. It is just the way to do it but when it is applied to a dataset, we obtain an interval prediction model. For example, a tolerance interval for regression is a type of interval. The method to obtain it in linear models is described in Mathew_Tolerance_book and, when applied to a dataset, the model which gives the tolerance interval for each point in the predictor space is the interval prediction model.

###### Definition 1

A regression -content interval prediction model, built on the random sample , is function from the predictor space to the response variable space such that:

 (15)

and, the expected content of the intervals is at least :

 (16)

Thus when the size of our training set goes to infinity and under certain conditions, a -content interval prediction model finds intervals that on average contain, at least, a of the distribution of . This is a quite broad definition which covers all the interval prediction method for and we will use it for such purpose.

Our works deals with the regression models, so we omit to mention the regression word and use “interval prediction model” instead of “regression interval prediction model”. Note that test and model selection techniques are always applied to models and not to methods. However when a method is more efficient than its competitors on several datasets or in a general theoretical framework, we can state that this method is more efficient than others.

### 3.2 Least-Squares Based Interval Prediction

We review briefly some well-known statistical inference techniques applied to least-squares regression models. There is an extensive literature on this topic and the goal of this section is to emphasize that least-squares interval prediction methods are not restricted to large sample techniques. However, there are still some subjects like tolerance intervals and simultaneous intervals that need further study. We will see further that prediction and tolerance intervals have some equivalents in the quantile regression set-up, but confidence band and simultaneous tolerance intervals seem to be restricted to the least-squares world.

### 3.3 Conventional Interval prediction

One of the most common interval prediction techniques used in practice is to take as the interval which contains a proportion of ’s population,where is the average MSE given by a LOO or a 10-fold cross validation scheme. One might assume that the intervals expressed below have similar properties to the regression tolerance interval defined in the next section.

 P(Y(x)∈[^f(x)−Z1−β2SSE,^f(x)+Z1−1−β2SSE])=β. (17)

We assume that:

• the error variance for all ’s is constant (homoscedasticity).

• the estimator’s variance is constant for all .

• is an unbiased estimator of .

• the error , and , are independent and both have normal distributions.

Equation (17) becomes asymptotically valid, but it remains non-applicable for datasets of finite size. For more details on the proof of our statements, refer to Appendix B. Unfortunately, it is a common practice in the Machine Learning community to employ this method for obtaining intervals having the properties of prediction intervals, tolerance intervals or simultaneous tolerance intervals. The conventional interval prediction method has some drawbacks:

• First of all, the estimation does not take into account the regression sample size like prediction interval, confidence bands and tolerance intervals.

• It just estimates asymptotic global inter-quantile for the conditional response variable.

• It supposes that the estimated regression function is non-biased.

The problem of finding intervals that satisfy (17) is treated by tolerance for in least-squares regression.

#### 3.3.1 Inference on the conditional mean f(x)

This part describes interval prediction techniques that obtain intervals that contain with a confidence level the conditional mean at point .

• Point-wise confidence interval for : The confidence interval for the mean regression function contains asymptotically, a desired proportion of the conditional distribution of the estimated mean function for each combination of predictors Applied_non_Hardle.

 P^f(f(x)∈Ipwβ(x))=β. (18)
• Simultaneous Confidence band for the conditional mean function, for all : this is the idea of -confidence band for the regression. These intervals create an envelope around the entire mean regression function such that, for all , the probability that the true is contained in the band is simultaneously .

 P^f(f(x)∈Icbγ(x) for all x∈X)=γ, where Icbγ(x)=[Uγ(x),Lγ(x)] (19)

where is the domain of .

#### 3.3.2 Inference on the response variable Y(x)=f(x)+ε

This part describes interval prediction techniques that obtain intervals that contain a desired proportion of the conditional distribution of the response variable. These methods are closely related to the methods explained just above.

• Prediction interval for : they are also called expectation tolerance intervals for regression which contains on average and not at least, a desired proportion of the conditional distribution of the response variable for each combination of predictors. They are described by the equation below:

 E^f,^σ[P(Y(x)∈IPredβ(x)|^f,^σ)]=β where Y(x)=f(x)+ε (20)

For a detailed discussion about the differences between prediction and tolerance intervals, the reader can find more in paulson_1943; stat_intervals_guide_1991; Mathew_Tolerance_book.

• Tolerance interval for regression: tolerance interval for regression are explained in 2.5. The interval contains, with a confidence level , at least a desired proportion of the distribution of .

• Simultaneous tolerance intervals for regression: -content -coverage simultaneous regression tolerance intervals, described in Mathew_Tolerance_book, create an envelope around the entire mean regression function such that for all , the probability that is contained in the envelope is , and this coverage is guaranteed with a confidence level . They are can be described by (12) and (13) where (13) must be replaced by the equation below:

 P^f,^σ(minx∈XC(x;^f,^σ)≥β)=γ. (21)

Note that tolerance intervals and simultaneous tolerance intervals for least squares regression have been well studied for linear regression but the application of these methods in the non-linear and particularly the non-parametric case are limited in the literature.

### 3.4 Quantile Regression Based Interval prediction

A quantile regression model can estimate one conditional quantile so one-sided and two-sided interval estimation are treated separately.

#### 3.4.1 One-sided interval prediction

• Quantile regression: one-sided intervals obtained by the estimation of the conditional quantile are similar to one-sided prediction intervals in least-squares models.

• Confidence intervals on regression quantiles: the obtained one-sided interval contains, with a confidence level , at least a desired proportion of quantile_regression_practical_confidence_interal; quantile_regression_koenker_confidence_interal. They have so far been studied for linear models, and they are similar to one-sided tolerance intervals for regression explained in 2.5.

#### 3.4.2 Two-sided interval prediction

in order to obtain two-sided -content conditional intervals (), one must build two distinct quantile regression models: a lower -quantile regression model and an upper -quantile regression model.

• Quantile regression: This is done by a pair of upper and lower quantile regression model. These intervals are estimations and they are similar to two-sided prediction intervals in least-squares models.

• Confidence intervals on regression quantiles: These two-sided intervals contain with a confidence level, a proportion of the . As noted, we need a pair of () quantile regression models but each model now itself needs a one-sided confidence interval on regression quantile. Once we have built the upper and lower quantile regression models, we must obtain a lower (one-sided) confidence interval on the lower -quantile regression model and an upper (one-sided) confidence interval on the upper -quantile regression model. By applying the Bonferroni inequality, one can merge the pair of confidence intervals to obtain a joint confidence statement with a probability greater or equal to . More details on this combination can be found in Appendix B.

It is important to emphasize that although these intervals are theoretically feasible, there has not been any work, until now, which treats the problem of two-sided interval prediction with one-sided confidence intervals on regression quantiles. Such intervals are similar to two-sided -coverage -content least-squares tolerance intervals and they are explained in 2.5.

As discussed in koenker_book_2005, two different quantile regression models may cross or overlap each other, which is called as quantile crossing. Thus two-sided interval prediction is more meaningful by enforcing the non-crossing constraint. However after enforcing this constraint, the conditional quantile estimator may not converge to the true conditional quantile. Thus we have to choose between a “non-correct” or non-convergent estimator.

## 4 Comparing Interval Prediction Models

For a given dataset, we may have several interval prediction models but we need some quality measure to compare them. For this purpose, we define the dataset measures listed below. These measures are then used as building blocks for some graphical charts and plots explained further in this section. The idea is to provide graphical tools which can help us to compare the effectiveness of different interval prediction methods through different datasets.

### 4.1 Direct Dataset Measures

For each of the datasets the following quality measures can be computed. Note that the -content intervals must be obtained for observations not contained in the training set . So for small to medium datasets, we obtain these measures with a cross-validation or a LOO schema.

• MIP: Mean Inclusion Percentages: the fraction of the response values that are contained in the -content interval .

 MIPβ=n−1n∑i=1V(xi).

where is is defined as:

 V(xi)={1if Y(xi)∈I(xi)β,0otherwise.
• MIS: Mean of Interval Size.

 MISβ=n−1n∑i=1size(I(xi)β).
• : sample standard deviation of interval sizes.

 σis=n−1n∑i=1(size(I(xi)β)−MISβ)2,

where .

In to verify the reliability of an interval prediction method, we introduce the following constraint.

###### Definition 2

Let be a sample of size and be an function of , a -interval prediction model is a reliable model if we have:

 MIP Constraint: MIPβ≥b(n). (22)

This definition will further be used in the next section.

### 4.2 Composed Dataset Measures

We use the above quality measures to define the following composed measures:

#### 4.2.1 Normalized MIS

Suppose that we want to test different -content methods (“”, “” ,…, “”) on the dataset . They give us distinct models and each model has a Mean of Interval Size (MIS), so we have: . But depending on the dataset and ’s value, one model may satisfy the MIP constraint or not. For a model that does not satisfy this constraint, we do not compute its normalized MIS. For each model its normalized MIS is equal to the ratio of its MIS to the maximum MIS on this dataset

 normalizedMISiS=MISiSmaxi∈(1,⋯,c)(MISiS).

If we have :

 MISm1S≥MISm2S % and MIPm1S,β≥b(n) and MIPm2S,β≥b(n) ⇔m1 provides a wider reliable envelope % than m2.

is better than because it satisfies the MIP constraint and it also gives the smallest normalized MIS value. Choosing the ratio to the maximum MIS value rescales the MIS value between and and lets us compare the strength of methods across different datasets. However we can not use the normalized MIS to compare two models (constructed on the same dataset) that obtain different MIP values but have equal or approximately equal MIS values. In this case, we have to compare them by their Equivalent Gaussian Standard Deviation, explained below.

#### 4.2.2 Equivalent Gaussian Standard Deviation (EGSD)

If we have two reliable models (constructed on the same dataset) having different MIP values but approximately equal MIS values, we normally choose the one that gives the higher MIP. But the situation can get more complicated for models (constructed on the same dataset) with different MIS values and different MIP values. EGSD is a measure which can be used to compare interval prediction models, constructed on the same dataset, which have different MIP values. Such models can have different or equal MIS values. Let be a -content interval prediction model built on the dataset , yielding

. The idea behind EGSD is to find the Equivalent Gaussian Distribution (EGD) for successful predicted intervals of

. We have seen that by taking intervals size on average equal to , that of the observations will be contained in their prediction interval. So EGD is the distribution of the size of predicted intervals obtained by model that correctly contains their response variable. Therefore the EGD which has the smallest variance corresponds to the most efficient model. The Equivalent Gaussian Distribution for is the normal distribution -content inter-quantile size of which will be equal to . We have: . So the Equivalent Gaussian Standard Deviation of is calculated by:

 EGSDmS=MISmS2Z1−α2θ, where θ=MIPmS,β, α=1−β.

Now by using each model’s EGSD, we can compare models with different values of and . EGSD measures the trade-off between average interval size and the fraction of successful predictions. Smaller EGSD values denote more effective interval prediction models. Finally, for the sake of readability, all found EGSD are normalized in each dataset. Thus the final value is the ratio of the method’s to the maximum value on the underlying dataset:

 normalizedEGSDmS=EGSDmSmaxi∈(1,⋯,c)(EGSDiS).

Note that if the model has smaller EGSD than the model , it does not mean ’s envelope is wider than ’s envelope. As seen above smaller normalized MIS values mean smaller envelopes and smaller EGSD values means more effective models.

### 4.3 Figures

Plots and charts help us to compare different interval prediction methods on different datasets because a figure can visualize complex and big tables. Each plot is dedicated to one dataset and it compares dataset measures of different interval prediction methods on the same dataset whereas a chart compares a desired dataset measure for different methods and across different datasets. All the presented plots have the same x axis. This axis is labeled “Nominal MIP”, and it represents distinct values of the desired proportion (distinct values). On the other hand, each plot type has a different y axis. This axis denotes the underlying dataset measure on the tested interval prediction models.

#### 4.3.1 MIP plot

The MIP plot is similar to the well-known Q-Q plot with the difference that it compares MIP instead of quantiles. The x axis is denoted by “Nominal MIP” and it represents the desired proportion of inclusion (distinct values of ). The y axis is denoted by “Obtained MIP”. It represents the model’s MIP. Each point represents the model’s obtained MIP for its value on the “Nominal MIP” axis. This figure always has two lines: the “MIP constraint line” and the “Nominal MIP line”. The “MIP constraint line” displays for different values of nominal MIP, and the “Nominal MIP line” represents the function . By looking at this figure we can see the reliability of a method for different nominal MIP. The first value in the x axis where a line crosses the MIP constraint line will be called its failure MIP. It is obvious that the method having the higher failure MIP is the most reliable one.

One can also use the MIP plot to rate the model’s precision. If a model obtains MIP values much higher than the desired nominal MIP, it means that the method is reliable but not precise. For example a model which obtains MIP values of , and for respective nominal MIP of , and is reliable but not precise. The most precise model is the one having the nearest line to the “Nominal MIP line”. Finally, the best model in this plot is the one which is the most precise and the most reliable. It means that the best model in a MIP plot is the one having the nearest line to the upper side of the “Nominal MIP line”.

#### 4.3.2 EGSD plot

The y axis of an EGSD plot is labeled by “Normalized EGSD Value” and it represents the model’s normalized EGSD value. By looking at this figure, we can compare the efficiency of different models. It is obvious that the model having the highest line is the most inefficient model. We suggest using this plot along with the MIP plot to rate the efficiency of reliable methods. However one may ignore the reliability aspect and take advantage of this plot to compare the efficiency of different models.

#### 4.3.3 MIS plot

The y axis of an EGSD plot labeled “Normalized MIS Value” and it represents the model’s normalized MIS value. By looking at this figure, we can compare the model which obtains the tightest reliable envelope. The model having the highest line provides the widest envelope. If a model does not satisfy the MIP constraint, its normalized MIS value is not computed. The MIS plot shows each model’s normalized MIS until its “failure MIP”. We suggest using this plot along with the EGSD plot.

#### 4.3.4 Charts

Charts are used to compare one dataset measure on different datasets. We propose the following charts:

• Mean Inclusion Percentage Chart (MIP chart): the goal of this chart is to compare the mentioned methods based on their fraction of response values located inside their predicted intervals. It just displays the MIP value and it usually does not contain important information.

• MIS ratio chart: this chart displays the normalized MIS measure on different datasets.

• Equivalent Gaussian Standard Deviation chart (EGSD chart): it displays the normalized EGSD measure on different datasets.

## 5 Predictive Interval Framework

The goal of this section is to propose a new interval prediction framework. We introduce the concept of regression predictive intervals and regression Predictive Interval Model (PIM). Next, we propose a statistical test to verify if an “interval prediction model” is a “predictive interval model”. In the same context, we introduce two measures for rating interval prediction models. These measures rate the efficiency and the tightness of the obtained envelope.

### 5.1 Predictive Interval Models (PIMs)

In the frequentist interpretation of confidence intervals, a confidence interval for a parameter contains zero or one parameter. The parameter is fixed and confidence intervals change with different random samples. In the same way, the confidence level () used in tolerance intervals for regression defined in (14) and confidence intervals for regression quantiles (explained in Appendix B) mean the following: the probability that the obtained intervals contain, under re-sampling, at least a proportion of the conditional distribution of the response value is . We know that the confidence level in Neyman-Pearson confidence intervals is independent of the observed sample. It means that if we obtain -confidence -content tolerance intervals of an observed sample from a regression function, then the confidence level

does not induce any posterior probability of including

proportion of the distribution of . Therefore, the confidence coefficient in frequentist confidence intervals cannot be interpreted as posterior probability. This idea is discussed in detail in Chapter 7 in imprecise_Walley.

Hence, under the frequentist viewpoint of regression, the true conditional response variable’s inter-quantile is included with probability zero or one in the obtained interval (by using tolerance intervals for regression or confidence intervals for regression quantiles). Our goal is to obtain intervals that correctly bracket these inter-quantiles. They can be found in two ways: the first approach takes a very high confidence level like and the second method finds the smallest confidence level which includes the true unknown model. We introduce the concept of predictive intervals which refer to both of these intervals. A predictive interval built on , is guaranteed to contain for the query point , at least a desired proportion of the conditional distribution of the response variable. It can be obtained with tolerance intervals for regression or confidence intervals for regression quantiles but these concepts have so far only been treated for linear models

###### Definition 3

Let denote a random sample where and

is white noise. A

-content predictive interval for , denoted by , is an interval such that:

 PY(x)(Y(x)∈I(x)Pβ∣∣∣S)≥β, where I(x)Pβ=[L(x)Pβ,U(x)Pβ]. (23)

Since we have observed , is no longer random and the probability measure is just related to cover at least a proportion of the conditional distribution of the response variable for a specified combination of the predictors.

###### Definition 4

Let denote a random sample where and is white noise. A -content predictive interval model, denoted by , is a function such that:

 I(⋅)Pβ:Rp→I, where I={[a,b]|a,b∈R∪{−∞,∞},a

and for all in the predictor space, the obtained interval is a -content predictive interval described by (23).

### 5.2 Predictive Interval Model Test (PIM Test)

The goal of this part is to develop a statistical test with which we can rate the reliability of any interval prediction model claiming to provide -content predictive intervals. A predictive interval model must provide predictive intervals for each point in the predictor space. We saw that the distribution of changes for each value of . So, in order to see whether an interval for the regression function at the point contains at least a proportion of the s distribution of , we need (for each combination of predictors ) a sample set from the distribution of , and then we can observe if the constructed interval contains a proportion s distribution of the distribution of . In the same way, in order to verify if the methods works for an entire dataset , we need a distinct sample set for each and this sample must be drawn from . Since a sample set is required for each , the described procedure requires a huge dataset having many observations for each point in the feature space which makes it impractical or impossible for multiple regression problems. However, we can make some approximations and use the results stated above to derive the following test. We first begin by defining a variable obtaining MIP on the dataset. Then we will see that this variable can be approximated by a normal distribution, so we use the quantiles of the standard normal distribution as the function used in Equation (22).

#### 5.2.1 Simultaneous Inclusion for Predictive Intervals

A -content predictive interval must contain at least proportion of the conditional distribution of . Hence, the probability in (23) is just related to contain at least a proportion of the conditional distribution of the . We define the function as:

 V(x)={1if Y(x)∈I(x)Pβ,0otherwise.

The above definition means that the probability that is equal to is , so

has a Bernoulli distribution with

.

 V(x)∼Bernoulli(β). (25)

Suppose that we have a dataset of observations with which we build our model, and other observations , not contained in the original dataset as our test set. If we apply the function on the whole test set and sum the result, we obtain:

 MIPβ=n−1vnv∑i=1V(xvi). (26)

Therefore, we can deduce that

has a Binomial distribution. This is expressed formally by (

27) where is a binomial distribution with and .

 nvMIPβ∼Binomial(nv,β). (27)

If is sufficiently large, we can assume that has a normal distribution as:

 MIPβ∼N(β,β(1−β)nv). (28)

Thus in a PIM, the fraction of instances having their response value included in their predictive intervals is on average . This means such predictive intervals for regression have in average a simultaneous content of so, on average, they are like simultaneous regression tolerance intervals. For small to medium datasets, is computed in a cross-validation or leave-one-out schema on the whole dataset which means that .

#### 5.2.2 Testing Predictive Interval Models

As we have seen in (28), the random variable can usually be well approximated by a normal distribution. The test below is used to verify, with level , if the interval prediction method does provide -content predictive intervals for all in the predictor space.

 H0:β≥β0 versus H1:β<β0, (29)

then can be rejected with significance where:

 MIPβ

where is the -quantile of the standard normal distribution. So if (30

) is not true, we fail to reject the null hypothesis with significance level

, and accept the underlying model as a model providing -content predictive intervals.

In this work we used a significance level of , so for each dataset we compared the ’s value on the test set with . Thus, any PIM must pass the PIM test as defined below:

 PIM Test: MIPβ≥F0.05β0,nv. (31)

For the sake of simplicity, we refer to for a given dataset and desired proportion as MIP (Mean Inclusion Percentage). As we have seen in (28), the fraction of response values inside their -content predictive intervals converges to , so the test defined in (31), where is used to verify if the obtained intervals, with a confidence level and on average and not at least like in simultaneous tolerance internals, do simultaneously contain a proportion of the distribution of for all in the predictor space.

#### Remark

The average simultaneous content of PIMs mentioned just above, means that: for a PIM built on the training set , if we test the PIM with a large number of new observations , the fraction of the new response values included in their predictive intervals is guaranteed to be . This is expressed formally in (28).

Note that, this is not the same as the average content for any -content interval prediction models. A -content interval prediction model built on a specified training set , is not guaranteed to contain, on average, a proportion of successful predictions. However, if we have a large number of -content interval prediction model , built on a large number of training set , and we test each with a large number of unseen instances . In this case, the expected probability of each -content interval to contain a proportion of the distribution of is . This is expressed formally in (16).
Thus, having a -content interval prediction model built on the training set , if we test the model with large number of training set and new testing observations, the fraction of response value included in their predictive intervals is on average . An arbitrary -content interval prediction method needs a large number of training set and new testing observations, or one large training set and a large number of new testing observations, to guarantee an equal to . However a PIM, must guarantee an with any training set.

### 5.3 Hyper-parameter Tuning and Model Selection

This part addresses hyper-parameter tuning questions related to predictive interval explained in Section 5.1. We first convert the hyper-parameter tuning problem to an optimization problem. Then, we propose a simple optimization algorithm that takes advantage of existing least-squares regression method to find an efficient model for predicting both the mean and the predictive interval. Finally, we give an application with tolerance intervals for least-square regression. Note that, the dataset does not change in the hyper-parameter tuning phase, so we do not use it to index our variables.

#### 5.3.1 The Optimization Problem

Let denote the vector of hyper-parameter of the -content interval prediction method . The vector is divided into two set of elements. The first set is the vector of hyper-parameter for the regression method and the second set is specific to the interval prediction method. Our goal is to find the optimal value of , for any -content interval prediction model that is also a -content predictive interval model (pass the PIM test). This result in the following problem:

 λ0=Argmin(MISλβ), % where MISλβ=1nn∑i=1I(xi)λβ (32)

Subject to:

 PIM Tuning Constraint: MIPλ0β=βλ-Specific Constraints:  depends on the % PIM. (33)

Note that the MIP constraint is a hard constraint and there is no trade-off between satisfying this constraint and minimizing the MIS. We found which satisfies the constraint defined above. This results in intervals having the smallest MIS measure where MIP and MIS are computed based on a leave-one-out or 10-fold cross validation scheme on the training set.

#### 5.3.2 The Optimization Algorithm

We propose the following hyper-parameter tuning method:

• First, find the regression model with the smallest prediction error MSE.

• Then use an optimization method that finds the that satisfies the tuning constraint and also results in intervals having the smallest MIS measure on the previously found regression model.

• Note that the space of in the second step is restricted to values that have their regression hyper-parameter equal to the hyper-parameter of the regression model found in the first step.

This process divide the optimization in two phase, the hyper-parameter of the regression method tuning and the hyper-parameter tuning of the interval prediction method. In order to obtain the smallest intervals (measured by MIS), we need the smallest values of prediction errors. So, choosing the regression model that minimizes the LOO or 10-fold cross validation MSE can give the most efficient PIM. This approach does not always find the optimal solution but remains a simple and efficient tuning algorithm. By using this algorithm, we can take advantage of existing regression packages to find the best regression model, and then use the aforementioned method to tune the the remaining hyper-parameters. Finally, this approach give us one model for predicting both the mean and the predictive interval.

#### 5.3.3 An Application with Tolerance Intervals

We know that -content predictive intervals can be obtained via tolerance intervals for regression or via confidence interval on regression quantiles and such intervals are obtained upon regression models which may themselves have hyper-parameters (For the sake of brevity we continue this part with tolerance intervals for regression but the same procedure and statements hold for confidence interval on regression quantiles). Consider an example of constructing PIMs by tolerance intervals on a KNN regression. This model has two hyper-parameters, the KNN regression’s hyper-parameter which is the number , and the confidence level which is the coverage level in -coverage -content tolerance intervals. So we have . First we find the regression’s hyper-parameters ; it is for KNN but it can be kernel related parameters in SVM regression or nothing in the linear regression (this hyper-parameter depends on the regression method). Once we have found the best regression model, we use an iterative algorithm that searches the smallest that satisfies the tuning constraint defined above. High values of will guarantee the PIM tuning constraint but the computed intervals can be very large, so the search begins with a high confidence value like or and we try to decrease and thus decrease the mean interval size. This procedure is repeated as long as the tuning constraints are satisfied and the search strategy is left to the user. Some datasets might require just 2 or 3 iterations but some others may work with small . It depends on the dataset and it can be influenced by the domain expert.

## 6 Tolerance Intervals for Local Linear Regression

In the previous section we introduced the predictive interval framework. In this section we introduce two methods for obtaining tolerance intervals for local linear regression which in the next section, will be employed to obtain PIMs for LLR. We assume that the mean regression function is locally linear and the prediction error is locally homoscedastic. Our methods do not neglect the regression bias and find variable size intervals that work properly with biased regression models. The proposed tolerance intervals are constructed based on the leave-one-out or -fold cross validation prediction errors of the local linear regression. The local linear regression needs a regression bandwidth which could be found by any of the existing methods in the literature. In order to obtain our non-parametric predictive intervals, we need a second bandwidth, which is the tolerance interval bandwidth (LHNPE bandwidth). This work suggests two different tolerance interval bandwidths: a bandwidth having a fixed number of neighbors and a bandwidth having a variable one but both obtain variable size intervals.

Another important subject is the regression bias. It is well known that the optimal smoothing non-parametric regression method consists of balancing between the bias and the standard deviation. This non-parametric bias does not vanish even with large sample sizes, so it is important for interval prediction methods to take it into account. The idea behind our tolerance intervals is to exploit the local density of prediction error () in the LHNPE neighborhood (explained further) of the to find the most appropriate intervals that contain the desired proportion of the distribution of . We find tolerance intervals for the response variable by adding the regression estimates to the tolerance intervals for the prediction error. Prediction error’s tolerance intervals are centered on the estimation of negative bias, so when added to the regression estimates, they remove the regression bias. Therefore, we obtain response variable’s tolerance intervals which correctly contains, with confidence , a proportion of .

### 6.1 Theoretical Aspect

This part describes the theoretical context of tolerance intervals for local linear regression. We first define the concept of a Local Homoscedastic Normal Prediction Error (LHNPE) regression estimator. Then, we define the LHNPE neighborhood of a query point and in the end we will use a simple straightforward inference to obtain the formula of tolerance intervals for local linear regression.

###### Definition 5

The oscillation of the function on an open set is defined as:

 ωf(U)=supx∈Uf(x)−infx∈Uf(x).
###### Definition 6

A regression estimator is a Local Homoscedastic Normal Prediction Error (LHNPE) if it satisfies the following conditions:

• Normal distribution: the prediction error has a normal distribution.

• Almost constant distribution the prediction error: We suppose that the mean and the standard deviation of the distribution for the prediction error have small local oscillations. This is defined formally as:

For all x, there exists an open set , such that:

 ωμ(εpredx)(U)≤υ1 and ωσ(εpredx)(U)≤υ2,

where and are small fixed positive values.

###### Definition 7

Let be a LHNPE regression estimator for the query point . The LHNPE neighborhood for are instances for which the prediction error satisfies the LHNPE conditions. This neighborhood is described as below:

 Ksetx∗={(xi,Yi)|d(x∗,xi)≤b}, (34)

where is a distance function in the feature space and denotes the LHNPE bandwidth.

Note that the LHNPE bandwidth is different from the regression bandwidth in local linear regression:

 Regx∗={(xi,Yi)|d(x∗,xi)≤breg}. (35)

The regression bandwidth minimizes the regression bias-variance trade-off but the LHNPE bandwidth is used to find the neighborhood which satisfies the LHNPE conditions. The LHNPE neighborhood is almost always included in the regression neighborhood:

 Ksetx∗⊆Regx∗, (36)

because the constant’s distribution in the neighborhood of the query point usually occurs inside its regression neighborhood. It is possible to find two different regression neighborhoods being next to each other having approximately the same prediction error distribution and not the same regression neighborhood. There are already several references on regression bandwidth selection in non-parametric regression. We do not treat this problem and the reader can find more details in fan_gijbels_1996 and Applied_non_Hardle.

###### Proposition 1

Let denote a regression function and let denote its Local Linear regression estimator. If our regression estimator satisfies the conditions below:

• Normal error distribution: .

• Normal distribution of the local linear estimator: , Fan95_localpolynomial have shown that this assumption holds under certain regularity conditions.

• satisfies the LHNPE conditions defined above.

where is the estimator’s bias, is the variance of the error and is the variance of the estimator. Then the -confidence -content regression tolerance interval for the query point is:

 I(x∗)Tγ,β=^f(x∗)+I(εpredx∗)Tγ,β, (37)
 where εpredx∗=Y(x∗)−^f(x∗).

In the above equation, and denote, respectively, the regression tolerance interval and the prediction error tolerance interval.

Proof: See Appendix B.

Even though we have a biased prediction, our tolerance interval for contains the desired proportion of the conditional distribution of the response variable. This is due to the fact that our tolerance intervals on the response variable are computed based on the tolerance intervals on the prediction error . LHNPE conditions assume that the prediction error has an unknown normal distribution with mean and variance being respectively the negative bias and the variance of the prediction error. So, for high values of and for ,