Log In Sign Up

Nonparametric Quantile Regression: Non-Crossing Constraints and Conformal Prediction

by   Wenlu Tang, et al.

We propose a nonparametric quantile regression method using deep neural networks with a rectified linear unit penalty function to avoid quantile crossing. This penalty function is computationally feasible for enforcing non-crossing constraints in multi-dimensional nonparametric quantile regression. We establish non-asymptotic upper bounds for the excess risk of the proposed nonparametric quantile regression function estimators. Our error bounds achieve optimal minimax rate of convergence for the Holder class, and the prefactors of the error bounds depend polynomially on the dimension of the predictor, instead of exponentially. Based on the proposed non-crossing penalized deep quantile regression, we construct conformal prediction intervals that are fully adaptive to heterogeneity. The proposed prediction interval is shown to have good properties in terms of validity and accuracy under reasonable conditions. We also derive non-asymptotic upper bounds for the difference of the lengths between the proposed non-crossing conformal prediction interval and the theoretically oracle prediction interval. Numerical experiments including simulation studies and a real data example are conducted to demonstrate the effectiveness of the proposed method.


Estimation of Non-Crossing Quantile Regression Process with Deep ReQU Neural Networks

We propose a penalized nonparametric approach to estimating the quantile...

Sparse Quantile Regression

We consider both ℓ _0-penalized and ℓ _0-constrained quantile regression...

Quantile index regression

Estimating the structures at high or low quantiles has become an importa...

Deep Quantile Regression: Mitigating the Curse of Dimensionality Through Composition

This paper considers the problem of nonparametric quantile regression un...

Solution to the Non-Monotonicity and Crossing Problems in Quantile Regression

This paper proposes a new method to address the long-standing problem of...

Prediction intervals with controlled length in the heteroscedastic Gaussian regression

We tackle the problem of building a prediction interval in heteroscedast...

The upper-crossing/solution (US) algorithm for root-finding with strongly stable convergence

In this paper, we propose a new and broadly applicable root-finding meth...

1 Introduction

How to assess uncertainty in prediction is a fundamental problem in statistics. Conformal prediction is a general distribution-free methodology for constructing prediction intervals with a guaranteed coverage probability in finite samples

(papadopoulos2002inductive; vovk2005algorithmic). We propose a method for nonparametric estimation of quantile regression functions with the constraint that two quantile functions for different quantile levels do not cross. We then use estimated non-crossing quantile regression functions for constructing conformal prediction intervals.

Since vovk2005algorithmic formally introduced the basic framework of conformal prediction, there has been a number of important advancements on conformal prediction (vovk2012conditional; lei2013distribution; lei2014distribution; lei2018distribution). lei2014distribution and vovk2012conditional showed that the conditional validity for prediction interval with finite length is impossible without any regularity and consistency assumptions on the model and the estimator. zeni2020conformal established that the marginal validity, a conventional coverage guarantee, can be achieved under the assumption that the observations are independent and identically distributed. Recently, several papers have studied the coverage probability, the prediction accuracy in terms of the length of prediction interval, and the computational complexities of conformal prediction using neural networks (Barber2021limits; lei2018distribution; papadopoulos2008inductive).

Earlier works on conformal prediction were based on estimating a conditional mean function and constructing intervals of constant width, assuming homoscedastic errors. Recently, romano2019conformalized

proposed a conformal prediction method based on quantile regression, called conformalized quantile regression. This method is adaptive to data heteroscedasticity and can have varying length across the input space. A similar construction of adaptive and distribution-free prediction intervals using deep neural networks have been considered by

kivaranovic2020adaptive. A comparison study of conformal prediction based on quantile regression with two choices of the conformity scores is given in sesia2020comparison.

Nonetheless, associated with the great flexibility of regression quantiles is the quantile-crossing phenomenon. The quantile crossing problem, due to separate estimation of regression quantile curves at individual quantile levels, has been observed in simple linear quantile regression, and can happen more frequently in multiple regression. Several papers have attempted to deal with the crossing problem. In linear quantile regression, koenker1984note studied a parallel quantile plane approach to avoid the crossing problem. he1997quantile proposed a restricted regression quantile method that avoids quantile crossing while maintaining modeling flexibility. neocleous2008monotonicity

established the asymptotic guarantees that the probability of crossing will tend to zero for the linear interpolation of the Koenker-Bassett linear quantile regression estimator. These papers focused on the linear quantile regression setting.

bondell2010noncrossing proposed a constrained quantile regression to avoid the crossing problem, and considered nonparametric non-crossing quantile regression using smoothing splines with a one-dimensional predictor. However, this approach may not work well with a multi-dimensional predictor. Recently, interesting findings on simultaneous quantile regression that alleviates the crossing quantile problem were reported. tagasovska2019single proposed simultaneous quantile regression to estimate the quantiles by minimizing the pinball loss where the target quantile is randomly sampled in every training iteration. brando2022deep proposed an algorithm for predicting an arbitrary number of quantiles, which ensures the quantile monotonicity by imposing a restriction on the partial derivative of the quantile functions.

In this paper, we make the following methodological and theoretical contributions.

  • We propose a penalized deep quantile regression approach, in which a novel penalty function based on the rectified linear unit (ReLU) function is proposed to encourage the non-crossing of the estimated quantile regression curves.

  • Based on the estimated non-crossing quantile regression curves, we study a conformalized quantile regression approach to construct non-crossing conformal prediction intervals, which are fully adaptive to heteroscedasticity and have locally varying length.

  • We study the properties of the ReLU-penalized nonparametric quantile regression using deep feedforward neural networks. We derive non-asymptotic upper bounds for the excess risk of the non-crossing empirical risk minimizers. Our error bounds achieve optimal minimax rate of convergence, and the prefactor of the error bounds depends polynomially on the dimension of the predictor, instead of exponentially.

  • We establish theoretical guarantees of valid coverage of the proposed approach to constructing conformal prediction intervals. We also give a non-asymptotic upper bound of the difference between the non-crossing conformal prediction interval and the oracle prediction interval. Extensive numerical studies are conducted to support the theory.

2 Non-crossing nonparametric quantile regression

In this section, we describe the proposed method for nonparametric estimation of quantile curves with non-crossing constraints.

For any given , the conditional quantile function of given is defined by


where is a response, is a

-dimensional vector of covariates, and

is the conditional distribution function (c.d.f) of given . It holds that , where is the conditional probability measure of given . By definition (1), an inherent constraint of the conditional quantile curves is the monotonicity property: for any , it holds that


Therefore, estimated conditional quantile curves should also satisfy this property, otherwise, it would be difficult to interpret the estimated quantiles.

Quantile regression is based on the check loss function defined by


where is the indicator function. The target conditional quantile function is the minimizer of over all measurable function (koenker_2005). In applications, only a finite random sample is available. The quantile regression estimator for a given is


where is a class of functions which may depend on the sample size .

For two quantile levels , we can obtain the estimated quantile curves and by using (4) separately for and . However, such estimated quantile curves may not satisfy the monotonicity constraint (2), that is, there may exist for which Below, we propose a penalized method to mitigate this problem.

2.1 Non-crossing quantile regression via ReLU penalty

In this subsection, we propose a penalized quantile regression framework to estimate quantile curves that can avoid the crossing problem. We first introduce a ReLU-based penalty function to enforce the non-crossing constraint in quantile regression. The ReLU penalty is defined as

This penalty function encourages when combined with the quantile loss function. At the population level, for , the expected penalized quantile loss function for is


where is a tuning parameter. Note that, with defined in (1), for . Moreover, as mentioned earlier, is the minimizer of the expected check loss over measurable function for respectively. The following lemma establishes the identifiability of the quantile functions through the loss function (5). This is the basis of the proposed method.

Lemma 2.1


is a random variable with c.d.f.

. For a given , any element in is a minimizer of the expected check loss function . Moreover, the pair of true conditional quantile functions is the minimizer of the loss function (5), that is,

where is a class of measurable functions that contains the true conditional quantile functions.

Lemma 2.1 shows that the ReLU penalty does not introduce any bias at the population level. This is a special property of this penalty function, which differs from the usual penalty function such as the ridge penalty used in penalized regression.

When only a random sample is available, we use the empirical loss function


and define


where is a class of functions that approximate . In (6), a positive value of (i.e., the quantile curves cross at ) will be penalized with the penalty parameter . On the other hand, a negative value of (i.e., the quantile curves do not cross at ) will not incur any penalty. Therefore, with a sufficiently large penalty parameter , quantile crossing will be prevented.

Throughout the paper, we choose the function class to be a function class of feedforward neural networks. We note that other approximation classes can also be used. However, an important advantage of neural network functions is that they are effective in approximating smooth functions in , see, for example, jiao2021deep and the references therein. A detailed description of neural network functions will be given in Section 2.2. Figure 1 previews non-crossing curve estimation via a toy example, which shows the comparison between the proposed method in (7) (with penalty) and the deep quantile estimation studied in shen2021deep (without penalty); see Section B.1.2 in Appendix for more details.

Figure 1: A toy example shows that NC-CQR can avoid quantile crossing. The quantile curves on the left panel are estimated by NC-CQR, and those on the right are estimated by the deep quantile regressionshen2021deep. The yellow color indicates quantile crossing.
Remark 2.1

The rectified linear unit (ReLU) function is a piecewise-linear function. An important advantage of the ReLU penalty is that it is convex with respect to and , thus the loss function in (6) is also convex with respect to and .

Remark 2.2

The tuning parameter controls the amount of penalty on crossing quantile curves. As shown in Section 4, a bigger leads to a larger estimation bias. Therefore, there is a trade-off between the crossing and the accuracy of prediction intervals. Both can be achieved with proper choice of . In the appendix, we propose a cross-validation method to select .

Remark 2.3

bondell2010noncrossing proposed a non-crossing nonparametric quantile regression using smoothing splines with non-crossing constraints. For two given quantile levels , they proposed to estimate the quantile curves and by minimizing the following constrained loss function


where denotes the derivative of a function , and is the total variation penalty to guarantee smoothness. Such a spline-based method works well in the one-dimensional setting, however, it is difficult to apply this approach to multi-dimensional problems.

2.2 ReLU Feedforward neural networks

For the estimation of conditional quantile functions, we choose the function class in (7) to be , a class of feedforward neural networks with parameter , depth , width , size

, number of neurons

and satisfying for some positive constant , where is the supreme norm of a function . Note that the network parameters may depend on the sample size , but this dependence is omitted for notational simplicity. Such a network has hidden layers and layers in total. We use a vector to describe the width of each layer; particularly in nonparametric regression problems, is the dimension of the input and is the dimension of the response. The width is defined as the maximum width of hidden layers, i.e., the size is defined as the total number of parameters in the network , i.e., the number of neurons is defined as the number of computational units in hidden layers, i.e., For an MLP , its size satisfies

From now on, we write as for short. Then, at the population level, the non-crossing nonparametric quantile estimation is to find a pair of measurable functions satisfying


3 Non-crossing quantile regression for conformal prediction

Now suppose we have a new observation . We are interested in predicting the corresponding unknown value of . Our goal is to construct a distribution-free prediction interval with a coverage probability satisfying


for any joint distribution

and any sample size , where is often called a miscoverage rate. We refer to such a coverage stated in (10) as marginal coverage.

First, we define an oracle prediction band based on the conditional quantile functions. For a pre-specified miscoverage rate , we consider the lower and upper quantiles levels such as and . Then, a conditional prediction interval for given with a nominal miscoverage rate is


where and are the conditional quantile functions defined in (1) for quantile levels and respectively. Such a prediction interval with true quantile functions is ideal but cannot be constructed, only a corresponding empirical version can be estimated based on data in practice.

Next, we use the split conformal method vovk2005algorithmic for constructing non-crossing conformal intervals. We split the observations into two disjoint sets: a training set and a calibration set . Non-crossing deep neural estimators of and based on the training set are given by



and is a tuning parameter. A key step is to compute the conformity score based on the calibration set romano2019conformalized, which is defined as


Let . The conformity scores in quantify the errors incurred by the plug-in prediction interval evaluated on the calibration set , and can account for undercoverage and overcoverage.

Finally, for a new input data point and , the prediction interval for is defined as


where is the -th empirical quantile of , namely, the -th smallest value in , and denotes the smallest integer no less than . Here, is the cardinality of a set . The empirical quantile is a data-driven quantity, which conformalizes the plug-in prediction interval.

In contrast to the conformalized quantile regression procedure in romano2019conformalized, our method produces conformal prediction intervals that avoid the quantile crossing problem. We refer to the proposed non-crossing conformalized quantile regression as NC-CQR, and the corresponding conformal interval (14) as NC-CQR interval.

Remark 3.1

The usual linear quantile regression (QR) model assumes that, for a given ,


where and are the intercept and slope parameters. Following romano2019conformalized, a conformal interval based on linear quantile regression can be constructed. Specifically, by splitting the observations into two disjoint subsets: a training set and a calibration set , we can fit model (15) on the training set and obtain the estimators for and for a given , denoted by . Under model (15), the conformal interval with miscoverage rate is given by


where , with and , and is the -th smallest value among the conformity scores .

We summarize the implementation of NC-CQR interval construction in the following algorithm.

Algorithm Computation of non-crossing conformalized prediction intervals

Input: Observations and miscoverage level



  1. Split into a training set and a calibration set .

  2. Fit to (12) to obtain for and .

  3. Compute the conformity score ,

  4. Find , the -th smallest value of
       5. Compute the prediction band according to (14).

4 Theoretical properties

In this section, we study the theoretical properties of the proposed NC-CQR method. We evaluate NC-CQR using the following two criteria:

  • Validity: Under proper conditions, a conformal prediction interval satisfies that

  • Accuracy: If the validity requirement (17) is satisfied, a conformal prediction interval should be as narrow as possible.

The validity requirement (17) is evaluated based on the finite-sample marginal coverage in (10), which holds in the sense of averaging over all possible test values of . The accuracy of a prediction interval is usually measured by the discrepancy defined in (19) between the lengths of the prediction interval and the oracle one.

We assume that the target conditional quantile function defined in (1) is a -Hölder smooth function with as stated in condition (C3) below. Let , and , where denotes the largest integer strictly smaller than and denotes the set of non-negative integers. For a finite constant , the Hölder class of functions is defined as


where with and . We assume the following conditions.

  • (C1) The observations are i.i.d. copies of .

  • (C2) (i) The support of the predictor vector is a bounded compact set in , and without loss of generality we let (ii) Let be the probability measure of . The probability measure is absolutely continuous with respect to the Lebesgue measure.

  • (C3) For any fixed , the target conditional quantile function defined in (1) is a Hölder smooth function of order and a finite constant .

  • (C4) There exist constants and such that for any and any ,

    for all up to a -negligible set, where

    is the conditional cumulative distribution function of

    given .

Condition (C1) is a basic assumption in conformal inference. The boundedness support assumption in Condition (C2) is made for technical convenience in the proof for deep neural estimation. Condition (C3) is a regular smoothness assumption on the target regression functions so that whose approximation result using deep neural networks can be obtained. Condition (C4) is imposed for self-calibration of the resulting neural estimator.

Theorem 4.1

Under Condition (C1), for a new i.i.d pair , the proposed NC-CQR interval satisfies

Theorem 4.1 shows that the proposed NC-CQR interval enjoys a rigorous coverage guarantee. The proof of this theorem is given in the Appendix.

We now quantify the accuracy of the prediction interval in terms of the difference between the NC-CQR interval and the oracle in (11) on the support of Let be the probability measure of defined in Condition (C2) and define for . The difference between the lengths of and is given by


By the triangle inequality, we have where

To bound and , we need to bound the error To this end, we first derive bounds for the the excess risk of defined as where is defined in (5). Without loss of generality, we assume that and , where denotes the smallest integer no less than and denotes the largest integer no greater than .

Theorem 4.2

Letting and , then the width, depth and size of the neural network satisfy

Suppose that Conditions (C1)-(C3) hold. If , the non-asymptotic error bound of the excess risk satisfies

where is a universal constant independent of and and .

The convergence rate of the excess risk is up to a logarithmic factor. The next theorem gives an upper bound for the accuracy of the proposed prediction interval defined in (19).

Theorem 4.3

(Non-asymptotic upper bound for prediction accuracy) Suppose that Conditions (C1)-(C4) hold. Let be a class of ReLU activated feedforward neural networks with width, depth specified as in Theorem 4.2 and let be the empirical risk minimizer over . Then, there exists a constant , for ,

where is a constant independent of and .

Theorem 4.3 gives an upper bound for the difference between the lengths of our proposed prediction interval and the oracle interval. With properly-selected neural network parameters, the oracle band can be consistently estimated by the NC-CQR band.

Finally, we consider the conformal interval based on linear quantile regression defined in (16).

Corollary 4.1

Suppose that at least one of and is a non-linear function on a subset of with non-zero measure. Then for any sample size , the accuracy of the conformal band defined in (16) is strictly worse than that of the oracle conformal band , i.e., there exists an such that

According to Corollary 4.1, a conformal interval based on the linear quantile regression will not reach the oracle accuracy in the presence of nonlinearity.

5 Numerical studies

5.1 Synthetic data

We first simulate a synthetic dataset with a -dimensional feature and a continuous response from the distributions defined in Section B.1.3. Our method is applied to independent observations from this distribution, using of them to train the deep quantile regression estimators and of them for calibration. The remaining data is for testing. We consider different dimensions to investigate how the dimensionality of the input affects the overall performance under multivariate input settings. The result is shown in Figure 2. In Figure 1(c), when dimension increases, NC-CQR method performs better than CQR in terms of smaller crossing rate. It shows that the NC-CQR method can mitigate the crossing problem in quantile regression. We also give a 3D visualization of the conformal intervals of our proposed NC-CQR estimation and that of the CQR method when in Figure 2. One can see from Figure 1(a) that the conformal interval by our proposed NC-CQR method does not have any crossing, while in Figure 1(b), the red region indicates that the lower bound is larger than the upper bound of the interval. More details of the results are given in Section B.1.3.

(a) The crossing rates of two methods as dimension increases.
(b) The non-crossing quantile interval.
(c) The quantile interval without non-crossing penalty.
Figure 2: The comparison between the proposed interval estimation with penalty and interval estimation without penalty. The blue surface is the estimated -th upper quantile surface and the yellow surface is the -th lower quantile surface. Red color indicates quantile crossing.

Our next synthetic example illustrates the adaptive property of the proposed NC-CQR prediction interval. We consider the following model with a discontinuous regression function:


A 90% NC-CQR interval is shown in color blue. The green lines represent a 90% conformal prediction intervals based on the standard linear quantile regression. As can be seen from the plots, the NC-CQR intervals automatically adapt to the shape of the regression function and the heteroscedasticity of the model error. Additional examples demonstrating the advantages of NC-CQR prediction intervals are given in the Appendix.

Figure 3: Prediction intervals based on NC-CQR and linear QR for model (20)

5.2 Real data examples

We compute the NC-CQR prediction intervals for the following datasets: bike sharing111, house price222 and the air foil333 data sets. To examine the performance of the ReLU penalty, we compare the NC-CQR with CQR. We compute the conformal prediction intervals based on these two methods using the same deep quantile regression estimator. Their performances are evaluated as in Section 5.1. We subsample of data for training,

for calibration, and the remaining data is used for testing. All features are standardized to have 0 mean and unit variance. The nominal coverage rate is set to be

. Figure 4 shows that the proposed NC-CQR method can mitigate the crossing problem encountered in the CQR estimation. Additional details of the result are given in Section B.2 in the Appendix.

(a) CQR interval
(b) NC-CQR interval
Figure 4: The performance of NC-CQR and CQR on several real data sets based a deep nonparametric quantile regression model. Since the input are multidimensional, we sorted the data by the value of responses and depict them along with the corresponding estimated interval in a 2D plot.

6 Concluding remarks

We have proposed NC-CQR, a penalized deep quantile regression method which avoids the quantile crossing problem via a ReLU penalty. We have derived non-asymptotic upper bounds for the excess risk of the non-crossing empirical risk minimizers. Our error bounds achieve optimal minimax rate of convergence, the with the prefactor of the error bounds depending polynomially on the dimension of the predictor, instead of exponentially. The proposed ReLU penalty for monotonic constraints is applicable to problems with different loss functions. It is applicable to other nonparametric estimation method, including smoothing splines, kernel method and neural network. Therefore, the proposed ReLU penalty method may be of independent interest. Based on the estimated non-crossing conditional quantile curves, we construct conformal prediction intervals that are fully adaptive to heterogeneity. We have also provided conditions under which the proposed NC-CQR conformal prediction bands are valid with the correct coverage probability and achieve optimal accuracy in terms of the width of the prediction band.

A main limitation of the proposed method and the theoretical properties is that they rely on the i.i.d. assumption for the data. There are applications where observations are not independent (e.g., time series data) and there may be distribution shift. It would be interesting to extend the proposed method to deal with such non-i.i.d. settings and establish the corresponding theoretical properties.

Appendix A Appendix: Proofs

In this appendix, we prove the theoretical results stated in Section 4. First, we prove Lemma 2.1. For convenience, we restate this lemma below.

Lemma A.1

Suppose is a random variable with c.d.f. . For a given , any element in is a minimizer of the expected check loss function . Moreover, the pair of the true conditional quantile functions is the minimizer of the loss function (5), that is,

where is a class of measurable functions that contains the true conditional quantile functions.

Proof[of Lemma A.1] First, we write

Taking derivative w.r.t , we obtain

Since is a c.d.f., it is monotonic and thus any element of minimizes the expected loss . For functions , recall that the loss function (5) in the main context is

By the definition of in (1) in the main context, satisfies for . Then minimizes the expected loss for . Since the true quantile function satisfies the monotonicity requirement (2) in the main context, then . Therefore, is the minimizer of the loss function (5) in the main context.

Lemma A.2

Suppose that


for some sequence and such that and when . Define

Then, under Conditions (C1)-(C4), for any independent of ,

Furthermore, if the calibration set is partitioned into

then for any constant , we have

Proof[of Lemma A.2] By Markov inequality, we have

When the calibration set is partitioned into

it follows from Hoeffding’s inequality that, for any ,

Letting , the proof of the second inequality in Lemma (A.2) is completed.

Proof[of Theorem 4.1] The proof of Theorem 4.1 is similar to that of Theorem 1 in romano2019conformalized. Let be the conformity score

at the test point Recall that the prediction interval

By the construction of the prediction interval,