1 Introduction
How to assess uncertainty in prediction is a fundamental problem in statistics. Conformal prediction is a general distributionfree methodology for constructing prediction intervals with a guaranteed coverage probability in finite samples
(papadopoulos2002inductive; vovk2005algorithmic). We propose a method for nonparametric estimation of quantile regression functions with the constraint that two quantile functions for different quantile levels do not cross. We then use estimated noncrossing quantile regression functions for constructing conformal prediction intervals.Since vovk2005algorithmic formally introduced the basic framework of conformal prediction, there has been a number of important advancements on conformal prediction (vovk2012conditional; lei2013distribution; lei2014distribution; lei2018distribution). lei2014distribution and vovk2012conditional showed that the conditional validity for prediction interval with finite length is impossible without any regularity and consistency assumptions on the model and the estimator. zeni2020conformal established that the marginal validity, a conventional coverage guarantee, can be achieved under the assumption that the observations are independent and identically distributed. Recently, several papers have studied the coverage probability, the prediction accuracy in terms of the length of prediction interval, and the computational complexities of conformal prediction using neural networks (Barber2021limits; lei2018distribution; papadopoulos2008inductive).
Earlier works on conformal prediction were based on estimating a conditional mean function and constructing intervals of constant width, assuming homoscedastic errors. Recently, romano2019conformalized
proposed a conformal prediction method based on quantile regression, called conformalized quantile regression. This method is adaptive to data heteroscedasticity and can have varying length across the input space. A similar construction of adaptive and distributionfree prediction intervals using deep neural networks have been considered by
kivaranovic2020adaptive. A comparison study of conformal prediction based on quantile regression with two choices of the conformity scores is given in sesia2020comparison.Nonetheless, associated with the great flexibility of regression quantiles is the quantilecrossing phenomenon. The quantile crossing problem, due to separate estimation of regression quantile curves at individual quantile levels, has been observed in simple linear quantile regression, and can happen more frequently in multiple regression. Several papers have attempted to deal with the crossing problem. In linear quantile regression, koenker1984note studied a parallel quantile plane approach to avoid the crossing problem. he1997quantile proposed a restricted regression quantile method that avoids quantile crossing while maintaining modeling flexibility. neocleous2008monotonicity
established the asymptotic guarantees that the probability of crossing will tend to zero for the linear interpolation of the KoenkerBassett linear quantile regression estimator. These papers focused on the linear quantile regression setting.
bondell2010noncrossing proposed a constrained quantile regression to avoid the crossing problem, and considered nonparametric noncrossing quantile regression using smoothing splines with a onedimensional predictor. However, this approach may not work well with a multidimensional predictor. Recently, interesting findings on simultaneous quantile regression that alleviates the crossing quantile problem were reported. tagasovska2019single proposed simultaneous quantile regression to estimate the quantiles by minimizing the pinball loss where the target quantile is randomly sampled in every training iteration. brando2022deep proposed an algorithm for predicting an arbitrary number of quantiles, which ensures the quantile monotonicity by imposing a restriction on the partial derivative of the quantile functions.In this paper, we make the following methodological and theoretical contributions.

We propose a penalized deep quantile regression approach, in which a novel penalty function based on the rectified linear unit (ReLU) function is proposed to encourage the noncrossing of the estimated quantile regression curves.

Based on the estimated noncrossing quantile regression curves, we study a conformalized quantile regression approach to construct noncrossing conformal prediction intervals, which are fully adaptive to heteroscedasticity and have locally varying length.

We study the properties of the ReLUpenalized nonparametric quantile regression using deep feedforward neural networks. We derive nonasymptotic upper bounds for the excess risk of the noncrossing empirical risk minimizers. Our error bounds achieve optimal minimax rate of convergence, and the prefactor of the error bounds depends polynomially on the dimension of the predictor, instead of exponentially.

We establish theoretical guarantees of valid coverage of the proposed approach to constructing conformal prediction intervals. We also give a nonasymptotic upper bound of the difference between the noncrossing conformal prediction interval and the oracle prediction interval. Extensive numerical studies are conducted to support the theory.
2 Noncrossing nonparametric quantile regression
In this section, we describe the proposed method for nonparametric estimation of quantile curves with noncrossing constraints.
For any given , the conditional quantile function of given is defined by
(1) 
where is a response, is a
dimensional vector of covariates, and
is the conditional distribution function (c.d.f) of given . It holds that , where is the conditional probability measure of given . By definition (1), an inherent constraint of the conditional quantile curves is the monotonicity property: for any , it holds that(2) 
Therefore, estimated conditional quantile curves should also satisfy this property, otherwise, it would be difficult to interpret the estimated quantiles.
Quantile regression is based on the check loss function defined by
(3) 
where is the indicator function. The target conditional quantile function is the minimizer of over all measurable function (koenker_2005). In applications, only a finite random sample is available. The quantile regression estimator for a given is
(4) 
where is a class of functions which may depend on the sample size .
For two quantile levels , we can obtain the estimated quantile curves and by using (4) separately for and . However, such estimated quantile curves may not satisfy the monotonicity constraint (2), that is, there may exist for which Below, we propose a penalized method to mitigate this problem.
2.1 Noncrossing quantile regression via ReLU penalty
In this subsection, we propose a penalized quantile regression framework to estimate quantile curves that can avoid the crossing problem. We first introduce a ReLUbased penalty function to enforce the noncrossing constraint in quantile regression. The ReLU penalty is defined as
This penalty function encourages when combined with the quantile loss function. At the population level, for , the expected penalized quantile loss function for is
(5) 
where is a tuning parameter. Note that, with defined in (1), for . Moreover, as mentioned earlier, is the minimizer of the expected check loss over measurable function for respectively. The following lemma establishes the identifiability of the quantile functions through the loss function (5). This is the basis of the proposed method.
Lemma 2.1
Suppose
is a random variable with c.d.f.
. For a given , any element in is a minimizer of the expected check loss function . Moreover, the pair of true conditional quantile functions is the minimizer of the loss function (5), that is,where is a class of measurable functions that contains the true conditional quantile functions.
Lemma 2.1 shows that the ReLU penalty does not introduce any bias at the population level. This is a special property of this penalty function, which differs from the usual penalty function such as the ridge penalty used in penalized regression.
When only a random sample is available, we use the empirical loss function
(6) 
and define
(7) 
where is a class of functions that approximate . In (6), a positive value of (i.e., the quantile curves cross at ) will be penalized with the penalty parameter . On the other hand, a negative value of (i.e., the quantile curves do not cross at ) will not incur any penalty. Therefore, with a sufficiently large penalty parameter , quantile crossing will be prevented.
Throughout the paper, we choose the function class to be a function class of feedforward neural networks. We note that other approximation classes can also be used. However, an important advantage of neural network functions is that they are effective in approximating smooth functions in , see, for example, jiao2021deep and the references therein. A detailed description of neural network functions will be given in Section 2.2. Figure 1 previews noncrossing curve estimation via a toy example, which shows the comparison between the proposed method in (7) (with penalty) and the deep quantile estimation studied in shen2021deep (without penalty); see Section B.1.2 in Appendix for more details.
Remark 2.1
The rectified linear unit (ReLU) function is a piecewiselinear function. An important advantage of the ReLU penalty is that it is convex with respect to and , thus the loss function in (6) is also convex with respect to and .
Remark 2.2
The tuning parameter controls the amount of penalty on crossing quantile curves. As shown in Section 4, a bigger leads to a larger estimation bias. Therefore, there is a tradeoff between the crossing and the accuracy of prediction intervals. Both can be achieved with proper choice of . In the appendix, we propose a crossvalidation method to select .
Remark 2.3
bondell2010noncrossing proposed a noncrossing nonparametric quantile regression using smoothing splines with noncrossing constraints. For two given quantile levels , they proposed to estimate the quantile curves and by minimizing the following constrained loss function
(8) 
where denotes the derivative of a function , and is the total variation penalty to guarantee smoothness. Such a splinebased method works well in the onedimensional setting, however, it is difficult to apply this approach to multidimensional problems.
2.2 ReLU Feedforward neural networks
For the estimation of conditional quantile functions, we choose the function class in (7) to be , a class of feedforward neural networks with parameter , depth , width , size
, number of neurons
and satisfying for some positive constant , where is the supreme norm of a function . Note that the network parameters may depend on the sample size , but this dependence is omitted for notational simplicity. Such a network has hidden layers and layers in total. We use a vector to describe the width of each layer; particularly in nonparametric regression problems, is the dimension of the input and is the dimension of the response. The width is defined as the maximum width of hidden layers, i.e., the size is defined as the total number of parameters in the network , i.e., the number of neurons is defined as the number of computational units in hidden layers, i.e., For an MLP , its size satisfiesFrom now on, we write as for short. Then, at the population level, the noncrossing nonparametric quantile estimation is to find a pair of measurable functions satisfying
(9) 
3 Noncrossing quantile regression for conformal prediction
Now suppose we have a new observation . We are interested in predicting the corresponding unknown value of . Our goal is to construct a distributionfree prediction interval with a coverage probability satisfying
(10) 
for any joint distribution
and any sample size , where is often called a miscoverage rate. We refer to such a coverage stated in (10) as marginal coverage.First, we define an oracle prediction band based on the conditional quantile functions. For a prespecified miscoverage rate , we consider the lower and upper quantiles levels such as and . Then, a conditional prediction interval for given with a nominal miscoverage rate is
(11) 
where and are the conditional quantile functions defined in (1) for quantile levels and respectively. Such a prediction interval with true quantile functions is ideal but cannot be constructed, only a corresponding empirical version can be estimated based on data in practice.
Next, we use the split conformal method vovk2005algorithmic for constructing noncrossing conformal intervals. We split the observations into two disjoint sets: a training set and a calibration set . Noncrossing deep neural estimators of and based on the training set are given by
(12) 
where
and is a tuning parameter. A key step is to compute the conformity score based on the calibration set romano2019conformalized, which is defined as
(13) 
Let . The conformity scores in quantify the errors incurred by the plugin prediction interval evaluated on the calibration set , and can account for undercoverage and overcoverage.
Finally, for a new input data point and , the prediction interval for is defined as
(14) 
where is the th empirical quantile of , namely, the th smallest value in , and denotes the smallest integer no less than . Here, is the cardinality of a set . The empirical quantile is a datadriven quantity, which conformalizes the plugin prediction interval.
In contrast to the conformalized quantile regression procedure in romano2019conformalized, our method produces conformal prediction intervals that avoid the quantile crossing problem. We refer to the proposed noncrossing conformalized quantile regression as NCCQR, and the corresponding conformal interval (14) as NCCQR interval.
Remark 3.1
The usual linear quantile regression (QR) model assumes that, for a given ,
(15) 
where and are the intercept and slope parameters. Following romano2019conformalized, a conformal interval based on linear quantile regression can be constructed. Specifically, by splitting the observations into two disjoint subsets: a training set and a calibration set , we can fit model (15) on the training set and obtain the estimators for and for a given , denoted by . Under model (15), the conformal interval with miscoverage rate is given by
(16) 
where , with and , and is the th smallest value among the conformity scores .
We summarize the implementation of NCCQR interval construction in the following algorithm.
Algorithm Computation of noncrossing conformalized prediction intervals
Input: Observations and miscoverage level
Output:
4 Theoretical properties
In this section, we study the theoretical properties of the proposed NCCQR method. We evaluate NCCQR using the following two criteria:

Validity: Under proper conditions, a conformal prediction interval satisfies that
(17) 
Accuracy: If the validity requirement (17) is satisfied, a conformal prediction interval should be as narrow as possible.
The validity requirement (17) is evaluated based on the finitesample marginal coverage in (10), which holds in the sense of averaging over all possible test values of . The accuracy of a prediction interval is usually measured by the discrepancy defined in (19) between the lengths of the prediction interval and the oracle one.
We assume that the target conditional quantile function defined in (1) is a Hölder smooth function with as stated in condition (C3) below. Let , and , where denotes the largest integer strictly smaller than and denotes the set of nonnegative integers. For a finite constant , the Hölder class of functions is defined as
(18) 
where with and . We assume the following conditions.

(C1) The observations are i.i.d. copies of .

(C2) (i) The support of the predictor vector is a bounded compact set in , and without loss of generality we let (ii) Let be the probability measure of . The probability measure is absolutely continuous with respect to the Lebesgue measure.

(C3) For any fixed , the target conditional quantile function defined in (1) is a Hölder smooth function of order and a finite constant .

(C4) There exist constants and such that for any and any ,
for all up to a negligible set, where
is the conditional cumulative distribution function of
given .
Condition (C1) is a basic assumption in conformal inference. The boundedness support assumption in Condition (C2) is made for technical convenience in the proof for deep neural estimation. Condition (C3) is a regular smoothness assumption on the target regression functions so that whose approximation result using deep neural networks can be obtained. Condition (C4) is imposed for selfcalibration of the resulting neural estimator.
Theorem 4.1
Under Condition (C1), for a new i.i.d pair , the proposed NCCQR interval satisfies
Theorem 4.1 shows that the proposed NCCQR interval enjoys a rigorous coverage guarantee. The proof of this theorem is given in the Appendix.
We now quantify the accuracy of the prediction interval in terms of the difference between the NCCQR interval and the oracle in (11) on the support of Let be the probability measure of defined in Condition (C2) and define for . The difference between the lengths of and is given by
(19) 
By the triangle inequality, we have where
To bound and , we need to bound the error To this end, we first derive bounds for the the excess risk of defined as where is defined in (5). Without loss of generality, we assume that and , where denotes the smallest integer no less than and denotes the largest integer no greater than .
Theorem 4.2
Letting and , then the width, depth and size of the neural network satisfy
Suppose that Conditions (C1)(C3) hold. If , the nonasymptotic error bound of the excess risk satisfies
where is a universal constant independent of and and .
The convergence rate of the excess risk is up to a logarithmic factor. The next theorem gives an upper bound for the accuracy of the proposed prediction interval defined in (19).
Theorem 4.3
(Nonasymptotic upper bound for prediction accuracy) Suppose that Conditions (C1)(C4) hold. Let be a class of ReLU activated feedforward neural networks with width, depth specified as in Theorem 4.2 and let be the empirical risk minimizer over . Then, there exists a constant , for ,
where is a constant independent of and .
Theorem 4.3 gives an upper bound for the difference between the lengths of our proposed prediction interval and the oracle interval. With properlyselected neural network parameters, the oracle band can be consistently estimated by the NCCQR band.
Finally, we consider the conformal interval based on linear quantile regression defined in (16).
Corollary 4.1
Suppose that at least one of and is a nonlinear function on a subset of with nonzero measure. Then for any sample size , the accuracy of the conformal band defined in (16) is strictly worse than that of the oracle conformal band , i.e., there exists an such that
According to Corollary 4.1, a conformal interval based on the linear quantile regression will not reach the oracle accuracy in the presence of nonlinearity.
5 Numerical studies
5.1 Synthetic data
We first simulate a synthetic dataset with a dimensional feature and a continuous response from the distributions defined in Section B.1.3. Our method is applied to independent observations from this distribution, using of them to train the deep quantile regression estimators and of them for calibration. The remaining data is for testing. We consider different dimensions to investigate how the dimensionality of the input affects the overall performance under multivariate input settings. The result is shown in Figure 2. In Figure 1(c), when dimension increases, NCCQR method performs better than CQR in terms of smaller crossing rate. It shows that the NCCQR method can mitigate the crossing problem in quantile regression. We also give a 3D visualization of the conformal intervals of our proposed NCCQR estimation and that of the CQR method when in Figure 2. One can see from Figure 1(a) that the conformal interval by our proposed NCCQR method does not have any crossing, while in Figure 1(b), the red region indicates that the lower bound is larger than the upper bound of the interval. More details of the results are given in Section B.1.3.
Our next synthetic example illustrates the adaptive property of the proposed NCCQR prediction interval. We consider the following model with a discontinuous regression function:
(20) 
A 90% NCCQR interval is shown in color blue. The green lines represent a 90% conformal prediction intervals based on the standard linear quantile regression. As can be seen from the plots, the NCCQR intervals automatically adapt to the shape of the regression function and the heteroscedasticity of the model error. Additional examples demonstrating the advantages of NCCQR prediction intervals are given in the Appendix.
5.2 Real data examples
We compute the NCCQR prediction intervals for the following datasets: bike sharing^{1}^{1}1https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset, house price^{2}^{2}2https://www.kaggle.com/datasets/harlfoxem/housesalesprediction and the air foil^{3}^{3}3https://archive.ics.uci.edu/ml/datasets/airfoil+selfnoise data sets. To examine the performance of the ReLU penalty, we compare the NCCQR with CQR. We compute the conformal prediction intervals based on these two methods using the same deep quantile regression estimator. Their performances are evaluated as in Section 5.1. We subsample of data for training,
for calibration, and the remaining data is used for testing. All features are standardized to have 0 mean and unit variance. The nominal coverage rate is set to be
. Figure 4 shows that the proposed NCCQR method can mitigate the crossing problem encountered in the CQR estimation. Additional details of the result are given in Section B.2 in the Appendix.


6 Concluding remarks
We have proposed NCCQR, a penalized deep quantile regression method which avoids the quantile crossing problem via a ReLU penalty. We have derived nonasymptotic upper bounds for the excess risk of the noncrossing empirical risk minimizers. Our error bounds achieve optimal minimax rate of convergence, the with the prefactor of the error bounds depending polynomially on the dimension of the predictor, instead of exponentially. The proposed ReLU penalty for monotonic constraints is applicable to problems with different loss functions. It is applicable to other nonparametric estimation method, including smoothing splines, kernel method and neural network. Therefore, the proposed ReLU penalty method may be of independent interest. Based on the estimated noncrossing conditional quantile curves, we construct conformal prediction intervals that are fully adaptive to heterogeneity. We have also provided conditions under which the proposed NCCQR conformal prediction bands are valid with the correct coverage probability and achieve optimal accuracy in terms of the width of the prediction band.
A main limitation of the proposed method and the theoretical properties is that they rely on the i.i.d. assumption for the data. There are applications where observations are not independent (e.g., time series data) and there may be distribution shift. It would be interesting to extend the proposed method to deal with such noni.i.d. settings and establish the corresponding theoretical properties.
Appendix A Appendix: Proofs
In this appendix, we prove the theoretical results stated in Section 4. First, we prove Lemma 2.1. For convenience, we restate this lemma below.
Lemma A.1
Suppose is a random variable with c.d.f. . For a given , any element in is a minimizer of the expected check loss function . Moreover, the pair of the true conditional quantile functions is the minimizer of the loss function (5), that is,
where is a class of measurable functions that contains the true conditional quantile functions.
Proof[of Lemma A.1] First, we write
Taking derivative w.r.t , we obtain
Since is a c.d.f., it is monotonic and thus any element of minimizes the expected loss . For functions , recall that the loss function (5) in the main context is
By the definition of in (1) in the main context, satisfies for . Then minimizes the expected loss for . Since the true quantile function satisfies the monotonicity requirement (2) in the main context, then . Therefore, is the minimizer of the loss function (5) in the main context.
Lemma A.2
Suppose that
(S21) 
for some sequence and such that and when . Define
Then, under Conditions (C1)(C4), for any independent of ,
Furthermore, if the calibration set is partitioned into
then for any constant , we have
Proof[of Lemma A.2] By Markov inequality, we have
When the calibration set is partitioned into
it follows from Hoeffding’s inequality that, for any ,
Letting , the proof of the second inequality in Lemma (A.2) is completed.
Proof[of Theorem 4.1] The proof of Theorem 4.1 is similar to that of Theorem 1 in romano2019conformalized. Let be the conformity score
at the test point Recall that the prediction interval
By the construction of the prediction interval,