AutoNCP: Automated pipelines for accurate confidence intervals

06/24/2020 ∙ by Yao Zhang, et al. ∙ University of Cambridge 10

Successful application of machine learning models to real-world prediction problems - e.g. financial predictions, self-driving cars, personalized medicine - has proved to be extremely challenging, because such settings require limiting and quantifying the uncertainty in the predictions of the model; i.e., providing valid and useful confidence intervals. Conformal Prediction is a distribution-free approach that achieves valid coverage and provides valid confidence intervals in finite samples. However, the confidence intervals constructed by Conformal Prediction are often (because of over-fitting, inappropriate measures of nonconformity, or other issues) overly conservative and hence not sufficiently adequate for the application(s) at hand. This paper proposes a framework called Automatic Machine Learning for Nested Conformal Prediction (AutoNCP). AutoNCP is an AutoML framework, but unlike familiar AutoML frameworks that attempt to select the best model (from among a given set of models) for a particular dataset or application, AutoNCP uses frequentist and Bayesian methodologies to construct a prediction pipeline that achieves the desired frequentist coverage while simultaneously optimizing the length of confidence intervals. Using a wide variety of real-world datasets, we demonstrate that AutoNCP substantially out-performs benchmark algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine Learning (ML) has made remarkable contributions to prediction in areas ranging from the natural sciences to the social sciences, from advertisement to medical imaging. It has been less successful in areas such as personalized medicine, autonomous vehicles and finance, in large part because actionable predictions in those areas require not only predictions but also confidence in predictions Amodei et al. (2016). This paper proposes a new method, that we call Automatic Machine Learning for Nested Conformal Prediction (AutoNCP), that produces confidence intervals that are tight and satisfy the rigorous coverage guarantee for predictions, and demonstrates that it substantially out-performs competing methods on a wide variety of both real-world and simulated datasets.111The desired confidence level can be specified arbitrarily by the user.

AutoNCP builds on the by-now familiar Automatic Machine Learning (AutoML) framework Feurer et al. (2015); Kotthoff et al. (2017); Alaa and van der Schaar (2018); Escalante et al. (2020). However, unlike these frameworks, which build pipelines to optimize predictive accuracy, AutoNCP builds a pipeline to optimize confidence intervals for a prescribed confidence level and frequentist coverage requirement, (e.g. construct the shortest confidence intervals with 90% coverage). AutoNCP builds on the nested set formulation of Vovk et al. (2005); Kuchibhotla and Ramdas (2019) to combine conformal prediction methods with discriminative methods. This combination is crucial because conformal prediction can provide reliable confidence intervals for any predictive model and discriminative methods identify regions of high and low predictive uncertainty. (This paper specifies a particular collection of models and algorithms, but many other collections, both smaller and larger, could be used.) The structure of AutoNCP is shown in Figure 1

. In view of the high-dimensionality and mixture of discrete and continuous variables in the optimization problem, e.g. which model/estimators we should use and how to choose their hyperparameters, we model pipeline performance as a black-box function with a factor structure in which the mapping from model and hyper-parameters in each factor to the confidence interval length is treated as a black-box function. These black-box functions are jointly optimized using Bayesian Optimization (BO)

Snoek et al. (2012). The factor structure allows for the decomposition of the high-dimensional problem into a set of lower-dimensional subproblems and for the solution of each sub-problem separately with many fewer BO iterations and thus to an efficient algorithm.

We note that there would be no difficulty in marrying a given conformal prediction algorithm with a given machine learning model. However, choosing by hand the conformal prediction algorithm and machine learning model to be used for a given dataset would be an extremely difficult task, requiring expertise in the various algorithms and models and specialized knowledge of the particular dataset. AutoNCP makes this entirely unnecessary, thereby allowing non-experts to use machine learning models with confidence intervals that satisfy the user-specified coverage guarantee.

Figure 1: A pictorial depiction of the operation of AutoNCP. The steps of NCP are enumerated in order. Implementing NCP on a data set is a compound decision problem, requiring the choice of a model, hyperparameters for that model, an estimator, and a calibration method. AutoNCP aims to solve this compound decision efficiently.

To demonstrate the effectiveness of AutoNCP, we conduct two sets of experiments. The first set of experiments pits AutoNCP against eight benchmark algorithms on eight real-world datasets from different application domains. Our results show that AutoNCP achieves confidence intervals that are from 5% to 38% shorter than the best of the benchmark algorithms while satisfying the rigorous coverage guarantee. A separate source of gain analysis (using the same eight datasets) confirms the importance of global optimization AutoNCP pipeline. The second set of experiments adapts AutoNCP to the problem of estimating Conditional Average Treatment Effects, and pits AutoNCP against three benchmark algorithms on two frequently-used datasets. Our results show that only AutoNCP and one of the benchmark algorithms are able to achieve the required coverages, and that AutoNCP achieves confidence intervals that much shorter than this competitor on both datasets.

Following this Introduction, we first discuss Related Work, then lay out the Problem Formulation, describe Nested Conformal Prediction, provide details of AutoNCP, describe the Experimental Results, and present a brief Conclusion. We relegate some of the technical details and additional information about the experiments to Appendix.

2 Related work

Bayesian statistics offers a probabilistic framework for modeling uncertainty in the data. In that framework, predictive uncertainty is quantified by computing posterior credible sets. In practice, exact Bayesian inference is computationally prohibitive for deep learning models with millions of parameters. Variational inference Welling and Teh (2011); Gal and Ghahramani (2016); Hernández-Lobato and Adams (2015); Maddox et al. (2019)

approximates the true posterior distribution by an ensemble of neural networks, in the same form as the non-bayesian ad-hoc ensemble approaches

Lakshminarayanan et al. (2017). These methods can measure uncertainty by the model disagreement in the ensemble. They are sensitive to the uncertainty in predictions since it is less likely that all the models in the ensemble are misled and overfitted to the training data. However, it is well known that the construction of honest credible sets (i.e. satisfying the frequentist coverage property) is impossible in general Cox (1993); Bayarri and Berger (2004)

. Hierarchical and Empirical Bayes methods result in honest credible sets asymptotically under some extra assumption on the functional parameters

Szabó et al. (2015); Rousseau and Szabo (2016), which do not generally hold for complex datasets in machine learning, e.g. electronic health records.

Conformal prediction, pioneered by Vovk et al. (2005)

, is a general framework for constructing confidence intervals that satisfy the frequentist coverage guarantee in finite samples. This framework is applicable to (almost) any prediction model. However, naive application of conformal prediction can yield poor results; e.g. providing intervals of the same lengths at every point in the input space, failing to reflect that more accurate predictions are possible for some inputs than for others, especially if the data is heteroscedastic (as is often the case). A variety of localized techniques have been proposed to address this problem, such as accounting for the variability of the regression function

Barber et al. (2019)

, combining conformal prediction with classical quantile regression

Romano et al. (2019); Sesia and Candès (2019) or reweighting samples using appropriate distance metrics Guan (2019).

In our work, we draw the connection between Bayesian methods and conformal prediction in the task of achieving actionable estimates of predictive uncertainty. Instead of adding new heuristics that may work the best in a particular case, we develop a general and automatic framework for constructing actionable confidence intervals, in favor of practitioners’ interests.

3 Problem Formulation

We consider a standard supervised learning setup.

is the set of features and is the set of the interested label, so a typical datapoint is with . We assume throughout that data is drawn exchangeably from an (unknown) distribution on .222It is standard in the literature to make the assumption that data is drawn exchangeably rather than i.i.d (which would be stronger). Let denote a set of training samples, denote a specified miscoverage rate (equivalently: a specified target coverage rate (TCR) ). For an regression model trained on (so that is the predicted label for a randomly sampled testing data point ), we wish to provide a confidence interval around that can cover the true

with probability equal to TCR. (Both the function

and the confidence intervals will depend on the training set and the specified miscoverage rate ; we view as fixed throughout the exercise and so suppress them in the notation.) In what follows, we write for the length of the confidence interval . In order that our confidence intervals be actionable, we require two things:

Frequentist Coverage. The confidence intervals satisfy , where the probability is taken with respect to Vovk et al. (2005); Lawless and Fredette (2005).


On average, confidence intervals are wider for test points with large variance. That is, for two different test points

, , we have

where the variance is taken with respect to the underlying data distribution .

Discrimination does not come with a guarantee. But it provides a guess of the high and low variance region in the input space so that we can adapt the length of locally. Both requirements are crucial for confidence intervals to be actionable. If the intervals fail to achieve the coverage rate, a user can not use them with confidence. On the other hand, algorithms can construct wide intervals to satisfy the coverage rate. But in this case, the intervals exaggerate the uncertainty in predictions, and the user can not trust the predictions made by a model even when they are are accurate.

Frequentist Coverage can be achieved using the idea of Conformal Prediction Vovk et al. (2005)

. Conformal algorithms achieve the frequentist coverage under a mild assumption (weaker than i.i.d.) that the training and testing samples are drawn exchangeably from the same data distribution. The confidence intervals constructed by conformal algorithms do not serve to solve the problems in out-of-distribution detection or anomaly detection. Nevertheless, in practice, the predictions that models make on in-distribution samples still have prediction errors; confidence intervals provide bounds on the magnitude of these errors. These bounds are important to a potential user (e.g. a clinician making treatment decisions), and tighter bounds mean the potential user can be more certain about the model predictions. Providing tighter bounds, i.e. narrower confidence intervals, is the purpose of our algorithm AutoNCP.

4 Nested Conformal Prediction

In this section, we summarize a variety of conformal algorithms in the vast literature that can be used in our flexible AutoML framework. We start with a simple method called Split Conformal Prediction (SCP) Lei et al. (2018). SCP splits the training samples into two equal-size subsets and , and compute the residuals on for fitted to , denoted by

Taking as -th quantile of , i.e. -th smallest residual on , the SCP confidence interval satisfies that , under the exchangeability assumption mentioned above.

With the nested set formulation in Vovk et al. (2005); Kuchibhotla and Ramdas (2019), we can generalize SCP to some more sophisticated methods in the following subsections 4.1 and 4.2. Let be a family of prediction regions for some ordered set , including those intervals wider than the interested confidence interval, such that

where . Note that the family of prediction regions here is nested. “nested” is not a property of a particular set. It means that if we have two or more distinct sets, they are nested if we can order them such that each one is a subset of the next. It it clear that . Therefore, is a nested family, i.e. a family of nested sets if we order according to the constant . The process of looking for the confidence interval (with coverage guarantee) among the prediction regions can be thought as finding a mapping . For example, SCP uses the residuals set to estimate the mapping as follows. First rewrite as

and consider the nonconformity score function

Informally, can be thought as the radius of the smallest closed ball centered at that contains . Taking as , the smallest in that contains proportion of the holdout residuals in , we re-derive the confidence interval of SCP,

From the example of SCP, we can see the process of NCP has four steps: (1) Fitting a estimator to the training set ; (2) Define a family of prediction regions with the proposed estimator; (3) Define the nonconformity score function for the prediction regions; (4) Construct the confidence interval with the smallest prediction region in which satisfy the coverage on the holdout set. Note that the length of is constant over the input space and hence non-discriminative. This problem motivates to generalize SCP in two different directions: (1) interval estimator methods, and (2) calibration methods.

4.1 Interval estimator

Instead of using a mean estimator , interval estimators, such as quantile functions or Bayesian credible set, can be applied in SCP to achieve discriminative confidence intervals.

The method Locally Weighted Split Conformal Lei et al. (2018) learns both the mean estimator and mean absolute deviation (MAD) estimator from the training split . In the original paper, the MAD estimator is obtained by fitting a separate regression model to the fitting residuals of on . For Bayesian models, the mean and MAD estimator can be derived through the posterior distribution. The prediction region is defined with and as follows,

The resulting score function is given as

Then we know that taking as the -th smallest normalized residual in , we obtain the confidence interval as

Conformal Quantile Regression Romano et al. (2019) is another conformal prediction method which estimates the interval directly via quantile regression. Let and be two conditional quantile functions based on . The prediction interval is given as,

Taking as the -the smallest out of quantile residual in , we obtain the confidence interval,

where is the out of quantile residual,

There are other works that also use the idea of interval estimators, as summarized in Table 4 of Appendix B. These methods are different in the choice of estimators. They use the same definition of nonconformity score , and their confidence intervals, e.g. , and are given in same form as . The intervals and have different lengths for different while the length of is the same over the input space. Using interval estimator is a attempt to achieve Discrimination. For example, is wide in the region of high predictive variance where the gap between the upper quantile and the lower quantile is large. We note that an interval estimator does not issue a prediction of the mean of . In practice, we can train another machine learning model for predicting , and only use the confidence intervals returned by conformal algorithms to provide a lower and upper bound of for the sake of uncertainty quantification.

4.2 Calibration methods

In SCP, the training samples are spiltted into two equal-size subsets, and . One could consider other calibration methods, such as leave-one-out Barber et al. (2019), -fold cross-validation Vovk (2015) and bootstrap method Kuchibhotla and Ramdas (2019). These variants enables to be trained on more samples, and have smaller residuals on the holdout samples. The confidence intervals obtained by -fold cross-validation and Bootstrap are given in Equation (5-6) of Appendix C. Like in conventional model selection, K-fold splitting tends to underestimate the true prediction error while Bootstrap tends to underestimate the true prediction error Hastie et al. (2009). Both methods can achieve the coverage guarantee, as summarized in Table 5 of Appendix C. Our algorithm AutoNCP aims to select the best out of the two, and optimize the number of folds in the calibration method for constructing tight confidence intervals.

5 AutoNCP

In the last section, we demonstrate a user can have a large degree of freedom in constructing a conformal prediction pipeline, with a variety of mean estimators or interval estimators, and calibration methods. The user also needs to choose what machine learning model and its hyperparameters to use for parametrizing the chosen estimator. We now introduce our AutoML system AutoNCP for solving this complex conformal pipeline optimization problem while keeping the human out of the loop.

Fix a set of prediction models; for each model, we will choose hyperparameters ; write for the set of pairs consisting of a model and a choice set of hyperparameters. Fix a set of estimators and a set of calibration methods. An NCP pipeline is 4-tuple consisting of a model, a set of hyperparameters for that model, an estimator and a calibration method; the space of all possible pipelines is .

An example pipeline might be Random Forest, 1000 stumps using 4 features, Quantile estimator, CV with 5 folds. Each model would have many possible sets of hyperparameters, the space of pipelines will be very large, even if we restrict attention to a few models, estimators and calibration methods. The goal of AutoNCP is to identify the pipeline configuration for a given dataset that yields the smallest average length of confidence intervals on the validation set :


where is the length of the confidence interval for the validation sample . We note that the Frequentist Coverage guarantee will be satisfied automatically with any conformal algorithm. The objective in (1) has no analytic form, and hence we treat it as a black-box optimization problem and solves it using the idea of divide and conquer. We assume is a noisy version of a black-box function , , where is a pipeline for the model and is the space of all possible pipelines for the model . We use Bayesian Optimization (BO) Snoek et al. (2012) to search for the optimal pipeline configuration . The BO algorithm specifies a Gaussian process (GP) prior on each as follows:


where is the mean function, encoding the expected performance of different pipeline, and is the covariance kernel Williams and Rasmussen (2006), measuring the similarity between the different pipelines.

Our BO procedure is a simple three-steps iterative process to search for the solutions to Equation 1: (1) Fitting the pipeline performance data with the GPs; (2) Find the optimizer using a acquisition function Kushner (1964); Srinivas et al. (2012); Mockus (2012); Hernández-Lobato et al. (2014) and select the model to evaluate next; (3) Evaluate the selected pipeline and add the evaluation performance with the performance data for the next fitting in step (1). Note that the acquisition function is computed by the exact posterior distribution of GPs, which balances the exploration and exploitation trade-off in the search of solutions. The iteration of BO is generally much cheaper than the pipeline evaluation, especially when we only evaluate thousands of pipelines.

Arguably, it is more sensible in our case to model the performance of each using a independent GP separately rather than jointly. Let where

is a categorical variable indicating which prediction model

we choose. Joint modelling requires to define a high dimensional kernel function with input as a categorical variable indicating which model is chosen and hyperparameters in the product space . When comparing two pipeline and with the same model , takes into account the irrelevant dimensions in other , , when the similarity between and only depends on the variables in . This problem is resolved by modelling each separately.

In AutoNCP, can be high-dimensional and mixed with continuous and discrete variable due to the composition of methods in the pipeline. High dimensionality Kandasamy et al. (2015); Györfi et al. (2006) and mixture structure Daxberger et al. (2019) can renders standard GP-based BO infeasible. We deal with these problems using the idea of sparse additive kernel decomposition Kandasamy et al. (2015); Wang et al. (2017). We assume the pipeline performance i.e. the average length of confidence intervals, has separate dependence on prediction model, estimator and calibration method. The underlying structure in that relates the hyperparameters of the model , the choice of estimator and calibration method can be expressed via the following sparse additive kernel decomposition:

where , , and . The kernel functions , , share the same “estimator” kernel and “calibration” kernel . This additive structure separates the continuous hyperparameter variables in out from the integer variables in and (e.g. one-hot indicator of which estimator is used). Furthermore, the hyperparameters of and can be learned more efficiently by maximizing the sum of marginal likelihood function Williams and Rasmussen (2006) over all the black-box function , . The kernel decomposition also breaks down the function as follows:


The additively sparse structure in (3) gives a statistically efficient BO procedure. That is, if a function is -smooth, the additive kernels reduce sample complexity from to , where is the maximum number of dimensions in any subspace Yang et al. (2015). Figure 3 at the end of Appendix B.2 demonstrates that our BO method outperforms the naive modelling approach with .

It is worthwhile to mention that BO can be extended to balance the trade-off between pipeline performance and training computational complexity Klein et al. (2016). This is an important feature to have in our case since the computational complexity increases if we split the dataset into more subsets in calibration. We treat the computational time as another black-box function and model it with a separate Gaussian process in Equation (2). When we acquire a pipeline by maximizing the acquisition function, we change the acquisition function from to . In this design, the pipeline acquisition process in BO will consider the computational time of pipeline evaluation. It will prioritize the pipeline which is cheap to evaluate. If there is a trend of performance gain by using a larger model and increasing the splitting size, BO will start to acquire these more expensive pipelines. The implementation details of AutoNCP are given in the experiment section and Appendix B.2.

6 Experiments

We conducted two sets of experiments. The first set of experiments employs eight regression333In a classification problem with a discrete label , a confidence interval may contain more than one class. Our method still works with appropriate nonconformity scores, e.g. distance to the nearest neighbors. datasets444The first five datasets are UCI datasets Dua and Graff (2017) and the MEPS datasets are in: Community and Crimes (Community), Boston Housing (Boston), concrete compressive strength (Concrete), Red wine quality (Wine), Energy Efficiency (Energy), and three Medical Expenditure Panel Surveys (MEPS 19, MEPS 20, MEPS 21). We fix the miscoverage rate , so the target coverage rate is 90%.555The coverage rates in the experiments differ slightly from 90% because we are dealing with finite samples. For each dataset we constructed 20 random training-test splits, with of the examples used for training and the remaining for testing, and we average the performance over these 20 splits.


. We compare our autoML framework AutoNCP with a total of eight benchmark algorithms which have valid finite sample coverage guarantees. The underlying models we consider are Ridge Regression, Random Forest, and Neural Network. Neural Network here is a standard Multiple Layer Perceptron. The benchmarks consist of (a) the original version of SCP for each of these models, denoted SCP-Ridge, SCP-RF, and SCP-NN; (b) the locally adaptive conformal prediction version of SCP

Lei et al. (2018) for each of these models, denoted SCP-Ridge-Local, SCP-RF-Local, and SCP-NN-Local; (c)666These methods subsume the standard conformal variants of random forests Johansson et al. (2014) and neural networks Papadopoulos and Haralambous (2011). Conformalized Quantile Regression (CQR) Romano et al. (2019), using Neural Network and Random Forest as underlying models, denoted CQR-NN and CQR-RF. In our implementation of AutoNCP, we use Ridge Regression, Random Forest, Neural Network as underlying models, the first five estimators listed in Table 4 in Appendix D, and the calibration methods shown Table 5 of Appendix C. (Further details of the implementation of the benchmark algorithms and of AutoNCP are provided in Appendix B.)

Figure 2: Normalized Interval Lengths

Performance results. For AutoNCP and each of the eight benchmarks, we compute the average coverage rate and interval length for each dataset. The relative performance of each algorithm on each dataset is displayed in Figure 2: for each dataset we normalize by dividing by the average interval length for the worst-performing algorithm (so scores are all in the interval .) (To aid in legibility, we ordered the presentation of the datasets according to the relative performance of AutoNCP.) As is easily seen, AutoNCP displays the best performance on every dataset. Table 1 compares the performance of AutoNCP against the Best and Worst Benchmarks on each dataset. As can be seen, all nine algorithms achieve the target coverage rate of 90% (or very close to it) on all datasets, but the Length (of Confidence Intervals) varies widely across datasets and across algorithms. The last two columns show the percentage improvement in Length (of Confidence Intervals) that AutoNCP achieves over the Best and Worst Benchmarks. We highlight that the improvement of AutoNCP over the Best Benchmark ranges from 5.36% on the Community dataset to 38.89% on the Wine dataset.

As noted in the Introduction, it is not difficult to marry a conformal prediction algorithm and a machine learning model, but the “right” algorithm and model to marry will depend on the particular dataset. This can be seen in Figure 2: CQR-NN is the best-performing benchmark (marriage of an algorithm and a model) on the MEPS 19, 20, 21 and Concrete datasets (and still AutoNCP outperforms it by 36.41%, 24.48%, 24.05% and 6.38%, respectively), but CQR-NN is in the middle of the pack on the Energy, Wine, Boston and Community datasets (where AutoNCP outperforms it by 55.56%, 43.59%, 14.28% and 13.87%, respectively).

width=center MeanStd AutoNCP Best Benchmark Worst Benchmark Length Decrease () Avg. Coverage () Avg. Length Avg. Coverage () Avg. Length Avg. Coverage () Avg. Length v.s. Best v.s. Worst Community Concrete Boston Energy MEPS 19 MEPS 20 MEPS 21 Wine

Table 1: Comparison of the average coverage rate and interval length (AutoNCP v.s. the best and worst benchmark for each of the eight datasets): AutoNCP yields large improvements of interval length while the coverage rates are very similar among the algorithms. AutoNCP outperforms the best model, which shows the gain of selecting good estimators and calibration methods jointly.

width=0.40center Datasets Avg. Length Model + Cal Estimator + Cal AutoNCP Boston Energy Wine MEPS 19 MEPS 20 MEPS 21 Concrete Community

Table 2: Source of gain

Source of gain. To better understand where how the gains achieved by AutoNCP arise, we consider the effects of fixing a particular component and then optimizing the others. We consider two cases: In “Model+Cal”, we fix the estimator to be conformal quantile regression and optimize the selection of prediction models and calibration method; in “Estimator+Cal”, we fix the model to be a neural network and optimize the selection of estimator and calibration method. We report the results in Table 2. Both of the restricted variants are out-performed by the full version of AutoNCP on all the datasets. The improvement of AutoNCP over either restricted variants is quite significant. This supports the view that it is worthwhile to optimize the selection of all components of the pipeline, and not just of a portion.

We provide more experimental results in Appendix. First, we show the second set of experiments on Conditional Average Treatment Effects in Appendix A. In the experiments, AutoNCP delivers the shortest average length of confidence intervals on two datasets, IHDP Hill (2011) and LBIDD Shimoni et al. (2018); Mathews and Atkinson (1998), among the methods that achieve the target coverage rate. For the first set of experiments, Tables 6 and 7 in Appendix D show the averaged coverage rate and interval length of each algorithm on each of the eight datasets, at target coverage rate 90% and 95% respectively. In Appendix D.1

, we provide a proof of the concept “Discrimination” by comparing the histogram and cumulative distribution function (CDF) of confidence intervals given by AutoNCP and CQR-NN.

7 Conclusion

This paper introduces AutoNCP, which is a simple and powerful AutoML framework for constructing predictive confidence intervals with valid coverage guarantees, without human intervention. Experiments using real-world datasets from a variety of domains demonstrate that AutoNCP outperforms (i.e. develops tighter intervals) than existing methods. AutoNCP is also effective for the problem of CATE estimation that plays an important role in personalized medicine and policymaking. Because AutoNCP provides tigher confidence intervals in real-world applications, it allows non-experts to use machine learning methods with more certainty.

Broader Impact

Our research attempts to develop actionable confidence interval based on automatic machine learning and conformal prediction algorithms. Conformal prediction is often distribution-free and able to construct a confidence interval with valid coverage in finite samples. However, it is often unclear how tight the constructed intervals are. With increasing interests to the topic of conformal prediction, many methods have been proposed, but there is no unified framework to select which model, estimator, or calibration method to use in practice. The positive outcome of our research is that our AutoML system can construct tight confidence intervals with rigorous coverage guarantee for machine learning models while keeping the human out of the loop. This is important for non-expert users, e.g. natural scientist, to have good confidence interval on their predictive models for scientific discovery. Since our work is the first AutoML algorithms for confidence intervals, it also helps machine learning practitioners to simplify their daily job in finding good confidence intervals for their predictive models. We do note that the safety and fairness challenges in machine learning models are not solved by our method which focuses on constructing tight confidence intervals. Even when a confidence interval is tight and we are very certain about a particular prediction, this prediction can still be unfair due to the bias in the training data.


  • [1] A. M. Alaa and M. van der Schaar (2017) Bayesian inference of individualized treatment effects using multi-task gaussian processes. In Advances in Neural Information Processing Systems, pp. 3424–3432. Cited by: Appendix A.
  • [2] A. Alaa and M. van der Schaar (2018) AutoPrognosis: automated clinical prognostic modeling via Bayesian optimization with structured kernel learning. In Proceedings of the 35th International Conference on Machine Learning, Vol. 80, pp. 139–148. Cited by: §1.
  • [3] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané (2016) Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
  • [4] S. Athey and G. Imbens (2016) Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113 (27), pp. 7353–7360. Cited by: Appendix A.
  • [5] R. F. Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani (2019) Predictive inference with the jackknife+. arXiv preprint arXiv:1905.02928. Cited by: Appendix C, §2, §4.2.
  • [6] M. J. Bayarri and J. O. Berger (2004) The interplay of bayesian and frequentist analysis. Statistical Science, pp. 58–80. Cited by: §2.
  • [7] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of machine learning research 13 (Feb), pp. 281–305. Cited by: §B.1.
  • [8] V. Chernozhukov, K. Wüthrich, and Y. Zhu (2019) Distributional conformal prediction. arXiv preprint arXiv:1909.07889. Cited by: Table 4.
  • [9] H. A. Chipman, E. I. George, R. E. McCulloch, et al. (2010) BART: bayesian additive regression trees. The Annals of Applied Statistics 4 (1), pp. 266–298. Cited by: Appendix A.
  • [10] D. D. Cox (1993) An analysis of bayesian inference for nonparametric regression. The Annals of Statistics, pp. 903–923. Cited by: §2.
  • [11] E. Daxberger, A. Makarova, M. Turchetta, and A. Krause (2019) Mixed-variable bayesian optimization. arXiv preprint arXiv:1907.01329. Cited by: §5.
  • [12] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: footnote 4.
  • [13] H. J. Escalante, W. Tu, I. Guyon, D. L. Silver, E. Viegas, Y. Chen, W. Dai, and Q. Yang (2020) Automl@ neurips 2018 challenge: design and results. In The NeurIPS’18 Competition, pp. 209–229. Cited by: §1.
  • [14] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. In Advances in neural information processing systems, pp. 2962–2970. Cited by: §1.
  • [15] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §B.2, §2.
  • [16] L. Guan (2019) Conformal prediction with localization. arXiv preprint arXiv:1908.08558. Cited by: §2.
  • [17] L. Györfi, M. Kohler, A. Krzyzak, and H. Walk (2006) A distribution-free theory of nonparametric regression. Springer Science & Business Media. Cited by: §5.
  • [18] T. Hastie, R. Tibshirani, and J. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Cited by: §4.2.
  • [19] J. M. Hernández-Lobato and R. Adams (2015)

    Probabilistic backpropagation for scalable learning of bayesian neural networks

    In International Conference on Machine Learning, pp. 1861–1869. Cited by: §2.
  • [20] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani (2014) Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926. Cited by: §5.
  • [21] J. L. Hill (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: Appendix A, §6.
  • [22] U. Johansson, H. Boström, T. Löfström, and H. Linusson (2014) Regression conformal prediction with random forests. Machine Learning 97 (1-2), pp. 155–176. Cited by: footnote 6.
  • [23] D. R. Jones, M. Schonlau, and W. J. Welch (1998) Efficient global optimization of expensive black-box functions. Journal of Global optimization 13 (4), pp. 455–492. Cited by: §B.2.
  • [24] K. Kandasamy, J. Schneider, and B. Póczos (2015) High dimensional bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pp. 295–304. Cited by: §5.
  • [25] B. Kim, C. Xu, and R. F. Barber (2020) Predictive inference is free with the jackknife+-after-bootstrap. arXiv preprint arXiv:2002.09025. Cited by: Appendix C.
  • [26] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: 5th item.
  • [27] D. Kivaranovic, K. D. Johnson, and H. Leeb (2019) Adaptive, distribution-free prediction intervals for deep neural networks. arXiv preprint arXiv:1905.10634. Cited by: Table 4.
  • [28] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter (2016) Fast bayesian optimization of machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079. Cited by: §5.
  • [29] R. Koenker and G. Bassett Jr (1978) Regression quantiles. Econometrica: journal of the Econometric Society, pp. 33–50. Cited by: 8th item.
  • [30] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. Leyton-Brown (2017) Auto-weka 2.0: automatic model selection and hyperparameter optimization in weka. The Journal of Machine Learning Research 18 (1), pp. 826–830. Cited by: §1.
  • [31] A. K. Kuchibhotla and A. K. Ramdas (2019) Nested conformal prediction and the generalized jackknife+. arXiv preprint arXiv:1910.10562. Cited by: Appendix C, Appendix C, §1, §4.2, §4.
  • [32] H. J. Kushner (1964) A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. Cited by: §5.
  • [33] B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §2.
  • [34] J. Lawless and M. Fredette (2005) Frequentist prediction intervals and predictive distributions. Biometrika 92 (3), pp. 529–542. Cited by: §3.
  • [35] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman (2018) Distribution-free predictive inference for regression. Journal of the American Statistical Association 113 (523), pp. 1094–1111. Cited by: Table 4, §4.1, §4, §6.
  • [36] W. Maddox, T. Garipov, P. Izmailov, D. Vetrov, and A. G. Wilson (2019) A simple baseline for bayesian uncertainty in deep learning. arXiv preprint arXiv:1902.02476. Cited by: §2.
  • [37] T. Mathews and J. Atkinson (1998) Infant mortality statistics from the linked birth/infant death data set—1995 period data. Monthly Vital Statistics Reports 46 (6). Cited by: Appendix A, §6.
  • [38] N. Meinshausen (2006) Quantile regression forests. Journal of Machine Learning Research 7 (Jun), pp. 983–999. Cited by: 7th item.
  • [39] J. Mockus (2012) Bayesian approach to global optimization: theory and applications. Vol. 37, Springer Science & Business Media. Cited by: §B.2, §5.
  • [40] H. Papadopoulos and H. Haralambous (2011) Reliable prediction intervals with regression neural networks. Neural Networks 24 (8), pp. 842–851. Cited by: footnote 6.
  • [41] Y. Romano, E. Patterson, and E. J. Candès (2019) Conformalized quantile regression. arXiv preprint arXiv:1905.03222. Cited by: Table 4, §2, §4.1, §6.
  • [42] J. Rousseau and B. Szabo (2016) Asymptotic frequentist coverage properties of bayesian credible sets for sieve priors. arXiv preprint arXiv:1609.05067. Cited by: §2.
  • [43] M. Sesia and E. J. Candès (2019) A comparison of some conformal quantile regression methods. arXiv preprint arXiv:1909.05433. Cited by: Table 4, §2.
  • [44] Y. Shimoni, C. Yanover, E. Karavani, and Y. Goldschmnidt (2018) Benchmarking framework for performance-evaluation of causal inference analysis. arXiv preprint arXiv:1802.05046. Cited by: Appendix A, §6.
  • [45] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §1, §5.
  • [46] N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger (2012) Information-theoretic regret bounds for gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory 58 (5), pp. 3250–3265. Cited by: §5.
  • [47] I. Steinwart, A. Christmann, et al. (2011) Estimating conditional quantiles with the help of the pinball loss. Bernoulli 17 (1), pp. 211–225. Cited by: 8th item.
  • [48] B. Szabó, A. W. van der Vaart, J. van Zanten, et al. (2015) Frequentist coverage of adaptive nonparametric bayesian credible sets. The Annals of Statistics 43 (4), pp. 1391–1428. Cited by: §2.
  • [49] J. W. Taylor (2000) A quantile regression neural network approach to estimating the conditional density of multiperiod returns. Journal of Forecasting 19 (4), pp. 299–311. Cited by: 8th item.
  • [50] V. Vovk, A. Gammerman, and G. Shafer (2005) Algorithmic learning in a random world. Springer Science & Business Media. Cited by: §1, §2, §3, §3, §4.
  • [51] V. Vovk (2015) Cross-conformal predictors.

    Annals of Mathematics and Artificial Intelligence

    74 (1-2), pp. 9–28.
    Cited by: Appendix C, §4.2.
  • [52] S. Wager and S. Athey (2018) Estimation and inference of heterogeneous treatment effects using random forests. Journal of the American Statistical Association 113 (523), pp. 1228–1242. Cited by: Appendix A.
  • [53] Z. Wang, C. Gehring, P. Kohli, and S. Jegelka (2017) Batched large-scale bayesian optimization in high-dimensional spaces. arXiv preprint arXiv:1706.01445. Cited by: §5.
  • [54] M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Cited by: §2.
  • [55] C. K. Williams and C. E. Rasmussen (2006) Gaussian processes for machine learning. Vol. 2, MIT press Cambridge, MA. Cited by: §B.2, §5, §5.
  • [56] Y. Yang, S. T. Tokdar, et al. (2015) Minimax-optimal nonparametric regression in high dimensions. The Annals of Statistics 43 (2), pp. 652–674. Cited by: §5.
  • [57] Y. Zhang, A. Bellot, and M. van der Schaar (2020) Learning overlapping representations for the estimation of individualized treatment effects. arXiv preprint arXiv:2001.04754. Cited by: Appendix A.

Appendix A Results on conditional average treatment effect

For the second set of experiments, we turn to the problem of developing reliable confidence intervals for estimating Conditional Average Treatment Effects (CATE). If the responses are (untreated) and (treated) then the CATE is .777In actual data, either the patient was treated or not-treated but not both – so only one of is actually observed. State-of-the-art methods for CATE estimation rely either on Bayesian credible sets [9, 1, 57] or on (tree-based) sample splitting [4, 52]. To construct confidence intervals for CATE estimation, we apply AutoNCP to construct confidence intervals for and for separately. If these confidence intervals are respectively, then the confidence interval for CATE estimation is and the length of the confidence interval is . To be consistent with our previous experiment, we use 95% confidence intervals (miscoverage ) for so that we obtain confidence intervals (miscoverage ) for the response difference . We compare the confidence intervals produced by AutoNCP (using CMGP as the prediction model and optimizing only the estimators and calibration methods) with the confidence intervals produced by Causal Random Forest (CRF) [52] and the credible sets produced by Causal Multitask Gaussian Process (CMGP) [1] and Bayesian Additive Regression Trees (BART), using the datasets IHDP [21] and LBIDD [44, 37].888We omit comparison against estimation methods for CATE that do not produce confidence intervals. The results are reported in Table 3. As can be seen, CMGP and AutoNCP achieve the target for both Response Coverage and CATE Coverage, but AutoNCP provides much better (smaller) confidence intervals. Both BART and CRF fall short on Response Coverage on both datasets, and BART also falls short on CATE Coverage. However, the credible set of CMGP over-covers the data at a price of its interval length.

Model Response CATE Avg. Length
Avg. Coverage () Avg. Coverage ()
Table 3: CATE Estimation on IHDP and LBIDD datasets. Interval lengths are highlighted in bold if the intervals cover at least of the Response difference and CATE.

Appendix B Implementation Details

b.1 Benchmarks

In this section, we provide the details of each benchmark we compare in the experiments. We briefly describe how the models are built and their hyperparameters we chose by random search [7] in cross-validation.

  • SCP-Ridge: We construct the mean estimator using the standard linear ridge regression model “RidgeCV” in the Python package sklearn. We tune the logarithm of the regularization parameter in and choosing the parameter with the smallest cross-validation error.

  • SCP-Ridge-Local: The mean estimator remains the same as SCP-Ridge. The mean absolute deviation (MAD) estimator is given as a -nearest neighbors regressor with a hyperparameter, the number of nearest neighbors in .

  • SCP-RF: We build the mean estimator using the Random Forest Regressor model in sklearn. We tune the following hyperparameters: the number of estimators, , minimum fraction of samples in a leaf, and maximum fraction of features to consider when looking for the best split, . The other hyperparameters are set to the default value in sklearn.

  • SCP-RF-Local: Both the mean estimator and the MAD estimator are the Random Forest Regressor in sklearn. the hyperparameters of and are the same as described above while other hyperparameters are set to the default value.

  • SCP-NN: The mean estimator

    is parameterized by a three-layer feedforward network, with ReLU nonlinear activation function at each layer. We optimize the network with the stochastic optimizer Adam

    [26], with a fixed learning rate of . The batch size is set to the minimum of and the of the dataset. The number of units (i.e. the layer size) is set to the same for each layer. We tune the logarithm of weight decay in , dropout rate in and layer size in

    . The maximum number of training epochs is 300. Then we choose the number of training epochs via the standard early stopping method.

  • SCP-NN-Local: Both the mean estimator and the MAD estimator have the same architecture as the Relu neural network used in SCP-NN. The hyperparameters of both networks are chosen in the same range as SCP-NN.

  • CQR-RF: We use Conformalized Quantile Regression (CQR) with quantile regression forests (QRF) [38]. Built on the hyperparameter space of SCP-RF, QRF has two additional hyperparameters that control the coverage rate on the training data. The hyperparameters of the upper and lower quantile estimator are given in and , respectively.

  • CQR-NN: We Combine Conformalized Quantile Regression (CQR) with the quantile regression neural network approach in [49]. The network architecture and hyperparameters are the same as SCP-NN

    . However, the network output is a two-dimensional vector, representing the lower and upper conditional quantiles. The network training is the same as


    , except that the loss function is the pinball loss

    [29, 47]. Likewise, CQR-NN also has two extra quantile hyperparameters. They are in the same range as CQR-RF.

b.2 BO Implementation

In this section, we introduce the hyperparameter space of the three prediction models used in AutoNCP, including Ridge Regression, Random Forest and Neural Network. We note that the neural network has the same architecture as the one used in SCP-NN. The hyperparameters of Random Forest not mentioned below are set to the default value of the Random Forest Regressor in sklearn. The model hyperparameter spaces in AutoNCP are given as follows.

  1. Ridge Regression:

    • Logarithm of regularization: float, .

  2. Random Forest:

    • Maximum fraction of features to consider when looking for the best split: float, .

    • Minimum fraction of samples in a leaf: float, ;

    • Number of estimators: int, ;

  3. Neural network:

    • Logarithm of weight decay: float, ;

    • Dropout rate: float, ;

    • Layer size: int, ;

    • Number of training epochs: int, ;

    • Batch size: int, .

We use three prediction models in AutoNCP, hence we need to specify five kernel functions in AutoNCP based on the kernel decompostion introduced in Section 5, including , , and . The input spaces for the kernel functions , , are given as the hyperparameter space of Ridge Regression, Random Forest and Neural Network described above, respectively. The input of the “estimator” kernel is a one-hot vector which indicates which of the first five estimators in Table 4 is used, and the hyperparameter of the lower and upper quantile estimators, taking value in and , respectively.

When the second estimator in Table 4 is chosen, we adopt the Bayesian approach to construct the MAD estimator on the training data. For Ridge Regression,

is given as the standard deviation estimator of Bayesian linear regression; For Random Forest, we use the standard deviation of the predictions in the ensemble as

. For Neural Network, we take the estimated standard deviation of MC-Dropout [15] as . Then we compute the upper and lower quantile of the bayesian credible set, where the constants and corresponds to the chosen quantile hyperparameters, e.g. ( corresponds to the -th precentile).

The input of the “calibration” kernel

is a two-dimensional, including a binary variable indicating whether we use CV+ or Bootstrap and a second discrete variable

, which is the number of folds in CV+ or the number of bags in Bootstrap. Recall that CV+ is only available for in Table 5. Therefore, when and CV+ are chosen by the BO, we implement Split Conformal.

Paper Estimators
Table 4: Comparison of interval estimators

In experiments, we use 20% of the training data as the validation set. We use the Matérn-5/2 kernel [55] for all these kernel function. We use the expected improvement (EI) [23, 39] as the acquisition function. The maximum number of BO iteration is set to be 1000.

Figure 3: The interval length averaged over all the datasets in the 1000 BO iterations. Our BO procedure (the red curve) converges faster and finds a better minimum than the joint modelling approach (the green curve).

Figure 3 is the convergence plot of BO in optimizing the average interval length on the validation set. The result is averaged over all the UCI and MEPS datasets in the experiment section. Our BO procedure converges faster and finds a better minimum than the joint modelling approach. Compared with the naive joint modelling approach that models the whole NCP pipeline using one single GP model with a high dimensional kernel , our modelling method is different in two aspects: (1) the sparse additive structural kernels, and (2) the joint marginal likelihood optimization to learn the shared kernels and . Figure 3 shows the importance of modelling the pipeline of each prediction model with a separate GP so that the kernel can measure the similarity between two different pipelines without taking into account any redundant dimensions. It also shows the benefit of optimizing the shared kernels in a joint marginal likelihood function.

Appendix C Calibration

K-folds. Suppose denote a disjoint partition of such that . The equality of sizes is needed for exchangeability. Let . Define the sample-wise score


where identifies the subset that contains the sample . Taking , the smallest prediction region that contains the residual for example , the prediction region of the conformal prediction methods in Table 4 has a general form for some . Using this notation, the cross-conformal confidence interval [51] is defined as


It has been shown in [31] that for all , we have

where denotes the -th quantile of and denotes the -th quantile of . The interval is generalization of methods CV in [5]. Under the exchangeability assumption, for all , Theorem 4 in [5] and [31] show that we have Therefore, for large , we can achieve a valid coverage guarantee. Despite the cross-conformal method has a tighter confidence interval than CV deterministically, it does not yield a meaningful guarantee [5], especially for large .

Bootstrap. Bootstrap [31, 25] is an alternative method to the K-fold splitting method, where each bag , , are obtained by randomly sampling with replacement. The score function is given as , which is different from Equation 5 in the main manuscript at the upper index of because of the indices overlapping among the bags , . Define the score function