Estimating Uncertainty Intervals from Collaborating Networks

02/12/2020 ∙ by Tianhui Zhou, et al. ∙ Duke University 0

Effective decision making requires understanding the uncertainty inherent in a prediction. To estimate uncertainty in regression, one could modify a deep neural network to predict coverage intervals, such as by predicting the mean and standard deviation. Unfortunately, in our empirical evaluations the predicted coverage from existing approaches is either overconfident or lacks sharpness (gives imprecise intervals). To address this challenge, we propose a novel method to estimate uncertainty based on two distinct neural networks with two distinct loss functions in a similar vein to Generative Adversarial Networks. Specifically, one network tries to learn the cumulative distribution function, and the second network tries to learn its inverse. Theoretical analysis demonstrates that the idealized solution is a fixed point and that under certain conditions the approach is asymptotically consistent to ground truth. We benchmark the approach on one synthetic and five real-world datasets, including forecasting A1c values in diabetic patients from electronic health records, where uncertainty is critical. In synthetic data, the proposed approach essentially matches the theoretically optimal solution in all aspects. In the real datasets, the proposed approach is empirically more faithful in its coverage estimates and typically gives sharper intervals than competing methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning techniques are a proven breakthrough in a multitude of prediction problems; however, effective decision-making often requires accurate assessment of uncertainty in addition to prediction Bellman and Zadeh (1970)

. In a single continuous outcome, the conditional probability distribution can be used more effectively in decision-making. For example, consider forecasting future lab values of diabetic patients from electronic health records. Whether this forecast should be used in decision making on clinical actions is dependent on how certain it is, which depends on the recency and completeness of the data. Therefore, we want to build a system that faithfully quantifies its uncertainty based on the available information.

Quantifying uncertainty has a long history in statistics and has been extended into neural network frameworks Blundell et al. (2015); Gal (2016); Blei et al. (2017). The outputs of these systems should ideally be statistically calibrated Mitchell and Wallis (2011), meaning that the predicted probability matches the ground truth occurrence rate of an event (e.g., in binary prediction, 70% positive probability should indicate that the event is actually positive 70% of the time). This is checked by evaluating true hit rates on validation data. Notably, out-of-the-box deep learning methods typically result in over-confident models, but correction methods such as Platt scaling in binary classification can be effective Platt (1999); Guo et al. (2017).

In regression, there have been many recent efforts to provide effective confidence intervals, including heteroskedastic regression 

Harvey (1976), post-hoc isotonic regression  Zadrozny and Elkan (2002), Bayesian approximate methods  Gal (2016), and ensemble methods  Lakshminarayanan et al. (2017). Empirically, we have found that existing methods are either faithful to the prediction interval (e.g., the 95% confidence interval covers 95% of the outcomes) or sharp (e.g., the prediction interval is narrow), but not both, as we demonstrate in our experiments. This challenge could partly be due to the inherent fact that existing methods achieve sharpness by overfitting, leading to unfaithful coverage.

In this manuscript, we introduce a new modeling framework capable of learning a network that provides both faithful coverage intervals and sharp predictions. A straightforward idea would be to learn a function to approximate the conditional Cumulative Distribution Function (CDF) given a data sample. However, there is no effective loss that can be directly used to learn this function; instead, we introduce a second network

to approximate the conditional inverse CDF. Despite the seeming similarity to an autoencoder with these paired functions, the networks must be learned with separate losses in a similar vein as the Generative Adversarial Network (GAN) 

Goodfellow et al. (2014). We show that the desired solution is a fixed point of the optimization scheme ( and are the true conditional CDF and inverse CDF). Furthermore, we show that ’s optimality only requires to satisfy fairly mild conditions, yielding a stable solution with asymptotic consistency under certain model classes. This framework can either be used to extend existing neural network architectures or as a post-hoc correction method. In the following sections we provide theoretical analysis and demonstrate empirical performance on one synthetic dataset and five real-world datasets.

2 Related Work

Uncertainty in binary classification is relatively well-explored. In the classical logistic regression setting, the probability is usually well calibrated 

Kull et al. (2017). However, deep networks become overconfident from overfitting, which can be partially addressed by the usage of normalization and weight decay Guo et al. (2017) or by intelligently varying the inputs Thulasidasan et al. (2019). Platt scaling has been fairly effective Platt (1999). In this two-step method, the initial prediction is learned as on the training data, and then a small reserved dataset is used to fit as the calibrated probability. Another frequently used nonparametric calibration method is called isotonic regression  Zadrozny and Elkan (2002), where the interval from 0 to 1 is binned from the pre-trained network, and the observed proportions on the data replaces the original predicted probability.

For a continuous output, one needs a separate predicted value for each percentile of interest. Succinctly, this requires the coverage of full span of the conditional distribution. Classically, this has been performed with the tilted loss functions in quantile regression  

Koenker and Bassett (1978), which has been extended into neural network frameworks to make it simultaneously mimic the full distribution over the full support  Romano et al. (2019); Tagasovska and Lopez-Paz (2019). Alternatively, Bayesian Neural Networks naturally form a scheme to generate uncertainty estimates Gal (2016)

. One can draw a posterior predicted distribution based on the data we observed using sampling methods 

Neal (2012)

. To address computational challenges, one can approximate the Bayesian solution by introducing dropout training as an approximation to Bayesian inference  

Gal and Ghahramani (2016), by using variational inference  Blei et al. (2017), or stochastic gradient sampling methods Li et al. (2016). While Bayesian approaches have nice theoretical properties, they suffer under mis-specification of the model and inability to correctly assess the posterior, resulting in mismatch between the claimed credibility and reality.

Alternatively, one could directly estimate uncertainty by training the model to produce both a mean and variance estimate 

Nix and Weigend (1994). However, this initial model is not stable and prone to overfitting, but can be enhanced by adjusting the gradient calculation and training the mean and variance model separately Skafte et al. (2019). We could also combine different neural networks with mild perturbations to create uncertainty estimates Lakshminarayanan et al. (2017) or decompose the overall uncertainty into a few components accounting for different noises and errors on top of ensemble model  Liu et al. (2019). Mixture models can also be used to approximate any form of distribution function Brando et al. (2019). Finally, a counterpart to isotonic regression called calibrated regression can be applied in continuous cases after an initial model has been trained Kuleshov et al. (2018).

Compared to these approaches, our method is unique because it does not rely on any underlying distributional assumptions nor rely on any approximation in its gradient calculation. We can incorporate all training covariates to learn a function that effectively generates the prediction interval for any possible quantile, which is precise and faithful in uncertainty estimate. Our model is also theoretical sound and we prove consistency to ground truth for given model classes.

(a) -network.
(b) Training scheme.
(c) Quantile Estimation.
Figure 1: 1(a) gives a visualization of the the -network. The dashed line means the objective function does not produce a useful derivative. 1(b) gives the -network, which helps with the non-derivative objective function in Eq. (1). In this framework, and can be jointly trained to learn the two functions. 1(c) gives the final mapping to the quantiles after the network has been trained.

3 Preliminaries

Before introducing our learning methodology, we first set up the notation and network definitions.

3.1 Notation

Let denote the input features, , with denoting an observed sample. We denote as the continuous outcome variable with an observed values

. It is assumed to have a joint distribution function

and an underlying conditional distribution function, . Let denote the quantile, where . For instance, would yield the median of the conditional distribution. is a chosen distribution to generate the percentiles in .

3.2 Neural Network Approximation Functions

The proposed method is based on two functions. We will first define them along with their predictive goals, and then discuss how such functions can be learned with neural networks. First, we have the function with parameters , which will be denoted as for simplicity. The goal of this network is to approximate the inverse CDF . An optimal function should have the property that . Our second function is with parameters (simplified as ) that tries to approximate the conditional CDF. It takes a random value , together with covariate , to predict its location in conditional distribution . A perfect should have the property . When both and perfectly match their goals, they satisfy the following properties: for ; the quantile is true in expectation, with

(1)

These two properties will be exploited in the following to create a learning scheme based on neural networks.

4 Joint Function Learning

A first approach to achieving the equality in Eq. (1) would be to directly adopt it as an objective (e.g., minimize the square loss ); unfortunately, this objective function’s gradient comes from an indicator function that is useless for learning. We will circumvent the learning issue by our joint learning scheme.

Specifically, the neural networks and are bestowed with two distinct losses,

(3)

The loss is a logistic regression loss, . Eq. (3) and (3) are the losses in true expectation; in practice, we would use empirical risk minimization approach to estimate the expectation. The distribution for quantiles can be chosen as desired. Any distribution that fully covers the space satisfies our theoretical framework; in practice, we choose

(uniform distribution). A visualization of this proposed model framework is given in Figure 

1.

The g-loss and f-loss defined in Eq. (3) and Eq. (3) are straightforward to optimize, allowing an alternating learning scheme with standard gradient methods. Note that most neural network architectures could be easily incorporated in this framework. Remarkably, under conditions similar to the theoretical claims in GAN  Goodfellow et al. (2014), these losses induce a fixed point for and at their desired properties (see Section 4.1). We further demonstrate that has a fixed point at the desired solution for very mild conditions on , meaning that is robust to a sub-optimal . We describe the full learning strategy to exploit this realization in Section 4.2 and Algorithm 1.

Additionally, if desired, our method could be integrated with a heteroskedastic regression in a two-stage procedure. A two-stage procedure works to first learn an initial model, and then learns a second calibration estimate on a small reduced dataset and a reduced set of variables Weigend et al. (1990); Steyerberg (2019).

4.1 Theoretical Results

The functions and should learn the CDF and inverse CDF of . Here, we explore when the loss functions in (3) and (3) will lead to these goals. Suppose that and have enough capacity to represent the ground truth distribution functions (a mild assumption in neural networks), then we can show that the optimal solution is a fixed point of the training scheme. Below is a sketch of proof of these claims, with a significantly more detailed proofs available in the Supplemental Material. We begin our analysis with .

Proposition 1.

Assume that is a function we used to approximate the inverse CDF of (not necessarily optimal), then a -function minimizing (3) is optimal when it is equivalent to the CDF, or

Proof.

First, recall our -loss can be expanded as:

(4)

Succinctly, by fixing any and letting , then the inner part becomes . For any function , its unique minimum is attained when . Therefore, is optimal when:

The result is robust to the distribution over percentiles as long as it is has support over all of . Our default choice is the uniform distribution.

Proposition 1 reveals an interesting result: has a fixed point at the optimal solution even when is not optimal. This leads to an additional question, which is whether our estimate will actually achieve our optimal result. To do that, we sketch out a consistency proof of that is independent of . This assumption is critically dependent on the existing M-Estimator theory Van der Vaart (2000). Prior to the statement of the theorem, we need to denote some additional notation. We define any learned to be a function that comes from function space . We make the assumption that the ground truth CDF function is included in as . Note that using this functional space is important for the theory because two different parameter settings in can map to the same function (see Supplemental Material). Let be a distance measurement (e.g., absolute difference in L1 or squared difference in L2).

Next, note that the g-loss of a single sample is:

Let the -estimator be the sample average of the loss of a given, evaluated at function : and (true expectation). With that, we can now state the theorem:

Theorem 1.

For , If the following three conditions are satisfied, then ,

  1. The sequence of estimator satisfy

To show that our optimal finite sample estimator is consistent (, the ground truth conditional CDF ( in theorem)), we need to satisfy these three conditions. A detailed derivation can be found in Supplemental Material, but we will give some intuition on the conditions. Note that in our derivations we limit the function class to those that satisfy Lipschitz continuity in order satisfy these conditions. Lipschitz continuity can be imposed in neural networks Arjovsky et al. (2017) and is a realistic assumption in uncertainty quantification because of the smoothness over . The first condition is a form of uniform convergence in probability, and describes that the finite sample objective function should well-approximate the objective function in expectation as the number of samples increases regardless of the chosen . The second condition states that the ground truth is the only setting that maximizes the objective function in expectation, and the third condition states that if we have a sequence of functions that is approximately equals in the finite sample objective , then by the large sample property of condition one the limit of the sequence will dominate the objective function in expectation. Thus, if the is not consistent proximal to ground truth, it violates these conditions.

Succinctly, for our smooth model class, our estimator is consistent if we can find a way to optimize the objective function. While we assume that the optimization will be optimal in the proof, recent theoretical advances in deep learning suggest that it may be a reasonable assumption in practice Du et al. (2018), although those results do not cover our precise situation.

Next we explore the fixed point properties on :

Proposition 2.

When the -function is ideal, then the -function is optimal under Eq. (3). The optimum is attained when captures the inverse CDF, i.e.

Proof.

For an ideal -function, . Including this in our -loss gives

If we make , then -loss and the loss is optimal. Let the distribution of be represented as ,

By combining Proposition 1 and 2, it is clear that our ideal functions are a fixed point when we have access to the complete data distribution and our learned functions have enough complexity.

However, a weakness is that Proposition 2 relies on to be optimal first. This then begs the question, if does not rely on an optimal (using population distributions could be used, for example), why is helpful? We find that by having closer to the inverse conditional distribution, there is a a more efficient exploration of the space by placing higher emphasize on area with higher density and expedites the convergence.

Noting that our theory is significantly more robust on than it is for , we recommend using to quantify the uncertainty of , which does require some function evaluations and additional computational overhead to find the precise intervals. In Section 5.2, we empirically show that using the network is more robust and smooth than using the network, matching our theoretical analysis.

4.2 Learning Initialization and Stabilization

As demonstrated in our theoretical results, the learning on -function has a fixed point at the optimal solution as long as the -function possesses some mild properties. Moreover, a “good enough” would help more efficiently learn . Therefore, we need to ensure a reasonable initialization. Second, we note that can collapse if becomes “too good” (100% prediction confidence), as the loss of is embedded in . Therefore, we want to ensure that is initialized properly and stays stable to prevent function from collapsing (note that many GAN models suffer from mode collapse Creswell et al. (2018); Salimans et al. (2016)).

For the initialization, we start by training independently of . Instead of using to randomly generate samples from the conditional distribution, we adopt a generator with enough variability to help to have a good capture of the distribution in whole space. As was shown in the theory, it does not change the optimal value of in expectation, so this is a reasonable initialization. Here, we pick to be a uniform distribution between the smallest and largest observed outcome , such that . It could also be chosen as the marginal empirical distribution of outcomes, as long as enough variability is involved. This pre-training step adopts the following loss,

(5)

Second, we want to make sure that the output from the function actually matches the chosen

distribution, which it should theoretically do in an optimally chosen model. This is accomplished by using a modified batch normalization  

Ioffe and Szegedy (2015)

to enforce that the moments of the learned logits match that of the ideal theoretical distribution. For the uniform distribution, the first and second moment of the transformed distribution in the logit space is straightforwardly calculated. Then we choose

and such that to match the ideal first and second moment on the logit output. This technique forces the predicted coverages to and roughly match the implied optimal distribution over . In our experience, these two procedures make our network learning stable and robust.

  : Training samples , for . A random generator for percentile.
  : Model parameters and .
   Reduce the raw covariates of .
  for iter  do
     Sample a mini-batch from training samples.
     Compute pre-train loss in Eq. (5) for and update .
  end for
  for iter  do
     Sample a mini-batch of training samples and the quantile from for each sample.
     Compute f-loss in Eq. (3) and update .
     Compute g-loss in Eq. (3) and update .
  end for
Algorithm 1 Full Learning Algorithm

5 Experiments

We first evaluate our proposed method on synthetic data to explore performance compared to optimal ground truth functions. Then we compare methods on multiple real datasets, showing improved performance across a variety of situations. We first discuss the chosen comparison metrics.

5.1 Metrics

We base our uncertainty estimation evaluation on three main criteria, which are described mathematically below. First is calibration, which measures how well the predicted coverage of certain interval matches with the actual coverage. The second is sharpness, which evaluates the width of the interval. For example, if two methods both have calibrated intervals, but one method has a much smaller range of uncertainty, it would be greatly preferred. Third, we evaluate how well we capture the conditional distribution by evaluating a discretized approximation to the conditional log-likelihood.

5.1.1 Calibration

Our quantitative calibration definition follows established literature Kuleshov et al. (2018); Gneiting et al. (2007). A predicted nominal quantile is well calibrated when

(6)

which holds . To define a metric on calibration, define the interval with (e.g. 95%) nominal level as for . The importance of each nominal level q can be weighted with , creating a metric:

cal=. (7)

This can be empirically evaluated by defining desired coverage levels between and evaluating

(8)

This metric evaluates the discrepancy between the nominal coverage and the empirical coverage of any method that provides uncertainty intervals.

Note that several definitions of the interval could be used; for the purpose of our study, we picked two-sided equal tail interval as our uncertainty objective.

In empirical evaluation for , we picked equally spaced 6 percentiles, with with all weights set to 1.

5.1.2 Sharpness

At first glance, evaluating sharpness appears straightforward because a sharper method should produce narrower interval given a nominal level . However, for methods that are overconfident or conservative in their nominal (predicted) coverage will have predicted intervals be either too narrow or too wide, respectfully. Simply reporting the width could reward poor calibration methods. Therefore, we focus on making a visual approach to visualize the trade-offs between empirical coverage generated by each method, and the predicted average width, and we use these visualizations in the results. This will allow the reader to understand the sharpness with regards to the actual coverage.

5.1.3 Goodness of Fit

In order to assess how well the predicted conditional distribution actually fits the data, a standard statistical approach is to evaluate the log-likelihood. Because the algorithms we are comparing more naturally produce intervals rather than probability density functions, this is challenging to do directly. Instead, we will use a “goodness-of-fit” (

) metric that approximates the log-likelihood by using a discretization of the interval. Specifically, we discretize the real line into mutually exclusive bins , and . The discretion approximation to our log-likelihood is then given by

(9)

In all our experiment, we picked 10 bins, with , and , with and denoting the fifth and ninety-fifth percentile of the empirical distribution on . The intermediate bins were chosen to be equally spaced between those interval.

Note that the log-likelihood values here will be maximized in expectation when the estimated distribution exactly matches the conditional distribution, which could also be viewed as comprehensively assessing the calibration and sharpness. Note also that the values of our metric are dependent on the locations of and the number of bins; however, this is the same on all algorithms and is a fair comparison.

5.2 Synthetic Data Example

(a) Coverage
(b) Sharpness
(c) Precise Predicted Intervals
(d) CDF of a random sample
Figure 2: 2(a) shows the absolute difference between the nominal and the empirical coverage. There is a mild discrepancy on the theoretical (TH) curve due to sampling error and it is on the same scale as our proposed approach, CN. 2(b) visualizes the sharpness versus the empirical coverage. Specifically, for each nominal level interest, the true coverage was calculated, and then plotted against interval width, each represented by a point. 2(c) gives counts for each method of how many highly confident(1interval width2) 90 intervals are generated and the proportion of actual coverage. TH gives the benchmark by using the theoretically optimal intervals, and approximately 90% are covered as expected. CN and CN-f nearly match these predictions; DP gives an overestimation of highly confident samples and EN is never that confident. 2(d) visualizes the estimated CDF of each method against ground truth for a random data sample.

Our synthetic data follows a Gaussian distribution with a unique mean and variance for each sample,

. Specifically, , , , and . The task is to evaluate whether our training scheme faithfully recovers true distributions. We generated 1200 training samples and 800 evaluation samples.

The input features are , but their relationship to uncertainty is unknown to the networks a priori. Assume that is the real distribution function of standard normal. Under our simulation, the true conditional distribution for each sample can be easily derived as the th quantile for example can be constructed as , and a perfectly calibrated sharpest ’th confidence interval can be constructed as . This is the theoretically sharpest interval for a Gaussian sample  Lehmann and Romano (2006). We denote the theoretically optimal uncertainty estimate as TH.

We evaluate our method (denoted as Collaborating Network (CN)) against the Ensemble model (EN) and MC Dropout (DP) for a comprehensive comparison of how well they approximate the ground truth conditional distribution. Because our theory suggests that we should use the network to estimate uncertainty, that is our default approach (CN); However, we include the uncertainty estimate from as well (CN-f). Since it is a empirical evaluation, the TH method has minor sampling error. The empirical results can be seen in Table 1. Note that EN, CN, CN-f all did well in calibration, but that CN and CN-f are dominant in our goodness-of-fit metric.

Method True Coverage of CI
TH 0.010 -0.960 0.904
CN 0.015 -0.965 0.913
CN-f 0.022 -0.992 0.923
DP 0.115 -2.711 0.663
EN 0.044 -2.206 0.922
Table 1: Metrics evaluation on the Synthetic dataset.

We next visualize calibration error at each evaluated level of nominal coverage in Figure 2(a). Note that CN and CN-f are drastically sharper than EN, the only other well-calibrated, as shown in Figure 2(b). Note that this figure is recomputed to true coverage rather than the predicted nominal coverage. Intriguingly, DP shows sharper intervals than the proposed methods and the theoretically optimal method in the low-coverage regime, which we explored further by looking at the properties of samples with predicted precise intervals (( nominal interval with width), shown in Figure 2(c). The proposed method nearly matches optimal, EN predicts that nothing is that sharp (despite being well-calibrated), and DP is drastically overconfident, frequently yielding mistakes. Note that 90% of samples should actually satisfy the interval. Therefore, because the average interval width is an average among all samples, predicting samples with large variation pulls up the average. Note that after removing sample with large variability (), DP’s sharpness advantage disappeared (details in Supplementary Material). Therefore, EN seems to be well calibrated in evaluating different level of coverage, and actually captures the distribution based on the nearly matching between the optimal theoretical functions and CN.

Finally, we further assessed whether the estimated conditional CDF reproduce the ground truth by sketching the learned function against ground truth in Figure 2(d). Note that there is nearly a perfect match between CN and the ground truth, and this holds up across a variety of input features (more examples are in the Supplemental Material). We note that CN is smoother than CN-f and captures the curve slightly better, whereas DP and EN are far off because they did not capture the shift in mean.

(a) CPU Coverage
(b) EHR Coverage
(c) CPU Sharpness
(d) EHR Sharpness
Figure 3: Visualization of the results on the CPU and EHR datasets showing the same coverage plot and sharpness plot as in Figure 2. Specifically, 3(a) and 3(b) shows the nominal coverage versus the absolute difference between nominal coverage and empirical coverage; 3(c) and 3(d) shows sharpness versus the empirical coverage. Note that the proposed method is more faithful in its coverage and is either the sharpest or equally sharp for every coverage level. The other datasets showed similar properties (see Supplemental Material).

5.3 Real Data Analysis

In this section, we compared our method against other methods in five real data examples. The first four are publicly available UCI data sets111http://archive.ics.uci.edu/ml/datasets. Specifically, they are: Computer Hardware Data Set (cpu), Individual household electric power consumption Data Set (energy), Auto MPG Data Set (mpg), Communities, and Crime Data Set (crime). Among them, energy is the only time series data documenting the energy measurements in a single house for 47 months. In addition to the CN, CN-f, EN, and DP that we used pereviously, Kuleshov et al. (2018) offers ’recalibrated regression’ to recalibate pre-trained uncertainty estimates; therefore, we included it as a second step procedure on top of EN (EN-CR), and DP (DP-CR). Our method can also re-explore the data given summary statistics; therefore, we appended it on top of EN as a additional contender (EN-CN). In all examples, training and validation follows a 0.6/0.4 split.

Method/Data CPU energy MPG Crime EHR
CN 0.022 / -0.936 0.012 / -1.781 0.025 / -1.224 0.022 / -1.454 0.012 / -1.568
CN-f 0.018 / -1.138 0.041 / -2.071 0.043 / -1.358 0.029 / -1.710 0.023 / -1.925
DP 0.064 / -1.419 0.389 / -3.949 0.194 / -1.880 0.154 / -2.660
EN 0.163 / -2.375 0.030 / -2.147 0.041 / -2.249 0.045 / -2.241 0.029 / -2.169
DP-CR 0.029 / -1.423 0.076 / -3.618 0.101 / -1.678 0.121 / -2.403
EN-CR 0.043 / -1.691 0.014 / -2.139 0.044 / -2.139 0.010 / -1.997 0.031 / -2.037
EN-CN 0.060 / -1.656 0.056 / -1.924 0.035 / -1.678 0.070 / -2.195 0.026 / -1.787
Table 2: Quantitative results on the real datasets. Each method is given ( / ) metric on each dataset. Because the EHR data was significantly larger than the others, DP and DP-CR were infeasible due to the amount of sampling required, which is denoted by an .

The last data set was an electronic health records developed from the Southeastern Diabetes Initiative (SEDI) Miranda et al. (2013). This collection of data includes diabetic patients medical records, with the goal to forecast Hemoglobin A1c. There are records from 18,335 patients with at least 6 A1c measurements with additional demographic information and lab values. The measurements from individual visits were discretized to month, and a patients’ first visit was considered to occur in the 0 month, so all time stamps are time since first measurement. Electronic health records are rife with missing data and informative missingness, so recent medical-record-specific LSTM-based methods were used to address this challenge Che et al. (2018). We use this example to demonstrate how the technique can adapt to complex data structures with missing data.

The quantitative results can be seen in Table 2. In the metric, CN dominated other methods with huge advantage, and CN-f also performed well and was ranked as second in 3 out of 5 experiments. On calibration, CN is usually the best, and was only slightly beaten by EN-CR in the crime data set. In all situations, the generated nominal from CN are close to the true empirical coverage (the Supplemental Material gives full comparisons between method on all datasets). CN-f did well, though not as competitive as CN, as predicted by our theoretical analysis. Visualizations of the results on CPU and EHR can be seen in Figure 3. Note that although EN was well-calibrated in CPU, its generated interval are not comparably sharp. In all tested situations, CN was either the sharpest or equally sharp to competing methods, and was the only method that was both sharp, well-calibrated, and actually captured the conditional distribution ().

CR is an efficient method to calibrate existing methods for uncertainty estimate via a two-step procedure. However, it is post-hoc, and is strongly dependent on the original method. In this sense, CR is useful for matching coverage level but cannot fix poor initial predictions. In direct contrast, CN fully utilizes the data, so it can have a faithful interval estimate for every level. However, if desired, we can use CN in a two-step procedure as well, as in EN-CN. We note that this combination is substantially sharper compared with EN. Thus, if a good uncertainty model is trained and the summary statistics are informative (mean, variance), then CN can also be used to correct in a two-stage fashion (EN-CN). However, our results suggest that using CN directly is the optimal strategy.

In the EHR data, CN was essentially perfect in calibration and showed a vastly better goodness-of-fit. Furthermore, our method showed its capability in drawing reliable uncertainty estimate for large-scale complex data, and produced significantly sharper data. These uncertainty estimates can be used to derive future values for patients, and our empirical results suggest that the uncertainty intervals are highly trustworthy.

6 Discussion and Conclusion

In this paper we propose a collaborative learning scheme by simultaneously training two neural networks that characterize the CDF and inverse CDF of the conditional distribution . In analyses of real data and synthetic data, our method showed its capability in drawing reliable uncertainty estimates from small to large-scale data with both non-temporal and temporal data structures. Empirically, our proposed method gave more accurate estimates of coverage and improved sharpness compared to the competing approaches. The method is supported by our theoretical analysis, and appears to be robust in practice. Moving forward, we will consider extensions to causal inference to model the heterogeneous treatment effect for each individual and focus on interpretable modeling.

Acknowledgments

Research reported in this publication was supported by the National Institute of Biomedical Imaging and Bioengineering and the National Institute of Mental Health through the National Institutes of Health BRAIN Initiative under Award Number R01EB026937.

The data used in the EHR analysis was provided by the Southeastern Diabetes Initiative (SEDI), directed by Ebony Boulware. SEDI was supported by Duke Clinical & Translational Science Award (CTSA) grant UL1TR001117; Cooperative Agreement Number 1C1CMS331018-01-00 from the Department of Health and Human Services, Centers for Medicare & Medicaid Services; and the Bristol-Myers Squibb Foundation. The data was used in accordance with Duke Health IRB Pro00025650. The EHR analysis was executed within the Duke Protected Analytics Computing Environment (PACE) supported by Duke’s Clinical and Translational Science Award (CTSA) grant (UL1TR001117), and by Duke University Health System. The CTSA initiative is led by the National Center for Advancing Translational Sciences (NCATS) at the National Institutes of Health.

The contents of this publication are solely the responsibility of the authors and do not necessarily represent the official views of any of the funding agencies or sponsors.

References

  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. International Conference on Machine Learning. Cited by: §4.1.
  • R. E. Bellman and L. A. Zadeh (1970) Decision-making in a fuzzy environment. Management Science 17 (4), pp. B–141. Cited by: §1.
  • D. M. Blei, A. Kucukelbir, and J. D. McAuliffe (2017) Variational inference: a review for statisticians. Journal of the American Statistical Association 112 (518), pp. 859–877. Cited by: §1, §2.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. Proceedings of Machine Learning Research. Cited by: §1.
  • A. Brando, J. A. Rodriguez, J. Vitria, and A. R. Muñoz (2019) Modelling heterogeneous distributions with an uncountable mixture of asymmetric laplacians. In Advances in Neural Information Processing Systems, pp. 8836–8846. Cited by: §2.
  • Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu (2018) Recurrent neural networks for multivariate time series with missing values. Scientific reports 8 (1), pp. 6085. Cited by: §5.3.
  • A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath (2018) Generative adversarial networks: an overview. IEEE Signal Processing Magazine 35 (1), pp. 53–65. Cited by: §4.2.
  • S. S. Du, X. Zhai, B. Poczos, and A. Singh (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: §4.1.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In International Conference on Machine Learning, pp. 1050–1059. Cited by: §2.
  • Yarin. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §1, §1, §2.
  • T. Gneiting, F. Balabdaoui, and A. E. Raftery (2007) Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69 (2), pp. 243–268. Cited by: §5.1.1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680. Cited by: §1, §4.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning-Volume 70, pp. 1321–1330. Cited by: §1, §2.
  • A. C. Harvey (1976)

    Estimating regression models with multiplicative heteroscedasticity

    .
    Econometrica 44 (3), pp. 461–465. Cited by: §1.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Proceedings of Machine Learning Research. Cited by: §4.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Appendix C, Acknowledgments.
  • R. Koenker and G. Bassett (1978) Regression quantiles. Econometrika 46, pp. 33–50. Cited by: §2.
  • V. Kuleshov, N. Fenner, and S. Ermon (2018) Accurate uncertainties for deep learning using calibrated regression. Proceedings of Machine Learning Research. Cited by: Appendix C, §2, §5.1.1, §5.3.
  • M. Kull, T. M. Silva Filho, P. Flach, et al. (2017)

    Beyond sigmoids: how to obtain well-calibrated probabilities from binary classifiers with beta calibration

    .
    Electronic Journal of Statistics 11 (2), pp. 5052–5080. Cited by: §2.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402–6413. Cited by: §1, §2.
  • E. L. Lehmann and J. P. Romano (2006) Confidence intervals and families of tests. Springer Science & Business Media. Cited by: §5.2.
  • C. Li, C. Chen, D. Carlson, and L. Carin (2016) Preconditioned stochastic gradient langevin dynamics for deep neural networks. In

    AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • J. Liu, J. Paisley, M. Kioumourtzoglou, and B. Coull (2019) Accurate uncertainty estimation and decomposition in ensemble learning. In Advances in Neural Information Processing Systems, pp. 8950–8961. Cited by: §2.
  • M. L. Miranda, J. Ferranti, B. Strauss, B. Neelon, and R. M. Califf (2013) Geographic health information systems: a platform to support the ‘triple aim’. Health affairs 32 (9), pp. 1608–1615. Cited by: §5.3.
  • J. Mitchell and K. F. Wallis (2011) Evaluating density forecasts: forecast combinations, model mixtures, calibration and sharpness. Journal of Applied Econometrics 26 (6), pp. 1023–1040. Cited by: §1.
  • R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §2.
  • D. A. Nix and A. S. Weigend (1994) Estimating the mean and variance of the target probability distribution. In IEEE International Conference on Neural Networks, Vol. 1, pp. 55–60. Cited by: §2.
  • J. Platt (1999)

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    .
    Advances in Large Margin Classifiers 10 (3), pp. 61–74. Cited by: §1, §2.
  • Y. Romano, E. Patterson, and E. J. Candès (2019) Conformalized quantile regression. Advances in Neural Information Processing Systems. Cited by: §2.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §4.2.
  • N. Skafte, M. Jørgensen, and S. Hauberg (2019) Reliable training and estimation of variance networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • E. W. Steyerberg (2019) Overfitting and optimism in prediction models. In Clinical Prediction Models, pp. 95–112. Cited by: §4.
  • N. Tagasovska and D. Lopez-Paz (2019) Single-model uncertainties for deep learning. In Advances in Neural Information Processing Systems, Cited by: §2.
  • S. Thulasidasan, G. Chennupati, J. Bilmes, T. Bhattacharya, and S. Michalak (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. Advances in Neural Information Processing Systems. Cited by: §2.
  • A. Van Der Vaart et al. (1996) New donsker classes. The Annals of Probability 24 (4), pp. 2128–2140. Cited by: §A.2, Acknowledgments.
  • A. W. Van der Vaart (2000) Asymptotic statistics. Vol. 3, Cambridge university press. Cited by: §A.1.1, §A.1.1, Appendix A, §4.1, Acknowledgments.
  • A. Van Der Vaart and J. A. Wellner (1996) Weak convergence. In Weak convergence and empirical processes, pp. 16–28. Cited by: §A.1.1, Acknowledgments.
  • A. S. Weigend, B. A. Huberman, and D. E. Rumelhart (1990) Predicting the future: a connectionist approach. International Journal of Neural Systems 1 (03), pp. 193–209. Cited by: §4.
  • B. Zadrozny and C. Elkan (2002) Transforming classifier scores into accurate multiclass probability estimates. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 694–699. Cited by: §1, §2.

Appendix A M estimator for the consistency

We start by restating Theorem 1 in Section 4.1 for an M-Estimator due to Van der Vaart (2000):

Theorem 1

For , if the following three conditions are satisfied, then

  1. The sequence of estimator satisfy

In this theorem, stands for the function space . is the ground truth. represents the sample average as a function of , and represents the expectation as a function of . The whole theorem to prove the consistency of estimator is based on a maximization framework.

a.1 Marginal Consistency

We first prove the consistency of our estimating function in marginal form, without . Throughout the rest of this document we use as a positive constant that may have different values in different situations.

a.1.1 Consistency of -network

Defining a Alternative Form of the Objective Function.

From -loss, we have:

where can be replaced by any (a pre-defined distribution), which does not influence the consistency as long as it has support over .

We assume that the distance measurement is the cumulative probability density of , i.e. . Based upon that, the distance between two functions can be defined as

(10)

Then our -loss can be re-written as:

In order to use Theorem 1 to prove the function consistency, we need to transform the original minimization problem of our g-loss into a maximization problem. The new objective function is:

With the above objective, minimizing the g-loss is equivalent to maximizing the the objective function with respect to . Further, define

and

where is the general form for a CDF function and therefore should be monotonically non-decreasing.

Defining a New Parameter Space.

In reality, with limited training samples, it is impossible to learn the full span of the distribution (i.e., we get very few samples on the tails). For inference, we are only interested in exploring a partial range, (i.e., ). Assuming that there exist two real numbers and , such that for positive small numbers and , and , where is the ground truth CDF. Thus, we bound the domain of the exploring space to be within , which can be adjusted to cover our inference of interests.

Based upon the above claim, (function in in Theorem 1) can be further extended as any functions within the following family.

(11)

where (that is, for in Theorem 1). It is always possible to find the range, since we can always choose small number or scale the variables in practice. Under this set up, since we are only interested in learning a fraction of the distribution function within a bounded domain, it is possible define a well-behaved function to generate the output . Specifically, we define , where each point is placed with equal importance, then according to Van Der Vaart and Wellner (1996), for every (in our case r=1 for the norm)

we have

where we choose , the probability measure for . Note that is the notation for bracketing number, which stands for the complexity of the family of functions (Van Der Vaart and Wellner, 1996). It is trivial to show that (when restricting the domains of all functions in to be ). Then we have the bracketing number for our function space

Therefore, s.t. , where and .

Based upon the above inequality, the new family of functions is defined as with

(12)

We can easily construct a bracket for this new function by bracketing with

and

Then by mean value theorem , for with positive lower bound, the probability measurement for any space and measurement , there is

Therefore, it is proved that

Hence, our newly defined in Eq. (12) is a Glivenko-Cantelli, a standard approach to show uniform convergence to an objective function in expectation as evaluated in the family , and therefore condition 1 in Theorem 1 follows naturally  Van der Vaart (2000).

The second condition focuses on proving that the expectation of the loss function has enough separation between two different functions. Let denote the ground truth -function, then the difference of evaluated objective function under two function and is

For function with form , it can be shown that for a large , there exists , s.t. , . Then

where is defined in Eq. (10). Condition 2 follows with the above inequality.

Condition 3 is obvious if we can get the maximum likelihood estimator (MLE) with the full class defined by class in Eq. (11). In practice, we are actually trying to attain the MLE within a function class defined by neural networks. Note that based on the proof of Theorem 1 Van der Vaart (2000), includes both and for large , i.e., is monotone. Hence, to guarantee this monotonicity, we further assume this class attained by neural network is a subclass of and defined by

Now, we take a detour for establishing condition 3. Assuming that there exists , s.t. and it is trivial to find . Under these two conditions, we can get

Hence with all three conditions are satisfied, we can arrive at

Now, we switch our focus to the existence of the function . To simplify the argument, assume is the true density function which has a positive lower bound within . From the Universal Approximation Theorem, there exists a function attained by neural network, s.t. . Then for large , also has a positive lower bound.

Next we define . From this definition, it is easy to see that has form , with , that is, is also attained by neural network. Besides, we have . In addition, by the fact that for any large , also has a positive lower bound, we know that is also a monotone function for any large . This proves that .

In the following, we will show to finish the verification of condition 3 and complete the proof. Since

and

by the fact that for with positive lower bound and using the boundedness of and the fact that .

Then we know . Therefore,

a.1.2 Consistency of

The form of -loss determines its strong reliance on function. This dependency shows up in both the consistency and the fixed point solution. ALthough we can use function to directly solve the quantiles, we are still interested in showing that can reach consistency under some additional assumptions.

We start by showing a stronger version of convergence of . Previously, we have shown that is consistent in probability, now assume that we have:

When is attained, can be defined as , which is bound to minimize the empirical -loss, since . The question is whether the sequence of converges to as expected.

We want to show that

since is a monotonic and continuous function, its inverse is monotonic and continuous, with its domain being a compact set . Therefore, it is uniformly continuous. Let’s define any , we have . assume that we have , s.t. when . Hence,

if not we have then , contradiction! same is for the second part. With (I), we also have , s.t. when , we have . Now, let . We want to show that .

Prove by contradiction: if , then , that contradicts with (II). Therefore, . The other direction can be proved similarly. Then we get . In this whole process, we showed that for

which further implies that

that completes the proof.

a.2 Extension to conditional distribution

In general, we are estimating the distribution of , to prove the consistency of . We still rely on the construction of M-estimator, but limited to a multivariate function family. The major difference for the general case is using the smoothness of the conditional distribution to establish a multivariate function family is a Glivenko-Cantelli while for the marginal case we have shown that a univariate monotone function class is a Glivenko-Cantelli. The detailed proof for the general case with the conditional distribution is otherwise almost the same with the previous proof and is not presented here. In the following, we only present the theorem Van Der Vaart and others (1996) for evaluating the bracketing number for a smooth multivariate function class and therefore verifying that the function class is a Glivenko-Cantelli.

Theorem 2.

Suppose to be bounded and convex with nonempty interior.

There exists a constant , depending only on , diagram ,and , s.t

In terms of the convergence of , we actually do not care about it that much, since the introducing of , and forcing it to learn together with , is to make us explore the space more efficiently where the density is higher. And in real inference, is only a handy way to help us draw each desired quantiles. In practice, if is perfectly captured for any conditional distribution , we can even numerically solve for each desired conditional quantile given , which is laborious, but realizable.

Appendix B Proofs of Additional Claimed Properties

Lemma 1.

The sampling distribution on the quantiles does not influence the optimal solution of , and .

Proof.

For g-loss and f-loss to attain its maximum, we have shown in main article that for fixed and , the optimal solution is