PIVEN: A Deep Neural Network for Prediction Intervals with Specific Value Prediction

06/09/2020 ∙ by Eli Simhayev, et al. ∙ Ben-Gurion University of the Negev 0

Improving the robustness of neural nets in regression tasks is key to their application in multiple domains. Deep learning-based approaches aim to achieve this goal either by improving the manner in which they produce their prediction of specific values (i.e., point prediction), or by producing prediction intervals (PIs) that quantify uncertainty. We present PIVEN, a deep neural network for producing both a PI and a prediction of specific values. Benchmark experiments show that our approach produces tighter uncertainty bounds than the current state-of-the-art approach for producing PIs, while managing to maintain comparable performance to the state-of-the-art approach for specific value-prediction. Additional evaluation on large image datasets further support our conclusions.



There are no comments yet.


page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been achieving state-of-the-art results in a large variety of complex problems. These include automated decision making and recommendation systems in the medical domain medical_dnn, autonomous control of drones dronednn and self driving cars carsdnn. In many of these domains, it is crucial not only that the prediction made by the DNN is accurate, but rather that its uncertainty is quantified. Quantifying uncertainty has many benefits, including risk reduction and the ability to plan in a more reliable fashion lube.

For regression problems, uncertainty is quantified by the creation of prediction intervals (PIs), which offer upper and lower bounds on the value of a data point for a given probability (e.g., 95% or 99%). Existing non-bayesian methods for PI generation can be roughly divided into two groups:

a) carrying out multiple runs of the regression problem (e.g., dropout mc_dropout, ensemble-based methods deep_ensemble

) and deriving the PI from the prediction variance in a post-hoc manner, and;

b) the use of dedicated architectures for the generation of the PI, which produce the upper and lower bounds of the PI.

While effective, each approach has limitations. On the one hand, the ensemble-based approaches produce a specific value for the regression problem (i.e., a point prediction), but they are not optimized for PI construction. This lack of a PI makes the use of such approaches difficult in domains such as financial risk mitigation or maintenance scheduling. For example, providing a PI for the number of days a machine can function without malfunctioning (e.g., 30-45 days with 99% certainty) is more valuable than a prediction for the specific time of failure. On the other hand, PI-dedicated architectures qd; SQR provide accurate upper and lower bounds for the prediction, but do not provide a method for specifically selecting a value within the interval. As a result, these approaches choose the middle of the interval as their value prediction, which is a sub-optimal strategy as it makes assumptions regarding the value distribution within the interval. The shortcomings of this approach to value prediction are supported by qd, as well as by our own experiments in Section 5.

In this study we propose PIVEN (prediction intervals with specific value prediction), a novel approach for uncertainty modeling using DNNs. Our approach combines the benefits of the two types of approaches described above by producing both a PI and a value prediction. We follow the experimental procedure of recent works, and compare our approach to current best-performing methods: Quality-Driven PI method (QD) qd (a dedicated PI generation architecture), and Deep Ensembles (DE) deep_ensemble. The results of our evaluation show that PIVEN outperforms QD by producing narrower PIs, while simultaneously achieving comparable results to DE in terms of value prediction.

2 Related Work

2.1 Uncertainty Modeling in Data

In the field of uncertainty modeling, one considers two types of uncertainty: a) Aleatoric uncertainty, which captures noise inherent in the observations, and; b) epistemic uncertainty, which accounts for uncertainty in the model parameters – thus capturing our ignorance about the correctness of the model generated from our collected data. Overall uncertainty can therefore be modeled as , where denotes epistemic uncertainty and denotes aleatoric uncertainty. Aleatoric uncertainty can further be categorized into homoscedastic uncertainty, where is constant for different inputs, and heteroscedastic uncertainty where is dependent on the inputs to the model, with some inputs potentially being more noisy than others. In this work we quantify uncertainty using PIs, which by definition quantify

, whereas confidence intervals (CIs) quantify only

. Therefore, PIs are necessarily wider than CIs.

2.2 Modeling Uncertainty in Regression Problems

Enabling deep learning algorithms to cope with uncertainty has been an active area of research in recent years qd; rio; mc_dropout; deep_ensemble; confidNet; vision_uncertainties; bias_reduced; can_you_trust. Studies in the uncertainty modeling and regression can be roughly divided into two groups: sampling-based and PI-based.

Sampling-based approaches initially utilized Bayesian neural networks bnn, in which a prior distribution was defined on the weights and biases of a neural net (NN), and a posterior distribution is then inferred from the training data. The main shortcomings of these approaches were their heavy computational costs and the fact that they were difficult to implement. Subsequently, non-Bayesian methods mc_dropout; deep_ensemble; rio were proposed. In mc_dropout

, Monte Carlo sampling was used to estimate the predictive uncertainty of NNs through the use of dropout over multiple runs. A later study

deep_ensemble employed a combination of ensemble learning and adversarial training to quantify data uncertainty. In an expansion of a previously-proposed approach mve

, each NN was optimized to learn the mean and variance of the data, assuming a Gaussian distribution. In a recent study

rio, the authors proposed a post-hoc procedure using Gaussian processes to measure the uncertainty of the predictions of NN regressors.

PI-based approaches, whose aim is to explicitly produce a PI for each analyzed sample, belong to a field of research that has been gaining popularity in recent years. In keren_pi_calibrated

, the authors propose a post-processing approach that considers the regression problem as one of classification, and uses the output of the final softmax layer to produce PIs. Another recent study


proposed the use of a loss function designed to learn all conditional quantiles of a given target variable. Khosravi et al.


proposed a method called LUBE, which consists of a loss function optimized for the creation of PIs but has the caveat of not being able to use stochastic gradient descent (SGD) for its optimization. Finally, a recent study

qd inspired by LUBE proposed a loss function that is both optimized for the generation of PIs and can be optimized using SGD.

Each of the two groups presented above tends to under-perform when applied to tasks for which its loss function was not optimized: sampling-based approaches, which are optimized to produce value predictions, tend to produce PIs of lesser accuracy than those of the PI-based methods, which are optimized to produce tight PI intervals, and vice versa. Recent studies adaptive; conformal_prediction_nips19 attempted to produce both value predictions and PIs by using conformal prediction with quantile regression. While effective, these methods use a complex splitting strategy, where one part of the data is used to produce value predictions and PIs, while the the other part is to further adjust the PIs. Contrary to these approaches, PIVEN produces PIs with value predictions in an end-to-end manner by relying on novel loss function.

3 Problem Formulation

In this work we consider a neural network regressor that processes an input with an associated label , where can be any feature space (e.g., tabular data, age prediction from images). Let be a data point along with its target value. Let and be the upper and lower bounds of PIs corresponding to the ith sample. Our goal is to construct such that . We refer to as the confidence level of the PI. In standard regression problems, the goal is to estimate a function such that , where is referred to as noise and is usually assumed to have zero mean.

Next we define two quantitative measures for the evaluation of PIs, as defined in lube. First we define coverage as the ratio of dataset samples that fall within their respective PIs. We measure coverage using the prediction interval coverage probability (PICP) metric:


where denotes the number of samples and if , otherwise . We now define a metric to measure the quality of the generated PIs. Naturally, we are interested in producing as tight a bound as possible while maintaining adequate coverage. We define the mean prediction interval width (MPIW) as,


When combined, these metrics enable us to comprehensively evaluate the quality of generated PIs.

4 Method

In this section we first define PIVEN, a deep neural architecture for the generation of both PIs and value predictions for regression problems. We then present a suitable loss function that enables us to train our architecture to generate the PIs for a desired confidence level .

4.1 System Architecture

The proposed architecture is presented in Figure 1. It consists of three components:

  • Backbone block. The main body block, consisting of a varying number of DNN layers or sub-blocks. The goal of this component is to transform the input into a latent representation that is then provided as input to the other components. It is important to note that PIVEN supports any architecture type (e.g., dense, convolutions) that can be applied to a regression problem. Moreover, pre-trained architectures can also be used seamlessly. For example, we use pre-trained VGG-16 and DenseNet architectures in our experiments.

  • Upper & lower-bound heads. and produce the lower and upper bounds of the PI respectively, such that where is the value prediction and is the predefined confidence level.

  • Auxiliary head. The auxiliary prediction head, , enables us to produce a value prediction. does not produce the value prediction directly, but rather produces a parameter indicating the relative weight that should be given to each of the two bounds. We derive the value prediction using,


    where . By expressing the output of the auxiliary as a function of the other two heads, we bound them together and improve their performance. See Section 4.3 for details.

This architecture has several advantages compared to previous studies, particularly in terms of robustness and the ability to represent PIs that are not uniformly distributed. We elaborate on this subject further in Section


Figure 1: The PIVEN schematic architecture

4.2 Network Optimization

Our goal is to generate narrow PIs, measured by MPIW, while maintaining the desired level of coverage, measured by . However, PIs that fail to capture their respective data point should not be encouraged to shrink further. We follow the derivation presented in qd and define captured () as the of only those points for which ,


where . Hence, we seek to minimize subject to :

where is the parameters of the neural net. To enforce the coverage constraint, we utilize a variant of the well-known Interior Point Method (IPM) IPM, resulting in an unconstrained loss:


is a hyperparameter controlling the relative importance of width vs. coverage,

is a quadratic penalty function, and is the batch size. We include dependency on batch size in the loss since a larger sample size increases confidence in the value of PICP, thus increasing the loss. In practice, optimizing the loss with discrete version of k (see eq. 4) fails to converge, because the gradient is always positive for all possible values. We therefore define a continuous version of k, denoted as , where

is the sigmoid function, and

is a softening factor. The final version of uses the continuous and discrete versions of k in its calculations of the and metrics, respectively. By doing so, it discourages the PIs from shrinking further when failing to capture their respective data points.

Neural networks optimized by the abovementioned objective are able to generate well-calibrated PIs, but they disregard the original value prediction task. This omission has two significant drawbacks:

  • Overfitting. The term in , as defined in qd, focuses only on the fraction of the training set where the data points are successfully captured by the PI. As a result, the network is likely to overfit to a subset of the data. Our reasoning is supported by our experiments in Section 5 and Appendix C.

  • Lack of value prediction. In its current form, is not able to perform value prediction, i.e., returning a specific prediction for the regression problem. To overcome this limitation, one can return the middle of the PI, as done in qd; SQR. This approach sometimes yields sub-optimal results, as it is based on assumptions regarding the distribution of the data. These assumptions do not always hold, as we show in our experiments in Section 5.4.

We propose a novel loss function that combines the generation of both PIs and value predictions. To optimize the output of (the auxiliary head), we minimize the standard regression loss,


where is a regression objective against the ground-truth, and . Our final loss function is a convex combination of , and the auxiliary loss . Thus, the overall training objective is:


where is a hyperparameter that balances the two goals of our approach: producing narrow PIs and accurate value predictions. To quantify epistemic uncertainty, we employ an ensemble of different networks with parameter resampling, as proposed in deep_ensemble. Given an ensemble of NNs trained with , let , represent the ensemble’s upper and lower estimate of the PI, and represents the ensemble’s auxiliary prediction. We calculate model uncertainty and use the ensemble to generate the PIs and as follows:


where and represents the upper bound of the PI and the auxiliary prediction for data point , for NN . A similar procedure is followed for , subtracting , where is the Z score for a confidence level .

4.3 Discussion of contributions

PIVEN is different from previous studies in two important aspects. First, our approach is the first to propose an integrated architecture capable of producing both PIs and exact value predictions. Moreover, since the auxiliary head produces predictions for all training set samples, it prevents PIVEN from overfitting to only the data points which were contained in their respective PIs (a possible problem for studies such as qd; SQR), thus increasing the robustness of our approach.

The second differentiating aspect of PIVEN with respect to previous work is its method for producing the value prediction. While previous studies either provided the middle of the PI qd; SQR or the mean-variance deep_ensemble as their value predictions, PIVEN’s auxiliary head can produce any value within the PI as its prediction. By expressing the value prediction as a function of the upper and lower bounds, we ensure that the three heads are synchronized. Finally, this representation enables us to produce value predictions that are not in the middle of the interval, thus creating representations that are more characteristic of many real-world cases, where the PI is not necessarily uniformly distributed. Our experiments, presented in Sections 5.5, 5.4 support our conclusions.

5 Evaluation

5.1 Datasets

UCI Datasets. To compare PIVEN to recent state-of-the-art studies mc_dropout; pbp; deep_ensemble; qd, we conduct our experiments on a set of benchmark datasets used by them for evaluation. This benchmark includes ten datasets from the UCI repository uci.

IMDB age estimation dataset111https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/. The IMDB-WIKI dataset imdb_ds is currently the largest age-labeled facial dataset available. Our dataset consists of 460,723 images from 20,284 celebrities, and the regression goal is to predict the age of the person in the image. It is important to note that this dataset is known to contain noise (i.e., aleatoric uncertainty), thus making it highly relevant to this study. We apply the same preprocessing as in imdb_large_work; SSRNet, and refer the reader to the Appendix A for full details.

RSNA pediatric bone age dataset222https://www.kaggle.com/kmader/rsna-bone-age. This dataset is a popular medical imaging dataset consisting of X-ray images of children’s hands rsna. The regression task is predicting one’s age from one’s bone image. The dataset contains 12,611 training images and 200 test set images.

While the first group of datasets enables us to compare PIVEN’s performance to recent state-of-the-art studies in the field, the two latter datasets enable us to demonstrate that our approach is both scalable and effective on multiple types of input.

5.2 Baselines

We compare our performance to two top-performing NN-based baselines from recent years:

  • Quality driven PI method (QD) qd. This approach produces prediction intervals that minimize a smooth combination of the PICP/MPIW metrics without considering the value prediction task in its objective function. Its reported results make this approach state-of-the-art in terms of PI width and coverage.

  • Deep Ensembles (DE) deep_ensemble. This work combines individual conditional Gaussian distribution with adversarial training, and uses the models’ variance to compute prediction intervals. Because DE outputs distribution instead of PIs, we first convert it to PIs, and then compute PICP and MPIW (replicating the process described in qd). Its reported results make this method one of the top performers with respect to the RMSE metric (i.e., value prediction).

By comparing PIVEN to these two baselines, we are able to evaluate its ability to simultaneously satisfy the two main requirements for regression problems in domains with high certainty.

5.3 Experimental Setup

Throughout our experiments, we evaluate our two baselines qd; deep_ensemble using their reported deep architectures and hyperparmeters. For full experimental details, please see Appendix A

. We ran our experiments using a GPU server with two NVIDIA Tesla P100. Our code is implemented using TensorFlow and Keras

tf; keras, and is made available online333https://github.com/elisim/piven.

UCI datasets. We implemented the experimental setup proposed by pbp, which was also used by our baselines. Results are averaged on 20 random 90%/10% splits of the data, except for the “Year Prediction MSD" and “Protein”, which were split once and five times respectively. Our network architecture is identical to previous work mc_dropout; pbp; deep_ensemble; qd

: one hidden layer with ReLU activation function

relu, and the Adam optimizer adam. Input and target variables are normalized to zero mean and unit variance.

IMDB age estimation dataset. We use the DenseNet architecture densenet as the backbone block, upon which we add two fully connected layers. We apply the data preprocessing used in SSRNet; zhang (see appendix for details). We report the results for 5-fold cross validation, as the dataset has no predefined test set.

RSNA bone age dataset. We use the VGG-16 architecture vgg

as the backbone block, with weights pre-trained on ImageNet. We then add two convolutional layers followed by a dense layer. This dataset has a predefined test set of 200 images.

5.4 Evaluation Results: UCI Datasets

We use two evaluation metrics: MPIW and RMSE, with the desired coverage, measured by the PICP metric, set to 95% (as done in

qd). In terms of PI-quality, shown in Table 1, PIVEN outperforms QD in nine out of ten datasets (although it should be noted that no method reached the required PICP in two of these datasets – “Boston" and “Concrete"), while achieving equal performance in the remaining dataset. DE trails behind PIVEN and QD in most datasets, which is to be expected since this approach does not attempt to optimize MPIW.

Table 2 presents the RMSE metric values for all methods. It is clear that PIVEN and DE are the top performers, with the former achieving the best results in five datasets, and the latter in four. The QD baseline trails behind the other methods in all datasets but one (“Naval", where all methods achieve equal performance). QD’s performance is not surprising given that the focus of the said approach is the generation of PIs rather than value predictions.

The results of our experiments clearly show that PIVEN is capable of providing accurate value predictions for regression problems (i.e., achieving competitive results with the top-performing DE baseline) while achieving state-of-the-art results in uncertainty modeling by the use of PIs.

Ablation Analysis. In Section 4.3 we describe our rationale in expressing the value prediction as a function of the upper and lower bounds of the interval. To prove the merits of our approach we evaluate two variants of PIVEN. In the first variant, denoted as POO (point-only optimization), we decouple the value prediction from the PI. The loss function of this variant is where is set to be MSE loss. In the second variant, denoted MOI (middle of interval), the value prediction produced by the model is always the middle of the PI (in other words, is set to 0.5).

The results of our ablation study are presented in Table 3, which contains the results of the MPIW and RMSE metrics (the PICP values are identical for all variants and are therefore omitted—values are presented in the Appendix B). It is clear that the full PIVEN significantly outperforms the two other variants. This leads us to conclude that both novel aspects of our approach—the simultaneous optimization of PI-width and RMSE, and the ability to select any value on the PI as the value prediction—contribute to PIVEN’s performance. Finally, it is important to note that even though their performance is inferior to PIVEN, both the POO and MOI variants outperform the QD baseline in terms of MPIW, while being equal or better for RMSE.

Table 1: Results on regression benchmark UCI datasets comparing PICP and MPIW. Best performance defined as in qd: every approach with PICP 0.95 was defined as best for PICP. For MPIW, best performance was awarded to lowest value. If PICP 0.95 for neither, the largest PICP was best, and MPIW was only assessed if the one with larger PICP also had smallest MPIW.
Boston 0.87 0.01 0.93 0.01 0.93 0.01 0.87 0.03 1.15 0.02 1.09 0.01
Concrete 0.92 0.01 0.93 0.01 0.93 0.01 1.01 0.02 1.08 0.01 1.02 0.01
Energy 0.99 0.00 0.97 0.01 0.97 0.00 0.49 0.01 0.45 0.01 0.42 0.01
Kin8nm 0.97 0.00 0.96 0.00 0.96 0.00 1.14 0.01 1.18 0.00 1.10 0.00
Naval 0.98 0.00 0.97 0.00 0.98 0.00 0.31 0.01 0.27 0.00 0.24 0.00
Power plant 0.96 0. 00 0.96 0.00 0.96 0.00 0.91 0.00 0.86 0.00 0.86 0.00
Protein 0.96 0.00 0.95 0.00 0.95 0.00 2.68 0.01 2.27 0.01 2.26 0.01
Wine 0.90 0.01 0.91 0.01 0.91 0.01 2.50 0.02 2.24 0.02 2.22 0.01
Yacht 0.98 0.01 0.95 0.01 0.95 0.01 0.33 0.02 0.18 0.00 0.17 0.00
Year Prediction MSD 0.95 NA 0.95 NA 0.95 NA 2.91 NA 2.45 NA 2.42 NA
Datasets DE QD PIVEN
Boston 2.87 0.19 3.39 0.26 3.13 0.21
Concrete 5.21 0.09 5.88 0.10 5.43 0.13
Energy 1.68 0.06 2.28 0.04 1.65 0.03
Kin8nm 0.08 0.00 0.08 0.00 0.07 0.00
Naval 0.00 0.00 0.00 0.00 0.00 0.00
Power plant 3.99 0.04 4.14 0.04 4.08 0.04
Protein 4.36 0.02 4.99 0.02 4.35 0.02
Wine 0.62 0.01 0.67 0.01 0.63 0.01
Yacht 1.38 0.07 1.10 0.06 0.98 0.07
Year Prediction MSD 8.95 NA 9.30 NA 8.93 NA
Table 2: Evaluation results for the UCI benchmark datasets, using the RMSE metric
Table 3: Ablation analysis, comparing PICP and MPIW. Results were analyzed as in Table 1
Boston 1.09 0.02 1.15 0.02 1.09 0.01 3.21 0.24 3.39 0.27 3.13 0.21
Concrete 1.02 0.01 1.07 0.01 1.02 0.01 5.55 0.11 5.73 0.10 5.43 0.13
Energy 0.42 0.01 0.45 0.01 0.42 0.01 2.16 0.04 2.27 0.04 1.65 0.03
Kin8nm 1.13 0.00 1.17 0.00 1.10 0.00 0.08 0.00 0.08 0.00 0.07 0.00
Naval 0.24 0.00 0.30 0.02 0.24 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Power plant 0.86 0.00 0.86 0.00 0.86 0.00 4.13 0.04 4.15 0.04 4.08 0.04
Protein 2.25 0.01 2.27 0.01 2.26 0.01 4.78 0.02 4.99 0.01 4.35 0.02
Wine 2.24 0.01 2.23 0.01 2.22 0.01 0.64 0.01 0.67 0.01 0.63 0.01
Yacht 0.18 0.00 0.19 0.01 0.17 0.00 0.99 0.07 1.15 0.08 0.98 0.07
Year Prediction MSD 2.42 NA 2.43 NA 2.42 NA 9.10 NA 9.25 NA 8.93 NA

5.5 Large-Scale Datasets

In our discussion in Section 4.3, we argue that PIVEN’s auxiliary head forces it to train on the entire training set rather than overfit itself to the data points it manages to capture within their respective PIs. We hypothesize that this advantage will become more pronounced in large and complex data, and therefore perform an evaluation on two image datasets: bone age and age estimation. Since training the DE approach on datasets of this size is computationally prohibitive, we instead use a dense layer on top of the used architecture (DenseNet/VGG, see Section 5.3) that outputs value prediction using the MSE metric. In doing so, we follow the approach used in rio for similar evaluation. We refer to this architecture as NN, because of the dense layer we add.

Our results are presented in Table 4. We use mean absolute error (MAE), which was the datasets’ chosen metric. For the IMDB age prediction dataset, results show that PIVEN outperforms both baselines across all metrics. It is particularly noteworthy that our approach achieves both higher coverage and tighter PIs compared to QD. We attribute the significant improvement in MPIW – 17% – to the fact that this dataset has relatively high degrees of noise SSRNet. In the bone age dataset, PIVEN outperforms both baselines in terms of MAE. Our approach fares slightly worse compared to QD on the MPIW metric, but that is likely due to the higher coverage (i.e., PICP) it is able to achieve.

Our results support our hypothesis that for large and high-dimensional data (and in particular those with high degrees of noise), PIVEN is likely to outperform previous work due to its ability to combine value predictions with PI generation. PIVEN produces tighter PIs and place the value prediction more accurately within the PI. A detailed analysis of the training process – in terms of training/validation loss, MAE, PICP and MPIW – is presented in Appendix

C and further supports our conclusions.

Table 4: Results on the RSNA bone age and IMDB age estimation datasets
Dataset Method PICP MPIW MAE
Bone age NN NA NA 18.68
PIVEN 0.93 2.09 18.13
QD 0.9 1.99 20.24
IMDB age NN NA NA 7.08 0.03
PIVEN 0.95 0.01 2.87 0.04 7.03 0.04
QD 0.92 0.01 3.47 0.03 10.23 0.12

6 Conclusions

We present PIVEN, a novel deep architecture for addressing uncertainty. Our approach is the first to combine the generation of prediction intervals together with specific value predictions. By optimizing for these two goals simultaneously we are able to produce tighter intervals while at the same time achieving greater precision in our value predictions. Our evaluation on a set of widely accepted benchmark datasets as well as large image datasets support the merits of our approach. For future work, we will consider applying PIVEN in the field of deep reinforcement learning (DRL). While DRL algorithms usually employ neural nets for their utility estimations, the ability to provide both a specific value and a PI has, in our view, the potential to produce more effective exploration/exploitation strategies. Such an improvement is particularly important in domains where exploration is expensive or time consuming.


Appendix A Experimental Setup

In this section we provide full details of our dataset preprocessing and experiments presented in the main study. Our code is available online 444https://github.com/elisim/piven

a.1 Dataset Preprocessing

In addition to the ten benchmark datasets used by all recent studies in the field, we evaluated PIVEN on two large image datasets. Due to the size of the datasets and the nature of the domain, preprocessing was required. We provide the full details of the process below.

UCI datasets.   For the UCI datasets, we used the experimental setup proposed by [pbp], which was also used in all the two baselines described in this study. All datasets were averaged on 20 random splits of the data, except for the “Year Prediction MSD" and “protein" datasets. Since “Year Prediction MSD" has predefined fixed splits by the provider, only one run was conducted. For "protein", 5 splits were used, as was done in previous work. We used identical network architectures to those described in [deep_ensemble, mc_dropout, qd, pbp]: one dense layer with ReLU [relu]

, containing 50 neurons for each network. In the “Year Prediction MSD" and “protein" datasets where NNs had 100 neurons. Regarding train/test split and hyperparameters, we employ the same setup as

[qd]: train/test folds were randomly split 90%/10%, input and target variables were normalized to zero mean and unit variance. The softening factor was constant for all datasets, . For the majority of the datasets we used , except for “naval", “protein", “wine" and “yacht" where was set to 4.0, 40.0, 30.0 and 3.0 respectively. The value of the parameter was set to 0.5. The Adam optimizer [adam] was used with exponential decay, where learning rate and decay rate were tuned. Batch size of 100 was used for all the datasets, except for “Year Prediction MSD" where batch size was set to 1000. Five neural nets were used in each ensemble, using parameter re-sampling. The objective used to optimized was Mean Square Error (MSE) for all datasets. We also tune

, initializing variance, and number of training epochs using early stopping. To ensure that our comparison with the state-of-the-art baselines is accurate, we first set the parameters of our neural nets so that they produce the results reported in

[qd]. We then use the same parameter configurations in our experiments of PIVEN.

IMDB age estimation dataset   For the IMDB dataset, we used the DenseNet architecture [densenet] as a feature extractor. On top of this architecture we added two dense layers with dropout. The sizes of the two dense layers were 128 and 32 neurons respectively, with a dropout factor of 0.2, and ReLU activation [relu]. In the last layer, the biases of the PIs were initially set to for the upper and lower bounds respectively. We used the data preprocessing similar to that of previous work [SSRNet, zhang]: all face images were aligned using facial landmarks such as eyes and the nose. After alignment, the face region of each image was cropped and resized to a 64 64 resolution. In addition, common data augmentation methods, including zooming, shifting, shearing, and flipping were randomly activated. The Adam optimization method [adam] was used for optimizing the network parameters over 90 epochs, with a batch size of 128. The learning rate was set to 0.002 initially and reduced by a factor 0.1 every 30 epochs. Regarding loss hyperparameters, we used the standard configuration proposed in [qd]: confidence interval set to 0.95, soften factor set to 160.0 and . For PIVEN we used the same setting, with . Since there was no predefined test set for this dataset, we employed a 5-fold cross validation: In each split, we used 20% as the test set. Additionally, 20% of the train set was designated as the validation set. Best model obtained by minimizing the validation loss. In QD and PIVEN, we normalized ages to zero mean and unit variance.

RSNA pediatric bone age dataset   For the RSNA dataset, we used the well-known VGG-16 architecture [vgg]

as a base model, with weights pre-trained on ImageNet. On top of this architecture, we added batch normalization

[batch_norm], attention mechanism with two CNN layers of 64 and 16 neurons each, two average pooling layers, dropout [dropout] with a 0.25 probability, and a fully connected layer with 1024 neurons. The activation function for the CNN layers was ReLU [relu], and we used ELU for the fully connected layer. For the PIs last layer, we used biases of , for the upper and lower bound initializion, respectively. We used standard data augmentation consisting of horizontal flips, vertical and horizontal shifts, and rotations. In addition, we normalized targets to zero mean and unit variance. To reduce computational costs, we downscaled input images to 384 384 pixels. The network was optimized using Adam optimizer [adam], with an initial learning rate of 0.01 which was reduced when the validation loss has stopped improving over 10 epochs. We trained the network for 50 epochs using batch size of 100. For our loss hyperparameters, we used the standard configuration like proposed in [qd]: confidence interval set to 0.95, soften factor set to 160.0 and . For PIVEN we used the same setting, with .

Appendix B Ablation analysis full results

We now present the full results of our ablation studies, including PICP, for the ablation variants:

Table 5: Ablation analysis, comparing PICP and MPIW. Best was assessed as in Table 1
Boston 0.93 0.01 0.93 0.01 0.93 0.01 1.09 0.02 1.15 0.02 1.09 0.01
Concrete 0.93 0.01 0.93 0.01 0.93 0.01 1.02 0.01 1.07 0.01 1.02 0.01
Energy 0.97 0.01 0.97 0.00 0.97 0.00 0.42 0.01 0.45 0.01 0.42 0.01
Kin8nm 0.96 0.00 0.96 0.00 0.96 0.00 1.13 0.00 1.17 0.00 1.10 0.00
Naval 0.98 0.00 0.98 0.00 0.98 0.00 0.24 0.00 0.30 0.02 0.24 0.00
Power plant 0.96 0.00 0.96 0.00 0.96 0.00 0.86 0.00 0.86 0.00 0.86 0.00
Protein 0.95 0.00 0.95 0.00 0.95 0.00 2.25 0.01 2.27 0.01 2.26 0.01
Wine 0.91 0.01 0.91 0.01 0.91 0.01 2.24 0.01 2.23 0.01 2.22 0.01
Yacht 0.95 0.01 0.95 0.01 0.95 0.01 0.18 0.00 0.19 0.01 0.17 0.00
Year Prediction MSD 0.95 NA 0.95 NA 0.95 NA 2.42 NA 2.43 NA 2.42 NA
Table 6: Ablation analysis comparing value prediction in terms of RMSE
Boston 3.21 0.24 3.39 0.27 3.13 0.21
Concrete 5.55 0.11 5.73 0.10 5.43 0.13
Energy 2.16 0.04 2.27 0.04 1.65 0.03
Kin8nm 0.08 0.00 0.08 0.00 0.07 0.00
Naval 0.00 0.00 0.00 0.00 0.00 0.00
Power plant 4.13 0.04 4.15 0.04 4.08 0.04
Protein 4.78 0.02 4.99 0.01 4.35 0.02
Wine 0.64 0.01 0.67 0.01 0.63 0.01
Yacht 0.99 0.07 1.15 0.08 0.98 0.07
Year Prediction MSD 9.10 NA 9.25 NA 8.93 NA

Appendix C IMDB age estimation training process and robustness to outliers

c.1 Training process

In the following figures we present comparisons of the training progression for PIVEN, QD and NN on the MAE, PICP and MPIW evaluation metrics. We used 80% of images as the training set while the remaining 20% were used as the validation set (we did not define a test set as we were only interested in analyzing the progression of the training). For the MAE metric, presented in Figure 6, we observe that the values for QD not improves. This is to be expected since QD does not consider this goal in its training process (i.e., loss function). This result further strengthens our argument that choosing the middle of the interval is often sub-optimal strategy for value prediction. For the remaining two approaches – NN and PIVEN– we note that NN suffers from overfitting, given that the validation error is greater than training error after convergence. This phenomena does not happen in PIVEN which indicates robustness, a result which further supports our conclusions regarding the method’s robustness.

For the MPIW metric (Figures 10), PIVEN presents better performance both for the validation and train sets compared to QD. Moreover, we observe a smaller gap between the errror produced by PIVEN for the two sets – validation and training – which indicates that PIVEN enjoys greater robustness and an ability to not overfit to a subset of the data. Our analysis also shows that for the PICP metric (Figure 14), PIVEN converges to higher coverage.

(a) MAE NN
(b) MAE QD
(d) MAE validation errors
Figure 6: Comparison of MAE metric in the training process. We observe that the values for QD (b) do not improve, which is expected since QD does not consider value prediction in its loss function. Moreover, we note that NN suffers from overfitting, given that the validation error is greater than the training error after convergence. This phenomena do not affect PIVEN, thus providing an indication of its robustness.
(c) MPIW validation
Figure 10: Comparison of the MPIW metric between QD and PIVEN in the training process. As can be seen, PIVEN significantly improves over QD, and has a smaller gap between training and validation errors.
(c) PICP validation
Figure 14: Comparison of PICP metric between QD and PIVEN in the training process. PIVEN achieves higher coverage when two methods converges.

c.2 Robustness to outliers

Since PIVEN is capable of learning from the entire dataset while QD learns only from data points which were captured by the PI, it is reasonable to expect that the former will outperform the latter when coping with outliers. In the IMDB age estimation dataset, we can consider images with very high or very low age as outliers. Our analysis shows that for this subset of cases, there is a large gap in performance between PIVEN and QD. In Figure

15 we provide several images of very young/old individuals and the results returned by the two methods. We can observe that PIVEN copes with these outliers significantly better.

Figure 15: The predictions produced for outliers (i.e., very young/old individuals) by both PIVEN and QD for the IMDB age estimation dataset. The results for QD are on the left, results for PIVEN on the right.