1 Introduction
With the advancement of deep learning techniques, models based on neural networks are entrusted with various applications that involve complex decision making, such as medical diagnosis
(Caruana et al., 2015), selfdriving cars (Bojarski et al., 2016), or safe exploration of an agent’s environment in a reinforcement learning setting
(Kahn et al., 2017). While the accuracy of these techniques has improved significantly in recent years, they are lacking a very important feature: an ability to reliably detect whether the model has produced an incorrect prediction. This is especially crucial in realworld decision making systems: if the model is able to sense that its prediction is likely incorrect, control of the system should be passed to fallback systems or to a human expert. For example, control should be passed to a human medical doctor when the confidence of a diagnosis with respect to a particular symptom is low (Jiang et al., 2011). Similarly, when a selfdriving car’s obstruent detector is not sufficiently certain, the car should rely on fallback sensors, or choose a conservative action of slowing down the vehicle (Kendall and Gal, 2017). Lack of, or poor confidence estimates may result in loss of human life
(National Highway Traffic Safety Administration, 2017).We address this problem by pursuing the following paradigm: a learnable confidence scorer acting as an “observer” (metamodel) on top of an existing neural classifier (base model). The observer collects various features from the base model and is trained to predict success or failure of the base model with respect to its original task (e.g., image recognition).
Formally, a metamodel that, given a base model , should produce a confidence score (where denotes the parameters of the base model ). The confidence score
need not be a probability: it can be any scalar value that relates to uncertainty and can be used to filter out the most uncertain samples based on a threshold value.
To generate confidence scores, we propose a metamodel utilizing linear classifier probes (Alain and Bengio, 2016) inserted into the intermediate layers of the base model (hence referred to as “whitebox” due to its transparency of the internal states). We use a wellstudied task, image classification, as the focus of this paper, and show that the confidence scores generated by the whitebox metamodels are superior to standard baselines when noisy data are considered in the training. By removing samples deemed most uncertain by our method, the precision of the predictions by the base model on the remaining examples improves significantly. Additionally, we show in the experiments that our method extends to handling outofdomain samples: when the base model encounters outofdomain data, the whitebox metamodel is shown to be capable of rejecting these with better accuracy than baselines.
2 Related work
Previous work on Monte Carlo dropout (Gal et al., 2017; Gal and Ghahramani, 2016) to estimate model uncertainty can be applied to our filtering task at hand. In an autonomous driving application this approach showed that model uncertainty correlates with positional error (Kendall and Cipolla, 2016). In an application to image segmentation, uncertainty analysis was done at the pixel level and overall classification accuracy improved when pixels with higher uncertainty were dropped (Kampffmeyer et al., 2016). Monte Carlo dropout was also used to estimate uncertainty in diagnosing diabetic retinopathy from fundus images (Leibig et al., 2017). Diagnostic performance improvement was reported when uncertainty was used to filter out some instances from model based classification.
Uncertainty estimations from methods like Monte Carlo dropout can be viewed as providing additional features about a model’s prediction for an instance, which can be subsumed by our proposed metamodel approach.
In a broader context, the ability to rank samples is a fundamental notion in the receiver operating characteristics (ROC) analysis. ROC is primarily concerned with the task of detection (filtering) which is in contrast to estimating a prognostic measure of uncertainty (implying calibration). Plethora of ROCrelated work spanning a variety of disciplines, including biomedical, signal, speech, language, and image processing, has been explored in the context of filtering and decision making (Zou et al., 2011; ICML Workshop, ). Moreover, ROC, either as a whole or through a part of its operating range, has been used in optimization in various applications (Wang et al., 2016; Navrátil and Ramaswamy, 2002). Since we are focusing on the filtering aspect of confidence scoring rather than their calibration, we adopt the ROC analysis as our primary metric in this work (Ferri et al., 2009).
Modern neural networks are known to be miscalibrated (Guo et al., 2017): the predicted probability is highly biased with respect to the true correctness likelihood. Calibration has been proposed as a postprocessing step to mitigate this problem for any model (Zadrozny and Elkan, 2001, 2002; Guo et al., 2017). Calibration methods like isotonic regression (Zadrozny and Elkan, 2002) perform transformations that are monotonic with respect to scores for sets of instances and so will not alter the ranking of confident vs. uncertain samples. The more recent temperature scaling calibration method (Guo et al., 2017) can alter the ranking of instances and will be considered and compared in our analysis.
The recent work on selective classification for deep neural networks (Geifman and ElYaniv, 2017) shares the same broad goals to filter out instances where the base model prediction is in doubt. Their method uses only the outputs of the base model (softmax response) to determine a threshold that would optimize coverage (recall) while guaranteeing the desired risk (precision) at some specified confidence level. From an application perspective, our work extends this by showing that in noisy settings whitebox models for this task outperform methods using only the base model output scores. We also consider an additional task using outofdomain instances to evaluate filtering methods when encountering domain shifts.
3 Method
For any classification model where
is the probability vector of the predicted classes, we define a confidence scoring model (
, the metamodel) that operates on (base model) and produces a confidence score for each prediction .We explore two kinds of metamodels, namely the blackbox and the whitebox type.
Blackbox
In the blackbox version it is assumed that the internal mechanism of the model is not accessible to the metamodel, i.e., the only observable variable for the metamodel is its output :
(1) 
For example, in a class classification problem, the metamodel is only allowed to take the final dimensional probability vector into account. A typical representative of a blackbox baseline commonly employed in realworld scenarios is the softmax response (Geifman and ElYaniv, 2017): just taking the probability output of the predicted class label:
(2) 
where is the th dimension of the vector , (i.e. the label with the highest predicted probability), and denotes the parameters of the base model .
Whitebox
A whitebox metamodel assumes full access to the internals of the base model. A neural model, consisting of multiple stacked layers, can be regarded as a composition of functions:
(3) 
We denote the intermediate results as ; ; ; etc. A whitebox metamodel is capable of accessing these intermediate results:
(4) 
where is the output of the last layer. It should be noted that in general the metamodel may employ additional functions to combine the base model’s intermediate results in various ways, and we explore one such option by using linear classifier probes described below.
3.1 Whitebox metamodel with linear classifier probes
We propose a whitebox model using linear classifier probes (later just “probes”). The concept of probes was originally proposed by (Alain and Bengio, 2016) as an aid for enhancing the interpretability of neural networks. However, we are applying this concept for the purpose of extracting features from the base model. Our intuition draws from the fact that probes for different layers tend to learn different levels of abstractions of the input data: lower layers (those closer to the input) learn more elementary patterns whereas higher layers (those closer to the output) capture conceptual abstractions of the data and tend to be more informative with respect to the class label of a given instance.
For each intermediate result ( with being the final output of a multilayer neural network), we train a probe to predict the correct class using only the specific intermediate result:
(5) 
Given a set of trained probes,
, we build the metamodel using the probe outputs (either probabilities or logits) as training input. The metamodel is then trained with the objective of predicting whether the base model’s classification is correct or not. Finally, the prediction probability of the base model being correct is the confidence score
:(6) 
This architecture is illustrated in Figure 1. The diode symbol “ ” represents the oneway nature of the information flow emphasizing that the probes are not trained jointly with the base model. Instead, they are trained with the underlying base model’s parameters fixed.
3.2 Metamodel structure
We explore three different forms of the metamodel function from Eq. (6). The metamodel is trained as a binary classifier where predicts whether the base model prediction is correct or not. The probability of the positive class is used as the confidence score .
Logistic regression (LR)
This metamodel has a simple form
(7) 
where the probe vectors are concatenated. The logit value in Eq. (7) is used directly as the confidence score. The model is regularized.
Gradient boosting machine (GBM)
The concatenated probe vectors are fed into a gradient boosting machine
(Friedman, 2001). The GBM hyperparameters include the learning rate, number of boosting stages, maximum depth of trees and the fraction of samples used for fitting individual base learners.
Besides the aforementioned structures, we also investigated fully connected 2layer neural networks, however, omitted them in this paper as their performance was essentially identical with the GBMs.
4 Tasks, datasets and metrics
We use the CIFAR10 and CIFAR100 image classification dataset^{1}^{1}1 https://www.cs.toronto.edu/~kriz/cifar.html. in our experiments. For each set of data we conduct two flavors of experiments: Indomain confidence scoring task and an indomain plus outofdomain pool task (referred to as “outofdomain” from now on).
Indomain task
Given a base model and a heldout set, the base model makes predictions about samples in the heldout set. Can the trained metamodel prune out predictions considered uncertain? Furthermore, after removing a varying percentile of the most uncertain predictions, how does the residual precision of the pruned heldout set change? The expected behavior is that the proposed metamodel should increase the overall residual accuracy after uncertain samples are removed.
Outofdomain task
Given a base model (here trained on CIFAR10), what would the model do if presented with images not belonging to one of the 10 classes? The predictions made by the base model will surely be wrong: However, can the metamodel distinguish these predictions as incorrect? Our proposed metamodel should in theory produce a low confidence score to these outofdomain predictions. Note that the outofdomain task comprises both indomain and outofdomain samples to be processed as a single pool.
We use the ROC (receiver operating characteristic) curve and the precision/recall curve to study the diagnostic ability of our metamodels. In the ROC curve, the
axis is the false positive rate (i.e. rate of incorrectly detected success events) and the axis is the true positive rate (i.e. recall): a operating point on the ROC plot corresponds to threshold inducing a tradeoff between a proportion of wrongly classified samples not detected by the metamodel and proportion of correctly classified samples that the metamodel agrees with.Additionally, we compute the area under curve (AUC) for the ROC curve as a summary value.
4.1 Datasets
The original CIFAR10 dataset contains 50,000 training images and 10,000 test images. We divide the original training set into 3 subsets, namely trainbase, trainmeta and dev.
Original partition  New partition  Size 

50,000 train  trainbase  30,000 
trainmeta  10,000  
dev  10,000  
10,000 test  test  10,000 
We adopt the following training strategy, so as to completely separate the data used by the base model and the metamodel:

Train the base model using the trainbase subset: Because the size of the training set is smaller (30,000 samples instead of 50,000) than the standard setup (reported as 92.5% accuracy using the base model), the accuracy on dev and test is slightly lower: we get 90.4% accuracy on test.

Train the whitebox metamodel (including the probes) on trainmeta.

The dev set is used for tuning (various hyperparameters) and for validation.

The test set is used for final heldout performance reporting.
The outofdomain task is evaluated by combining the test sets of CIFAR10 and CIFAR100 datasets. The CIFAR100 dataset class labels are completely disjoint with those of CIFAR10. The outofdomain set will be referred to as OOD.
4.2 Base model
We reuse the high performing ResNet model for image classification implemented in the official TensorFlow
(Abadi et al., 2016) example model code^{2}^{2}2 https://github.com/tensorflow/models/tree/master/research/resnet.. This model consists of a sequential stack of residual units of convolution networks (He et al., 2016a, b; Zagoruyko and Komodakis, 2016) as shown in Figure 2. Each layer’s tensor size is specified in the figure.
In subsequent experiments, we train probes for all intermediate layers^{3}^{3}3 We do not insert probes between the two convolutional layers within the residual unit, instead, we consider a residual unit as an atomic layer. from to .
5 Experimental results
To assess the various models we organize the experiments in several parts by varying the quality of the data used to create the models. Furthermore, their performance in each part is evaluated on both the indomain and the outofdomain tasks. The varying quality aspect comprises the following conditions:
Clean base / Clean meta
All sets involved in training, i.e., trainbase, trainmeta, and dev are used in their original form from the CIFAR10 dataset;
Noisy base / Noisy meta
In this case the sets trainmeta and dev are modified by adding artificial noise to the labels of the images, hence degrading the base model performance. Specifically, for a random subset of 30% of the samples, the correct label is replaced by another label (randomly chosen over the corresponding complement of the label set). This results in an artificially degraded base model with a test set accuracy of 77.4% (as compared to 90.4% of the same model trained on clean data). This condition, in combination with the degraded base model, represents a scenario of obtaining training data from a noisy environment, e.g., via crowdsourcing in which labels are not always correct.
In both conditions, the test set (both indomain and outofdomain) is applied clean, without artificial corruption. The above conditions in combination with the two tasks offer a representative set of classification scenarios encountered in practice.
We compare the following confidence scoring methods:

(BlackboxLR/GBM) Using the final output as the only feature for the metamodels;

(ProbesLR/GBM) Whitebox model using all the probes as features for the metamodels.
Fig. 3 shows the main results for the two conditions and two datasets defined above, in terms of ROC and Precision/Recall curves. Table 2 summarizes the AUC (area under ROC) results.
Method  Condition (base/meta)  

Clean/Clean  Noisy/Noisy  
Indomain Tasks  
Softmax  0.91  0.74 
Blackbox (LR)  0.91  0.79 
Blackbox (GBM)  0.91  0.80 
Probes (LR)  0.88  0.87 
Probes (GBM)  0.91  0.88 
Outofdomain Tasks  
Softmax  0.89  0.72 
Blackbox (LR)  0.89  0.76 
Blackbox (GBM)  0.89  0.77 
Probes (LR)  0.85  0.83 
Probes (GBM)  0.89  0.84 
Under the Clean/Clean condition we observe little difference among the methods, with AUC values at 0.91 (indomain setting for the test set, later on, test) and 0.89 (outofdomain setting, later on, ood) (with the exception of the ProbesLR model, see discussion below).
On the other hand, under Noisy/Noisy condition, the probebased (whitebox) models separate themselves well from the baseline as well as their blackbox counterparts. Under the Noisy/Noisy condition, the ProbesGBM model with AUC values of 0.88 (test) and 0.84 (ood) dominates its BlackboxGBM counterpart at 0.80 (test) and 0.77 (ood).
Overall, under the Noisy/Noisy condition, two trends can be identified: (1) whitebox probebased models outperform their blackbox counterparts, all of which fare significantly better than the softmax baseline, and (2) the probebased GBM model dominates, albeit moderately, the simpler LR model in all cases.
We analyzed further the lower performance of the regularized ProbesLR model in the Clean/Clean condition. We explored variants including a sparse regularized LR model but could not find a satisfactory answer to this performance drop.
We also compared the performance of the temperature scaled base model scores (Guo et al., 2017) in the two cases, Clean/Clean and Noisy/Noisy: The performances for both indomain and outofdomain tasks after scaling when compared to the original base model scores stayed essentially the same in each case, suggesting that the task of calibration remains an orthogonal aspect of confidence scoring (i.e., changing the distribution of the predicted scores but not sample ranking).
6 Discussion
The experimental results presented in the previous section show that whitebox metamodels using probes are significantly better in noisy settings and also in outofdomain settings when compared to softmax baseline and blackbox models, as is shown by the various ROC or precision/recall curve plots. In this section we will extract some insights by diving deeper into the results.
It is instructive to start with a comparison of accuracies achieved by the probes at various levels. The chart in Figure 4 depicts these accuracies based on the metamodel training data in the two scenarios: Clean base / Clean meta, Noisy base / Noisy meta, respectively. The impact of noise is seen in the top accuracy achieved in one of the two scenarios. The accuracy improves with neural network depth for the most part in both scenarios. We also explored nonlinear probes using neural networks with one hidden layer of size 100. Although the probe accuracies did improve for many of the earlier layers the resulting meta model performance remained comparable and therefore we present results using the simpler linear probes only.
The accuracy plots do not provide insights into how the whitebox models achieve their higher performance and how this changes going from the clean data scenario to the scenarios with added label noise.
To gain additional insight we performed a feature informativeness analysis based on a method described in (Friedman, 2001). Derived from the GBM metamodel’s feature usage statistics using the test set, feature importance scores for the two conditions (Clean/Clean and Noisy/Noisy) are shown in Figure 5. Here, each of the 10 outputs of each of the 17 probes is assigned an intensity level according it its importance score, thus forming a heatmap representation. Recall that the features are sorted according to the toplayer class probabilities, i.e, for each sample, feature 1 (on the vertical axis in Figure 5) corresponds to the topscoring class, feature 2 to 2nd highest scoring class, etc., across all the probes (horizontal axis).
Considering the Clean/Clean scenario first (top portion in Figure 5), the top important features include probe outputs in the last layer (Layer 17), focusing on the score of the predicted class (i.e., output with the highest base model score) and the class with the second highest base model score. This aligns with the intuition that having a high score for the predicted class and a large gap relative to the next competing class (i.e., mostly looking at top 2 scores) is indicative of the base model being correct. However, the observation changes in the Noisy/Noisy scenario (bottom portion of Figure 5). Here, two observations can be made: (1) there is a distinct shift in reliance of the GBM on the secondtolast layer (Layer 16), preserving the pattern of looking at the top 23 scores within the probe, and (2) a significantly deeperreaching attention of the metamodel within the probe cascade, including layers 12 through 16. We conjecture that these observations reflect the metamodel’s pattern of ”hedging” against the adverse effect of the label noise introduced in the NoisyNoisy task. As the base model’s error rate becomes higher (approximately 25%), the metamodel learns to almost completely ignore the Layer 17 (which is directly exposed to the label noise) and to pick up on more robust, deeperresiding features in the ResNet model. This ability to adjust is the powerful advantage of the metamodel approach and seems to lead to its significant performance improvement in the noisy scenario.
There is another advantage of the whitebox metamodels that can be illustrated by considering the relative performance in the indomain and outofdomain settings. We argue that the Noisy/Noisy scenario is relevant for many reallife applications in which labels for the training data come from noisy sources. Figure 6 shows the comparative performances in indomain and outofdomain settings for the whitebox GBM metamodel and the base model final scores, respectively.
The axes in these plots represents the corresponding threshold values for the respective models for filtering the base model predictions (i.e., samples with confidence scores lower than the threshold value would be filtered). First, consider the whitebox metamodel case in Figure 6 (left). Let’s say, in an application setting, we pick a threshold (0.59) that achieves an indomain recall of 0.7. At this threshold, the GBM whitebox metamodel achieves an indomain precision of 0.95. If we encounter a domain shift as represented by the outofdomain task the precision degrades to 0.71. Consider the same situation if we were using the base model score as in Figure 6 (right). The threshold value of 0.51 achieves the same indomain recall of 0.7. The indomain precision is 0.87 but the drop in precision for the outofdomain case is steeper to 0.54. The lower performance degradation for whitebox metamodels when encountering domain shifts can be viewed as a form of robustness when compared with simply using the base model’s scores.
The impact of metamodel based filtering can be further illustrated using examples representing four quadrants of the binary confusion matrix. We chose the CIFAR10 class “deer” and considered all instances from the Noisy/Noisy outofdomain test set.
^{4}^{4}4An interesting article showing some CIFAR examples of false positives can be found at https://hjweide.github.io/quantifyinguncertaintyinneuralnetworks. Figure 7 compares image examples sampled from the confusion quadrants when using the metamodel scores (lefthand side) with those sampled using the base model class score (baseline, righthand side). The thresholds for each system were chosen so as to achieve highest precision while still obtaining at least four samples in each confusion quadrant. Representative images shown in Figure 7 were randomly sampled from the resulting quadrant sets. Subjectively, it appears that the FP images from the whitebox metamodel are relatively competitive with the “deer” class compared to ones which the simple baseline falsely accepts. A similar, albeit subjective, assessment in favor of the metamodel can be made comparing the FN images across the two systems.7 Conclusion and future work
We proposed the paradigm of metamodels for confidence scoring, and investigated a whitebox metamodel with linear classifier probes. Experiments on CIFAR10 and CIFAR100 data showed that our proposed method is capable of more accurately rejecting samples with low confidence compared to various baselines in noisy settings and/or outofdomain scenarios. Its superiority over blackbox baselines supports the use of whitebox models and our results demonstrate that probes into the intermediate states of a neural network provide useful signal for confidence scoring.
Future work includes incorporating other base model features. One example is the work by (Gal et al., 2017) whereby the uncertainty measures using Monte Carlo dropout could serve as additional features to our proposed whitebox metamodel.
References

Abadi et al. (2016)
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al.
2016.
Tensorflow: A system for largescale machine learning.
In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283.  Alain and Bengio (2016) Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
 Bojarski et al. (2016) Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316.
 Caruana et al. (2015) Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. 2015. Intelligible models for healthcare: Predicting pneumonia risk and hospital 30day readmission. In Proc. KDD, pages 1721–1730. ACM.
 Ferri et al. (2009) C. Ferri, J. HernándezOrallo, and R. Modroiu. 2009. An experimental comparison of performance measures for classification. Pattern Recogn. Lett., 30(1):27–38.
 Friedman (2001) Jerome H Friedman. 2001. Greedy function approximation: a gradient boosting machine. Annals of Statistics, pages 1189–1232.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Proc. ICML.
 Gal et al. (2017) Yarin Gal, Jiri Hron, and Alex Kendall. 2017. Concrete dropout. arXiv preprint arXiv:1705.07832.
 Geifman and ElYaniv (2017) Yonatan Geifman and Ran ElYaniv. 2017. Selective classification for deep neural networks. In Proc. NeurIPS, pages 4885–4894.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. In Proc. ICML, pages 1321–1330.
 He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016a. Deep residual learning for image recognition. In Proc. CVPR, pages 770–778.
 He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016b. Identity mappings in deep residual networks. In Proc. ECCV, pages 630–645.
 (13) ICML Workshop. 2006. Third Workshop on ROC Analysis in ML, ICML Workshop.
 Jiang et al. (2011) Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila OhnoMachado. 2011. Calibrating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association, 19(2):263–274.
 Kahn et al. (2017) Gregory Kahn, Adam Villaflor, Vitchyr Pong, Pieter Abbeel, and Sergey Levine. 2017. Uncertaintyaware reinforcement learning for collision avoidance. arXiv preprint arXiv:1702.01182.

Kampffmeyer et al. (2016)
Michael Kampffmeyer, ArntBorre Salberg, and Robert Jenssen. 2016.
Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
 Kendall and Cipolla (2016) Alex Kendall and Roberto Cipolla. 2016. Modeling uncertainty in deep learning for camera relocalization. In Proceedings of the 2016 IEEE international conference on robotics and automation (ICRA), pages 4762–4769. IEEE.
 Kendall and Gal (2017) Alex Kendall and Yarin Gal. 2017. What uncertainties do we need in bayesian deep learning for computer vision? In Proc. NeurIPS.
 Leibig et al. (2017) Christian Leibig, Vaneeda Vaneeda Allken, Murat Seckin Ayhan, Philipp Berens, and Siegfried Wahl Wahl. 2017. Leveraging uncertainty information from deep neural networks for disease detection. bioRxiv doi: 10.1101/084210.
 National Highway Traffic Safety Administration (2017) National Highway Traffic Safety Administration. 2017. PE 16007. Technical report.
 Navrátil and Ramaswamy (2002) J. Navrátil and G.N. Ramaswamy. 2002. DETAC  a discriminative criterion for speaker verification. In Seventh International Conference on Spoken Language Processing (ICSLP), Denver, CO.
 Wang et al. (2016) Sheng Wang, Siqi Sun, and Jinbo Xu. 2016. Aucmaximized deep convolutional neural fields for protein sequence labeling. In Proc. ECML PKDD, pages 1–16.
 Zadrozny and Elkan (2001) Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proc. ICML, pages 609–616, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
 Zadrozny and Elkan (2002) Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proc. KDD, pages 694–699. ACM.
 Zagoruyko and Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
 Zou et al. (2011) Kelly H Zou, Aiyi Liu, Andriy I Bandos, Lucila OhnoMachado, and Howard E Rockette. 2011. Statistical evaluation of diagnostic performance: topics in ROC analysis. CRC Press.