Label disagreement between human experts is a common issue in the medical domain and poses unique challenges in the evaluation and learning of classification models. In this work, we extend metrics for probability prediction, including calibration, i.e., the reliability of predictive probability, to adapt to such a situation. We further formalize the metrics for higher-order statistics, including inter-rater disagreement, in a unified way, which enables us to assess the quality of distributional uncertainty. In addition, we propose a novel post-hoc calibration method that equips trained neural networks with calibrated distributions over class probability estimates. With a large-scale medical imaging application, we show that our approach significantly improves the quality of uncertainty estimates in multiple metrics.READ FULL TEXT VIEW PDF
Obtaining reliable and accurate quantification of uncertainty estimates ...
Recent works have shown that deep neural networks can achieve super-huma...
Fully convolutional neural networks (FCNs), and in particular U-Nets, ha...
In many applications, accurate class probability estimates are required,...
We are addressing two fundamental problems in authorship verification (A...
Deep Neural Networks (DNNs), despite their tremendous success in recent
The ability to accurately estimate uncertainties in neural network
Reliable uncertainty quantification is an indispensable property for safety-critical systems such as medical diagnosis assistance. Despite the high accuracy of modern neural networks in the wide-ranging classification tasks, their predictive probability often tends to be uncalibrated [guo2017calibration]
. Measuring and improving probability calibration, which is the closeness of predictive probability to an actual class frequency, has become one of the central issues in machine learning research[vaicenavicius2019evaluating, widmann2019calibration, kumar2019verified]
. At the same time, uncertainty exists in the real-world labels used as ground truth for training classifiers. Notably, in the medical domain, inter-rater variability of annotations is commonly observed despite their expertise[raghu2018direct, sasada2018inter, jensen2019improving]. As current machine learning research primarily relies on ground truth labels, evaluation and learning of classifiers under the label uncertainty poses unique challenges in the medical domain.
To obtain reliable class probability estimates (CPEs) from a trained classifier, post-hoc calibration, which transforms the classifier’s output scores to fit into empirical class probabilities, has been proposed for both general classifiers [platt1999probabilistic, zadrozny2001obtaining, zadrozny2002transforming] and neural networks [guo2017calibration, kull2019beyond]
. However, current evaluation metrics for calibration rely on an empirical accuracy calculated with ground truth, in which the uncertainty of labels has not been considered. Moreover, an essential aspect of uncertainty is not fully accounted for by CPEs,e.g., whether a % confidence of class comes from machine’s unconfident or human’s disagreement about classification is not differentiated, even when the CPE matches an observed label frequency in expectation. Recent work [raghu2018direct] has indicated that label-side uncertainty measures, such as an empirical disagreement frequency, inferred from CPEs, is suboptimal. They instead used the direct discrimination of high uncertainty instances with input features as a superior approach. However, this treatment requires training an additional predictor per different objective and lacks an integrated view of the problem with classification. Also, since the discrimination threshold is given in a problem-specific way, how such a prediction should be expressed and evaluated in a general situation has not been formalized.
In this work, we first develop an evaluation framework for CPEs under label uncertainty, in which multiple annotations per instance (called label histograms) is available. With the insight gained from proper scoring rules [gneiting2007strictly] and their decomposition [degroot1983comparison], we extend existing metrics, including calibration measures, for the situation with label histograms. Next, we generalize our formulation for probabilistic predictions on higher-order statistics, including inter-rater disagreement, which enables us to evaluate these statistics in a unified way with CPEs. At the same time, the importance of awareness for the distribution of CPEs is emphasized for the reliable predictions on the statistics. To fit into this situation, we propose a novel approach that enhances the reliability of neural network models in terms of CPE distributions. While our approach, which we referred to -calibration, only uses a single parameter to control the distribution, it can enable us to capture well-recognized notions of uncertainty: epistemic and aleatoric [der2009aleatory, senge2014reliable, kendall2017uncertainties] within the distributional model. Finally, we apply our evaluation frameworks and the -calibration to a large-scale classification task of cellular image data provided from a study of myelodysplastic syndrome (MDS) [sasada2018inter]. We show that our uncertainty-aware evaluation metrics for CPEs offer a meaningful interpretation of classification performance. Also, the -calibration shows a significant improvement in the prediction of disagreement between annotators.
We overview calibration measures and proper scoring rules as a prerequisite for our work.
Let be a number of categories, be a set of
dimensional one-hot vectors (i.e., ), and be a -dimensional probability simplex. Let and , where denotes an input feature, such as an image data, and denotes a -way label. Given a classification model , a predictive class probability of the model for an input feature is also a random variable and denoted by .
The notion of calibration, which is an agreement between a predictive class probability and an empirical class frequency, is a desirable property for the prediction to be reliable. Formally, we reference [kull2015novel] for the definition of calibration.
A classification model is said to be calibrated if its predictive probability satisfies , where is called a calibration map.
The following metric is commonly used to measure an error of calibration for binary classifiers:
Note that a case of is called a squared calibration error [kumar2019verified] and is an expectation calibration error (ECE) [naeini2015obtaining]. CE takes a minimum value iff . We refer to as a case of in this work. For multiclass cases, we use a commonly used definition of class-wise calibration error [kumar2019verified], which is defined as .
Although calibration is a desirable property, being calibrated is not sufficient for useful predictions. For instance, a predictor that always presents the marginal class frequency is perfectly calibrated, but entirely lacks the sharpness of prediction for labels stratified with . In contrast, the strictly proper scoring rules [gneiting2007strictly] elicit an instance-wise true probability in expectation and do not suffer from this problem.
holds, where denotes a categorical distribution. If the strict inequality holds, is said to be strictly proper. Following the convention, we write for .
For a strictly proper loss , the divergence function takes a non-negative value and is zero iff , by definition. Squared loss and logarithmic loss are the most well known examples of strictly proper losses. For these cases, the divergence functions are given as and , a.k.a. KL divergence, respectively.
Let denote the expected loss, where the expectation is taken over a distribution . As special examples of that base on the and ,