Prediction accuracy, quantified as the number of correct predictions divided by the number of total predictions, has emerged as the de facto
measure of a classifier’s quality. Modern deep-learning based classifiers, however, are well known for being both highly accurate and quite poor at predicting their own level of certainty. In particular, the “logit” values commonly used as a surrogate for probability have been shown to be vulnerable to a wide range of attacks that produce high-confidence but incorrect classification decisions. In light of this, it is worth considering a more general conception of accuracy.
In recent work [7, 6], Nelson argues that the geometric mean of the probabilities assigned to the correct classes, rather than the prediction accuracy defined above, should be used as the key measure of a classifier’s quality. He then generalizes this concept to introduce two additional metrics, which he calls decisiveness (or decisive-biased accuracy) and robustness (robust-biased accuracy) to provide further insight. Nelson shows that these metrics can be assessed using both the classifier’s reported probability and the classifier’s measured probability, giving insight into the model’s degree of under- or over-confidence.
In this work, we take up the challenge of applying these metrics to large-scale, deep-learning-based classifiers. By calculating the metrics for the classification decisions made by two convolutional neural networks applied to two real-world, imagery datasets, we give insight to both the metrics and the models. We also propose some minor clarifications to the original metric definition, allowing us to standardize these metrics.
We organize the remainder of this paper as follows. In section 2, we define the terms we will use in the remainder of this discussion. We then review Nelson’s past work in more detail (section 3) and then provide the full prescription for calculating our metrics (section 4), highlighting clarifications to the original methodology. In section 5, we show the results of applying these metrics to our two datasets and two classifiers. We then conclude in section 6.
To avoid ambiguity, we define the following terms:
Prediction Accuracy: the number of correct predictions divided by the number of total predictions.
Reported Probability: the probability that a classifier assigns to a given class for a given target. These may be the “logit” values from a neural network, or may be calculated using any other technique. This can also be called “model probability.”
Measured Probability: the actual probability that a given classification decision is correct, based on the classifier’s historical performance on items with similar reported probabilities. Note, this could be generalized to consider other types of historical performance (e.g., per reported class), but we do not consider this here. This can also be called “source probability.”
Correct-Class Probability: the probability (reported or measured) assigned to the correct class for a given target.
Geometric, Arithmetic, or -2/3 Accuracy: the geometric, arithmetic, -2/3, or generalized mean of the probabilities (where probabilities lower than are set equal to to avoid zero values). These quantities are defined for both the reported probabilities and the truth probabilities.
3 Review of Nelson’s Past Work
Machine-learning based classifiers are generally trained with the binary cross-entropy, also called the log loss. This binary cross-entropy represents the arithmetic mean of the log of the “probability,” , assigned to the correct class (measured state), . In general, most neural-network based architectures define as the logit value assigned to each class, after normalization with a softmax function. Mathematically, then, the cross-entropy is defined as:
Optimizing this cross-entropy is mathematically equivalent to optimizing the geometric mean of the probabilities. Nelson argues that this same quantity – the geometric mean of the measured state probabilities (the classifier’s reported probabilities of the correct classes) – should be used so that the interpretation of performance is clear. We will call this quantity the geometric accuracy
. Nelson justifies this choice with arguments based on both Bayesian statistics and information theory, and cites a great body of literature going back to 1879. This approach becomes particularly important as advances in generalized entropy are incorporated. In this case, a cumbersome body of proposed functions, all translate to use of the generalized mean of the probabilities as a spectrum of metrics.
The geometric mean, of course, is simply a special case () of the generalized mean, which is defined as:
Nelson’s metrics are therefore as follows:
Geometric Accuracy: the generalized mean of the correct-class probabilities with (geometric mean). Mathematically, this represents the translation of the cross-entropy back to the probability domain, and so is a “neutral-biased” measure of the accuracy.
Decisiveness (Arithmetic Accuracy): the generalized mean of the probabilities with (arithmetic mean). Mathematically, this would be equal to the prediction accuracy if the classifier perfectly reported its uncertainty. This arithmetic mean is therefore “decisive-biased” in that it is closely tied to the decision performance of the algorithm. Decisiveness is relatively insensitive to reported probabilities near zero, and therefore provides more sensitivity to the high-confidence region.
Robustness (-2/3 Accuracy): the generalized mean of the probabilities with (-2/3 mean). Mathematically, the -2/3rds mean is the complement to the arithmetic mean because due to the conjugate relationship between positive and negative generalizations of the log-score . This -2/3 mean is “robustness-biased” in that it is highly sensitive to reported probabilities near zero, and therefore provides more sensitivity to the low-confidence region. In particular, it assesses how well the classifier handles sources of severe error, such as events not included in the training.
Together, decisiveness and robustness place bounds on the geometric accuracy.
These metrics can be calculated both for the reported or measured probabilities (as defined in section 2). Indeed, Nelson proposes calculating both, and using their slope to determine underconfidence or overconfidence. In particular, the slope is defined by:
where and represent decisiveness and robustness, respectively. Then, slopes greater than unity indicate underconfidence (as the measured confidence rises faster than the reported confidence), and slopes less than unity indicate overconfidence.
We note also that both the geometric mean and the -2/3 mean are highly sensitive to low values (and, indeeed, a single zero value sets the entire metric to zero). This is partially by design (robustness in particular is designed to give sensitivity to the lower-probability tail), but we will avoid the extreme case of zero values by setting all probabilities below some threshold to .
4 Metric Calculation Procedure and Clarifications
Calculating the reported metrics is simple: following Nelson’s prescription, we calculate the generalized mean of the reported probability for each correct decision, setting all reported probabilities below some threshold to .
Calculating the truth metrics is more complicated. Intuitively, to calculate a decision’s “measured probability” we need to find several decisions with similar reported probabilities and calculate the fraction of correct decisions. We can then combine these measured probabilities with the generalized mean, and calculate the metrics as before.
We implement this using histograms, largely following Nelson’s original prescription. The prescription, however, specifies bins with “equal amounts of data” to ensure “adequate data for the analysis.” We must clarify this statement in three respects:
It is essential that the bins have (approximately) equal data. The overall metrics are an average of the metrics in each bin. If some bins have more data than others, the metrics will be distorted accordingly. Thus, even if evenly-spaced bins could provide adequate data, they would still not be an appropriate choice (unless each bin were weighted by its population). We allow for “approximately” equal data only to the extent that the number of data points may not be exactly divisible by the number of bins.
By “equal data bins,” we mean that each bin should have (approximately) the same number of correct-class probabilities. The incorrect-class probabilities have no role in defining the bins, as the metrics by definition correspond to the correct-class probabilities.
We adjust our binning to avoid singularities (very high numbers of decisions with equal correct-class probabilities). In particular, there is likely to be a singularity for correct-class probabilities equal to one. As needed, therefore, we exempt some bins from the equal-population requirement, but compensate by (a) limiting their width (for convenience, we assign overfull bins a width of ) and (b) weighting overfull bins in proportion to their population when calculating the metrics.
The final procedure for calculating the truth metrics is therefore as follows:
Identify any singularities and assign such singularities a bin of width .
Bin the reported correct-class probabilities into equal-population bins (modulo divisibility issues).
Count the number of correct-class predictions and incorrect-class predictions in each bin.
Calculate the fraction correct for each bin, .
Obtain the truth metrics by taking the generalized mean of these fractions, where we set any fractions below equal to and weight any overfull bins appropriately.
5 Numerical Experiments and Results
We conduct numerical experiments by calculating these metrics for two convolutional neural network architectures and two datasets. Our neural networks are AlexNet  and DenseNet , representing one of the earliest and most recent neural architectures, respectively. For the datasets, we considered the German Traffic Sign Recognition Benchmark (GTSRB)  and ImageNet . The GTSRB has approximately 39K training images over 43 classes, while ImageNet has approximately 1.3M training images over 1000 classes. Both datasets have been widely used to benchmark computer vision architectures. We used for this work.
Table 1 shows the results.
|Geometric Accuracy||0.913 / 0.913||0.831 / 0.828||0.063 / 0.185||0.148 / 0.304|
|Robustness||0.603 / 0.679||0.437 / 0.505||0.054 / 0.054||0.085 / 0.087|
|Decisiveness||0.973 / 0.962||0.939 / 0.920||0.465 / 0.444||0.607 / 0.573|
Figure 1 shows a visualization of our results. We interpret this figure as follows:
The blue histogram reprents the measured prediction accuracy () in each bin. Note the variable bin widths; since the classifiers are in general highly accurate, the bins are smallest on the right side of the graph (and the last bin includes all reported accuracies above 0.995).
The green line has slope 1. A classifier that perfectly reports its uncertainty would be well-aligned with the green line.
The red dots indicate the metrics as calculated with the reported probabilities (x-axis) and truth probabilities (y-axis).
The magenta line shows the slope of the line connecting the decisiveness with the robustness.
Since there is little difference between the reported and measured metrics, we conclude that the classifiers generally assess their probability correctly on the validation dataset. We find, however, that all four magenta lines in Figure 1 have slopes less than one; this implies that all four classifiers are slightly overconfident. We note also that the geometric accuracy and robustness are affected by probabilities near . Table 2 illustrates the effect of on the metrics for one of our classifiers.
We have applied the generalized metrics, geometric accuracy, decisiveness, and robustness, to real-world, deep-learning-based classifiers on large-scale datasets. We have also clarified the binning scheme to define a well-defined procedure by which these metrics can be calculated across different classifiers. In particular, we found it necessary to set a minimum bound () on all values when calculating the metrics; the value of significantly affected geometric accuracy and (especially) robustness.
This work was supported by the Air Force Research Laboratory under contract number FA8750-17-C-0282.
Coupled VAE: improved accuracy and robustness of a variational autoencoder. External Links: Cited by: §6.
ImageNet: a large-scale hierarchical image database.
2009 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 248–255. External Links: Cited by: §5.
-  (2016) Densely connected convolutional networks. CoRR abs/1608.06993. External Links: Cited by: §5.
-  () Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 2012. Cited by: §5.
-  (1879) XIII. the law of the geometric mean. Proceedings of the Royal Society of London 29 (196-199), pp. 367–376. External Links: Cited by: §3.
-  (2020) In Advances in Info-Metrics, M. Chen, J. M. Dunn, A. Golan, and A. Ullah (Eds.), Cited by: §1, 3rd item.
-  (2017-06) Assessing probabilistic inference by comparing the generalized mean of the model and source probabilities. Entropy 19, pp. 286. External Links: Cited by: §1.
-  (2014-12) Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. arXiv e-prints. External Links: Cited by: §1.
-  (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks (0), pp. –. Note: External Links: Cited by: §5.