1 Introduction
Prediction accuracy, quantified as the number of correct predictions divided by the number of total predictions, has emerged as the de facto
measure of a classifier’s quality. Modern deeplearning based classifiers, however, are well known for being both highly accurate and quite poor at predicting their own level of certainty. In particular, the “logit” values commonly used as a surrogate for probability have been shown to be vulnerable to a wide range of attacks that produce highconfidence but incorrect classification decisions
[8]. In light of this, it is worth considering a more general conception of accuracy.In recent work [7, 6], Nelson argues that the geometric mean of the probabilities assigned to the correct classes, rather than the prediction accuracy defined above, should be used as the key measure of a classifier’s quality. He then generalizes this concept to introduce two additional metrics, which he calls decisiveness (or decisivebiased accuracy) and robustness (robustbiased accuracy) to provide further insight. Nelson shows that these metrics can be assessed using both the classifier’s reported probability and the classifier’s measured probability, giving insight into the model’s degree of under or overconfidence.
In this work, we take up the challenge of applying these metrics to largescale, deeplearningbased classifiers. By calculating the metrics for the classification decisions made by two convolutional neural networks applied to two realworld, imagery datasets, we give insight to both the metrics and the models. We also propose some minor clarifications to the original metric definition, allowing us to standardize these metrics.
We organize the remainder of this paper as follows. In section 2, we define the terms we will use in the remainder of this discussion. We then review Nelson’s past work in more detail (section 3) and then provide the full prescription for calculating our metrics (section 4), highlighting clarifications to the original methodology. In section 5, we show the results of applying these metrics to our two datasets and two classifiers. We then conclude in section 6.
2 Nomenclature
To avoid ambiguity, we define the following terms:

Prediction Accuracy: the number of correct predictions divided by the number of total predictions.

Reported Probability: the probability that a classifier assigns to a given class for a given target. These may be the “logit” values from a neural network, or may be calculated using any other technique. This can also be called “model probability.”

Measured Probability: the actual probability that a given classification decision is correct, based on the classifier’s historical performance on items with similar reported probabilities. Note, this could be generalized to consider other types of historical performance (e.g., per reported class), but we do not consider this here. This can also be called “source probability.”

CorrectClass Probability: the probability (reported or measured) assigned to the correct class for a given target.

Geometric, Arithmetic, or 2/3 Accuracy: the geometric, arithmetic, 2/3, or generalized mean of the probabilities (where probabilities lower than are set equal to to avoid zero values). These quantities are defined for both the reported probabilities and the truth probabilities.
3 Review of Nelson’s Past Work
Machinelearning based classifiers are generally trained with the binary crossentropy, also called the log loss. This binary crossentropy represents the arithmetic mean of the log of the “probability,” , assigned to the correct class (measured state), . In general, most neuralnetwork based architectures define as the logit value assigned to each class, after normalization with a softmax function. Mathematically, then, the crossentropy is defined as:
(1) 
Optimizing this crossentropy is mathematically equivalent to optimizing the geometric mean of the probabilities. Nelson argues that this same quantity – the geometric mean of the measured state probabilities (the classifier’s reported probabilities of the correct classes) – should be used so that the interpretation of performance is clear. We will call this quantity the geometric accuracy
. Nelson justifies this choice with arguments based on both Bayesian statistics and information theory, and cites a great body of literature going back to 1879
[5]. This approach becomes particularly important as advances in generalized entropy are incorporated. In this case, a cumbersome body of proposed functions, all translate to use of the generalized mean of the probabilities as a spectrum of metrics.The geometric mean, of course, is simply a special case () of the generalized mean, which is defined as:
(2) 
Nelson’s metrics are therefore as follows:

Geometric Accuracy: the generalized mean of the correctclass probabilities with (geometric mean). Mathematically, this represents the translation of the crossentropy back to the probability domain, and so is a “neutralbiased” measure of the accuracy.

Decisiveness (Arithmetic Accuracy): the generalized mean of the probabilities with (arithmetic mean). Mathematically, this would be equal to the prediction accuracy if the classifier perfectly reported its uncertainty. This arithmetic mean is therefore “decisivebiased” in that it is closely tied to the decision performance of the algorithm. Decisiveness is relatively insensitive to reported probabilities near zero, and therefore provides more sensitivity to the highconfidence region.

Robustness (2/3 Accuracy): the generalized mean of the probabilities with (2/3 mean). Mathematically, the 2/3rds mean is the complement to the arithmetic mean because due to the conjugate relationship between positive and negative generalizations of the logscore [6]. This 2/3 mean is “robustnessbiased” in that it is highly sensitive to reported probabilities near zero, and therefore provides more sensitivity to the lowconfidence region. In particular, it assesses how well the classifier handles sources of severe error, such as events not included in the training.
Together, decisiveness and robustness place bounds on the geometric accuracy.
These metrics can be calculated both for the reported or measured probabilities (as defined in section 2). Indeed, Nelson proposes calculating both, and using their slope to determine underconfidence or overconfidence. In particular, the slope is defined by:
(3) 
where and represent decisiveness and robustness, respectively. Then, slopes greater than unity indicate underconfidence (as the measured confidence rises faster than the reported confidence), and slopes less than unity indicate overconfidence.
We note also that both the geometric mean and the 2/3 mean are highly sensitive to low values (and, indeeed, a single zero value sets the entire metric to zero). This is partially by design (robustness in particular is designed to give sensitivity to the lowerprobability tail), but we will avoid the extreme case of zero values by setting all probabilities below some threshold to .
4 Metric Calculation Procedure and Clarifications
Calculating the reported metrics is simple: following Nelson’s prescription, we calculate the generalized mean of the reported probability for each correct decision, setting all reported probabilities below some threshold to .
Calculating the truth metrics is more complicated. Intuitively, to calculate a decision’s “measured probability” we need to find several decisions with similar reported probabilities and calculate the fraction of correct decisions. We can then combine these measured probabilities with the generalized mean, and calculate the metrics as before.
We implement this using histograms, largely following Nelson’s original prescription. The prescription, however, specifies bins with “equal amounts of data” to ensure “adequate data for the analysis.” We must clarify this statement in three respects:

It is essential that the bins have (approximately) equal data. The overall metrics are an average of the metrics in each bin. If some bins have more data than others, the metrics will be distorted accordingly. Thus, even if evenlyspaced bins could provide adequate data, they would still not be an appropriate choice (unless each bin were weighted by its population). We allow for “approximately” equal data only to the extent that the number of data points may not be exactly divisible by the number of bins.

By “equal data bins,” we mean that each bin should have (approximately) the same number of correctclass probabilities. The incorrectclass probabilities have no role in defining the bins, as the metrics by definition correspond to the correctclass probabilities.

We adjust our binning to avoid singularities (very high numbers of decisions with equal correctclass probabilities). In particular, there is likely to be a singularity for correctclass probabilities equal to one. As needed, therefore, we exempt some bins from the equalpopulation requirement, but compensate by (a) limiting their width (for convenience, we assign overfull bins a width of ) and (b) weighting overfull bins in proportion to their population when calculating the metrics.
The final procedure for calculating the truth metrics is therefore as follows:

Identify any singularities and assign such singularities a bin of width .

Bin the reported correctclass probabilities into equalpopulation bins (modulo divisibility issues).

Count the number of correctclass predictions and incorrectclass predictions in each bin.

Calculate the fraction correct for each bin, .

Obtain the truth metrics by taking the generalized mean of these fractions, where we set any fractions below equal to and weight any overfull bins appropriately.
5 Numerical Experiments and Results
We conduct numerical experiments by calculating these metrics for two convolutional neural network architectures and two datasets. Our neural networks are AlexNet [4] and DenseNet [3], representing one of the earliest and most recent neural architectures, respectively. For the datasets, we considered the German Traffic Sign Recognition Benchmark (GTSRB) [9] and ImageNet [2]. The GTSRB has approximately 39K training images over 43 classes, while ImageNet has approximately 1.3M training images over 1000 classes. Both datasets have been widely used to benchmark computer vision architectures. We used for this work.
Table 1 shows the results.
AlexNetGSSRB  DenseNetGSSRB  AlexNetImageNet  DenseNetImageNet  
Prediction Accuracy  0.974  0.948  0.568  0.687 
Geometric Accuracy  0.913 / 0.913  0.831 / 0.828  0.063 / 0.185  0.148 / 0.304 
Robustness  0.603 / 0.679  0.437 / 0.505  0.054 / 0.054  0.085 / 0.087 
Decisiveness  0.973 / 0.962  0.939 / 0.920  0.465 / 0.444  0.607 / 0.573 
Figure 1 shows a visualization of our results. We interpret this figure as follows:

The blue histogram reprents the measured prediction accuracy () in each bin. Note the variable bin widths; since the classifiers are in general highly accurate, the bins are smallest on the right side of the graph (and the last bin includes all reported accuracies above 0.995).

The green line has slope 1. A classifier that perfectly reports its uncertainty would be wellaligned with the green line.

The red dots indicate the metrics as calculated with the reported probabilities (xaxis) and truth probabilities (yaxis).

The magenta line shows the slope of the line connecting the decisiveness with the robustness.
Since there is little difference between the reported and measured metrics, we conclude that the classifiers generally assess their probability correctly on the validation dataset. We find, however, that all four magenta lines in Figure 1 have slopes less than one; this implies that all four classifiers are slightly overconfident. We note also that the geometric accuracy and robustness are affected by probabilities near . Table 2 illustrates the effect of on the metrics for one of our classifiers.
robustness  geometric accuracy  decisiveness  
.05  .708  .863  .938 
.01  .509  .832  .938 
.005  .425  .823  .938 
.001  .255  .808  .938 
0  .019  .790  .938 
6 Conclusion
We have applied the generalized metrics, geometric accuracy, decisiveness, and robustness, to realworld, deeplearningbased classifiers on largescale datasets. We have also clarified the binning scheme to define a welldefined procedure by which these metrics can be calculated across different classifiers. In particular, we found it necessary to set a minimum bound () on all values when calculating the metrics; the value of significantly affected geometric accuracy and (especially) robustness.
We found that the source and model metrics were remarkably consistent. Integrating decisiveness and robustness into the classifier’s loss function is a promising future direction, and has indeed already been partially considered
[1].Acknowledgment
This work was supported by the Air Force Research Laboratory under contract number FA875017C0282.
References

[1]
(201906)
Coupled VAE: improved accuracy and robustness of a variational autoencoder
. External Links: 1906.00536 Cited by: §6. 
[2]
(200906)
ImageNet: a largescale hierarchical image database.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, Vol. , pp. 248–255. External Links: Document, ISSN 10636919 Cited by: §5.  [3] (2016) Densely connected convolutional networks. CoRR abs/1608.06993. External Links: Link, 1608.06993 Cited by: §5.
 [4] () Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 2012. Cited by: §5.
 [5] (1879) XIII. the law of the geometric mean. Proceedings of the Royal Society of London 29 (196199), pp. 367–376. External Links: Document, Link, https://royalsocietypublishing.org/doi/pdf/10.1098/rspl.1879.0061 Cited by: §3.
 [6] (2020) In Advances in InfoMetrics, M. Chen, J. M. Dunn, A. Golan, and A. Ullah (Eds.), Cited by: §1, 3rd item.
 [7] (201706) Assessing probabilistic inference by comparing the generalized mean of the model and source probabilities. Entropy 19, pp. 286. External Links: Document Cited by: §1.
 [8] (201412) Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images. arXiv eprints. External Links: 1412.1897 Cited by: §1.
 [9] (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks (0), pp. –. Note: External Links: ISSN 08936080, Document, Link Cited by: §5.