In real-world machine learning systems, the predictive performance of a model is often evaluated on several test datasets rather than one, and comparisons are made. These datasets can correspond to sub-populations in the data, or different periods in time [18, 15]. Choosing the best suited metrics for these comparisons is not a trivial task. Some metrics may prevent a proper interpretation of the performance differences between the sets [8, 14], especially because the latter generally not only have a different class prior but also a different likelihood . For instance, a metric dependent on the prior (e.g. precision which is naturally higher if positive examples are globally more frequent) will be affected by both differences indiscernibly  but a practitioner could be interested in isolating the variation of performance due to likelihood which reflects the intrinsic model’s performance. Take the example of comparing the performance of a model across time periods: At time , we receive data drawn from
where are the features and the label. Hence the optimal scoring function (i.e. model) for this dataset is the likelihood ratio :
In particular, if does not vary with time, neither will . In this case, even if the prior varies, it is desirable to have a performance metric satisfying so that the model maintains the same metric value over time.
In binary classification, researchers often rely on the AUC-ROC (Area Under the Curve of Receiver Operating Characteristic) to measure a classifier’s performance[9, 6]. While the AUC-ROC has the advantage of being invariant to the class prior, many real-world applications, especially when data are imbalanced, favor precision-based metrics. The reason is that, by considering the proportion of false positives with respect to negative examples which are in large majority, ROC suffers from giving them too little importance . Yet, they deteriorate user experience and waste human efforts with false alerts. Thus, ROC can give over-optimistic scores for a poorly performing classifier in terms of precision and this is particularly inconvenient when the class of interest is in minority . Metrics directly based on recall and precision, which gives more weight to false positives, such as AUC-PR or F1-score have shown to be much more relevant  and have gained in popularity in recent years . That being said, these metrics are strongly tied to the class prior [2, 3]
. A new definition of precision and recall into precision gain and recall gain has been recently proposed to correct several drawbacks of AUC-PR
. However, while the resulting AUC-PR Gain have some advantages of the AUC-ROC such as the linear interpolation between points, it remains dependent on the class prior.
This study aims at providing (i) a precision-based metric to cope with problems where the class of interest is in a slim minority and (ii) a metric independent of the prior to monitor performance over time. The motivation is illustrated by a fraud detection use-case . Figure 1 show a situation where, over a given period in time, both the detection system’s performance and the fraud ratio (i.e. the empirical ) decrease. As the used metric AUC-PR is dependent on the prior, it cannot tell if the performance variation is only due to the fraud ratio or if there are other factors (e.g. drift in ).
To cope with the limitations of the widely used existing precision-based metrics, we introduce a calibrated version of them which excludes the impact of the class ratio. More precisely, our contributions are:
A formulation, in section 3.1, of calibration for precision-based metrics:
It estimates the value that the metrics would have if the class ratio in the test set was equal to a reference class ratiofixed by the user. We give a theoretical argument to show that it allows invariance to the class prior. We also provide a calibrated version of precision gain and recall gain.
An empirical analysis on both synthetic and real-world data in section 3.2 to confirm our claims.
A further analysis, in section 3.3, showing that calibrated metrics are able to assess the model’s performance and are easier to interpret.
We emphasize that calibration not only solves the issue of dependence to the prior but also allows, with parameter , propelling the user to a different configuration (a different ratio) and controlling what the metric will precisely reflect. These new properties have several practical interests (e.g. for development, reporting, analysis) and we discuss them in realistic use-cases in section 5.
2 Popular metrics for binary classification: advantages and limits
We consider a usual binary classification setting where a model has been trained and its performance is evaluated on a test dataset of instances. is the ground-truth label of the instance and is equal to (resp. ) if the instance belongs to the positive (resp. negative) class. The model provides , a score for the instance to belong to the positive class. For a given threshold , the predicted label is if and otherwise.
The number of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN) are defined as follows:
where yields 1 when is true and otherwise. Based on these statistics, one can compute relevant ratios such as the True Positive Rate (TPR) also referred to as the Recall (), the False Positive Rate (FPR) also referred to as the Fall-out and the Precision ():
As these ratios are biased towards a specific type of error and can easily be manipulated with the threshold, more complex metrics have been proposed. In this paper, we discuss the most popular ones which have been widely adopted in binary classification: F1-Score, AUC-ROC, AUC-PR and AUC-PR Gain.
F1-Score is the harmonic average between and :
The three other metrics consider every threshold from the highest to the lowest. For each one, they compute TP, FP, TN and FN. Then, they plot one ratio against another and compute the Area Under the Curve (Figure 2
). AUC-ROC considers the Receiver Operating Characteristic curve where TPR is plotted against FPR. AUC-PR considers the Precision vs Recall curve. Finally, in AUC-PR Gain, the precision gain () is plotted against the recall gain (). They are defined in  as follows:
where is the proportion of positive examples or positive class ratio.111In this paper, we always consider that the positive class is the minority class. This allows PR Gain to enjoy many interesting properties of the ROC that the original PR analysis does not, such as the validity of linear interpolations, the existence of universal baselines or the interpretability of the area under its curve . However, AUC-PR Gain can become limited in an extremely imbalanced setting. In particular, we can derive from (4) and (5) that both and will often be close to 1 if is close to 0. This is illustrated in Figure 2 with in the order of .
Each metric has its own raison d’etre. For instance, we visually observe that only AUC-ROC is invariant to the positive class ratio. Indeed, FPR and are both unrelated to the class ratio because they only focus on one class but it is not the case for : Its dependency on the positive class ratio is illustrated in Figure 3. Instances, represented as circles (resp. squares) if they belong to the positive (resp. negative) class, are ordered from left to right according to their score given by the model. A threshold is illustrated as a vertical line between the instances: those on the left (resp. right) are classified as positive (resp. negative). When comparing a case (i) with a given ratio and another case (ii) where a randomly selected half of the positive examples has been removed, one can visually understand that both recall and false positive rate are the same but the precision is lower in the second case.
These different properties can become strengths or limitations depending on the context. As stated in the introduction, we will consider a motivating scenario where (i) data are imbalanced and the minority (positive) class is the one of interest and (ii) we monitor performance across time. In that case, we need a metric that considers precision rather than FPR and that is invariant to the prior. By definition, AUC-ROC does not involve precision whereas the other metrics do. But it is the only one invariant to the positive class ratio.
3 Calibrated Metrics
To obtain a metric that satisfies both properties from the last section, we will modify those based on (AUC-PR, F1-Score and AUC-PR Gain) to make them independent of the positive class ratio .
The idea is to fix a reference ratio and to weigh the count of TP or FP in order to calibrate them to the value that they would have if was equal to . can be chosen arbitrarily (e.g. for balanced) but it is preferable to fix it according to the task at hand. We analyze the impact of in section 4 and describe simple guidelines to fix it in section 5.
If the positive class ratio is instead of , the ratio between negative examples and positive examples is multiplied by . In this case, we expect the ratio between false positives and true positives to be multiplied by . Therefore, we define the calibrated precision as follows:
Since is the imbalance ratio where (resp. ) is the number of positive (resp. negative) examples, we have: which is independent of .
Based on the calibrated precision, we can also define the calibrated F1-score, the calibrated and the calibrated by replacing by and by in equations (3), (4) and (5). Note that calibration does not change precision gain. Indeed, calibrated precision gain can be rewritten as which is equal to the regular precision gain. Also, the interesting properties of the recall gain were proved independently of the ratio in  which means calibration preserves them.
3.2 Robustness to variations
In order to evaluate the robustness of the new metrics to variations in
, we create a synthetic dataset where the label is drawn from a Bernoulli distribution with parameter
and the feature is drawn from Normal distributions:
presents the results averaged over 30 runs with their confidence interval. We observe that the impact of the class prior on the regular metrics is important. It can be a serious issue for applications wheresometimes vary by one order of magnitude from one day to another (see  for a real world example) as it leads to a significant variation of the measured performance (see the difference between AUC-PR when and when ) even if the model remains the same. On the contrary, the calibrated versions remain very robust to changes in the class prior even for extreme values. Note that we here experiment with synthetic data to have a strong control over the distribution/prior. In appendix A, we run a similar experiment with real-world data where we artificially change in the test set with undersampling. The conclusions remain the same.
Let us remark that for test datasets in which , is equal to the regular precision since (the intersection of the metrics in Figure 4 when reflects that). The new metrics essentially have the value that the original ones would have if the positive class ratio was equal to . This is analyzed in depth in appendix B where we compare the proposed formula for calibration with the most promising proposal from the past 
: a heuristic-based calibration. Their approach consists in randomly undersampling the test set so thatand then compute the regular metrics. Because of the randomness, sampling can remove more hard examples than easy examples so the performance can be over-estimated, and vice versa. To avoid that, they perform several runs and compute the mean performance. The appendix shows and explain why our formula directly computes a value equal to their estimation. Therefore, while similar in spirit, our proposal can be seen as an improvement as it provides a closed-form solution, is deterministic, and less computationally expensive than their Monte-Carlo approximation.
3.3 Assessment of the model quality
Besides the robustness of the calibrated metrics to changes in , we also want them to be sensitive to the quality of the model. If this latter decreases regardless of the value, we expect all metrics, calibrated ones included, to decrease in value. Let us consider an experiment where we use the same synthetic dataset as defined the previous section. However, instead of changing the value of only, we change to make the problem harder and harder and thus worsen the optimal model’s performance. This can be done by reducing the distance between the two normal distributions in (7). As a distance, we consider the KL-divergence that boils down to .
Figure 5 shows how the values of the metrics evolve as the KL-divergence gets closer to zero. For each run, we randomly chose in the interval . As expected, all metrics globally decrease as the problem gets harder. However, we can notice an important difference: the variation in the calibrated metrics are smooth compared to those of the original ones which are affected by the random changes in . In that sense, variations of the calibrated metrics across the different generated datasets are much easier to interpret than the original metrics.
4 Link with the original metrics
Calibrated metrics are robust to variation in and improve the interpretability of the model’s performance. But a question remains: are they still linked to the original metrics and does play a role in that link? We tackle this question from the model selection point of view by empirically analyzing the correlation of the metrics in terms of model ordering. We use OpenML  to select the 614 supervised binary classification datasets over which at least 30 models have been evaluated with a 10-fold cross-validation. For each one, we randomly choose 30 models, fetch their predictions, and evaluate their performance with several metrics. This leaves us with different values for each metric. To analyze whether the metrics rank the models in the same order, we compute the Spearman rank correlation coefficient between them for each of the
problems. Most datasets roughly have balanced classes (see the cumulative distribution function in Figure 1 in appendixC). We also run the same experiment with the subset of highly imbalanced datasets where . A third experiment with the 21 datasets where is available in appendix C.
The compared metrics are AUC-ROC, AUC-PR, AUC-PR Gain and F1-score. The threshold for the latter is fixed on a holdout validation set. We also add the calibrated version of the last three. In order to understand the impact of on the calibration, we use two different values: the arbitrary and another value (for the first experiment with all datasets, corresponds to and for the second experiment where is very small, we go further and corresponds to which remains closer to than ). The obtained correlation matrices are shown in Figure 6. The individual entries correspond to the average Spearman correlation over all datasets between the row metric and the column metric.
In general, we observe that metrics are less correlated when classes are unbalanced (right figure). We also note that F1-score is more correlated to PR based metrics than to AUC-ROC. And it agrees more with AUC-PR than AUC-PR Gain. The two latter have a high correlation, especially in the balanced case (left matrix in Figure 6). Also in the balanced case, we can see that metrics defined as area under curves are more correlated with each other than with the threshold sensitive classification metric F1-score.
Let us now analyze calibration. As expected, in general, when , calibrated metrics have a behavior really close to that of the original metrics because and therefore . In the balanced case (left), since is close to , calibrated metrics where are also highly correlated with the original metrics. In the imbalanced case (on the right matrix of Figure 6), when is arbitrarily set to the calibrated metrics seem to disagree with the original ones. In fact, they are even less correlated with the original metrics than with AUC-ROC. It can be explained with the relative weights given to FP and TP by each of these metrics. The original precision gives the same weight to as to , although false positives are times more likely to occur222 is larger than 100 if . The calibrated precision with the arbitrary value boils down to and gives a weight times smaller to false positives which counterbalances their higher likelihood. ROC also implicitly gives less weight to FP because it is computed from FPR and TPR which are linked to TP and FP with the relationship .
To conclude this analysis, we first emphasize that the choice of the metrics when datasets are rather balanced seems to be much less sensitive than in the extremely imbalanced case. Indeed, In the balanced case the least correlated metrics are F1-score and AUC-ROC with a correlation coefficient of . For the imbalanced dataset, on the other hand, many different metrics are uncorrelated which means that most of the time they disagree on the best model. The choice of the metric is very important here and our experiment reflects that it is a matter of how much weight we are willing to give to each type of error. With calibration, in order to preserve the nature of the original metrics, has to be fixed to a value close to and not arbitrarily. can also be fixed to a value different from if the user has another purpose.
5 Guidelines and use-cases
Calibration could benefit ML practitioners when comparing the performance of a single model on different datasets/time periods. Without being exhaustive, we give four use-cases where it is beneficial (the strategy for setting depends on the target use-case):
Model performance monitoring in an industrial context: in systems where performance is monitored over time with F1-score, using calibration in addition to the regular metrics makes it easier to analyze the drift (i.e. distinguish between variations linked to or - see Appendix D) and design adapted solutions: either updating the threshold or completely retraining the model. To avoid denaturing too much the F1-score, can here be, for instance, fixed based on expert knowledge (e.g. average in historical data).
Comparing the performance of a model on two populations: If the prior is different from one population to another, the calibrated metric will set each population to the same reference. It will provide a balanced point of view and make the analysis richer. This might be useful to study fairness  for instance. Here, can be chosen as the prior of one of the two populations.
Establishing agreements with clients: The positive label ratio can vary extremely on particular events (e.g. fraudster attacks) which significantly affects the measured F1-score (see Figure 4), making it difficult to guarantee a standard to a client. Calibration helps provide a more controlled guarantee such as a minimal level of precision at a given reference ratio . Here can be a norm fixed by both parties based on expert knowledge.
Anticipating the deployment of an algorithm in a real-world system: If the prior in the data collected for offline development is different from reality, non-calibrated metrics measured during development might give pessimistic/optimistic estimations of the post-deployment performance. In particular, this can be harmful when industry has constraints on the performance (e.g precision has to be strictly above 1/2). Calibration with, for instance, a value equal to the minimal prior envisioned for the application at hand will allow anticipating the worst case scenario.
In this paper, we provided a formula and guidelines to calibrate metrics in order to make the variation of their value across different datasets more interpretable. As opposed to the regular metrics, the new ones are shown to be robust to random variation in the class prior. This property can be useful to both academic and industrial applications. On the one hand, in a research study that involves incremental learning on streams, having a metric which resists to a virtual concept drift  can help to better focus on other sources of drifts and design adapted solutions. On the other hand, in an industrial context, calibrated metrics will give more stable results, help prevent false conclusions about a deployed model and also allow for more interpretable performance indicators, easier to guarantee and report. Calibration relies on a reference positive class ratio and transform the metric such that its value is the one that would be obtained if in the evaluated test set was equal to . It has a simple interpretation and should be chosen with caution. If one’s goal is to preserve the nature of the original metric, has to be close to the real . But choosing a different will also allow, if needed, to situate the algorithm’s performance to a reference situation. Because it directly controls the importance given to false positives relative to true positives, calibration draws an interesting perspective for future works: investigating a generalized metric in which the cost associated to each type of error (FP, FN) appears as a parameter and from which we can retrieve the definition of existing popular metrics such as TPR, FPR or Precision.
-  (2017) Fairness in machine learning. NIPS Tutorial. Cited by: 2nd item.
-  (2016) A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR) 49 (2), pp. 31. Cited by: §1.
On the dynamics of classification measures for imbalanced and streaming data.
IEEE transactions on neural networks and learning systems. Cited by: §1, §1.
-  (2018) Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE transactions on neural networks and learning systems 29 (8), pp. 3784–3797. Cited by: §1, §3.2.
-  (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §1.
-  (2006) An introduction to roc analysis. Pattern recognition letters 27 (8), pp. 861–874. Cited by: §1.
-  (2015) Precision-recall-gain curves: pr analysis done right. In Advances in Neural Information Processing Systems, pp. 838–846. Cited by: §1, §2, §2, §3.1.
-  (2012) On the suitability of numerical performance measures for class imbalance problems. In International Conference in Pattern Recognition Applications and Methods, pp. 310–313. Cited by: §1.
-  (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §1.
-  (2013) Facing imbalanced data–recommendations for the use of performance metrics. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 245–251. Cited by: Appendix B, Appendix B, §1, §3.2.
-  (1933) IX. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694-706), pp. 289–337. Cited by: §1.
-  (2015) The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one 10 (3), pp. e0118432. Cited by: §1.
-  (2018) Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pp. 5228–5237. Cited by: §1.
-  (2015) Dealing with the evaluation of supervised classification algorithms. Artificial Intelligence Review 44 (4), pp. 467–508. Cited by: §1.
-  (2018) Precision and recall for time series. In Advances in Neural Information Processing Systems, pp. 1920–1930. Cited by: §1.
-  (2014) OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15 (2), pp. 49–60. Cited by: item 4, §4.
-  (1993) Effective learning in dynamic environments by explicit context tracking. In European Conference on Machine Learning, pp. 227–243. Cited by: §6.
-  (2017) A framework of online learning with imbalanced streaming data. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
Appendix A Robustness to variations on real-world data
In this appendix, we study again the behavior of the -score, AUC-PR, AUC-PR Gain and their calibrated version as varies but on real-world data instead of synthetic data. We arbitrarily set for the calibrated metrics. The experiment is carried out on the highly unbalanced () credit card fraud detection dataset available on Kaggle333https://www.kaggle.com/mlg-ulb/creditcardfraud as follows:
Data are splitted into train/test sets with of the samples for training.
A logistic regression modelis trained with scikit-learn default parameters.
The model makes predictions on the test set and the metrics are evaluated. This gives us the reference performance at .
Next, is artificially increased in the test set by randomly undersampling the majority class. For each sampled test set (from to ), we use the model
for predictions and evaluate the metrics. We perform 1000 runs to reduce the variance of our estimations.
Figure 7 displays the evolution of the metrics as varies.
The conclusion are the same as with synthetic data: the calibrated metrics are robust to changes in even for extreme values.
Appendix B Comparison between the proposed calibrated metrics and a previously published heuristic for calibration
In , the authors propose a heuristic-based calibration which consists in randomly undersampling the test set so that and then compute the regular metrics. Because of the randomness, sampling can remove more hard examples than easy examples so the performance can be over-estimated, and vice versa. To avoid that, they perform several runs and compute the mean value of the metrics. In this appendix, we experimentally compare the results obtained with our calibration formula and with their calibration heuristic. We use the same train/test data and model as in Appendix A. Figure 8 displays the AUC-PR on the test set calibrated with our formula (blue dot) and the heuristic from  (red line) at several reference ratio
. The red shadow represents the standard deviation over 1000 runs when applying the heuristic.
We can observe that our formula and the heuristic provide the same value and it can be theoretically confirmed (Proposition 1). This is an important result as it confirms that the formula proposed in the paper really plays the role of calibrating the metric to the value that it would have if .
Note that the closed-form calibration in the formula can be seen as an improvement of the heuristic because it directly provides the targeted value. It is deterministic and computes the metric one time which is therefore roughly times faster (where is the number of runs in the heuristic).
Let be the predictions of a classifier on a test set of samples, the ground truth label vector with
the ground truth label vector withnon-zero values. Let be the prediction and ground truth vectors of a sampled test set where we keep all the negative examples and a random sample of positive examples so that the positive class ratio becomes .
If we denote by (resp. ) and (resp. ) the precision and recall of (resp. ), then :
Let us first introduce some notations:
(resp. ): the set of indices of the positive examples in (resp. )
: a random sample, of size , without replacement of .
, , , (resp. , , ): the number of true positive, the number of false positive, the number of false negative and the positive class ratio in (, ) (resp in )
By definition, we have:
Let us now define :
Since only positive examples are sampled from to , the number of false positives remain unchanged so . Moreover, if then:
Therefore, if then:
Note: Property 1 reflects the case where we sample the positive class in the test set so it allows explaining why the heuristic in  provides results close to our calibration formula when the target reference ratio is lower than the original ratio . If the targeted reference ratio is higher than , the heuristic have to sample the negative class. In that case, we can show a similar property (and proof) in which we simply have to replace the recall (resp ) with the false positive rate (resp ).
Appendix C Additional experiment with OpenML
Here we display the CDF graph of the minority class ratio in OpenML binary classification datasets (Figure 9).
We also present a third experiment on OpenML. The protocol is the same as explained in the paper. Here we rather select the 21 datasets for which . The correlation matrix in Figure 10 presents the results.
Appendix D Real-world use-case of calibration: fraud detection systems
In this appendix, we consider a real dataset from a credit card fraud detection system which fits the scenario described in the paper. We reuse the example presented in figure 1 where we observed that both the model’s performance in terms of AUC-PR and the fraud ratio decrease. This example is from a private dataset similar to the credit card fraud detection dataset available on Kaggle. Let us take a look at the calibrated metric. To set , we recommend avoiding using the arbitrary value to preserve the behavior of precision. has to be fixed to a value that makes sense for the application at hand. A straightforward strategy is to fix it with expert knowledge such as the average proportion of fraudulent transactions in historical data (in our case, ).
Figure 11 presents the comparison between AUC-PR and AUC-PR. On the left figure, we can see that the two metrics differ when the fraud ratio is far from the reference . As an expected behavior, when is higher (resp. lower) than , calibration reduces (resp. increases) the value of the precision (see how AUC-PR - AUC-PR is correlated to the variation of in the right figure). That being said, although it couldn’t be concluded with certitude with the original metric, the calibrated metric shows that the model is getting worse with time independently of . With this in mind, we can start making hypotheses on the reasons of such behavior and take proper actions to correct future predictions.