Master your Metrics with Calibration

by   Wissam Siblini, et al.
Jean Monnet University

Machine learning models deployed in real-world applications are often evaluated with precision-based metrics such as F1-score or AUC-PR (Area Under the Curve of Precision Recall). Heavily dependent on the class prior, such metrics may sometimes lead to wrong conclusions about the performance. For example, when dealing with non-stationary data streams, they do not allow the user to discern the reasons why a model performance varies across different periods. In this paper, we propose a way to calibrate the metrics so that they are no longer tied to the class prior. It corresponds to a readjustment, based on probabilities, to the value that the metric would have if the class prior was equal to a reference prior (user parameter). We conduct a large number of experiments on balanced and imbalanced data to assess the behavior of calibrated metrics and show that they improve interpretability and provide a better control over what is really measured. We describe specific real-world use-cases where calibration is beneficial such as, for instance, model monitoring in production, reporting, or fairness evaluation.


page 7

page 13


On Model Evaluation under Non-constant Class Imbalance

Many real-world classification problems are significantly class-imbalanc...

Post-hoc Models for Performance Estimation of Machine Learning Inference

Estimating how well a machine learning model performs during inference i...

Succinct Differentiation of Disparate Boosting Ensemble Learning Methods for Prognostication of Polycystic Ovary Syndrome Diagnosis

Prognostication of medical problems using the clinical data by leveragin...

Towards reliable and fair probabilistic predictions: field-aware calibration with neural networks

In machine learning, it is observed that probabilistic predictions somet...

Understanding the impact of class imbalance on the performance of chest x-ray image classifiers

This work aims to understand the impact of class imbalance on the perfor...

Neural Network Based Undersampling Techniques

Class imbalance problem is commonly faced while developing machine learn...

One-Class Classification for Wafer Map using Adversarial Autoencoder with DSVDD Prior

Recently, semiconductors' demand has exploded in virtual reality, smartp...

1 Introduction

In real-world machine learning systems, the predictive performance of a model is often evaluated on several test datasets rather than one, and comparisons are made. These datasets can correspond to sub-populations in the data, or different periods in time [18, 15]. Choosing the best suited metrics for these comparisons is not a trivial task. Some metrics may prevent a proper interpretation of the performance differences between the sets [8, 14], especially because the latter generally not only have a different class prior but also a different likelihood . For instance, a metric dependent on the prior (e.g. precision which is naturally higher if positive examples are globally more frequent) will be affected by both differences indiscernibly [3] but a practitioner could be interested in isolating the variation of performance due to likelihood which reflects the intrinsic model’s performance. Take the example of comparing the performance of a model across time periods: At time , we receive data drawn from

where are the features and the label. Hence the optimal scoring function (i.e. model) for this dataset is the likelihood ratio [11]:


In particular, if does not vary with time, neither will . In this case, even if the prior varies, it is desirable to have a performance metric satisfying so that the model maintains the same metric value over time.

In binary classification, researchers often rely on the AUC-ROC (Area Under the Curve of Receiver Operating Characteristic) to measure a classifier’s performance 

[9, 6]. While the AUC-ROC has the advantage of being invariant to the class prior, many real-world applications, especially when data are imbalanced, favor precision-based metrics. The reason is that, by considering the proportion of false positives with respect to negative examples which are in large majority, ROC suffers from giving them too little importance [5]. Yet, they deteriorate user experience and waste human efforts with false alerts. Thus, ROC can give over-optimistic scores for a poorly performing classifier in terms of precision and this is particularly inconvenient when the class of interest is in minority [10]. Metrics directly based on recall and precision, which gives more weight to false positives, such as AUC-PR or F1-score have shown to be much more relevant  [12] and have gained in popularity in recent years [13]. That being said, these metrics are strongly tied to the class prior [2, 3]

. A new definition of precision and recall into precision gain and recall gain has been recently proposed to correct several drawbacks of AUC-PR


. However, while the resulting AUC-PR Gain have some advantages of the AUC-ROC such as the linear interpolation between points, it remains dependent on the class prior.

This study aims at providing (i) a precision-based metric to cope with problems where the class of interest is in a slim minority and (ii) a metric independent of the prior to monitor performance over time. The motivation is illustrated by a fraud detection use-case [4]. Figure 1 show a situation where, over a given period in time, both the detection system’s performance and the fraud ratio (i.e. the empirical ) decrease. As the used metric AUC-PR is dependent on the prior, it cannot tell if the performance variation is only due to the fraud ratio or if there are other factors (e.g. drift in ).

Figure 1: Evolution of the fraudulent transactions ratio () and the AUC-PR of the model over time.

To cope with the limitations of the widely used existing precision-based metrics, we introduce a calibrated version of them which excludes the impact of the class ratio. More precisely, our contributions are:

  1. A formulation, in section 3.1, of calibration for precision-based metrics:


    It estimates the value that the metrics would have if the class ratio in the test set was equal to a reference class ratio

    fixed by the user. We give a theoretical argument to show that it allows invariance to the class prior. We also provide a calibrated version of precision gain and recall gain.

  2. An empirical analysis on both synthetic and real-world data in section 3.2 to confirm our claims.

  3. A further analysis, in section 3.3, showing that calibrated metrics are able to assess the model’s performance and are easier to interpret.

  4. A large scale experiments on 614 datasets using openML [16], in section 4, to (a) give more insights on correlation between popular metrics by analyzing how they rank models, (b) explore the links between the calibrated metrics and the regular ones and explain how it is impacted by the choice of .

We emphasize that calibration not only solves the issue of dependence to the prior but also allows, with parameter , propelling the user to a different configuration (a different ratio) and controlling what the metric will precisely reflect. These new properties have several practical interests (e.g. for development, reporting, analysis) and we discuss them in realistic use-cases in section 5.

2 Popular metrics for binary classification: advantages and limits

We consider a usual binary classification setting where a model has been trained and its performance is evaluated on a test dataset of instances. is the ground-truth label of the instance and is equal to (resp. ) if the instance belongs to the positive (resp. negative) class. The model provides , a score for the instance to belong to the positive class. For a given threshold , the predicted label is if and otherwise.

The number of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN) are defined as follows:


where yields 1 when is true and otherwise. Based on these statistics, one can compute relevant ratios such as the True Positive Rate (TPR) also referred to as the Recall (), the False Positive Rate (FPR) also referred to as the Fall-out and the Precision ():

As these ratios are biased towards a specific type of error and can easily be manipulated with the threshold, more complex metrics have been proposed. In this paper, we discuss the most popular ones which have been widely adopted in binary classification: F1-Score, AUC-ROC, AUC-PR and AUC-PR Gain.

F1-Score is the harmonic average between and :


The three other metrics consider every threshold from the highest to the lowest. For each one, they compute TP, FP, TN and FN. Then, they plot one ratio against another and compute the Area Under the Curve (Figure 2

). AUC-ROC considers the Receiver Operating Characteristic curve where TPR is plotted against FPR. AUC-PR considers the Precision vs Recall curve. Finally, in AUC-PR Gain, the precision gain (

) is plotted against the recall gain (). They are defined in [7] as follows:


where is the proportion of positive examples or positive class ratio.111In this paper, we always consider that the positive class is the minority class. This allows PR Gain to enjoy many interesting properties of the ROC that the original PR analysis does not, such as the validity of linear interpolations, the existence of universal baselines or the interpretability of the area under its curve [7]. However, AUC-PR Gain can become limited in an extremely imbalanced setting. In particular, we can derive from (4) and (5) that both and will often be close to 1 if is close to 0. This is illustrated in Figure 2 with in the order of .

Figure 2: ROC, PR and PR gain curves for the same model evaluated on an extremely imbalanced test set from a fraud detection application (, in the top row) and on a balanced sample (, in the bottom row).

Each metric has its own raison d’etre. For instance, we visually observe that only AUC-ROC is invariant to the positive class ratio. Indeed, FPR and are both unrelated to the class ratio because they only focus on one class but it is not the case for : Its dependency on the positive class ratio is illustrated in Figure 3. Instances, represented as circles (resp. squares) if they belong to the positive (resp. negative) class, are ordered from left to right according to their score given by the model. A threshold is illustrated as a vertical line between the instances: those on the left (resp. right) are classified as positive (resp. negative). When comparing a case (i) with a given ratio and another case (ii) where a randomly selected half of the positive examples has been removed, one can visually understand that both recall and false positive rate are the same but the precision is lower in the second case.

Figure 3: Illustration of the impact of on precision, recall, and the false positive rate.

These different properties can become strengths or limitations depending on the context. As stated in the introduction, we will consider a motivating scenario where (i) data are imbalanced and the minority (positive) class is the one of interest and (ii) we monitor performance across time. In that case, we need a metric that considers precision rather than FPR and that is invariant to the prior. By definition, AUC-ROC does not involve precision whereas the other metrics do. But it is the only one invariant to the positive class ratio.

3 Calibrated Metrics

To obtain a metric that satisfies both properties from the last section, we will modify those based on (AUC-PR, F1-Score and AUC-PR Gain) to make them independent of the positive class ratio .

3.1 Calibration

The idea is to fix a reference ratio and to weigh the count of TP or FP in order to calibrate them to the value that they would have if was equal to . can be chosen arbitrarily (e.g. for balanced) but it is preferable to fix it according to the task at hand. We analyze the impact of in section 4 and describe simple guidelines to fix it in section 5.

If the positive class ratio is instead of , the ratio between negative examples and positive examples is multiplied by . In this case, we expect the ratio between false positives and true positives to be multiplied by . Therefore, we define the calibrated precision as follows:


Since is the imbalance ratio where (resp. ) is the number of positive (resp. negative) examples, we have: which is independent of .

Based on the calibrated precision, we can also define the calibrated F1-score, the calibrated and the calibrated by replacing by and by in equations (3), (4) and (5). Note that calibration does not change precision gain. Indeed, calibrated precision gain can be rewritten as which is equal to the regular precision gain. Also, the interesting properties of the recall gain were proved independently of the ratio in [7] which means calibration preserves them.

3.2 Robustness to variations

In order to evaluate the robustness of the new metrics to variations in

, we create a synthetic dataset where the label is drawn from a Bernoulli distribution with parameter

and the feature is drawn from Normal distributions:


We empirically study the behavior of the -score, AUC-PR, AUC-PR Gain and their calibrated version on the optimal model defined in (1) as decreases from (balanced) to . Figure 4

presents the results averaged over 30 runs with their confidence interval. We observe that the impact of the class prior on the regular metrics is important. It can be a serious issue for applications where

sometimes vary by one order of magnitude from one day to another (see [4] for a real world example) as it leads to a significant variation of the measured performance (see the difference between AUC-PR when and when ) even if the model remains the same. On the contrary, the calibrated versions remain very robust to changes in the class prior even for extreme values. Note that we here experiment with synthetic data to have a strong control over the distribution/prior. In appendix A, we run a similar experiment with real-world data where we artificially change in the test set with undersampling. The conclusions remain the same.

Figure 4: Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version as decreases. For every value of , data points are generated -so that the observed class ratio is approximately equal to the Bernouilli parameter - from (7) with and , and the metrics are evaluated on the optimal model defined in (1). We arbitrarily set for the calibrated metrics. The curves are obtained by averaging results over 30 runs.

Let us remark that for test datasets in which , is equal to the regular precision since (the intersection of the metrics in Figure 4 when reflects that). The new metrics essentially have the value that the original ones would have if the positive class ratio was equal to . This is analyzed in depth in appendix B where we compare the proposed formula for calibration with the most promising proposal from the past [10]

: a heuristic-based calibration. Their approach consists in randomly undersampling the test set so that

and then compute the regular metrics. Because of the randomness, sampling can remove more hard examples than easy examples so the performance can be over-estimated, and vice versa. To avoid that, they perform several runs and compute the mean performance. The appendix shows and explain why our formula directly computes a value equal to their estimation. Therefore, while similar in spirit, our proposal can be seen as an improvement as it provides a closed-form solution, is deterministic, and less computationally expensive than their Monte-Carlo approximation.

3.3 Assessment of the model quality

Besides the robustness of the calibrated metrics to changes in , we also want them to be sensitive to the quality of the model. If this latter decreases regardless of the value, we expect all metrics, calibrated ones included, to decrease in value. Let us consider an experiment where we use the same synthetic dataset as defined the previous section. However, instead of changing the value of only, we change to make the problem harder and harder and thus worsen the optimal model’s performance. This can be done by reducing the distance between the two normal distributions in (7). As a distance, we consider the KL-divergence that boils down to .

Figure 5: Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version as tends to 0 and as randomly varies. This curve was obtained by averaging results over 30 runs.

Figure 5 shows how the values of the metrics evolve as the KL-divergence gets closer to zero. For each run, we randomly chose in the interval . As expected, all metrics globally decrease as the problem gets harder. However, we can notice an important difference: the variation in the calibrated metrics are smooth compared to those of the original ones which are affected by the random changes in . In that sense, variations of the calibrated metrics across the different generated datasets are much easier to interpret than the original metrics.

4 Link with the original metrics

Calibrated metrics are robust to variation in and improve the interpretability of the model’s performance. But a question remains: are they still linked to the original metrics and does play a role in that link? We tackle this question from the model selection point of view by empirically analyzing the correlation of the metrics in terms of model ordering. We use OpenML [16] to select the 614 supervised binary classification datasets over which at least 30 models have been evaluated with a 10-fold cross-validation. For each one, we randomly choose 30 models, fetch their predictions, and evaluate their performance with several metrics. This leaves us with different values for each metric. To analyze whether the metrics rank the models in the same order, we compute the Spearman rank correlation coefficient between them for each of the

problems. Most datasets roughly have balanced classes (see the cumulative distribution function in Figure 1 in appendix

C). We also run the same experiment with the subset of highly imbalanced datasets where . A third experiment with the 21 datasets where is available in appendix C.

The compared metrics are AUC-ROC, AUC-PR, AUC-PR Gain and F1-score. The threshold for the latter is fixed on a holdout validation set. We also add the calibrated version of the last three. In order to understand the impact of on the calibration, we use two different values: the arbitrary and another value (for the first experiment with all datasets, corresponds to and for the second experiment where is very small, we go further and corresponds to which remains closer to than ). The obtained correlation matrices are shown in Figure 6. The individual entries correspond to the average Spearman correlation over all datasets between the row metric and the column metric.

Figure 6: Spearman rank correlation matrices between 10 metrics over 30 models and 614 datasets for the left figure and 30 models and 4 highly imbalanced datasets for the right figure. The used supervised binary classification datasets and model predictions are all available in OpenML.

In general, we observe that metrics are less correlated when classes are unbalanced (right figure). We also note that F1-score is more correlated to PR based metrics than to AUC-ROC. And it agrees more with AUC-PR than AUC-PR Gain. The two latter have a high correlation, especially in the balanced case (left matrix in Figure 6). Also in the balanced case, we can see that metrics defined as area under curves are more correlated with each other than with the threshold sensitive classification metric F1-score.

Let us now analyze calibration. As expected, in general, when , calibrated metrics have a behavior really close to that of the original metrics because and therefore . In the balanced case (left), since is close to , calibrated metrics where are also highly correlated with the original metrics. In the imbalanced case (on the right matrix of Figure 6), when is arbitrarily set to the calibrated metrics seem to disagree with the original ones. In fact, they are even less correlated with the original metrics than with AUC-ROC. It can be explained with the relative weights given to FP and TP by each of these metrics. The original precision gives the same weight to as to , although false positives are times more likely to occur222 is larger than 100 if . The calibrated precision with the arbitrary value boils down to and gives a weight times smaller to false positives which counterbalances their higher likelihood. ROC also implicitly gives less weight to FP because it is computed from FPR and TPR which are linked to TP and FP with the relationship .

To conclude this analysis, we first emphasize that the choice of the metrics when datasets are rather balanced seems to be much less sensitive than in the extremely imbalanced case. Indeed, In the balanced case the least correlated metrics are F1-score and AUC-ROC with a correlation coefficient of . For the imbalanced dataset, on the other hand, many different metrics are uncorrelated which means that most of the time they disagree on the best model. The choice of the metric is very important here and our experiment reflects that it is a matter of how much weight we are willing to give to each type of error. With calibration, in order to preserve the nature of the original metrics, has to be fixed to a value close to and not arbitrarily. can also be fixed to a value different from if the user has another purpose.

5 Guidelines and use-cases

Calibration could benefit ML practitioners when comparing the performance of a single model on different datasets/time periods. Without being exhaustive, we give four use-cases where it is beneficial (the strategy for setting depends on the target use-case):

  • Model performance monitoring in an industrial context: in systems where performance is monitored over time with F1-score, using calibration in addition to the regular metrics makes it easier to analyze the drift (i.e. distinguish between variations linked to or - see Appendix D) and design adapted solutions: either updating the threshold or completely retraining the model. To avoid denaturing too much the F1-score, can here be, for instance, fixed based on expert knowledge (e.g. average in historical data).

  • Comparing the performance of a model on two populations: If the prior is different from one population to another, the calibrated metric will set each population to the same reference. It will provide a balanced point of view and make the analysis richer. This might be useful to study fairness [1] for instance. Here, can be chosen as the prior of one of the two populations.

  • Establishing agreements with clients: The positive label ratio can vary extremely on particular events (e.g. fraudster attacks) which significantly affects the measured F1-score (see Figure 4), making it difficult to guarantee a standard to a client. Calibration helps provide a more controlled guarantee such as a minimal level of precision at a given reference ratio . Here can be a norm fixed by both parties based on expert knowledge.

  • Anticipating the deployment of an algorithm in a real-world system: If the prior in the data collected for offline development is different from reality, non-calibrated metrics measured during development might give pessimistic/optimistic estimations of the post-deployment performance. In particular, this can be harmful when industry has constraints on the performance (e.g precision has to be strictly above 1/2). Calibration with, for instance, a value equal to the minimal prior envisioned for the application at hand will allow anticipating the worst case scenario.

6 Conclusion

In this paper, we provided a formula and guidelines to calibrate metrics in order to make the variation of their value across different datasets more interpretable. As opposed to the regular metrics, the new ones are shown to be robust to random variation in the class prior. This property can be useful to both academic and industrial applications. On the one hand, in a research study that involves incremental learning on streams, having a metric which resists to a virtual concept drift [17] can help to better focus on other sources of drifts and design adapted solutions. On the other hand, in an industrial context, calibrated metrics will give more stable results, help prevent false conclusions about a deployed model and also allow for more interpretable performance indicators, easier to guarantee and report. Calibration relies on a reference positive class ratio and transform the metric such that its value is the one that would be obtained if in the evaluated test set was equal to . It has a simple interpretation and should be chosen with caution. If one’s goal is to preserve the nature of the original metric, has to be close to the real . But choosing a different will also allow, if needed, to situate the algorithm’s performance to a reference situation. Because it directly controls the importance given to false positives relative to true positives, calibration draws an interesting perspective for future works: investigating a generalized metric in which the cost associated to each type of error (FP, FN) appears as a parameter and from which we can retrieve the definition of existing popular metrics such as TPR, FPR or Precision.


  • [1] S. Barocas, M. Hardt, and A. Narayanan (2017) Fairness in machine learning. NIPS Tutorial. Cited by: 2nd item.
  • [2] P. Branco, L. Torgo, and R. P. Ribeiro (2016) A survey of predictive modeling on imbalanced domains. ACM Computing Surveys (CSUR) 49 (2), pp. 31. Cited by: §1.
  • [3] D. Brzezinski, J. Stefanowski, R. Susmaga, and I. Szczech (2019) On the dynamics of classification measures for imbalanced and streaming data.

    IEEE transactions on neural networks and learning systems

    Cited by: §1, §1.
  • [4] A. Dal Pozzolo, G. Boracchi, O. Caelen, C. Alippi, and G. Bontempi (2018) Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE transactions on neural networks and learning systems 29 (8), pp. 3784–3797. Cited by: §1, §3.2.
  • [5] J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §1.
  • [6] T. Fawcett (2006) An introduction to roc analysis. Pattern recognition letters 27 (8), pp. 861–874. Cited by: §1.
  • [7] P. Flach and M. Kull (2015) Precision-recall-gain curves: pr analysis done right. In Advances in Neural Information Processing Systems, pp. 838–846. Cited by: §1, §2, §2, §3.1.
  • [8] V. Garcıa, J. S. Sánchez, and R. A. Mollineda (2012) On the suitability of numerical performance measures for class imbalance problems. In International Conference in Pattern Recognition Applications and Methods, pp. 310–313. Cited by: §1.
  • [9] J. A. Hanley and B. J. McNeil (1982) The meaning and use of the area under a receiver operating characteristic (roc) curve.. Radiology 143 (1), pp. 29–36. Cited by: §1.
  • [10] L. A. Jeni, J. F. Cohn, and F. De La Torre (2013) Facing imbalanced data–recommendations for the use of performance metrics. In 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 245–251. Cited by: Appendix B, Appendix B, §1, §3.2.
  • [11] J. Neyman and E. S. Pearson (1933) IX. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character 231 (694-706), pp. 289–337. Cited by: §1.
  • [12] T. Saito and M. Rehmsmeier (2015) The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one 10 (3), pp. e0118432. Cited by: §1.
  • [13] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly (2018) Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pp. 5228–5237. Cited by: §1.
  • [14] G. Santafe, I. Inza, and J. A. Lozano (2015) Dealing with the evaluation of supervised classification algorithms. Artificial Intelligence Review 44 (4), pp. 467–508. Cited by: §1.
  • [15] N. Tatbul, T. J. Lee, S. Zdonik, M. Alam, and J. Gottschlich (2018) Precision and recall for time series. In Advances in Neural Information Processing Systems, pp. 1920–1930. Cited by: §1.
  • [16] J. Vanschoren, J. N. Van Rijn, B. Bischl, and L. Torgo (2014) OpenML: networked science in machine learning. ACM SIGKDD Explorations Newsletter 15 (2), pp. 49–60. Cited by: item 4, §4.
  • [17] G. Widmer and M. Kubat (1993) Effective learning in dynamic environments by explicit context tracking. In European Conference on Machine Learning, pp. 227–243. Cited by: §6.
  • [18] Y. Yan, T. Yang, Y. Yang, and J. Chen (2017) A framework of online learning with imbalanced streaming data. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.

Appendix A Robustness to variations on real-world data

In this appendix, we study again the behavior of the -score, AUC-PR, AUC-PR Gain and their calibrated version as varies but on real-world data instead of synthetic data. We arbitrarily set for the calibrated metrics. The experiment is carried out on the highly unbalanced () credit card fraud detection dataset available on Kaggle333 as follows:

  1. Data are splitted into train/test sets with of the samples for training.

  2. A logistic regression model

    is trained with scikit-learn default parameters.

  3. The model makes predictions on the test set and the metrics are evaluated. This gives us the reference performance at .

  4. Next, is artificially increased in the test set by randomly undersampling the majority class. For each sampled test set (from to ), we use the model

    for predictions and evaluate the metrics. We perform 1000 runs to reduce the variance of our estimations.

Figure 7 displays the evolution of the metrics as varies.

Figure 7: Evolution of AUC-PR, AUC-PR Gain, F1-score and their calibrated version on real-world data as decreases. This curve was obtained by averaging results over 1000 runs.

The conclusion are the same as with synthetic data: the calibrated metrics are robust to changes in even for extreme values.

Appendix B Comparison between the proposed calibrated metrics and a previously published heuristic for calibration

In [10], the authors propose a heuristic-based calibration which consists in randomly undersampling the test set so that and then compute the regular metrics. Because of the randomness, sampling can remove more hard examples than easy examples so the performance can be over-estimated, and vice versa. To avoid that, they perform several runs and compute the mean value of the metrics. In this appendix, we experimentally compare the results obtained with our calibration formula and with their calibration heuristic. We use the same train/test data and model as in Appendix A. Figure 8 displays the AUC-PR on the test set calibrated with our formula (blue dot) and the heuristic from [10] (red line) at several reference ratio

. The red shadow represents the standard deviation over 1000 runs when applying the heuristic.

Figure 8: Comparison between heuristic-based calibration and closed-form calibration applied to AUC-PR on real-world data

We can observe that our formula and the heuristic provide the same value and it can be theoretically confirmed (Proposition 1). This is an important result as it confirms that the formula proposed in the paper really plays the role of calibrating the metric to the value that it would have if .

Note that the closed-form calibration in the formula can be seen as an improvement of the heuristic because it directly provides the targeted value. It is deterministic and computes the metric one time which is therefore roughly times faster (where is the number of runs in the heuristic).

Proposition 1.

Let be the predictions of a classifier on a test set of samples,

the ground truth label vector with

non-zero values. Let be the prediction and ground truth vectors of a sampled test set where we keep all the negative examples and a random sample of positive examples so that the positive class ratio becomes .

If we denote by (resp. ) and (resp. ) the precision and recall of (resp. ), then :




Let us first introduce some notations:

  • (resp. ): the set of indices of the positive examples in (resp. )

  • : a random sample, of size , without replacement of .

  • , , , (resp. , , ): the number of true positive, the number of false positive, the number of false negative and the positive class ratio in (, ) (resp in )

By definition, we have:






By injecting (13) in (12), we obtain :


Let us now define :


Since only positive examples are sampled from to , the number of false positives remain unchanged so . Moreover, if then:


Note that

Therefore, if then:


Note: Property 1 reflects the case where we sample the positive class in the test set so it allows explaining why the heuristic in [10] provides results close to our calibration formula when the target reference ratio is lower than the original ratio . If the targeted reference ratio is higher than , the heuristic have to sample the negative class. In that case, we can show a similar property (and proof) in which we simply have to replace the recall (resp ) with the false positive rate (resp ).

Appendix C Additional experiment with OpenML

Here we display the CDF graph of the minority class ratio in OpenML binary classification datasets (Figure 9).

Figure 9: CDF of the minority class ratio in OpenML binary classification datasets.

We also present a third experiment on OpenML. The protocol is the same as explained in the paper. Here we rather select the 21 datasets for which . The correlation matrix in Figure 10 presents the results.

Figure 10: Spearman rank correlation matrices between 10 metrics over 30 models and 21 datasets. The used supervised binary classification datasets and model predictions are all available in OpenML.

Appendix D Real-world use-case of calibration: fraud detection systems

In this appendix, we consider a real dataset from a credit card fraud detection system which fits the scenario described in the paper. We reuse the example presented in figure 1 where we observed that both the model’s performance in terms of AUC-PR and the fraud ratio decrease. This example is from a private dataset similar to the credit card fraud detection dataset available on Kaggle. Let us take a look at the calibrated metric. To set , we recommend avoiding using the arbitrary value to preserve the behavior of precision. has to be fixed to a value that makes sense for the application at hand. A straightforward strategy is to fix it with expert knowledge such as the average proportion of fraudulent transactions in historical data (in our case, ).

Figure 11: On the left, the figure shows the performance of a model over time on the fraud detection task in terms of AUC-PR and AUC-PR. On the right, the figure presents the normalized difference between AUC-PR and AUC-PR against the value of in the test sets.

Figure 11 presents the comparison between AUC-PR and AUC-PR. On the left figure, we can see that the two metrics differ when the fraud ratio is far from the reference . As an expected behavior, when is higher (resp. lower) than , calibration reduces (resp. increases) the value of the precision (see how AUC-PR - AUC-PR is correlated to the variation of in the right figure). That being said, although it couldn’t be concluded with certitude with the original metric, the calibrated metric shows that the model is getting worse with time independently of . With this in mind, we can start making hypotheses on the reasons of such behavior and take proper actions to correct future predictions.