EPP: interpretable score of model predictive power

by   Alicja Gosiewska, et al.

The most important part of model selection and hyperparameter tuning is the evaluation of model performance. The most popular measures, such as AUC, F1, ACC for binary classification, or RMSE, MAD for regression, or cross-entropy for multilabel classification share two common weaknesses. First is, that they are not on an interval scale. It means that the difference in performance for the two models has no direct interpretation. It makes no sense to compare such differences between datasets. Second is, that for k-fold cross-validation, the model performance is in most cases calculated as an average performance from particular folds, which neglects the information how stable is the performance for different folds. In this talk, we introduce a new EPP rating system for predictive models. We also demonstrate numerous advantages for this system, First, differences in EPP scores have probabilistic interpretation. Based on it we can assess the probability that one model will achieve better performance than another. Second, EPP scores can be directly compared between datasets. Third, they can be used for navigated hyperparameter tuning and model selection. Forth, we can create embeddings for datasets based on EPP scores.



There are no comments yet.


page 6


Interpretable Meta-Measure for Model Performance

Measures for evaluation of model performance play an important role in M...

Selection of Exponential-Family Random Graph Models via Held-Out Predictive Evaluation (HOPE)

Statistical models for networks with complex dependencies pose particula...

Counterfactual Cross-Validation: Effective Causal Model Selection from Observational Data

What is the most effective way to select the best causal model among pot...

Convex Techniques for Model Selection

We develop a robust convex algorithm to select the regularization parame...

Performance evaluation and hyperparameter tuning of statistical and machine-learning models using spatial data

Machine-learning algorithms have gained popularity in recent years in th...

Binary disease prediction using tail quantiles of the distribution of continuous biomarkers

In the analysis of binary disease classification, single biomarkers migh...

Interpretable and Fair Comparison of Link Prediction or Entity Alignment Methods with Adjusted Mean Rank

In this work, we take a closer look at the evaluation of two families of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The measurement of performance is the foundation of the model selection and hyperparameter tuning. The choice of the algorithm strongly relies on the choice of an evaluation measure. The task of evaluation score selection may be challenging, especially there are whole articles devoted only to the selection of various measures and their properties (Powers, 2008).

For example, in binary classification problems, one of the most common measures is Area Under the ROC Curve (AUC) (Sokolova and Lapalme, 2009)

. However, if there is a high cost associated with False Negative examples, we would rather use Recall instead of AUC. If costs of False Positives is high we would use Precision. If we care about balance between precision and recall we should use F1

(Goutte and Gaussier, 2005).

In this paper, we will show weaknesses of the most common measures, such as AUC, F1, ACC for binary classification, or RMSE, MAE for regression, or cross-entropy for multilabel classification and propose our novel measure EPP (Elo-based Predictive Power).

The paper is organized as follows. In Section 2 we present weaknesses of most common performance measures. In Section 3 we describe the idea of Elo score. In Section 4 we introduce Elo-Based Predictive Power (EPP) measure and address all weaknesses pointed out in Section 2. In Section 5 we show experiments and applications of EPP score. In Section 6 we discuss possible extensions of EPP.

2 What is wrong with most common measures?

In this section, we will point out four weaknesses of the most popular performance measures. We introduce examples for the AUC measure, however, reasoning would apply to other measures, such as F1, MSE, or cross-entropy. Each subsection correponds to a different issue.

2.1 There is no interpretation of differences in performance

Team AUC
Erkut & Mark,Google AutoML 0.618492
Erkut & Mark 0.616913
Google AutoML 0.615982
Erkut & Mark,Google AutoML,Sweet Deal 0.615858
Sweet Deal 0.615766
Arno Candel @ H2O.ai 0.615492
ALDAPOP 0.615040
9hr Overfitness 0.614371
Shlandryn 0.614132
Erin (H2O AutoML 100 mins) 0.612657
Table 1: Top 10 results of KaggleDays SF competition in 2019. https://www.kaggle.com/antgoldbloom/analyzing-kaggledays-sf-competition-data/notebook

In Table 1

we present AUC of 10 machine learning models and AutoML solutions calculated on the same data set.

The difference between AUC of the first and AUC of the second team equals . This difference has no direct interpretation, it does not provide any quantitative comparison of models’ performances. AUC is useful for ordering, but its differences have no interpretation.

2.2 There is no procedure for assessing the significance of the difference in performances

Results in the Table 1 differ in the third decimal place. There is no reference point to indicate whether this difference represents a significant improvement in prediction or not. Significance in the statistical sense it means that these differences are not on the noise level.

2.3 You cannot compare performances between data sets

Team Name AUC Asian Ensemble 0.8043 .baGGaj. 0.8039 Erkut & Mark,Google AutoML,Sweet Deal 0.8039 ARG eMMSamble 0.8037 n_m 0.8021
Table 2: Springleaf Marketing Response Kaggle Competition, https://www.kaggle.com/c/springleaf-marketing-response
Team Name AUC alijs 0.9562 7777777777777… 0.9559 ML Keksika 0.9546 krivoship 0.9544 2 old mipt dogs 0.9543
Table 3: IEEE-CIS Fraud Detection Kaggle Competition, https://www.kaggle.com/c/ieee-fraud-detection

In Tables 3 and 3 differences between best models for each data set are around . One would like to know, whether these differences are comparable between data sets. Does on Springleaf Marketing data is the same increase of model quality on IEEE-CIS Fraud data?

There at least three points of view. One is that the gap between the first and second places are almost the same for both data sets, because the differences in AUC are almost similar. Second is that the gap in the IEEE-CIS Fraud Competition is larger as the AUC is close to 1. Relative improvement for Fraud detection () is larger than relative improvement for Springleaf Marketing (). Third point of view is that the gap between first and second place for Springleaf is larger that the difference between second and third place. The opposite is true for IEEE-CIS Fraud detection.

2.4 You cannot assess the stability of the performance in cross-validation folds

k AUC AutoML_1 AUC AutoML_2
1 0.8 0.9
2 0.8 0.78
3 0.8 0.78
4 0.8 0.78
Mean AUC 0.8 0.81
Table 4: Artifficial results from 4-fold cross-validation.

For k-fold cross-validation model performance is usually the averaged performance of models trained on different folds. In Table 4, there are artificial values of AUC for four folds and mean AUC across all folds. Comparing just averages across folds creates false impression that the AutoML_2 model is better than the AutoML_1. Yet, we can see that AutoML_1 wins in 3 out of 4 folds.

3 What is Elo ranking system?

The Elo rating is a ranking system used for calculating the relative level of players’ skill. It is used by, for example, chess and football federations. The score for players is updated after each match they have participated, new Elo rating is calculated on the basis of two components, result of match and rating of the opponent. A player’s level is not measured absolutely, although is inferred from wins, losses, and draws against other players. What is more, the difference between Elo scores of two players can be transferred into probabilities of winning when they play against each other.

Elo scores can be interpret in terms of probability of winning. There are many variations of Elo, we will show a short overview of one of the most popular introduced by Elo and Sloan (2008).

Let and be ratings of Player 1 and Player 2. The expected score of player 1 is

Player 1 expected score is his probability of winning plus half of the probability of drawing with Player 2. After the match, the rating of Player 1 is updated using the formula given by:

where is actual score that means whether a player won or lost, it can take values 1 or 0. K is a given constant, which can take different values, it is usually defined by the organizer of the competition.

The most common scaling forces that the difference of 200 rating points mean that more skilled player has an expected score of approximately 0.75. An average player have a rating of 1500, and reaching rating over 2000 means that player is one of the best.

In addition to being interpretable in terms of probability, Elo has one more advantage. It is not necessary for each player to play with each other player. In real world, it would be impossible to play matches between all the chess players, therefore Elo is used to find an approximation of a true skill. Of course, the more matches played, the better approximation, however each player do not need to play with all other players.

The concept of Elo is not completely new in machine learning. The performance of neural networks that play Dota 2 are often expressed in terms of the TrueSkill which is a ranking system developed by Microsoft

(Herbrich et al., 2007) for e-sport. TrueSkill is an extension of Elo to games with more than two players. It is used to not only compare algorithms with each other, but also compare them with human players. However, the Elo was not previously used to assess predictive models.

4 Elo-based Predictive Power (EPP) score

Our novel idea is to transfer the way players are ranked in the Elo system to create rankings of models.

Let stands for EPP score for model . The desired property is that



stands for odds that model

beats model .

The following procedure satisfies Property 1.

To calculate Elo we propose a logistic regression. Let

be the probability of model wining with model . Then we can specify formula


In case of larger number of models, it can be extended to





coefficients can be estimated with simple logistic regression. Once

coefficients are estimated, one can calculate from the following formula


4.1 The advantages of EPP

In this section, we will address four problems pointed out in Section 2, related to the weaknesses of the most common performance measures. We will show that EPP handles these identified issues.

Ad 2.1 There is an interpretation of differences in performance
EPP score provides the direct interpretation in terms of probability. The EPP difference for models and

is the logit of the probability that

achieves better performance than (see Formula 2 and Formula 5).

The next three points benefit from this probabilistic interpretation of the difference of scores.

Ad 2.2 There is a procedure for assessing the significance of the difference in performances
EPP score allows to assess the significance via probability of better performance, which gives an intuition whether the difference in performance is a noise or not.

Ad 2.3 You can compare performances between data sets
Difference between two EPP scores has he same meaning regardless of the data set. As stated in Equation 2, it is logit of probability of better performance of one model over another.

Ad 2.4 You can assess the stability of the performance in cross-validation folds
EPP score takes into consideration how many times one model beat another. Thus, the better one would be the model that more often had higher performance.

5 Experiments and applications of EPP score

In Figure 1, we present a concept of Elo-based comparing of machine learning algorithms. We describe the ratings of models as an analogy to the tournaments with the Elo system. Countries (Algorithms) are staging their players (sets of hyperparameters) for duels. These duels are held within the tournaments (data sets) divided into rounds (train/test splits). The results of matches (models training and testing) are used to create leaderboards (EPP ranking of models).

Figure 1: Our novel concept of Elo-based model ranking. Colors represent machine learning algorithms, gradients represent sets of hyperparameters, border styles represent data set.

The output rankings can be analyzed according to the type of algorithm, a specific set of hyperparameters, or particular data set.

We have calculated EPP score for several algorithms and data sets. In the following subsections we will present the results of the experiment.

5.1 Experiment Setup

We have used 4 machine learning algorithms (gradient boosting machines, generalized linear model with regularization, k nearest neighbours, and random forest). Each algorithm has been studied for 11 different hyperparameter settings on 11 selected classification data sets from the OpenML100

(Bischl et al., 2017) benchmark. For each data set, we specified 20 splits for train and test subsets. For each subset, we fitted models on train data and computed AUC on test data. For a model-data combination this gives us values of AUC scores and the overall number of AUC values equals .

On the computed AUC scores, we applied methodology of calculating EPP presented in Section 4 and Figure 1. As a single round, we consider comparison of performances of two models with specified hyperparameters on the same data set, yet not necessary on the same train/test split. As a result, we have obtained EPP scores for each data-model-hyperparameters combination, which gave us values of EPP scores.

5.2 Tuning hyperparameters of algorithms

By analyzing EPP scores we can assess the tunability of model. Philipp Probst and Anne-Laure Boulesteix and Bernd Bischl (2019) had an attempt to measure the tunability of algorithms, EPP score can extend and add an interpretation for tunability measure.

Figure 2: EPP scores of different hyperparameter settings for kknn across data sets. Each panel corresponds to the number of data set in the OpenML (Vanschoren et al., 2013) data base. On the y-axis there are 11 different hyperparameter settings of k-nearest neighbours.
Figure 3: Boxplots of EPP scores for different models across data sets.

The EPP score has a huge potential for supporting hyperparameter tuning. In Figure 2, there are EPP scores for different hyperparameter settings of k-nearest neighbours across 11 data sets. The closer the end of the strap is to the right side, the better the model is. The further to the left side, the worse the model is. Models with EPP equals 0 have average performances.

By looking on the results presented in Figure 2, we can make two insights. First, as the EPP scores differs, we can say that k-nearest neighbour model is susceptible to tuning. Second, data set number 334 is somehow different, as the direction of quality of models is reversed. During the process of modelling, it would be a hint that one should examine this data set.

In Figure 3, there are the distributions of EPP score across models and data sets. The longer boxplot is, the more tunable would be the model, for example, we can see that tree-based models (random forest and GBM) perform better on data set number 3 than the other two models. Also, all of EPP scores for random forest are positive, this means that generally, the performance of random forest is over the average.

The insights about the performance of models and particular hyperparameter settings could be further used for the navigated tuning that is an automated way to find the best model.

5.3 Building Embeddings of data sets

Figure 4: PCA biplot of EPP scores for different hyperparameter settings for k-nearest neighbours model. Groups presented as symbols correspond to the data sets.

In the Figure 4, there is a PCA biplot for different hyperparameter settings with marked datasets. Such projection can lead to additional insights. We can observe a separation of hyperparameter settings for k-nearest neighbours. The dimension linked with the x-axis provides a way to divide hyperparameter settings across data sets. Let us focus on the most marginal data sets, 50 and 334. When analyze results presented in bar plots in Figure 2, we can see that performances of individual hyperparameter settings are reversed for these two sets of data.

Due to the observation of a connection between the model and the data, one can use values of EPP to create embeddings of data sets. Such embeddings could be further used for model tuning.

6 Possible extensions

The results of experiments are very promising. We see possible application and extentions in many areas of machine learning.

First of all, EPP score would be beneficial for Explainable Artificial Intelligence (XAI). Interpretability brings several multiple benefits, such as, increasing trust in model predictions or identification of reasons behind poor predictions

(Biecek, 2018). Interpretable differences of scores opens many new ways to develop explanations of machine learning models.

The second major opportunity is to use EPP for navigated hyperparameter tuning. EPP score can be used to assess the probability that we can improve the performance if we continue searching of the hyperparameter space. What is more, the stop condition may also take into account the time of training further models. The automatization of the EPP-base tuning process could lead to developing navigated tuning method.

The idea od EPP score may by extended of a TrueSkill (Herbrich et al., 2007), which was mentioned in Section 3. The same way that TrueSkill allows to grade humans’ skill in games for more than two players, it can be used for assessing the performance of model ensembles. It could make it possible to assess separately the performance of a single model, performance of the ensemble of models, and the potential of the model in the ensembles.

Possible modifications to the calculation of the EPP scores are also worth considering. For example, in experiments presented in Section 5, we have compared wins and loses of models across different train/test splits. An alternative way would be to compare the results only between identical splits.


  • P. Biecek (2018) DALEX: Explainers for Complex Predictive Models in R. External Links: Link Cited by: §6.
  • B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. G. Mantovani, J. N. van Rijn, and J. Vanschoren (2017) OpenML benchmarking suites and the OpenML100. Cited by: §5.1.
  • A.E. Elo and S. Sloan (2008) The Rating of Chess Players, Past and Present. External Links: Link Cited by: §3.
  • C. Goutte and E. Gaussier (2005)

    A probabilistic interpretation of precision, recall and F-score, with implication for evaluation

    In European Conference on Information Retrieval, Cited by: §1.
  • R. Herbrich, T. Minka, and T. Graepel (2007) TrueSkill(TM): A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems 20, External Links: Link Cited by: §3, §6.
  • Philipp Probst and Anne-Laure Boulesteix and Bernd Bischl (2019) Tunability: importance of hyperparameters of machine learning algorithms. External Links: Link Cited by: §5.2.
  • D. Powers (2008) Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Cited by: §1.
  • M. Sokolova and G. Lapalme (2009) A systematic analysis of performance measures for classification tasks. Information Processing & Management. External Links: Link Cited by: §1.
  • J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2013) OpenML: Networked Science in Machine Learning. SIGKDD Explorations. External Links: Link Cited by: Figure 2.