1 Introduction
The measurement of performance is the foundation of the model selection and hyperparameter tuning. The choice of the algorithm strongly relies on the choice of an evaluation measure. The task of evaluation score selection may be challenging, especially there are whole articles devoted only to the selection of various measures and their properties (Powers, 2008).
For example, in binary classification problems, one of the most common measures is Area Under the ROC Curve (AUC) (Sokolova and Lapalme, 2009)
. However, if there is a high cost associated with False Negative examples, we would rather use Recall instead of AUC. If costs of False Positives is high we would use Precision. If we care about balance between precision and recall we should use F1
(Goutte and Gaussier, 2005).In this paper, we will show weaknesses of the most common measures, such as AUC, F1, ACC for binary classification, or RMSE, MAE for regression, or cross-entropy for multilabel classification and propose our novel measure EPP (Elo-based Predictive Power).
The paper is organized as follows. In Section 2 we present weaknesses of most common performance measures. In Section 3 we describe the idea of Elo score. In Section 4 we introduce Elo-Based Predictive Power (EPP) measure and address all weaknesses pointed out in Section 2. In Section 5 we show experiments and applications of EPP score. In Section 6 we discuss possible extensions of EPP.
2 What is wrong with most common measures?
In this section, we will point out four weaknesses of the most popular performance measures. We introduce examples for the AUC measure, however, reasoning would apply to other measures, such as F1, MSE, or cross-entropy. Each subsection correponds to a different issue.
2.1 There is no interpretation of differences in performance
Team | AUC |
---|---|
Erkut & Mark,Google AutoML | 0.618492 |
Erkut & Mark | 0.616913 |
Google AutoML | 0.615982 |
Erkut & Mark,Google AutoML,Sweet Deal | 0.615858 |
Sweet Deal | 0.615766 |
Arno Candel @ H2O.ai | 0.615492 |
ALDAPOP | 0.615040 |
9hr Overfitness | 0.614371 |
Shlandryn | 0.614132 |
Erin (H2O AutoML 100 mins) | 0.612657 |
In Table 1
we present AUC of 10 machine learning models and AutoML solutions calculated on the same data set.
The difference between AUC of the first and AUC of the second team equals . This difference has no direct interpretation, it does not provide any quantitative comparison of models’ performances. AUC is useful for ordering, but its differences have no interpretation.
2.2 There is no procedure for assessing the significance of the difference in performances
Results in the Table 1 differ in the third decimal place. There is no reference point to indicate whether this difference represents a significant improvement in prediction or not. Significance in the statistical sense it means that these differences are not on the noise level.
2.3 You cannot compare performances between data sets
In Tables 3 and 3 differences between best models for each data set are around . One would like to know, whether these differences are comparable between data sets. Does on Springleaf Marketing data is the same increase of model quality on IEEE-CIS Fraud data?
There at least three points of view. One is that the gap between the first and second places are almost the same for both data sets, because the differences in AUC are almost similar. Second is that the gap in the IEEE-CIS Fraud Competition is larger as the AUC is close to 1. Relative improvement for Fraud detection () is larger than relative improvement for Springleaf Marketing (). Third point of view is that the gap between first and second place for Springleaf is larger that the difference between second and third place. The opposite is true for IEEE-CIS Fraud detection.
2.4 You cannot assess the stability of the performance in cross-validation folds
k | AUC AutoML_1 | AUC AutoML_2 |
---|---|---|
1 | 0.8 | 0.9 |
2 | 0.8 | 0.78 |
3 | 0.8 | 0.78 |
4 | 0.8 | 0.78 |
Mean AUC | 0.8 | 0.81 |
For k-fold cross-validation model performance is usually the averaged performance of models trained on different folds. In Table 4, there are artificial values of AUC for four folds and mean AUC across all folds. Comparing just averages across folds creates false impression that the AutoML_2 model is better than the AutoML_1. Yet, we can see that AutoML_1 wins in 3 out of 4 folds.
3 What is Elo ranking system?
The Elo rating is a ranking system used for calculating the relative level of players’ skill. It is used by, for example, chess and football federations. The score for players is updated after each match they have participated, new Elo rating is calculated on the basis of two components, result of match and rating of the opponent. A player’s level is not measured absolutely, although is inferred from wins, losses, and draws against other players. What is more, the difference between Elo scores of two players can be transferred into probabilities of winning when they play against each other.
Elo scores can be interpret in terms of probability of winning. There are many variations of Elo, we will show a short overview of one of the most popular introduced by Elo and Sloan (2008).
Let and be ratings of Player 1 and Player 2. The expected score of player 1 is
Player 1 expected score is his probability of winning plus half of the probability of drawing with Player 2. After the match, the rating of Player 1 is updated using the formula given by:
where is actual score that means whether a player won or lost, it can take values 1 or 0. K is a given constant, which can take different values, it is usually defined by the organizer of the competition.
The most common scaling forces that the difference of 200 rating points mean that more skilled player has an expected score of approximately 0.75. An average player have a rating of 1500, and reaching rating over 2000 means that player is one of the best.
In addition to being interpretable in terms of probability, Elo has one more advantage. It is not necessary for each player to play with each other player. In real world, it would be impossible to play matches between all the chess players, therefore Elo is used to find an approximation of a true skill. Of course, the more matches played, the better approximation, however each player do not need to play with all other players.
The concept of Elo is not completely new in machine learning. The performance of neural networks that play Dota 2 are often expressed in terms of the TrueSkill which is a ranking system developed by Microsoft
(Herbrich et al., 2007) for e-sport. TrueSkill is an extension of Elo to games with more than two players. It is used to not only compare algorithms with each other, but also compare them with human players. However, the Elo was not previously used to assess predictive models.4 Elo-based Predictive Power (EPP) score
Our novel idea is to transfer the way players are ranked in the Elo system to create rankings of models.
Let stands for EPP score for model . The desired property is that
The following procedure satisfies Property 1.
To calculate Elo we propose a logistic regression. Let
be the probability of model wining with model . Then we can specify formula(2) |
In case of larger number of models, it can be extended to
(3) |
where
(4) |
Unknown
coefficients can be estimated with simple logistic regression. Once
coefficients are estimated, one can calculate from the following formula(5) |
4.1 The advantages of EPP
In this section, we will address four problems pointed out in Section 2, related to the weaknesses of the most common performance measures. We will show that EPP handles these identified issues.
Ad 2.1 There is an interpretation of differences in performance
EPP score provides the direct interpretation in terms of probability. The EPP difference for models and
is the logit of the probability that
achieves better performance than (see Formula 2 and Formula 5).The next three points benefit from this probabilistic interpretation of the difference of scores.
Ad 2.2 There is a procedure for assessing the significance of the difference in performances
EPP score allows to assess the significance via probability of better performance, which gives an intuition whether the difference in performance is a noise or not.
Ad 2.3 You can compare performances between data sets
Difference between two EPP scores has he same meaning regardless of the data set. As stated in Equation 2, it is logit of probability of better performance of one model over another.
Ad 2.4 You can assess the stability of the performance in cross-validation folds
EPP score takes into consideration how many times one model beat another. Thus, the better one would be the model that more often had higher performance.
5 Experiments and applications of EPP score
In Figure 1, we present a concept of Elo-based comparing of machine learning algorithms. We describe the ratings of models as an analogy to the tournaments with the Elo system. Countries (Algorithms) are staging their players (sets of hyperparameters) for duels. These duels are held within the tournaments (data sets) divided into rounds (train/test splits). The results of matches (models training and testing) are used to create leaderboards (EPP ranking of models).

The output rankings can be analyzed according to the type of algorithm, a specific set of hyperparameters, or particular data set.
We have calculated EPP score for several algorithms and data sets. In the following subsections we will present the results of the experiment.
5.1 Experiment Setup
We have used 4 machine learning algorithms (gradient boosting machines, generalized linear model with regularization, k nearest neighbours, and random forest). Each algorithm has been studied for 11 different hyperparameter settings on 11 selected classification data sets from the OpenML100
(Bischl et al., 2017) benchmark. For each data set, we specified 20 splits for train and test subsets. For each subset, we fitted models on train data and computed AUC on test data. For a model-data combination this gives us values of AUC scores and the overall number of AUC values equals .On the computed AUC scores, we applied methodology of calculating EPP presented in Section 4 and Figure 1. As a single round, we consider comparison of performances of two models with specified hyperparameters on the same data set, yet not necessary on the same train/test split. As a result, we have obtained EPP scores for each data-model-hyperparameters combination, which gave us values of EPP scores.
5.2 Tuning hyperparameters of algorithms
By analyzing EPP scores we can assess the tunability of model. Philipp Probst and Anne-Laure Boulesteix and Bernd Bischl (2019) had an attempt to measure the tunability of algorithms, EPP score can extend and add an interpretation for tunability measure.


The EPP score has a huge potential for supporting hyperparameter tuning. In Figure 2, there are EPP scores for different hyperparameter settings of k-nearest neighbours across 11 data sets. The closer the end of the strap is to the right side, the better the model is. The further to the left side, the worse the model is. Models with EPP equals 0 have average performances.
By looking on the results presented in Figure 2, we can make two insights. First, as the EPP scores differs, we can say that k-nearest neighbour model is susceptible to tuning. Second, data set number 334 is somehow different, as the direction of quality of models is reversed. During the process of modelling, it would be a hint that one should examine this data set.
In Figure 3, there are the distributions of EPP score across models and data sets. The longer boxplot is, the more tunable would be the model, for example, we can see that tree-based models (random forest and GBM) perform better on data set number 3 than the other two models. Also, all of EPP scores for random forest are positive, this means that generally, the performance of random forest is over the average.
The insights about the performance of models and particular hyperparameter settings could be further used for the navigated tuning that is an automated way to find the best model.
5.3 Building Embeddings of data sets

In the Figure 4, there is a PCA biplot for different hyperparameter settings with marked datasets. Such projection can lead to additional insights. We can observe a separation of hyperparameter settings for k-nearest neighbours. The dimension linked with the x-axis provides a way to divide hyperparameter settings across data sets. Let us focus on the most marginal data sets, 50 and 334. When analyze results presented in bar plots in Figure 2, we can see that performances of individual hyperparameter settings are reversed for these two sets of data.
Due to the observation of a connection between the model and the data, one can use values of EPP to create embeddings of data sets. Such embeddings could be further used for model tuning.
6 Possible extensions
The results of experiments are very promising. We see possible application and extentions in many areas of machine learning.
First of all, EPP score would be beneficial for Explainable Artificial Intelligence (XAI). Interpretability brings several multiple benefits, such as, increasing trust in model predictions or identification of reasons behind poor predictions
(Biecek, 2018). Interpretable differences of scores opens many new ways to develop explanations of machine learning models.The second major opportunity is to use EPP for navigated hyperparameter tuning. EPP score can be used to assess the probability that we can improve the performance if we continue searching of the hyperparameter space. What is more, the stop condition may also take into account the time of training further models. The automatization of the EPP-base tuning process could lead to developing navigated tuning method.
The idea od EPP score may by extended of a TrueSkill (Herbrich et al., 2007), which was mentioned in Section 3. The same way that TrueSkill allows to grade humans’ skill in games for more than two players, it can be used for assessing the performance of model ensembles. It could make it possible to assess separately the performance of a single model, performance of the ensemble of models, and the potential of the model in the ensembles.
Possible modifications to the calculation of the EPP scores are also worth considering. For example, in experiments presented in Section 5, we have compared wins and loses of models across different train/test splits. An alternative way would be to compare the results only between identical splits.
References
- DALEX: Explainers for Complex Predictive Models in R. External Links: Link Cited by: §6.
- OpenML benchmarking suites and the OpenML100. Cited by: §5.1.
- The Rating of Chess Players, Past and Present. External Links: Link Cited by: §3.
-
A probabilistic interpretation of precision, recall and F-score, with implication for evaluation
. In European Conference on Information Retrieval, Cited by: §1. - TrueSkill(TM): A Bayesian Skill Rating System. In Advances in Neural Information Processing Systems 20, External Links: Link Cited by: §3, §6.
- Tunability: importance of hyperparameters of machine learning algorithms. External Links: Link Cited by: §5.2.
- Evaluation: From Precision, Recall and F-Factor to ROC, Informedness, Markedness & Correlation. Cited by: §1.
- A systematic analysis of performance measures for classification tasks. Information Processing & Management. External Links: Link Cited by: §1.
- OpenML: Networked Science in Machine Learning. SIGKDD Explorations. External Links: Link Cited by: Figure 2.
Comments
There are no comments yet.