1 Introduction
This paper is about improving current practices regarding benchmarks of NLP systems. As pointed out by (ruder2021benchmarking), benchmarks are made of datasets, metrics, and a way to aggregate performance. Our point is that, if the bulk of the NLP community efforts on this domain is about collecting new datasets and introducing new metrics, little work is concerned with the third part, namely how to aggregate various performances.
Why are benchmarks vital? Research advances in Machine Learning (ML) are crucially fueled by reliable evaluation procedures (post-2018-call; benchmark_lottery). The latter are indeed mandatory to fairly compare new methods and systems. Usually, one relies on a well-chosen metric that reflects the ability to perform on a task – e.g. accuracy for classification, mean-squared error for regression.
The multi-tasks evaluation setting. If single-tasks problems are quite common, to best understand weakness and model in real-world scenario, the community is heading towards more complex evaluations involving fine-grained evaluation (liu2021explainaboard) across several metrics (or criteria (gardent2017creating; yuan2019adversarial)) and several tasks (wang2018glue; zheng2021fewnlu; gehrmann-etal-2021-gem; mcmillan-major-etal-2021-reusable; dhole2021nl; benchmark_transformers)
. This is due to the increasing performance of deep neural networks, which are nowadays designed to generalize in a great variety of situations and to solve complex tasks
(silver2016mastering). One is typically seeking for models with good transfer learning properties, meaning an ability to generalize well under distribution shift and/or task shift (kawaguchi2017generalization).How to aggregate performances? The multi-tasks setting has been investigated in recent works that provide benchmark of state-of-the-art models across a great variety of tasks (rajpurkar2016squad; mccann2018natural; conneau2018you; zheng2021fewnlu; tay2020would), sometimes with more than fifty (siddhant2020xtreme; aribandi2021ext5; wei2021finetuned; sanh2021multitask). These papers provide tables of scores across the considered tasks, but the only non-qualitative way to compare systems consists in averaging the performances across tasks and then ranking systems according to their mean score values. This is, for instance, done with the GLUE benchmark (wang2018glue) and its derivatives (wang2019superglue). However, taking the mean is seriously flawed since the different metrics are usually not on the same scales and can even be unbounded (yuan2021bartscore; colombo2021infolm). Even a pre-processing renormalization scheme would fail to capture the intrinsic difficulty of the tasks.
Contribution 1. Our first contribution is to provide a reliable tool to rank systems in a multi-tasks setting. We rely on a ranking aggregation procedure which, from a set of rankings induced by each criterion, returns a single ranking that somehow aggregates the former. This procedure, called the Kemeny consensus (kemeny1959mathematics), can be seen as a voting rule and stems from the social choice theory (myerson1996fundamentals).
Aggregation when instance-level information is available. As illustrated by zhong2021larger; ruder2021benchmarking, a fine-grained understanding of the model performance should include instance-level scores. If taking the mean is quite natural in the classification setting, this is not always the case, as recently pointed out by (peyrard2021better) in the NLG setting. In this article, the authors investigate pairwise comparison of NLG systems for a single metric (e.g. BLEU (papineni2002bleu), ROUGE (lin2004rouge), METEOR (banerjee2005meteor; denkowski2014meteor; guo-hu-2019-meteor), CHRF (popovic2015chrf; popovic2017chrf++), BertScore (zhang2019bertscore)). They prove that a comparison based on the mean or the median of the scores across test utterances can be highly flawed. They rather advise to rely on the Bradley-Terry (bradley1952rank) pairwise comparison method, which consists, for two systems A and B, in computing the proportion of utterances on which A achieves a better score than B. Their work is a significant advance but remains limited to pairwise comparisons.
Contribution 2. Our second contribution consists in going one step further than (peyrard2021better) by applying our ranking procedure to an arbitrarily large set of NLG systems with respect to a group of fixed criterion. Our evaluation methodology can be seen as a natural extension of (peyrard2021better) since it coincides with the latter in the particular case of pairwise comparison. In a more realistic multi-criteria scenario, we combine our two contributions and develop a two-stages ranking aggregation procedure which first aggregates along utterances and then along criteria.
Experiments. Our two contributions rely on our aggregation procedure which is proved to be effective through several experiments.
-
We explain on a simple synthetic example the superiority of our approach compared to the mean-aggregation procedure and the pairwise-aggregation procedure, both in terms of consistency and robustness.
-
We use our ranking procedure on 10 multi-tasks / multi-criteria benchmarks and observe it leads to different conclusions than mean- and pairwise-aggregation procedures.
-
We argue our procedure is more robust by investigating its stability with respect to the addition of criteria and with respect to the addition of systems.
Our code and the collected data will be released111https://github.com/PierreColombo/RankingNLPSystems.git to accelerate the adoption of what we think is a reliable evaluation method for multi-tasks and multi-criteria benchmarks.
2 Problem Formulation and limitations of existing methods
2.1 General Considerations
When comparing systems performances, two settings can be distinguished depending on the information granularity at our disposal and on the way, one wishes to use this information. In general, each system is scored on each instance of a test set with respect to a given metric. The final (single) score of the system with respect to this metric is obtained through an aggregation procedure we will call instance-level aggregation, and which has to be chosen by the practitioner (usually, the mean of the instances-scores). Then, the final benchmark score of the system is obtained through an aggregation we call task-level aggregation of the scores of this system for each metric of the considered benchmark. See Fig. 1 for an illustration.

Notations. Suppose we are given systems evaluated on tasks, each task being associated with a metric and a test set made of instances. For every , and , we denote by
the score of system on the instance of task .
Instance-level aggregation. The performance of system on task is an aggregation of its scores on each instances. This aggregation is chosen by the practitioner and is usually the mean-aggregation defined by
Task-level aggregation. Sometimes, one only has access to the aggregated scores of each system on each task, that is, for every and , to a score which corresponds to an instance-level aggregation for system on the instances of task . From these quantities, one can then compute, for each system, a synthetic score reflecting the overall performance of this system across every considered task. Again, the usually-taken synthetic score of system is the mean-aggregation:
2.2 Problem Formulation
Ranking objective. When benchmarking systems, the goal is to output a ranking of each systems according to some objective criterion. Formally, we need to introduce the symmetric group on elements, denoted by , which elements are the permutations of . Equipped with this notation, our goal is to output a permutation
corresponding to the rankings of the systems. For instance, one reads “system is the -th best system”. Depending on the granularity of the information at our disposal, we distinguish two problems.
Ranking Systems from Task Level Information.
Given a set of scores of systems on tasks, find a proper aggregation procedure.
Ranking Systems from Instance Level Information.
Given a set of scores of systems on the different instances of tasks, find a proper aggregation procedure.
2.3 Limitation of Existing Methods
Mean-aggregation procedure. The mean-aggregation procedure consists in taking the permutation that would rank the aggregated means , . This procedure suffers from several flaws. First, it is not well-suited when the metrics associated with the considered tasks are not on the same scale. Consider, for instance, the situation where one of the tasks (say task
) is associated with a metric that is at a significantly larger scale than the others. In that case, the ranking obtained through mean-aggregation would probably correspond to the ranking induced by task
. One could argue that a remedy would first normalize each metric so that everything is on the same scale. However, the resulting aggregation would still fail to capture each task’s intrinsic difficulty. Worse, this procedure is impracticable in cases where some metrics are unbounded – for instance, this is the case of the BARTScore (yuan2021bartscore). Finally, another weakness of the mean-aggregation ranking procedure is that the score of a system is computed irrespective of its relative performance with respect to the others. This simple observation has been pointed out by (peyrard2021better) who advises, in the special case of two systems and one metric, to compute the number of times a system is better than the other on the instances.Pairwise ranking.To be a bit more formal, the pairwise ranking aggregation proposed by (peyrard2021better) to rank two systems and , which scores on a given task are given by and , consists in computing
Then, is better than if and only if . As explained by the authors, this method is more relevant than the mean-aggregation in the context of NLG evaluation. However, it is limited to the evaluation of two systems and does not apply for a general number of systems. A solution would be to give a rank for each pair of systems and then aggregate these pairwise rankings. However, this would lead to a prohibitive computational factor for the complexity. Moreover, the conclusion of these pairwise rankings can be paradoxical. Tab. 1 below provides a toy example where three systems , , and are evaluated on 6 tasks, and where the pairwise comparisons give the paradoxical conclusion , and .
Task1 | Task2 | Task3 | Task4 | Task5 | Task6 | SUM | |
A | 0.3 | 5 | 10 | 0.02 | 1.0 | 0.4 | 16.72 |
B | 0.1 | 4 | 13 | 0.01 | 2.2 | 0.3 | 19.61 |
C | 0.0 | 3 | 15 | 0.03 | 2.0 | 0.2 | 20.23 |
3 Ranking via Kemeny consensus
We now turn to the description of our methodology to rank an arbitrary number of systems on multi-tasks / multi-criteria benchmarks.
3.1 Kemeny Consensus
Let us consider the problem of ranking systems on tasks based on the information of the scores of each system on each task . We believe a robust approach to this problem consist in relying on the relative performance between systems on each task. More precisely, for each task , we consider
where corresponds to the rank of system on task , in decreasing order. Then, we would like to find an appropriate procedure that aggregates the rankings . More formally, we would like to define a function
This function, from a set of
permutations corresponding to rankings, should return a final permutation that summarizes them. One difficulty is that the mean procedure makes no sense on the symmetric group, which is not a vector space. It turns out that a very natural choice consists in taking
as the so-called Kemeny consensus (kemeny1959mathematics) aggregation procedure, which somehow corresponds to compute a barycenter.Kemeny consensus.
Let be the Kendall distance on the symmetric group, defined for every by
A Kemeny consensus of is a solution of the following minimization problem:
Why is Kemeny consensus natural? As proved by young1978consistent, the Kemeny consensus aggregation procedure is the only rule that satisfies three natural properties: neutrality, meaning that it does not depend on the order of the tasks; consistency, meaning that if the tasks are split in two subsets and that the aggregation in the to subsets rank system above system then ; and the Condorcet criterion (nicolas1785essai), meaning that an item wining all its pairwise comparison is ranked fist. Moreover, the Kemeny consensus is also the maximum likelihood of the widely-used Mallows statistical on the symmetric group (young1988condorcet).
3.2 Borda’s count approximation
If the Kemeny consensus is the ideal objective one would like to obtain, its computation is, in general, an NP-hard problem (bartholdi1989computational; dwork2001rank) – although some regularity assumptions, rarely satisfied in practice, can speed up the computation, see for instance (betzler2009similarity) and (brandt2010bypassing). Fortunately, there exist many ways to get satisfying approximations of the latter: see for example (ali2012experiments) for a comprehensive empirical study. For our experiments, we choose the so-called Borda’s count procedure, defined hereafter for the instance-level and/or task-level aggregation.
Borda’s count.
The Borda’s count consists, from a set of permutations corresponding to the ranking of systems across tasks or instances, to sum the ranks of each system and then to rank the obtained sums. Formally, it
-
Compute for every ,
-
Output that ranks the sums, ().
There are at least four explanations for choosing Borda’s count procedure. First, it coincides with the pairwise ranking procedure in the case of two systems, making it a natural generalization. Second, there exists a theoretical result assessing it is a -approximation of the Kemeny consensus (coppersmith2006ordering) with respect to the Kendall distance (fagin2003comparing)
. Third, it is an unbiased estimator of the Kemeny consensus with low sample complexity for data distributed according to standard rankings models such as Mallows
(Caragiannis2013; Fligner1988). Fourth, from a practical perspective, (ali2012experiments) observe it is efficient, accurate, and actually times faster than the other approximation algorithms. Fourth, We are now in a position to give our answers to the initial ranking problems from Task Level Information and from Instance Level Information.
3.3 Our Ranking Procedures
How to rank Systems from Task Level Information.
For every , let be permutation that ranks the scores . Our aggregation procedure () output is
How to rank Systems from Instance Level Information.
We actually give two different procedures. For every task and every instance of that task, let be the permutation that ranks the scores . See Figure 2 for an illustration.
Two-level aggregation (). This procedure
-
Compute for each task ,
-
Output .
One-level aggregation (). This procedure outputs
3.4 How to compare rankings
The rest of the paper is dedicated to synthetic and empirical experiments, on which we demonstrate the soundness of our approach. In order to obtain a quantitative result, one needs to be able to compare different rankings quantitatively. Two measures can be used for that purpose: (1) the Kendall distance and (2) the Kendall correlation () (kendall1938new; kendall1945treatment). The Kendall distance computes the number of inversions between two permutations and is therefore adapted for our purpose of ranking systems. The values of range from where the value of 1 corresponds to a strong agreement, and close to -1 indicates strong disagreement.
4 Synthetic Experiments
In this section, we validate on simulated data the performance of our method on two criteria: robustness to manipulation and robustness to scaling.
4.1 Data Generation
The toy experiment analysis is carried out on synthetic scores over systems, tasks and instances. For each , we model the performance of system by a Gumbel r.v. centered at at scale , where is a dispersion parameter. The scores of system , , are i.i.d. samples of centered at with scale , where is a dispersion parameter. Moreover, the scores of different systems are sampled independently. Since follows a logistic distribution with mean at scale , this imply that , the probability that system performs better than system is at least . Therefore, for all , the rankings of systems is a realization of the ground-truth ranking , with a noise term controlled by the ‘dispersion’ parameter .
Extreme scenarii correspond to the choices and . More precisely, implies that all scores have the same distribution, whereas induces a strong consensus, i.e., a clear system ranking emerges.
Remark 1.
Sampling according to the described procedure is equivalent to sampling the ranking of the systems from the well-know Plackett-Luce distribution (duncan1959individual; plackett1975analysis) with weights . Interestingly, this distribution over ranking can be seen both from the utilitarian perspective in which the scores are real numbers and from a ranking-model perspective in which the ranking of systems have known distribution.
4.2 Robustness to manipulation
Setting. To test the robustness of our ranking procedure, we analyze its stability with respect to perturbations of the scores. More precisely, our way to corrupt scores of a given task consists in sampling as i.i.d. samples following Gumbel distribution centered at . This implies that, for that task , the underlying ranking is , namely the exact opposite of the ground truth . The robustness to manipulation analysis shows how the error on the final ranking of systems increases as the scores of some tasks are ‘corrupted’. Here, the error is computed relying on the normalized Kendall distance between the ground-truth ranking and the ranking of systems obtained relying on the corrupted scores.
Results: For each of the considered methods, , and we report in Fig. 7 the results of the robustness analysis when and the number of corrupted task varies. The results of the robustness analysis show that outperforms which at the same time consistently outperforms . Overall, for the same number of corrupted tasks and dispersion, the error of is always the smallest.
Moreover, the score-based method gets an error larger than .75 when just 2, 3, and 5 out of the total of tasks have been corrupted, while for the same error is achieved with 5, 7 and 10 corrupted tasks. The most robust method is the two-level for which 10, 11 and 11 out of tasks have to be corrupted to get the same error of 0.75.
Takeaways: We conclude that the ranking-based methods are more robust than . In particular, the 2-level aggregation is the most robust aggregation procedure.
4.3 Robustness to scaling
To further compare the ranking, we corrupt the scores of a given task by re-scaling them by a factor of . Whereas it does not affect our ranking procedure (every ranking induced by a task-instance pair remains the same), it increasingly perturbs the mean aggregation procedure as increases. Re-scaling the scores by a factor of 2 produces an error larger than 90%. For larger , re-scaling the scores by a factor of 7 produces the same error (see Fig. 19 for detailed results).
Takeaways: Re-scaling one task’s score with an arbitrarily large number will always produce an arbitrarily large error for mean aggregation while not affecting ranking based aggregation.
5 Empirical Experiments
In this section, we present our results on real evaluation scores. Our large scale experiments relies on real evaluation scores (over 270k scores) which are described in Ssec. 5.1. In Ssec. 5.2 we gather experimental results for Ranking Systems from Task Level Information, while Ssec. 5.3 is dedicated to the problem of Ranking Systems from Instance Level Information.
5.1 Data Collection
5.1.1 Datasets with Task Level Information
We collect the results of GLUE (wang2018glue), SGLUE (wang2019superglue)222Results can be found at https://super.gluebenchmark.com/ and XTREME (hu2020xtreme).
For GLUE the dataset is composed of
systems that are evaluated on 9 different tasks: CoLA
(warstadt2019neural), SST-2 (socher2013recursive), MRPC
(dolan2005automatically), STS-B (cer2017semeval), QQP, MNLI (williams2017broad), QNLI
(rajpurkar2016squad), RTE
(dagan2005pascal; giampiccolo2007third; bentivogli2009fifth) and WNLI (levesque2012winograd).For SGLUE, the final dataset gathers scores from systems that are evaluated on 10 different tasks: BoolQ (clark2019boolq), CB (de2019commitmentbank), COPA (roemmele2011choice), MultiRC (khashabi2018looking), ReCoRD (zhang2018record), RTE, WiC (pilehvar2018wic), WSC and its derivatives AX-b AX-g (levesque2012winograd).
XTREM benchmark is composed of systems and include tasks such as sentence classification (using XNLI (xnli; mnli) and PAXS-X (paws_1; paws_2)), structured prediction (relying on Universal Dependencies v2.5 (universal_2) and Wikiann (pos_2; pos)), sentence retrieval (with BUCC (bucc_1; bucc_2) and Tatoeba (artetxe2019massively)) and question answering (via XQuAD (squad_2; rajpurkar2016squad), MLQA (lewis2019mlqa), TyDiQA-GoldP (clark2020tydi)). For all benchmarks, various types of metrics with various scales are reported (i.e accuracy, f1, correlation).
5.1.2 Datasets with Instance-level information
In this setting we focus on NLG evaluation as these scores are among the easiest to be collected. We focus on five different tasks: summary evaluation, image description, dialogue and translation. For summary evaluation, we use TAC08 (dang2008overview), TAC10, TAC11 (owczarzak2011overview), RSUM (bhandari2020re) and SEVAL (fabbri2021summeval). For sentence-based image description we rely on FLICKR (young2014image) and for dialogue we use PersonaChat (PC) and TopicalChat (TC) (mehri2020usr). Finally for machine translation
, we rely on the multilingual quality estimation (MLQE) introduced in
ranasinghe2021exploratory. For all datasets except MLQE, we consider automatic metric based on S3 (both variant pyr/resp) (peyrard2017learning), ROUGE (lin2004rouge) (including 5 of its variants (ng2015better)), JS [1-2] (lin2006information), Chrfpp (popovic2017chrf++), BLEU, BERTScore (zhang2019bertscore), MoverScore (zhao2019moverscore). For MLQE we solely consider several version of BERTScore, MoverScore and ContrastScore. We also add human evaluation which is specific to each dataset. All details corresponding to these dataset can be found in Appendix A and dataset will be made available in https://github.com/PierreColombo/RankingNLPSystems.git.


5.2 Task-level Aggregation Experiments
In this section, we address the aggregation problem when task-level information is available. We first study the final ranking obtained by different methods on GLUE, SGLUE, and XTREM. Then, we assess the robustness of when removing tasks.
5.2.1 Comparison with mean-aggregation
Setting. To compare the rankings and , we compute (i) the agreement rate (in %) which is the proportion of common top-ranked systems between and , and (ii) the Kendall Tau correlation () between the rankings.
Results. In Tab. 2, we compare the rankings of aforementioned methods for Top K systems (strongest systems) and Last K systems (weakest systems). For the three benchmarks, we observe a high correlation between the final rankings (i.e. correlation values are in the range ). To a finer degree, we also observe that methods tend to agree on which are the best/worst systems. Although and agree on the best/worst systems, they do not rank them in the same order (see Tab. 3). For instance, on XTREM, the third-best system according to (rank 2) is actually the sixth-best system according to .
Takeaways. When changing the aggregation function, the response to our initial question ”what are the best systems?” varies.
Dataset | Top 1 | Top 3 | Top 5 | Top 10 |
XT. | 1 | 0.66 | 0.8 | 0.9 |
GLUE | 1 | 1 | 0.8 | 0.8 |
SGLUE | 1 | 1 | 0.8 | 0.9 |
Dataset | Last 3 | Last 5 | Last 10 | |
EXT. | 1 | 0.8 | 0.9 | 0.82 |
GLUE | 1 | 0.8 | 0.7 | 0.92 |
SGLUE | 1 | 1 | 1 | 0.91 |
GLUE | XTREM | ||||
Team | Team | ||||
0 | Ms Alex | 0 | 0 | ULR | 0 |
1 | ERNIE | 1 | 1 | CoFe | 1 |
2 | DEBERTA | 2 | 2 | InfoLXL | 3 |
3 | AliceMind | 3 | 3 | VECO | 4 |
4 | PING-AH | 5 | 4 | Unicoder | 5 |
5 | HFL | 4 | 5 | PolyGlot | 2 |
6 | T5 | 6 | 6 | ULR-v2 | 6 |
7 | DIRL | 10 | 7 | HiCTL | 8 |
8 | Zihan | 7 | 8 | Ernie | 7 |
9 | ELECTRA | 11 | 9 | Anony | 10 |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
5.2.2 How does the addition/removal of new tasks/metrics affect the ranking?
When building a benchmark, practitioners can always add new tasks to refine the model performance assessment (it boils down to adding a new column in Tab. 1). In this experiment, we analyze how adding and removing tasks affect the rankings of the aggregation procedures.
Setting. We compare the rankings obtained when considering a subset of the tasks and the one obtained using all the tasks. Formally, for a given number of tasks , we randomly sample tasks, compute the rankings obtain by our procedure and by the mean procedure, and , on these tasks, and finally compute the Kendall correlation between (resp. ) and the ”ground truth” (resp.
). We repeat this random sampling 100 times to obtain a mean/variance plot of the correlations in the function of the number of sub-tasks.
Results. We report in Fig. 18(a,f) the obtained results for varying size of subsets. Interestingly we observe that correlation between the and is consistently higher than the one between and . This difference is particularly visible in the range . We observe a similar behavior when considering SGLUE (see Fig. 23)
Takeaways. The ranking obtained with is more robust to task addition/drop than the one from .
5.3 Instance-level Aggregation Experiments
For instance level aggregation, we conduct experiments on the 9 aforementioned data-sets. We study both the final ranking obtained for each aggregation (i.e. , and ) as well as the effect of task addition/removal.
PC | TC | FLI. | MLQE | |
-0.08 | -0.01 | 0 | -0.03 | |
0.32 | 0.27 | 0.29 | 0.01 | |
-0.10 | -0.15 | -0.04 | 0.00 | |
RSUM | SEVAL | TAC08 | TAC09 | TAC11 |
0.04 | 0.14 | 0.28 | 0.06 | -0.06 |
0.07 | 0.52 | 0.32 | 0.37 | 0.37 |
0 | 0.10 | 0.23 | 0.19 | 0.07 |
5.3.1 Global analysis
When conducting our experiments, we observe that the three different aggregation procedures lead to three different state-of-the-art in 8 datasets out of 9. Furthermore, they never agree on the top 3 systems. In what follows, we conduct a finer analysis to compare , and .
Setting. We compare the obtained ranking by (1) comparing the Kendall correlation (see Tab. 4), (2) comparing the number of agreements between the top N systems, (3) computing the Kendall correlation between the common top N systems.
Results. When considering the agreement analysis of Fig. 7(a), we observe that and select a high number of common top systems. However, the correlation of the rankings induced by and on these top-systems is low (see Fig. 7(b)). This is also the case for the correlation of the entire rankings (see Fig. 7). In short and select similar systems but rank them differently. Similar analysis shows that disagrees from and both on top systems and on their orders.
Takeaways. exhibits a more similar behavior than with respect to .
5.3.2 What is the impact of removing/adding tasks?
In NLG, different metrics (i.e. task) assess the quality of a generated sentence along a different axis. As adding a new metric may affect the final ranking, we investigate the impact of task addition/removal on the final system ordering.
Setting. Similarly to Sssec. 5.2.2, we study the evolution between the correlation between the ranking computed on a subset of tasks and the ground truth ranking (computed on all tasks) for each of the three procedures.
Results. We observe that for all datasets both obtain higher correlation and lower variance compared to when adding/removing tasks. Results for RSUM reports a similar trends (see Fig. 33).
Takeaways. The ranking obtained with either or are more robust to task addition/drop than the one from .
6 Conclusion and Future Works
In this paper, we introduced new aggregation procedures to rank systems when either task level scores or instance level scores are available. Our methods, which are theoretically grounded and rely on Kemeny ranking consensus, address fundamental flaws of the widely used arithmetic mean.
We conducted extensive numerical experiments, which show that our methods are both more reliable and more robust than the mean aggregation while leading to different conclusions on which are the best systems. Overall, when task-level (resp. instance-level) information is available, we would recommend using the aggregation procedure (resp. ) rather than the (resp. and ).
Although we focused on NLP benchmarks, our methodology could be applied in other modalities (e.g.Computer Vision, Audio). Another interesting avenue would be to consider a weighted version of the Kemeney consensus where weight could reflect task-specific criteria (e.g fairness, robustness, complexity).
7 Acknowledgements and Fundings
We thank Maxime Peyrard, Saibo Geng, and Wei Zhao for sharing their collected dataset. A warm thanks to Matthieu Labeau for his sharp eyes and excellent comments when reading our paper manuscript. This work was granted access to the HPC resources of IDRIS under the allocation 2021-101838 and 2021-101913 made by GENCI. Nathan is funded by the projet ANR LIMPID.
References
Appendix A Details on Datasets
We report in Tab. 6 and Tab. 6 global statistics of the data described in . Overall our conclusions are drawn from a total of over 270k scores.
# of tasks | # of instance | |
GLUE | 105 | 15 |
SGLUE | 24, 15 | |
XTREM | 15 | 5 |
# of tasks | # of instance | |
PC | 19 | 240 |
TC | 19 | 300 |
FLICKR | 14 | 864 |
MLQE | 10 | 7000 |
RSUM | 15 | 2500 |
SEVAL | 17 | 1600 |
TAC08 | 15 | 2976 |
TAC09 | 15 | 2596 |
TAC11 | 15 | 2376 |
Appendix B Additional Experiments
In this section, we report additional experimental results including the details of the robustness to the scaling experiment (see Ssec. B.1), the ranking on XTREM (see Ssec. B.2), complete results on the experiments when adding adding/removing metrics/tasks when Task Level Information (see Sssec. B.3.1) and Instance Level Information is available (see Sssec. B.3.2). In the main paper we only report the aggregated score for the agreement analysis when instance level informatino is available, we report detailed results on Ssec. B.4.
b.1 Toy Experiment on Scale
We display in Fig. 19 the results of the toy experiment on scaling robustness. When corrupting one task by rescaling, we see that the error of the ranking induced by increases to 1 while the error of ranking-based aggregation remains constant.

b.2 Ranking of SGLUE
We display in Tab. 7 the resulting ranking on the three considered benchmark. Although aggregation procedures tend to agree on good and bad systems, when changing the aggregation function, the rankings vary. Thus conclusion depending on the answer to the initial question ”what are the best systems?” might change.
GLUE | SGLUE | XTREM | ||||||
Team | Team | Team | ||||||
0 | Ms Alex | 0 | 0 | Liam | 0 | 0 | ULR | 0 |
1 | ERNIE | 1 | 1 | Ms Alex | 1 | 1 | CoFe | 1 |
2 | DEBERTA | 2 | 2 | ERNIE | 2 | 2 | InfoLXL | 3 |
3 | AliceMind | 3 | 3 | HUMAN | 3 | 3 | VECO | 4 |
4 | PING-AH | 5 | 4 | DEBERTA | 5 | 4 | Unicoder | 5 |
5 | HFL | 4 | 5 | Zirui | 4 | 5 | PolyGlot | 2 |
6 | T5 | 6 | 6 | T5 | 6 | 6 | ULR-v2 | 6 |
7 | DIRL | 10 | 7 | Alibaba | 7 | 7 | HiCTL | 8 |
8 | Zihan | 7 | 8 | Anuar | 8 | 8 | Ernie | 7 |
9 | ELECTRA | 11 | 9 | Huawei | 11 | 9 | Anony | 10 |
b.3 Complete results on the task addition/removal experiments
b.3.1 Task Level Aggregation
Results of task addition/removal experiments when Task level information is available are reported in Fig. 23. Overall, we observe that ranking-based aggregation is more robust than mean-based aggregation.
![]() |
![]() |
![]() |
b.3.2 Instance Level Aggregation
Results of task addition/removal experiments when instance-level information is available are reported in Fig. 33. Overall, we observe that ranking-based aggregation is more robust than mean-based aggregation.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
b.4 Agreement analysis on Instance Level Aggregation
To further complete the experiment on agreement analysis of Sssec. 5.3.1 report results on individual tasks. We report in Fig. 39 the correlation of ranking when Top K systems for different values of K while Fig. 45 reports the agreement analysis.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
b.5 Possible Extensions
For the futur we would like to extend the proposed ranking procedures to:
-
Sequence Labelling Task (chapuis2020hierarchical; chapuis2021code; colombo2020guiding) where instances are sequences, thus ordering matters.
-
Emotion classification (dinkar2020importance; garcia2019token; colombo2021improving; witon2018disney) where score are continuous.
-
NLG (colombo2021beam; colombo2021automatic; colombo2021novel; staerman2021depth; colombo2021learning; colombo2019affect) where instance are sentences where both ordering and content matter.
Appendix C Dispersion Analysis
In this section, we introduce the notion of dispersion across a set of different rankings .
c.1 Dispersion as a measure performance
Suppose you have two ranking candidates, and , to summarize . A natural question consists in measuring the performance of and . Denoting by the Kendall distance, a natural measure is
(1) |
Of course, in our task-level framework, our candidate will achieve better performance than the mean since it is designed to minimize this quantity. From a probabilistic point of view, one can see as an i.i.d. realization of a r.v. on the symmetric group . Denoting by is law and the associated expectation, Equation (1) is an approximation of
(2) |
c.2 Dispersion to measure ranking complexity
In order to take into account the intrinsic complexity of ranking , a natural way would be to compute some kind of Dispersion among these permutations. Using the same notations as before, one can rely on the pairwise distance.
(3) |
On a practical perspective, the computational complexity of Equation (3) is . Nevertheless, it is possible to efficiently approximate this value via empirical stopping algorithms based on Berstein or Hoeffding bounds (mnih2008empirical; Korba2017). Notice that the pairwise distance has a solid theoretical foundation as to be the base for a measure of the spread. It is an empirical approximation of where and are independent with law . Therefore, both are directly related to the noise level in different probabilistic models for permutations (gMallows). Moreover, the former is the basis of statistical uniformity tests for ranked data building upon it (busa2021private).
Remark 2.
A known relation between and has remarkable consequences on assessing the quality of our estimation. It says that the former is bounded below (respectively, above) by 0.5 (respectively, 1) times the latter (Korba2017). This fact has broad practical application since it means that the measure of the expected quality of the estimators and is lower an upper bounded by the intrinsic difficulty of the problem, which we can approximate via sample statistics in Equation (3).
We conclude by noting that is the natural ranking counterpart of the variance for real-valued aggregation . However, when the scores are not on the same scale, then the variance of the scores is no longer interpretable as a measure of spread in the population.
c.3 Experiments
We report in Tab. 8 the results of the dispersion analysis. We compare the dispersion to measure performance obtained with the induced ranking by and the one obtained by 100 random permutations.
Takeaways As expected, obtains a lowest dispersion which further validate our approach.
GLUE | 793 | 805 | 2746 |
SGLUE | 44.9 | 47.21 | 137.3 |
XTREM | 12.25 | 12.75 | 50.6 |