What are the best systems? New perspectives on NLP Benchmarking

by   Pierre Colombo, et al.

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.



page 1

page 2

page 3

page 4


Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...

CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision

Recent work has shown success in incorporating pre-trained models like B...

Do Data-based Curricula Work?

Current state-of-the-art NLP systems use large neural networks that requ...

BERT for Sentiment Analysis: Pre-trained and Fine-Tuned Alternatives

BERT has revolutionized the NLP field by enabling transfer learning with...

BERT, mBERT, or BiBERT? A Study on Contextualized Embeddings for Neural Machine Translation

The success of bidirectional encoders using masked language models, such...

Towards Reliable Evaluation of Road Network Reconstructions

Existing performance measures rank delineation algorithms inconsistently...

How stable are Transferability Metrics evaluations?

Transferability metrics is a maturing field with increasing interest, wh...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper is about improving current practices regarding benchmarks of NLP systems. As pointed out by (ruder2021benchmarking), benchmarks are made of datasets, metrics, and a way to aggregate performance. Our point is that, if the bulk of the NLP community efforts on this domain is about collecting new datasets and introducing new metrics, little work is concerned with the third part, namely how to aggregate various performances.

Why are benchmarks vital? Research advances in Machine Learning (ML) are crucially fueled by reliable evaluation procedures (post-2018-call; benchmark_lottery). The latter are indeed mandatory to fairly compare new methods and systems. Usually, one relies on a well-chosen metric that reflects the ability to perform on a task – e.g. accuracy for classification, mean-squared error for regression.

The multi-tasks evaluation setting. If single-tasks problems are quite common, to best understand weakness and model in real-world scenario, the community is heading towards more complex evaluations involving fine-grained evaluation (liu2021explainaboard) across several metrics (or criteria (gardent2017creating; yuan2019adversarial)) and several tasks (wang2018glue; zheng2021fewnlu; gehrmann-etal-2021-gem; mcmillan-major-etal-2021-reusable; dhole2021nl; benchmark_transformers)

. This is due to the increasing performance of deep neural networks, which are nowadays designed to generalize in a great variety of situations and to solve complex tasks

(silver2016mastering). One is typically seeking for models with good transfer learning properties, meaning an ability to generalize well under distribution shift and/or task shift (kawaguchi2017generalization).

How to aggregate performances? The multi-tasks setting has been investigated in recent works that provide benchmark of state-of-the-art models across a great variety of tasks (rajpurkar2016squad; mccann2018natural; conneau2018you; zheng2021fewnlu; tay2020would), sometimes with more than fifty (siddhant2020xtreme; aribandi2021ext5; wei2021finetuned; sanh2021multitask). These papers provide tables of scores across the considered tasks, but the only non-qualitative way to compare systems consists in averaging the performances across tasks and then ranking systems according to their mean score values. This is, for instance, done with the GLUE benchmark (wang2018glue) and its derivatives (wang2019superglue). However, taking the mean is seriously flawed since the different metrics are usually not on the same scales and can even be unbounded (yuan2021bartscore; colombo2021infolm). Even a pre-processing renormalization scheme would fail to capture the intrinsic difficulty of the tasks.

Contribution 1. Our first contribution is to provide a reliable tool to rank systems in a multi-tasks setting. We rely on a ranking aggregation procedure which, from a set of rankings induced by each criterion, returns a single ranking that somehow aggregates the former. This procedure, called the Kemeny consensus (kemeny1959mathematics), can be seen as a voting rule and stems from the social choice theory (myerson1996fundamentals).

Aggregation when instance-level information is available. As illustrated by zhong2021larger; ruder2021benchmarking, a fine-grained understanding of the model performance should include instance-level scores. If taking the mean is quite natural in the classification setting, this is not always the case, as recently pointed out by (peyrard2021better) in the NLG setting. In this article, the authors investigate pairwise comparison of NLG systems for a single metric (e.g. BLEU (papineni2002bleu), ROUGE (lin2004rouge), METEOR (banerjee2005meteor; denkowski2014meteor; guo-hu-2019-meteor), CHRF (popovic2015chrf; popovic2017chrf++), BertScore (zhang2019bertscore)). They prove that a comparison based on the mean or the median of the scores across test utterances can be highly flawed. They rather advise to rely on the Bradley-Terry (bradley1952rank) pairwise comparison method, which consists, for two systems A and B, in computing the proportion of utterances on which A achieves a better score than B. Their work is a significant advance but remains limited to pairwise comparisons.

Contribution 2. Our second contribution consists in going one step further than (peyrard2021better) by applying our ranking procedure to an arbitrarily large set of NLG systems with respect to a group of fixed criterion. Our evaluation methodology can be seen as a natural extension of (peyrard2021better) since it coincides with the latter in the particular case of pairwise comparison. In a more realistic multi-criteria scenario, we combine our two contributions and develop a two-stages ranking aggregation procedure which first aggregates along utterances and then along criteria.

Experiments. Our two contributions rely on our aggregation procedure which is proved to be effective through several experiments.

  • We explain on a simple synthetic example the superiority of our approach compared to the mean-aggregation procedure and the pairwise-aggregation procedure, both in terms of consistency and robustness.

  • We use our ranking procedure on 10 multi-tasks / multi-criteria benchmarks and observe it leads to different conclusions than mean- and pairwise-aggregation procedures.

  • We argue our procedure is more robust by investigating its stability with respect to the addition of criteria and with respect to the addition of systems.

Our code and the collected data will be released111https://github.com/PierreColombo/RankingNLPSystems.git to accelerate the adoption of what we think is a reliable evaluation method for multi-tasks and multi-criteria benchmarks.

2 Problem Formulation and limitations of existing methods

2.1 General Considerations

When comparing systems performances, two settings can be distinguished depending on the information granularity at our disposal and on the way, one wishes to use this information. In general, each system is scored on each instance of a test set with respect to a given metric. The final (single) score of the system with respect to this metric is obtained through an aggregation procedure we will call instance-level aggregation, and which has to be chosen by the practitioner (usually, the mean of the instances-scores). Then, the final benchmark score of the system is obtained through an aggregation we call task-level aggregation of the scores of this system for each metric of the considered benchmark. See Fig. 1 for an illustration.

Figure 1: Illustration of the two considered frameworks relying on different information granularity: task-level information (below) or instance-level information(above). From the latter, one can derive the former relying on what we call an instance-level aggregation. A task-level aggregation can then be performed to synthesize a system performance.

Notations. Suppose we are given systems evaluated on tasks, each task being associated with a metric and a test set made of instances. For every , and , we denote by

the score of system on the instance of task .

Instance-level aggregation. The performance of system on task is an aggregation of its scores on each instances. This aggregation is chosen by the practitioner and is usually the mean-aggregation defined by

Task-level aggregation. Sometimes, one only has access to the aggregated scores of each system on each task, that is, for every and , to a score which corresponds to an instance-level aggregation for system on the instances of task . From these quantities, one can then compute, for each system, a synthetic score reflecting the overall performance of this system across every considered task. Again, the usually-taken synthetic score of system is the mean-aggregation:

2.2 Problem Formulation

Ranking objective. When benchmarking systems, the goal is to output a ranking of each systems according to some objective criterion. Formally, we need to introduce the symmetric group on elements, denoted by , which elements are the permutations of . Equipped with this notation, our goal is to output a permutation

corresponding to the rankings of the systems. For instance, one reads “system is the -th best system”. Depending on the granularity of the information at our disposal, we distinguish two problems.

Ranking Systems from Task Level Information.

Given a set of scores of systems on tasks, find a proper aggregation procedure.

Ranking Systems from Instance Level Information.

Given a set of scores of systems on the different instances of tasks, find a proper aggregation procedure.

2.3 Limitation of Existing Methods

Mean-aggregation procedure. The mean-aggregation procedure consists in taking the permutation that would rank the aggregated means , . This procedure suffers from several flaws. First, it is not well-suited when the metrics associated with the considered tasks are not on the same scale. Consider, for instance, the situation where one of the tasks (say task

) is associated with a metric that is at a significantly larger scale than the others. In that case, the ranking obtained through mean-aggregation would probably correspond to the ranking induced by task

. One could argue that a remedy would first normalize each metric so that everything is on the same scale. However, the resulting aggregation would still fail to capture each task’s intrinsic difficulty. Worse, this procedure is impracticable in cases where some metrics are unbounded – for instance, this is the case of the BARTScore (yuan2021bartscore). Finally, another weakness of the mean-aggregation ranking procedure is that the score of a system is computed irrespective of its relative performance with respect to the others. This simple observation has been pointed out by (peyrard2021better) who advises, in the special case of two systems and one metric, to compute the number of times a system is better than the other on the instances.

Pairwise ranking.To be a bit more formal, the pairwise ranking aggregation proposed by (peyrard2021better) to rank two systems and , which scores on a given task are given by and , consists in computing

Then, is better than if and only if . As explained by the authors, this method is more relevant than the mean-aggregation in the context of NLG evaluation. However, it is limited to the evaluation of two systems and does not apply for a general number of systems. A solution would be to give a rank for each pair of systems and then aggregate these pairwise rankings. However, this would lead to a prohibitive computational factor for the complexity. Moreover, the conclusion of these pairwise rankings can be paradoxical. Tab. 1 below provides a toy example where three systems , , and are evaluated on 6 tasks, and where the pairwise comparisons give the paradoxical conclusion , and .

Task1 Task2 Task3 Task4 Task5 Task6 SUM
A 0.3 5 10 0.02 1.0 0.4 16.72
B 0.1 4 13 0.01 2.2 0.3 19.61
C 0.0 3 15 0.03 2.0 0.2 20.23
Table 1: Example of where pairwise rankings can be paradoxical. Mean aggregation outputs while pairwise ranking considered in (peyrard2021better) fails to rank the systems and produce . Our method does not have this flaw and outputs .

3 Ranking via Kemeny consensus

We now turn to the description of our methodology to rank an arbitrary number of systems on multi-tasks / multi-criteria benchmarks.

3.1 Kemeny Consensus

Let us consider the problem of ranking systems on tasks based on the information of the scores of each system on each task . We believe a robust approach to this problem consist in relying on the relative performance between systems on each task. More precisely, for each task , we consider

where corresponds to the rank of system on task , in decreasing order. Then, we would like to find an appropriate procedure that aggregates the rankings . More formally, we would like to define a function

This function, from a set of

permutations corresponding to rankings, should return a final permutation that summarizes them. One difficulty is that the mean procedure makes no sense on the symmetric group, which is not a vector space. It turns out that a very natural choice consists in taking

as the so-called Kemeny consensus (kemeny1959mathematics) aggregation procedure, which somehow corresponds to compute a barycenter.

Kemeny consensus.

Let be the Kendall distance on the symmetric group, defined for every by

A Kemeny consensus of is a solution of the following minimization problem:

Why is Kemeny consensus natural? As proved by young1978consistent, the Kemeny consensus aggregation procedure is the only rule that satisfies three natural properties: neutrality, meaning that it does not depend on the order of the tasks; consistency, meaning that if the tasks are split in two subsets and that the aggregation in the to subsets rank system above system then ; and the Condorcet criterion (nicolas1785essai), meaning that an item wining all its pairwise comparison is ranked fist. Moreover, the Kemeny consensus is also the maximum likelihood of the widely-used Mallows statistical on the symmetric group (young1988condorcet).

3.2 Borda’s count approximation

If the Kemeny consensus is the ideal objective one would like to obtain, its computation is, in general, an NP-hard problem (bartholdi1989computational; dwork2001rank) – although some regularity assumptions, rarely satisfied in practice, can speed up the computation, see for instance (betzler2009similarity) and (brandt2010bypassing). Fortunately, there exist many ways to get satisfying approximations of the latter: see for example (ali2012experiments) for a comprehensive empirical study. For our experiments, we choose the so-called Borda’s count procedure, defined hereafter for the instance-level and/or task-level aggregation.

Borda’s count.

The Borda’s count consists, from a set of permutations corresponding to the ranking of systems across tasks or instances, to sum the ranks of each system and then to rank the obtained sums. Formally, it

  1. Compute for every ,

  2. Output that ranks the sums, ().

There are at least four explanations for choosing Borda’s count procedure. First, it coincides with the pairwise ranking procedure in the case of two systems, making it a natural generalization. Second, there exists a theoretical result assessing it is a -approximation of the Kemeny consensus (coppersmith2006ordering) with respect to the Kendall distance (fagin2003comparing)

. Third, it is an unbiased estimator of the Kemeny consensus with low sample complexity for data distributed according to standard rankings models such as Mallows 

(Caragiannis2013; Fligner1988). Fourth, from a practical perspective, (ali2012experiments) observe it is efficient, accurate, and actually times faster than the other approximation algorithms. Fourth, We are now in a position to give our answers to the initial ranking problems from Task Level Information and from Instance Level Information.

Figure 2: Illustration of our two aggregation procedures to rank systems from instance-level information.

3.3 Our Ranking Procedures

How to rank Systems from Task Level Information.

For every , let be permutation that ranks the scores . Our aggregation procedure () output is

How to rank Systems from Instance Level Information.

We actually give two different procedures. For every task and every instance of that task, let be the permutation that ranks the scores . See Figure 2 for an illustration.

Two-level aggregation (). This procedure

  1. Compute for each task ,

  2. Output .

One-level aggregation (). This procedure outputs

3.4 How to compare rankings

The rest of the paper is dedicated to synthetic and empirical experiments, on which we demonstrate the soundness of our approach. In order to obtain a quantitative result, one needs to be able to compare different rankings quantitatively. Two measures can be used for that purpose: (1) the Kendall distance and (2) the Kendall correlation () (kendall1938new; kendall1945treatment). The Kendall distance computes the number of inversions between two permutations and is therefore adapted for our purpose of ranking systems. The values of range from where the value of 1 corresponds to a strong agreement, and close to -1 indicates strong disagreement.

4 Synthetic Experiments

In this section, we validate on simulated data the performance of our method on two criteria: robustness to manipulation and robustness to scaling.

4.1 Data Generation

The toy experiment analysis is carried out on synthetic scores over systems, tasks and instances. For each , we model the performance of system by a Gumbel r.v. centered at at scale , where is a dispersion parameter. The scores of system , , are i.i.d. samples of centered at with scale , where is a dispersion parameter. Moreover, the scores of different systems are sampled independently. Since follows a logistic distribution with mean at scale , this imply that , the probability that system performs better than system is at least . Therefore, for all , the rankings of systems is a realization of the ground-truth ranking , with a noise term controlled by the ‘dispersion’ parameter .

Extreme scenarii correspond to the choices and . More precisely, implies that all scores have the same distribution, whereas induces a strong consensus, i.e., a clear system ranking emerges.

Remark 1.

Sampling according to the described procedure is equivalent to sampling the ranking of the systems from the well-know Plackett-Luce distribution (duncan1959individual; plackett1975analysis) with weights . Interestingly, this distribution over ranking can be seen both from the utilitarian perspective in which the scores are real numbers and from a ranking-model perspective in which the ranking of systems have known distribution.

4.2 Robustness to manipulation

Setting. To test the robustness of our ranking procedure, we analyze its stability with respect to perturbations of the scores. More precisely, our way to corrupt scores of a given task consists in sampling as i.i.d. samples following Gumbel distribution centered at . This implies that, for that task , the underlying ranking is , namely the exact opposite of the ground truth . The robustness to manipulation analysis shows how the error on the final ranking of systems increases as the scores of some tasks are ‘corrupted’. Here, the error is computed relying on the normalized Kendall distance between the ground-truth ranking and the ranking of systems obtained relying on the corrupted scores.
Results: For each of the considered methods, , and we report in Fig. 7 the results of the robustness analysis when and the number of corrupted task varies. The results of the robustness analysis show that outperforms which at the same time consistently outperforms . Overall, for the same number of corrupted tasks and dispersion, the error of is always the smallest. Moreover, the score-based method gets an error larger than .75 when just 2, 3, and 5 out of the total of tasks have been corrupted, while for the same error is achieved with 5, 7 and 10 corrupted tasks. The most robust method is the two-level for which 10, 11 and 11 out of tasks have to be corrupted to get the same error of 0.75.
Takeaways: We conclude that the ranking-based methods are more robust than . In particular, the 2-level aggregation is the most robust aggregation procedure.

4.3 Robustness to scaling

To further compare the ranking, we corrupt the scores of a given task by re-scaling them by a factor of . Whereas it does not affect our ranking procedure (every ranking induced by a task-instance pair remains the same), it increasingly perturbs the mean aggregation procedure as increases. Re-scaling the scores by a factor of 2 produces an error larger than 90%. For larger , re-scaling the scores by a factor of 7 produces the same error (see Fig. 19 for detailed results).
Takeaways: Re-scaling one task’s score with an arbitrarily large number will always produce an arbitrarily large error for mean aggregation while not affecting ranking based aggregation.

5 Empirical Experiments

In this section, we present our results on real evaluation scores. Our large scale experiments relies on real evaluation scores (over 270k scores) which are described in Ssec. 5.1. In Ssec. 5.2 we gather experimental results for Ranking Systems from Task Level Information, while Ssec. 5.3 is dedicated to the problem of Ranking Systems from Instance Level Information.

5.1 Data Collection

5.1.1 Datasets with Task Level Information

We collect the results of GLUE (wang2018glue), SGLUE (wang2019superglue)222Results can be found at https://super.gluebenchmark.com/ and XTREME (hu2020xtreme).
For GLUE the dataset is composed of

systems that are evaluated on 9 different tasks: CoLA

(warstadt2019neural), SST-2 (socher2013recursive)


(dolan2005automatically), STS-B (cer2017semeval), QQP, MNLI (williams2017broad)




(dagan2005pascal; giampiccolo2007third; bentivogli2009fifth) and WNLI (levesque2012winograd).
For SGLUE, the final dataset gathers scores from systems that are evaluated on 10 different tasks: BoolQ (clark2019boolq), CB (de2019commitmentbank), COPA (roemmele2011choice), MultiRC (khashabi2018looking), ReCoRD (zhang2018record), RTE, WiC (pilehvar2018wic), WSC and its derivatives AX-b AX-g (levesque2012winograd).
XTREM benchmark is composed of systems and include tasks such as sentence classification (using XNLI (xnli; mnli) and PAXS-X (paws_1; paws_2)), structured prediction (relying on Universal Dependencies v2.5 (universal_2) and Wikiann (pos_2; pos)), sentence retrieval (with BUCC (bucc_1; bucc_2) and Tatoeba (artetxe2019massively)) and question answering (via XQuAD (squad_2; rajpurkar2016squad), MLQA (lewis2019mlqa), TyDiQA-GoldP (clark2020tydi)). For all benchmarks, various types of metrics with various scales are reported (i.e accuracy, f1, correlation).

5.1.2 Datasets with Instance-level information

In this setting we focus on NLG evaluation as these scores are among the easiest to be collected. We focus on five different tasks: summary evaluation, image description, dialogue and translation. For summary evaluation, we use TAC08 (dang2008overview), TAC10, TAC11 (owczarzak2011overview), RSUM (bhandari2020re) and SEVAL (fabbri2021summeval). For sentence-based image description we rely on FLICKR (young2014image) and for dialogue we use PersonaChat (PC) and TopicalChat (TC) (mehri2020usr). Finally for machine translation

, we rely on the multilingual quality estimation (MLQE) introduced in

ranasinghe2021exploratory. For all datasets except MLQE, we consider automatic metric based on S3 (both variant pyr/resp) (peyrard2017learning), ROUGE (lin2004rouge) (including 5 of its variants (ng2015better)), JS [1-2] (lin2006information), Chrfpp (popovic2017chrf++), BLEU, BERTScore (zhang2019bertscore), MoverScore (zhao2019moverscore). For MLQE we solely consider several version of BERTScore, MoverScore and ContrastScore. We also add human evaluation which is specific to each dataset. All details corresponding to these dataset can be found in Appendix A and dataset will be made available in https://github.com/PierreColombo/RankingNLPSystems.git.

(a) Synthetic Scores
Figure 4: Robustness on synthetic scores.
(a) Agreement Analysis (b) Correlation Analysis
Figure 7: Global Analysis of Instance Level Raning

5.2 Task-level Aggregation Experiments

In this section, we address the aggregation problem when task-level information is available. We first study the final ranking obtained by different methods on GLUE, SGLUE, and XTREM. Then, we assess the robustness of when removing tasks.

5.2.1 Comparison with mean-aggregation

Setting. To compare the rankings and , we compute (i) the agreement rate (in %) which is the proportion of common top-ranked systems between and , and (ii) the Kendall Tau correlation () between the rankings.
Results. In Tab. 2, we compare the rankings of aforementioned methods for Top K systems (strongest systems) and Last K systems (weakest systems). For the three benchmarks, we observe a high correlation between the final rankings (i.e. correlation values are in the range ). To a finer degree, we also observe that methods tend to agree on which are the best/worst systems. Although and agree on the best/worst systems, they do not rank them in the same order (see Tab. 3). For instance, on XTREM, the third-best system according to (rank 2) is actually the sixth-best system according to .
Takeaways. When changing the aggregation function, the response to our initial question ”what are the best systems?” varies.

Dataset Top 1 Top 3 Top 5 Top 10
XT. 1 0.66 0.8 0.9
GLUE 1 1 0.8 0.8
SGLUE 1 1 0.8 0.9
Dataset Last 3 Last 5 Last 10
EXT. 1 0.8 0.9 0.82
GLUE 1 0.8 0.7 0.92
SGLUE 1 1 1 0.91
Table 2: Agreement count between Top N/Last N systems on the Ranking when Task Level Information is available. is computed on the total ranking.
Team Team
0 Ms Alex 0 0 ULR 0
1 ERNIE 1 1 CoFe 1
2 DEBERTA 2 2 InfoLXL 3
3 AliceMind 3 3 VECO 4
4 PING-AH 5 4 Unicoder 5
5 HFL 4 5 PolyGlot 2
6 T5 6 6 ULR-v2 6
7 DIRL 10 7 HiCTL 8
8 Zihan 7 8 Ernie 7
9 ELECTRA 11 9 Anony 10
Table 3: Qualitative analysis between ranking obtained with or . Results in parenthesis report the score of the considered aggregation procedure.
(a) GLUE
(b) Persona Chat
(c) Topic Chat
(e) MLQE
(g) SUM Eval
(h) TAC08
(i) TAC09
(j) TAC10
Figure 18: Impact of adding/removing metrics/tasks. The first column refers to ranking obtained with task-level information, while others columns refer to ranking obtained with instance-level information.

5.2.2 How does the addition/removal of new tasks/metrics affect the ranking?

When building a benchmark, practitioners can always add new tasks to refine the model performance assessment (it boils down to adding a new column in Tab. 1). In this experiment, we analyze how adding and removing tasks affect the rankings of the aggregation procedures.
Setting. We compare the rankings obtained when considering a subset of the tasks and the one obtained using all the tasks. Formally, for a given number of tasks , we randomly sample tasks, compute the rankings obtain by our procedure and by the mean procedure, and , on these tasks, and finally compute the Kendall correlation between (resp. ) and the ”ground truth” (resp.

). We repeat this random sampling 100 times to obtain a mean/variance plot of the correlations in the function of the number of sub-tasks.

Results. We report in Fig. 18(a,f) the obtained results for varying size of subsets. Interestingly we observe that correlation between the and is consistently higher than the one between and . This difference is particularly visible in the range . We observe a similar behavior when considering SGLUE (see Fig. 23)
Takeaways. The ranking obtained with is more robust to task addition/drop than the one from .

5.3 Instance-level Aggregation Experiments

For instance level aggregation, we conduct experiments on the 9 aforementioned data-sets. We study both the final ranking obtained for each aggregation (i.e. , and ) as well as the effect of task addition/removal.

-0.08 -0.01 0 -0.03
0.32 0.27 0.29 0.01
-0.10 -0.15 -0.04 0.00
0.04 0.14 0.28 0.06 -0.06
0.07 0.52 0.32 0.37 0.37
0 0.10 0.23 0.19 0.07
Table 4: on global instance-level rankings.

5.3.1 Global analysis

When conducting our experiments, we observe that the three different aggregation procedures lead to three different state-of-the-art in 8 datasets out of 9. Furthermore, they never agree on the top 3 systems. In what follows, we conduct a finer analysis to compare , and .
Setting. We compare the obtained ranking by (1) comparing the Kendall correlation (see Tab. 4), (2) comparing the number of agreements between the top N systems, (3) computing the Kendall correlation between the common top N systems.
Results. When considering the agreement analysis of Fig. 7(a), we observe that and select a high number of common top systems. However, the correlation of the rankings induced by and on these top-systems is low (see Fig. 7(b)). This is also the case for the correlation of the entire rankings (see Fig. 7). In short and select similar systems but rank them differently. Similar analysis shows that disagrees from and both on top systems and on their orders.
Takeaways. exhibits a more similar behavior than with respect to .

5.3.2 What is the impact of removing/adding tasks?

In NLG, different metrics (i.e. task) assess the quality of a generated sentence along a different axis. As adding a new metric may affect the final ranking, we investigate the impact of task addition/removal on the final system ordering.
Setting. Similarly to Sssec. 5.2.2, we study the evolution between the correlation between the ranking computed on a subset of tasks and the ground truth ranking (computed on all tasks) for each of the three procedures.
Results. We observe that for all datasets both obtain higher correlation and lower variance compared to when adding/removing tasks. Results for RSUM reports a similar trends (see Fig. 33).
Takeaways. The ranking obtained with either or are more robust to task addition/drop than the one from .

6 Conclusion and Future Works

In this paper, we introduced new aggregation procedures to rank systems when either task level scores or instance level scores are available. Our methods, which are theoretically grounded and rely on Kemeny ranking consensus, address fundamental flaws of the widely used arithmetic mean.

We conducted extensive numerical experiments, which show that our methods are both more reliable and more robust than the mean aggregation while leading to different conclusions on which are the best systems. Overall, when task-level (resp. instance-level) information is available, we would recommend using the aggregation procedure (resp. ) rather than the (resp. and ).

Although we focused on NLP benchmarks, our methodology could be applied in other modalities (e.g.Computer Vision, Audio). Another interesting avenue would be to consider a weighted version of the Kemeney consensus where weight could reflect task-specific criteria (e.g fairness, robustness, complexity).

7 Acknowledgements and Fundings

We thank Maxime Peyrard, Saibo Geng, and Wei Zhao for sharing their collected dataset. A warm thanks to Matthieu Labeau for his sharp eyes and excellent comments when reading our paper manuscript. This work was granted access to the HPC resources of IDRIS under the allocation 2021-101838 and 2021-101913 made by GENCI. Nathan is funded by the projet ANR LIMPID.


Appendix A Details on Datasets

We report in Tab. 6 and Tab. 6 global statistics of the data described in . Overall our conclusions are drawn from a total of over 270k scores.

# of tasks # of instance
GLUE 105 15
SGLUE 24, 15
XTREM 15 5
Table 5: Summary of the considered benchmarks. Overall the total number of the score is over 2010.
# of tasks # of instance
PC 19 240
TC 19 300
FLICKR 14 864
MLQE 10 7000
RSUM 15 2500
SEVAL 17 1600
TAC08 15 2976
TAC09 15 2596
TAC11 15 2376
Table 6: Summary of the considered datasets. Overall this benchmark is composed of over 276276 scores.

Appendix B Additional Experiments

In this section, we report additional experimental results including the details of the robustness to the scaling experiment (see Ssec. B.1), the ranking on XTREM (see Ssec. B.2), complete results on the experiments when adding adding/removing metrics/tasks when Task Level Information (see Sssec. B.3.1) and Instance Level Information is available (see Sssec. B.3.2). In the main paper we only report the aggregated score for the agreement analysis when instance level informatino is available, we report detailed results on Ssec. B.4.

b.1 Toy Experiment on Scale

We display in Fig. 19 the results of the toy experiment on scaling robustness. When corrupting one task by rescaling, we see that the error of the ranking induced by increases to 1 while the error of ranking-based aggregation remains constant.

Figure 19: Synthetic Experiment on robustness to scaling. Error is measured in term of Kendall distance.

b.2 Ranking of SGLUE

We display in Tab. 7 the resulting ranking on the three considered benchmark. Although aggregation procedures tend to agree on good and bad systems, when changing the aggregation function, the rankings vary. Thus conclusion depending on the answer to the initial question ”what are the best systems?” might change.

Team Team Team
0 Ms Alex 0 0 Liam 0 0 ULR 0
1 ERNIE 1 1 Ms Alex 1 1 CoFe 1
2 DEBERTA 2 2 ERNIE 2 2 InfoLXL 3
3 AliceMind 3 3 HUMAN 3 3 VECO 4
4 PING-AH 5 4 DEBERTA 5 4 Unicoder 5
5 HFL 4 5 Zirui 4 5 PolyGlot 2
6 T5 6 6 T5 6 6 ULR-v2 6
7 DIRL 10 7 Alibaba 7 7 HiCTL 8
8 Zihan 7 8 Anuar 8 8 Ernie 7
9 ELECTRA 11 9 Huawei 11 9 Anony 10
Table 7: Qualitative analysis between ranking obtained with or . Results in parenthesis report the score of the considered aggregation procedure.

b.3 Complete results on the task addition/removal experiments

b.3.1 Task Level Aggregation

Results of task addition/removal experiments when Task level information is available are reported in Fig. 23. Overall, we observe that ranking-based aggregation is more robust than mean-based aggregation.

(a) GLUE
Figure 23: Experiment on task addition/removal when Task level information is available.

b.3.2 Instance Level Aggregation

Results of task addition/removal experiments when instance-level information is available are reported in Fig. 33. Overall, we observe that ranking-based aggregation is more robust than mean-based aggregation.

(a) Persona Chat
(b) Topic Chat
(d) MLQE
(e) RSUM
(g) TAC08
(h) TAC09
(i) TAC10
Figure 33: Experiment on task addition/removal when Instance level information is available.

b.4 Agreement analysis on Instance Level Aggregation

To further complete the experiment on agreement analysis of Sssec. 5.3.1 report results on individual tasks. We report in Fig. 39 the correlation of ranking when Top K systems for different values of K while Fig. 45 reports the agreement analysis.

(a) Dialog
(c) MLQE
(d) Summarizations Datasets
(e) TAC Datasets
Figure 39: Test size experiements
(a) Dialog
(c) MLQE
(d) Summarizations Datasets
(e) TAC Datasets
Figure 45: Test size experiements

b.5 Possible Extensions

For the futur we would like to extend the proposed ranking procedures to:

  • Sequence Labelling Task (chapuis2020hierarchical; chapuis2021code; colombo2020guiding) where instances are sequences, thus ordering matters.

  • Emotion classification (dinkar2020importance; garcia2019token; colombo2021improving; witon2018disney) where score are continuous.

  • NLG (colombo2021beam; colombo2021automatic; colombo2021novel; staerman2021depth; colombo2021learning; colombo2019affect) where instance are sentences where both ordering and content matter.

Appendix C Dispersion Analysis

In this section, we introduce the notion of dispersion across a set of different rankings .

c.1 Dispersion as a measure performance

Suppose you have two ranking candidates, and , to summarize . A natural question consists in measuring the performance of and . Denoting by the Kendall distance, a natural measure is


Of course, in our task-level framework, our candidate will achieve better performance than the mean since it is designed to minimize this quantity. From a probabilistic point of view, one can see as an i.i.d. realization of a r.v. on the symmetric group . Denoting by is law and the associated expectation, Equation (1) is an approximation of


c.2 Dispersion to measure ranking complexity

In order to take into account the intrinsic complexity of ranking , a natural way would be to compute some kind of Dispersion among these permutations. Using the same notations as before, one can rely on the pairwise distance.


On a practical perspective, the computational complexity of Equation (3) is . Nevertheless, it is possible to efficiently approximate this value via empirical stopping algorithms based on Berstein or Hoeffding bounds (mnih2008empirical; Korba2017). Notice that the pairwise distance has a solid theoretical foundation as to be the base for a measure of the spread. It is an empirical approximation of where and are independent with law . Therefore, both are directly related to the noise level in different probabilistic models for permutations (gMallows). Moreover, the former is the basis of statistical uniformity tests for ranked data building upon it (busa2021private).

Remark 2.

A known relation between and has remarkable consequences on assessing the quality of our estimation. It says that the former is bounded below (respectively, above) by 0.5 (respectively, 1) times the latter (Korba2017). This fact has broad practical application since it means that the measure of the expected quality of the estimators and is lower an upper bounded by the intrinsic difficulty of the problem, which we can approximate via sample statistics in Equation (3).

We conclude by noting that is the natural ranking counterpart of the variance for real-valued aggregation . However, when the scores are not on the same scale, then the variance of the scores is no longer interpretable as a measure of spread in the population.

c.3 Experiments

We report in Tab. 8 the results of the dispersion analysis. We compare the dispersion to measure performance obtained with the induced ranking by and the one obtained by 100 random permutations.
Takeaways As expected, obtains a lowest dispersion which further validate our approach.

GLUE 793 805 2746
SGLUE 44.9 47.21 137.3
XTREM 12.25 12.75 50.6
Table 8: Results of the dispersion analysis on the considered benchmarks.