Re-evaluating Evaluation in Text Summarization

by   Manik Bhandari, et al.

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.



There are no comments yet.


page 6


Go Figure! A Meta Evaluation of Factuality in Summarization

Text generation models can generate factually inconsistent text containi...

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

In text summarization, evaluating the efficacy of automatic metrics with...

Neural Text Summarization: A Critical Evaluation

Text summarization aims at compressing long documents into a shorter for...

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Natural language generation (NLG) spans a broad range of tasks, each of ...

SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization

In this paper, we present a conceptually simple while empirically powerf...

Unsupervised Representation Disentanglement of Text: An Evaluation on Synthetic Datasets

To highlight the challenges of achieving representation disentanglement ...

Evaluating the Efficacy of Summarization Evaluation across Languages

While automatic summarization evaluation methods developed for English a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ability of metrics to Observations on existing human judgments (TAC) Observations on new human judgments (CNNDM)
Exp-I: evaluate all systems? (Sec. 4.1) MoverScore and JS-2 outperform all other metrics. ROUGE-2 outperforms all other metrics. MoverScore and JS-2 performs worse both in extractive (only achieved nearly 0.1 Pearson correlation) and abstractive summaries.
Exp-II: evaluate top- systems? (Sec. 4.2) As becomes smaller, ROUGE-2 de-correlates with humans. For extractive and abstractive systems, ROUGE-2 highly correlates with humans. For evaluating a mix of extractive and abstractive systems, all metrics de-correlate.
Exp-III: compare 2 systems? (Sec. 4.3) MoverScore and JS-2 outperform all other metrics. ROUGE-2 is the most reliable for abstractive systems while ROUGE-1 is most reliable for extractive systems.
Exp-IV: evaluate summaries? (Sec. 4.4) (1) MoverScore and JS-2 outperform all other metrics. (2) Metrics have much lower correlations when evaluating summaries than systems. (1) ROUGE metrics outperform all other metrics. (2) For extractive summaries, most metrics are better at evaluating summaries than systems. For abstractive summaries, some metrics are better at summary level, others are better at system level.
Table 1: Summary of our experiments, observations on existing human judgments on the TAC, and contrasting observations on newly obtained human judgments on the CNNDM dataset. Please refer to Sec. 4 for more details.

In text summarization, manual evaluation, as exemplified by the Pyramid method nenkova-passonneau-2004-pyramid-og, is the gold-standard in evaluation. However, due to time required and relatively high cost of annotation, the great majority of research papers on summarization use exclusively automatic evaluation metrics, such as ROUGE lin2004rouge , JS-2 js2, S3 peyrard_s3, BERTScore bert-score, MoverScore zhao-etal-2019-moverscore etc. Among these metrics, ROUGE is by far the most popular, and there is relatively little discussion of how ROUGE may deviate from human judgment and the potential for this deviation to change conclusions drawn regarding relative merit of baseline and proposed methods. To characterize the relative goodness of evaluation metrics, it is necessary to perform meta-evaluation graham-2015-evaluating; lin-och-2004-orange, where a dataset annotated with human judgments (e.g. TAC111 2008 tac2008) is used to test the degree to which automatic metrics correlate therewith.

However, the classic TAC meta-evaluation datasets are now 6-12 years old222In TAC, summarization was in 2008, 2009, 2010, 2011, 2014. In 2014, the task was biomedical summarization. and it is not clear whether conclusions found there will hold with modern systems and summarization tasks. Two earlier works exemplify this disconnect: (1) peyrard-2019-studying observed that the human-annotated summaries in the TAC dataset are mostly of lower quality than those produced by modern systems and that various automated evaluation metrics strongly disagree in the higher-scoring range in which current systems now operate. (2) rankel-etal-2013-decade observed that the correlation between ROUGE and human judgments in the TAC dataset decreases when looking at the best systems only, even for systems from eight years ago, which are far from today’s state-of-the-art.

Constrained by few existing human judgment datasets, it remains unknown how existing metrics behave on current top-scoring summarization systems. In this paper, we ask the question: does the rapid progress of model development in summarization models require us to re-evaluate the evaluation process used for text summarization? To this end, we create and release a large benchmark for meta-evaluating summarization metrics including:

  • Outputs from 25 top-scoring extractive and abstractive summarization systems on the CNN/DailyMail dataset.

  • Automatic evaluations from several evaluation metrics including traditional metrics (e.g. ROUGE) and modern semantic matching metrics (e.g. BERTScore, MoverScore).

  • Manual evaluations using the lightweight pyramids method litepyramids-shapira-etal-2019-crowdsourcing, which we use as a gold-standard to evaluate summarization systems as well as automated metrics.

Using this benchmark, we perform an extensive analysis, which indicates the need to re-examine our assumptions about the evaluation of automatic summarization systems. Specifically, we conduct four experiments analyzing the correspondence between various metrics and human evaluation. Somewhat surprisingly, we find that many of the previously attested properties of metrics found on the TAC dataset demonstrate different trends on our newly collected CNNDM dataset, as shown in Tab. 1. For example, MoverScore is the best performing metric for evaluating summaries on dataset TAC, but it is significantly worse than ROUGE-2 on our collected CNNDM set. Additionally, many previous works novikova-etal-2017-need; peyrard_s3; chaganty2018price show that metrics have much lower correlations at comparing summaries than systems. For extractive summaries on CNNDM, however, most metrics are better at comparing summaries than systems.

Calls for Future Research

These observations demonstrate the limitations of our current best-performing metrics, highlighting (1) the need for future meta-evaluation to (i) be across multiple datasets and (ii) evaluate metrics on different application scenarios, e.g. summary level vs. system level (2) the need for more systematic meta-evaluation of summarization metrics that updates with our ever-evolving systems and datasets, and (3) the potential benefit to the summarization community of a shared task similar to the WMT333 Metrics Task in Machine Translation, where systems and metrics co-evolve.

2 Preliminaries

In this section we describe the datasets, systems, metrics, and meta evaluation methods used below.

2.1 Datasets

TAC-2008, 2009 tac2008; tac2009 are multi-document, multi-reference summarization datasets. Human judgments are available on for the system summaries submitted during the TAC-2008, TAC-2009 shared tasks.

CNN/DailyMail (CNNDM) hermann2015teaching; nallapati2016abstractive is a commonly used summarization dataset that contains news articles and associated highlights as summaries. We use the version without entities anonymized.

2.2 Representative Systems

We use the following representative top-scoring systems that either achieve state-of-the-art (SOTA) results or competitive performance, for which we could gather the outputs on the CNNDM dataset.

Extractive summarization systems. We use CNN-LSTM-BiClassifier (CLSTM-SL; kedzie2018content), Latent zhang-etal-2018-neural-latent, BanditSum dong-etal-2018-banditsum, REFRESH narayan-etal-2018-ranking, NeuSum zhou-etal-2018-neural, HIBERT  zhang-etal-2019-hibert, Bert-Sum-Ext liu-lapata-2019-text, CNN-Transformer-BiClassifier (CTrans-SL; zhong2019searching), CNN-Transformer-Pointer (CTrans-PN; zhong2019searching), HeterGraph wang2020heterogeneous and MatchSum zhong2020extractive as representatives of extractive systems, totaling 11 extractive system outputs for each document in the CNNDM test set.

Abstractive summarization systems. We use pointer-generator+coverage see-etal-2017-get, fastAbsRL chen-bansal-2018-fast-abs, fastAbsRL-rank chen-bansal-2018-fast-abs, Bottom-up gehrmann2018bottom, T5 raffel2019exploring-t5, Unilm-v1 dong2019unified, Unilm-v2 dong2019unified, twoStageRL zhang2019pretraining, preSummAbs liu2019text-presumm, preSummAbs-ext liu2019text-presumm BART lewis2019bart and Semsim yoon2020learning as abstractive systems. In total, we use 14 abstractive system outputs for each document in the CNNDM test set.

2.3 Evaluation Metrics

We examine eight metrics that measure the agreement between two texts, in our case, between the system summary and reference summary.

BERTScore (BScore) measures soft overlap between contextual BERT embeddings of tokens between the two texts444Used code at (bert-score).

MoverScore (MScore) applies a distance measure to contextualized BERT and ELMo word embeddings555Used code at (zhao-etal-2019-moverscore).

Sentence Mover Similarity (SMS) applies minimum distance matching between text based on sentence embeddings (clark2019sentence).

Word Mover Similarity (WMS) measures similarity using minimum distance matching between texts which are represented as a bag of word embeddings666For WMS and SMS: (kusner2015word).

JS divergence (JS-2) measures Jensen-Shannon divergence between the two text’s bigram distributions777JS-2 is calculated using the function defined in (lin-etal-2006-information).

ROUGE-1 and ROUGE-2 measure overlap of unigrams and bigrams respectively888For ROUGE-1,2, and L, we used the python wrapper: (lin2004rouge).

ROUGE-L measures overlap of the longest common subsequence between two texts (lin2004rouge).

We use the recall variant of all metrics (since the Pyramid method of human evaluations is inherently recall based) except MScore which has no specific recall variant.

2.4 Correlation Measures

Pearson Correlation is a measure of linear correlation between two variables and is popular in meta-evaluating metrics at the system level (lee1988thirteen). We use the implementation given by scipy-2020.

William’s Significance Test is a means of calculating the statistical significance of differences in correlations for dependent variables williams-test; graham-baldwin-2014-testing. This is useful for us since metrics evaluated on the same dataset are not independent of each other.

2.5 Meta Evaluation Strategies

There are two broad meta-evaluation strategies: summary-level and system-level.

Setup: For each document in a dataset , we have system outputs, where the outputs can come from (1) extractive systems (Ext), (2) abstractive systems (Abs) or (3) a union of both (Mix). Let be the summary of the document, be a specific metric and be a correlation measure.

2.5.1 Summary Level

Summary-level correlation is calculated as follows:


Here, correlation is calculated for each document, among the different system outputs of that document, and the mean value is reported.

2.5.2 System Level

System-level correlation is calculated as follows:


Additionally, the “quality” of a system is defined as the mean human score received by it i.e.


3 Collection of Human Judgments

We follow a 3-step process to collect human judgments: (1) we collect system-generated summaries on the most-commonly used summarization dataset, CNNDM; (2) we select representative test samples from CNNDM and (3) we manually evaluate system-generated summaries of the above-selected test samples.

3.1 System-Generated Summary Collection

We collect the system-generated summaries from 25 top-scoring systems,999We contacted the authors of these systems to gather the corresponding outputs, including variants of the systems. covering 11 extractive and 14 abstractive systems (Sec. 2.2) on the CNNDM dataset. We organize our collected generated summaries into three groups based on system type:

  • CNNDM Abs denotes collected output summaries from abstractive systems.

  • CNNDM Ext denotes collected output summaries from extractive systems.

  • CNNDM Mix is the union of the two.

3.2 Representative Sample Selection

Since collecting human annotations is costly, we sample 100 documents from CNNDM test set (11,490 samples) and evaluate system generated summaries of these 100 documents. We aim to include documents of varying difficulties in the representative sample. As a proxy to the difficulty of summarizing a document, we use the mean score received by the system generated summaries for the document. Based on this, we partition the CNNDM test set into equal sized bins and sample 4 documents from each bin. We repeat this process for 5 metrics (BERTScore, MoverScore, R-1, R-2, R-L) obtaining a sample of 100 documents. This methodology is detailed in Alg. 1 in Sec. A.1.

(a) Reference Summary: Bayern Munich beat Porto 6 - 1 in the Champions League on Tuesday. Pep Guardiola’s side progressed 7 - 4 on aggregate to reach semi-finals. Thomas Muller scored 27th Champions League goal to pass Mario Gomez. Muller is now the leading German scorer in the competition. After game Muller led the celebrations with supporters using a megaphone.
(b) System Summary (BART, lewis2019bart): Bayern Munich beat Porto 6 - 1 at the Allianz Arena on Tuesday night. Thomas Muller scored his 27th Champions League goal. The 25 - year - old became the highest - scoring German since the tournament took its current shape in 1992. Bayern players remained on the pitch for some time as they celebrated with supporters.
(c) SCUs with corresponding evaluations: Bayern Munich beat Porto.  Bayern Munich won 6 - 1.  Bayern Munich won in Champions League.  Bayern Munich won on Tuesday.  Bayern Munich is managed by Pep Guardiola.  Bayern Munich progressed in the competition.  Bayern Munich reached semi-finals.  Bayern Munich progressed 7 - 4 on aggregate.  Thomas Muller scored 27th Champions League goal.  Thomas Muller passed Mario Gomez in goals.  Thomas Muller is now the leading German scorer in the competition.  After the game Thomas Muller led the celebrations.  Thomas Muller led the celebrations using a megaphone. 
Table 2: Example of a summary and corresponding annotation. (a) shows a reference summary from the representative sample of the CNNDM test set. (b) shows the corresponding system summary generated by BART, one of the abstractive systems used in the study. (c) shows the SCUs (Semantic Content Units) extracted from (a) and the “Present()”/“Not Present()” marked by crowd workers when evaluating (b).

3.3 Human Evaluation

In text summarization, a “good” summary should represent as much relevant content from the input document as possible, within the acceptable length limits. Many human evaluation methods have been proposed to capture this desideratum nenkova-passonneau-2004-pyramid-og; chaganty2018price; fan-etal-2018-controllable; litepyramids-shapira-etal-2019-crowdsourcing. Among these, Pyramid nenkova-passonneau-2004-pyramid-og is a reliable and widely used method, that evaluates content selection by (1) exhaustively obtaining Semantic Content Units (SCUs) from reference summaries, (2) weighting them based on the number of times they are mentioned and (3) scoring a system summary based on which SCUs can be inferred.

Recently, litepyramids-shapira-etal-2019-crowdsourcing extended Pyramid to a lightweight, crowdsourcable method - LitePyramids, which uses Amazon Mechanical Turk101010 (AMT) for gathering human annotations. LitePyramids simplifies Pyramid by (1) allowing crowd workers to extract a subset of all possible SCUs and (2) eliminating the difficult task of merging duplicate SCUs from different reference summaries, instead using SCU sampling to simulate frequency-based weighting.

Both Pyramid and LitePyramid rely on the presence of multiple references per document to assign importance weights to SCUs. However in the CNNDM dataset there is only one reference summary per document. We therefore adapt the LitePyramid method for the single-reference setting as follows.

SCU Extraction The LitePyramids annotation instructions define a Semantic Content Unit (SCU) as a sentence containing a single fact written as briefly and clearly as possible. Instead, we focus on shorter, more fine-grained SCUs that contain at most 2-3 entities. This allows for partial content overlap between a generated and reference summary, and also makes the task easy for workers. Tab. 2 gives an example. We exhaustively extract (up to 16) SCUs111111In our representative sample we found no document having more than 16 SCUs. from each reference summary. Requiring the set of SCUs to be exhaustive increases the complexity of the SCU generation task, and hence instead of relying on crowd-workers, we create SCUs from reference summaries ourselves. In the end, we obtained nearly 10.5 SCUs on average from each reference summary.

System Evaluation During system evaluation the full set of SCUs is presented to crowd workers. Workers are paid similar to litepyramids-shapira-etal-2019-crowdsourcing, scaling the rates for fewer SCUs and shorter summary texts. For abstractive systems, we pay $0.20 per summary and for extractive systems, we pay $0.15 per summary since extractive summaries are more readable and might precisely overlap with SCUs. We post-process system output summaries before presenting them to annotators by true-casing the text using Stanford CoreNLP manning-EtAl-2014-CoreNLP and replacing “unknown” tokens with a special symbol “(chaganty2018price).

Tab. 2 depicts an example reference summary, system summary, SCUs extracted from the reference summary, and annotations obtained in evaluating the system summary.

Annotation Scoring For robustness litepyramids-shapira-etal-2019-crowdsourcing, each system summary is evaluated by 4 crowd workers. Each worker annotates up to 16 SCUs by marking an SCU “present” if it can be inferred from the system summary or “not present” otherwise. We obtain a total of 10,000 human annotations (100 documents  25 systems  4 workers). For each document, we identify a “noisy” worker as one who disagrees with the majority (i.e. marks an SCU as “present” when majority thinks “not present” or vice-versa), on the largest number of SCUs. We remove the annotations of noisy workers and retain 7,742 annotations of the 10,000. After this filtering, we obtain an average inter-annotator agreement (Krippendorff’s alpha krippendorff2011computing) of 0.66.121212The agreement was 0.57 and 0.72 for extractive and abstractive systems respectively. Finally, we use the majority vote to mark the presence of an SCU in a system summary, breaking ties by the class, “not present”.

4 Experiments

Motivated by the central research question: “does the rapid progress of model development in summarization models require us to re-evaluate the evaluation process used for text summarization?” We use the collected human judgments to meta-evaluate current metrics from four diverse viewpoints, measuring the ability of metrics to: (1) evaluate all systems; (2) evaluate top- strongest systems; (3) compare two systems; (4) evaluate individual summaries. We find that many previously attested properties of metrics observed on TAC exhibit different trends on the new CNNDM dataset.

4.1 Exp-I: Evaluating All Systems

[TAC-2008] [TAC-2009] [CNNDM Mix] [CNNDM Abs] [CNNDM Ext]

Figure 1: p-value of William’s Significance Test for the hypothesis “Is the system on left (y-axis) significantly better than system on top (x-axis)”. ‘BScore’ refers to BERTScore and ‘MScore’ refers to MoverScore. A dark green value in cell denotes metric has a significantly higher Pearson correlation with human scores compared to metric (p-value ).141414Dark cells with p-value have been rounded up.‘-’ in cell refers to the case when Pearson correlation of with human scores is less that of (Sec. 4.1).
Figure 2: System-level Pearson correlation between metrics and human scores (Sec. 4.1).

Automatic metrics are widely used to determine where a new system may rank against existing state-of-the-art systems. Thus, in meta-evaluation studies, calculating correlation of automatic metrics with human judgments at the system level is a commonly-used setting novikova-etal-2017-need; bojar-etal-2016-wmt-results; graham-2015-evaluating. We follow this setting and specifically, ask two questions: Can metrics reliably compare different systems? To answer this we observe the Pearson correlation between different metrics and human judgments in Fig. 2, finding that:

(1) MoverScore and JS-2, which were the best performing metrics on TAC, have poor correlations with humans in comparing CNNDM Ext systems.

(2) Most metrics have high correlations on the TAC-2008 dataset but many suffer on TAC-2009, especially ROUGE based metrics. However, ROUGE metrics consistently perform well on the collected CNNDM datasets.

Are some metrics significantly better than others in comparing systems? Since automated metrics calculated on the same data are not independent, we must perform the William’s test williams-test to establish if the difference in correlations between metrics is statistically significant graham-baldwin-2014-testing. In Fig. 1 we report the p-values of William’s test. We find that

(1) MoverScore and JS-2 are significantly better than other metrics in correlating with human judgments on the TAC datasets.

(2) However, on CNNDM Abs and CNNDM Mix, R-2 significantly outperforms all others whereas on CNNDM Ext none of the metrics show significant improvements over others.

Takeaway: These results suggest that metrics run the risk of overfitting to some datasets, highlighting the need to meta-evaluate metrics for modern datasets and systems. Additionally, there is no one-size-fits-all metric that can outperform others on all datasets. This suggests the utility of using different metrics for different datasets to evaluate systems e.g. MoverScore on TAC-2008, JS-2 on TAC-2009 and R-2 on CNNDM datasets.

4.2 Exp-II: Evaluating Top-k Systems

Most papers that propose a new state-of-the-art system often use automatic metrics as a proxy to human judgments to compare their proposed method against other top scoring systems. However, can metrics reliably quantify the improvements that one high quality system makes over other competitive systems? To answer this, instead of focusing on all of the collected systems, we evaluate the correlation between automatic metrics and human judgments in comparing the top-k systems, where top-k are chosen based on a system’s mean human score (Eqn. 3).151515As a caveat, we do not perform significance testing for this experiment, due to the small number of data points. Our observations are presented in Fig. 3. We find that:

(1) As becomes smaller, metrics de-correlate with humans on the TAC-2008 and CNNDM Mix datasets, even getting negative correlations for small values of (Fig. 3, 3). Interestingly, SMS, R-1, R-2 and R-L improve in performance as becomes smaller on CNNDM Ext.

(2) R-2 had negative correlations with human judgments on TAC-2009 for , however it remains highly correlated with human judgments on CNNDM Abs for all values of .

Takeaway: Metrics cannot reliably quantify the improvements made by one system over others, especially for the top few systems across all datasets. Some metrics, however, are well suited for specific datasets, e.g. JS-2 and R-2 are reliable indicators of improvements on TAC-2009 and CNNDM Abs respectively.

[TAC-2008] [TAC-2009] [CNNDM Mix] [CNNDM Abs] [CNNDM Ext]

Figure 3: System-level Pearson correlation with humans on top- systems (Sec. 4.2).

4.3 Exp-III: Comparing -Systems

Instead of comparing many systems (Sec. 4.1,  4.2) ranking systems aims to test the discriminative power of a metric, i.e., the degree to which the metric can capture statistically significant differences between two summarization systems.

We analyze the reliability of metrics along a useful dimension: can metrics reliably say if one system is significantly better than another? Since we only have 100 annotated summaries to compare any two systems, and , we use paired bootstrap resampling, to test with statistical significance if is better than according to metric  bootstrap_paper; dror-etal-2018-hitchhikers. We take all pairs of systems and compare their mean human score (Eqn. 3) using paired bootstrap resampling. We assign a label if is better than with 95% confidence, for vice-versa and if the confidence is below 95%. We treat this as the ground truth label of the pair . This process is then repeated for all metrics, to get a “prediction”, from each metric for the same pairs. If is a good proxy for human judgments, the F1 score goutte2005probabilistic between and should be high. We calculate the weighted macro F1 score for all metrics and view them in Fig. 4.

We find that ROUGE based metrics perform moderately well in this task. R-2 performs the best on CNNDM datasets. While on the TAC 2009 dataset, JS-2 achieves the highest F1 score, its performance is low on CNNDM Ext.

Takeaway: Different metrics are better suited for different datasets. For example, on the CNNDM datasets, we recommend using R-2 while, on the TAC datasets, we recommend using JS-2.

Figure 4: F1-Scores with bootstrapping (Sec. 4.3).

4.4 Exp-IV: Evaluating Summaries

In addition to comparing systems, real-world application scenarios also require metrics to reliably compare multiple summaries of a document. For example, top-scoring reinforcement learning based summarization systems 

bohm2019better and the current state-of-the-art extractive system zhong2020extractive heavily rely on summary-level reward scores to guide the optimization process.

In this experiment, we ask the question: how well do different metrics perform at the summary level, i.e. in comparing system summaries generated from the same document? We use Eq. 1 to calculate Pearson correlation between different metrics and human judgments for different datasets and collected system outputs.

[Summary-level Pearson correlation with human scores.]

[Difference between system-level and summary-level Pearson correlation.]

Figure 5: Pearson correlation between metrics and human judgments across different datasets (Sec. 4.4).

Our observations are summarized in Fig. 5. We find that:

(1) As compared to semantic matching metrics, R-1, R-2 and R-L have lower correlations on the TAC datasets but are strong indicators of good summaries especially for extractive summaries on the CNNDM dataset.

(2) Notably, BERTScore, WMS, R-1 and R-L have negative correlations on TAC-2009 but perform moderately well on other datasets including CNNDM.

(3) Previous meta-evaluation studies novikova-etal-2017-need; peyrard_s3; chaganty2018price conclude that automatic metrics tend to correlate well with humans at the system level but have poor correlations at the instance (here summary) level. We find this observation only holds on TAC-2008. Some metrics’ summary-level correlations can outperform system-level on the CNNDM dataset as shown in Fig. 5 (bins below ). Notably, MoverScore has a correlation of only 0.05 on CNNDM Ext at the system level but 0.74 at the summary level.

Takeaway: Meta-evaluations of metrics on the old TAC datasets show significantly different trends than meta-evaluation on modern systems and datasets. Even though some metrics might be good at comparing summaries, they may point in the wrong direction when comparing systems. Moreover, some metrics show poor generalization ability to different datasets (e.g. BERTScore on TAC-2009 vs other datasets). This highlights the need for empirically testing the efficacy of different automatic metrics in evaluating summaries on multiple datasets.

5 Related Work

This work is connected to the following threads of topics in text summarization.

Human Judgment Collection Despite many approaches to the acquisition of human judgment chaganty2018price; nenkova-passonneau-2004-pyramid-og; litepyramids-shapira-etal-2019-crowdsourcing; fan-etal-2018-controllable, Pyramid nenkova-passonneau-2004-pyramid-og has been a mainstream method to meta-evaluate various automatic metrics. Specifically, Pyramid provides a robust technique for evaluating content selection by exhaustively obtaining a set of Semantic Content Units (SCUs) from a set of references, and then scoring system summaries on how many SCUs can be inferred from them. Recently, litepyramids-shapira-etal-2019-crowdsourcing proposed a lightweight and crowdsourceable version of the original Pyramid, and demonstrated it on the DUC 2005 Dang05overviewof and 2006 Dang06overviewof multi-document summarization datasets. In this paper, our human evaluation methodology is based on the Pyramid nenkova-passonneau-2004-pyramid-og and LitePyramids litepyramids-shapira-etal-2019-crowdsourcing techniques. chaganty2018price also obtain human evaluations on system summaries on the CNNDM dataset, but with a focus on language quality of summaries. In comparison, our work is focused on evaluating content selection. Our work also covers more systems than their study (11 extractive + 14 abstractive vs. 4 abstractive).

Meta-evaluation with Human Judgment The effectiveness of different automatic metrics - ROUGE-2 lin2004rouge, ROUGE-L lin2004rouge, ROUGE-WE ng-abrecht-2015-better, JS-2 js2 and S3 peyrard_s3 is commonly evaluated based on their correlation with human judgments (e.g., on the TAC-2008 tac2008 and TAC-2009 tac2009 datasets). As an important supplementary technique to meta-evaluation, graham-2015-evaluating advocate for the use of a significance test, William’s test williams-test, to measure the improved correlations of a metric with human scores and show that the popular variant of ROUGE (mean ROUGE-2 score) is sub-optimal. Unlike these works, instead of proposing a new metric, in this paper, we upgrade the meta-evaluation environment by introducing a sizeable human judgment dataset evaluating current top-scoring systems and mainstream datasets. And then, we re-evaluate diverse metrics at both system-level and summary-level settings. (novikova-etal-2017-need) also analyzes existing metrics, but they only focus on dialog generation.

6 Implications and Future Directions

Our work not only diagnoses the limitations of current metrics but also highlights the importance of upgrading the existing meta-evaluation testbed, keeping it up-to-date with the rapid development of systems and datasets. In closing, we highlight some potential future directions: (1) The choice of metrics depends not only on different tasks (e.g, summarization, translation) but also on different datasets (e.g., TAC, CNNDM) and application scenarios (e.g, system-level, summary-level). Future works on meta-evaluation should investigate the effect of these settings on the performance of metrics. (2) Metrics easily overfit on limited datasets. Multi-dataset meta-evaluation can help us better understand each metric’s peculiarity, therefore achieving a better choice of metrics under diverse scenarios. (3) Our collected human judgments can be used as supervision to instantiate the most recently-proposed pretrain-then-finetune framework (originally for machine translation) sellam2020bleurt, learning a robust metric for text summarization.


We sincerely thank all authors of the systems that we used in this work for sharing their systems’ outputs.


Appendix A Appendices

a.1 Sampling Methodology

Please see Algorithm 1.

Data: where is CNNDM test set, is source document, is reference summary, and is a set of individual system summaries . = [ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, MoverScore]
Output: : sampled set of documents
1 ,
2 ,
5 for  do
6       := sorted by
7       for  do
9             sorted by
10             for  do
12                   Randomly sample from
14             end for
16       end for
18 end for
Algorithm 1 Sampling Methodology

a.2 Exp-I using Kendall’s tau correlation

Please see Figure 6 for the system level Kendall’s tau correlation between different metrics and human judgements.

Figure 6: System-level Kendall correlation between metrics and human scores.

[Summary-level Kendall correlation with human scores.]

[Difference between system-level and summary-level Kendall correlation.]

Figure 7: Kendall correlation between metrics and human judgements across different datasets.

a.3 Exp-II using Kendall’s tau correlation

Please see Figure 8 for the system level Kendall’s tau correlation on top- systems, between different metrics and human judgements.

[TAC-2008] [TAC-2009] [CNNDM Mix] [CNNDM Abs] [CNNDM Ext]

Figure 8: System-level Kendall correlation with humans on top- systems.

a.4 Exp IV using Kendall’s tau correlation

Please see Figure 7 for the summary level Kendall’s tau correlation between different metrics and human judgements.