DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence

01/26/2022
by   Wei Zhao, et al.
HITS gGmbH
0

Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics cannot recognize coherence and fail to punish incoherent elements in system outputs. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level – which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at <https://github.com/AIPHES/DiscoScore>.

READ FULL TEXT VIEW PDF

page 10

page 18

10/04/2017

Discourse Structure in Machine Translation Evaluation

In this article, we explore the potential of using sentence-level discou...
11/14/2018

Modeling Coherence for Discourse Neural Machine Translation

Discourse coherence plays an important role in the translation of one te...
06/30/2021

Evaluation of Thematic Coherence in Microblogs

Collecting together microblogs representing opinions about the same topi...
11/28/2019

DiscoTK: Using Discourse Structure for Machine Translation Evaluation

We present novel automatic metrics for machine translation evaluation th...
10/13/2020

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Recent advances in automatic evaluation metrics for text have shown that...
09/05/2021

Transformer Models for Text Coherence Assessment

Coherence is an important aspect of text quality and is crucial for ensu...
04/09/2020

BLEURT: Learning Robust Metrics for Text Generation

Text generation has made significant advances in the last few years. Yet...

1 Introduction

In discourse, coherence refers to the continuity of semantics in text. Often, discourse relations and lexical cohesion devices, such as repetition and coreference, are employed to connect text spans, aiming to ensure text coherence. Popular theories in the linguistics community on discourse were provided by Grosz et al. (1995) and Mann and Thompson (1988). They formulate coherence through the lens of readers’ focus of attention, and rhetorical discourse structures over sentences. Later on, coherence models as computational approaches of these theories emerged to judge text coherence in discourse tasks such as sentence ordering and essay scoring Barzilay and Lapata (2008); Lin et al. (2011); Guinaudeau and Strube (2013).

While humans also often use text planning at discourse level prior to writing and speaking, up until recently, the majority of natural language generation (NLG) systems, be it text summarization or document-level MT, has performed sequential word prediction without considering text coherence. For instance, MT systems mostly do not model the interdependence between sentences and translate a document at sentence level, and as such produce many incoherent elements such as coreference mistakes in system outputs 

Maruf et al. (2021). Only more recently has there been a surge of interest towards discourse based summarization and MT systems, aiming to model inter-sentence context, with a focus on pronominal anaphora Voita et al. (2018); Liu et al. (2021) and discouse relations Miculicich et al. (2018); Xu et al. (2020).

However, there appears a mismatch between discourse based NLG systems and non-discourse NLG evaluation metrics such as MoverScore Zhao et al. (2019) and BERTScore Zhang et al. (2020) which have recently become popular for MT and summarization evaluation. As these metrics base their judgment on semantic similarity (and lexical overlap (Kaster et al., 2021)) between hypotheses and references—which by design does not target text coherence—it is not surprising that they do not correlate well with human rated coherence Fabbri et al. (2021); Yuan et al. (2021); Sai et al. (2021).

In this work, we fill the gap of lacking discourse metrics in MT and summarization evaluation, particularly in reference-based evaluation scenarios. We introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence through the lens of readers’ focus, driven by Centering theory Grosz et al. (1995). The DiscoScore variants can be distinguished in how we use focus—see Figure 1: (i) we model focus frequency and semantics, and compare their difference between hypothesis and reference and (ii) we use focus transitions to model the interdependence between sentences. Building upon this, we present a simple graph-based approach to compare hypothesis with reference.

Chelsea 1 0 0 0 0 1
offer 0 0 0 0 1 0
(a) FocusDiff
0 1 0.5
0 0 1
0 0 0
(b) SentGraph
Figure 1: Sample hypothesis and reference from SUMMEval. Each focus is marked in a different color, corresponding to multiple tokens as instances of a focus. Focuses shared in Hypothesis and Reference are marked in the same color. (a)+(b) are adjacency matrices used to model focus-based coherence for Hypothesis; for simplicity, adjacency matrices for Reference are omitted. FocusDiff and SentGraph are the variants of DiscoScore. For FocusDiff, we use (a) to depict the relations between foci and tokens, reflecting focus frequency. For SentGraph, we use (b) to depict the interdependence between sentences according to the number of foci shared between sentences and the distance between sentences.

We compare DiscoScore with a range of baselines, including discourse and non-discourse metrics, and coherence models on summarization and document-level MT datasets. Our contributions and findings are summarized as follows:

  • Recent BERT-based metrics and the state-of-the-art BARTScore Yuan et al. (2021) are all weak in system-level correlation with human ratings, not only in coherence but also in other aspects such as factual consistency. Most of them are even worse than very early discourse metrics, RC and LC Wong and Kit (2012)—which require neither source texts nor references and use discourse features to predict hypothesis coherence.

  • DiscoScore strongly correlates with human rated coherence and many other aspects, over 10 points (on average across aspects) better than BARTScore and two strong baselines RC and LC in the single and multi-references settings. This indicates that either leveraging contextualized encoders or finding discourse features is not sufficient, suggesting to combine both as DiscoScore does.

  • We demonstrate the importance of including discourse signals in the assessment of system outputs, as the discourse features derived from DiscoScore can strongly separate hypothesis from reference. Further, we show that the more discriminative these features are, the better the metrics perform, which allows for interpreting the performance gaps between the variants of DisoScore.

  • We investigate two focus choices popular in the discourse community, i.e., noun Elsner and Charniak (2011) and semantic entity Mesgar and Strube (2016). Our results show that entity as focus is not always helpful, but when it helps, the gain is big.

2 Related work

Evaluation Metrics.

Traditional metrics such as BLEU Papineni et al. (2002) and ROUGE Lin (2004)

measure lexical n-gram overlap between a hypothesis and a human reference. As they fail to measure semantic similarity in the absence of lexical overlap, several metrics have been proposed to overcome this issue, which carry out soft lexical matching with static word embeddings 

Ng and Abrecht (2015) and synonym matching Lavie and Agarwal (2007). However, none of those metrics can properly judge text coherence Kryscinski et al. (2019); Zhu and Bhat (2020).

Recently, a class of novel metrics based on BERT Devlin et al. (2019) has received a surge of attention, as they correlate strongly with human judgment of text quality in both reference-based and reference-free scenarios Zhao et al. (2019); Zhang et al. (2020); Sellam et al. (2020); Rei et al. (2020); Gao et al. (2020); Thompson and Post (2020); Zhao et al. (2020); Pu et al. (2021); Chen et al. (2021). While strong at sentence-level, these metrics cannot recognize coherence in inter-sentence contexts (just like BLEU and ROUGE), as BERT and the majority of BERT variants111Recently, several discourse BERT variants such as Conpono Iter et al. (2020) have been proposed, but we find that they are not always helpful for evaluation metrics—see Table 4. that these metrics build on are inadequate in capturing discourse phenomena Koto et al. (2021); Laban et al. (2021); Beyer et al. (2021). Thus, they are not suitable for evaluating long texts as in document-level MT evaluation. Works that either (i) use the average of sentence-level scores as document score or (ii) assign a score to the concatenation of sentences within a document  Xiong et al. (2019); Liu et al. (2020); Saunders et al. (2020) completely ignore interdependence between sentences, thus are also inadequate.

Several attempts have been made towards discourse metrics in MT evaluation. Wong and Kit (2012); Gong et al. (2015); Cartoni et al. (2018) use the frequency of lexical cohesion devices (e.g., word repetition) over sentences to predict coherence of hypothesis translations, while Guzmán et al. (2014) and Joty et al. (2017) suggest to compare the difference of rhetorical structures between hypothesis and reference translations. Recently, Jiang et al. (2021) measure the inconsistency between hypothesis and reference translations in several aspects such as verb tense and named entities. However, these metrics do not leverage strong contextualized encoders, as has been shown to be a key ingredient for recent success of BERT-based metrics. Most recently, BARTScore Yuan et al. (2021) uses sequence-to-sequence pretrained language models such as BART Lewis et al. (2020)

to measure how likely hypothesis and reference are paraphrased according to the probability of one given the other. While BARTScore constitutes the recent state-of-the-art in sentence-level correlation with human ratings, we find that (i) it performs still poorly at system level—which is particularly problematic as systems are typically compared in this manner. (ii) As based on a ‘blackbox’ language model, it cannot offer insights towards how it models coherence and what discourse phenomena it does (not) capture.

Coherence Models.

In discourse, there have been many computational models Barzilay and Lapata (2008); Guinaudeau and Strube (2013); Pitler and Nenkova (2008); Lin et al. (2011) for text coherence assessment, the majority of which differ in regularities that they use to distinguish coherent from incoherent text, driven by different linguistic theories, v.i.z., a pattern of (i) focus transitions in adjacent sentences Grosz et al. (1995) and (ii) text organization regarding discourse relations over sentences Mann and Thompson (1988). For instance, Barzilay and Lapata (2008) and Guinaudeau and Strube (2013) use the distribution of entity transitions over sentences to predict text coherence, while Pitler and Nenkova (2008) and Lin et al. (2011) suggest to produce discourse relations over sentences with a discourse parser, showing that the relations are indicative of text coherence. In the last few years, neural coherence models have been explored. Popular examples are Tien Nguyen and Joty (2017), Mesgar and Strube (2018) and Moon et al. (2019). As they and the recent state-of-the-art Mesgar et al. (2021) all have been trained on text readability datasets, with readability labels as supervision, they may suffer issues of domain shift when applied to MT and summarization evaluation. More importantly, they judge hypothesis coherence in the absence of reference, thus are not sufficient for reference-based evaluation. Our experiments involve two popular, unsupervised coherence models, entity graph Guinaudeau and Strube (2013) and lexical graph Mesgar and Strube (2016) treated as discourse metrics due to their advantages on robustness Lai and Tetreault (2018).

Discourse Test Sets.

Apart from evaluation metrics, there have been several discourse-focused test sets proposed to compare NLG systems, most of which have been studied in MT evaluation. For instance, the DiscoMT15 shared task Hardmeier et al. (2015) compares MT systems, not based on translation adequacy but on the accuracy of pronoun translation for English-to-French. Bawden et al. (2018) extend this by labeling both anaphoric pronouns and lexical cohesion devices on test sets, while Voita et al. (2018) construct English-to-Russian test sets targeted on deixis, ellipsis and lexical cohesion. Lopes et al. (2020) reconstruct a large English-to-French test set targeting pronouns. While reliable, these test sets involve costly manual annotation, thus are limited to few language pairs.

In this work, we introduce DiscoScore to judge system outputs, which uses BERT to model readers’ focus within hypothesis and reference, and thus clearly outlines the discourse phenomena being captured, serving as low-cost alternatives to discourse test sets for comparing discourse based NLG systems. More prominently, we derive discourse features from DiscoScore, which we use to understand the importance of discourse for evaluation metrics, and explain why one metric is superior to another. This parallels recent effort towards explainability for non-discourse evaluation metrics Kaster et al. (2021); Fomicheva et al. (2021). Finally, we show that simple features can be indicative of the superiority of a metric, which fosters research towards finding insightful features with domain expertise and building upon these insights to design high-quality metrics.

3 Our Approach

In the following, we elaborate on the two variants of DiscoScore, FocusDiff and SentGraph named as DS-Focus and DS-Sent.

Focus Difference.

In discourse, there have been many corpus-based studies towards modeling focus transitions over sentences, showing that focus transition patterns are indicative of text coherence Barzilay and Lapata (2008); Guinaudeau and Strube (2013). When reading a document, readers may raise multiple focus of attention with each associated to a group of expressions: (i) referring expressions such as pronouns and (ii) semantically related elements such as [Berlin, capital].

Here we hypothesize two focus based conditions that a coherent hypothesis should meet in reference-based evaluation scenarios:

  • A large number of focus overlaps between a hypothesis and a reference.

  • Each focus overlap is nearly identical in terms of semantics and frequency222Focus frequency denotes how often a focus is mentioned in a hypothesis or in a reference..

In the following, we present focus modeling towards semantics and frequency, according to which we compare hypothesis with reference.

For a hypothesis, we introduce a bipartite graph , where and are two sets of vertices corresponding to a set of foci and all tokens (per occurrence a word is a separate token) within a hypothesis. Let be an adjacency matrix where and are the number of foci and tokens respectively, and equals if and only if the -th focus associates to the -th token. Let be a matrix of focus embeddings and be a matrix of contextualized token embeddings with as the embedding size. Similarly, we use notation , and for a human reference.

We use contextualized encoders such as BERT to produce token embeddings and . We use a simple approach to model both semantics and frequency of a focus. That is, we assign per focus an embedding by summing token embeddings that a focus is associated to:

(1)

where is a set of tokens (e.g., a group of semantically related expressions) associated with a focus . In matrix notation, we rewrite Eq. (1) to , similarly for .

Next, we measure the distance between a common set of foci in a hypothesis and reference pair based on their embeddings:

(2)

where is a penalty term used to punish little focus overlap between hypothesis and reference. We set to the number of foci in hypothesis.

Sentence Graph.

Few contextualized encoders can produce high-quality sentence embeddings in the document context, as they do not model interdependence between sentences. According to Centering theory Grosz et al. (1995), two sentences are marked continuous in meaning when they share at least one focus, on the one hand; one marks a meaning shift for two sentences when no focus appears in common, on the other hand. From this, one can aggregate sentence embeddings for which corresponding sentences are considered continuous. In the following, we present a graph-based approach to do so.

For a hypothesis333For simplicity, we omit the notation and for a reference., let be a matrix of sentence embeddings with and as the number of sentences and the embedding size. We introduce a graph where is a set of sentences and is an adjacency matrix weighted according to the number of foci shared between sentences and the distance between sentences as listed below to depict two variants of :

  • unweighted: if the -th and the -th sentences have at least focus in common (otherwise 0), where denotes the distance between two sentences and when .

  • weighted: , where is the number of foci shared in the -th and the -th sentences, with the same constraints on and as above.

Analyses by Guinaudeau and Strube (2013) indicate that global statistics (e.g., average) over such adjacency matrices can distinguish incoherent from coherent text to some degree. Here we depict adjacency matrices as a form of sentence connectivity derived from focus transitions over sentences. We use them to aggregate sentence embeddings from hypothesis and from reference:

where

is an identity matrix that adds a self-loop to a graph so as to include self-embeddings when updating them.

Next, we derive per graph an embedding with simple statistics from and , i.e., the concatenation of mean-max-min-sum embeddings444We compare three mechanisms regarding how best to measure distance between two sets of embeddings and . We find that the statistics based approach is much better than the other two alignment based ones Zhang et al. (2020); Zhao et al. (2019)—see Table 6.

. Finally, we compute the cosine similarity between two graph-level embeddings:

(3)

Choice of Focus.

In discourse, often four popular choices are used to describe a focus: (i) a noun that heads a NP Barzilay and Lapata (2008), (ii) a noun Elsner and Charniak (2011), (iii) a coreferent entity associated with a set of referring expressions Guinaudeau and Strube (2013) and (iv) a semantic entity associated with a set of lexical related words Mesgar and Strube (2016).

In this work, we investigate two focus choices, (ii) and (iv). In theory, (iv) is better (ii) as only a small portion of nouns in a text could be lexical cohesion devices, indicative of coherence. From this, nouns as focus yields few useful coherence signals but a lot of noise, while entity as focus uses ‘signal compression’ by means of aggregation to produce better signals. Concerning (iv), we first extract all nouns from hypothesis (or reference), and aggregate them into different semantic entities if their cosine similarities based on Dep2Vec word embeddings555http://u.cs.biu.ac.il/~yogo/data/syntemb/deps.words.bz2 is greater than a threshold—assuming that nouns with high similarity refer to the same semantic entity. Note that DiscoScore with entity as focus is not a parameter-free metric, but the threshold choice is robust to some degree: after initial trials, we set the threshold to 0.8 for DS-Focus in all setups, and to 0.8 in summarization and to 0.5 in MT for DS-Sent.

4 Experiments

4.1 Evaluation Metrics

In the following, we list all of the evaluation metrics, and elaborate on them in Appendix A.1.

Non-discourse Metrics.

We consider BLEU Papineni et al. (2002), ROUGE Lin (2004), BERTScore Zhang et al. (2020), MoverScore Zhao et al. (2019), SBERT Reimers and Gurevych (2019), -pyr Peyrard et al. (2017), BLEURT Sellam et al. (2020), BARTScore Yuan et al. (2021), PRISM Thompson and Post (2020).

Discourse Metrics.

We consider RC and LC Wong and Kit (2012) and Lexical Chain Gong et al. (2015). We consider two coherence models Graph Guinaudeau and Strube (2013) and Lexical Graph Mesgar and Strube (2016) and treat them as discourse metrics.

DiscoScore.

DS-Focus can be parameterized with two focus choices: noun (NN) or semantic entity (Entity). DS-Sent can be parameterized, not only with focus, but also with the choices of unweighted and weighted. For DS-Focus, we use Conpono Iter et al. (2020) that has finetuned BERT with a novel discourse-level objective regarding sentence ordering. For DS-Sent, we use BERT-NLI. This is because we find this configuration performs best after initial trials—see Table 4. Figure 2 shows all of the variants of DiscoScore.

4.2 Datasets

We outline two datasets in summarization, and one in document-level MT. We provide data statistics in Table 8 (appendix).

Text Summarization.

While DUC666https://duc.nist.gov/data.html and TAC777https://tac.nist.gov/data/ datasets with human rated summaries, constructed one decade ago, were the standard benchmarks for comparing evaluation metrics in summmarization, they collect summaries only from extractive summarization systems. In the last few years, abstractive systems have become popular; however, little is known how well metrics judge them. Recently, several datasets based on CNN/DailyMail have been constructed to address this. For instance, SummEval Fabbri et al. (2021), REALSumm Bhandari et al. (2020), XSum Maynez et al. (2020) and FEQA Durmus et al. (2020) all collect summaries from both extractive and abstractive systems, but differ in the aspects human experts rate summaries. In this work, we consider the following two complementary summarization datasets.

  • SummEval has been constructed in multiple-references settings, i.e., that each hypothesis is associated to multiple references. It contains human judgments of summary coherence, factual consistency, fluency and relevance. We only consider abstractive summaries as they have little lexical overlap with references.

  • NeR18 Grusky et al. (2018), in contrast, has been constructed in single-reference settings. It contains human judgments of summary coherence, fluency, informativeness and relevance. As majority of summaries are extractive, we include both extractive and abstractive for the inclusive picture.

Document-level Machine Translation.

As document-level human ratings in MT are particularly laborious, hardly ever have there been MT datasets directly addressing them. First attempts suggested to use the average of much cheaper sentence-level ratings as a document score for comparing document-level metrics Comelles et al. (2010); Wong and Kit (2012); Gong et al. (2015). However, human experts were asked to rate sentences in isolation within a document. Thus, human ratings at both sentence and document levels cannot reflect inter-sentence coherence. Recently, the WMT20 workshop Mathur et al. (2020) asks humans to rate each sentence translation in the document context, and follows the previous idea of ‘average’ to yield document scores.

In this work, we use the WMT20 dataset with ‘artificial’ document-level ratings. Note that WMT20 comes with two issues: (i) though sentences are rated in the document context, averaging sentence-level ratings may zero out negative effects of incoherent elements on document level and (ii) unlike SummEval and NeR18, WMT20 only contains human judgment of translation adequacy (which may subsume multiple aspects), not coherence.

For simplicity, we exclude system and reference translations with lengths greater than 512—the number of tokens at maximum allowed by BERT, as only a small portion of instances is over the token limit. Note that it is effortless to replace BERT with Longformer Beltagy et al. (2020) to deal with longer documents for DiscoScore.

5 Results

In this section, we start by investigating the importance of discourse for evaluation metrics, and compare DiscoScore with a range of metrics on summarization and MT datasets, and provide understanding on DiscoScore.

Importance of Discourse.

DS-Focus and DS-Sent concern the modeling of discourse coherence on two different levels: (i) the occurrences of foci, and (ii) the interdependence between sentences driven by focus transitions, both reflecting the discourse characteristics of a text. In the following, we describe these discourse features, and examine how important these features are for the assessment of system outputs. Figure 2 shows the mapping between DiscoScore and these features.

  • Focus Frequency is the ratio between the total frequencies of foci and the number of foci in a text. We exclude foci occurring only once.

  • Sentence Connectivity is the average of adjacency matrix representing the interdependence between sentences.

DiscoScore

DiscoFeatures

DS-Focus (NN)

DS-Focus (Entity)

DS-Sent-u (NN)

DS-Sent-u (Entity)

DS-Sent-w (NN)

DS-Sent-w (Entity)

FREQ (NN)

FREQ (Entity)

Conn-u (NN)

Conn-u (Entity)

Conn-w (NN)

Conn-w (Entity)
Figure 2: Mapping between the DiscoScore variants and discourse features. FREQ and Conn mean Focus Frequency and Sentence Connectivity.

Figure 3: Distribution of Freq (NN) on SUMMEval. Each coordinate is the output of Freq (NN) on a pair of hypothesis and reference. The coordinates below the auxiliary line are the ones for which Freq (NN) favors hypothesis over reference.

Figure 3 shows that the scales on reference and hypothesis differ by a large amount, i.e., from 0.5 to 2.5 on reference and up to 6 on hypothesis. This means that hypothesis and reference can be strongly distinguished by Freq (NN), which underpins the usefulness of including such discourse signals in the assessment of system outputs when references are available. The results for other discourse features are similar, we provide them in Figure 7 (appendix). Further, we see that Freq (NN), as the other features do, mostly favor hypothesis over reference. This might be because foci in hypothesis are more repetitive than in reference, as it is the case in machine translation that incoherent, low-quality translations often contain needless repetition Guillou (2013); Voita et al. (2019). As such, these discourse features capture the incoherent patterns in a text.

Overall, these results show discourse features mostly separate hypothesis from reference.

5.1 Text Summarization

Settings Metrics Coherence Consistency Fluency Relevance Average
Non-discourse metrics
ROUGE-1 9.09 27.27 18.18 9.09 15.91
ROUGE-L 0.00 36.36 21.21 18.18 18.94
BERTScore 30.30 30.30 51.52 54.55 41.67
MoverScore 36.36 42.42 63.64 60.61 50.76
SBERT 3.03 33.33 30.30 27.27 23.48
BLEURT 45.45 51.52 72.73 63.64 58.33
BARTScore 60.61 36.36 45.45 48.48 47.73
PRISM 51.52 39.39 72.73 69.70 58.33
-pyr 18.18 24.24 9.09 6.06 14.39
Discourse metrics
RC 45.45 51.52 54.55 57.58 52.27
LC 51.52 45.45 48.48 57.58 50.76
Entity Graph 42.42 12.12 15.15 18.18 21.97
Lexical Graph 48.48 6.06 15.15 18.18 21.97
Lexical Chain 42.42 6.06 9.09 18.18 18.94
DS-Focus (NN) 75.76 63.64 78.79 81.82 75.00
DS-Focus (Entity) 69.70 57.58 72.73 75.76 68.94
DS-Sent-u (NN) 48.48 54.55 63.64 60.61 56.82
DS-Sent-u (Entity) 54.55 60.61 75.76 66.67 64.39
DS-Sent-w (NN) 51.52 51.52 66.67 63.64 58.33
DS-Sent-w (Entity) 51.52 57.58 66.67 63.64 59.85
Table 1:

System-level Kendall correlations between metrics and human ratings of summary quality on SUMMEval. We bold numbers that significantly outperform others according to paired t-test 

Fisher and others (1937). is a metric.

Correlation Results.

Table 1 compares metrics on SUMMEval on system level. Most of non-discourse metrics have a lowest correlation with human rated coherence among four quality aspects. Even worse, ROUGE-L and SBERT do not correlate with coherence whatsoever. BARTScore, the recent state-of-the-art metric, is very weak when operated on system level, notwithstanding that it has been fine-tuned on “document-to-summary” parallel data from CNN/DailyMail—SUMMEval constructs from. We note that SUMMEval uses multiple references. BARTScore by default compares a hypothesis with one reference at a time, then takes the average of multiple evaluation scores as a final score. Table 7 (appendix) shows that we can improve system-level BARTScore to some degree by replacing ‘average’ with ‘max’ (i.e., taking the maximum score), but DS-Focus is still much better overall—which surpasses BARTScore by ca. 10 points on average across aspects.

Table 2 reports correlation results on NeR18 that uses single reference. We find that half of hypotheses do not contain ‘good foci’, and as such the foci-based discourse features outlined previously are less discriminative on NeR18 than on SUMMEval—see Table 8 (appendix). However, DS-Focus is still strong, ca. 20 points better than BARTScore in all aspects, despite that DS-Focus uses a much smaller contextualized encoder888DS-Focus uses Conpono on the same size of BERTBase. BARTScore uses BARTLarge finetuned on CNN/DailyMail.

. We note that the ‘F-score’ version of

DS-Focus seems extremely strong on NeR18, but it is not robust across datasets, e.g., much worse than the original, precision-based DS-Focus on SUMMEval.

On a side note, coherence (mostly) strongly correlates with the other rating aspects on both SUMMEval and NeR18—see Figure 4. Thus, it is not surprising that both DS-Focus and DS-Sent correlate well with these aspects, despite that we have not targeted them. While strong on system level, DiscoScore could not show advantages on summary level (see Table 3), but we argue that system-level correlation deserves the highest priority as systems are compared in this manner.

Overall, these results show that BERT-based non-discourse metrics correlate weakly with human ratings on system level. BARTScore also does so, though we improve it to some degree in multi-references settings. DiscoScore, particularly DS-Focus, performs consistently best in both single- and multi-references settings, and it is equally strong in all aspects.

As for discourse metrics, RC and LC that use discourse features are strong baselines as they outperform most of non-discourse metrics and coherence models (i.e., Entity and Lexical Graph) without the access to source texts and references. However, they are worse than both DS-Focus and DS-Sent. This confirms the inadequacy of RC and LC in that they do not leverage strong contextualized encoders and judge hypothesis in the absence of references. Moreover, we compare DiscoScore with a combination of two strong, complementary baselines, BARTScore and RC—a simple solution to address text coherence of non-discourse metrics. To combine them, we simply average their scores. We see the gains are additive in all aspects but coherence. DS-Focus wins all the time by a large margin—see Table 9 (appendix).

Taken together, these results show that any of the three—(i) leveraging contextualized encoders as in BERT-based metrics and BARTScore; (ii) leveraging discourse features as in RC and (ii) the ensemble of (i) and (ii)—is not sufficient, suggesting combining (i) and (ii) as DiscoScore does.

Figure 4: Pearson Correlation between coherence and other aspects on system level. SUMMEval and NeR18 use Consistency and Informativeness respectively.
Settings Metrics Coherence Fluency Informative Relevance Average
BARTScore 42.58 42.58 23.80 33.33 35.57
PRISM 51.52 42.58 42.86 52.38 47.33
DS-Focus (NN) 61.90 61.90 42.86 52.38 54.76
DS-Focus* (NN) 80.95 80.95 100.00 90.47 88.09
DS-Sent-u (NN) 14.29 14.29 14.29 23.81 16.67
Table 2: System-level Kendall correlations between metrics and human ratings on NeR18. DS-Focus* is the ‘F-score’ version of DS-Focus.

Understanding DiscoScore.

As for all variants of DiscoScore, we provide understanding on why one variant is superior to another with the discourse features outlined in Figure 2. To this end, we begin with quantifying the discriminativeness of these features, which concerns the magnitude of separating hypothesis from reference:

(4)

where is a normalization term, is any one of the discourse features in Figure 2.

Figure 5 shows that the discriminativeness of these features strongly correlate with the results of the DiscoScore variants, i.e., that the more discriminative the features are, the better the metrics perform. This attributes the superiority of a metric to the fact that the discourse feature can better separate hypothesis and reference.

From this, we can interpret the performance gaps between the DiscoScore variants, namely (i) DS-Focus over DS-Sent: given Focus Frequency more discriminative than Sentence Connectivity, it is not surprising that DS-Focus modeling discourse coherence with the former outperforms DS-Sent modeling with the latter, and (ii) DS-Focus (NN) outperforms DS-Focus (Entity) because Frequency (NN) can better separate hypothesis from reference than Frequency (Entity).

Figure 5: Correlations between the results of metrics and the discriminativeness of features on SUMMEval. Metric results are averaged across four rating aspects.
Metrics SUMMEval NeR18
BARTScore 14.13 24.78
PRISM 14.92 18.89
DS-Focus (NN) 10.81 10.42
DS-Sent-u (NN) 15.71 3.81
Table 3: Summary-level averaged Kendall correlations across all rating aspects.
Metrics Encoders Average
DS-Focus (NN) + BERT 71.97
+ BERT-NLI 70.45
+ Conpono 75.00
DS-Sent-u (NN) + BERT 35.61
+ BERT-NLI 56.82
+ Conpono 23.48
Table 4: Results of three contextualized encoders on SUMMEval. Results are averaged across four aspects.
Metrics Average
DS-Sent-u (NN) 56.82
w/o sentence aggregation 46.21
Table 5: Ablation study on the use of adjacency matrix to aggregate sentence embeddings.
Metrics Mechanisms Average
DS-Sent-u (NN) + greedy align 21.97
+ optimal align 26.52
+ mean-max-min-sum 56.82
Table 6: Averaged results of SentGraph variants based on three mechanisms on SUMMEval.

Choice of BERT Variants.

Table 4 compares the impact of three BERT variants on DiscoScore. Conpono, referred to as a discourse BERT, has finetuned BERT with a novel discourse-level objective regarding sentence ordering. While strong on discourse evaluation benchmarks Chen et al. (2019), Conpono is not always helpful, e.g., BERT-NLI is better for DS-Sent. These results suggest the best configuration for DiscoScore.

Impact of Sentence Connectivity.

Table 5 shows an ablation study on the use of sentence connectivity. Aggregating sentence embeddings with our adjacency matrices (see Eq.3) helps considerably. This confirms the usefulness of aggregation from which we include coherence signals in sentence embeddings.

SentGraph Variants.

Table 6 compares three DS-Sent variants as to how we measure the distance between two sets of sentence embeddings from hypothesis and reference. While the two alignments999In particular, we refer to BERTScore Zhang et al. (2020) as a ‘greedy align’ mechanism used to compute the similarity between two sets of sentence embeddings. As for ‘optimal align’, we use MoverScore Zhao et al. (2019) to do so. directly measure the distance between the two sets, the simple statistics, i.e., mean-max-min-sum, derives a graph embedding from each set and computes the cosine similarity between two graph embeddings. The ‘statistics’ wins by a big margin.

5.2 Document-level Machine Translation

Correlation Results.

Table 11 (appendix) compares metrics on WMT20. We see that non-discourse metrics seem much better, but these results are not consistent to the discriminativeness of the discourse features—see Table 10 (appendix). For instance, in cs-en, the discourse features of both DS-Focus and DS-Sent clearly separate hypothesis from reference. Surprisingly, both DS-Focus and DS-Sent correlate weakly with human rated adequacy. One may not forget that document-level adequacy is ‘artificial’ as it averages sentence-level ratings. Thus, poor correlation with ‘artificial adequacy’ means either that ‘adequacy’ does not properly reflect discourse coherence or that averaging sentence-level ratings zeros out the negative effects of incoherence elements on document level.

On the other hand, these results provide justifications to recent criticisms101010Freitag et al. (2021) showed that human ratings of adequacy sometimes have trouble in distinguishing human from system translations, and ’adequacy’ correlates weakly with other aspects (e.g., fluency and accuracy) selected from the MQM rating scheme. But their studies are limited in scope to non-discourse aspects and reference-free evaluation. We complement their studies with the analyses in reference-based, discourse coherence evaluation. on ‘adequacy’ in WMT20 evaluation setups Freitag et al. (2021).

Correlation between Metrics.

Figure 6: Pearson Correlations between metrics on WMT20 in Czech-to-English.

Figure 6 shows inter-correlations between metrics on WMT20 in Czech-to-English. Correlations are mostly high between non-discourse metrics, much weaker between discourse and non-discourse metrics—which confirms the orthogonality of them in that they rate translations in different aspects. We note that DS-Focus has the lowest correlations with all other metrics. For instance, DS-Focus is almost orthogonal to BERTScore and MoverScore. We investigated whether combining them receives additive gains. We find the a combination of DS-Focus and BERTScore (or MoverScore) provides zero help in correlation with adequacy—which even reinforces our belief in that recent MT document-level setups are problematic in the previously mentioned aspects and calls for an improvement. The results for other language pairs are similar, we provide them in Figure 8 (appendix).

6 Conclusions

Given the recent growth in discourse based NLG systems, evaluation metrics targeting the assessment of text coherence are essential next steps for properly tracking the progress of these systems. Although there have been several attempts made towards discourse metrics, they all do not leverage strong contextualized encoders which have been held responsible for the recent success story of NLP. In this work, we introduced DiscoScore that uses BERT to model discourse coherence from two perspectives of readers’ focus: (i) frequencies and semantics of foci and (ii) focus transitions over sentences used to predict interdependence between sentences. We find that BERT-based non-discourse metrics cannot address text coherence, even much worse than early feature-based discourse metrics invented a decade ago. We also find that the recent state-of-the-art BARTScore correlates weakly with human ratings on system level. DiscoScore, on the other hand, performs consistently best in both single- and multi-reference settings, equally strong in coherence and several other aspects such as factual consistency, despite that we have not targeted them. More prominently, we provide understanding on the importance of discourse for evaluation metrics, and explain the superiority of one metric over another with simple features, in line with recent effort towards explainability for evaluation metrics Kaster et al. (2021); Fomicheva et al. (2021). However, we acknowledge that DiscoScore is weak on summary level, for which our studies are shallow compared to the thorough analyses on system-level. We hope that our results provide insights in order for future research to improve discourse metrics on summary level. Last but not least, with the justifications we provided, we argue that recent WMT evaluation setups that use ‘adequacy’ alone are not suitable for comparing MT document-level metrics, particularly in coherence. We advocate that future WMT evaluation campaigns should consider multiple rating aspects111111For instance, Freitag et al. (2021) suggested to replace ‘adequacy’ with multiple other aspects selected from the MQM rating scheme. as in summarization.

Scope for future research is huge, e.g., developing reference-free discourse metrics that compare source texts to hypotheses, improving document-level MT evaluation setups, detecting other novel features to design high-quality metrics and re-ranking NLP systems on leaderboards with discourse metrics and rigorous comparison mechanisms Peyrard et al. (2021); Kocmi et al. (2021).

Acknowledgments

We thank Yannik Benz, ZhanKe Liu, Frank Pfirmann and Dennis Weinberger who completed a seminar project related to this work. This work has been funded by the Klaus Tschira Foundation, Heidelberg, Germany.

References

  • R. Barzilay and M. Lapata (2008) Modeling local coherence: an entity-based approach. Computational Linguistics 34 (1), pp. 1–34. Cited by: §1, §2, §3, §3.
  • R. Bawden, R. Sennrich, A. Birch, and B. Haddow (2018)

    Evaluating discourse phenomena in neural machine translation

    .
    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1304–1313. External Links: Link, Document Cited by: §2.
  • I. Beltagy, M. E. Peters, and A. Cohan (2020) Longformer: the long-document transformer. arXiv:2004.05150. Cited by: §4.2.
  • A. Beyer, S. Loáiciga, and D. Schlangen (2021) Is incoherence surprising? targeted evaluation of coherence prediction from language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 4164–4173. External Links: Link, Document Cited by: §2.
  • M. Bhandari, P. Narayan Gour, A. Ashfaq, P. Liu, and G. Neubig (2020) Re-evaluating evaluation in text summarization. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Cited by: §4.2.
  • B. Cartoni, J. Libovický, and T. B. (Meyer) (Eds.) (2018) Machine translation evaluation beyond the sentence level. Alicante, Spain. Cited by: §2.
  • M. Chen, Z. Chu, and K. Gimpel (2019) Evaluation benchmarks and learning criteria for discourse-aware sentence representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 649–662. External Links: Link, Document Cited by: §5.1.
  • W. Chen, P. Li, and I. King (2021) A training-free and reference-free summarization evaluation metric via centrality-weighted relevance and self-referenced redundancy. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 404–414. External Links: Link, Document Cited by: §2.
  • E. Comelles, J. Giménez, L. Màrquez, I. Castellón, and V. Arranz (2010) Document-level automatic MT evaluation based on discourse representations. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, Uppsala, Sweden, pp. 333–338. External Links: Link Cited by: §4.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.
  • E. Durmus, H. He, and M. Diab (2020) FEQA: a question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5055–5070. External Links: Link, Document Cited by: §4.2.
  • M. Elsner and E. Charniak (2011) Extending the entity grid with entity-specific features. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 125–129. Cited by: 4th item, §3.
  • A. R. Fabbri, W. Kryściński, B. McCann, C. Xiong, R. Socher, and D. Radev (2021) Summeval: re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics 9, pp. 391–409. Cited by: §1, §4.2.
  • R. A. Fisher et al. (1937) The design of experiments.. The design of experiments. (2nd Ed). Cited by: Table 1.
  • M. Fomicheva, P. Lertvittayakumjorn, W. Zhao, S. Eger, and Y. Gao (2021)

    The Eval4NLP shared task on explainable quality estimation: overview and results

    .
    In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, Punta Cana, Dominican Republic, pp. 165–178. External Links: Link Cited by: §2, §6.
  • M. Freitag, G. Foster, D. Grangier, V. Ratnakar, Q. Tan, and W. Macherey (2021) Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics 9, pp. 1460–1474. Cited by: §5.2, footnote 10, footnote 11.
  • Y. Gao, W. Zhao, and S. Eger (2020) SUPERT: towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1347–1354. External Links: Link, Document Cited by: §2.
  • Z. Gong, M. Zhang, and G. Zhou (2015) Document-level machine translation evaluation with gist consistency and text cohesion. In Proceedings of the Second Workshop on Discourse in Machine Translation, pp. 33–40. Cited by: 3rd item, §2, §4.1, §4.2.
  • B. J. Grosz, A. K. Joshi, and S. Weinstein (1995) Centering: a framework for modeling the local coherence of discourse. Computational Linguistics 21 (2), pp. 203–225. External Links: Link Cited by: §1, §1, §2, §3.
  • M. Grusky, M. Naaman, and Y. Artzi (2018) Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 708–719. External Links: Link, Document Cited by: 2nd item.
  • L. Guillou (2013) Analysing lexical consistency in translation. In Proceedings of the Workshop on Discourse in Machine Translation, Sofia, Bulgaria, pp. 10–18. External Links: Link Cited by: §5.
  • C. Guinaudeau and M. Strube (2013) Graph-based local coherence modeling. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 93–103. External Links: Link Cited by: 2nd item, §1, §2, §3, §3, §3, §4.1.
  • F. Guzmán, S. Joty, L. Màrquez, and P. Nakov (2014) Using discourse structure improves machine translation evaluation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 687–698. External Links: Link, Document Cited by: §2.
  • C. Hardmeier, P. Nakov, S. Stymne, J. Tiedemann, Y. Versley, and M. Cettolo (2015) Pronoun-focused MT and cross-lingual pronoun prediction: findings of the 2015 DiscoMT shared task on pronoun translation. In Proceedings of the Second Workshop on Discourse in Machine Translation, Lisbon, Portugal, pp. 1–16. External Links: Link, Document Cited by: §2.
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: 6th item.
  • D. Iter, K. Guu, L. Lansing, and D. Jurafsky (2020) Pretraining with contrastive sentence objectives improves discourse performance of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4859–4870. External Links: Link, Document Cited by: §4.1, footnote 1.
  • Y. Jiang, S. Ma, D. Zhang, J. Yang, H. Huang, and M. Zhou (2021) BlonD: an automatic evaluation metric for document-level machinetranslation. CoRR abs/2103.11878. External Links: Link, 2103.11878 Cited by: §2.
  • S. Joty, F. Guzmán, L. Màrquez, and P. Nakov (2017) Discourse structure in machine translation evaluation. Computational Linguistics 43 (4), pp. 683–722. Cited by: §2.
  • M. Kaster, W. Zhao, and S. Eger (2021) Global explainability of BERT-based evaluation metrics by disentangling along linguistic factors. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 8912–8925. External Links: Link Cited by: §1, §2, §6.
  • T. Kocmi, C. Federmann, R. Grundkiewicz, M. Junczys-Dowmunt, H. Matsushita, and A. Menezes (2021) To ship or not to ship: an extensive evaluation of automatic metrics for machine translation. CoRR abs/2107.10821. External Links: Link, 2107.10821 Cited by: §6.
  • F. Koto, J. H. Lau, and T. Baldwin (2021) Discourse probing of pretrained language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 3849–3864. External Links: Link, Document Cited by: §2.
  • W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher (2019) Neural text summarization: a critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 540–551. External Links: Link, Document Cited by: §2.
  • M. Kusner, Y. Sun, N. Kolkin, and K. Weinberger (2015) From word embeddings to document distances. In

    International conference on machine learning

    ,
    pp. 957–966. Cited by: 2nd item.
  • P. Laban, L. Dai, L. Bandarkar, and M. A. Hearst (2021) Can transformer models measure coherence in text: re-thinking the shuffle test. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Online, pp. 1058–1064. External Links: Link, Document Cited by: §2.
  • A. Lai and J. Tetreault (2018) Discourse coherence in the wild: a dataset, evaluation and methods. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia, pp. 214–223. External Links: Link, Document Cited by: §2.
  • A. Lavie and A. Agarwal (2007) METEOR: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, Prague, Czech Republic, pp. 228–231. External Links: Link Cited by: §2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Link, Document Cited by: §2.
  • C. Lin (2004) ROUGE: A Package for Automatic Evaluation of summaries. In Proceedings of ACL workshop on Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. Cited by: 1st item, §2, §4.1.
  • Z. Lin, H. T. Ng, and M. Kan (2011) Automatically evaluating text coherence using discourse relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 997–1006. External Links: Link Cited by: §1, §2.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §2.
  • Z. Liu, K. Shi, and N. Chen (2021) Coreference-aware dialogue summarization. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Singapore and Online, pp. 509–519. External Links: Link Cited by: §1.
  • A. Lopes, M. A. Farajian, R. Bawden, M. Zhang, and A. F. T. Martins (2020) Document-level neural MT: a systematic comparison. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, Lisboa, Portugal, pp. 225–234. External Links: Link Cited by: §2.
  • W. C. Mann and S. A. Thompson (1988) Rhetorical structure theory: toward a functional theory of text organization. Text-interdisciplinary Journal for the Study of Discourse 8 (3), pp. 243–281. Cited by: §1, §2.
  • S. Maruf, F. Saleh, and G. Haffari (2021) A survey on document-level neural machine translation: methods and evaluation. ACM Computing Surveys (CSUR) 54 (2), pp. 1–36. Cited by: §1.
  • N. Mathur, J. Wei, M. Freitag, Q. Ma, and O. Bojar (2020) Results of the WMT20 metrics shared task. In Proceedings of the Fifth Conference on Machine Translation, Online, pp. 688–725. External Links: Link Cited by: §4.2.
  • J. Maynez, S. Narayan, B. Bohnet, and R. T. Mcdonald (2020) On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1906–1919. Cited by: §4.2.
  • M. Mesgar, L. F. R. Ribeiro, and I. Gurevych (2021) A neural graph-based local coherence model. In Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, pp. 2316–2321. External Links: Link Cited by: §2.
  • M. Mesgar and M. Strube (2016) Lexical coherence graph modeling using word embeddings. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 1414–1423. External Links: Link, Document Cited by: 2nd item, 4th item, §2, §3, §4.1.
  • M. Mesgar and M. Strube (2018) A neural local coherence model for text quality assessment. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4328–4339. External Links: Link, Document Cited by: §2.
  • L. Miculicich, D. Ram, N. Pappas, and J. Henderson (2018) Document-level neural machine translation with hierarchical attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2947–2954. External Links: Link, Document Cited by: §1.
  • H. C. Moon, T. Mohiuddin, S. Joty, and C. Xu (2019) A unified neural coherence model. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2262–2272. External Links: Link, Document Cited by: §2.
  • J. Ng and V. Abrecht (2015) Better summarization evaluation with word embeddings for ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1925–1930. External Links: Link, Document Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, Stroudsburg, PA, USA, pp. 311–318. External Links: Link, Document Cited by: 1st item, §2, §4.1.
  • M. Peyrard, T. Botschen, and I. Gurevych (2017) Learning to score system summaries for better content selection evaluation.. In Proceedings of the Workshop on New Frontiers in Summarization, Copenhagen, Denmark, pp. 74–84. External Links: Link, Document Cited by: 4th item, §4.1.
  • M. Peyrard, W. Zhao, S. Eger, and R. West (2021) Better than average: paired evaluation of NLP systems. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 2301–2315. External Links: Link, Document Cited by: §6.
  • E. Pitler and A. Nenkova (2008) Revisiting readability: A unified framework for predicting text quality. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, pp. 186–195. External Links: Link Cited by: §2.
  • A. Pu, H. W. Chung, A. Parikh, S. Gehrmann, and T. Sellam (2021) Learning compact metrics for MT. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 751–762. External Links: Link Cited by: §2.
  • R. Rei, C. Stewart, A. C. Farinha, and A. Lavie (2020) COMET: a neural framework for MT evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2685–2702. External Links: Link Cited by: §2.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 3980–3990. External Links: Link, Document Cited by: 3rd item, §4.1.
  • A. B. Sai, T. Dixit, D. Y. Sheth, S. Mohan, and M. M. Khapra (2021) Perturbation CheckLists for evaluating NLG evaluation metrics. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 7219–7234. External Links: Link Cited by: §1.
  • D. Saunders, F. Stahlberg, and B. Byrne (2020) Using context in neural machine translation training objectives. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7764–7770. External Links: Link, Document Cited by: §2.
  • T. Sellam, D. Das, and A. Parikh (2020) BLEURT: learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7881–7892. External Links: Link, Document Cited by: 5th item, §2, §4.1.
  • B. Thompson and M. Post (2020) Automatic machine translation evaluation in many languages via zero-shot paraphrasing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 90–121. External Links: Link, Document Cited by: 6th item, §2, §4.1.
  • D. Tien Nguyen and S. Joty (2017) A neural local coherence model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1320–1330. External Links: Link, Document Cited by: §2.
  • E. Voita, R. Sennrich, and I. Titov (2019) When a good translation is wrong in context: context-aware machine translation improves on deixis, ellipsis, and lexical cohesion. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1198–1212. External Links: Link, Document Cited by: §5.
  • E. Voita, P. Serdyukov, R. Sennrich, and I. Titov (2018) Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1264–1274. External Links: Link, Document Cited by: §1, §2.
  • B. T. M. Wong and C. Kit (2012) Extending machine translation evaluation metrics with lexical cohesion to document level. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 1060–1068. External Links: Link Cited by: 1st item, 1st item, §2, §4.1, §4.2.
  • H. Xiong, Z. He, H. Wu, and H. Wang (2019) Modeling coherence for discourse neural machine translation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 33, pp. 7338–7345. Cited by: §2.
  • J. Xu, Z. Gan, Y. Cheng, and J. Liu (2020) Discourse-aware neural extractive text summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5021–5031. External Links: Link, Document Cited by: §1.
  • W. Yuan, G. Neubig, and P. Liu (2021) BARTScore: evaluating generated text as text generation. In Thirty-Fifth Conference on Neural Information Processing Systems, External Links: Link Cited by: 6th item, 1st item, §1, §2, §4.1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2020) BERTScore: evaluating text generation with BERT. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: 2nd item, §1, §2, §4.1, footnote 4, footnote 9.
  • W. Zhao, G. Glavaš, M. Peyrard, Y. Gao, R. West, and S. Eger (2020) On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1656–1671. External Links: Link, Document Cited by: §2.
  • W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger (2019) MoverScore: text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 563–578. External Links: Link, Document Cited by: 2nd item, §1, §2, §4.1, footnote 4, footnote 9.
  • W. Zhu and S. Bhat (2020) GRUEN for evaluating linguistic quality of generated text. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 94–108. External Links: Link, Document Cited by: §2.

Appendix A Appendix

Settings Metrics Coherence Consistency Fluency Relevance Average
BARTScore (max) 78.79 48.48 63.64 72.73 65.91
BARTScore (original) 60.61 36.36 45.45 48.48 47.73
FocusDiff (NN) 75.76 63.64 78.79 81.82 75.00
FocusDiff (Entity) 69.70 57.58 72.73 75.76 68.94
SentGraph-u (NN) 48.48 54.55 63.64 60.61 56.82
SentGraph-u (Entity) 54.55 60.61 75.76 66.67 64.39
Table 7: System-level Kendall correlations between metrics and human ratings on SUMMEval in multi-reference settings. BARTScore (original) compares a hypothesis with one reference at a time, and takes the average of evaluation scores as a final score, while BARTScore (max) takes the maximum score.
WMT20
SUMMEval NeR18 cs-en de-en ja-en ru-en
Number of references 11 1 1 1 1 1
Number of systems 12 7 13 14 11 13
Number of hypothesis per system 100 60 102 118 80 91
Number of sentences per hypothesis 3.13 1.90 15.21 13.84 11.29 9.46
Average number of foci in hypothesis 15.18 12.85 62.01 56.68 57.09 44.99
Average number of ‘good foci’ in hypothesis 2.47 2.56 13.16 13.37 15.07 9.95
Percent of hypotheses with ‘good foci’ 80.50% 43.80% 100% 98.60% 100% 100%
Table 8: Characteristics of summarization and MT datasets. ‘good foci’ denotes a focus appearing more than once in hypothesis. The more often a focus appears, the stronger the discourse signals are.
Metrics Coherence Consistency Fluency Relevance Average
RC 45.45 51.52 54.55 57.58 52.27
BARTScore (max) 78.79 48.48 63.64 72.73 65.91
BARTScore (max) + RC 66.67 54.55 69.70 78.79 67.42
DS-Focus (NN) 75.76 63.64 78.79 81.82 75.00
Table 9: Ensemble of non-discourse and discourse metrics (BARTScore + RC) vs DisoScore.
(a) Focus Frequency (NN)
(b) Focus Frequency (Entity)
(c) Connectivity-u (NN)
(d) Connectivity-u (Entity)
(e) Connectivity-w (NN)
(f) Connectivity-w (Entity)
Figure 7: Distribution of discourse features over hypothesis and reference on SUMMEval.
Figure 8: Pearson Correlations between metrics on WMT20 in German-to-English (left), Japanese-to-English (middle) and Russian-to-English (right).
cs-en de-en ja-en ru-en
DiscoFeatures
Frequency (NN) 74.18 2.00 23.82 57.38 9.65 32.97 53.04 2.63 44.33 52.77 7.31 39.92
Frequency (Entity) 76.17 1.76 22.07 59.74 8.38 31.88 52.38 1.48 46.14 53.61 7.31 39.08
Connectivity-u (NN) 78.05 0.35 21.60 63.11 8.29 28.60 59.61 5.25 35.14 52.04 10.03 37.93
Connectivity-u (Entity) 79.46 0.35 20.19 62.02 8.20 29.78 59.44 5.09 35.47 52.87 9.40 37.72
Connectivity-w (NN) 77.93 0.24 21.83 64.85 4.64 30.51 59.12 0.49 40.39 59.98 5.12 34.90
Connectivity-w (Entity) 80.40 0.23 19.37 63.48 4.73 31.79 60.76 0.33 38.91 60.82 4.60 34.58
Table 10: Statistics of discourse features on WMT20. denotes the percent of ‘reference-hypothesis’ pairs for which with as any one of these features, similarly for the other two columns. We exclude the pairs for which hypothesis and reference are the exact same.
Direct Assessment (Adequacy)
Settings Metrics cs-en de-en ja-en ru-en Average
Non-discourse metrics
BLEU 7.44 57.52 41.48 10.74 29.30
BERTScore 10.82 60.38 46.95 13.08 32.81
MoverScore 15.40 61.69 42.12 13.78 33.25
BARTScore 10.82 60.26 46.30 14.95 33.09
PRISM 8.64 58.83 32.48 15.42 28.84
SBERT 13.20 55.26 33.44 10.04 27.99
BLEURT 12.01 58.83 37.94 18.22 31.75
-pyr 6.25 58.83 42.44 13.78 30.33
-resp 5.85 58.59 47.26 14.71 31.61
Discourse metrics
RC 5.85 7.19 8.68 9.34 7.77
LC 9.23 1.72 3.53 6.07 5.14
Entity Graph 5.06 43.24 3.53 10.51 15.59
Lexical Graph 2.28 43.60 5.14 13.55 16.15
Discourse metrics
Lexical Chain 21.54 35.15 15.11 16.12 21.99
FocusDiff (NN) 7.64 33.13 19.29 2.57 15.66
FocusDiff (Entity) 6.45 33.73 19.94 1.64 15.44
SentGraph-u (NN) 7.64 57.16 39.22 18.22 30.56
SentGraph-u (Entity) 7.65 57.17 39.23 18.22 30.57
SentGraph-w (NN) 7.65 57.18 39.22 18.21 30.57
SentGraph-w (Entity) 7.65 57.17 39.23 18.22 30.57
Table 11: Document-level Kendall correlations between metrics and human rated translation quality on WMT20.

a.1 Evaluation Metrics

Non-discourse Metrics.

We consider the following non-discourse metrics.

  • BLEU Papineni et al. (2002) and ROUGE Lin (2004)

    are precision- and recall-oriented metrics respectively, both of which measure n-gram overlap between a hypothesis and a reference.

  • BERTScore Zhang et al. (2020) and MoverScore Zhao et al. (2019) are set-based metrics used to measure the semantic similarity between hypothesis and reference. BERTScore uses greedy alignment to compute the similarity between two sets of BERT-based word embeddings from hypothesis and from reference, while MoverScore uses optimal alignments based on Word Mover’s Distance Kusner et al. (2015) to do so.

  • SBERT Reimers and Gurevych (2019) fine-tunes BERT on the NLI datasets and uses pooling operations to produce sentence embeddings. We compute the cosine similarity between two sentence representations from hypothesis and from reference.

  • -pyr and -resp Peyrard et al. (2017) are supervised metrics that linearly combine ROUGE, JS-divergence and ROUGE-WE scores, trained on the TAC datasets with human annotated pyramid and responsiveness scores as supervision.

  • BLEURT Sellam et al. (2020) is another supervised metric that fine-tunes BERT on the concatenation of WMT datasets and synthetic data in the MT domain, with human judgment of translation quality as supervision.

  • BARTScore Yuan et al. (2021) and PRISM Thompson and Post (2020) depict sequence-to-sequence language models as metrics to compare hypothesis with reference. In reference-based settings, they both measure the likelihood that hypothesis and reference are paraphrases, but differ in the language models they rely on. PRISM has been based on a neural MT system trained from scratch on parallel data in MT, while BARTScore uses BART Yuan et al. (2021) that has been fine-tuned on CNN/DailyMail Hermann et al. (2015)—which is parallel data in summarization. We use the ‘F-score’ version of BARTScore as recommended in Yuan et al. (2021).

Discourse Metrics.

We consider the following discourse metrics (including ours and coherence models).

  • RC and LC Wong and Kit (2012) require neither source texts nor references and use lexical cohesion devices (e.g., repetition) within a hypothesis to predict text coherence. LC computes the proportion of words within hypothesis that are lexical cohesion devices, while RC computes the proportion of times that lexical cohesion devices appear in hypothesis.

  • Entity Graph Guinaudeau and Strube (2013) and Lexical Graph Mesgar and Strube (2016) are popular coherence models used to perform discourse tasks such as essay scoring, both of which introduce a graph with nodes as sentences and adjacency matrices as the connectivity between sentences. Here, we use the average of adjacency matrices from the hypothesis as the proxy of hypothesis coherence. While Entity Graph draws an edge between two sentences if both sentences have at least one noun in common, Lexical Graph draws an edge if two sentences have a pair of similar words in common, i.e., the cosine similarity between their embeddings greater than a threshold.

  • Lexical Chain Gong et al. (2015) extracts multiple lexical chains from hypothesis and from reference. Each word is associated to a lexical chain if a word appears in more than one sentence. A lexical chain contains a set of sentence positions in which a word appears. Finally, the metric performs soft matching to measure lexical chain overlap between hypothesis and reference.

  • FocusDiff and SentGraph are the two variants of DiscoScore, which use BERT to model semantics and coherence of readers’ focus in hypothesis and reference. In particular, FocusDiff measures the difference between a common set of foci in hypothesis and reference in terms of semantics and frequency, while SentGraph measures the semantic similarity between two sets of sentence embeddings from hypothesis and reference—which are aggregated according to the number of foci shared across sentences and the distance between sentences.