On conducting better validation studies of automatic metrics in natural language generation evaluation

07/31/2019 ∙ by Johnny Tian-Zheng Wei, et al. ∙ University of Massachusetts Amherst 0

Natural language generation (NLG) has received increasing attention, which has highlighted evaluation as a central methodological concern. Since human evaluations for these systems are costly, automatic metrics have broad appeal in NLG. Research in language generation often finds situations where it is appropriate to apply existing metrics or propose new ones. The application of these metrics are entirely dependent on validation studies - studies that determine a metric's correlation to human judgment. However, there are many details and considerations in conducting strong validation studies. This document is intended for those validating existing metrics or proposing new ones in the broad context of NLG: we 1) begin with a write-up of best practices in validation studies, 2) outline how to adopt these practices, 3) conduct analyses in the WMT'17 metrics shared task[Our jupyter notebook containing the analyses is available at <https://github.com>], and 4) highlight promising approaches to NLG metrics 5) conclude with our opinions on the future of this area.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Increasing interest in tasks that require generating natural language such as image captioning Lin et al. (2014), dialogue Vinyals and Le (2015), and style transfer Fu et al. (2018) highlight evaluation as a central methodological concern. With human involvement, a system can be evaluated extrinsically (how well does the system fulfill its intended purpose? e.g. Reiter et al., 2003) or intrinsically (what is the quality of the output?) In domains such as machine translation (MT), a system’s extrinsic value is both hard to define and measure, and intrinsic human judgments of a system’s output quality have been the main indicator of progress in the field. Bojar et al. (2016b)

This paper focuses on those domains best evaluated intrinsically. Since acquiring intrinsic human judgments is costly, automatic metrics, which both are computed automatically and correlate highly with human judgment, are ideal. If sufficiently correlated, a metric can be used as a surrogate evaluation, which may be useful in developmental cycles. Therefore, the application of such metrics are dependent on studies of their validity (Reiter, 2018, how well does the metric correlate with human judgment?). In MT, BLEU Papineni et al. (2002) has seen widespread use, and, consequently, its validity has been extensively studied. (Callison-Burch et al., 2006, inter alia)

-0.4cm Source Reference Output DA BEER 双方很难,甚至不可能重新建立真正的信任。 rebuilding real trust will be hard , perhaps impossible . it is difficult , if not impossible , to re-establish real trust . 0.55 0.39 如何在持枪攻击中使用马伽术保护自己 how to defend yourself from gun attacks using krav … how to use marcella to protect himself in a gun attack -0.85 0.42

Figure 1: Examples from the WMT’17 metrics task for zh-en translation evaluation. Outputs produced from Sennrich et al. (2017). The DA score is a mean human judgment of translation quality. More on DA (Direct Assessment) in §2.2. Scores of a participating metric (BEER; Stanojevic and Sima’an, 2014) are shown. Metrics aim to achieve high correlation with DA scores. More on task details in §2.

These automatic metrics are generally appealing to natural language generation (NLG) domains. During 2005-2014, Gkatzia and Mahamood (2015)

found that a significant portion of NLG research in ACL reported results from automatic metrics. Research in under-explored NLG domains have begun with proposals of both models and novel evaluation metrics.

Fu et al. (2018); Yao et al. (2018) For the application of any existing metric (e.g. BLEU) or newly proposed metric to bear validity in various domains, researchers have attempted to conduct validation studies, for tasks such as surface realization Novikova et al. (2017), open-domain dialogue Liu et al. (2016), and image captioning Kilickaya et al. (2017), at times reporting negative results.

However, there are many considerations when designing a validation study. There are at least two aspects that experimenters should be cognizant about in conducting a robust validation study: first are the assumptions made; the second is statistical methodology. Not unlike testing of our models, validation of our metrics should also be approached with rigor. At the time of writing, the metrics shared task in the Conference of Machine Translation (WMT; Bojar et al., 2018b) annually conducts strong validation studies and we recommend adopting best practices from this domain.

Figure 2: The interface for collection of direct assessment (DA) scores from Amazon Mechanical Turk workers. Figure taken from Bojar et al. (2017b). Each worker rates system outputs on a continuous 0-100 scale, which are converted to deviations from the worker’s mean to normalize over scoring strategies.

This document is intended for those validating existing metrics or proposing new ones in NLG. To this end, we present:

  • [topsep=0.3em]

  • an overview of the WMT metrics shared task validation procedures (§2) and an outline on how to adopt these best practices, generally, to other NLG domains (§3).

  • proposals of basic analyses of metrics scoring, complemented by analysis on existing data from the WMT’17 metrics task (§4).

  • a literature review in both metrics (§5) and their analysis, (§6) concluding with the authors’ opinions on directions of the area (§7).

2 WMT metrics shared task

The metrics shared task Ma et al. (2018) utilizes the WMT evaluation results of the popular translation shared task. The data provided from the translation shared task is diverse, including output from a range of state-of-the-art systems. Over the years, WMT also has developed robust statistical methodology for collecting human judgments and significance testing of metric performance. For these reasons, the results from the metrics shared task are particularly strong. We highlight relevant methodological aspects in the following sections.

Figure 3: The effect of the number of assessors on the consistency of DA scores. Figure taken from Graham et al. (2015). The -axis is a sample mean of translation quality calculated from judgments, and the

-axis is a true mean estimated from a much larger population.

2.1 Direct Assessment

Beginning in WMT’17, the WMT human evaluation campaign has adopted direct assessment (DA; Graham et al., 2015) as the primary human evaluation. Bojar et al. (2017b)

DA scores are formulated on the law of large numbers - a sample mean of many human judgments is close to the population mean, which represents an intrinsic property of the translation quality.

Figure 4: Sample Pearson of segment/system-level BEER and DA scores from the WMT’17 metrics task for zh-en translation evaluation. Left. Each point denotes a test example from the metrics task. Examples are a subset from the uedin-nmt system Sennrich et al. (2017). Right. Each point denotes a system participating in the zh-en translation task. DA and BEER scores have been averaged over all outputs for a given system.

However, we do not know the variance of human scores for a given translation, so an appropriate sample size

cannot be calculated. In addition, the variance in judgments can vary considerably across translations. To overcome this, DA uses a two step process: 1. Empirically determine the number of assessors needed per translation for the desired consistency. 2. Collect human judgments for remaining system outputs.

Refer to Figure 2. DA uses a continuous sliding bar from 1-100, which are averaged over many workers for a sample mean. Graham et al. (2013) Refer to Figure 3. In the first phase of DA, a large () number of assessments are collected for a small number of output translations. Compute a simulated DA score for each segment with the first judgments, and estimate a true mean with the remaining judgments. For the where the correlation between DA scores and “true mean” is sufficient, gather judgments for each segment in the full data collection.

Several aspects of DA have made it appealing for the WMT shared tasks. These points are stated in Bojar et al. (2017b): 1) DA score are more consistent than relative rankings of system output, which are often contradictory 2) The samples means from DA scores are absolute, facilitating comparisons across translations 3) Sliding bar judgments can be collected from crowdsourcing websites, and there are effective measures for quality control.

2.2 System and sentence-level correlation

Direct Assessment gives human judgment scores on a sentence (or segment) level. System level quality is then defined as the average of DA scores of a system’s output over an entire test set, and is how the rankings for the annual WMT translation task are produced. (Bojar et al., 2018b; 2017b) Refer to Figure 4. Therefore, a metric can be evaluated on two correlations: at the segment-level - between segments’ metric scores and DA scores, or system-level - between systems’ average DA scores and aggregate metric scores (most commonly a mean).

There are several notable distinctions between the two. For a metric to correlate highly on the system level does not imply it correlates on the segment level. In fact, in the MT domain, baselines such as BLEU have high correlations on the system-level for *-en translation evaluation (), but have low segment-level correlations (, Ma et al., 2018). This is also seen for the SPICE Anderson et al. (2016) metric in image captioning. Intuitively, a metric may only be competent in penalizing bad output, but cannot differentiate between average and good output (Refer to §4.1 for an analysis of BLEU). This causes the discrepancy in low correlation at the segment-level but high at the system-level. Novikova et al. (2017); Chaganty et al. (2018)

In practice, system-level correlation is relevant. Research will often report results of system-level BLEU scores to show the effectiveness of a new system over baselines or existing literature. Koehn (2004)

When making decisions about hyperparameters or model architectures, we rely on metric-produced system rankings.

Britz et al. (2017) However, existence of metrics with high segment-level correlations opens up research questions e.g. can we use such a metric as an alternative training objective? Ranzato et al. (2015)

Figure 5: -values from the William’s test on all pairs of metrics participating in system-level, zh-en translation evaluation in WMT’17. A green cell denotes (). Figure taken from Bojar et al. (2017c). Significance in cell denotes metric in row significantly outperformed metric in column .

2.3 Pearson correlation

The emerging consensus in WMT is the use of Pearson correlation222https://libguides.library.kent.edu/SPSS/PearsonCorr in segment and system-level evaluation of metrics. Given paired data points , the sample Pearson correlation is defined as:

where and are the samples means for and

, respectively. This correlation measures a linear association between two ordinal variables and ranges from

. In segment-level metric validation, may represent the DA and metric scores of a system translation. On the system level, may represent the average DA and aggregated metric scores of an MT system.

There are at least three reasons for motivation of the use of Pearson coefficient over Kendall’s and Spearman. (These reasons are stated across Graham et al. 2015; 2014) 1) The linear measurement of Pearson correlation measure is more sensitive than rank-based correlations (i.e. Spearman), which measure any monotonic relationship. 2) Pearson correlations are absolute, facilitating comparisons across testing settings. 3) Significance testing of differences in Pearson values are possible (see §2.4).

2.4 Significance tests for correlations

For the sample Pearson correlation calculated between metric and DA scores, there are two significance tests that are often applied. The first test (which is not used in the WMT metrics task) calculates a

-value to reject the null hypothesis

. An unlikely null hypothesis means your metric has some true non-zero correlation. This is a common test, and -values will be provided by most standard statistical packages.333cor.test in R or stats.pearsonr in scipy.

The WMT metrics task uses the William’s test Williams (1959), which tests for the significance of differences in Pearson correlation. Concretely, a -value is calculated to reject , where and are correlated. In our case, and are generalizations of two different metric scores, and of DA scores. Graham et al. (2015) An unlikely null hypothesis means that it is likely one metric is better than the other.

Refer to Figure 5. In the WMT metrics task, the winners are declared as those metrics which were not significantly outperformed by any other metric. In WMT’17 for zh-en, 9 different metrics were co-winners. (Bojar et al., 2017c, count the empty columns in Figure 5)

3 Conducting a validation study

We assume that a suitable NLG task has been chosen for evaluation. The section will outline considerations and statistical methodology step-by-step in your validation study. We synthesize from existing literature we read, and borrow heavily from the design of the WMT metrics shared task.

3.1 Collecting diverse system output

The results of your validation study will vary greatly based on testing conditions of the metrics. Conclusions generalize best to similar conditions in practice, and so it is important to cover as much ground as possible.

  • [topsep=0.3em]

  • If you are using Pearson correlation (recommended, refer to §2.2) to measure a metric’s system-level correlation, choose at least five (5) systems. Five points is the least amount of data points to make a statistically significant conclusion () that a correlation is non-zero.444https://ehudreiter.com/2018/07/10/how-to-validate-metrics/ Reiter (2018)

  • Produce outputs from a mix of baselines and state-of-the art systems using a variety of approaches. In MT, a few popular approaches may include transfer-based, statistical, and neural systems. It is known, for instance, that BLEU correlates poorly with rule-based systems,

    Callison-Burch et al. (2006) and the exclusion of these systems have caused correlations between BLEU and human judgments to increase. Reiter (2018)

  • If possible, include a few synthetic variations of a system, ideally variations seen in practice. Callison-Burch et al. (2006) One such variation could be several identical models with ablated subsets of the training data.

  • Report characteristics of the test set system output is elicited from. Metrics using linguistic resources (e.g. WordNet, parsers, taggers) will be sensitive to the language they are applied to. Kilickaya et al. (2017)

3.2 Collecting consistent human judgments

Consistent human judgments are necessary to accurately evaluate metrics. Collect data in a manner that is replicable and publicly release the data.

  • [topsep=0.3em]

  • Choose a question to elicit intrinsic quality judgment from humans. Unfortunately, the choice of question is nearly art and considerations may be philosophical. Gatt and Krahmer (2018). However, a question that elicits consistent judgment saves effort in annotation, and may mean the question reflects a true intrinsic property. Note that BLEU was originally validated against human judgment of “general translation quality.” Papineni et al. (2002)

  • Choose one intrinsic quality question. In WMT, aspects of adequacy and fluency were previously judged separately but later abandoned. Bojar et al. (2016a) This is not ideal because two forms of results are confusing, and if you expect a metric to correlate with both aspects, only one question is needed.

  • Direct assessment (§2.1) collects consistent human judgments, and can be crowdsourced relatively hassle-free through Amazon Mechanical Turk. Graham et al. (2015); Bojar et al. (2017b). This collection methodology has also been successfully applied NLG domains outside of MT. Refer to Graham et al. (2018) for an example in video captioning.

3.3 Producing automatic metric scores

  • [topsep=0.3em]

  • Use consistent tokenization across all metrics. Most metrics, especially -gram based metrics, will be affected by tokenization. Post (2018)

  • You will likely be using BLEU and/or sentBLEU as a baseline. If so, note factors affecting scores (e.g. preprocessing, -gram weighting, length penalty), and that SACREBLEU is an existing tool to manage these parameters. Post (2018)

3.4 Conducting significance tests

Applying robust statistical methodology will allow sound conclusion to be drawn about the performance of our metrics.

  • [topsep=0.3em]

  • Don’t leave anything to chance - use the William’s test (§2.4) to test for significance in increase of Pearson correlation. Graham and Baldwin (2014) provides an open source implementation at https://github.com/ygraham/significance-williams.

  • Report both system and segment-level results with significant tests. (§2.2) You may also want to use Kendall’s or Spearman’s as secondary metric evaluations. (§4)

4 Analyses of metrics in WMT’17

This section demonstrates several basic analysis methods of metric scoring on both the segment and system level. We will exclusively perform these analyses on BLEU and its sentence-level variant sentence-BLEU555https://github.com/moses-smt/mosesdecoder/blob/master/mert/sentence-bleu.cpp, Koehn et al. (2007) for demonstration and for insight in a widely applied baseline metric.

Figure 6: Distribution of sentence-BLEU scores conditional on bin of DA score for zh-en translation evaluation in WMT’17. The three intervals for binning are . Every bin has a near equal number of points.
Bin Source Reference Output DA sentence-BLEU
如果两家工厂关闭,则电力市场的需求量会大大减少。 if both of those plants go from the market that ’s a significant reduction in demand in the [ electricity ] market . if the two factories closed , the power market demand will be greatly reduced . 0.355 0.066
good 上次产油国召开会议已经是4月份的事情,OPEC成员国未能就任何措施达成协议。 it was april when the oil-producing countries had meeting . no agreement was reached among the opec member … the last time oil producers are meeting in april , opec member states that have failed to agree on any measures to reach an … 0.247 0.079
小鹏和吴言在一个大群里相识,因为两人都是积极发言者,很快就熟络起来了。 xiaopeng and wuyan meet each other in the group chatting . they two get familiar with each other because they are active … xiao peng and wu met in a large group , because both were active speakers and soon became familiar . 0.537 0.087
朗兹曼写道。 lanzmann wrote . 朗兹曼 wrote . -1.245 0.707
bad 云南省首届青运会13日开赛开幕式打民族牌展示青春活力 the opening ceremony of the first youth games of yunnan province on the 13th day of the month showed youth vitality the opening ceremony of the opening ceremony of the first olympic games of yunnan province on the opening ceremony of yunnan province -1.081 0.481
远在千里之外,巨嘴鸟格雷西亚的故事感动了著名纪录片导演葆拉·埃雷迪亚和探索新闻频道制片人约翰·霍夫曼。 thousands of miles away , the story of the toucan gracia touched the famous documentary director paula el reidia and the news channel producer john hoffman . far more than thousands of thousands , the story of greecia is moved by the famous documentary director paula mareda and the exploration of the producer of news … -0.776 0.282
Table 1: Selections of lowest scoring BLEU examples in the “good” bin and highest scoring BLEU examples in the “bad” bin for zh-en translation evaluation in WMT’17. Examples are binned by DA scores, and chosen from the top/bottom 10. Some words have been elided. Readers are encouraged to view other examples on our notebook.

4.1 Segment-level analysis: metric score distributions conditional on DA

To understand how your metric is scoring with respect to translations of different quality, you may visualize the distributions of your metric score on “bins” of DA scores. Start by partitioning your examples by DA scores into a lower one-third () DA scores bin (bad translations), a medium one-third bin (average), and higher one-third bin (good). Each bin should have a near equal number of points. Produce violin plots of the conditional distributions as in Chaganty et al. (2018). Examining conditional correlations within bins does not provide the same information!

Refer to Figure 6. We see that sentence-BLEU has significant overlap in scoring bad and average translations, but not the good translation. Our interpretation is that sentence-BLEU mainly differentiates good translations well. Points that lay high on the -axis (high -gram overlap) will almost always be good translations, as the reference and system output will be nearly exact matches. In *-en translation, this strategy is effective to achieve high system-level correlation.

4.2 Segment-level analysis: qualitative analysis on metric failure cases

One of the best ways to understand your metric is to examine failure cases - cases in which there is large disagreement between human DA scores and your metric score. We have found this analysis to be insightful even looking at 5-10 examples. After binning examples into three bins by DA score (as in §4.1), sort the “good” bin in ascending order, and the “bad” bin in descending order. You may also want to produce some features of your metric for comparison. Qualitative analysis has been used in Tao et al. (2018).

Refer to Table 1. One of the biggest shortcomings of BLEU is the lack of respect to the semantic content. In the good system outputs that received low BLEU scores, we see several meaning preserving paraphrases of the reference. However, we note that these instances are quite rare (observe Figure 6). In the bad output examples, there are at least two failure cases of BLEU: matching some words in really short sentences and repeating long phrases artificially inflates the BLEU score.

Figure 7: Sample Pearson correlation of BLEU with DA conditional on system type for zh-en

translation evaluation in WMT’17. System types were classified by hand. Four out of five statistical systems were anonymized online translation systems.

4.3 System-level analysis: metric correlation with DA conditional on system type

The correlation of a metric can vary greatly depending on the diversity of systems it is validated on. As shown in Callison-Burch et al. (2006), your correlation may decrease when validated against a diverse range of systems. In practice, if we know that a metric has weak correlation for comparing systems with different approaches, we may want to constrain metric use to comparing systems using the same approach (e.g. neural).

Refer to Figure 7. The correlation for neural systems is higher than that of the statistical systems. These recent switch from statistical to neural MT systems is a likely factor in Reiter (2018) observing human-BLEU correlation increasing over time. When comparing BLEU scores, it is more effective to compare neural systems to other neural systems. In validating our metrics, we must choose a mix of possible approaches to better understand our correlation.

Refer to Figure 8. The WMT’17 tuning task Bojar et al. (2017d) calls for participants to tune the same neural MT model (Neural Monkey; Helcl and Libovický, 2017) for en-cs translation with varying training settings e.g. curriculum learning etc. Intuitively, models of the same architecture are likely to make similar mistakes, on average, over the same test set, so are penalized by humans equally. This causes the spread (or residuals) of the best-fit-line to be much tighter.

Figure 8: Sample Pearson correlation of BLEU with DA conditional on submission track for en-cs translation evaluation in WMT’17. All tuning task systems are instances of Neural Monkey. Helcl and Libovický (2017) The translation task includes different systems.
cs-en 4 0.971 1.000
de-en 11 0.923 0.564
fi-en 6 0.903 0.867
lv-en 9 0.979 0.833
ru-en 9 0.912 0.778
tr-en 10 0.976 0.911
zh-en 16 0.864 0.767
Table 2: Sample Pearson’s and Kendall’s correlations between BLEU and DA at the system level.

4.4 System-level analysis: Kendall’s coefficient and its interpretation

In this section we propose the use of Kendall’s as a secondary evaluation metric on the system level. While this coefficient does not have an appropriate significance test in our setting, it has an intuitive interpretation. For paired data points Kendall’s is a rank-based correlation coefficient defined as

where the number of concordant pairs are the number of pairs such that or for all . Pairs are discordant when there is disagreement in rank between the two variables. Kendall’s is then a difference in the percentage (normalized over total pairs) of concordant and discordant pairs.

Intuitively, and represent the BLEU and DA scores. Refer to Table 2. Kendall’s then tells us the percentage difference that BLEU would agree over disagree with DA pairwise judgment on system quality. For instance in zh-en, BLEU agrees more than disagrees 86 percentage points with DA for pairwise judgments. We leave the interpretation of correlation metrics more relevant in practice as future work.

5 Relevant work: metrics

This section attempts to make a selection of influential work that overviews the metrics development literature. Besides the well-known BLEU metric, other metrics primarily using -gram features include the NIST Doddington (2002) metric for translation, which is similar to BLEU but weighs -grams based on their rarity, and ROUGE Lin (2004) for summarization. Character-level -gram features have also been proposed to capture subword information. The CHRF metric Popovic (2015)

calculate F-scores based on character-level

-grams. CDER and BEER Stanojevic and Sima’an (2014) also use character features.

Several translation metrics calculate scores based on alignments of extracted surface linguistic features between hypotheses and references. METEOR Denkowski and Lavie (2014) uses alignments based on exact, stem, synonym, and paraphrase matches between words and phrases.666http://www.cs.cmu.edu/~alavie/METEOR/ MEANT Lo (2017) evaluates translations by first aligning semantic frames and then aligning role fillers. Both MEANT and METEOR incorporate linguistic resources such as WordNet and semantic role parsers, respectively, and their incorporation, including other character-level and shallow linguistic approaches, have shown benefits in the WMT metrics task. Bojar et al. (2016b)

More recently, both formal and distributed semantic representations have found success in metrics, even in NLG domains outside of translation. SPICE Anderson et al. (2016) is an image captioning metric based on semantic parsing, where scores are based on overlap of semantic propositions in the graph meaning representations of both the hypothesis and reference captions. The RUBER metric Tao et al. (2018)

for dialogue uses cosine similarities of word embeddings to predict response appropriateness, and a neural model trained with negative sampling to predict relevancy. The first metric in WMT’18 to use sentence embeddings, RUSE

Shimanaka et al. (2018) scores with a trained neural regression model on sentence embeddings and is the highest performing metric for *-en translation.

Finally, we will highlight MT quality estimation systems (QE, evaluating how good a translation is without the reference) which includes both neural and grammar-based approaches. Recurrent neural networks encoding both the source and MT output have been used for regression on human quality scores.

Dusek et al. (2017) The WMT quality estimation shared task Specia et al. (2018) includes examples of many such systems, with a winning submission using neural models. Wang et al. (2018) Approaches based on grammatical error correction systems are reference-less and have also been proposed for QE. Napoles et al. (2016)

6 Conclusion

The authors believe a metric that applies across multiple NLG domains may be widely adopted. This metric might focus only on semantic similarity (with a reference), and be complemented with another domain-specific measure e.g. style preservation for style transfer, Fu et al. (2018) summary length for summarization, or relevancy for dialogue. Tao et al. (2018) To compute semantic similarity, a model may use meaning representations and parsing methods Konstas et al. (2017) or continuous sentence/word embeddings, Peters et al. (2018); Conneau et al. (2017) as these technologies are becoming more widespread and powerful in NLP.

For a few NLG tasks e.g. MT, image captioning, summarization, achieving usable system-level correlation is not impossible. In some cases, it can be argued that achieving usable system-level correlation is not difficult, as merely correctly differentiating good output is sufficient. (§4.1) However, segment-level correlation leaves much to be desired. The existence of metrics with high segment-level correlation opens up exciting research directions. Such a metric can also be useful in practice, as it can identify model failure modes, or detect low quality output and fallback on rule-based models as needed.