Human vs Automatic Metrics: on the Importance of Correlation Design

by   Anastasia Shimorina, et al.

This paper discusses two existing approaches to the correlation analysis between automatic evaluation metrics and human scores in the area of natural language generation. Our experiments show that depending on the usage of a system- or sentence-level correlation analysis, correlation results between automatic scores and human judgments are inconsistent.



There are no comments yet.


page 1

page 2

page 3

page 4


Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Natural Language Generation (NLG) evaluation is a multifaceted task requ...

LCEval: Learned Composite Metric for Caption Evaluation

Automatic evaluation metrics hold a fundamental importance in the develo...

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied pr...

On conducting better validation studies of automatic metrics in natural language generation evaluation

Natural language generation (NLG) has received increasing attention, whi...

Comparing PCG metrics with Human Evaluation in Minecraft Settlement Generation

There are a range of metrics that can be applied to the artifacts produc...

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defi...

Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Reliable evaluation protocols are of utmost importance for reproducible ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Context and Motivation

This work seeks to gain more insight into existing approaches to correlation analysis between automatic and human metrics in the area of natural language generation (NLG).

In the machine translation community, the practice to compare system- and sentence-level correlation results is well established Callison-Burch et al. (2008, 2009). System-level analysis is motivated by the fact that automatic evaluation metrics such as bleu Papineni et al. (2002), meteor Denkowski and Lavie (2014), ter Snover et al. (2006) were initially created to account for the evaluation of whole systems (i.e. they are corpus-based metrics). On the other hand, correlation analysis at the sentence level is motivated by the need to gauge the quality of individual sentences and more generally by the need to have a more fine-grained analysis of the results produced Kulesza and Shieber (2004). The common finding in MT is that automatic metrics correlate well with human judgments at the system level but much less so at the sentence level. This in turn prompted the search for alternative automatic metrics which would correlate well with human judgments at the sentence level.

In NLG, there is a lack of such comparative studies. Traditional NLG evaluations and challenges (Reiter and Belz (2009); Gatt and Belz (2010), among others) used only system-level comparisons and reported low to strong correlations depending on the automatic metric used. Reiter and Belz (2009) explicitly wrote that they did not compute correlations on individual texts because bleu-type metrics “are not intended to be meaningful for individual sentences”.

Nonetheless, when researchers have one or few systems to evaluate, they resort to sentence-level correlation analysis: e.g., reports of Stent et al. (2005) for paraphrasing, Elliott and Keller (2014) for image caption generation. They usually report low to moderate correlations. Lately, Novikova et al. (2017) observed the difference between results depending on the evaluation design but only report sentence-level correlation results as they have few systems in their study.

In the recent survey on the state of the art in NLG, Gatt and Krahmer (2018) made a comparison of various validation studies, concluding that the studies yielded inconsistent results. However, it was not mentioned that the underlying design of those studies was different (some of them were system-based, other were sentence-based).

By this paper, we hope to raise awareness of different design in correlation analysis for NLG evaluation. In this study, we present both a system- and a sentence-level correlation analysis on NLG data. We show that the results are similar to those obtained for MT systems and we conclude with some recommendations concerning the evaluation of NLG systems.

(a) Spearman’s at the system level. Crossed squares indicate that statistical significance was not reached (). Human vs automatic metrics are in the black square.
(b) Spearman’s at the sentence level. All correlations are statistically significant (). Human vs automatic metrics are in the black square.
Figure 1: System- and sentence-level correlation analysis.

2 Experimental Setup

We used the webnlg dataset (Gardent et al. (2017a)) for our experiments. The dataset maps data to text, where a data input is a set of triples extracted from DBpedia, and a text is a verbalisation of those triples. We sampled 223 data inputs from webnlg, and used the outputs of nine different NLG systems which participated in the WebNLG Challenge Gardent et al. (2017b)111

The data inputs were chosen based on different characteristics of the webnlg corpus: how many RDF triples were in data units (size from 1 to 5), and what was the DBpedia category (Building, City, Artist, etc.). A sample for each system comprised texts from each category (15 texts); in each category all triple set sizes were covered (5 sizes), and finally we extracted 3 texts per every category and every size222One should note that, in such a way, our sample should have had 225 (i.e. ) texts; however, the count was reduced to 223, as one category (ComicsCharacter) had few data units for a particular size..

Automatic evaluation results (i.e., meteor, ter, bleu-4 scores) were calculated for each NLG system both at the system and at the sentence-level and comparing each generated sentence against three references on average.

We crowdsourced333We used CrowdFlower to run evaluation and mace Hovy et al. (2013) to remove unreliable judges. human judgments. Each participant was asked to rate each generated text on a a three-point Lickert scale for semantic adequacy, grammaticality and fluency. We collected three judgments per text.

To perform correlation analysis both at system and sentence levels, we used Spearman’s correlation coefficient444All statistical experiments were conducted using R. Data and scripts are available at To prevent a possible bias, we excluded human references from the analysis as their automatic scores are equal to 1.0 (bleu, meteor) and 0.0 (ter). Thus, for system-level analysis, we have nine data points to build a regression line.

3 Correlation Analysis Results

At the system level (Figure 1(a)), the only statistically significant correlation () is between meteor and semantic adequacy555We are focusing here mostly on the correlation between human and automatic metrics as delineated by the black square in Figure 1.. Similar findings for meteor were reported in the MT community (Callison-Burch et al., 2009) and in the image caption generation domain (Bernardi et al., 2017). We also found a strong correlation between ter and bleu, and between grammaticality and fluency judgments.

At the sentence-level, on the other hand (Figure 1(b)), all correlations are statistically significant (). The highest correlation between human and automatic metrics is between meteor and semantic adequacy (). For other human/automatic correlation results, the correlation is moderate, ranging from to . Automatic metrics show strong correlations with each other ().

In sum, there is a strong discrepancy between system- and sentence-level correlation results. Significance was not reached for most of the system-level correlations. At the sentence level, all correlations are significant, however the correlation between automatic metrics and human scores remains relatively low thereby confirming the findings of the MT community. At the sentence level, statistical significance is easier to achieve, since there are more data points than for the system-level analysis. One possibility to have statistically significant results at the system-level would be to use one-tailed test (instead of two-tailed) without a Bonferroni multiple-hypothesis correction, as it was done by Reiter and Belz (2009). However, that test is considered less statistically robust.

4 Conclusion

We argued that in NLG, as in MT, the specific type (system- vs sentence-level) of correlation analysis chosen to compare human and automatic metrics strongly impacts the outcome. While system-level correlation analyses have repeatedly been used in NLG challenges, sentence-level correlation is more relevant as it better supports error analysis. Based on an experiment, we showed that, in NLG as in MT, the sentence-level correlation between human and automatic metrics is low which in turn suggests the need for new automatic evaluation metrics for NLG that would better correlate with human scores at the sentence level.