The dataset and statistical analysis code released with the submission of EMNLP 2017 paper "Why We Need New Evaluation Metrics for NLG"
The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.READ FULL TEXT VIEW PDF
The dataset and statistical analysis code released with the submission of EMNLP 2017 paper "Why We Need New Evaluation Metrics for NLG"
Automatic evaluation measures, such as bleu (Papineni et al., 2002)
, are used with increasing frequency to evaluate Natural Language Generation (NLG) systems: Up to 60% of NLG research published between 2012–2015 relies on automatic metrics(Gkatzia and Mahamood, 2015). Automatic evaluation is popular because it is cheaper and faster to run than human evaluation, and it is needed for automatic benchmarking and tuning of algorithms. The use of such metrics is, however, only sensible if they are known to be sufficiently correlated with human preferences. This is rarely the case, as shown by various studies in NLG (Stent et al., 2005; Belz and Reiter, 2006; Reiter and Belz, 2009), as well as in related fields, such as dialogue systems (Liu et al., 2016), machine translation (MT) (Callison-Burch et al., 2006), and image captioning (Elliott and Keller, 2014; Kilickaya et al., 2017). This paper follows on from the above previous work and presents another evaluation study into automatic metrics with the aim to firmly establish the need for new metrics. We consider this paper to be the most complete study to date, across metrics, systems, datasets and domains, focusing on recent advances in data-driven NLG. In contrast to previous work, we are the first to:
Target end-to-end data-driven NLG, where we compare 3 different approaches. In contrast to NLG methods evaluated in previous work, our systems can produce ungrammatical output by (a) generating word-by-word, and (b) learning from noisy data.
Compare a large number of 21 automated metrics, including novel grammar-based ones.
Report results on two different domains and three different datasets, which allows us to draw more general conclusions.
Conduct a detailed error analysis, which suggests that, while metrics can be reasonable indicators at the system-level, they are not reliable at the sentence-level.
Make all associated code and data publicly available, including detailed analysis results.111Available for download at: https://github.com/jeknov/EMNLP_17_submission
In this paper, we focus on recent end-to-end, data-driven NLG methods, which jointly learn sentence planning and surface realisation from non-aligned data (Dušek and Jurčíček, 2015; Wen et al., 2015; Mei et al., 2016; Wen et al., 2016; Sharma et al., 2016; Dušek and Jurčíček, 2016, Lampouras and Vlachos, 2016). These approaches do not require costly semantic alignment between Meaning Representations (MR) and human references (also referred to as “ground truth” or “targets”), but are based on parallel datasets, which can be collected in sufficient quality and quantity using effective crowdsourcing techniques, e.g. (Novikova et al., 2016), and as such, enable rapid development of NLG components in new domains. In particular, we compare the performance of the following systems:
uses a Long Short-term Memory (LSTM) network to jointly address sentence planning and surface realisation. It augments each LSTM cell with a gate that conditions it on the input MR, which allows it to keep track of MR contents generated so far.
TGen:333https://github.com/UFAL-DSG/tgen The system by Dušek and Jurčíček (2015) learns to incrementally generate deep-syntax dependency trees of candidate sentence plans (i.e. which MR elements to mention and the overall sentence structure). Surface realisation is performed using a separate, domain-independent rule-based module.
), an imitation learning framework which learns usingbleu and rouge
as non-decomposable loss functions.
We consider the following crowdsourced datasets, which target utterance generation for spoken dialogue systems. Table 1 shows the number of system outputs for each dataset. Each data instance consists of one MR and one or more natural language references as produced by humans, such as the following example, taken from the Bagel dataset:555Note that we use lexicalised versions of SFHotel and SFRest and a partially lexicalised version of Bagel, where proper names and place names are replaced by placeholders (“X”), in correspondence with the outputs generated by the systems, as provided by the system authors.
MR: inform(name=X, area=X, pricerange=moderate, type=restaurant)
Reference: “X is a moderately priced restaurant in X.”
SFHotel & SFRest (Wen et al., 2015) provide information about hotels and restaurants in San Francisco. There are 8 system dialogue act types, such as inform, confirm, goodbye etc. Each domain contains 12 attributes, where some are common to both domains, such as name, type, pricerange, address, area, etc., and the others are domain-specific, e.g. food and kids-allowed for restaurants; hasinternet and dogs-allowed for hotels. For each domain, around 5K human references were collected with 2.3K unique human utterances for SFHotel and 1.6K for SFRest. The number of unique system outputs produced is 1181 for SFRest and 875 for SFHotel.
Bagel (Mairesse et al., 2010) provides information about restaurants in Cambridge. The dataset contains 202 aligned pairs of MRs and 2 corresponding references each. The domain is a subset of SFRest, including only the inform act and 8 attributes.
NLG evaluation has borrowed a number of automatic metrics from related fields, such as MT, summarisation or image captioning, which compare output texts generated by systems to ground-truth references produced by humans. We refer to this group as word-based metrics. In general, the higher these scores are, the better or more similar to the human references the output is.666Except for ter whose scale is reversed. The following order reflects the degree these metrics move from simple -gram overlap to also considering term frequency (TF-IDF) weighting and semantically similar words.
Word-overlap Metrics (WOMs): We consider frequently used metrics, including ter (Snover et al., 2006), bleu (Papineni et al., 2002), rouge (Lin, 2004), nist (Doddington, 2002), lepor (Han et al., 2012), cider (Vedantam et al., 2015), and meteor (Lavie and Agarwal, 2007).
Semantic Similarity (sim): We calculate the Semantic Text Similarity measure designed by Han et al. (2013). This measure is based on distributional similarity and Latent Semantic Analysis (LSA) and is further complemented with semantic relations extracted from WordNet.
Grammar-based measures have been explored in related fields, such as MT (Giménez and Màrquez, 2008) or grammatical error correction (Napoles et al., 2016), and, in contrast to WBMs, do not rely on ground-truth references. To our knowledge, we are the first to consider GBMs for sentence-level NLG evaluation. We focus on two important properties of texts here – readability and grammaticality:
Readability quantifies the difficulty with which a reader understands a text, as used for e.g. evaluating summarisation (Kan et al., 2001) or text simplification (Francois and Bernhard, 2014). We measure readability by the Flesch Reading Ease score (re) (Flesch, 1979), which calculates a ratio between the number of characters per sentence, the number of words per sentence, and the number of syllables per word. Higher re score indicates a less complex utterance that is easier to read and understand. We also consider related measures, such as characters per utterance (len) and per word (cpw), words per sentence (wps), syllables per sentence (sps) and per word (spw), as well as polysyllabic words per utterance (pol) and per word (ppw). The higher these scores, the more complex the utterance.
Grammaticality: In contrast to previous NLG methods, our corpus-based end-to-end systems can produce ungrammatical output by (a) generating word-by-word, and (b) learning from noisy data. As a first approximation of grammaticality, we measure the number of misspellings (msp) and the parsing score as returned by the Stanford parser (prs). The lower the msp, the more grammatically correct an utterance is. The Stanford parser score is not designed to measure grammaticality, however, it will generally prefer a grammatical parse to a non-grammatical one.777http://nlp.stanford.edu/software/parser-faq.shtml Thus, lower parser scores indicate less grammatically-correct utterances. In future work, we aim to use specifically designed grammar-scoring functions, e.g. (Napoles et al., 2016), once they become publicly available.
To collect human rankings, we presented the MR together with 2 utterances generated by different systems side-by-side to crowdworkers, which were asked to score each utterance on a 6-point Likert scale for:
Informativeness: Does the utterance provide all the useful information from the meaning representation?
Naturalness: Could the utterance have been produced by a native speaker?
Quality: How do you judge the overall quality of the utterance in terms of its grammatical correctness and fluency?
Each system output (see Table 1) was scored by 3 different crowdworkers. To reduce participants’ bias, the order of appearance of utterances produced by each system was randomised and crowdworkers were restricted to evaluate a maximum of 20 utterances. The crowdworkers were selected from English-speaking countries only, based on their IP addresses, and asked to confirm that English was their native language.
To assess the reliability of ratings, we calculated the intra-class correlation coefficient (ICC), which measures inter-observer reliability on ordinal data for more than two raters (Landis and Koch, 1977). The overall ICC across all three datasets is 0.45 (), which corresponds to a moderate agreement. In general, we find consistent differences in inter-annotator agreement per system and dataset, with lower agreements for lols than for rnnlg and TGen. Agreement is highest for the SFHotel dataset, followed by SFRest and Bagel (details provided in supplementary material).
Table 2 summarises the individual systems’ overall corpus-level performance in terms of automatic and human scores (details are provided in the supplementary material).
All WOMs produce similar results, with sim showing different results for the restaurant domain (Bagel and SFRest). Most GBMs show the same trend (with different levels of statistical significance), but re is showing inverse results. System performance is dataset-specific: For WBMs, the lols system consistently produces better results on Bagel compared to TGen, while for SFRest and SFHotel, lols is outperformed by rnnlg in terms of WBMs. We observe that human informativeness ratings follow the same pattern as WBMs, while the average similarity score (sim) seems to be related to human quality ratings.
Looking at GBMs, we observe that they seem to be related to naturalness and quality ratings. Less complex utterances, as measured by readability (re) and word length (cpw), have higher naturalness ratings. More complex utterances, as measured in terms of their length (len), number of words (wps), syllables (sps, spw) and polysyllables (pol, ppw), have lower quality evaluation. Utterances measured as more grammatical are on average evaluated higher in terms of naturalness.
These initial results suggest a relation between automatic metrics and human ratings at system level. However, average scores can be misleading, as they do not identify worst-case scenarios. This leads us to inspect the correlation of human and automatic metrics for each MR-system output pair at utterance level.
We calculate the correlation between automatic metrics and human ratings using the Spearman coefficient (
). We split the data per dataset and system in order to make valid pairwise comparisons. To handle outliers within human ratings, we use the median score of the three human raters.888As an alternative to using the median human judgment for each item, a more effective way to use all the human judgments could be to use Hovy et al. (2013)’s MACE tool for inferring the reliability of judges. Following Kilickaya et al. (2017), we use the Williams’ test (Williams, 1959) to determine significant differences between correlations. Table 3 summarises the utterance-level correlation results between automatic metrics and human ratings, listing the best (i.e. highest absolute ) results for each type of metric (details provided in supplementary material). Our results suggest that:
In sum, no metric produces an even moderate correlation with human ratings, independently of dataset, system, or aspect of human rating. This contrasts with our initially promising results on the system level (see Section 6) and will be further discussed in Section 8. Note that similar inconsistencies between document- and sentence-level evaluation results are observed in MT (Specia et al., 2010).
Similar to our results in Section 6, we find that WBMs show better correlations to human ratings of informativeness (which reflects content selection), whereas GBMs show better correlations to quality and naturalness.
Human ratings for informativeness, naturalness and quality are highly correlated with each other, with the highest correlation between the latter two () reflecting that they both target surface realisation.
All WBMs produce similar results (see Figure 1 and 2): They are strongly correlated with each other, and most of them produce correlations with human ratings which are not significantly different from each other. GBMs, on the other hand, show greater diversity.
Correlation results are system- and dataset-specific (details provided in supplementary material). We observe the highest correlation for TGen on Bagel (Figures 1 and 2) and lols on SFRest, whereas rnnlg often shows low correlation between metrics and human ratings. This lets us conclude that WBMs and GBMs are sensitive to different systems and datasets.
The highest positive correlation is observed between the number of words (wps) and informativeness for the TGen system on Bagel (,, see Figure 1). However, the wps metric (amongst most others) is not robust across systems and datasets: Its correlation on other datasets is very weak, () and its correlation with informativeness ratings of lols outputs is insignificant.
As a sanity check, we also measure a random score which proves to have a close-to-zero correlation with human ratings (highest ).
We now evaluate a more coarse measure, namely the metrics’ ability to predict relative human ratings. That is, we compute the score of each metric for two system output sentences corresponding to the same MR. The prediction of a metric is correct if it orders the sentences in the same way as median human ratings (note that ties are allowed). Following previous work (Vedantam et al., 2015; Kilickaya et al., 2017), we mainly concentrate on WBMs. Results summarised in Table 4 show that most metrics’ performance is not significantly different from that of a random score (Wilcoxon signed rank test). While the random score fluctuates between 25.4–44.5% prediction accuracy, the metrics achieve an accuracy of between 30.6–49.8%. Again, the performance of the metrics is dataset-specific: Metrics perform best on Bagel data; for SFHotel, metrics show mixed performance while for SFRest, metrics perform worst.
Discussion: Our data differs from the one used in previous work (Vedantam et al., 2015; Kilickaya et al., 2017), which uses explicit relative rankings (“Which output do you prefer?”), whereas we compare two Likert-scale ratings. As such, we have 3 possible outcomes (allowing ties). This way, we can account for equally valid system outputs, which is one of the main drawbacks of forced-choice approaches (Hodosh and Hockenmaier, 2016). Our results are akin to previous work: Kilickaya et al. (2017) report results between 60-74% accuracy for binary classification on machine-machine data, which is comparable to our results for 3-way classification.
Still, we observe a mismatch between the ordinal human ratings and the continuous metrics. For example, humans might rate system A and system B both as a 6, whereas bleu, for example, might assign 0.98 and 1.0 respectively, meaning that bleu will declare system B as the winner. In order to account for this mismatch, we quantise our metric data to the same scale as the median scores from our human ratings.999Note that this mismatch can also be accounted for by continuous rating scales, as suggested by Belz and Kow (2011). Applied to SFRest, where we previously got our worst results, we can see an improvement for predicting informativeness, where all WBMs now perform significantly better than the random baseline (see Table 4). In the future, we will investigate related discriminative approaches, e.g. (Hodosh and Hockenmaier, 2016; Kannan and Vinyals, 2017), where the task is simplified to distinguishing correct from incorrect output.
In this section, we attempt to uncover why automatic metrics perform so poorly.
We first explore the hypothesis that metrics are good in distinguishing extreme cases, i.e. system outputs which are rated as clearly good or bad by the human judges, but do not perform well for utterances rated in the middle of the Likert scale, as suggested by Kilickaya et al. (2017). We ‘bin’ our data into three groups: bad, which comprises low ratings (2); good, comprising high ratings (5); and finally a group comprising average ratings.
We find that utterances with low human ratings of informativeness and naturalness correlate significantly better () with automatic metrics than those with average and good human ratings. For example, as shown in Figure 3, the correlation between WBMs and human ratings for utterances with low informativeness scores ranges between (moderate correlation), while the highest correlation for utterances of average and high informativeness barely reaches (very weak correlation). The same pattern can be observed for correlations with quality and naturalness ratings.
This discrepancy in correlation results between low and other user ratings, together with the fact that the majority of system outputs are rated “good” for informativeness (79%), naturalness (64%) and quality (58%), whereas low ratings do not exceed 7% in total, could explain why the overall correlations are low (Section 7) despite the observed trends in relationship between average system-level performance scores (Section 6). It also explains why the rnnlg system, which contains very few instances of low user ratings, shows poor correlation between human ratings and automatic metrics.
Characteristics of Data: In Section 7.1, we observed that datasets have a significant impact on how well automatic metrics reflect human ratings. A closer inspection shows that Bagel data differs significantly from SFRest and SFHotel, both in terms of grammatical and MR properties. Bagel has significantly shorter references both in terms of number of characters and words compared to the other two datasets. Although being shorter, the words in Bagel references are significantly more often polysyllabic. Furthermore, Bagel only consists of utterances generated from inform MRs, while SFRest and SFHotel also have less complex MR types, such as confirm, goodbye, etc. Utterances produced from inform MRs are significantly longer and have a significantly higher correlation with human ratings of informativeness and naturalness than non-inform utterance types. In other words, Bagel is the most complex dataset to generate from. Even though it is more complex, metrics perform most reliably on Bagel here (note that the correlation is still only weak). One possible explanation is that Bagel only contains two human references per MR, whereas SFHotel and SFRest both contain 5.35 references per MR on average. Having more references means that WBMs naturally will return higher scores (‘anything goes’). This problem could possibly be solved by weighting multiple references according to their quality, as suggested by (Galley et al., 2015), or following a reference-less approach (Specia et al., 2010).
Quality of Data: Our corpora contain crowdsourced human references that have grammatical errors, e.g. “Fifth Floor does not allow childs” (SFRest reference). Corpus-based methods may pick up these errors, and word-based metrics will rate these system utterances as correct, whereas we can expect human judges to be sensitive to ungrammatical utterances. Note that the parsing score (while being a crude approximation of grammaticality) achieves one of our highest correlation results against human ratings, with . Grammatical errors raise questions about the quality of the training data, especially when being crowdsourced. For example, Belz and Reiter (2006) find that human experts assign low rankings to their original corpus text. Again, weighting (Galley et al., 2015) or reference-less approaches (Specia et al., 2010) might remedy this issue.
As shown in previous sections, word-based metrics moderately agree with humans on bad quality output, but cannot distinguish output of good or medium quality. Table 5 provides examples from our three systems.101010Please note that WBMs tend to match against the reference that is closest to the generated output. Therefore, we only include the closest match in Table 5 for simplicity. Again, we observe different behaviour between WOMs and sim scores. In Example 1, lols generates a grammatically correct English sentence, which represents the meaning of the MR well, and, as a result, this utterance received high human ratings (median = 6) for informativeness, naturalness and quality. However, WOMs rate this utterance low, i.e. scores of bleu1-4, nist, lepor, cider, rouge and meteor normalised into the 1-6 range all stay below 1.5. This is because the system-generated utterance has low overlap with the human/corpus references. Note that the sim score is high (5), as it ignores human references and computes distributional semantic similarity between the MR and the system output. Examples 2 and 3 show outputs which receive low scores from both automatic metrics and humans. WOMs score these system outputs low due to little or no overlap with human references, whereas humans are sensitive to ungrammatical output and missing information (the former is partially captured by GBMs). Examples 2 and 3 also illustrate inconsistencies in human ratings since system output 2 is clearly worse than output 3 and both are rated by human with a median score of 1. Example 4 shows an output of the rnnlg system which is semantically very similar to the reference (sim=4) and rated high by humans, but WOMs fail to capture this similarity. GBMs show more accurate results for this utterance, with mean of readability scores 4 and parsing score 3.5.
Table 6 summarises results published by previous studies in related fields which investigate the relation between human scores and automatic metrics. These studies mainly considered WBMs, while we are the first study to consider GBMs. Some studies ask users to provide separate ratings for surface realisation (e.g. asking about ‘clarity’ or ‘fluency’), whereas other studies focus only on sentence planning (e.g. ‘accuracy’, ‘adequacy’, or ‘correctness’). In general, correlations reported by previous work range from weak to strong. The results confirm that metrics can be reliable indicators at system-level (Reiter and Belz, 2009), while they perform less reliably at sentence-level (Stent et al., 2005). Also, the results show that the metrics capture realization better than sentence planning. There is a general trend showing that best-performing metrics tend to be the more complex ones, combining word-overlap, semantic similarity and term frequency weighting. Note, however, that the majority of previous works do not report whether any of the metric correlations are significantly different from each other.
This paper shows that state-of-the-art automatic evaluation metrics for NLG systems do not sufficiently reflect human ratings, which stresses the need for human evaluations. This result is opposed to the current trend of relying on automatic evaluation identified in(Gkatzia and Mahamood, 2015).
A detailed error analysis suggests that automatic metrics are particularly weak in distinguishing outputs of medium and good quality, which can be partially attributed to the fact that human judgements and metrics are given on different scales. We also show that metric performance is data- and system-specific.
Nevertheless, our results also suggest that automatic metrics can be useful for error analysis by helping to find cases where the system is performing poorly. In addition, we find reliable results on system-level, which suggests that metrics can be useful for system development.
Word-based metrics make two strong assumptions: They treat human-generated references as a gold standard, which is correct and complete. We argue that these assumptions are invalid for corpus-based NLG, especially when using crowdsourced datasets. Grammar-based metrics, on the other hand, do not rely on human-generated references and are not influenced by their quality. However, these metrics can be easily manipulated with grammatically correct and easily readable output that is unrelated to the input. We have experimented with combining WBMs and GBMs using ensemble-based learning. However, while our model achieved high correlation with humans within a single domain, its cross-domain performance is insufficient.
Our paper clearly demonstrates the need for more advanced metrics, as used in related fields, including: assessing output quality within the dialogue context, e.g. (Dušek and Jurčíček, 2016); extrinsic evaluation metrics, such as NLG’s contribution to task success, e.g. (Rieser et al., 2014; Gkatzia et al., 2016; Hastie et al., 2016); building discriminative models, e.g. (Hodosh and Hockenmaier, 2016), (Kannan and Vinyals, 2017); or reference-less quality prediction as used in MT, e.g. (Specia et al., 2010). We see our paper as a first step towards reference-less evaluation for NLG by introducing grammar-based metrics. In current work (Dušek et al., 2017)
, we investigate a reference-less quality estimation approach based on recurrent neural networks, which predicts a quality score for a NLG system output by comparing it to the source meaning representation only.
Finally, note that the datasets considered in this study are fairly small (between 404 and 2.3k human references per domain). To remedy this, systems train on de-lexicalised versions (Wen et al., 2015), which bears the danger of ungrammatical lexicalisation (Sharma et al., 2016) and a possible overlap between testing and training set (Lampouras and Vlachos, 2016). There are ongoing efforts to release larger and more diverse data sets, e.g. (Novikova et al., 2016, 2017).
This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1). The Titan Xp used for this research was donated by the NVIDIA Corporation.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing, China, pages 451–461. http://aclweb.org/anthology/P15-1044.
|metric||Avg / StDev||Avg / StDev||Avg / StDev||Avg / StDev||Avg / StDev||Avg / StDev|
stands for standard deviation, “*” denotes a statistically significant difference () between the two systems on the given dataset.
|Bad||Good and avg||Bad||Good and avg||Bad||Good and avg|
|Inform||Not inform||Inform||Not inform||Inform||Not inform|