Referenceless Quality Estimation for Natural Language Generation

by   Ondřej Dušek, et al.

Traditional automatic evaluation measures for natural language generation (NLG) use costly human-authored references to estimate the quality of a system output. In this paper, we propose a referenceless quality estimation (QE) approach based on recurrent neural networks, which predicts a quality score for a NLG system output by comparing it to the source meaning representation only. Our method outperforms traditional metrics and a constant baseline in most respects; we also show that synthetic data helps to increase correlation results by 21 results obtained in similar QE tasks despite the more challenging setting.


page 1

page 2

page 3

page 4


BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Evaluation is a bottleneck in the development of natural language genera...

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)

We present a recurrent neural network based system for automatic quality...

AMR quality rating with a lightweight CNN

Structured semantic sentence representations such as Abstract Meaning Re...

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...

Semi-automatic Simultaneous Interpreting Quality Evaluation

Increasing interpreting needs a more objective and automatic measurement...

Crowd-sourcing NLG Data: Pictures Elicit Better Data

Recent advances in corpus-based Natural Language Generation (NLG) hold t...

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Precisely assessing the progress in natural language generation (NLG) ta...

1 Introduction

Automatic evaluation of natural language generation (NLG) is a complex task due to multiple acceptable outcomes. Apart from manual human evaluation, most recent works in NLG are evaluated using word-overlap-based metrics such as BLEU (Gkatzia & Mahamood, 2015), which compute similarity against gold-standard human references. However, high quality human references are costly to obtain, and for most word-overlap metrics, a minimum of 4 references are needed in order to achieve reliable results (Finch et al., 2004). Furthermore, these metrics tend to perform poorly at segment level (Lavie & Agarwal, 2007; Chen & Cherry, 2014; Novikova et al., 2017a).

We present a novel approach to assessing NLG output quality without human references, focusing on segment-level (utterance-level) quality assessments.111In our data, a “segment” refers to an utterance generated by an NLG system in the context of a human-computer dialogue, typically 1 or 2 sentences in length (see Section 3.1). We estimate the utterance quality without taking the dialogue context into account. Assessing the appropriateness of responses in context is beyond the scope of this paper, see e.g. (Liu et al., 2016; Lowe et al., 2017; Cercas Curry et al., 2017). We train a recurrent neural network (RNN) to estimate the quality of an NLG output based on comparison with the source meaning representation (MR) only. This allows to efficiently assess NLG quality not only during system development, but also at runtime, e.g. for optimisation, reranking, or compensating low-quality output by rule-based fallback strategies.

To evaluate our method, we use crowdsourced human quality assessments of real system outputs from three different NLG systems on three datasets in two domains. We also show that adding fabricated data with synthesised errors to the training set increases relative performance by 21% (as measured by Pearson correlation).

In contrast to recent advances in referenceless quality estimation (QE) in other fields such as machine translation (MT) (Bojar et al., 2016) or grammatical error correction (Napoles et al., 2016), NLG QE is more challenging because (1) diverse realisations of a single MR are often acceptable (as the MR is typically a limited formal language); (2) human perception of NLG quality is highly variable, e.g. (Dethlefs et al., 2014); (3) NLG datasets are costly to obtain and thus small in size. Despite these difficulties, we achieve promising results – correlations with human judgements achieved by our system stay in a somewhat lower range than those achieved e.g. by state-of-the-art MT QE systems (Bojar et al., 2016), but they significantly outperform word-overlap metrics. To our knowledge, this is the first work in trainable NLG QE without references.

Figure 1: The architecture of our referenceless NLG QE model.

2 Our Model

We use a simple RNN model based on Gated Recurrent Units (GRUs)

(Cho et al., 2014), composed of one GRU encoder each for the source MR and the NLG system output to be rated, followed by fully connected layers operating over the last hidden states of both encoders. The final classification layer is linear and produces the quality assessment as a single floating point number (see Figure 1).222The meaning of this number depends entirely on the training data. In our case, we use a 1–6 Likert scale assessment (see Section 3.1), but we could, for instance, use the same network to predict the required number of post-edits, as commonly done in MT (see Section 5).

The model assumes both the source MR and the system output to be sequences of tokens , , where each token is represented by its embedding (Bengio et al., 2003). The GRU encoders encode these sequences left-to-right into sequences of hidden states , :


The final hidden states and are then fed to a set of fully connected layers :


The final prediction is given by a linear layer:


In (35), stand for square weight matrices,

is a weight vector.

The network is trained in a supervised setting by minimising mean square error against human-assigned quality scores on the training data (see Section 3). Embedding vectors are initialised randomly and learned during training; each token found in the training data is given an embedding dictionary entry. Dropout (Hinton et al., 2012) is applied on the inputs to the GRU encoders for regularisation.

The floating point values predicted by the network are rounded to the precision and clipped to the range found in the training data.333We use a precision of 0.5 and the 1–6 range (see Section 3.1).

2.1 Model Variants

We also experimented with several variants of the model which performed similarly or worse than those presented in Section 4. We list them here for completeness:

  • [itemsep=0pt,topsep=0pt]

  • replacing GRU cells with LSTM (Hochreiter & Schmidhuber, 1997),

  • using word embeddings pretrained by the word2vec tool (Mikolov et al., 2013) on Google News data,444We used the model available at

  • using a set of independent binary classifiers, each predicting one of the individual target quality levels (see Section 


  • using an ordered set of binary classifiers trained to predict 1 for NLG outputs above a specified quality level, 0 below it,555

    The predicted value was interpolated from classifiers’ predictions of the positive class probability.

  • pretraining the network using a different task (classifying MRs or predicting next word in the sentence).

3 Experimental Setup

In the following, we describe the data we use to evaluate our system, our method for data augmentation, detailed parameters of our model, and evaluation metrics.

3.1 Dataset

System Data BAGEL SFRest SFHot Total
LOLS 202 581 398 1,181
RNNLG - 600 477 1,077
TGen 202 - - 202
[0.5pt/2pt] Total 404 1,181 875 2,460
Table 1: Number of ratings from different source datasets and NLG systems in our data.
Instance H B M R C 4
MR inform(name=‘la ciccia’,area=‘bernal heights’,price_range=moderate) 5.5 0 1 0.000 3 0.371 3.5 0.542 2 2.117 4.5 0
Ref la ciccia is a moderate price restaurant in bernal heights
Out la ciccia, is in the bernal heights area with a moderate price range.
[0.5pt/2pt] MR inform(name=‘intercontinental san francisco’,price_range=‘pricey’) 2 0 4.5 0.707 3 0.433 5.5 0.875 2 2.318 5 0
Ref sure, the intercontinental san francisco is in the pricey range.
Out the intercontinental san francisco is in the pricey price range.
Figure 2: Examples from our dataset. Ref = human reference (one selected), Out = system output to be rated. H = median human rating, B = BLEU, M = METEOR, R = ROUGE, C = CIDEr, 4 = rating given by our system in the 4 configuration. The metrics are shown with normalised and rounded values (see Section 3.5) on top and the original, raw values underneath (in italics). The top example is rated low by all metrics, our system is more accurate. The bottom one is rated low by humans but high by some metrics and our system.

Using the CrowdFlower crowdsourcing platform,666 we collected a dataset of human rankings for outputs of three recent data-driven NLG systems as provided to us by the systems’ authors; see (Novikova et al., 2017a) for more details. The following systems are included in our set:

  • [itemsep=0pt,topsep=0pt]

  • LOLS (Lampouras & Vlachos, 2016)

    , which is based on imitation learning,

  • RNNLG (Wen et al., 2015), a RNN-based system,

  • TGen (Dušek & Jurčíček, 2015)

    , a system using perceptron-guided incremental tree generation.

Their outputs are on the test parts of the following datasets (see Table 1):

  • [itemsep=0pt,topsep=0pt]

  • BAGEL (Mairesse et al., 2010) – 404 short text segments (1–2 sentences) informing about restaurants,

  • SFRest (Wen et al., 2015) – ca. 5,000 segments from the restaurant information domain (including questions, confirmations, greetings, etc.),

  • SFHot (Wen et al., 2015) – a set from the hotel domain similar in size and contents to SFRest.

During the crowdsourcing evaluation of the outputs, crowd workers were given two random system outputs along with the source MRs and were asked to evaluate the absolute overall quality of both outputs on a 1–6 Likert scale (see Figure 2). We collected at least three ratings for each system output; this resulted in more ratings for the same sentence if two systems’ outputs were identical. The ratings show a moderate inter-rater agreement of 0.45 () intra-class correlation coefficient (Landis & Koch, 1977) across all three datasets. We computed medians of the three (or more) ratings in our experiments to ensure more consistent ratings, which resulted in .5 ratings for some examples. We keep this granularity throughout our experiments.

We use our data in a 5-fold cross-validation setting (three training, one development, and one test part in each fold). We also test our model on a subset of ratings for a particular NLG system or dataset in order to assess its cross-system and cross-domain performance (see Section 4).

3.2 Data Preprocessing

The source MRs in our data are variants of the dialogue acts (DA) formalism (Young et al., 2010) – a shallow domain-specific MR, consisting of the main DA type (hello, inform, request) and an optional set of slots (attributes, such as food or location) and values (e.g. Chinese for food or city centre for location). DAs are converted into sequences for our system as a list of triplets “DA type – slot – value”, where DA type may be repeated multiple times and/or special null tokens are used if slots or values are not present (see Figure 1). The system outputs are tokenised and lowercased for our purposes.

We use delexicalisation to prevent data sparsity, following (Mairesse et al., 2010; Henderson et al., 2014; Wen et al., 2015), where values of most DA slots (except unknown and binary yes/no values) are replaced in both the source MRs and the system outputs by slot placeholders – e.g. Chinese is replaced by X-food (cf. also Figure 1).777Note that only values matching the source MR are delexicalised in the system outputs – if the outputs contain values not present in the source MR, they are kept intact in the model input.

3.3 Synthesising Additional Training Data

Setup Instances
1 01,476
2 03,937
3 13,442
4 45,137
5 57,372
6 80,522
Table 2: Training data size comparison for different data augmentation procedures (varies slightly in different cross-validation folds due to rounding).

Following prior works in grammatical error correction (Rozovskaya et al., 2012; Felice & Yuan, 2014; Xie et al., 2016)

, we synthesise additional training instances by introducing artificial errors: Given a training instance (source MR, system output, and human rating), we generate a number of errors in the system output and lower the human rating accordingly. We use a set of basic heuristics mimicking some of the observed system behaviour to introduce errors into the system outputs:

888To ensure that the errors truly are detrimental to the system output quality, our rules prioritise content words, i.e. they do not change articles or punctuation if other words are present. The rules never remove the last word left in the system output.

  1. [itemsep=0pt,topsep=0pt]

  2. removing a word,

  3. duplicating a word at its current position,

  4. duplicating a word at a random position,

  5. adding a random word from a dictionary learned from all training system outputs,

  6. replacing a word with a random word from the dictionary.

We lower the original Likert scale rating of the instance by 1 for each generated error. If the original rating was 5.5 or 6, the rating is lowered to 4 with the first introduced error and by 1 for each additional error.

We also experiment with using additional natural language sentences where human ratings are not available – we use human-authored references from the original training datasets and assume that these would receive the maximum rating of 6. We introduce artificial errors into the human references in exactly the same way as with training system outputs.

3.4 Model Training Parameters

We set the network parameters based on several experiments performed on the development set of one of the cross-validation folds (see Section 3.1).999We use embedding size 300, learning rate 0.0001, dropout probability 0.5, and two fully connected layers ().

We train the network for 500 passes over the training data, checking Pearson and Spearman correlations on the validation set after each pass (with equal importance). We keep the configuration that yielded the best correlations overall. For setups using synthetic training data (see Section 3.3), we first perform 20 passes over all data including synthetic, keeping the best parameters, and then proceed with 500 passes over the original data. To compensate for the effects of random network initialisation, all our results are averaged over five runs with different initial random seeds following Wen et al. (2015).

3.5 Evaluation Measures

Following practices from MT Quality estimation (Bojar et al., 2016),101010See also the currently ongoing WMT‘17 Quality Estimation shared task at we use Pearson’s correlation of the predicted output quality with median human ratings as our primary evaluation metric. Mean absolute error (MAE), root mean squared error (RMSE), and Spearman’s rank correlation are used as additional metrics.

We compare our results to some of the common word-overlap metrics – BLEU (Papineni et al., 2002), METEOR (Lavie & Agarwal, 2007), ROUGE-L (Lin, 2004), and CIDEr (Vedantam et al., 2015) – normalised into the 1–6 range of the predicted human ratings and further rounded to 0.5 steps.111111We used the Microsoft COCO Captions evaluation script to obtain the metrics scores (Chen et al., 2015). Trials with non-quantised metrics yielded very similar correlations. In addition, we also show the MAE/RMSE values for a trivial constant baseline that always predicts the overall average human rating (4.5).

4 Results

Setup Pearson Spearman MAE RMSE
Constant - - 1.013 1.233
BLEU* 0.074 0.061 2.264 2.731
METEOR* 0.095 0.099 1.820 2.129
ROUGE-L* 0.079 0.072 1.312 1.674
CIDEr* 0.061 0.058 2.606 2.935
[0.5pt/2pt] 1: Base system 0.273 0.260 0.948 1.258
2: + errors generated in training system outputs 0.283 0.268 0.948 1.273
3: + training references, with generated errors 0.278 0.261 0.930 1.257
4: + systems training data, with generated errors 0.330 0.274 0.914 1.226
[0.5pt/2pt] 5: + test references, with generated errors* 0.331 0.265 0.937 1.245
6: + complete datasets, with generated errors* 0.354 0.287 0.909 1.208
Table 3: Results using cross-validation over the whole dataset. Setups marked with “*” use human references for the test instances. All setups 16 produce significantly better correlations than all metrics (). Significant improvements in correlation () over 1 are marked in bold.

We test the following configurations that only differ in the amount and nature of synthetic data used (see Section 3.3 and Table 2):

  1. [label=S0,itemsep=0pt,topsep=0pt]

  2. Base system variant, with no synthetic data.

  3. Adding synthetic data – introducing artificial errors into system outputs from the training portion of our dataset (no additional human references are used).

  4. Same as previous, but with additional human references from the training portion of our dataset (including artificial errors; see Section 3.3).121212As mentioned in Section 3.1, our dataset only comprises the test sets of the source NLG datasets, i.e. the additional human references in 3 represent a portion of the source test sets. The difference to 4 is the amount of the additional data (see Table 2).

  5. As previous, but with additional human references from the training parts of the respective source NLG datasets (including artificial errors), i.e. references on which the original NLG systems were trained.12

  6. As previous, but also includes additional human references from the test portion of our dataset (including artificial errors).131313Note that the model still does not have any access at training time to test NLG system outputs or their true ratings.

  7. As previous, but also includes development parts of the source NLG datasets (including artificial errors).

Synthetic data are never created from system outputs in the test part of our dataset. Note that 1 and 2 only use the original system outputs and their ratings, with no additional human references. 3 and 4 use additional human references (i.e. more in-domain data), but do not use human references for the instances on which the system is tested. 5 and 6 also use human references for test MRs, even if not directly, and are thus not strictly referenceless.

4.1 Results using the whole dataset

The correlations and error values we obtained over the whole data in a cross-validation setup are shown in Table 3. The correlations only stay moderate for all system variants. On the other hand, we can see that even the base setup (1) trained using less than 2,000 examples performs better than all the word-overlap metrics in terms of all evaluation measures. Improvements in both Pearson and Spearman correlations are significant according to the Williams test (Williams, 1959; Kilickaya et al., 2017) (). When comparing the base setup against the constant baseline, MAE is lower but RMSE is slightly higher, which suggests that our system does better on average but is prone to occasional large errors.

The results also show that the performance can be improved considerably by adding synthetic data, especially after more than tripling the training data in 4 (Pearson correlation improvements are statistically significant in terms of the Williams test, ). Using additional human references for the test data seems to be helping further in 6 (the difference in Pearson correlation is statistically significant, ): The additional references apparently provide more information even though the SFHot and SFRest datasets have similar MRs (identical when delexicalised) in training and test data (Lampouras & Vlachos, 2016).141414Note that unlike in NLG systems training on SFHot and SFRest, the identical MRs in training and test data do not allow our system to only memorize the training data as the NLG outputs to be rated are distinct. However, the situation is not 100% referenceless as the system may have been exposed to other NLG outputs for the same MR. Our preliminary experiments suggest that our systems can also handle lexicalised data well, without any modification (Pearson correlation 0.264–0.359 for 16). The setups using larger synthetic data further improve MAE and RMSE: 4 and 6 increase the margin against the constant baseline up to ca. 0.1 in terms of MAE, and both are able to surpass the constant baseline in terms of RMSE.

4.2 Cross-domain and Cross-System Training

1: small in-domain data only 2: out-of-domain data only 3: out-of-dom. + small in-dom.
Pear Spea MAE RMSE Pear Spea MAE RMSE Pear Spea MAE RMSE
Constant - - 0.994 1.224 - - 0.994 1.224 - - 0.994 1.224
BLEU* 0.033 0.016 2.235 2.710 0.033 0.016 2.235 2.710 0.033 0.016 2.235 2.710
METEOR* 0.076 0.074 1.719 2.034 0.076 0.074 1.719 2.034 0.076 0.074 1.719 2.034
ROUGE-L* 0.064 0.049 1.255 1.620 0.064 0.049 1.255 1.620 0.064 0.049 1.255 1.620
CIDEr* 0.048 0.043 2.590 2.921 0.048 0.043 2.590 2.921 0.048 0.043 2.590 2.921
[0.5pt/2pt] 1 0.147 0.136 1.086 1.416 0.162 0.152 0.985 1.281 0.170 0.166 1.003 1.315
2 0.196 0.176 1.059 1.364 0.197 0.189 1.003 1.311 0.219 0.218 0.988 1.296
3 0.176 0.163 1.093 1.420 0.147 0.134 1.037 1.366 0.219 0.216 0.979 1.302
4 0.264 0.218 0.983 1.307 0.162 0.138 1.084 1.448 0.247 0.211 0.983 1.306
[0.5pt/2pt] 5* 0.280 0.221 1.009 1.341 0.173 0.145 1.077 1.438 0.210 0.162 1.095 1.442
6* 0.271 0.202 0.991 1.331 0.188 0.178 1.037 1.392 0.224 0.210 1.002 1.339
Table 4: Cross-domain evaluation results. Setups marked with “*” use human references of test instances. All setups produce significantly better correlations than all metrics (). Significant improvements in correlation () over the corresponding 1 are marked in bold, significant improvements over the corresponding 1 are underlined.
1: small in-system data only 2: out-of-system data only 3: out-of-sys. + small in-sys.
Pear Spea MAE RMSE Pear Spea MAE RMSE Pear Spea MAE RMSE
Constant - - 1.060 1.301 - - 1.060 1.301 - - 1.060 1.301
BLEU* 0.079 0.043 2.514 2.971 0.079 0.043 2.514 2.971 0.079 0.043 2.514 2.971
METEOR* 0.141 0.122 1.929 2.238 0.141 0.122 1.929 2.238 0.141 0.122 1.929 2.238
ROUGE-L* 0.064 0.048 1.449 1.802 0.064 0.048 1.449 1.802 0.064 0.048 1.449 1.802
CIDEr* 0.127 0.106 2.801 3.112 0.127 0.106 2.801 3.112 0.127 0.106 2.801 3.112
[0.5pt/2pt] 1 0.341 0.334 1.054 1.405 0.097 0.117 1.052 1.336 0.174 0.179 1.114 1.455
2 0.358 0.345 1.007 1.342 0.115 0.119 1.057 1.355 0.203 0.222 1.253 1.613
3 0.378 0.365 0.971 1.326 0.112 0.094 1.059 1.387 0.404 0.377 0.968 1.277
4 0.390 0.360 0.981 1.311 0.247 0.189 1.011 1.338 0.370 0.346 0.997 1.312
[0.5pt/2pt] 5* 0.398 0.364 1.043 1.393 0.229 0.174 1.025 1.328 0.386 0.356 0.975 1.301
6* 0.390 0.353 1.036 1.389 0.332 0.262 0.969 1.280 0.374 0.330 0.979 1.298
Table 5: Cross-system evaluation results. Setups marked with “*” use human references of test instances. Setups that do not produce significantly better correlations than all metrics () are marked in italics. Significant improvements in correlation () over the corresponding 1 are marked in bold, significant improvements over the corresponding 1 are underlined.

Next, we test how well our approach generalises to new systems and datasets and how much in-set data (same domain/system) is needed to obtain reasonable results. We use the SFHot data as our test domain and LOLS as our test system, and we treat the rest as out-of-set. We test three different configurations:

  1. [label=C0,itemsep=0pt,topsep=0pt]

  2. Training exclusively using a small amount of in-set data (200 instances, 100 reserved for validation), testing on the rest of the in-set.

  3. Training and validating exclusively on out-of-set data, testing on the same part of the in-set as in 1 and 3.

  4. Training on the out-of-set data with a small amount of in-set data (200 instances, 100 reserved for validation), testing on the rest of the in-set.

The results are shown in Tables 4 and 5, respectively.

The correlations of 2 suggest that while our network can generalise across systems to some extent (if data fabrication is used), it does not generalise well across domains without using in-domain training data. 1 and 3 configuration results demonstrate that even small amounts of in-set data help noticeably. However, if in-set data is used, additional out-of-set data does not improve the results in most cases (3 is mostly not significantly better than the corresponding 1).

Except for a few cross-system 2 configurations with low amounts of synthetic data, all systems perform better than word-overlap metrics. However, most setups are not able to improve over the constant baseline in terms of MAE and RMSE.

5 Related Work

This work is the first NLG QE system to our knowledge; the most related work in NLG is probably the system by Dethlefs et al. (2014), which reranks NLG outputs by estimating their properties (such as colloquialism or politeness) using various regression models. However, our work is also related to QE research in other areas, such as MT (Specia et al., 2010), dialogue systems (Lowe et al., 2017) or grammatical error correction (Napoles et al., 2016). QE is especially well researched for MT, where regular QE shared tasks are organised (Callison-Burch et al., 2012; Bojar et al., 2013, 2014, 2015, 2016).

Many of the past MT QE systems participating in the shared tasks are based on Support Vector Regression (Specia et al., 2015; Bojar et al., 2014, 2015). Only in the past year, NN-based solutions started to emerge. Patel & M (2016) present a system based on RNN language models, which focuses on predicting MT quality on the word level. Kim & Lee (2016) estimate segment-level MT output quality using a bidirectional RNN over both source and output sentences combined with a logistic prediction layer. They pretrain their RNN on large MT training data.

Last year’s MT QE shared task systems achieve Pearson correlations between 0.4–0.5, which is slightly higher than our best results. However, the results are not directly comparable: First, we predict a Likert-scale assessment instead of the number of required post-edits. Second, NLG datasets are considerably smaller than corpora available for MT. Third, we believe that QE for NLG is harder due to the reasons outlined in Section 1.

6 Conclusions and Future Work

We presented the first system for referenceless quality estimation of natural language generation outputs. All code and data used here is available online at:

In an evaluation spanning outputs of three different NLG systems and three datasets, our system significantly outperformed four commonly used reference-based metrics. It also improved over a constant baseline, which always predicts the mean human rating, in terms of MAE and RMSE. The smaller RMSE improvement suggests that our system is prone to occasional large errors. We have shown that generating additional training data, e.g. by using NLG training datasets and synthetic errors, significantly improves the system performance. While our system can generalise to unseen NLG systems in the same domain to some extent, its cross-domain generalisation capability is poor. However, very small amounts of in-domain/in-system data improve performance notably.

In future work, we will explore improvements to our error synthesising methods as well as changes to our network architecture (bidirectional RNNs or convolutional NNs). We also plan to focus on relative ranking of different NLG outputs for the same source MR or predicting the number of post-edits required. We intend to use data collected within the ongoing E2E NLG Challenge (Novikova et al., 2017b), which promises greater diversity than current datasets.


The authors would like to thank Lucia Specia and the two anonymous reviewers for their helpful comments. This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1). The Titan Xp used for this research was donated by the NVIDIA Corporation.