Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge

01/23/2019 ∙ by Ondřej Dušek, et al. ∙ Charles University in Prague Heriot-Watt University 0

This paper provides a detailed summary of the first shared task on End-to-End Natural Language Generation (NLG) and identifies avenues for future research based on the results. This shared task aimed to assess whether recent end-to-end NLG systems can generate more complex output by learning from datasets containing higher lexical richness, syntactic complexity and diverse discourse phenomena. We compare 62 systems submitted by 17 institutions, covering a wide range of approaches, including machine learning architectures -- with the majority implementing sequence-to-sequence models (seq2seq) -- as well as systems based on grammatical rules and templates. Seq2seq-based systems have demonstrated a great potential for NLG in the challenge. We find that seq2seq systems generally score high in terms of word-overlap metrics and human evaluations of naturalness -- with the winning SLUG system (Juraska et al. 2018) being seq2seq-based. However, vanilla seq2seq models often fail to correctly express a given meaning representation if they lack a strong semantic control mechanism applied during decoding. Moreover, seq2seq models can be outperformed by hand-engineered systems in terms of overall quality, as well as complexity, length and diversity of outputs.



There are no comments yet.


page 13

page 30

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper provides a detailed report and analysis of the first shared task on End-to-End (E2E) Natural Language Generation (NLG). Shared challenges have become an established way of pushing research boundaries in the field of Natural Language Processing, with NLG benchmarking tasks running since 2007

(Belz and Gatt, 2007). These previous shared tasks have demonstrated that large-scale, comparative evaluations are vital for identifying future research challenges in NLG Belz and Hastie (2014).

The E2E NLG shared task is novel in that it poses new challenges for recent end-to-end, data-driven NLG systems. This type of systems promises rapid development of NLG components in new domains by reducing annotation effort: They jointly learn sentence planning and surface realisation from non-aligned data, e.g. Dušek and Jurčíček (2015); Wen et al. (2015b); Mei, Bansal, and Walter (2016); Wen et al. (2016); Sharma et al. (2016a); Dušek and Jurčíček (2016a); Lampouras and Vlachos (2016). As such, these approaches do not require costly semantic alignment between meaning representations (MRs) and the corresponding natural language reference texts (also referred to as “ground truths" or “targets"), but they are trained on parallel datasets, which can be collected in sufficient quality and quantity using effective crowdsourcing techniques, e.g. (Novikova, Lemon, and Rieser, 2016).

So far, end-to-end approaches to NLG have been limited to small, delexicalised datasets, e.g. BAGEL (Mairesse et al., 2010), SF Hotels/Restaurants (Wen et al., 2015b), or RoboCup (Chen and Mooney, 2008). Therefore, end-to-end methods have not been able to replicate the rich dialogue and discourse phenomena targeted by previous rule-based and statistical approaches for language generation in dialogue, e.g. (Walker et al., 2004; Stent, Prasad, and Walker, 2004; Mairesse and Walker, 2007; Rieser and Lemon, 2009). In this paper, we describe a large-scale shared task based on a new crowdsourced dataset of 50k instances in the restaurant domain (see Section 3). We show that the dataset poses new challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena, as described in Section 4. Our shared task aims to assess whether the novel end-to-end NLG systems are able to produce more complex outputs given a larger and richer training dataset.

We received 62 system submissions by 17 institutions from 11 countries for the E2E NLG Challenge, with about 13 of these submissions coming from industry, as summarised in Section 5. We consider this level of participation an unexpected success, which underlines the timeliness of this task111Note that, in comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) got 31 institutions submitting to a total of 8 tasks (Bojar et al., 2017). and allows us to reach general conclusions and issue recommendations on the suitability of different methods. We analyse how the submitted systems address the challenges posed by the dataset in Section 6

, and we evaluate the submitted systems by comparing them to a challenging baseline using automatic evaluation metrics (including novel text-based measures) as well as human evaluation (see Section 

7). Note that, while there are previous studies comparing a limited number of end-to-end NLG approaches Novikova et al. (2017); Wiseman, Shieber, and Rush (2017); Gardent et al. (2017a), this is the first research to evaluate novel end-to-end generation at scale using human assessment.

Our results in Section 8

show a discrepancy between data-driven seq2seq models versus template- and rule-based systems. While seq2seq models generally score high on word-overlap similarity measures and human rankings of naturalness, manually engineered systems score better than some seq2seq systems in terms of overall quality, as well as diversity and complexity of generated outputs. In Section 


, we conclude by laying out challenges for future shared tasks in this area. We also release a new dataset of 36k system outputs paired with user ratings, which will enable novel research on automatic quality estimation for NLG

Specia, Raj, and Turchi (2010); Dušek, Novikova, and Rieser (2017); Ueffing, Camargo de Souza, and Leusch (2018); Kann, Rothe, and Filippova (2018); Tian, Douratsos, and Groves (2018). All data and scripts associated with the challenge, as well as technical descriptions of participating systems are available at the following URL:


This journal article summarises our previous work Novikova, Lemon, and Rieser (2016); Novikova and Rieser (2016); Novikova et al. (2017); Novikova, Dušek, and Rieser (2017); Dušek, Novikova, and Rieser (2018) and extends it by including corrected and substantially extended evaluation of the training dataset, providing an exhaustive analysis of results including novel metrics, as well as a more detailed description of all participating systems with example outputs. This allows us to reach some more in-depths insights about the strength and weaknesses of end-to-end generation systems. We furthermore provide a more comprehensive literature review and discuss directions for future work with respect to end-to-end generation, as well as NLG evaluation in general. Finally, this paper accompanies a release of all the participating systems’ outputs on the test set along with the human ratings collected in the evaluation campaign.

2 Domain and Task

Attribute Data Type Example value
name verbatim string The Eagle, …
eatType dictionary restaurant, pub, …
familyFriendly boolean Yes / No
priceRange dictionary cheap, expensive, …
food dictionary French, Italian, …
near verbatim string market square, Cafe Adriatic, …
area dictionary riverside, city center, …
customerRating enumerable 1 of 5 (low), 4 of 5 (high), …
Table 1: Domain ontology of the E2E dataset.
MR name[The Wrestlers], priceRange[cheap], customerRating[1 of 5]
reference The Wrestlers offers competitive prices, but isn’t rated highly by customers.
Figure 1: Example pair of an MR and a corresponding human-written reference text.

In general, the task of NLG is to convert an input MR into a natural language utterance consisting of one or more sentences. In this paper, we focus on the case where an end-to-end data-driven generator is trained from simple pairs of MRs and reference texts, without fine-grained alignments between elements of the MR and words or phrases in the reference texts, as in, e.g. Dušek and Jurčíček (2015); Wen et al. (2015b). An example pair of a MR and a reference text is shown in Figure 1. We focus on restaurant recommendations in our experiments, which, previously, have been widely explored in dialogue systems research, e.g. Young et al. (2010); Henderson, Thomson, and Williams (2014); Wen et al. (2017). However, our E2E dataset is substantially bigger and more complex and than previous NLG training datasets for this domain Mairesse et al. (2010); Wen et al. (2015b) (see Section 4), which allows us to assess whether NLG systems are able to learn to produce more varied and complex utterances given enough training examples (cf. Section 8).

For the input representation, we use a format commonly found in task-oriented domain-specific spoken dialogue systems – unordered sets of attributes (slots) and their values, e.g. Mairesse et al. (2010); Young et al. (2010); Liu and Lane (2016).222Most dialogue systems also include a general intent of the utterance, such as inform, confirm, or request Young et al. (2010); Wen et al. (2015b); Liu and Lane (2016). Since our task is focussed on recommendations, this intent would be recommend/inform for all our data, and we can therefore disregard it. The list of possible attributes used in the MRs in our dataset with example values is shown in Table 1.

3 Data Collection Procedure

In order to maximise the chances for data-driven end-to-end systems of producing high quality output, we aim to provide training data in sufficient quality and quantity. We turned to crowdsourcing to collect training data in large enough quantities. We used the CrowdFlower platform333The CrowdFlower platform was renamed to FigureEight after our study was completed. See https://www.figure-eight.com/. to recruit workers. Previously, crowdsourcing has mainly been used for evaluation in the NLG community, e.g. Rieser, Lemon, and Keizer (2014); Dethlefs et al. (2012). However, recent efforts in corpus creation via crowdsourcing have proven to be successful in related tasks. For example, Zaidan and Callison-Burch (2011) showed that crowdsourcing can result in datasets of comparable quality to those created by professional translators given appropriate quality control methods. Mairesse et al. (2010) demonstrate that crowd workers can produce aligned natural language descriptions from abstract MRs for NLG, a method which also has shown success in related NLP tasks, such as spoken dialogue systems Wang et al. (2012) or semantic parsing Wang, Berant, and Liang (2015). More recently, data-driven NLG systems, such as Wen et al. (2015a) and Dušek and Jurčíček (2016), have relied on crowdsourcing for collecting training data.

When crowdsourcing corpora for training NLG systems, i.e. eliciting natural language paraphrases for given MRs from workers, the following main challenges arise:

  1. How to ensure the required quality of the collected data?

  2. What types of meaning representations can elicit spontaneous, natural and varied data from crowd workers?

In an attempted to address both challenges before collecting the main training dataset for the E2E NLG challenge, we ran a small-scale pre-study published in Novikova, Lemon, and Rieser (2016). We briefly summarise the results of this study in this section and apply the successful techniques to the whole data set.

For the pre-study, we prepared a subset of 75 distinct MRs, consisting of three, five or eight attributes from our domain (see Table 1) and their corresponding values in order to evaluate MRs with different complexities.444The attributes were selected at random, but we excluded MRs that do not contain the attribute name as these would not be appropriate for a venue recommendation. We then implemented several automatic validation procedures for filtering the crowdsourced data in order to address (1), see Section 3.1. To address (2), we explored the trade-off between semantic expressiveness of the MR and the quality of crowdsourced utterances elicited for the different semantic representations. In particular, we investigated translating MRs into pictorial representations as used in, e.g. Williams and Young (2007); Black et al. (2011) for evaluating spoken dialogue systems (see Section 3.2). In the remainder of this section, we first describe the detailed setup used to crowdsource our data (Section 3.3) and then finally evaluate the pre-study by comparing pictorial MRs to text-based MRs used by previous crowdsourcing work Mairesse et al. (2010); Wang et al. (2012) in Section 3.4.

3.1 Automatic Validation Measures

We used two simple methods to check the quality of crowd workers on CrowdFlower: First, we only select workers that are likely to be native speakers of English, following Sprouse (2011) and Callison-Burch and Dredze (2010). We use IP addresses to ensure that workers are located in one of three English-speaking countries – Canada, the United Kingdom, or the United States. In addition, we included a requirement that “Participants must be native speakers of British or American English" both in the caption of the task listed on CrowdFlower and in the task instructions. Second, we check whether workers spend at least 20 seconds to complete a page of work. This is a standard CrowdFlower option to control the quality of contributions, and it ensures that the contributor is removed from the job if they complete the task too fast.

We also check the quality of the natural language texts produced by crowd workers for a given MR. In particular, we use three JavaScript validators to ensure that the submitted utterances are well-formed English sentences:

  1. We check if the ready-to-submit utterance only contains legal characters, i.e. letters, numbers and symbols “, ’ . : ; £”.

  2. We check whether the submitted text is not shorter than the required minimal length, which is an approximation of the total number of characters used for all attribute values in a given MR, as calculated by Eq. 1:


    Here, # MR characters is the total number of characters in the given MR; # MR attributes is the number of attributes in the given MR; and is an average length of an attribute name plus two associated square brackets.

  3. We check that workers do not submit the same utterance several times.

We ensured by manually checking a small number of initial trial tasks that these automatic validation methods were able to correctly identify and reject 100% of bad submissions.

3.2 Meaning Representations: Pictures and Text

In previous crowdsourcing tasks involving MRs, these were typically presented to workers in a textual form of dialogue acts Young et al. (2010), such as the following:

inform(type=hotel, pricerange=expensive)

However, there is a limit in the semantic complexity that crowd workers can handle when using this type of textual/logical descriptions of dialogue acts Mairesse et al. (2010). Also, Wang et al. (2012) observed that the chosen semantic formalism influences the workers’ language, i.e. crowd workers are primed by the words/tokens and ordering used in the MR. Therefore, in contrast to previous work Mairesse et al. (2010); Wen et al. (2015a); Dušek and Jurčíček (2016), we explore the usage of different modalities of meaning representation:

  • Textual/logical MRs appear as a list of comma-separated attribute-value pairs, where attribute values are shown in square brackets after each attribute (see Figures 1 and 2). The order of attributes is randomised so that crowd workers are not primed by the ordering used in the MRs Wang et al. (2012).

  • Pictorial MRs are semi-automatically generated pictures with a combination of icons corresponding to the individual attributes (see Figure 2). The icons are located on a background showing a map of a city, thus allowing to represent the meaning of the attributes area and near.

1. name[Loch Fyne], eatType[restaurant], familyFriendly[yes], priceRange[cheap], food[Japanese]
2. name[The Wrestlers], familyFriendly[No], area[riverside], food[Italian], customerRating[5 of 5], priceRange[expensive], near[Cafe Adriatic], eatType[restaurant]
Figure 2: Examples of pictorial MRs (left: logical/textual MR, right: corresponding pictorial MR).

3.3 Data Collection Setup

We set up the data collection tasks on the CrowdFlower platform, using the automatic checks described in Section 3.1 and using both pictorial and textual MRs as input (see Section 3.2). For this pre-study, we collected 1133 distinct utterances from the 75 distinct/unique MRs we prepared. 744 utterances were elicited using the textual MRs, and 498 utterances were elicited using the pictorial MRs. The data collected in the pre-study are freely available for download.555See https://github.com/jeknov/INLG_16_submission. The data is not part of the final E2E NLG dataset. We later used the same CrowdFlower setup to collect the whole E2E NLG dataset (see Section 4).

In terms of financial compensation, crowd workers were paid the standard pay on CrowdFlower, which is $0.02 per page (where each page contained 1 MR). Workers were expected to spend about 20 seconds per page. Participants were allowed to complete up to 20 pages, i.e. create utterances for up to 20 MRs. Mason and Watts (2010) found in their study of financial incentives on Mechanical Turk (counter-intuitively) that increasing the amount of compensation for a particular task does not tend to improve the quality of the results. Furthermore, Callison-Burch and Dredze (2010) observed that there can be an inverse relationship between the amount of payment and the quality of work, because it may be more tempting for crowd workers to cheat on high-paying tasks if they do not have the skills to complete them. Following these findings, we did not increase the payment for our task over the standard level.

3.4 Results and Discussion

We analysed the collected natural language reference texts, focussing on textual versus pictorial MRs and their effects on objective measures, such as time taken to collect the data and length of an utterance, and human evaluations of the reference texts collected under the different conditions. Results in full detail can be found in Novikova, Lemon, and Rieser (2016); here we only summarise the main findings. The data analysis showed that:

  • There is no significant difference in the time taken to collect data with pictorial vs. textual MRs.

  • The average length of a collected reference text, both in terms of number of characters and number of sentences, depends mainly on the number of attributes associated with the MR, rather than on whether pictures or text were used.

  • Compared to textual MRs, pictorial MRs elicit texts that are significantly less similar to the underlying MR in terms of semantic text similarity Han et al. (2013). We assume that this is because pictorial MRs are less likely to prime the crowd workers in terms of their lexical choices.

  • The human evaluation revealed that reference texts produced from pictorial MRs are rated as significantly () more informative than textual MRs. Equally, utterances produced from pictorial MRs were considered to be significantly () more natural and better phrased than utterances collected with textual MRs.666Please see Novikova, Lemon, and Rieser (2016) for a definition of informativeness, naturalness and phrasing.

This shows that pictorial MRs have specific benefits for elicitation of NLG data from crowd workers. This may be because the lack of priming by lexical tokens in the MRs leads the crowd workers to producing more spontaneous and natural language, with more variability. As a concrete example of this phenomenon from the collected data, consider the first MR in Figure 2. The textual version of this MR elicited utterances such as “Loch Fyne is a family friendly restaurant serving cheap Japanese food.” whereas the pictorial MR elicited e.g. “Serving low cost Japanese style cuisine, Loch Fyne caters for everyone, including families with small children.

Pictorial stimuli have also been used in other, related NLP tasks, such as crowdsourced evaluations of dialogue systems, e.g. Williams and Young (2007); Black et al. (2011). Williams and Young (2007), for example, used pictures to set dialogue goals for users (e.g. to find an expensive Italian restaurant in the town centre). However, no analysis was performed regarding the suitability of such representations. This experiment therefore has a bearing on the general issue of human natural language responses to pictorial task stimuli, and shows for example that pictorial task presentations can elicit more natural variability in user inputs to a dialogue system.

Of course, there is a limit in the meaning complexity that pictures can express. We observed that pictorial MRs tend to introduce more noise. In particular, crowd workers tend to omit information, such as eatType = restaurant, which is particularly hard to visualise. Finally, producing pictorial MRs is a semi-automatic process, which is expensive to run at large scale.

Based on these findings, we decided to use pictorial MRs to collect 20% of the full dataset and textual MRs for the rest of the data in order to keep noise and production costs low while increasing diversity. To further increase the data quality and diversity, we collected multiple references per MR to help NLG systems deal with potential noise in the data.

4 The E2E NLG dataset

Using the procedure described in Section 3, we crowdsourced a large dataset of 50k instances in the restaurant domain Novikova, Dušek, and Rieser (2017). Our dataset is substantially bigger than previous NLG datasets for dialogue in the restaurant domain, i.e. BAGEL Mairesse et al. (2010) and SF Restaurants (SFRest) Wen et al. (2015b), which typically only allowed delexicalised data-driven end-to-end approaches (see Section 4.1). In addition, we demonstrate that our data is also more challenging given its lexical richness, syntactic complexity and diverse discourse phenomena. Following an approach suggested by Perez-Beltrachini and Gardent (2017), we describe these different dimensions of our dataset and compare them to the BAGEL and SFRest datasets in Sections 4.2 and 4.3.777The particular versions of the BAGEL and SFRest datasets used for this research are available from http://farm2.user.srcf.net/research/bagel/ and https://www.repository.cam.ac.uk/handle/1810/251304, respectively.

To ensure a fair comparison, we analyse both fully lexicalised and delexicalised versions of all datasets. The lexicalised references in all datasets contained full natural language texts including all restaurant names. This is the default form for the E2E set; small postprocessing steps were taken for the other two sets to achieve a compatible format.888The BAGEL texts are partially delexicalised by default, so we lexicalised them. SFRest texts were detokenised and adverb/plural markers were postprocessed, e.g. “restaurant -s” changed to “restaurants”. To obtain the delexicalised versions, we replaced with placeholders (e.g. “X-slot”) most slot values from open sets that appear verbatim in the data: restaurant names, area names, addresses, and numbers.999This included slot values for name and near in the E2E dataset, name, near, phone, address, postcode, count and area in the SFRest dataset, and name, near, addr, phone, postcode and area in the BAGEL set. For BAGEL, the values citycentre and riverside were excluded from delexicalisation as they do not always appear verbatim in the data. The delexicalised version of BAGEL is equivalent to how the dataset is distributed by default. SFRest would allow even more delexicalisation in practice – food types and price ranges also appear verbatim in the references. We decided to keep these values lexicalised since they are not from open sets and the two other datasets do not allow for easy delexicalisation in this case.

Since the E2E and BAGEL datasets contain only restaurant recommendations, i.e. cases where the system is providing information (inform dialogue acts), whereas SFRest also includes system questions, confirmations, and greetings, we also created a subset of SFRest dubbed SFRest-inf with only inform instances for a fairer comparison.

We processed the datasets using the MorphoDiTa part-of-speech tagger Straková, Straka, and Hajič (2014) to identify tokens, words (as opposed to punctuation tokens) and sentence boundaries. We used the same tagger to preprocess our data for lexical and syntactic complexity analysis.

E2E SFRest SFRest-inf BAGEL
Total instances 51,426 5,192
Total MRs 6,039 1,914
Unique delexicalised MRs 5,963 733
Total tokens in all references 1,166,000 49,081
Total words in all references 1,051,093 44,338
Total delex. words in all references 957,205 37,758
[0.5pt/2pt] Slots per MR 2.63
References per MR 1.91
(1-46) (1-101) (1-33) (1-2)
[0.5pt/2pt] Tokens per reference 9.45
Words per reference 8.54
Delexicalised words per reference 7.27
[0.5pt/2pt] Sentences per reference 1.05
(1-6) (1-4) (1-4) (1-2)
Tokens per sentence 8.97
Words per sentence 8.11
Delexicalised words per sentence 6.90
Table 2: Overall size statistics for NLG datasets in the restaurant information domain. All statistics for length of MRs and human references are averages (see Section 4.1 for details). Minimum and maximum numbers of references per MR and sentences per reference are shown in brackets below the average. Highest values on each line are typeset in bold.

4.1 Size

Table 2 summarises the main size statistics of all three datasets, plus the inform-only portion of SFRest. The E2E dataset is significantly larger than the other sets in terms of the total number of different MRs, the total number of data instances (i.e. MR-reference pairs), and especially in terms of the total amount of text in the human references, which is more than 20 times bigger than the next-biggest SFRest. These differences are even more profound if we consider delexicalisation: almost all MRs in the E2E set are distinct even after delexicalisation, while the number of unique MRs is reduced significantly (by more than half) for the other sets. Delexicalisation also seems to have a less significant effect on the reference texts in the E2E sets than in the other datasets (cf. the number of delexicalised words vs. the total number of words). The high number of instances directly translates to the higher average number of human references per MR, which is 8.27 for the E2E dataset as opposed to less than two for the other sets.101010

Note that Refs/MR ratio for the SFRest dataset is skewed: the

goodbye() MR has up to 101 references, but the average is less than 2 references per MR. This is apparent in the SFRest-inf section, which has a much lower maximum number of references.

While having more data with a higher number of references per MR makes the E2E data more attractive for statistical approaches and enables learning more robust models, it is also more challenging than previous sets as it contains a larger number of sentences in the human reference texts (up to 6 in our dataset, with an average of 1.54, compared to typically 1–2 for the other sets, which average below 1.1). The sentences themselves are also longer than in the other datasets. This is immediately apparent for SFRest or SFRest-inf, which are up to 40% shorter in terms of words and tokens. BAGEL’s sentences are slightly longer than E2E’s on average, but this situation is reversed when the sets are delexicalised. In addition, the input MRs in the E2E dataset are more complex than in the other sets: the average number of slot-value pairs in our set is twice that of SFRest (even if only the more complex inform dialogue acts are considered), and slightly higher than BAGEL.

E2E data part MRs References
training set 4,862 42,061
development set 0,547 04,672
test set 0,630 04,693
[0.5pt/2pt] full dataset 6,039 51,426
Table 3: Total number of MRs and human references in the E2E dataset sections.

The dataset is split into training, validation and test sets (in a 82-9-9 ratio, see Table 3), keeping a similar distribution of MR and reference text lengths. We ensure that MRs in our test set are all previously unseen, i.e. none of them overlaps with training/development sets, even when restaurant names are removed, unlike the SFRest data (cf. Lampouras and Vlachos, 2016).

4.2 Lexical Richness

Lexicalised sets E2E SFRest SFRest-inf BAGEL
Distinct tokens 1,249
Distinct tokens occurring once 230
%) (18%) %) %)
Distinct lemmas 1,186
Distinct bigrams 5,729
Distinct bigrams occurring once 2,582
%) (45%) %) %)
Distinct trigrams 11,290
Distinct trigrams occurring once 6,832
%) (61%) %) %)
[0.5pt/2pt] Lexical sophistication (LS2) 0.428
Type-token ratio (TTR) 0.027
Mean segmental TTR (MSTTR-50) 0.648
[0.5pt/2pt] Unigram entropy
Bigram entropy
Trigram entropy 11.830
Bigram next-word conditional entropy 2.714
Trigram next-word conditional entropy 1.463
Delexicalised sets E2E SFRest SFRest-inf BAGEL
Distinct tokens 504
Distinct tokens occurring once 116
%) (23%) %) %)
Distinct lemmas 437
Distinct bigrams 3,099
Distinct bigrams occurring once 1,376
%) (44%) %) %)
Distinct trigrams 6,383
Distinct trigrams occurring once 3,628
%) (57%) %) %)
[0.5pt/2pt] Lexical sophistication (LS2) 0.323
Type-token ratio (TTR) 0.012
Mean segmental TTR (MSTTR-50) 0.602
[0.5pt/2pt] Unigram entropy 6.305
Bigram entropy 9.083
Trigram entropy 10.546
Bigram next-word conditional entropy 2.594
Trigram next-word conditional entropy 1.414
Table 4:

Lexical complexity and diversity statistics for NLG datasets in the restarant information domain. Counts for n-grams appearing only once are shown as absolute numbers and proportions of the total number of respective n-grams. Highest values on each line are typeset in bold.

In order to measure various dimensions of lexical richness in the datasets under comparison, we computed statistics on token/unigram, bigram and trigram counts, and we applied the Lexical Complexity Analyser Lu (2012), as shown in Table 4. It is clear that our dataset has a much larger vocabulary – 2x larger than the second largest SFRest, but more than 5x larger if delexicalised versions of the datasets are considered. This directly translates into the number of distinct lemmas and distinct n-grams; the E2E set has almost 10x more distinct trigrams than SFRest, over 13x more in the delexicalised versions. While the proportion of n-grams only appearing once in the set is slightly lower than in the other datasets, it stays relatively high given the dataset size and narrow domain, and poses a challenging task for end-to-end data-driven approaches.

The traditional measure of lexical diversity, the type-token ratio (TTR), is not a good fit in our case when datasets of different sizes in a narrow domain are compared because the values are inversely proportional to the dataset size. Therefore, we complement TTR with the more robust measure of mean segmental TTR (MSTTR) Lu (2012), which divides the corpus into successive segments of a given length (50 tokens) and then calculates the average TTR of all segments. The higher the value of MSTTR, the more diverse is the measured text. Table 4 shows our dataset has higher MSTTR value (0.71) than the other sets (0.65). The difference is even more profound if we consider delexicalised versions of the sets and inform-only MRs in the SFRest data – 0.66 vs. 0.55 for SFRest-inf and 0.48 for BAGEL.

In addition, we measure lexical sophistication (LS2) Lu (2012), also known as lexical rareness, which is calculated as the proportion of lexical word types not on the list of 2,000 most frequent words generated from the British National Corpus. Table 4 shows that while the E2E is more sophisticated than SFRest, it is slightly less so compared to BAGEL. However, LS2 numbers on the delexicalised sets show that this is mainly caused by lexical slot values – the delexicalised E2E dataset is almost twice as sophisticated as both SFRest and BAGEL.

Following Oraby, Reed, and Tandon (2018) and Jagfeld, Jenne, and Vu (2018), we also use Shannon entropy (Manning and Schütze, 2000, p. 61ff.) as a measure of lexical diversity in the texts:


Here, stands for all unique tokens/n-grams, freq stands for the number of occurrences in the text, and len for the total number of tokens/n-grams in the text. We computed entropy over tokens (unigrams), bigrams and trigrams, as shown in Table 4. We can see that the E2E dataset has slightly lower unigram and bigram entropy than SFRest and higher trigram entropy than any other set. However, when delexicalised, the E2E set shows the highest entropy for any n-gram value. Considering that entropy is a logarithmic measure, the difference is substantial for trigrams – 12.1 vs. the closest 10.5 for SFRest, which amounts to about 2.98 higher uncertainty.

We further complement Shannon text entropy with n-gram-language-model-style conditional entropy for next-word prediction (Manning and Schütze, 2000, p. 63ff.), given one previous word (bigram) or two previous words (trigram):


Here, stands for all unique n-grams in the text, composed of (context, all tokens but the last one) and (the last token). Conditional next-word entropy gives an additional, novel measure of diversity and repetitiveness: The more diverse a text is, the less predictable is the next word given previous word(s); on the other hand, the more repetitive the text, the more predictable is the next word given previous word(s). The values for all the datasets are again shown in Table 4, and they demonstrate clearly that E2E data is much more diverse than SFRest or BAGEL. Note also that lexicalisation has a much smaller effect on this measure. In the delexicalised version, the difference against the closest SFRest (2.446 vs. 1.414) indicates about 2.04 more uncertainty on next-word prediction given two previous words.

Figure 3: D-Level sentence distribution of the datasets under comparison.

4.3 Syntactic Complexity

We used the D-Level Analyser Lu (2009) to evaluate the syntactic complexity of human references in our data using the revised D-Level Scale Covington et al. (2006). We used the syntactic constituency parser of Collins (1997) to preprocess the sentences for the D-Level Analyser.111111We used the Model 2 variant of the parser as instructed by the D-Level Analyser website at http://www.personal.psu.edu/xxl13/downloads/d-level.html. The D-Level scale has eight levels of syntactic complexity, where levels 0 and 1 include simple or incomplete sentences and higher levels include sentences with more complex structures, e.g. sentences joined by a subordinating conjunction, more than one level of embedding etc. Figure 3 shows the D-Level distribution in all three datasets.

The largest proportion of the datasets is composed of simple sentences (levels 0 and 1), but the proportion of simple texts is much lower for the E2E NLG dataset (46%) compared to others (59-66%). Examples of simple sentences in our dataset include: “The Vaults is an Indian restaurant”, or “The Loch Fyne is a moderate priced family restaurant”.

The majority of our data, however, contains more complex, varied syntactic structures, including phenomena explicitly modelled by early statistical approaches to NLG Stent, Prasad, and Walker (2004); Walker et al. (2004). For example, clauses may be joined by a coordinating conjunction (level 2), e.g. “Cocum is a very expensive restaurant but the quality is great”. There are 14% level-2 sentences in the E2E dataset; BAGEL only has 7% and SFRest 9%, but inform MRs in SFRest contain a similar proportion as our set. Level 3 sentences in our domain are mainly those with object-modifying relative clauses, e.g. “There is a pub called Strada which serves Italian food.” The E2E dataset contains 18% level-3 sentences, similar to BAGEL but more than SFRest’s 12% (13% in inform MRs). The levels 4-5 are not very frequent in any of the datasets. Sentences may contain verbal gerund (-ing) phrases (level 4), either in addition to previously discussed structures or separately, e.g. “The coffee shop Wildwood has fairly priced food, while being in the same vicinity as the Ranch” or “The Vaults is a family-friendly restaurant offering fast food at moderate prices”. Subordinate clauses are marked as level 5, e.g. “If you like Japanese food, try the Vaults”.

The highest levels of syntactic complexity involve sentences containing referring expressions (“The Golden Curry provides Chinese food in the high price range. It is near the Bakers”), non-finite clauses in adjunct position (“Serving cheap English food, as well as having a coffee shop, the Golden Palace has an average customer rating and is located along the riverside”) or sentences with multiple embedded structures from previous levels. As Figure 3 shows, our dataset has a substantially higher proportion of level-6-7 sentences – 15%, compared to 7% for BAGEL and 8% for SFRest (11% in inform MRs).

On average, sentences in the E2E dataset are much more syntactically complex than in the other datasets under comparison: the mean D-Level for E2E data is 2.17, compared to BAGEL’s 1.32 and SFRest’s 1.25 (1.57 for inform-only MRs).

4.4 Attribute Coverage

Fully covered 30 47 50
Missing content 11 00 00
Additional content 09 03 00
Table 5: Coverage of MR attributes in references as measured manually on a random sample of 50 MR-reference pairs for each dataset. The numbers indicate the absolute number of instances falling into the given category, out of 50.

Our crowd workers were asked to verbalise all information from the MR; however, they were not penalised if they skip an attribute (cf. Section 3.4). This feature makes generating text from our dataset more challenging as the NLG systems need to deal with a certain amount of noise, i.e. attributes not being verbalised in the human reference texts. In order to measure the extent of this phenomenon, we examined a random sample of 50 MR-reference pairs in all three datasets under comparison. An MR-reference pair was considered “fully covered” if all attribute values present in the MR are verbalised in the reference. It was marked as “additional content” if the reference contains information not present in the MR, and as “missing content” if the MR contains information not present in the reference.

The results of our sample probe in Table 5 indicate that roughly 40% of our data contains either additional or omitted information. In order to help NLG systems account for this variation, we collected multiple references per MR (also see Table 2).

This variation often concerns the attribute-value pair eatType=restaurant, which is either omitted (“Loch Fyne provides French food near The Rice Boat. It is located in riverside and has a low customer rating”) or added in case eatType is absent from the MR (“Loch Fyne is a low-rating riverside French restaurant near The Rice Boat”).121212Note that inclusion of this attribute is mainly due to historical reasons, following SFRest and BAGEL. As discussed in Section 3.4, pictorial MRs might be a possible source of this phenomenon where eatType=restaurant, eatType=pub, etc. is difficult to illustrate.

5 Systems in the Competition

System Architecture Delex. slots Copy Semantic control Data augmentation / diversity

TGen Novikova, Dušek, and Rieser (2017)
seq2seq (TGen) name, near MR classification reranking
[0.5pt/2pt] Adapt
Elder et al. (2018) seq2seq (OpenNMT-py) none none enriching MR by output words
Chen Chen (2018) seq2seq none attention memory
Gong Gong (2018) seq2seq (TGen) name, near MR classification reranking
Gehrmann et al. (2018) seq2seq none coverage penalty reranking diverse ensembling
NLE Agarwal, Dymetman, and Gaussier (2018) char seq2seq (tf-seq2seq) none MR classification reranking
Sheff2 Chen, Lampouras, and Vlachos (2018) seq2seq name, near none
Juraska et al. (2018) seq2seq name, near slot aligner reranking using sub-MRs and aligned sentences
Juraska et al. (2018)
(late submission) seq2seq name, near slot aligner reranking using only complex training sentences
Oraby et al. (2018a) seq2seq (TGen) name, near MR classification reranking using Personage
Tandon et al. (2018) seq2seq (TGen) name, near MR classification reranking shuffling MRs
Schilder et al. (2018) seq2seq (tf-seq2seq) name, near,
priceRange, customerRating none
Zhang et al. (2018) sub-word seq2seq none attention regularisation
[0.5pt/2pt] Sheff1 Chen, Lampouras, and Vlachos (2018)

linear classifiers

+ LOLS name, near 2-step prediction with slots using only references with highest average word frequency
ZHAW1 Deriu and Cieliebak (2018) RNN language model name, near SC-LSTM (semantic gates), MR classification loss + reranking first word control
ZHAW2 Deriu and Cieliebak (2018) RNN language model name, near SC-LSTM
(semantic gates) first word control
[0.5pt/2pt] DANGNT Nguyen and Tran (2018) rule-based all implied by architecture
FORGe1 Mille and Dasiopoulou (2018) grammar all implied by architecture
[0.5pt/2pt] FORGe3 Mille and Dasiopoulou (2018) templates all implied by architecture
Schilder et al. (2018) templates all implied by architecture
TUDA Puzikov and Gurevych (2018) templates all implied by architecture
Table 6: A full list of the primary systems participating in the E2E challenge, with their basic architecture and other properties (list of delexicalised slots, presence of a copy mechanism, control of semantic MR coverage on the output, data augmentation and output diversity techniques). System architectures are coded with colours and symbols: seq2seq, other data-driven, rule-based, template-based.

The initial idea of the E2E NLG Challenge was first presented in Novikova and Rieser (2016). The interest and active participation in the E2E Challenge has by far outperformed our expectations. We received a total of 62 submitted systems by 17 institutions from 11 countries, with about 13 of these submissions coming from industry. In accordance with ethical considerations for NLP shared tasks Parra Escartín et al. (2017), we allowed researchers to withdraw or anonymise their results after obtaining automatic evaluation metrics results (cf. Section 7.1). Two groups from industry withdrew their submissions and one group asked to be anonymised after obtaining automatic evaluation results. A full list of all the remaining submissions is given in Table 14 in the Appendix (including their automatic metric scores).

We asked each participating team to identify 1-2 primary systems, which resulted in 20 systems by 14 groups. Each primary system is described in a short technical paper (available on the E2E NLG Challenge website)131313http://www.macs.hw.ac.uk/InteractionLab/E2E/ and was evaluated both by automatic metrics and human judges (see Section 7). We compare the primary systems to a baseline system we provided ourselves (see Section 5.1). A detailed overview of all the primary systems is given in Table 6. In the following, we describe the systems in terms of different architectures; see Sections 5.25.5.

5.1 Baseline System

TGen (development set) 0.6925 8.4781 0.4703 0.7257 2.3987
Table 7: TGen performance on the development set (see Section 7.1 for a description of the metrics).

To establish a baseline on the task data, we use TGen Dušek and Jurčíček (2016a).141414TGen is freely available at https://github.com/UFAL-DSG/tgen. TGen is based on the sequence-to-sequence model with attention (seq2seq) Bahdanau, Cho, and Bengio (2015)

, an encoder-decoder recurrent neural network (RNN) architecture. In addition to the standard seq2seq model with LSTM cells

Hochreiter and Schmidhuber (1997), TGen uses beam search for decoding and an LSTM-based reranker over the top outputs, penalising those outputs that do not verbalise all attributes from the input MR. TGen was previously tested on the BAGEL and SFRest datasets, where it reached state-of-the-art performance (Dušek, 2017, p. 88ff.).

As TGen does not handle unknown vocabulary well, the sparsely occurring string attributes (see Table 1) name and near are delexicalised (see Section 6.1). The main seq2seq model is trained by minimising cross entropy using the Adam algorithm Kingma and Ba (2015) in direct token-by-token generation of surface strings; the reranker is trained to detect the presence of all attributes from the input MR.151515We use a learning rate of 0.0005, cell size 50, batch size 20, beam size 10, maximum encoder and decoder lengths 10 and 80, respectively, and up to 20 passes through training data with early stopping. The reranker uses the same parameters, except for a higher learning rate (0.001). See Novikova, Dušek, and Rieser (2017) for more details. Based on evaluation on the development part of the E2E dataset using automatic metrics (see Table 7), as well as manual cursory checks, TGen appears to be a strong baseline, capable of generating fluent and relevant outputs in most cases.

5.2 Seq2seq-based systems

Systems based on the popular sequence-to-sequence architecture Sutskever, Vinyals, and Le (2014); Bahdanau, Cho, and Bengio (2015) represent the biggest group of systems participating in the challenge (12 out of 20 primary systems). All the seq2seq-based systems use beam search, and most of them further enhance the basic seq2seq architecture in a number of ways.

Several systems are built on top of previous systems and toolkits. A number of systems are based on the TGen baseline and aiming to improve it: TNT1 Oraby et al. (2018a) and TNT2 Tandon et al. (2018) are using TGen with two different data augmentation techniques (see Section 6.3). Gong Gong (2018) trains TGen with fine-tuning by the REINFORCE algorithm Williams (1992). Two systems are based on the tf-seq2seq toolkit Britz et al. (2017): NLE Agarwal, Dymetman, and Gaussier (2018) built a character-to-character seq2seq (using simply characters of the original MR as inputs), TR1 Schilder et al. (2018) use a regular word-based model. The Adapt system Elder et al. (2018) is based on OpenNMT-py Klein et al. (2017). It uses pointer networks (a form of a copy mechanism Vinyals, Fortunato, and Jaitly (2015)) and a two-step generation where the first step enriches the input MR for diversity (see Section 6.3).

Several other systems use custom seq2seq implementations. Slug and Slug-alt Juraska et al. (2018) use an ensemble of two bidirectional LSTM encoders and one convolutional encoder, all paired with an attention LSTM decoder (incl. self-attention). Harv Gehrmann et al. (2018) use a seq2seq model with multiple additions for MR coverage and diversity (see Sections 6.2 and 6.3). Sheff2’s model Chen, Lampouras, and Vlachos (2018), on the other hand, is a vanilla seq2seq setup with LSTM cells. Chen Chen (2018) presents a seq2seq model with a custom-tailored input data representation: 2-part input embeddings, which divide into slot name and value token embeddings. Zhang Zhang et al. (2018) apply a seq2seq model with CAEncoder Zhang et al. (2017), which adds a second layer over a bidirectional encoder with GRU cells Cho et al. (2014), summarising both directional encoders.

5.3 Other data-driven systems

Two groups submitted fully trainable systems that are not based on the seq2seq architecture. First, ZHAW1 and ZHAW2 Deriu and Cieliebak (2018) use an RNN language model with semantically conditioned LSTM (SC-LSTM) cells Wen et al. (2015b) and a 1-hot encoding of input MR slot values. The two system variants differ in the presence of an additional semantic control mechanism (see Section 6.2).

Sheff1 Chen, Lampouras, and Vlachos (2018)

is the only non-neural fully data-driven system submitted to the challenge. It is based on imitation learning using linear classifiers

Crammer, Kulesza, and Dredze (2009) in a two-level generation approach, where the classifiers first select the next slot to be realised and then the corresponding word-by-word realisation of that slot Lampouras and Vlachos (2016). The classifiers are trained using the Locally Optimal Learning to Search (LOLS) imitation learning framework Chang et al. (2015), optimising for BLEU, ROUGE-L, and slot error (cf. Section 7.1).

5.4 Rule-based systems

There are two rule-based entries in the E2E challenge: First, the DANGNT system Nguyen and Tran (2018) uses a two-step rule-based setup, where the first step determines the appropriate phrases to use for a delexicalised sentence; the second step selects the appropriate phrases to lexicalise slot values. Second, the FORGe1 system Mille and Dasiopoulou (2018) is a rule-based pipeline using grammars based on the Meaning-Text Theory Mel’čuk (1988). It matches the MR to handcrafted per-slot semantic templates, applies aggregation rules to build sentences, and realises the aggregated sentence structures into surface text.

5.5 Template-based systems

Three entries in the E2E challenge are based on traditional template filling. FORGe3 Mille and Dasiopoulou (2018) and TR2 Schilder et al. (2018) take a very similar approach: They mine templates from data by delexicalising slot values. TUDA Puzikov and Gurevych (2018), on the other hand, uses templates manually designed by the system authors; the templates are not based on the dataset directly, they are only informed by the data.

6 Addressing the Challenges

In this section, we focus on how the competing primary systems address specific challenges posed by the task: vocabulary unseen in training (Section 6.1), control of semantic coverage of the input MR (Section 6.2), and producing diverse outputs (Section 6.3). We also include an overview of alternative approaches to addressing these challenges in Section 6.4.

6.1 Open Vocabulary

All systems in the challenge have a way of addressing the open vocabulary in the data. In closed-domain setups, slot values are the usually the only part of data where open vocabulary is present, as e.g. is the case of the name and near slots in our dataset (see Table 1). The common approach to dealing with open vocabulary in NLG systems is to use delexicalisation (Wen et al., 2015b; see also Section 4), i.e. replacing slot values with placeholders during training and generation time (both in input MRs and training sentences). This approach is indeed one of the principles of template-based systems; accordingly, all template-based entries in the E2E Challenge use full delexicalisation of all slot values (except, perhaps, the binary-valued familyFriendly; cf. Table 6). Both rule-based systems also perform full delexicalisation.

The data-driven systems submitted to our challenge mostly opt for partial delexicalisation (see Table 6); the prevailing approach is to delexicalise only the values of the name and near slots, which allows for very simple pre- and postprocessing since these values usually appear verbatim in the outputs.161616Unlike other slot values, e.g., area=riverside might appear as “near the river”. Cf. also our remarks on delexicalisation in Section 4 and Footnote 9. TR1 is the only data-driven system to use a stronger delexicalisation, which also includes the priceRange and customerRating slots. Slug and Slug-alt are the only systems to treat values with different morpho-syntactic properties differently (e.g., a value requiring “an” instead of “a” as an article).

Five of the seq2seq systems in the challenge opted for using no delexicalisation and employ alternative ways of addressing open vocabulary: Adapt, Chen and Harv use a copy mechanism (cf. Section 5.2), which allows the system to copy some of the tokens from the input instead of generating them anew. Zhang operates over sub-word units instead of words; these are determined by the byte-pair encoding algorithm and can combine to create previously unseen words Sennrich, Haddow, and Birch (2016). NLE’s seq2seq system operates on the character level.

6.2 Semantic Control

Most of the participating systems explicitly attempt to realise all slots and thus cope with the noise in the training data (cf. Section 4.4

). Full realisation is implied for template and rule-based systems as the templates and rules always relate to specific slots and are chosen based on the slots in the input MR. On the other hand, vanilla seq2seq systems have no way of controlling whether all input slots have been realised. While attention models

Bahdanau, Cho, and Bengio (2015) certainly have an influence on this, they are not explicitly trained to attend exactly once to each slot in a vanilla seq2seq setup. Therefore, most seq2seq systems include an additional tool checking the realised parts of the input MR on the output (cf. Table 6).

The most frequent approach among the E2E submissions is a MR classification reranker Dušek and Jurčíček (2016a). Here, the generator first produces multiple outputs using beam search, then these are tested for the presence of all slots from the input MR, and deviations from the input are penalised. Apart from the TGen baseline (using a RNN MR classifier, see Section 5.1), this approach is also taken by all systems based on TGen (TNT1, TNT2, Gong) as well as NLE

, which uses a logistic regression classifier.

Slug and Slug-alt

apply a very similar approach: they use a heuristic slot aligner (trained on words and phrases from training data and WordNet) to align outputs to the input MR and penalise for any unaligned slots.

Harv do not build a separate classifier or aligner, but use the sum of weights from the attention model (which should not exceed 1 for each token of the input MR) in a penalty term for reranking.

Two seq2seq systems use a direct modification of the attention mechanism instead of reranking at decoding time. Chen includes attention memory (sum of attention distributions so far in the generation process) as an additional input to the attention model. Zhang adds an attention regularisation loss term to the training process, which attempts to keep the sum of weights close to 1 for each input MR token, similarly to Harv’s penalty term. Three systems, Adapt, TR1 and Sheff2, do not use any explicit semantic control mechanism.

The non-seq2seq data-driven systems use specific mechanisms to maintain input MR coverage. ZHAW1 and ZHAW2 are based on SC-LSTM cells Wen et al. (2015b), which include a special gate that keeps track of slots covered so far in the MR. In addition, ZHAW1 uses convolutional MR classifiers to rerank beam search outputs similarly to most seq2seq systems; however, this classification is also used in an additional loss term during training. The Sheff1 system explicitly decides which slot to verbalise next using a separate slot-level classifier, which is optimised to cover the input MR.

6.3 Data Augmentation and Diversity

The design of the E2E dataset attempts to provide higher text diversity (see Section 4), and several challenge participants made use of this. Others modified the training set simply to achieve better output quality.

Several systems aim at higher output quality by using data augmentation. TNT1 enriches input MRs by prepending them with the corresponding outputs of the Personage generator Mairesse and Walker (2007), with the aim to generate more diverse output. TNT2 aims to boots the robustness of the baseline TGen system by re-shuffling slots in the input MRs. Slug uses single sentences from the training data with corresponding aligned parts of the original MR. This increases the amount of training data available and simplifies the task by breaking outputs into smaller (partially) aligned units. Slug-alt, on the other hand, only uses training instances involving complex sentences in an attempt to provide more sophisticated outputs. On the other hand, the system of Sheff1 is trained using only one reference text per training MR; the reference text with the highest average word frequency is selected. While this approach is likely to decrease output diversity, the authors use it to stabilise system training. Harv takes yet another approach in order to both stabilise training and increase diversity, called diverse ensembling Guzman-Rivera, Batra, and Kohli (2012). In an expectation-maximisation fashion, they split the training data instances into subsets that exhibit similar structural properties and style in the natural language references, then train different models on these subsets and deploy them as an ensemble.

Two teams attempt to increase output diversity by directly modifying the generation process. The ZHAW1 and ZHAW2 systems use a first word control mechanism: they generate outputs starting with all (frequent enough) first words from the training set, then select the final output by sampling. ZHAW1 only samples among semantically correct outputs (see Section 6.2). Adapt takes a different approach, adding a preprocessing step before the main generator, which decides upon specific words that should appear on the output. These are then used to enrich the input MR in the main generation step, providing more diversity on the input.

6.4 Systems outside the competition

Solving the challenges outlined above is an ongoing effort addressed by many recent systems. Here we briefly summarise other attempts by systems outside the competition for completeness. Note that many of these approaches are very recent and have been published only after the E2E NLG Challenge ended.

Apart from delexicalisation, which is most often used in the E2E NLG Challenge, various variants of the copy mechanism are the most prominent approach to address open vocabulary in NLG Wiseman, Shieber, and Rush (2017); Lebret, Grangier, and Auli (2016); Bao et al. (2018); Kaffee et al. (2018); Wang et al. (2018). Shimorina and Gardent (2018) combine a copy mechanism with delexicalisation. In contrast, Freitag and Roy (2018)

use subwords and recast the NLG model as a denoising autoencoder, with shared input and output embeddings (starting from slot values and “filling in” the rest of the sentence on the output).

Attempts at improving semantic accuracy of the generated texts show a wider variety of approaches. Kiddon, Zettlemoyer, and Choi (2016)

use a “checklist model” – the decoder keeps a vector of items used so far during the generation; this is similar to semantic gates of

Wen et al. (2015b), which have been used by the ZHAW1 and ZHAW2 systems in our challenge (see Section 6.2). Tran, Nguyen, and Tojo (2017) use a two-level attention model (composed of a standard attention model and a “refiner”, an attention-over-attention module) to improve semantic coverage. Nema et al. (2018) combine semantic gating and two-level attention (with attention over slots, slot values, and a combination thereof). Other authors explore supplementary inputs for improving semantic correctness: Reed, Oraby, and Walker (2018) use an additional supervision signal indicating the desired number of sentences to generate, Freitag and Roy (2018) show that additional unlabeled training data improves semantic coverage in their denoising-autoencoder-based NLG model.

Since its initial release in Novikova, Dušek, and Rieser (2017), the E2E dataset has been used by several authors to explore generating more diverse outputs, mostly with additional supervision signals: The system of Wiseman, Shieber, and Rush (2018) learns latent templates (sequences of phrases/slots) while learning to generate, thus allowing more controllability and arguably more diversity of the outputs – the templates serve as an additional, fine-grained way of specifying the desired shape of the generator output. Reed, Oraby, and Walker (2018) explore using the presence of prespecified contrast markers (e.g. but, although) as additional supervision, while Juraska and Walker (2018) investigate other stylistic markers and use them to generate sentences of specified type. Oraby, Reed, and Tandon (2018) and Oraby et al. (2018b) attempt to generate outputs showing different personality traits (represented by the Big Five model) using additional synthetic training data with personality annotation. Jagfeld, Jenne, and Vu (2018) do not add more supervision but compare the diversity produced by word-level and character-level seq2seq models on E2E data, showing better performance of the latter.

Using an in-house restaurant dataset, Nayak et al. (2017) explore using a basic sentence plan specification (slot ordering and sentence grouping) as an additional training signal to increase output diversity. Working in the transport information domain, Dušek and Jurčíček (2016) and Mangrulkar et al. (2018) condition their generators on preceding dialogue context as well as the input MR to obtain greater diversity.

7 Evaluation Setup

We evaluated the systems submitted to the E2E challenge using a range of automatic metrics, which we describe in Section 7.1. This includes a novel application of textual measures171717These measures were previously applied by Perez-Beltrachini and Gardent (2017) and this work (see Section 4) to describe datasets, but not for evaluation of NLG outputs. and a novel usage of standard word-overlap metrics to assess similarity among individual systems. Automatic metrics are popular in NLG Gkatzia and Mahamood (2015) because they are cheaper and faster to run than human evaluation. However, sole use of automatic metrics is only sensible if they are known to be sufficiently correlated with human preferences. Recent studies Novikova et al. (2017); Reiter (2018) have demonstrated that this is very often not the case and that automatic metrics only weakly reflect human judgements on system outputs as generated by data-driven NLG. Therefore, we also performed a large-scale crowdsourced human evaluation, as detailed in Section 7.2. For the human evaluation of the 20 primary systems, we address the problem of how to efficiently compare a large number of systems, by:

  1. Extending our previous work Novikova, Dušek, and Rieser (2018) on rank-based Magnitude Estimation (RankME) and verifying the method at scale;181818The original study Novikova, Dušek, and Rieser (2018) was limited to comparing 3 similar systems on 100 utterances.

  2. Introducing the data-efficient TrueSkill algorithm Herbrich, Minka, and Graepel (2006); Sakaguchi, Post, and Van Durme (2014) to NLG. This allows us to compute an overall ranking by directly comparing the systems, rather than individually assessing them at higher cost, as done by previous NLG challenges Belz and Hastie (2014).

7.1 Automatic Metrics

We apply two types of automatic metrics: One set assessing the similarity between generated system outputs and natural language references in the corpus using word-overlap-based measures, and another set assessing the complexity and diversity of system outputs using a variety of textual measures.

Word-overlap metrics

For the first set, we selected a range of metrics measuring word-overlap between system output and references, including BLEU and NIST, which are used as standard in machine translation evaluation Bojar, Graham, and Kamran (2017) and very common in NLG, and several others which were applied in the COCO caption generation challenge Chen et al. (2015) as well as other NLG experiments (e.g. Lebret, Grangier, and Auli, 2016; Gardent et al., 2017b; Sharma et al., 2016b):

BLEU Papineni et al. (2002)

is the harmonic mean of

-gram precisions of the system output with respect to human-authored reference sentences, with , lowered by a brevity penalty if the output is shorter than references. The -gram precisions are proportions of -grams in the system output that can be matched in any of the reference sentences. Repeated -gram matches are clipped to the maximum number of times the -gram occurs in any single reference.

NIST Doddington (2002)

is a version of BLEU with higher weighting for less frequent (i.e., more informative) -grams and a different length penalty. It uses .

METEOR Lavie and Agarwal (2007)

measures both precision and recall of unigrams by aligning the system output with the individual human references. In addition to exact word matches, it uses fuzzy matching based on stemming and WordNet synonyms. It computes matches against multiple references separately and uses the best-matching one.

ROUGE-L Lin (2004)

is based on longest common subsequences (LCS) between the system output and the human references, where a common subsequence requires the same words in the same order but allows additional, non-covered words in the middle of either sequence. The final ROUGE-L score is an F-measure based on maximum precision and maximum recall achieved over any of the human references, where precision and recall are computed as length of the LCS divided by the length of the system output and the reference, respectively.

CIDEr Vedantam, Zitnick, and Parikh (2015)

was primarily designed for generated image captions, but is also applicable for NLG in general. CIDEr is computed as the average cosine similarity between the system output and the reference sentences on the level of

-grams, . The importance of the individual -grams is given by the Term Frequency Inverse Document Frequency (TF-IDF) measure, which weighs an -gram’s frequency in a particular instance against its overall frequency in the whole dataset.

We provided scripts to the challenge participants to run all of these metrics in a simple, easy-to-use way. The scripts are freely available at the following URL:191919The scripts are partially based on COCO caption generation challenge evaluation scripts (https://github.com/tylin/coco-caption).


In addition to evaluating all NLG systems individually against human-authored reference texts (see Section 8.1), we also apply the same metrics as measures of output similarity among the systems, comparing each system’s outputs with all other systems’ outputs in place of references (see Section 8.3).

Textual metrics

For the second set of scores, which is intended to measure complexity and diversity in the system outputs, we use the same automatic textual metrics which we used to evaluate the E2E NLG dataset itself (see Section 4.2 and 4.3), i.e. dimensions of lexical richness, such as lexical sophistication (LS2) and mean segmental token-to-type ratio (MSTTR), and metrics of syntactic complexity, such as levels of the revised D-level Scale. This allows us to both evaluate the diversity and complexity of system outputs and to establish whether the text characteristics are similar to the training and test sets. To focus specifically on the style produced by the individual systems, we delexicalized restaurant names in the system outputs before computing textual metrics scores, since restaurant names could skew some of these metrics as they are mostly composed of infrequent nouns (cf. Section 4.2).

7.2 Human Evaluation

The human evaluation was conducted on the 20 primary systems and the baseline using Rank-based Magnitude Estimation (RankME) Novikova, Dušek, and Rieser (2018). In an ordinary (i.e. not rank-based) ME task Bard, Robertson, and Sorace (1996), subjects provide a relative rating of an experimental sentence to a reference sentence, which is associated with a pre-set/fixed number. If the target sentence appears twice as good as the reference sentence, for instance, subjects are to multiply the reference score by two; if it appears half as good, they should divide it in half, etc. Rank-based ME extends this idea by asking subjects to provide a relative ranking of several target sentences, i.e. not only to the reference sentence, but also to each other.

Rank-based ME was selected for several reasons. First, its use proved to significantly increase the consistency of human ratings, compared to other data collection methods Novikova, Dušek, and Rieser (2018). Second, it implies the use of continuous scales, i.e. rating scales without numerical labels and without given end points. Recent studies show that continuous scales allow subjects to give more nuanced judgements Belz and Kow (2011); Graham et al. (2013); Bojar et al. (2017). Third, it explores relative ranking of different systems instead of directly assessing quality of each specific system, which makes it more reliable in the environment of a challenge.

The evaluation was conducted using crowdsourcing based on the CrowdFlower/FigureEight platform. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single MR, and were asked to evaluate and rank these systems from the best to the worst, ties permitted, using the RankME method.

The final evaluation results were produced using the TrueSkill algorithm  Herbrich, Minka, and Graepel (2006); Sakaguchi, Post, and Van Durme (2014). TrueSkill produces system rankings by gradually updating a Bayesian estimate of each system’s capability according to the “surprisal” of pairwise comparisons of individual system outputs. This way, fewer direct comparisons between systems are needed to establish their overall ranking. In Novikova, Dušek, and Rieser (2018), we were able to show that TrueSkill is able to to reduce the amount of collected human evaluation data without compromising the final ranking results.

Since the performance of some systems may be very similar and a total ordering would not reflect this, we adopt the practice used in machine translation of presenting a partial ordering into significance clusters established by bootstrap resampling Bojar et al. (2013, 2014); Sakaguchi, Post, and Van Durme (2014). The TrueSkill algorithm is run 200 times, producing slightly different rankings each time as pairs of system outputs for comparison are randomly sampled. This way we can determine the range of ranks where each system is placed 95% of the time or more often. Clusters are then formed of systems whose rank ranges overlap.

Traditionally, human evaluation aims to assess the naturalness (fluency, readability) and informativeness (relevance, correctness, adequacy) of an automatically generated output Gatt and Krahmer (2017). Naturalness targets the linguistic quality of the NLG system output; informativeness targets relevance or correctness of the output relative to the input MR, showing how well the system reflects the MR content. Recent research often adds a general, overall quality criterion Wen et al. (2015b, a); Manishina et al. (2016); Novikova, Lemon, and Rieser (2016); Novikova et al. (2017), or even uses only that Sharma et al. (2016a).

We decided against explicitly evaluating informativeness since our training instances do not always verbalise all MR attributes (cf. Section 4.4). We therefore only collected separate ranks for quality and naturalness.


When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation, which implies that correctness of the NL utterance relative to the MR should also influence this ranking. The crowd workers were asked: “How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?"


When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation. The crowd workers were asked: “Could the utterance have been produced by a native speaker?"

Ratings of quality and naturalness were collected separately, i.e. in two individual crowdsourcing tasks. Furthermore, when crowd workers were asked to assess naturalness, the MR was not shown to them since it was not necessary for the task. This setup allows to minimise the correlation between the ratings of naturalness and quality Novikova, Dušek, and Rieser (2018); Callison-Burch et al. (2007).

8 Results

In this section, we report on the results of the evaluation of all E2E NLG Challenge primary systems, following the evaluation procedures described in Section 7. We first show the results using automatic metrics: word-overlap-based (Section 8.1) and textual metrics (Section 8.2), as well as automatically computed output similarity between systems (Section 8.3). We then summarise the human evaluation results (Section 8.4), comment on the semantic accuracy of system outputs (Section 8.5) and declare the overall winning system (Section 8.6). Finally, we provide a list of “lessons learnt” in Section 8.7 – observations that we hope will be useful for future NLG system development.

8.1 Word-overlap Metrics

TGen 0.6593 8.6094 0.4483 0.6850 2.2338 0.5754
[0.5pt/2pt] Slug 0.6619 8.6130 0.4454 0.6772 2.2615 0.5744
TNT1 0.6561 8.5105 0.4517 0.6839 2.2183 0.5729
NLE 0.6534 8.5300 0.4435 0.6829 2.1539 0.5696
TNT2 0.6502 8.5211 0.4396 0.6853 2.1670 0.5688
Harv 0.6496 8.5268 0.4386 0.6872 2.0850 0.5673
Zhang 0.6545 8.1840 0.4392 0.7083 2.1012 0.5661
Gong 0.6422 8.3453 0.4469 0.6645 2.2721 0.5631
TR1 0.6336 8.1848 0.4322 0.6828 2.1425 0.5563
Sheff1 0.6015 8.3075 0.4405 0.6778 2.1775 0.5537
DANGNT 0.5990 7.9277 0.4346 0.6634 2.0783 0.5395
Slug-alt (late submission) 0.6035 8.3954 0.4369 0.5991 2.1019 0.5378
ZHAW2 0.6004 8.1394 0.4388 0.6119 1.9188 0.5314
TUDA 0.5657 7.4544 0.4529 0.6614 1.8206 0.5215
ZHAW1 0.5864 8.0212 0.4322 0.5998 1.8173 0.5205
Adapt 0.5092 7.1954 0.4025 0.5872 1.5039 0.4738
Chen 0.5859 5.4383 0.3836 0.6714 1.5790 0.4685
FORGe3 0.4599 7.1092 0.3858 0.5611 1.5586 0.4547
Sheff2 0.5436 5.7462 0.3561 0.6152 1.4130 0.4462
TR2 0.4202 6.7686 0.3968 0.5481 1.4389 0.4372
FORGe1 0.4207 6.5139 0.3685 0.5437 1.3106 0.4231
Table 8: Word-overlap metrics scores (see Section 7.1) for all primary systems, plus the average of all metrics’ values normalised into the 0-1 range. The list is sorted by the normalised average; any values higher than the corresponding baseline are marked in bold. System architectures are coded with colours and symbols: seq2seq, other data-driven, rule-based, template-based.

Table 8 summarises the system scores for word-overlap metrics (cf. Section 7.1). It is apparent that the TGen baseline is very strong in terms of word-overlap metrics: No primary system is able to beat it in terms of all metrics, or in terms of the normalised metrics’ mean – only Slug comes very close. Several other systems manage to beat TGen in one of the metrics but not in others. Note, however, that many secondary system submissions perform better than the primary ones (and the baseline) with respect to word-overlap metrics (see Table 14 in the Appendix).

Overall, seq2seq-based systems show the best word-based metric values, followed by Sheff1, a data-driven system based on imitation learning. As expected, attempts to increase output diversity by ZHAW1, ZHAW2, Slug-alt and Adapt result in lowered scores by word-overlap-based metrics. Template-based and rule-based systems mostly score at the bottom of the list. The lowest-scoring systems in terms of word-overlap metrics are the ones of Chen and Sheff2, which tend to produce much shorter outputs than other systems (cf. Section 8.2). This most likely resulted in severe brevity penalty.

Finally, it must be noted that the results using automatic metrics are quite different from results obtained in human evaluation (see Section 8.4), which confirms previous findings Novikova et al. (2017); Reiter (2018).

8.2 Textual Metrics

% Level0-2 % Level6-7 LS2 MSTTR-50 Avg. length
Gong 82.68 Sheff1 41.27 test set all 0.43 test set rand 0.62 TUDA 31.02
TNT2 79.64 FORGe1 33.66 test set rand 0.36 TR2 0.62 TR2 27.48
Slug 78.08 Slug-alt 30.49 Adapt 0.33 Adapt 0.61 FORGe1 26.88
TNT1 72.18 ZHAW1 26.00 FORGe1 0.30 FORGe1 0.59 ZHAW2 26.58
Zhang 70.83 TR2 21.07 TR2 0.29 ZHAW1 0.58 TNT1 26.37
DANGNT 66.95 ZHAW2 19.03 Harv 0.27 test set all 0.58 ZHAW1 26.16
TGen 65.12 FORGe3 18.51 TNT1 0.26 ZHAW2 0.57 TNT2 25.49
Harv 64.63 test set rand 17.46 Chen 0.25 FORGe3 0.56 Gong 25.41
TR1 64.28 Gong 16.90 NLE 0.25 TUDA 0.55 DANGNT 24.85
FORGe3 62.62 test set all 16.48 Sheff2 0.25 DANGNT 0.54 Adapt 24.47
Adapt 62.48 Slug 11.39 Sheff1 0.24 Slug-alt 0.54 Slug-alt 24.47
FORGe1 61.13 NLE 11.12 TNT2 0.23 Slug 0.52 test set rand 24.39
ZHAW1 58.91 TUDA 10.48 TGen 0.22 TNT1 0.52 TGen 24.04
NLE 58.24 Adapt 10.28 DANGNT 0.21 Sheff1 0.52 test set all 23.96
test set rand 58.16 TNT1 09.55 TUDA 0.21 NLE 0.52 Slug 23.76
test set all 57.97 TGen 09.02 TR1 0.20 TGen 0.52 FORGe3 23.49
TUDA 57.66 DANGNT 08.91 Zhang 0.20 TNT2 0.51 NLE 23.40
TR2 57.36 TR1 08.13 Slug 0.20 Harv 0.51 Harv 23.22
Chen 54.35 Harv 08.12 Gong 0.20 TR1 0.50 Sheff1 22.75
Sheff2 52.98 Zhang 05.27 FORGe3 0.20 Gong 0.50 TR1 22.43
ZHAW2 52.63 TNT2 05.22 Slug-alt 0.19 Zhang 0.47 Zhang 20.71
Slug-alt 35.12 Chen 04.40 ZHAW2 0.17 Chen 0.43 Sheff2 17.18
Sheff1 26.19 Sheff2 02.08 ZHAW1 0.17 Sheff2 0.43 Chen 16.32
Table 9: Systems sorted according to selected textual metrics (percentage of simple and complex sentences, lexical sophistication LS2, MSTTR-50, average output length in tokens). For comparison, the table also includes the same values for the whole test set (test set all) and for a randomly selected subset of the test set, with one reference text per MR (test set rand). System architectures are coded with colours and symbols: seq2seq, other data-driven, rule-based, template-based.
Distinct tokens Distinct trigrams % Unique trigrams Entropy tokens Cond. entropy bigrams
test set all 1079 test set all 16797 test set rand 69.13 test set all 6.40 test set all 2.92
test set rand 542 test set rand 5166 Adapt 66.61 test set rand 6.37 test set rand 2.70
Adapt 455 TR2 4687 TR2 60.44 TR2 6.24 TR2 2.60
TR2 399 Adapt 3567 test set all 44.66 Adapt 6.18 Adapt 2.09
ZHAW1 136 ZHAW1 969 ZHAW1 24.97 FORGe3 5.74 FORGe3 1.66
FORGe3 124 FORGe3 896 Harv 21.88 ZHAW1 5.71 Slug-alt 1.55
ZHAW2 102 Slug-alt 855 TNT1 21.34 ZHAW2 5.65 Harv 1.45
Harv 93 Harv 777 NLE 18.75 Slug-alt 5.57 ZHAW1 1.44
TNT1 89 ZHAW2 716 ZHAW2 18.72 FORGe1 5.55 TNT2 1.39
FORGe1 88 TNT1 703 Slug-alt 18.13 Harv 5.50 NLE 1.37
Slug-alt 88 TNT2 634 Chen 17.92 Sheff1 5.43 TNT1 1.37
TNT2 86 NLE 608 Zhang 17.81 NLE 5.43 Sheff1 1.33
TGen 83 TGen 597 Sheff1 16.44 TGen 5.41 TGen 1.32
NLE 81 Sheff1 578 Slug 15.58 TNT1 5.37 ZHAW2 1.32
Zhang 76 FORGe1 549 FORGe3 13.50 Slug 5.35 TR1 1.30
TR1 75 Zhang 511 TGen 13.23 TNT2 5.34 FORGe1 1.29
Slug 74 Slug 507 TNT2 12.93 DANGNT 5.29 Zhang 1.26
Chen 73 Chen 480 FORGe1 12.39 TUDA 5.25 Chen 1.17
Sheff1 72 TR1 464 TR1 10.78 TR1 5.24 Slug 1.13
DANGNT 61 DANGNT 301 Gong 7.30 Zhang 5.21 Sheff2 1.10
Sheff2 59 Sheff2 262 Sheff2 4.96 Gong 5.19 DANGNT 1.06
Gong 58 Gong 233 DANGNT 0.00 Chen 5.09 Gong 0.91
TUDA 57 TUDA 143 TUDA 0.00 Sheff2 4.76 TUDA 0.71
Table 10: Systems sorted according to selected textual diversity metrics (number of distinct tokens, number fo distinct trigrams, proportion of unique trigrams, Shannon entropy over tokens (unigrams), bigram next-word conditional entropy). For comparison, the table also includes the same values for the whole test set (test set all) and for a randomly selected subset of the test set, with one reference text per MR (test set rand). System architectures are coded with colours and symbols: seq2seq, other data-driven, rule-based, template-based.

Table 9 summarises results from a range of textual metrics which aim to assess the complexity and diversity of primary system outputs (cf. Section 7.1). In addition, we include a comparison to the human references in the test set in order to assess whether systems are able to replicate characteristics of human-produced data.202020Note that textual metrics have been computed with restaurant names delexicalised (cf. Section 7.1). The results in Table 9 show the following:

  • Seq2seq-based system outputs are less syntactically complex on average than outputs of other systems (they produce more D-level 0-2 sentences and less D-level 6-7 sentences than other architectures).

  • The systems seem to show a relatively high variance in syntactic complexity levels, especially with respect to the higher levels; few systems match the distribution of the training and test data. The differences in D-level distributions in the outputs are mostly statistically significant (see Figure 

    6 in the Appendix). The only system producing a D-level distribution not significantly different from a random test set reference is FORGe3, which is based on template mining from training data.

    If we use Bhattacharyya distance to compare the D-level distributions (cf. Figure 7 in the Appendix), the greatest distances appear in both extremes. Sheff1, FORGe1 and Slug-alt produce higher-level sentences more frequently and thus show among the most distant from other systems. The Gong system mostly produces level 0-2 sentences, and therefore it appears very distant from other systems as well as the most distant system from human references.

  • None of the systems reaches the lexical sophistication of the human-authored test set references. The diversity-attempting seq2seq-based Adapt system comes very close, followed by the grammar-based FORGe1 and the TR2 system, which is based on template mining from data. Data-driven systems aiming at higher lexical diversity seem to achieve higher sophistication as well; note the lower performance of Slug-alt, which aims more at syntactic diversity than lexical. For rule-based systems, lexical sophistication is a direct result of the system authors’ decisions.

  • In terms of MSTTR, highest scores are achieved by template or rule-based systems and by data-driven systems that explicitly aim at greater output diversity (ZHAW1, ZHAW2, Adapt, Slug-alt). Note that MSTTR is typically higher in systems that tend to produce longer outputs, which includes most rule- and template-based systems. We assume that this is due to MSTTR’s fixed 50-token window used to segment utterances.

  • Most systems produce outputs similar in length to the test set human references. Outputs of rule- and template-based systems tend to be more verbose than those of data-driven systems. The outputs of Zhang, Sheff2 and Chen are much shorter on average than texts in the dataset, which suggests that these systems might not verbalise all the information contained in the MR (cf. Section 8.5).

Same as for the datasets statistics in Section 4.2, we also computed additional textual measures to assess the diversity/repetitiveness of the generated outputs: number of distinct n-grams, Shannon entropy, and conditional next-word entropy; a selection of these metrics is shown in Table 10.212121We used system outputs with delexicalised restaurant names for the evaluation, but the lexicalised outputs show the same trends. The values for n-gram lengths not displayed in Table 10 also show very similar trends. We compare the outputs against the whole test set (multiple references) and a randomly selected single reference per MR from the test set. The results show the following:

  • None of the systems is able to produce as much diversity as is contained in a randomly selected human reference – even the most diverse systems lag behind. Adapt comes close in vocabulary size, TR2 is the closest system in terms of entropy and next-word conditional entropy.

  • In terms of vocabulary, there is a huge gap between the most diverse Adapt and TR2 systems, and any other system (e.g., the 3rd-ranking ZHAW1 has 3 smaller vocabulary than TR2, and 2.4 smaller ratio of unique trigrams).

    TR2 demonstrates that mining templates from the training data can lead to very diverse outputs. FORGe3, which uses the same method, also ranks relatively high on vocabulary size and entropy. The diversity produced by Adapt’s seq2seq model indicates that the prepocessing step enriching the MRs works effectively (cf. Section 6.3).

  • All diversity-attempting data-driven systems (Adapt, ZHAW1, ZHAW2, Harv, TNT1, TNT2, Slug-alt) indeed rank better than most systems not incorporating diversity measures, with TNT1 and TNT2 showing lower gains than the rest of the group. However, template-mining-based systems (TR2, FORGe3) produce outputs of similar or higher diversity with no concentrated effort.

  • Outputs of seq2seq-based systems which do not explicitly model diversity (e.g. Gong, Sheff1, TR1, Slug, Chen) indeed show lower diversity scores. The rule-based DANGNT system also ranks very low on diversity, and the TUDA system with handcrafted templates is the least diverse of all.

In summary, few systems are able to approach the complexity and diversity shown in human-authored data. Seq2seq-based systems tend to favor simpler sentences than hand-engineered systems unless diversity control is in place. Vanilla seq2seq and handcrafted templates produce the least diverse outputs; highest diversity is achieved by template mining or explicit diversity control mechanisms.

8.3 System Output Similarity

System Mean
TGen 0.48
Slug 0.47
TNT1 0.46
NLE 0.46
TNT2 0.46
Harv 0.46
Zhang 0.45
Sheff1 0.44
TR1 0.44
Gong 0.44
Slug-alt 0.42
ZHAW2 0.42
ZHAW1 0.40
Chen 0.40
TUDA 0.37
Adapt 0.37
FORGe3 0.34
Sheff2 0.34
test set rand 0.34
TR2 0.33
FORGe1 0.31
Figure 4: Similarity of the systems’ outputs as measured by automatic metrics (mean of normalised BLEU, NIST, METEOR, ROUGE-L and CIDEr where one system output is used as reference). Systems are sorted by their architecture. For comparison, we also include metrics values against the full test set with multiple references (test set all) and against a single-reference randomly sampled subset of the test set (test set rand). The table on the right shows mean values of similarity of each system against all other systems (average over columns on the left, excluding the 1st line). System architectures are coded with colours and symbols: seq2seq, other data-driven, rule-based, template-based.

In order to assess the similarity of outputs produced by the individual systems, we reused the word-overlap-based metrics applied in the challenge (see Section 7.1). We created all possible pairs of systems and computed word-overlap metrics between each of their outputs for every instance in the test set. Same as for textual metrics, restaurant names were delexicalised in the system outputs.222222Results with fully lexicalised outputs are very similar, the differences are just slightly less profound.

This process resulted in a table for each of the metrics (see Figure 5 in the Appendix), with reference systems in rows and tested systems in columns. All five metrics showed a very similar pattern. Figure 4 therefore summarises the results by taking the average of all normalised metrics (cf. Table 8). For comparison, we also measure similarity of system outputs against the reference texts in the test set, as well as a subset of the test set with a single, randomly sampled reference text per MR.

We can see from Figure 4 that all the seq2seq-based system outputs are in general most similar to each other; other data-driven systems also show higher similarity amongst each other. The exception to this rule in case of the Chen and Sheff2 systems can be explained by the brevity of their outputs (cf. Sections 8.1 and 8.2). Systems that aim at output diversity (ZHAW1, ZHAW2, Slug-alt and mainly Adapt) also exhibit lowered similarity of their outputs to those of other systems, which might indicate that their outputs are indeed more original. The outputs of rule-based and template-based systems are markedly less similar to other outputs than that of the data-driven systems.

We can also see that most system outputs, especially those of data-driven methods, are much more similar to each other than they are to a single randomly selected human-authored reference text from the test set. This is to be expected since data-driven methods tend to select more frequent phrasing. Some of the system outputs even show a higher similarity to each other than to the closest matching human references from the test set. This is mainly the case for systems with very similar architectures, which often arrive at identical results (e.g. TGen, TNT1 and TNT2).

8.4 Results of Human Evaluation

# TrueSkill Rank System
1 -0.300 1– 1 Slug
[0.5pt/2pt] 2 -0.228 2– 4 TUDA
-0.213 2– 5 Gong
-0.184 3– 5 DANGNT
-0.184 3– 6 TGen
-0.136 5– 7 Slug-alt (late)
-0.117 6– 8 ZHAW2
-0.084 7– 10 TNT1
-0.065 8– 10 TNT2
-0.048 8– 12 NLE
-0.018 10– 13 ZHAW1
-0.014 10– 14 FORGe1
-0.012 11– 14 Sheff1
-0.012 11– 14 Harv
[0.5pt/2pt] 3 -0.078 15– 16 TR2
-0.083 15– 16 FORGe3
[0.5pt/2pt] 4 -0.152 17– 19 Adapt
-0.185 17– 19 TR1
-0.186 17– 19 Zhang
[0.5pt/2pt] 5 -0.426 20– 21 Chen
-0.457 20– 21 Sheff2
# TrueSkill Rank System
1 -0.211 1– 1 Sheff2
[0.5pt/2pt] 2 -0.171 2– 3 Slug
-0.154 2– 4 Chen
-0.126 3– 6 Harv
-0.105 4– 8 NLE
-0.101 4– 8 TGen
-0.091 5– 8 DANGNT
-0.077 5– 10 TUDA
-0.060 7– 11 TNT2
-0.046 9– 12 Gong
-0.027 9– 12 TNT1
-0.027 10– 12 Zhang
[0.5pt/2pt] 3 -0.053 13– 16 TR1
-0.073 13– 17 Slug-alt (late)
-0.077 13– 17 Sheff1
-0.083 13– 17 ZHAW2
-0.104 15– 17 ZHAW1
[0.5pt/2pt] 4 -0.144 18– 19 FORGe1
-0.164 18– 19 Adapt
[0.5pt/2pt] 5 -0.243 20– 21 TR2
-0.255 20– 21 FORGe3
Table 11: TrueSkill measurements of quality (left) and naturalness (right) for all primary systems (significance cluster number, TrueSkill value, range of ranks where the system falls in 95% of cases or more, system name). Significance clusters are separated by a dotted line. System architectures are coded with colours and symbols: seq2seq, other data-driven, rule-based, template-based.
Human Ratings
System OK A M A+M
Slug 74% 08% 17% 1%
Gong 74% 06% 19% 1%
DANGNT 74% 09% 17% 0%
TUDA 74% 19% 07% 0%
TR2 73% 10% 14% 3%
Sheff1 72% 09% 18% 1%
Slug-alt 70% 12% 18% 1%
ZHAW2 69% 08% 22% 1%
TGen 69% 07% 23% 1%
FORGe1 68% 09% 20% 3%
TNT1 66% 07% 25% 1%
TNT2 62% 09% 28% 1%
ZHAW1 61% 09% 28% 1%
FORGe3 60% 10% 29% 1%
NLE 59% 08% 31% 2%
Harv 53% 09% 35% 4%
TR1 51% 08% 42% 0%
Adapt 51% 12% 33% 4%
Zhang 43% 08% 49% 0%
Chen 27% 10% 62% 0%
Sheff2 26% 09% 62% 3%

Automatic (pattern matching)

System OK A M A+M SER
TUDA 100% 0% 00% 00% 00.00%
Sheff1 93% 0% 05% 02% 01.08%
Gong 92% 4% 02% 02% 01.13%
FORGe1 92% 0% 08% 00% 01.22%
Slug 91% 1% 04% 04% 01.26%
DANGNT 88% 0% 12% 00% 01.75%
TGen 79% 3% 16% 02% 03.56%
Slug-alt 78% 4% 09% 09% 03.56%
ZHAW2 76% 3% 20% 01% 03.68%
TNT1 73% 1% 22% 04% 04.92%
TNT2 71% 1% 28% 01% 06.04%
ZHAW1 70% 3% 25% 02% 05.12%
TR2 66% 6% 23% 05% 05.45%
NLE 63% 3% 24% 10% 06.20%
Harv 54% 2% 30% 14% 10.43%
Adapt 50% 3% 36% 10% 12.48%
TR1 48% 0% 52% 00% 13.83%
FORGe3 41% 0% 55% 03% 10.41%
Zhang 27% 0% 73% 00% 14.80%
Chen 11% 0% 88% 01% 23.53%
Sheff2 05% 0% 88% 06% 27.94%
Table 12: Results of input MR coverage evaluation, with human ratings (left) and using an automatic pattern-matching script (right). Columns legend: OK – proportion of outputs covered perfectly, A – proportion of outputs with added information, M – with missed information, A+M with both missed and added information, SER – slot/semantic error rate (see Section 8.5). The lists are sorted by the proportion of perfectly covered MRs. System architectures are colour-coded: seq2seq, other data-driven, rule-based, template-based.
# MR / System name and output Score
1 name[The Vaults], eatType[restaurant], food[French], priceRange[moderate], area[city centre], familyFriendly[yes], near[Raja Indian Cuisine]
2-4[0.5pt/2pt] Gong The Vaults is a french restaurant in the city centre near Raja Indian Cuisine. It is kids friendly and has a moderate price range. 100
Harv The Vaults is a moderately priced French restaurant in the city centre near Raja Indian Cuisine. 90
FORGe3 Family-friendly and moderate, The Vaults restaurant is located near Raja Indian Cuisine, in the city centre area. 85
Zhang The Vaults is a kid friendly French restaurant in the city centre near Raja Indian Cuisine . 80
Sheff1 The Vaults is a moderately priced French restaurant near Raja Indian Cuisine in the city centre is. 70
2 name[Cotto], eatType[coffee shop], near[The Bakers]
2-4[0.5pt/2pt] TR2 Cotto is a coffee shop located near The Bakers. 100
Slug-alt Cotto is a coffee shop and is located near The Bakers 97
TGen Cotto is a coffee shop with a low price range. It is located near The Bakers. 85
Gong Cotto is a place near The Bakers. 85
Sheff2 Cotto is a pub near The Bakers. 82
3 name[Clowns], eatType[coffee shop], customer rating[3 out of 5], near[All Bar One]
2-4[0.5pt/2pt] Sheff1 Clowns is a coffee shop near All Bar One with a customer rating of 3 out of 5. 100
Zhang Clowns is a coffee shop near All Bar One with a customer rating of 3 out of 5 . 100
FORGe3 Clowns is a coffee shop near All Bar One with a rating 3 out of 5. 70
ZHAW2 A coffee shop near All Bar One is Clowns. It has a customer rating of 3 out of 5. 50
Sheff2 Clowns is a pub near All Bar One. 20
4 name[The Cricketers], eatType[restaurant], customer rating[low], familyFriendly[no], near[Ranch]
2-4[0.5pt/2pt] Slug The Cricketers is a restaurant near Ranch. It is not family friendly and has a low customer rating. 72
Slug-alt Located near Ranch, The Cricketers is a family-friendly restaurant that is not family-friendly and has a low customer rating. 71
Adapt The Cricketers is a non - family - friendly restaurant located near the Ranch . It has a low customer satisfaction rating . 68
FORGe1 The restaurant The Cricketers is near Ranch. The Cricketers, which does not welcome kids, has a low customer rating. 65
TUDA The Cricketers is a restaurant located near Ranch. It has a low customer rating. It is not family friendly. 56
Table 13: Example system outputs with human rankings of quality and a detailed error analysis attempting to interpret the rankings.

Each example is shown as ranked for quality by a single crowd worker. The raw RankME scores assigned by the crowd workers are shown; however, note that only relative ranks are used by the TrueSkill algorithm. The outputs within each example are sorted by the score for clarity. For the purpose of error analysis, the rankings may be interpreted in the following way (note that quality rankings include both relevance and fluency):

  1. Gong and FORGe3 verbalise all attributes but the latter is less fluent. Harv misses the family-friendliness, Zhang misses the price information. Sheff1 misses family-friendliness and is not fluent.

  2. TR2 and Slug-alt provide perfect and fluent information but Slug-alt misses the full stop. Gong does not specify the type of place while TGen adds irrelevant price range information. Sheff2 indicates a wrong venue type.

  3. Sheff1 and Zhang provide perfect and fluent information, FORGe3 is less fluent and ZHAW2 even less than that. Sheff2 indicates a wrong venue type and misses the customer rating information.

  4. Slug provides a perfect an fluent information. Slug-alt is repetitive and Adapt

    was probably penalised for lack of detokenisation.

    FORGe1 and TUDA provide a complete information but are not very fluent.

The results of human evaluation of quality and naturalness are provided in Table 11. Using the RankME setup described in Section 7.2, we collected 2,979 data points of partial system rankings for quality, where one data point corresponds to one MR and ranked outputs of five randomly selected systems (see Table 13 for examples). From these rankings, a set of 29,790 pairwise output comparisons were produced to be used by the TrueSkill algorithm. This resulted in 1,418 pairwise comparisons per system. For naturalness, 4,239 data points were collected, which resulted in 42,390 pairwise comparisons, and 2,018 comparisons per system. For each of 630 MRs in the test set, 9.5 systems on average (with a maximum of 14) were compared based on both naturalness and quality of their outputs. That is, using TrueSkill, we were able to reduce the number of required system comparisons to more than half. The CrowdFlower task for collecting human evaluation data was running for 235 hours and cost USD 314 in total.

We produced the final ranking of all systems for both quality and naturalness using the TrueSkill algorithm with bootstrap resampling as described in Section 7.2. This resulted in clusters of systems with significantly different system rankings for both naturalness and quality.232323Note that TrueSkill provides a relative ranking of a system in terms of its cluster and rank range (cf. Section 7.2), i.e. the numerical scores are not directly interpretable. Other systems in the same cluster are considered to show performance that is not significantly different. In other words: if a system is part of e.g. cluster 2, this system can be considered 2nd best, but it is sharing this position with all other systems in the cluster. In both cases, there are clear winning systems (i.e., the 1st cluster only has one member): Sheff2 for naturalness and Slug for quality. The 2nd clusters are quite large for both criteria – they contain 13 and 11 systems, respectively, and they include the baseline TGen system in both cases.

The results indicate that seq2seq systems dominate in terms of naturalness of their outputs, while most systems of other architectures score lower. The bottom cluster is filled with template-based systems. The winning Sheff2 system is seq2seq-based, and the 2nd cluster mostly includes other seq2seq-based systems. The result also indicates that diversity-attempting systems are penalised in naturalness, i.e. Slug-alt, ZHAW1, ZHAW2 placed in the 3rd cluster; Adapt in the 4th.

The results for quality242424Note that our definition of quality in Section 7.2 also includes semantic completeness and grammaticality. are, however, more mixed in terms of architectures, with none of them clearly prevailing. The 2nd, most populous cluster includes all different architecture types. The winner is the seq2seq-based system Slug. However, the bottom two clusters are also composed of seq2seq-based systems. This shows the importance of an explicit semantic control mechanism applied at decoding time in seq2seq systems: None of the systems in the bottom two clusters apply such mechanism, whereas all better ranking seq2seq systems do (cf. Section 6.2).252525While the Chen and Zhang systems do attempt to model the coverage of the input MR, they do not use explicit beam reranking based on MR coverage. Note that this also includes the Sheff2 system, which scored top for naturalness. With the exception of diversity-attempting Adapt, these systems tend to produce the shortest outputs (see Table 9), which indicates that they are penalised for not realising parts of the input MR too often (cf. Section 8.5).

Finally, we computed the correlation of word-overlap metrics with the human judgements of both quality and naturalness for all the systems. All of the correlations are weak (, see Tables 16 and 15 in the Appendix), which confirms earlier findings of Novikova et al. (2017) and explains the discrepancy between system performances in terms of automatic and human evaluation.

8.5 Error Analysis: Input MR Coverage

In order to clarify the mixed quality evaluation results, we attempted to estimate the number of semantic errors produced by the individual systems in two ways: First, we ran a specific crowdsourced evaluation of systems’ coverage of the input MR, where crowd workers were asked to manually annotate missed and added information with respect to the input MR (see Table 12). We did not check for workers’ correctness here, and thus we can expect some noise, but the annotations confirm that the systems rated low on quality, most of which also produce very short outputs, also correspond to the ones with the lowest proportion of perfectly covered MRs (Chen, Sheff2, Zhang, TR1 and Adapt).

Second, semantic errors were computed following Reed, Oraby, and Walker (2018), where we implemented a script to estimate the coverage automatically based on regular expression matching.262626We based the patterns for the individual attribute-value pairs on Reed, Oraby, and Walker (2018)’s script and manually enhanced them using the first 500 instances of the E2E development set. This allowed us to produce an independent estimate of the proportion of outputs with missing or added information (see Table 12). Following Reed, Oraby, and Walker (2018), we also computed the slot error rate (SER) using this pattern-matching approach and the following formula:272727Note that the coverage and SER values produced by the script is only an estimate as the patterns for a given attribute-value pair will not cover all possible all correct ways to express it. This is different from Wen et al. (2015b)’s computation of SER, where full delexicalisation allowed them to directly count placeholders in the output.


Here, missed stands for slot values missing from the realisations, added denotes additional information not present in the MR (hallucinations), value errors denote correctly realised slots with incorrect values (e.g., specifying low price range instead of high), and repetitions are values mentioned repeatedly in the outputs; slots is the total number of slots/attributes in the test set. SER thus amounts to a proportion of erroneously realised slots. While the absolute numbers for perfectly covered MRs are different from those estimated by humans, they mostly follow the same trend. The SER value is highly correlated with the proportion of perfectly covered MRs.

Both evaluations show that template- and rule-based systems, where MR coverage is implied by the architecture, mostly score high in this regard. However, FORGe3, which uses template mining from training data, scores below average; here, some amount of noise was probably carried over from training data. TUDA, on the other hand, scores high in human ratings and even achieved perfect score by the automatic script (100% perfect coverage), but this is partly given by its low diversity (cf. Section 8.2) – all its templates are probably covered well by the patterns. The results also show that some data-driven systems are able to achieve very good coverage (especially Sheff1, Gong and Slug, with SER estimates below 1.5%), which confirms the efficacy of their respective semantic control approaches (see Section 6.2). Seq2seq systems without reranking (Chen, Sheff2, Zhang, Adapt, TR1) score near the bottom of the list in both evaluations.

Both estimates also indicate that missing information is the most common type of problem, added (hallucinated) information occurs less frequently, but still poses a serious problem for utterance generation in task-based dialogue systems.282828Note that this problem appears to be more general since it has also been reported in machine translation Koehn and Knowles (2017). It also appears that both problems are connected – systems hallucinating less frequently tend to miss information more often.

Finally, the scores show that attempts at diversity may hurt semantic accuracy. This is most apparent in Adapt, the most diverse system with no explicit semantic control mechanism. Other systems with diverse outputs, FORGe3 and Harv, also score lower on coverage. In case of FORGe3, this is due to the above-mentioned noise in the mined templates; Harv’s reranking is probably less aggressive than others’. On the other hand, ZHAW1, ZHAW2 and especially Slug-alt produce diverse outputs while maintaining good coverage thanks to their very powerful semantic control mechanisms.

8.6 Winning System

We consider the Slug system Juraska et al. (2018), a seq2seq-based ensemble system, as the overall winner of this challenge. It received high human ratings for both naturalness and quality, as well as for automatic word-overlap metrics. In contrast to vanilla seq2seq systems, Slug improves semantic coverage using a heuristic slot aligner in combination with a data augmentation method producing partially aligned examples, which places it among the top-scoring systems in terms of MR coverage (cf. Section 8.5). Slug’s only drawback is the relatively low output diversity; note that repetitive output is considered to be problematic for task-based dialogue systems. A variant of the same system, Slug-alt, provides much more output diversity at the cost of slightly lower quality ratings and MR coverage; it maintains higher quality and coverage scores than other diversity-attempting approaches.

While the Sheff2 system Chen, Lampouras, and Vlachos (2018), a vanilla seq2seq setup, won in terms of naturalness, it often does not realise all parts of the input MR, which severely affected its quality rating – it placed in the last cluster, ranked 20th–21st out of 21. Sheff2’s outputs also rank very low on complexity and diversity.

Furthermore, the TGen baseline system turned out hard to beat. It ranked highest on average in word-overlap-based automatic metrics and placed in the 2nd cluster in both quality and naturalness (ranks 3–6 and 4–8 out of 21, respectively). TGen also fared well (albeit not perfectly) in MR coverage evaluations. On the other hand, TGen only scored in the middle of the pack on output diversity.

8.7 Lessons Learnt and Future Directions

We attempt to formulate some high-level “lessons learnt" for developing future data-driven NLG systems based on the above results, while we acknowledge that our data is limited to a single domain, and that comparisons are not strictly controlled, i.e. models vary in more than one aspect.

  • Semantic control: For seq2seq-based systems, a strong semantic control of the generated content seems crucial – beam reranking based on MR classification or heuristic alignments appears to work well while attention-only models perform poorly on our data. Correct semantics is regarded by users as more important than fluency Reiter and Belz (2009) and should be prioritised when training the models (cf. also Reiter, 2019).

  • Open vocabulary: For limited domains such as ours, delexicalisation of open-set attributes still seem to be the best approach. However, the systems of Harv and NLE show character-level models and copy mechanisms are viable alternatives. We believe that the low results of Chen, Zhang and Adapt are due to inferior semantic control, not open-vocabulary handling.

  • Complexity and diversity: In general, hand-engineered systems seem to outperform neural systems in terms of output diversity and complexity (see Section 8.2); the most diverse outputs are produced by systems using templates mined from training data and data-driven systems with explicit diversity mechanisms.

    Vanilla seq2seq-based systems produce the least diverse outputs: they are essentially probabilistic language models, which tend to settle for the most frequent phrasing, thus penalising length and favouring high-frequency word sequences. Diversity in seq2seq models can be improved by data selection (Slug-alt), diverse ensembling (Harv) or sampling from the generated beam Wen et al. (2015b). In contrast, hand-engineered system authors can control the output complexity and diversity directly: here, TUDA’s outputs are very repetitive as its set of handcrafted templates is small, while FORGe3 and TR2 with templates mined from data produce some of the most diverse outputs.

    In general, any systems attempting output diversity need to impose strong semantic control mechanisms to maintain MR coverage.

  • Best method suggestion: Rule-based methods work quite well for limited domains, such as ours. Low-effort handcrafting (as in TUDA) may lead to correct but repetitive outputs. Seq2seq models with semantic reranking emerge as the best data-driven option, in combination with controlling for diversity and using copy mechanisms to minimise preprocessing.

9 Conclusion

This paper presents the findings of the first shared task on End-to-End Natural Language Generation for Spoken Dialogue Systems. The aim of this challenge was to assess the capabilities of recent end-to-end, fully data-driven NLG systems, which can be trained from pairs of input meaning representations and corresponding texts, without the need for fine-grained semantic alignments.

As part of this challenge, we have created a novel dataset for NLG benchmarking in the restaurant information domain, which is an order-of-magnitude bigger than any previous publicly available dataset for task-oriented NLG. We also provided one of the previous state-of-the art seq2seq-based NLG systems, TGen Dušek and Jurčíček (2016a), as a baseline for comparison. The challenge received 62 system submissions by 17 different participating institutions. The systems submitted ranged from complex seq2seq-based setups with different additions to the architecture, over other data-driven methods and rule-based systems, to simple template-based ones. We evaluated all the entries in terms of five different automatic metrics. 20 primary submissions (as identified by the participants) were further evaluated using a novel, crowdsourced evaluation setup. We also include a novel comparison of systems in terms of automatic textual metrics aimed to assess output complexity and diversity. Our evaluation lets us include several general recommendations for future NLG system development.

In general, seq2seq-based systems produce very similar outputs (as measured by word-overlap, cf. Section 8.3), despite their different implementations. Seq2seq models tend to score high on word-overlap metrics and human evaluations of naturalness, while the scores for other data-driven, rule-based and template-based systems are lower. However, these other types of systems often score better in human evaluations of the overall quality. While the winning Slug system is seq2seq-based, the results also demonstrated possible pitfalls of using seq2seq models:

  1. Vanilla seq2seq models tend to produce short outputs of low diversity and syntactic complexity. Low diversity is especially problematic since it causes repetitive outputs in spoken dialogue systems.

  2. Applying a strong semantic control mechanism during decoding is crucial to preserve the input meaning. The most common semantic mistake for systems is to miss out information. However, added information (hallucinations) is also closely linked. Both type of errors can have severe consequences for task-based dialogue systems, depending on the application domain.

  3. Addressing these issues is challenging: Attempts to improve diversity can often result in lowered semantic accuracy and/or output naturalness.

In comparison, hand-engineered systems tend to produce more complex and diverse outputs and are able to reach high overall quality, but are mostly rated low on naturalness. Note that similar findings have been reported by Wiseman, Shieber, and Rush (2017) for data-to-document generation. This raises the general question regarding efficiency, costs, and performance of purely data-driven versus carefully hand-engineered NLG systems.

To facilitate further research in this domain, we have made the following data and tools freely available for download:

  • The E2E NLG training dataset (including test set with human references),

  • A set of word-overlap-based metrics used for automatic evaluation in the challenge,

  • Outputs of the baseline TGen system for the development set,

  • Outputs for the test set produced by the baseline and all participating systems,

  • the corresponding RankME ratings for quality and naturalness collected in the human evaluation campaign.

All can be accessed under the following URL:


In future work, we aim to investigate additional evaluation methods for NLG systems, such as post-edits Sripada, Reiter, and Hawizy (2005), or extrinsic evaluation, such as NLG’s contribution to task success, e.g. Rieser, Lemon, and Keizer (2014); Gkatzia, Lemon, and Rieser (2016). We also intend to continue our work on automatic quality estimation for NLG Dušek, Novikova, and Rieser (2017), where the large amount of data obtained in this challenge allows a wider range of experiments than previously possible.

This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1) and Charles University project PRIMUS/19/SCI/10. The Titan Xp used for this research was donated by the NVIDIA Corporation. The authors would like to thank Lena Reed and Shereen Oraby for help with computing the slot error rate. We would also like to thank Prof. Ehud Reiter, whose blog292929https://ehudreiter.com/ inspired some of this research.