This paper provides a detailed report and analysis of the first shared task on End-to-End (E2E) Natural Language Generation (NLG). Shared challenges have become an established way of pushing research boundaries in the field of Natural Language Processing, with NLG benchmarking tasks running since 2007(Belz and Gatt, 2007). These previous shared tasks have demonstrated that large-scale, comparative evaluations are vital for identifying future research challenges in NLG Belz and Hastie (2014).
The E2E NLG shared task is novel in that it poses new challenges for recent end-to-end, data-driven NLG systems. This type of systems promises rapid development of NLG components in new domains by reducing annotation effort: They jointly learn sentence planning and surface realisation from non-aligned data, e.g. Dušek and Jurčíček (2015); Wen et al. (2015b); Mei, Bansal, and Walter (2016); Wen et al. (2016); Sharma et al. (2016a); Dušek and Jurčíček (2016a); Lampouras and Vlachos (2016). As such, these approaches do not require costly semantic alignment between meaning representations (MRs) and the corresponding natural language reference texts (also referred to as “ground truths" or “targets"), but they are trained on parallel datasets, which can be collected in sufficient quality and quantity using effective crowdsourcing techniques, e.g. (Novikova, Lemon, and Rieser, 2016).
So far, end-to-end approaches to NLG have been limited to small, delexicalised datasets, e.g. BAGEL (Mairesse et al., 2010), SF Hotels/Restaurants (Wen et al., 2015b), or RoboCup (Chen and Mooney, 2008). Therefore, end-to-end methods have not been able to replicate the rich dialogue and discourse phenomena targeted by previous rule-based and statistical approaches for language generation in dialogue, e.g. (Walker et al., 2004; Stent, Prasad, and Walker, 2004; Mairesse and Walker, 2007; Rieser and Lemon, 2009). In this paper, we describe a large-scale shared task based on a new crowdsourced dataset of 50k instances in the restaurant domain (see Section 3). We show that the dataset poses new challenges, such as open vocabulary, complex syntactic structures and diverse discourse phenomena, as described in Section 4. Our shared task aims to assess whether the novel end-to-end NLG systems are able to produce more complex outputs given a larger and richer training dataset.
We received 62 system submissions by 17 institutions from 11 countries for the E2E NLG Challenge, with about 13 of these submissions coming from industry, as summarised in Section 5. We consider this level of participation an unexpected success, which underlines the timeliness of this task111Note that, in comparison, the well established Conference in Machine Translation WMT’17 (running since 2006) got 31 institutions submitting to a total of 8 tasks (Bojar et al., 2017). and allows us to reach general conclusions and issue recommendations on the suitability of different methods. We analyse how the submitted systems address the challenges posed by the dataset in Section 6
, and we evaluate the submitted systems by comparing them to a challenging baseline using automatic evaluation metrics (including novel text-based measures) as well as human evaluation (see Section7). Note that, while there are previous studies comparing a limited number of end-to-end NLG approaches Novikova et al. (2017); Wiseman, Shieber, and Rush (2017); Gardent et al. (2017a), this is the first research to evaluate novel end-to-end generation at scale using human assessment.
Our results in Section 8
show a discrepancy between data-driven seq2seq models versus template- and rule-based systems. While seq2seq models generally score high on word-overlap similarity measures and human rankings of naturalness, manually engineered systems score better than some seq2seq systems in terms of overall quality, as well as diversity and complexity of generated outputs. In Section9
, we conclude by laying out challenges for future shared tasks in this area. We also release a new dataset of 36k system outputs paired with user ratings, which will enable novel research on automatic quality estimation for NLGSpecia, Raj, and Turchi (2010); Dušek, Novikova, and Rieser (2017); Ueffing, Camargo de Souza, and Leusch (2018); Kann, Rothe, and Filippova (2018); Tian, Douratsos, and Groves (2018). All data and scripts associated with the challenge, as well as technical descriptions of participating systems are available at the following URL:
This journal article summarises our previous work Novikova, Lemon, and Rieser (2016); Novikova and Rieser (2016); Novikova et al. (2017); Novikova, Dušek, and Rieser (2017); Dušek, Novikova, and Rieser (2018) and extends it by including corrected and substantially extended evaluation of the training dataset, providing an exhaustive analysis of results including novel metrics, as well as a more detailed description of all participating systems with example outputs. This allows us to reach some more in-depths insights about the strength and weaknesses of end-to-end generation systems. We furthermore provide a more comprehensive literature review and discuss directions for future work with respect to end-to-end generation, as well as NLG evaluation in general. Finally, this paper accompanies a release of all the participating systems’ outputs on the test set along with the human ratings collected in the evaluation campaign.
2 Domain and Task
|Attribute||Data Type||Example value|
|name||verbatim string||The Eagle, …|
|eatType||dictionary||restaurant, pub, …|
|familyFriendly||boolean||Yes / No|
|priceRange||dictionary||cheap, expensive, …|
|food||dictionary||French, Italian, …|
|near||verbatim string||market square, Cafe Adriatic, …|
|area||dictionary||riverside, city center, …|
|customerRating||enumerable||1 of 5 (low), 4 of 5 (high), …|
|MR||name[The Wrestlers], priceRange[cheap], customerRating[1 of 5]|
|reference||The Wrestlers offers competitive prices, but isn’t rated highly by customers.|
In general, the task of NLG is to convert an input MR into a natural language utterance consisting of one or more sentences. In this paper, we focus on the case where an end-to-end data-driven generator is trained from simple pairs of MRs and reference texts, without fine-grained alignments between elements of the MR and words or phrases in the reference texts, as in, e.g. Dušek and Jurčíček (2015); Wen et al. (2015b). An example pair of a MR and a reference text is shown in Figure 1. We focus on restaurant recommendations in our experiments, which, previously, have been widely explored in dialogue systems research, e.g. Young et al. (2010); Henderson, Thomson, and Williams (2014); Wen et al. (2017). However, our E2E dataset is substantially bigger and more complex and than previous NLG training datasets for this domain Mairesse et al. (2010); Wen et al. (2015b) (see Section 4), which allows us to assess whether NLG systems are able to learn to produce more varied and complex utterances given enough training examples (cf. Section 8).
For the input representation, we use a format commonly found in task-oriented domain-specific spoken dialogue systems – unordered sets of attributes (slots) and their values, e.g. Mairesse et al. (2010); Young et al. (2010); Liu and Lane (2016).222Most dialogue systems also include a general intent of the utterance, such as inform, confirm, or request Young et al. (2010); Wen et al. (2015b); Liu and Lane (2016). Since our task is focussed on recommendations, this intent would be recommend/inform for all our data, and we can therefore disregard it. The list of possible attributes used in the MRs in our dataset with example values is shown in Table 1.
3 Data Collection Procedure
In order to maximise the chances for data-driven end-to-end systems of producing high quality output, we aim to provide training data in sufficient quality and quantity. We turned to crowdsourcing to collect training data in large enough quantities. We used the CrowdFlower platform333The CrowdFlower platform was renamed to FigureEight after our study was completed. See https://www.figure-eight.com/. to recruit workers. Previously, crowdsourcing has mainly been used for evaluation in the NLG community, e.g. Rieser, Lemon, and Keizer (2014); Dethlefs et al. (2012). However, recent efforts in corpus creation via crowdsourcing have proven to be successful in related tasks. For example, Zaidan and Callison-Burch (2011) showed that crowdsourcing can result in datasets of comparable quality to those created by professional translators given appropriate quality control methods. Mairesse et al. (2010) demonstrate that crowd workers can produce aligned natural language descriptions from abstract MRs for NLG, a method which also has shown success in related NLP tasks, such as spoken dialogue systems Wang et al. (2012) or semantic parsing Wang, Berant, and Liang (2015). More recently, data-driven NLG systems, such as Wen et al. (2015a) and Dušek and Jurčíček (2016), have relied on crowdsourcing for collecting training data.
When crowdsourcing corpora for training NLG systems, i.e. eliciting natural language paraphrases for given MRs from workers, the following main challenges arise:
How to ensure the required quality of the collected data?
What types of meaning representations can elicit spontaneous, natural and varied data from crowd workers?
In an attempted to address both challenges before collecting the main training dataset for the E2E NLG challenge, we ran a small-scale pre-study published in Novikova, Lemon, and Rieser (2016). We briefly summarise the results of this study in this section and apply the successful techniques to the whole data set.
For the pre-study, we prepared a subset of 75 distinct MRs, consisting of three, five or eight attributes from our domain (see Table 1) and their corresponding values in order to evaluate MRs with different complexities.444The attributes were selected at random, but we excluded MRs that do not contain the attribute name as these would not be appropriate for a venue recommendation. We then implemented several automatic validation procedures for filtering the crowdsourced data in order to address (1), see Section 3.1. To address (2), we explored the trade-off between semantic expressiveness of the MR and the quality of crowdsourced utterances elicited for the different semantic representations. In particular, we investigated translating MRs into pictorial representations as used in, e.g. Williams and Young (2007); Black et al. (2011) for evaluating spoken dialogue systems (see Section 3.2). In the remainder of this section, we first describe the detailed setup used to crowdsource our data (Section 3.3) and then finally evaluate the pre-study by comparing pictorial MRs to text-based MRs used by previous crowdsourcing work Mairesse et al. (2010); Wang et al. (2012) in Section 3.4.
3.1 Automatic Validation Measures
We used two simple methods to check the quality of crowd workers on CrowdFlower: First, we only select workers that are likely to be native speakers of English, following Sprouse (2011) and Callison-Burch and Dredze (2010). We use IP addresses to ensure that workers are located in one of three English-speaking countries – Canada, the United Kingdom, or the United States. In addition, we included a requirement that “Participants must be native speakers of British or American English" both in the caption of the task listed on CrowdFlower and in the task instructions. Second, we check whether workers spend at least 20 seconds to complete a page of work. This is a standard CrowdFlower option to control the quality of contributions, and it ensures that the contributor is removed from the job if they complete the task too fast.
We check if the ready-to-submit utterance only contains legal characters, i.e. letters, numbers and symbols “, ’ . : ; £”.
We check whether the submitted text is not shorter than the required minimal length, which is an approximation of the total number of characters used for all attribute values in a given MR, as calculated by Eq. 1:
Here, # MR characters is the total number of characters in the given MR; # MR attributes is the number of attributes in the given MR; and is an average length of an attribute name plus two associated square brackets.
We check that workers do not submit the same utterance several times.
We ensured by manually checking a small number of initial trial tasks that these automatic validation methods were able to correctly identify and reject 100% of bad submissions.
3.2 Meaning Representations: Pictures and Text
In previous crowdsourcing tasks involving MRs, these were typically presented to workers in a textual form of dialogue acts Young et al. (2010), such as the following:
However, there is a limit in the semantic complexity that crowd workers can handle when using this type of textual/logical descriptions of dialogue acts Mairesse et al. (2010). Also, Wang et al. (2012) observed that the chosen semantic formalism influences the workers’ language, i.e. crowd workers are primed by the words/tokens and ordering used in the MR. Therefore, in contrast to previous work Mairesse et al. (2010); Wen et al. (2015a); Dušek and Jurčíček (2016), we explore the usage of different modalities of meaning representation:
Textual/logical MRs appear as a list of comma-separated attribute-value pairs, where attribute values are shown in square brackets after each attribute (see Figures 1 and 2). The order of attributes is randomised so that crowd workers are not primed by the ordering used in the MRs Wang et al. (2012).
Pictorial MRs are semi-automatically generated pictures with a combination of icons corresponding to the individual attributes (see Figure 2). The icons are located on a background showing a map of a city, thus allowing to represent the meaning of the attributes area and near.
|1.||name[Loch Fyne], eatType[restaurant], familyFriendly[yes], priceRange[cheap], food[Japanese]|
|2.||name[The Wrestlers], familyFriendly[No], area[riverside], food[Italian], customerRating[5 of 5], priceRange[expensive], near[Cafe Adriatic], eatType[restaurant]|
3.3 Data Collection Setup
We set up the data collection tasks on the CrowdFlower platform, using the automatic checks described in Section 3.1 and using both pictorial and textual MRs as input (see Section 3.2). For this pre-study, we collected 1133 distinct utterances from the 75 distinct/unique MRs we prepared. 744 utterances were elicited using the textual MRs, and 498 utterances were elicited using the pictorial MRs. The data collected in the pre-study are freely available for download.555See https://github.com/jeknov/INLG_16_submission. The data is not part of the final E2E NLG dataset. We later used the same CrowdFlower setup to collect the whole E2E NLG dataset (see Section 4).
In terms of financial compensation, crowd workers were paid the standard pay on CrowdFlower, which is $0.02 per page (where each page contained 1 MR). Workers were expected to spend about 20 seconds per page. Participants were allowed to complete up to 20 pages, i.e. create utterances for up to 20 MRs. Mason and Watts (2010) found in their study of financial incentives on Mechanical Turk (counter-intuitively) that increasing the amount of compensation for a particular task does not tend to improve the quality of the results. Furthermore, Callison-Burch and Dredze (2010) observed that there can be an inverse relationship between the amount of payment and the quality of work, because it may be more tempting for crowd workers to cheat on high-paying tasks if they do not have the skills to complete them. Following these findings, we did not increase the payment for our task over the standard level.
3.4 Results and Discussion
We analysed the collected natural language reference texts, focussing on textual versus pictorial MRs and their effects on objective measures, such as time taken to collect the data and length of an utterance, and human evaluations of the reference texts collected under the different conditions. Results in full detail can be found in Novikova, Lemon, and Rieser (2016); here we only summarise the main findings. The data analysis showed that:
There is no significant difference in the time taken to collect data with pictorial vs. textual MRs.
The average length of a collected reference text, both in terms of number of characters and number of sentences, depends mainly on the number of attributes associated with the MR, rather than on whether pictures or text were used.
Compared to textual MRs, pictorial MRs elicit texts that are significantly less similar to the underlying MR in terms of semantic text similarity Han et al. (2013). We assume that this is because pictorial MRs are less likely to prime the crowd workers in terms of their lexical choices.
The human evaluation revealed that reference texts produced from pictorial MRs are rated as significantly () more informative than textual MRs. Equally, utterances produced from pictorial MRs were considered to be significantly () more natural and better phrased than utterances collected with textual MRs.666Please see Novikova, Lemon, and Rieser (2016) for a definition of informativeness, naturalness and phrasing.
This shows that pictorial MRs have specific benefits for elicitation of NLG data from crowd workers. This may be because the lack of priming by lexical tokens in the MRs leads the crowd workers to producing more spontaneous and natural language, with more variability. As a concrete example of this phenomenon from the collected data, consider the first MR in Figure 2. The textual version of this MR elicited utterances such as “Loch Fyne is a family friendly restaurant serving cheap Japanese food.” whereas the pictorial MR elicited e.g. “Serving low cost Japanese style cuisine, Loch Fyne caters for everyone, including families with small children.”
Pictorial stimuli have also been used in other, related NLP tasks, such as crowdsourced evaluations of dialogue systems, e.g. Williams and Young (2007); Black et al. (2011). Williams and Young (2007), for example, used pictures to set dialogue goals for users (e.g. to find an expensive Italian restaurant in the town centre). However, no analysis was performed regarding the suitability of such representations. This experiment therefore has a bearing on the general issue of human natural language responses to pictorial task stimuli, and shows for example that pictorial task presentations can elicit more natural variability in user inputs to a dialogue system.
Of course, there is a limit in the meaning complexity that pictures can express. We observed that pictorial MRs tend to introduce more noise. In particular, crowd workers tend to omit information, such as eatType = restaurant, which is particularly hard to visualise. Finally, producing pictorial MRs is a semi-automatic process, which is expensive to run at large scale.
Based on these findings, we decided to use pictorial MRs to collect 20% of the full dataset and textual MRs for the rest of the data in order to keep noise and production costs low while increasing diversity. To further increase the data quality and diversity, we collected multiple references per MR to help NLG systems deal with potential noise in the data.
4 The E2E NLG dataset
Using the procedure described in Section 3, we crowdsourced a large dataset of 50k instances in the restaurant domain Novikova, Dušek, and Rieser (2017). Our dataset is substantially bigger than previous NLG datasets for dialogue in the restaurant domain, i.e. BAGEL Mairesse et al. (2010) and SF Restaurants (SFRest) Wen et al. (2015b), which typically only allowed delexicalised data-driven end-to-end approaches (see Section 4.1). In addition, we demonstrate that our data is also more challenging given its lexical richness, syntactic complexity and diverse discourse phenomena. Following an approach suggested by Perez-Beltrachini and Gardent (2017), we describe these different dimensions of our dataset and compare them to the BAGEL and SFRest datasets in Sections 4.2 and 4.3.777The particular versions of the BAGEL and SFRest datasets used for this research are available from http://farm2.user.srcf.net/research/bagel/ and https://www.repository.cam.ac.uk/handle/1810/251304, respectively.
To ensure a fair comparison, we analyse both fully lexicalised and delexicalised versions of all datasets. The lexicalised references in all datasets contained full natural language texts including all restaurant names. This is the default form for the E2E set; small postprocessing steps were taken for the other two sets to achieve a compatible format.888The BAGEL texts are partially delexicalised by default, so we lexicalised them. SFRest texts were detokenised and adverb/plural markers were postprocessed, e.g. “restaurant -s” changed to “restaurants”. To obtain the delexicalised versions, we replaced with placeholders (e.g. “X-slot”) most slot values from open sets that appear verbatim in the data: restaurant names, area names, addresses, and numbers.999This included slot values for name and near in the E2E dataset, name, near, phone, address, postcode, count and area in the SFRest dataset, and name, near, addr, phone, postcode and area in the BAGEL set. For BAGEL, the values citycentre and riverside were excluded from delexicalisation as they do not always appear verbatim in the data. The delexicalised version of BAGEL is equivalent to how the dataset is distributed by default. SFRest would allow even more delexicalisation in practice – food types and price ranges also appear verbatim in the references. We decided to keep these values lexicalised since they are not from open sets and the two other datasets do not allow for easy delexicalisation in this case.
Since the E2E and BAGEL datasets contain only restaurant recommendations, i.e. cases where the system is providing information (inform dialogue acts), whereas SFRest also includes system questions, confirmations, and greetings, we also created a subset of SFRest dubbed SFRest-inf with only inform instances for a fairer comparison.
We processed the datasets using the MorphoDiTa part-of-speech tagger Straková, Straka, and Hajič (2014) to identify tokens, words (as opposed to punctuation tokens) and sentence boundaries. We used the same tagger to preprocess our data for lexical and syntactic complexity analysis.
|Unique delexicalised MRs||5,963||733|
|Total tokens in all references||1,166,000||49,081|
|Total words in all references||1,051,093||44,338|
|Total delex. words in all references||957,205||37,758|
|[0.5pt/2pt] Slots per MR||2.63|
|References per MR||1.91|
|[0.5pt/2pt] Tokens per reference||9.45|
|Words per reference||8.54|
|Delexicalised words per reference||7.27|
|[0.5pt/2pt] Sentences per reference||1.05|
|Tokens per sentence||8.97|
|Words per sentence||8.11|
|Delexicalised words per sentence||6.90|
Table 2 summarises the main size statistics of all three datasets, plus the inform-only portion of SFRest. The E2E dataset is significantly larger than the other sets in terms of the total number of different MRs, the total number of data instances (i.e. MR-reference pairs), and especially in terms of the total amount of text in the human references, which is more than 20 times bigger than the next-biggest SFRest.
These differences are even more profound if we consider delexicalisation: almost all MRs in the E2E set are distinct even after delexicalisation, while the number of unique MRs is reduced significantly (by more than half) for the other sets. Delexicalisation also seems to have a less significant effect on the reference texts in the E2E sets than in the other datasets (cf. the number of delexicalised words vs. the total number of words).
The high number of instances directly translates to the higher average number of human references per MR, which is 8.27 for the E2E dataset as opposed to less than two for the other sets.101010 Note that Refs/MR ratio for the SFRest dataset is skewed: the
Note that Refs/MR ratio for the SFRest dataset is skewed: thegoodbye() MR has up to 101 references, but the average is less than 2 references per MR. This is apparent in the SFRest-inf section, which has a much lower maximum number of references.
While having more data with a higher number of references per MR makes the E2E data more attractive for statistical approaches and enables learning more robust models, it is also more challenging than previous sets as it contains a larger number of sentences in the human reference texts (up to 6 in our dataset, with an average of 1.54, compared to typically 1–2 for the other sets, which average below 1.1). The sentences themselves are also longer than in the other datasets. This is immediately apparent for SFRest or SFRest-inf, which are up to 40% shorter in terms of words and tokens. BAGEL’s sentences are slightly longer than E2E’s on average, but this situation is reversed when the sets are delexicalised. In addition, the input MRs in the E2E dataset are more complex than in the other sets: the average number of slot-value pairs in our set is twice that of SFRest (even if only the more complex inform dialogue acts are considered), and slightly higher than BAGEL.
|E2E data part||MRs||References|
|[0.5pt/2pt] full dataset||6,039||51,426|
The dataset is split into training, validation and test sets (in a 82-9-9 ratio, see Table 3), keeping a similar distribution of MR and reference text lengths. We ensure that MRs in our test set are all previously unseen, i.e. none of them overlaps with training/development sets, even when restaurant names are removed, unlike the SFRest data (cf. Lampouras and Vlachos, 2016).
4.2 Lexical Richness
|Distinct tokens occurring once||230|
|Distinct bigrams occurring once||2,582|
|Distinct trigrams occurring once||6,832|
|[0.5pt/2pt] Lexical sophistication (LS2)||0.428|
|Type-token ratio (TTR)||0.027|
|Mean segmental TTR (MSTTR-50)||0.648|
|[0.5pt/2pt] Unigram entropy|
|Bigram next-word conditional entropy||2.714|
|Trigram next-word conditional entropy||1.463|
|Distinct tokens occurring once||116|
|Distinct bigrams occurring once||1,376|
|Distinct trigrams occurring once||3,628|
|[0.5pt/2pt] Lexical sophistication (LS2)||0.323|
|Type-token ratio (TTR)||0.012|
|Mean segmental TTR (MSTTR-50)||0.602|
|[0.5pt/2pt] Unigram entropy||6.305|
|Bigram next-word conditional entropy||2.594|
|Trigram next-word conditional entropy||1.414|
Lexical complexity and diversity statistics for NLG datasets in the restarant information domain. Counts for n-grams appearing only once are shown as absolute numbers and proportions of the total number of respective n-grams. Highest values on each line are typeset in bold.
In order to measure various dimensions of lexical richness in the datasets under comparison, we computed statistics on token/unigram, bigram and trigram counts, and we applied the Lexical Complexity Analyser Lu (2012), as shown in Table 4. It is clear that our dataset has a much larger vocabulary – 2x larger than the second largest SFRest, but more than 5x larger if delexicalised versions of the datasets are considered. This directly translates into the number of distinct lemmas and distinct n-grams; the E2E set has almost 10x more distinct trigrams than SFRest, over 13x more in the delexicalised versions. While the proportion of n-grams only appearing once in the set is slightly lower than in the other datasets, it stays relatively high given the dataset size and narrow domain, and poses a challenging task for end-to-end data-driven approaches.
The traditional measure of lexical diversity, the type-token ratio (TTR), is not a good fit in our case when datasets of different sizes in a narrow domain are compared because the values are inversely proportional to the dataset size. Therefore, we complement TTR with the more robust measure of mean segmental TTR (MSTTR) Lu (2012), which divides the corpus into successive segments of a given length (50 tokens) and then calculates the average TTR of all segments. The higher the value of MSTTR, the more diverse is the measured text. Table 4 shows our dataset has higher MSTTR value (0.71) than the other sets (0.65). The difference is even more profound if we consider delexicalised versions of the sets and inform-only MRs in the SFRest data – 0.66 vs. 0.55 for SFRest-inf and 0.48 for BAGEL.
In addition, we measure lexical sophistication (LS2) Lu (2012), also known as lexical rareness, which is calculated as the proportion of lexical word types not on the list of 2,000 most frequent words generated from the British National Corpus. Table 4 shows that while the E2E is more sophisticated than SFRest, it is slightly less so compared to BAGEL. However, LS2 numbers on the delexicalised sets show that this is mainly caused by lexical slot values – the delexicalised E2E dataset is almost twice as sophisticated as both SFRest and BAGEL.
Here, stands for all unique tokens/n-grams, freq stands for the number of occurrences in the text, and len for the total number of tokens/n-grams in the text. We computed entropy over tokens (unigrams), bigrams and trigrams, as shown in Table 4. We can see that the E2E dataset has slightly lower unigram and bigram entropy than SFRest and higher trigram entropy than any other set. However, when delexicalised, the E2E set shows the highest entropy for any n-gram value. Considering that entropy is a logarithmic measure, the difference is substantial for trigrams – 12.1 vs. the closest 10.5 for SFRest, which amounts to about 2.98 higher uncertainty.
We further complement Shannon text entropy with n-gram-language-model-style conditional entropy for next-word prediction (Manning and Schütze, 2000, p. 63ff.), given one previous word (bigram) or two previous words (trigram):
Here, stands for all unique n-grams in the text, composed of (context, all tokens but the last one) and (the last token). Conditional next-word entropy gives an additional, novel measure of diversity and repetitiveness: The more diverse a text is, the less predictable is the next word given previous word(s); on the other hand, the more repetitive the text, the more predictable is the next word given previous word(s). The values for all the datasets are again shown in Table 4, and they demonstrate clearly that E2E data is much more diverse than SFRest or BAGEL. Note also that lexicalisation has a much smaller effect on this measure. In the delexicalised version, the difference against the closest SFRest (2.446 vs. 1.414) indicates about 2.04 more uncertainty on next-word prediction given two previous words.
4.3 Syntactic Complexity
We used the D-Level Analyser Lu (2009) to evaluate the syntactic complexity of human references in our data using the revised D-Level Scale Covington et al. (2006). We used the syntactic constituency parser of Collins (1997) to preprocess the sentences for the D-Level Analyser.111111We used the Model 2 variant of the parser as instructed by the D-Level Analyser website at http://www.personal.psu.edu/xxl13/downloads/d-level.html. The D-Level scale has eight levels of syntactic complexity, where levels 0 and 1 include simple or incomplete sentences and higher levels include sentences with more complex structures, e.g. sentences joined by a subordinating conjunction, more than one level of embedding etc. Figure 3 shows the D-Level distribution in all three datasets.
The largest proportion of the datasets is composed of simple sentences (levels 0 and 1), but the proportion of simple texts is much lower for the E2E NLG dataset (46%) compared to others (59-66%). Examples of simple sentences in our dataset include: “The Vaults is an Indian restaurant”, or “The Loch Fyne is a moderate priced family restaurant”.
The majority of our data, however, contains more complex, varied syntactic structures, including phenomena explicitly modelled by early statistical approaches to NLG Stent, Prasad, and Walker (2004); Walker et al. (2004). For example, clauses may be joined by a coordinating conjunction (level 2), e.g. “Cocum is a very expensive restaurant but the quality is great”. There are 14% level-2 sentences in the E2E dataset; BAGEL only has 7% and SFRest 9%, but inform MRs in SFRest contain a similar proportion as our set. Level 3 sentences in our domain are mainly those with object-modifying relative clauses, e.g. “There is a pub called Strada which serves Italian food.” The E2E dataset contains 18% level-3 sentences, similar to BAGEL but more than SFRest’s 12% (13% in inform MRs). The levels 4-5 are not very frequent in any of the datasets. Sentences may contain verbal gerund (-ing) phrases (level 4), either in addition to previously discussed structures or separately, e.g. “The coffee shop Wildwood has fairly priced food, while being in the same vicinity as the Ranch” or “The Vaults is a family-friendly restaurant offering fast food at moderate prices”. Subordinate clauses are marked as level 5, e.g. “If you like Japanese food, try the Vaults”.
The highest levels of syntactic complexity involve sentences containing referring expressions (“The Golden Curry provides Chinese food in the high price range. It is near the Bakers”), non-finite clauses in adjunct position (“Serving cheap English food, as well as having a coffee shop, the Golden Palace has an average customer rating and is located along the riverside”) or sentences with multiple embedded structures from previous levels. As Figure 3 shows, our dataset has a substantially higher proportion of level-6-7 sentences – 15%, compared to 7% for BAGEL and 8% for SFRest (11% in inform MRs).
On average, sentences in the E2E dataset are much more syntactically complex than in the other datasets under comparison: the mean D-Level for E2E data is 2.17, compared to BAGEL’s 1.32 and SFRest’s 1.25 (1.57 for inform-only MRs).
4.4 Attribute Coverage
Our crowd workers were asked to verbalise all information from the MR; however, they were not penalised if they skip an attribute (cf. Section 3.4). This feature makes generating text from our dataset more challenging as the NLG systems need to deal with a certain amount of noise, i.e. attributes not being verbalised in the human reference texts. In order to measure the extent of this phenomenon, we examined a random sample of 50 MR-reference pairs in all three datasets under comparison. An MR-reference pair was considered “fully covered” if all attribute values present in the MR are verbalised in the reference. It was marked as “additional content” if the reference contains information not present in the MR, and as “missing content” if the MR contains information not present in the reference.
The results of our sample probe in Table 5 indicate that roughly 40% of our data contains either additional or omitted information. In order to help NLG systems account for this variation, we collected multiple references per MR (also see Table 2).
This variation often concerns the attribute-value pair eatType=restaurant, which is either omitted (“Loch Fyne provides French food near The Rice Boat. It is located in riverside and has a low customer rating”) or added in case eatType is absent from the MR (“Loch Fyne is a low-rating riverside French restaurant near The Rice Boat”).121212Note that inclusion of this attribute is mainly due to historical reasons, following SFRest and BAGEL. As discussed in Section 3.4, pictorial MRs might be a possible source of this phenomenon where eatType=restaurant, eatType=pub, etc. is difficult to illustrate.
5 Systems in the Competition
|System||Architecture||Delex. slots||Copy||Semantic control||Data augmentation / diversity|
TGen Novikova, Dušek, and Rieser (2017)
|seq2seq (TGen)||name, near||MR classification reranking|
|Elder et al. (2018)||seq2seq (OpenNMT-py)||none||✓||none||enriching MR by output words|
|Chen Chen (2018)||seq2seq||none||✓||attention memory|
|Gong Gong (2018)||seq2seq (TGen)||name, near||MR classification reranking|
|Gehrmann et al. (2018)||seq2seq||none||✓||coverage penalty reranking||diverse ensembling|
|NLE Agarwal, Dymetman, and Gaussier (2018)||char seq2seq (tf-seq2seq)||none||MR classification reranking|
|Sheff2 Chen, Lampouras, and Vlachos (2018)||seq2seq||name, near||none|
|Juraska et al. (2018)||seq2seq||name, near||slot aligner reranking||using sub-MRs and aligned sentences|
|Juraska et al. (2018)|
|(late submission)||seq2seq||name, near||slot aligner reranking||using only complex training sentences|
|Oraby et al. (2018a)||seq2seq (TGen)||name, near||MR classification reranking||using Personage|
|Tandon et al. (2018)||seq2seq (TGen)||name, near||MR classification reranking||shuffling MRs|
|Schilder et al. (2018)||seq2seq (tf-seq2seq)||name, near,|
|Zhang et al. (2018)||sub-word seq2seq||none||attention regularisation|
|[0.5pt/2pt] Sheff1 Chen, Lampouras, and Vlachos (2018)||
|+ LOLS||name, near||2-step prediction with slots||using only references with highest average word frequency|
|ZHAW1 Deriu and Cieliebak (2018)||RNN language model||name, near||SC-LSTM (semantic gates), MR classification loss + reranking||first word control|
|ZHAW2 Deriu and Cieliebak (2018)||RNN language model||name, near||SC-LSTM|
|(semantic gates)||first word control|
|[0.5pt/2pt] DANGNT Nguyen and Tran (2018)||rule-based||all||implied by architecture|
|FORGe1 Mille and Dasiopoulou (2018)||grammar||all||implied by architecture|
|[0.5pt/2pt] FORGe3 Mille and Dasiopoulou (2018)||templates||all||implied by architecture|
|Schilder et al. (2018)||templates||all||implied by architecture|
|TUDA Puzikov and Gurevych (2018)||templates||all||implied by architecture|
The initial idea of the E2E NLG Challenge was first presented in Novikova and Rieser (2016). The interest and active participation in the E2E Challenge has by far outperformed our expectations. We received a total of 62 submitted systems by 17 institutions from 11 countries, with about 13 of these submissions coming from industry. In accordance with ethical considerations for NLP shared tasks Parra Escartín et al. (2017), we allowed researchers to withdraw or anonymise their results after obtaining automatic evaluation metrics results (cf. Section 7.1). Two groups from industry withdrew their submissions and one group asked to be anonymised after obtaining automatic evaluation results. A full list of all the remaining submissions is given in Table 14 in the Appendix (including their automatic metric scores).
We asked each participating team to identify 1-2 primary systems, which resulted in 20 systems by 14 groups. Each primary system is described in a short technical paper (available on the E2E NLG Challenge website)131313http://www.macs.hw.ac.uk/InteractionLab/E2E/ and was evaluated both by automatic metrics and human judges (see Section 7). We compare the primary systems to a baseline system we provided ourselves (see Section 5.1). A detailed overview of all the primary systems is given in Table 6. In the following, we describe the systems in terms of different architectures; see Sections 5.2–5.5.
5.1 Baseline System
|TGen (development set)||0.6925||8.4781||0.4703||0.7257||2.3987|
To establish a baseline on the task data, we use TGen Dušek and Jurčíček (2016a).141414TGen is freely available at https://github.com/UFAL-DSG/tgen. TGen is based on the sequence-to-sequence model with attention (seq2seq) Bahdanau, Cho, and Bengio (2015)
, an encoder-decoder recurrent neural network (RNN) architecture. In addition to the standard seq2seq model with LSTM cellsHochreiter and Schmidhuber (1997), TGen uses beam search for decoding and an LSTM-based reranker over the top outputs, penalising those outputs that do not verbalise all attributes from the input MR. TGen was previously tested on the BAGEL and SFRest datasets, where it reached state-of-the-art performance (Dušek, 2017, p. 88ff.).
As TGen does not handle unknown vocabulary well, the sparsely occurring string attributes (see Table 1) name and near are delexicalised (see Section 6.1). The main seq2seq model is trained by minimising cross entropy using the Adam algorithm Kingma and Ba (2015) in direct token-by-token generation of surface strings; the reranker is trained to detect the presence of all attributes from the input MR.151515We use a learning rate of 0.0005, cell size 50, batch size 20, beam size 10, maximum encoder and decoder lengths 10 and 80, respectively, and up to 20 passes through training data with early stopping. The reranker uses the same parameters, except for a higher learning rate (0.001). See Novikova, Dušek, and Rieser (2017) for more details. Based on evaluation on the development part of the E2E dataset using automatic metrics (see Table 7), as well as manual cursory checks, TGen appears to be a strong baseline, capable of generating fluent and relevant outputs in most cases.
5.2 Seq2seq-based systems
Systems based on the popular sequence-to-sequence architecture Sutskever, Vinyals, and Le (2014); Bahdanau, Cho, and Bengio (2015) represent the biggest group of systems participating in the challenge (12 out of 20 primary systems). All the seq2seq-based systems use beam search, and most of them further enhance the basic seq2seq architecture in a number of ways.
Several systems are built on top of previous systems and toolkits. A number of systems are based on the TGen baseline and aiming to improve it: TNT1 Oraby et al. (2018a) and TNT2 Tandon et al. (2018) are using TGen with two different data augmentation techniques (see Section 6.3). Gong Gong (2018) trains TGen with fine-tuning by the REINFORCE algorithm Williams (1992). Two systems are based on the tf-seq2seq toolkit Britz et al. (2017): NLE Agarwal, Dymetman, and Gaussier (2018) built a character-to-character seq2seq (using simply characters of the original MR as inputs), TR1 Schilder et al. (2018) use a regular word-based model. The Adapt system Elder et al. (2018) is based on OpenNMT-py Klein et al. (2017). It uses pointer networks (a form of a copy mechanism Vinyals, Fortunato, and Jaitly (2015)) and a two-step generation where the first step enriches the input MR for diversity (see Section 6.3).
Several other systems use custom seq2seq implementations. Slug and Slug-alt Juraska et al. (2018) use an ensemble of two bidirectional LSTM encoders and one convolutional encoder, all paired with an attention LSTM decoder (incl. self-attention). Harv Gehrmann et al. (2018) use a seq2seq model with multiple additions for MR coverage and diversity (see Sections 6.2 and 6.3). Sheff2’s model Chen, Lampouras, and Vlachos (2018), on the other hand, is a vanilla seq2seq setup with LSTM cells. Chen Chen (2018) presents a seq2seq model with a custom-tailored input data representation: 2-part input embeddings, which divide into slot name and value token embeddings. Zhang Zhang et al. (2018) apply a seq2seq model with CAEncoder Zhang et al. (2017), which adds a second layer over a bidirectional encoder with GRU cells Cho et al. (2014), summarising both directional encoders.
5.3 Other data-driven systems
Two groups submitted fully trainable systems that are not based on the seq2seq architecture. First, ZHAW1 and ZHAW2 Deriu and Cieliebak (2018) use an RNN language model with semantically conditioned LSTM (SC-LSTM) cells Wen et al. (2015b) and a 1-hot encoding of input MR slot values. The two system variants differ in the presence of an additional semantic control mechanism (see Section 6.2).
Sheff1 Chen, Lampouras, and Vlachos (2018)
is the only non-neural fully data-driven system submitted to the challenge. It is based on imitation learning using linear classifiersCrammer, Kulesza, and Dredze (2009) in a two-level generation approach, where the classifiers first select the next slot to be realised and then the corresponding word-by-word realisation of that slot Lampouras and Vlachos (2016). The classifiers are trained using the Locally Optimal Learning to Search (LOLS) imitation learning framework Chang et al. (2015), optimising for BLEU, ROUGE-L, and slot error (cf. Section 7.1).
5.4 Rule-based systems
There are two rule-based entries in the E2E challenge: First, the DANGNT system Nguyen and Tran (2018) uses a two-step rule-based setup, where the first step determines the appropriate phrases to use for a delexicalised sentence; the second step selects the appropriate phrases to lexicalise slot values. Second, the FORGe1 system Mille and Dasiopoulou (2018) is a rule-based pipeline using grammars based on the Meaning-Text Theory Mel’čuk (1988). It matches the MR to handcrafted per-slot semantic templates, applies aggregation rules to build sentences, and realises the aggregated sentence structures into surface text.
5.5 Template-based systems
Three entries in the E2E challenge are based on traditional template filling. FORGe3 Mille and Dasiopoulou (2018) and TR2 Schilder et al. (2018) take a very similar approach: They mine templates from data by delexicalising slot values. TUDA Puzikov and Gurevych (2018), on the other hand, uses templates manually designed by the system authors; the templates are not based on the dataset directly, they are only informed by the data.
6 Addressing the Challenges
In this section, we focus on how the competing primary systems address specific challenges posed by the task: vocabulary unseen in training (Section 6.1), control of semantic coverage of the input MR (Section 6.2), and producing diverse outputs (Section 6.3). We also include an overview of alternative approaches to addressing these challenges in Section 6.4.
6.1 Open Vocabulary
All systems in the challenge have a way of addressing the open vocabulary in the data. In closed-domain setups, slot values are the usually the only part of data where open vocabulary is present, as e.g. is the case of the name and near slots in our dataset (see Table 1). The common approach to dealing with open vocabulary in NLG systems is to use delexicalisation (Wen et al., 2015b; see also Section 4), i.e. replacing slot values with placeholders during training and generation time (both in input MRs and training sentences). This approach is indeed one of the principles of template-based systems; accordingly, all template-based entries in the E2E Challenge use full delexicalisation of all slot values (except, perhaps, the binary-valued familyFriendly; cf. Table 6). Both rule-based systems also perform full delexicalisation.
The data-driven systems submitted to our challenge mostly opt for partial delexicalisation (see Table 6); the prevailing approach is to delexicalise only the values of the name and near slots, which allows for very simple pre- and postprocessing since these values usually appear verbatim in the outputs.161616Unlike other slot values, e.g., area=riverside might appear as “near the river”. Cf. also our remarks on delexicalisation in Section 4 and Footnote 9. TR1 is the only data-driven system to use a stronger delexicalisation, which also includes the priceRange and customerRating slots. Slug and Slug-alt are the only systems to treat values with different morpho-syntactic properties differently (e.g., a value requiring “an” instead of “a” as an article).
Five of the seq2seq systems in the challenge opted for using no delexicalisation and employ alternative ways of addressing open vocabulary: Adapt, Chen and Harv use a copy mechanism (cf. Section 5.2), which allows the system to copy some of the tokens from the input instead of generating them anew. Zhang operates over sub-word units instead of words; these are determined by the byte-pair encoding algorithm and can combine to create previously unseen words Sennrich, Haddow, and Birch (2016). NLE’s seq2seq system operates on the character level.
6.2 Semantic Control
Most of the participating systems explicitly attempt to realise all slots and thus cope with the noise in the training data (cf. Section 4.4
). Full realisation is implied for template and rule-based systems as the templates and rules always relate to specific slots and are chosen based on the slots in the input MR. On the other hand, vanilla seq2seq systems have no way of controlling whether all input slots have been realised. While attention modelsBahdanau, Cho, and Bengio (2015) certainly have an influence on this, they are not explicitly trained to attend exactly once to each slot in a vanilla seq2seq setup. Therefore, most seq2seq systems include an additional tool checking the realised parts of the input MR on the output (cf. Table 6).
The most frequent approach among the E2E submissions is a MR classification reranker Dušek and Jurčíček (2016a). Here, the generator first produces multiple outputs using beam search, then these are tested for the presence of all slots from the input MR, and deviations from the input are penalised. Apart from the TGen baseline (using a RNN MR classifier, see Section 5.1), this approach is also taken by all systems based on TGen (TNT1, TNT2, Gong) as well as NLE
, which uses a logistic regression classifier.Slug and Slug-alt
apply a very similar approach: they use a heuristic slot aligner (trained on words and phrases from training data and WordNet) to align outputs to the input MR and penalise for any unaligned slots.Harv do not build a separate classifier or aligner, but use the sum of weights from the attention model (which should not exceed 1 for each token of the input MR) in a penalty term for reranking.
Two seq2seq systems use a direct modification of the attention mechanism instead of reranking at decoding time. Chen includes attention memory (sum of attention distributions so far in the generation process) as an additional input to the attention model. Zhang adds an attention regularisation loss term to the training process, which attempts to keep the sum of weights close to 1 for each input MR token, similarly to Harv’s penalty term. Three systems, Adapt, TR1 and Sheff2, do not use any explicit semantic control mechanism.
The non-seq2seq data-driven systems use specific mechanisms to maintain input MR coverage. ZHAW1 and ZHAW2 are based on SC-LSTM cells Wen et al. (2015b), which include a special gate that keeps track of slots covered so far in the MR. In addition, ZHAW1 uses convolutional MR classifiers to rerank beam search outputs similarly to most seq2seq systems; however, this classification is also used in an additional loss term during training. The Sheff1 system explicitly decides which slot to verbalise next using a separate slot-level classifier, which is optimised to cover the input MR.
6.3 Data Augmentation and Diversity
The design of the E2E dataset attempts to provide higher text diversity (see Section 4), and several challenge participants made use of this. Others modified the training set simply to achieve better output quality.
Several systems aim at higher output quality by using data augmentation. TNT1 enriches input MRs by prepending them with the corresponding outputs of the Personage generator Mairesse and Walker (2007), with the aim to generate more diverse output. TNT2 aims to boots the robustness of the baseline TGen system by re-shuffling slots in the input MRs. Slug uses single sentences from the training data with corresponding aligned parts of the original MR. This increases the amount of training data available and simplifies the task by breaking outputs into smaller (partially) aligned units. Slug-alt, on the other hand, only uses training instances involving complex sentences in an attempt to provide more sophisticated outputs. On the other hand, the system of Sheff1 is trained using only one reference text per training MR; the reference text with the highest average word frequency is selected. While this approach is likely to decrease output diversity, the authors use it to stabilise system training. Harv takes yet another approach in order to both stabilise training and increase diversity, called diverse ensembling Guzman-Rivera, Batra, and Kohli (2012). In an expectation-maximisation fashion, they split the training data instances into subsets that exhibit similar structural properties and style in the natural language references, then train different models on these subsets and deploy them as an ensemble.
Two teams attempt to increase output diversity by directly modifying the generation process. The ZHAW1 and ZHAW2 systems use a first word control mechanism: they generate outputs starting with all (frequent enough) first words from the training set, then select the final output by sampling. ZHAW1 only samples among semantically correct outputs (see Section 6.2). Adapt takes a different approach, adding a preprocessing step before the main generator, which decides upon specific words that should appear on the output. These are then used to enrich the input MR in the main generation step, providing more diversity on the input.
6.4 Systems outside the competition
Solving the challenges outlined above is an ongoing effort addressed by many recent systems. Here we briefly summarise other attempts by systems outside the competition for completeness. Note that many of these approaches are very recent and have been published only after the E2E NLG Challenge ended.
Apart from delexicalisation, which is most often used in the E2E NLG Challenge, various variants of the copy mechanism are the most prominent approach to address open vocabulary in NLG Wiseman, Shieber, and Rush (2017); Lebret, Grangier, and Auli (2016); Bao et al. (2018); Kaffee et al. (2018); Wang et al. (2018). Shimorina and Gardent (2018) combine a copy mechanism with delexicalisation. In contrast, Freitag and Roy (2018)
use subwords and recast the NLG model as a denoising autoencoder, with shared input and output embeddings (starting from slot values and “filling in” the rest of the sentence on the output).
Attempts at improving semantic accuracy of the generated texts show a wider variety of approaches. Kiddon, Zettlemoyer, and Choi (2016)
use a “checklist model” – the decoder keeps a vector of items used so far during the generation; this is similar to semantic gates ofWen et al. (2015b), which have been used by the ZHAW1 and ZHAW2 systems in our challenge (see Section 6.2). Tran, Nguyen, and Tojo (2017) use a two-level attention model (composed of a standard attention model and a “refiner”, an attention-over-attention module) to improve semantic coverage. Nema et al. (2018) combine semantic gating and two-level attention (with attention over slots, slot values, and a combination thereof). Other authors explore supplementary inputs for improving semantic correctness: Reed, Oraby, and Walker (2018) use an additional supervision signal indicating the desired number of sentences to generate, Freitag and Roy (2018) show that additional unlabeled training data improves semantic coverage in their denoising-autoencoder-based NLG model.
Since its initial release in Novikova, Dušek, and Rieser (2017), the E2E dataset has been used by several authors to explore generating more diverse outputs, mostly with additional supervision signals: The system of Wiseman, Shieber, and Rush (2018) learns latent templates (sequences of phrases/slots) while learning to generate, thus allowing more controllability and arguably more diversity of the outputs – the templates serve as an additional, fine-grained way of specifying the desired shape of the generator output. Reed, Oraby, and Walker (2018) explore using the presence of prespecified contrast markers (e.g. but, although) as additional supervision, while Juraska and Walker (2018) investigate other stylistic markers and use them to generate sentences of specified type. Oraby, Reed, and Tandon (2018) and Oraby et al. (2018b) attempt to generate outputs showing different personality traits (represented by the Big Five model) using additional synthetic training data with personality annotation. Jagfeld, Jenne, and Vu (2018) do not add more supervision but compare the diversity produced by word-level and character-level seq2seq models on E2E data, showing better performance of the latter.
Using an in-house restaurant dataset, Nayak et al. (2017) explore using a basic sentence plan specification (slot ordering and sentence grouping) as an additional training signal to increase output diversity. Working in the transport information domain, Dušek and Jurčíček (2016) and Mangrulkar et al. (2018) condition their generators on preceding dialogue context as well as the input MR to obtain greater diversity.
7 Evaluation Setup
We evaluated the systems submitted to the E2E challenge using a range of automatic metrics, which we describe in Section 7.1. This includes a novel application of textual measures171717These measures were previously applied by Perez-Beltrachini and Gardent (2017) and this work (see Section 4) to describe datasets, but not for evaluation of NLG outputs. and a novel usage of standard word-overlap metrics to assess similarity among individual systems. Automatic metrics are popular in NLG Gkatzia and Mahamood (2015) because they are cheaper and faster to run than human evaluation. However, sole use of automatic metrics is only sensible if they are known to be sufficiently correlated with human preferences. Recent studies Novikova et al. (2017); Reiter (2018) have demonstrated that this is very often not the case and that automatic metrics only weakly reflect human judgements on system outputs as generated by data-driven NLG. Therefore, we also performed a large-scale crowdsourced human evaluation, as detailed in Section 7.2. For the human evaluation of the 20 primary systems, we address the problem of how to efficiently compare a large number of systems, by:
Introducing the data-efficient TrueSkill algorithm Herbrich, Minka, and Graepel (2006); Sakaguchi, Post, and Van Durme (2014) to NLG. This allows us to compute an overall ranking by directly comparing the systems, rather than individually assessing them at higher cost, as done by previous NLG challenges Belz and Hastie (2014).
7.1 Automatic Metrics
We apply two types of automatic metrics: One set assessing the similarity between generated system outputs and natural language references in the corpus using word-overlap-based measures, and another set assessing the complexity and diversity of system outputs using a variety of textual measures.
For the first set, we selected a range of metrics measuring word-overlap between system output and references, including BLEU and NIST, which are used as standard in machine translation evaluation Bojar, Graham, and Kamran (2017) and very common in NLG, and several others which were applied in the COCO caption generation challenge Chen et al. (2015) as well as other NLG experiments (e.g. Lebret, Grangier, and Auli, 2016; Gardent et al., 2017b; Sharma et al., 2016b):
- BLEU Papineni et al. (2002)
is the harmonic mean of-gram precisions of the system output with respect to human-authored reference sentences, with , lowered by a brevity penalty if the output is shorter than references. The -gram precisions are proportions of -grams in the system output that can be matched in any of the reference sentences. Repeated -gram matches are clipped to the maximum number of times the -gram occurs in any single reference.
- NIST Doddington (2002)
is a version of BLEU with higher weighting for less frequent (i.e., more informative) -grams and a different length penalty. It uses .
- METEOR Lavie and Agarwal (2007)
measures both precision and recall of unigrams by aligning the system output with the individual human references. In addition to exact word matches, it uses fuzzy matching based on stemming and WordNet synonyms. It computes matches against multiple references separately and uses the best-matching one.
- ROUGE-L Lin (2004)
is based on longest common subsequences (LCS) between the system output and the human references, where a common subsequence requires the same words in the same order but allows additional, non-covered words in the middle of either sequence. The final ROUGE-L score is an F-measure based on maximum precision and maximum recall achieved over any of the human references, where precision and recall are computed as length of the LCS divided by the length of the system output and the reference, respectively.
- CIDEr Vedantam, Zitnick, and Parikh (2015)
was primarily designed for generated image captions, but is also applicable for NLG in general. CIDEr is computed as the average cosine similarity between the system output and the reference sentences on the level of-grams, . The importance of the individual -grams is given by the Term Frequency Inverse Document Frequency (TF-IDF) measure, which weighs an -gram’s frequency in a particular instance against its overall frequency in the whole dataset.
We provided scripts to the challenge participants to run all of these metrics in a simple, easy-to-use way. The scripts are freely available at the following URL:191919The scripts are partially based on COCO caption generation challenge evaluation scripts (https://github.com/tylin/coco-caption).
In addition to evaluating all NLG systems individually against human-authored reference texts (see Section 8.1), we also apply the same metrics as measures of output similarity among the systems, comparing each system’s outputs with all other systems’ outputs in place of references (see Section 8.3).
For the second set of scores, which is intended to measure complexity and diversity in the system outputs, we use the same automatic textual metrics which we used to evaluate the E2E NLG dataset itself (see Section 4.2 and 4.3), i.e. dimensions of lexical richness, such as lexical sophistication (LS2) and mean segmental token-to-type ratio (MSTTR), and metrics of syntactic complexity, such as levels of the revised D-level Scale. This allows us to both evaluate the diversity and complexity of system outputs and to establish whether the text characteristics are similar to the training and test sets. To focus specifically on the style produced by the individual systems, we delexicalized restaurant names in the system outputs before computing textual metrics scores, since restaurant names could skew some of these metrics as they are mostly composed of infrequent nouns (cf. Section 4.2).
7.2 Human Evaluation
The human evaluation was conducted on the 20 primary systems and the baseline using Rank-based Magnitude Estimation (RankME) Novikova, Dušek, and Rieser (2018). In an ordinary (i.e. not rank-based) ME task Bard, Robertson, and Sorace (1996), subjects provide a relative rating of an experimental sentence to a reference sentence, which is associated with a pre-set/fixed number. If the target sentence appears twice as good as the reference sentence, for instance, subjects are to multiply the reference score by two; if it appears half as good, they should divide it in half, etc. Rank-based ME extends this idea by asking subjects to provide a relative ranking of several target sentences, i.e. not only to the reference sentence, but also to each other.
Rank-based ME was selected for several reasons. First, its use proved to significantly increase the consistency of human ratings, compared to other data collection methods Novikova, Dušek, and Rieser (2018). Second, it implies the use of continuous scales, i.e. rating scales without numerical labels and without given end points. Recent studies show that continuous scales allow subjects to give more nuanced judgements Belz and Kow (2011); Graham et al. (2013); Bojar et al. (2017). Third, it explores relative ranking of different systems instead of directly assessing quality of each specific system, which makes it more reliable in the environment of a challenge.
The evaluation was conducted using crowdsourcing based on the CrowdFlower/FigureEight platform. Crowd workers were presented with five randomly selected outputs of different systems corresponding to a single MR, and were asked to evaluate and rank these systems from the best to the worst, ties permitted, using the RankME method.
The final evaluation results were produced using the TrueSkill algorithm Herbrich, Minka, and Graepel (2006); Sakaguchi, Post, and Van Durme (2014). TrueSkill produces system rankings by gradually updating a Bayesian estimate of each system’s capability according to the “surprisal” of pairwise comparisons of individual system outputs. This way, fewer direct comparisons between systems are needed to establish their overall ranking. In Novikova, Dušek, and Rieser (2018), we were able to show that TrueSkill is able to to reduce the amount of collected human evaluation data without compromising the final ranking results.
Since the performance of some systems may be very similar and a total ordering would not reflect this, we adopt the practice used in machine translation of presenting a partial ordering into significance clusters established by bootstrap resampling Bojar et al. (2013, 2014); Sakaguchi, Post, and Van Durme (2014). The TrueSkill algorithm is run 200 times, producing slightly different rankings each time as pairs of system outputs for comparison are randomly sampled. This way we can determine the range of ranks where each system is placed 95% of the time or more often. Clusters are then formed of systems whose rank ranges overlap.
Traditionally, human evaluation aims to assess the naturalness (fluency, readability) and informativeness (relevance, correctness, adequacy) of an automatically generated output Gatt and Krahmer (2017). Naturalness targets the linguistic quality of the NLG system output; informativeness targets relevance or correctness of the output relative to the input MR, showing how well the system reflects the MR content. Recent research often adds a general, overall quality criterion Wen et al. (2015b, a); Manishina et al. (2016); Novikova, Lemon, and Rieser (2016); Novikova et al. (2017), or even uses only that Sharma et al. (2016a).
We decided against explicitly evaluating informativeness since our training instances do not always verbalise all MR attributes (cf. Section 4.4). We therefore only collected separate ranks for quality and naturalness.
When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation, which implies that correctness of the NL utterance relative to the MR should also influence this ranking. The crowd workers were asked: “How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?"
When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation. The crowd workers were asked: “Could the utterance have been produced by a native speaker?"
Ratings of quality and naturalness were collected separately, i.e. in two individual crowdsourcing tasks. Furthermore, when crowd workers were asked to assess naturalness, the MR was not shown to them since it was not necessary for the task. This setup allows to minimise the correlation between the ratings of naturalness and quality Novikova, Dušek, and Rieser (2018); Callison-Burch et al. (2007).
In this section, we report on the results of the evaluation of all E2E NLG Challenge primary systems, following the evaluation procedures described in Section 7. We first show the results using automatic metrics: word-overlap-based (Section 8.1) and textual metrics (Section 8.2), as well as automatically computed output similarity between systems (Section 8.3). We then summarise the human evaluation results (Section 8.4), comment on the semantic accuracy of system outputs (Section 8.5) and declare the overall winning system (Section 8.6). Finally, we provide a list of “lessons learnt” in Section 8.7 – observations that we hope will be useful for future NLG system development.
8.1 Word-overlap Metrics
|Slug-alt (late submission)||0.6035||8.3954||0.4369||0.5991||2.1019||0.5378|
Table 8 summarises the system scores for word-overlap metrics (cf. Section 7.1). It is apparent that the TGen baseline is very strong in terms of word-overlap metrics: No primary system is able to beat it in terms of all metrics, or in terms of the normalised metrics’ mean – only Slug comes very close. Several other systems manage to beat TGen in one of the metrics but not in others. Note, however, that many secondary system submissions perform better than the primary ones (and the baseline) with respect to word-overlap metrics (see Table 14 in the Appendix).
Overall, seq2seq-based systems show the best word-based metric values, followed by Sheff1, a data-driven system based on imitation learning. As expected, attempts to increase output diversity by ZHAW1, ZHAW2, Slug-alt and Adapt result in lowered scores by word-overlap-based metrics. Template-based and rule-based systems mostly score at the bottom of the list. The lowest-scoring systems in terms of word-overlap metrics are the ones of Chen and Sheff2, which tend to produce much shorter outputs than other systems (cf. Section 8.2). This most likely resulted in severe brevity penalty.
8.2 Textual Metrics
|% Level0-2||% Level6-7||LS2||MSTTR-50||Avg. length|
|Gong||82.68||Sheff1||41.27||test set all||0.43||test set rand||0.62||TUDA||31.02|
|TNT2||79.64||FORGe1||33.66||test set rand||0.36||TR2||0.62||TR2||27.48|
|DANGNT||66.95||ZHAW2||19.03||Harv||0.27||test set all||0.58||ZHAW1||26.16|
|Harv||64.63||test set rand||17.46||Chen||0.25||FORGe3||0.56||Gong||25.41|
|FORGe3||62.62||test set all||16.48||Sheff2||0.25||DANGNT||0.54||Adapt||24.47|
|FORGe1||61.13||NLE||11.12||TNT2||0.23||Slug||0.52||test set rand||24.39|
|NLE||58.24||Adapt||10.28||DANGNT||0.21||Sheff1||0.52||test set all||23.96|
|test set rand||58.16||TNT1||9.55||TUDA||0.21||NLE||0.52||Slug||23.76|
|test set all||57.97||TGen||9.02||TR1||0.20||TGen||0.52||FORGe3||23.49|
|Distinct tokens||Distinct trigrams||% Unique trigrams||Entropy tokens||Cond. entropy bigrams|
|test set all||1079||test set all||16797||test set rand||69.13||test set all||6.40||test set all||2.92|
|test set rand||542||test set rand||5166||Adapt||66.61||test set rand||6.37||test set rand||2.70|
|TR2||399||Adapt||3567||test set all||44.66||Adapt||6.18||Adapt||2.09|
Table 9 summarises results from a range of textual metrics which aim to assess the complexity and diversity of primary system outputs (cf. Section 7.1). In addition, we include a comparison to the human references in the test set in order to assess whether systems are able to replicate characteristics of human-produced data.202020Note that textual metrics have been computed with restaurant names delexicalised (cf. Section 7.1). The results in Table 9 show the following:
Seq2seq-based system outputs are less syntactically complex on average than outputs of other systems (they produce more D-level 0-2 sentences and less D-level 6-7 sentences than other architectures).
The systems seem to show a relatively high variance in syntactic complexity levels, especially with respect to the higher levels; few systems match the distribution of the training and test data. The differences in D-level distributions in the outputs are mostly statistically significant (see Figure6 in the Appendix). The only system producing a D-level distribution not significantly different from a random test set reference is FORGe3, which is based on template mining from training data.
If we use Bhattacharyya distance to compare the D-level distributions (cf. Figure 7 in the Appendix), the greatest distances appear in both extremes. Sheff1, FORGe1 and Slug-alt produce higher-level sentences more frequently and thus show among the most distant from other systems. The Gong system mostly produces level 0-2 sentences, and therefore it appears very distant from other systems as well as the most distant system from human references.
None of the systems reaches the lexical sophistication of the human-authored test set references. The diversity-attempting seq2seq-based Adapt system comes very close, followed by the grammar-based FORGe1 and the TR2 system, which is based on template mining from data. Data-driven systems aiming at higher lexical diversity seem to achieve higher sophistication as well; note the lower performance of Slug-alt, which aims more at syntactic diversity than lexical. For rule-based systems, lexical sophistication is a direct result of the system authors’ decisions.
In terms of MSTTR, highest scores are achieved by template or rule-based systems and by data-driven systems that explicitly aim at greater output diversity (ZHAW1, ZHAW2, Adapt, Slug-alt). Note that MSTTR is typically higher in systems that tend to produce longer outputs, which includes most rule- and template-based systems. We assume that this is due to MSTTR’s fixed 50-token window used to segment utterances.
Most systems produce outputs similar in length to the test set human references. Outputs of rule- and template-based systems tend to be more verbose than those of data-driven systems. The outputs of Zhang, Sheff2 and Chen are much shorter on average than texts in the dataset, which suggests that these systems might not verbalise all the information contained in the MR (cf. Section 8.5).
Same as for the datasets statistics in Section 4.2, we also computed additional textual measures to assess the diversity/repetitiveness of the generated outputs: number of distinct n-grams, Shannon entropy, and conditional next-word entropy; a selection of these metrics is shown in Table 10.212121We used system outputs with delexicalised restaurant names for the evaluation, but the lexicalised outputs show the same trends. The values for n-gram lengths not displayed in Table 10 also show very similar trends. We compare the outputs against the whole test set (multiple references) and a randomly selected single reference per MR from the test set. The results show the following:
None of the systems is able to produce as much diversity as is contained in a randomly selected human reference – even the most diverse systems lag behind. Adapt comes close in vocabulary size, TR2 is the closest system in terms of entropy and next-word conditional entropy.
In terms of vocabulary, there is a huge gap between the most diverse Adapt and TR2 systems, and any other system (e.g., the 3rd-ranking ZHAW1 has 3 smaller vocabulary than TR2, and 2.4 smaller ratio of unique trigrams).
TR2 demonstrates that mining templates from the training data can lead to very diverse outputs. FORGe3, which uses the same method, also ranks relatively high on vocabulary size and entropy. The diversity produced by Adapt’s seq2seq model indicates that the prepocessing step enriching the MRs works effectively (cf. Section 6.3).
All diversity-attempting data-driven systems (Adapt, ZHAW1, ZHAW2, Harv, TNT1, TNT2, Slug-alt) indeed rank better than most systems not incorporating diversity measures, with TNT1 and TNT2 showing lower gains than the rest of the group. However, template-mining-based systems (TR2, FORGe3) produce outputs of similar or higher diversity with no concentrated effort.
Outputs of seq2seq-based systems which do not explicitly model diversity (e.g. Gong, Sheff1, TR1, Slug, Chen) indeed show lower diversity scores. The rule-based DANGNT system also ranks very low on diversity, and the TUDA system with handcrafted templates is the least diverse of all.
In summary, few systems are able to approach the complexity and diversity shown in human-authored data. Seq2seq-based systems tend to favor simpler sentences than hand-engineered systems unless diversity control is in place. Vanilla seq2seq and handcrafted templates produce the least diverse outputs; highest diversity is achieved by template mining or explicit diversity control mechanisms.
8.3 System Output Similarity
|test set rand||0.34|
In order to assess the similarity of outputs produced by the individual systems, we reused the word-overlap-based metrics applied in the challenge (see Section 7.1). We created all possible pairs of systems and computed word-overlap metrics between each of their outputs for every instance in the test set. Same as for textual metrics, restaurant names were delexicalised in the system outputs.222222Results with fully lexicalised outputs are very similar, the differences are just slightly less profound.
This process resulted in a table for each of the metrics (see Figure 5 in the Appendix), with reference systems in rows and tested systems in columns. All five metrics showed a very similar pattern. Figure 4 therefore summarises the results by taking the average of all normalised metrics (cf. Table 8). For comparison, we also measure similarity of system outputs against the reference texts in the test set, as well as a subset of the test set with a single, randomly sampled reference text per MR.
We can see from Figure 4 that all the seq2seq-based system outputs are in general most similar to each other; other data-driven systems also show higher similarity amongst each other. The exception to this rule in case of the Chen and Sheff2 systems can be explained by the brevity of their outputs (cf. Sections 8.1 and 8.2). Systems that aim at output diversity (ZHAW1, ZHAW2, Slug-alt and mainly Adapt) also exhibit lowered similarity of their outputs to those of other systems, which might indicate that their outputs are indeed more original. The outputs of rule-based and template-based systems are markedly less similar to other outputs than that of the data-driven systems.
We can also see that most system outputs, especially those of data-driven methods, are much more similar to each other than they are to a single randomly selected human-authored reference text from the test set. This is to be expected since data-driven methods tend to select more frequent phrasing. Some of the system outputs even show a higher similarity to each other than to the closest matching human references from the test set. This is mainly the case for systems with very similar architectures, which often arrive at identical results (e.g. TGen, TNT1 and TNT2).
8.4 Results of Human Evaluation
Automatic (pattern matching)
|#||MR / System name and output||Score|
|1||name[The Vaults], eatType[restaurant], food[French], priceRange[moderate], area[city centre], familyFriendly[yes], near[Raja Indian Cuisine]|
|2-4[0.5pt/2pt]||Gong||The Vaults is a french restaurant in the city centre near Raja Indian Cuisine. It is kids friendly and has a moderate price range.||100|
|Harv||The Vaults is a moderately priced French restaurant in the city centre near Raja Indian Cuisine.||90|
|FORGe3||Family-friendly and moderate, The Vaults restaurant is located near Raja Indian Cuisine, in the city centre area.||85|
|Zhang||The Vaults is a kid friendly French restaurant in the city centre near Raja Indian Cuisine .||80|
|Sheff1||The Vaults is a moderately priced French restaurant near Raja Indian Cuisine in the city centre is.||70|
|2||name[Cotto], eatType[coffee shop], near[The Bakers]|
|2-4[0.5pt/2pt]||TR2||Cotto is a coffee shop located near The Bakers.||100|
|Slug-alt||Cotto is a coffee shop and is located near The Bakers||97|
|TGen||Cotto is a coffee shop with a low price range. It is located near The Bakers.||85|
|Gong||Cotto is a place near The Bakers.||85|
|Sheff2||Cotto is a pub near The Bakers.||82|
|3||name[Clowns], eatType[coffee shop], customer rating[3 out of 5], near[All Bar One]|
|2-4[0.5pt/2pt]||Sheff1||Clowns is a coffee shop near All Bar One with a customer rating of 3 out of 5.||100|
|Zhang||Clowns is a coffee shop near All Bar One with a customer rating of 3 out of 5 .||100|
|FORGe3||Clowns is a coffee shop near All Bar One with a rating 3 out of 5.||70|
|ZHAW2||A coffee shop near All Bar One is Clowns. It has a customer rating of 3 out of 5.||50|
|Sheff2||Clowns is a pub near All Bar One.||20|
|4||name[The Cricketers], eatType[restaurant], customer rating[low], familyFriendly[no], near[Ranch]|
|2-4[0.5pt/2pt]||Slug||The Cricketers is a restaurant near Ranch. It is not family friendly and has a low customer rating.||72|
|Slug-alt||Located near Ranch, The Cricketers is a family-friendly restaurant that is not family-friendly and has a low customer rating.||71|
|Adapt||The Cricketers is a non - family - friendly restaurant located near the Ranch . It has a low customer satisfaction rating .||68|
|FORGe1||The restaurant The Cricketers is near Ranch. The Cricketers, which does not welcome kids, has a low customer rating.||65|
|TUDA||The Cricketers is a restaurant located near Ranch. It has a low customer rating. It is not family friendly.||56|
Each example is shown as ranked for quality by a single crowd worker. The raw RankME scores assigned by the crowd workers are shown; however, note that only relative ranks are used by the TrueSkill algorithm. The outputs within each example are sorted by the score for clarity. For the purpose of error analysis, the rankings may be interpreted in the following way (note that quality rankings include both relevance and fluency):
Gong and FORGe3 verbalise all attributes but the latter is less fluent. Harv misses the family-friendliness, Zhang misses the price information. Sheff1 misses family-friendliness and is not fluent.
TR2 and Slug-alt provide perfect and fluent information but Slug-alt misses the full stop. Gong does not specify the type of place while TGen adds irrelevant price range information. Sheff2 indicates a wrong venue type.
Sheff1 and Zhang provide perfect and fluent information, FORGe3 is less fluent and ZHAW2 even less than that. Sheff2 indicates a wrong venue type and misses the customer rating information.
Slug provides a perfect an fluent information. Slug-alt is repetitive and Adapt
was probably penalised for lack of detokenisation.FORGe1 and TUDA provide a complete information but are not very fluent.
The results of human evaluation of quality and naturalness are provided in Table 11. Using the RankME setup described in Section 7.2, we collected 2,979 data points of partial system rankings for quality, where one data point corresponds to one MR and ranked outputs of five randomly selected systems (see Table 13 for examples). From these rankings, a set of 29,790 pairwise output comparisons were produced to be used by the TrueSkill algorithm. This resulted in 1,418 pairwise comparisons per system. For naturalness, 4,239 data points were collected, which resulted in 42,390 pairwise comparisons, and 2,018 comparisons per system. For each of 630 MRs in the test set, 9.5 systems on average (with a maximum of 14) were compared based on both naturalness and quality of their outputs. That is, using TrueSkill, we were able to reduce the number of required system comparisons to more than half. The CrowdFlower task for collecting human evaluation data was running for 235 hours and cost USD 314 in total.
We produced the final ranking of all systems for both quality and naturalness using the TrueSkill algorithm with bootstrap resampling as described in Section 7.2. This resulted in clusters of systems with significantly different system rankings for both naturalness and quality.232323Note that TrueSkill provides a relative ranking of a system in terms of its cluster and rank range (cf. Section 7.2), i.e. the numerical scores are not directly interpretable. Other systems in the same cluster are considered to show performance that is not significantly different. In other words: if a system is part of e.g. cluster 2, this system can be considered 2nd best, but it is sharing this position with all other systems in the cluster. In both cases, there are clear winning systems (i.e., the 1st cluster only has one member): Sheff2 for naturalness and Slug for quality. The 2nd clusters are quite large for both criteria – they contain 13 and 11 systems, respectively, and they include the baseline TGen system in both cases.
The results indicate that seq2seq systems dominate in terms of naturalness of their outputs, while most systems of other architectures score lower. The bottom cluster is filled with template-based systems. The winning Sheff2 system is seq2seq-based, and the 2nd cluster mostly includes other seq2seq-based systems. The result also indicates that diversity-attempting systems are penalised in naturalness, i.e. Slug-alt, ZHAW1, ZHAW2 placed in the 3rd cluster; Adapt in the 4th.
The results for quality242424Note that our definition of quality in Section 7.2 also includes semantic completeness and grammaticality. are, however, more mixed in terms of architectures, with none of them clearly prevailing. The 2nd, most populous cluster includes all different architecture types. The winner is the seq2seq-based system Slug. However, the bottom two clusters are also composed of seq2seq-based systems. This shows the importance of an explicit semantic control mechanism applied at decoding time in seq2seq systems: None of the systems in the bottom two clusters apply such mechanism, whereas all better ranking seq2seq systems do (cf. Section 6.2).252525While the Chen and Zhang systems do attempt to model the coverage of the input MR, they do not use explicit beam reranking based on MR coverage. Note that this also includes the Sheff2 system, which scored top for naturalness. With the exception of diversity-attempting Adapt, these systems tend to produce the shortest outputs (see Table 9), which indicates that they are penalised for not realising parts of the input MR too often (cf. Section 8.5).
Finally, we computed the correlation of word-overlap metrics with the human judgements of both quality and naturalness for all the systems. All of the correlations are weak (, see Tables 16 and 15 in the Appendix), which confirms earlier findings of Novikova et al. (2017) and explains the discrepancy between system performances in terms of automatic and human evaluation.
8.5 Error Analysis: Input MR Coverage
In order to clarify the mixed quality evaluation results, we attempted to estimate the number of semantic errors produced by the individual systems in two ways: First, we ran a specific crowdsourced evaluation of systems’ coverage of the input MR, where crowd workers were asked to manually annotate missed and added information with respect to the input MR (see Table 12). We did not check for workers’ correctness here, and thus we can expect some noise, but the annotations confirm that the systems rated low on quality, most of which also produce very short outputs, also correspond to the ones with the lowest proportion of perfectly covered MRs (Chen, Sheff2, Zhang, TR1 and Adapt).
Second, semantic errors were computed following Reed, Oraby, and Walker (2018), where we implemented a script to estimate the coverage automatically based on regular expression matching.262626We based the patterns for the individual attribute-value pairs on Reed, Oraby, and Walker (2018)’s script and manually enhanced them using the first 500 instances of the E2E development set. This allowed us to produce an independent estimate of the proportion of outputs with missing or added information (see Table 12). Following Reed, Oraby, and Walker (2018), we also computed the slot error rate (SER) using this pattern-matching approach and the following formula:272727Note that the coverage and SER values produced by the script is only an estimate as the patterns for a given attribute-value pair will not cover all possible all correct ways to express it. This is different from Wen et al. (2015b)’s computation of SER, where full delexicalisation allowed them to directly count placeholders in the output.
Here, missed stands for slot values missing from the realisations, added denotes additional information not present in the MR (hallucinations), value errors denote correctly realised slots with incorrect values (e.g., specifying low price range instead of high), and repetitions are values mentioned repeatedly in the outputs; slots is the total number of slots/attributes in the test set. SER thus amounts to a proportion of erroneously realised slots. While the absolute numbers for perfectly covered MRs are different from those estimated by humans, they mostly follow the same trend. The SER value is highly correlated with the proportion of perfectly covered MRs.
Both evaluations show that template- and rule-based systems, where MR coverage is implied by the architecture, mostly score high in this regard. However, FORGe3, which uses template mining from training data, scores below average; here, some amount of noise was probably carried over from training data. TUDA, on the other hand, scores high in human ratings and even achieved perfect score by the automatic script (100% perfect coverage), but this is partly given by its low diversity (cf. Section 8.2) – all its templates are probably covered well by the patterns. The results also show that some data-driven systems are able to achieve very good coverage (especially Sheff1, Gong and Slug, with SER estimates below 1.5%), which confirms the efficacy of their respective semantic control approaches (see Section 6.2). Seq2seq systems without reranking (Chen, Sheff2, Zhang, Adapt, TR1) score near the bottom of the list in both evaluations.
Both estimates also indicate that missing information is the most common type of problem, added (hallucinated) information occurs less frequently, but still poses a serious problem for utterance generation in task-based dialogue systems.282828Note that this problem appears to be more general since it has also been reported in machine translation Koehn and Knowles (2017). It also appears that both problems are connected – systems hallucinating less frequently tend to miss information more often.
Finally, the scores show that attempts at diversity may hurt semantic accuracy. This is most apparent in Adapt, the most diverse system with no explicit semantic control mechanism. Other systems with diverse outputs, FORGe3 and Harv, also score lower on coverage. In case of FORGe3, this is due to the above-mentioned noise in the mined templates; Harv’s reranking is probably less aggressive than others’. On the other hand, ZHAW1, ZHAW2 and especially Slug-alt produce diverse outputs while maintaining good coverage thanks to their very powerful semantic control mechanisms.
8.6 Winning System
We consider the Slug system Juraska et al. (2018), a seq2seq-based ensemble system, as the overall winner of this challenge. It received high human ratings for both naturalness and quality, as well as for automatic word-overlap metrics. In contrast to vanilla seq2seq systems, Slug improves semantic coverage using a heuristic slot aligner in combination with a data augmentation method producing partially aligned examples, which places it among the top-scoring systems in terms of MR coverage (cf. Section 8.5). Slug’s only drawback is the relatively low output diversity; note that repetitive output is considered to be problematic for task-based dialogue systems. A variant of the same system, Slug-alt, provides much more output diversity at the cost of slightly lower quality ratings and MR coverage; it maintains higher quality and coverage scores than other diversity-attempting approaches.
While the Sheff2 system Chen, Lampouras, and Vlachos (2018), a vanilla seq2seq setup, won in terms of naturalness, it often does not realise all parts of the input MR, which severely affected its quality rating – it placed in the last cluster, ranked 20th–21st out of 21. Sheff2’s outputs also rank very low on complexity and diversity.
Furthermore, the TGen baseline system turned out hard to beat. It ranked highest on average in word-overlap-based automatic metrics and placed in the 2nd cluster in both quality and naturalness (ranks 3–6 and 4–8 out of 21, respectively). TGen also fared well (albeit not perfectly) in MR coverage evaluations. On the other hand, TGen only scored in the middle of the pack on output diversity.
8.7 Lessons Learnt and Future Directions
We attempt to formulate some high-level “lessons learnt" for developing future data-driven NLG systems based on the above results, while we acknowledge that our data is limited to a single domain, and that comparisons are not strictly controlled, i.e. models vary in more than one aspect.
Semantic control: For seq2seq-based systems, a strong semantic control of the generated content seems crucial – beam reranking based on MR classification or heuristic alignments appears to work well while attention-only models perform poorly on our data. Correct semantics is regarded by users as more important than fluency Reiter and Belz (2009) and should be prioritised when training the models (cf. also Reiter, 2019).
Open vocabulary: For limited domains such as ours, delexicalisation of open-set attributes still seem to be the best approach. However, the systems of Harv and NLE show character-level models and copy mechanisms are viable alternatives. We believe that the low results of Chen, Zhang and Adapt are due to inferior semantic control, not open-vocabulary handling.
Complexity and diversity: In general, hand-engineered systems seem to outperform neural systems in terms of output diversity and complexity (see Section 8.2); the most diverse outputs are produced by systems using templates mined from training data and data-driven systems with explicit diversity mechanisms.
Vanilla seq2seq-based systems produce the least diverse outputs: they are essentially probabilistic language models, which tend to settle for the most frequent phrasing, thus penalising length and favouring high-frequency word sequences. Diversity in seq2seq models can be improved by data selection (Slug-alt), diverse ensembling (Harv) or sampling from the generated beam Wen et al. (2015b). In contrast, hand-engineered system authors can control the output complexity and diversity directly: here, TUDA’s outputs are very repetitive as its set of handcrafted templates is small, while FORGe3 and TR2 with templates mined from data produce some of the most diverse outputs.
In general, any systems attempting output diversity need to impose strong semantic control mechanisms to maintain MR coverage.
Best method suggestion: Rule-based methods work quite well for limited domains, such as ours. Low-effort handcrafting (as in TUDA) may lead to correct but repetitive outputs. Seq2seq models with semantic reranking emerge as the best data-driven option, in combination with controlling for diversity and using copy mechanisms to minimise preprocessing.
This paper presents the findings of the first shared task on End-to-End Natural Language Generation for Spoken Dialogue Systems. The aim of this challenge was to assess the capabilities of recent end-to-end, fully data-driven NLG systems, which can be trained from pairs of input meaning representations and corresponding texts, without the need for fine-grained semantic alignments.
As part of this challenge, we have created a novel dataset for NLG benchmarking in the restaurant information domain, which is an order-of-magnitude bigger than any previous publicly available dataset for task-oriented NLG. We also provided one of the previous state-of-the art seq2seq-based NLG systems, TGen Dušek and Jurčíček (2016a), as a baseline for comparison. The challenge received 62 system submissions by 17 different participating institutions. The systems submitted ranged from complex seq2seq-based setups with different additions to the architecture, over other data-driven methods and rule-based systems, to simple template-based ones. We evaluated all the entries in terms of five different automatic metrics. 20 primary submissions (as identified by the participants) were further evaluated using a novel, crowdsourced evaluation setup. We also include a novel comparison of systems in terms of automatic textual metrics aimed to assess output complexity and diversity. Our evaluation lets us include several general recommendations for future NLG system development.
In general, seq2seq-based systems produce very similar outputs (as measured by word-overlap, cf. Section 8.3), despite their different implementations. Seq2seq models tend to score high on word-overlap metrics and human evaluations of naturalness, while the scores for other data-driven, rule-based and template-based systems are lower. However, these other types of systems often score better in human evaluations of the overall quality. While the winning Slug system is seq2seq-based, the results also demonstrated possible pitfalls of using seq2seq models:
Vanilla seq2seq models tend to produce short outputs of low diversity and syntactic complexity. Low diversity is especially problematic since it causes repetitive outputs in spoken dialogue systems.
Applying a strong semantic control mechanism during decoding is crucial to preserve the input meaning. The most common semantic mistake for systems is to miss out information. However, added information (hallucinations) is also closely linked. Both type of errors can have severe consequences for task-based dialogue systems, depending on the application domain.
Addressing these issues is challenging: Attempts to improve diversity can often result in lowered semantic accuracy and/or output naturalness.
In comparison, hand-engineered systems tend to produce more complex and diverse outputs and are able to reach high overall quality, but are mostly rated low on naturalness. Note that similar findings have been reported by Wiseman, Shieber, and Rush (2017) for data-to-document generation. This raises the general question regarding efficiency, costs, and performance of purely data-driven versus carefully hand-engineered NLG systems.
To facilitate further research in this domain, we have made the following data and tools freely available for download:
The E2E NLG training dataset (including test set with human references),
A set of word-overlap-based metrics used for automatic evaluation in the challenge,
Outputs of the baseline TGen system for the development set,
Outputs for the test set produced by the baseline and all participating systems,
the corresponding RankME ratings for quality and naturalness collected in the human evaluation campaign.
All can be accessed under the following URL:
In future work, we aim to investigate additional evaluation methods for NLG systems, such as post-edits Sripada, Reiter, and Hawizy (2005), or extrinsic evaluation, such as NLG’s contribution to task success, e.g. Rieser, Lemon, and Keizer (2014); Gkatzia, Lemon, and Rieser (2016). We also intend to continue our work on automatic quality estimation for NLG Dušek, Novikova, and Rieser (2017), where the large amount of data obtained in this challenge allows a wider range of experiments than previously possible.
Acknowledgements.This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1) and Charles University project PRIMUS/19/SCI/10. The Titan Xp used for this research was donated by the NVIDIA Corporation. The authors would like to thank Lena Reed and Shereen Oraby for help with computing the slot error rate. We would also like to thank Prof. Ehud Reiter, whose blog292929https://ehudreiter.com/ inspired some of this research.
- Agarwal, Dymetman, and Gaussier (2018) Agarwal, Shubham, Marc Dymetman, and Éric Gaussier. 2018. Char2char generation with reranking for the E2E NLG Challenge. In Proceedings of INLG.
- Arnold and Emerson (2011) Arnold, Taylor B and John W Emerson. 2011. Nonparametric Goodness-of-Fit Tests for Discrete Null Distributions. The R Journal, 3(2):34–39.
- Bahdanau, Cho, and Bengio (2015) Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA. ArXiv: 1409.0473.
- Bao et al. (2018) Bao, Junwei, Duyu Tang, Nan Duan, Zhao Yan, Yuanhua Lv, Ming Zhou, and Tiejun Zhao. 2018. Table-to-Text: Describing Table Region with Natural Language. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 5020–5027, New Orleans, LA, USA. ArXiv: 1805.11234.
- Bard, Robertson, and Sorace (1996) Bard, Ellen Gurman, Dan Robertson, and Antonella Sorace. 1996. Magnitude estimation of linguistic acceptability. Language, 72:32–68.
- Belz and Gatt (2007) Belz, Anja and Albert Gatt. 2007. The attribute selection for GRE challenge: Overview and evaluation results. In Proceedings of the Machine Translation Summit XI, pages 75–83.
- Belz and Hastie (2014) Belz, Anja and Helen Hastie. 2014. Comparative evaluation and shared tasks for nlg in interactive systems. In Amanda Stent and Srinivas Bangalore, editors, Natural Language Generation in Interactive Systems. Cambridge University Press, Cambridge, chapter 13, pages 302–350.
- Belz and Kow (2011) Belz, Anja and Eric Kow. 2011. Discrete vs. continuous rating scales for language evaluation in NLP. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Short papers, pages 230–235, Portland, OR, USA.
- Black et al. (2011) Black, Alan W, Susanne Burger, Alistair Conkie, Helen Hastie, Simon Keizer, Oliver Lemon, Nicolas Merigaud, Gabriel Parent, Gabriel Schubiner, Blaise Thomson, Jason D. Williams, Kai Yu, Steve Young, and Maxine Eskenazi. 2011. Spoken dialog challenge 2010: Comparison of live and control test results. In Proceedings of the SIGDIAL 2011 Conference, pages 2–7, Association for Computational Linguistics, Portland, Oregon.
- Bojar et al. (2017) Bojar, Ondřej, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, et al. 2017. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation (WMT), pages 169–214, Copenhagen, Denmark.
- Bojar et al. (2013) Bojar, Ondřej, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Association for Computational Linguistics, Sofia, Bulgaria.
- Bojar et al. (2014) Bojar, Ondřej, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Johannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, Radu Soricut, Lucia Specia, and Aleš Tamchyna. 2014. Findings of the 2014 Workshop on Statistical Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 12–58, Association for Computational Linguistics, Baltimore, Maryland, USA.
- Bojar, Graham, and Kamran (2017) Bojar, Ondřej, Yvette Graham, and Amir Kamran. 2017. Results of the WMT17 Metrics Shared Task. In Proceedings of the Second Conference on Machine Translation, Volume 2: Shared Task Papers, pages 489–513, Association for Computational Linguistics, Copenhagen, Denmark.
- Britz et al. (2017) Britz, Denny, Anna Goldie, Minh-Thang Luong, and Quoc Le. 2017. Massive Exploration of Neural Machine Translation Architectures. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark. ArXiv: 1703.03906.
- Callison-Burch and Dredze (2010) Callison-Burch, Chris and Mark Dredze. 2010. Creating speech and language data with Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 1–12, Association for Computational Linguistics.
- Callison-Burch et al. (2007) Callison-Burch, Chris, Cameron Fordyce, Philipp Koehn, Christof Monz, and Josh Schroeder. 2007. (Meta-) evaluation of machine translation. In Proceedings of the Second Workshop on Statistical Machine Translation (WMT), pages 136–158, Prague, Czech Republic.
- Chang et al. (2015) Chang, Kai-Wei, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumé III, and John Langford. 2015. Learning to search better than your teacher. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France. ArXiv: 1502.02206.
- Chen and Mooney (2008) Chen, David L. and Raymond J. Mooney. 2008. Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th international conference on Machine learning (ICML), pages 128–135, Helsinki, Finland.
- Chen, Lampouras, and Vlachos (2018) Chen, Mingje, Gerasimos Lampouras, and Andreas Vlachos. 2018. Sheffield at E2E: structured prediction approaches to end-to-end language generation. In E2E NLG Challenge System Descriptions.
- Chen (2018) Chen, Shuang. 2018. A General Model for Neural Text Generation from Structured Data. In E2E NLG Challenge System Descriptions.
- Chen et al. (2015) Chen, Xinlei, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C. Lawrence Zitnick. 2015. Microsoft COCO Captions: Data Collection and Evaluation Server. CoRR, abs/1504.00325.
- Cho et al. (2014) Cho, Kyunghyun, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar. ArXiv: 1406.1078.
- Collins (1997) Collins, Michael. 1997. Three Generative, Lexicalised Models for Statistical Parsing. In Proceedings of the 8th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 16–23, Madrid, Spain.
- Covington et al. (2006) Covington, Michael A, Congzhou He, Cati Brown, Lorina Naçi, and John Brown. 2006. How Complex Is That Sentence? A Proposed Revision of the Rosenberg and Abbeduto D-Level Scale. Technical Report CASPR Research Report 2006-01, University of Georgia, Athens, GA, USA.
- Crammer, Kulesza, and Dredze (2009) Crammer, Koby, Alex Kulesza, and Mark Dredze. 2009. Adaptive regularization of weight vectors. In Advances in Neural Information Processing Systems, pages 414–422, Vancouver, Canada.
- Deriu and Cieliebak (2018) Deriu, Jan and Mark Cieliebak. 2018. End-to-End Trainable System for Enhancing Diversity in Natural Language Generation. In E2E NLG Challenge System Descriptions.
- Dethlefs et al. (2012) Dethlefs, Nina, Helen Hastie, Verena Rieser, and Oliver Lemon. 2012. Optimising Incremental Dialogue Decisions Using Information Density for Interactive Systems. In Proc. of EMNLP.
- Doddington (2002) Doddington, George. 2002. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Second International Conference on Human Language Technology Research, pages 138–145, San Diego, CA, USA.
- Dušek and Jurčíček (2015) Dušek, Ondřej and Filip Jurčíček. 2015. Training a natural language generator from unaligned data. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Beijing, China.
- Dušek (2017) Dušek, Ondřej. 2017. Novel Methods for Natural Language Generation in Spoken Dialogue Systems. Ph.D. thesis, Charles University, Prague, Czech Republic.
- Dušek and Jurčíček (2016) Dušek, Ondřej and Filip Jurčíček. 2016. A Context-aware Natural Language Generator for Dialogue Systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 185–190, Association for Computational Linguistics, Los Angeles, CA, USA.
- Dušek and Jurčíček (2016a) Dušek, Ondřej and Filip Jurčíček. 2016a. Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 45–51, Berlin, Germany. arXiv:1606.05491.
- Dušek, Novikova, and Rieser (2017) Dušek, Ondřej, Jekaterina Novikova, and Verena Rieser. 2017. Referenceless Quality Estimation for Natural Language Generation. In Proceedings of the 1st Workshop on Learning to Generate Natural Language (LGNL), Sydney, Australia. ArXiv: 1708.01759.
- Dušek, Novikova, and Rieser (2018) Dušek, Ondřej, Jekaterina Novikova, and Verena Rieser. 2018. Findings of the E2e NLG Challenge. In Proceedings of the 11th International Conference on Natural Language Generation, pages 322–328, Tilburg, The Netherlands.
- Elder et al. (2018) Elder, Henry, Sebastian Gehrmann, Alexander O’Connor, and Qun Liu. 2018. E2E NLG Challenge Submission: Towards Controllable Generation of Diverse Natural Language. In Proceedings of INLG.
- Freitag and Roy (2018) Freitag, Markus and Scott Roy. 2018. Unsupervised Natural Language Generation with Denoising Autoencoders. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3922–3929, Brussels, Belgium. ArXiv: 1804.07899.
- Gardent et al. (2017a) Gardent, Claire, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017a. The webnlg challenge: Generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133, Association for Computational Linguistics.
- Gardent et al. (2017b) Gardent, Claire, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017b. The WebNLG Challenge: Generating Text from RDF Data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133, Association for Computational Linguistics, Santiago de Compostela, Spain.
Gatt and Krahmer (2017)
Gatt, Albert and Emiel Krahmer. 2017.
Survey of the state of the art in natural language generation: Core
tasks, applications and evaluation.
Journal of Artificial Intelligence Research (JAIR), 60.
- Gehrmann et al. (2018) Gehrmann, Sebastian, Falcon Z. Dai, Henry Elder, and Alexander M. Rush. 2018. End-to-End Content and Plan Selection for Natural Language Generation. In E2E NLG Challenge System Descriptions.
- Gkatzia, Lemon, and Rieser (2016) Gkatzia, Dimitra, Oliver Lemon, and Verena Rieser. 2016. Natural language generation enhances human decision-making with uncertain information. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 264–268, Berlin, Germany. arXiv:1606.03254.
- Gkatzia and Mahamood (2015) Gkatzia, Dimitra and Saad Mahamood. 2015. A Snapshot of NLG Evaluation Practices 2005 - 2014. In Proceedings of the 15th European Workshop on Natural Language Generation (ENLG), pages 57–60, Association for Computational Linguistics, Brighton, UK.
- Gong (2018) Gong, Heng. 2018. Technical Report for E2E NLG Challenge. In E2E NLG Challenge System Descriptions.
- Graham et al. (2013) Graham, Yvette, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2013. Continuous Measurement Scales in Human Evaluation of Machine Translation. In Proceedings of the 7th Linguistic Annotation Workshop & Interoperability with Discourse, pages 33–41, Sofia, Bulgaria.
- Guzman-Rivera, Batra, and Kohli (2012) Guzman-Rivera, Abner, Dhruv Batra, and Pushmeet Kohli. 2012. Multiple choice learning: Learning to produce multiple structured outputs. In Advances in Neural Information Processing Systems, pages 1799–1807, Lake Tahoe, NV, USA.
- Han et al. (2013) Han, Lushan, Abhay Kashyap, Tim Finin, James Mayfield, and Jonathan Weese. 2013. UMBC_EBIQUITY-CORE: Semantic textual similarity systems. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*SEM), volume 1, pages 44–52, Atlanta, Georgia.
- Henderson, Thomson, and Williams (2014) Henderson, Matthew, Blaise Thomson, and Jason D. Williams. 2014. The Second Dialog State Tracking Challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272, Association for Computational Linguistics, Philadelphia, PA, U.S.A.
- Herbrich, Minka, and Graepel (2006) Herbrich, Ralf, Tom Minka, and Thore Graepel. 2006. TrueskillTM: a Bayesian skill rating system. In Proceedings of the 19th International Conference on Neural Iinformation Processing Systems (NIPS), pages 569–576, Vancouver, Canada.
- Hochreiter and Schmidhuber (1997) Hochreiter, S. and J. Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
- Jagfeld, Jenne, and Vu (2018) Jagfeld, Glorianna, Sabrina Jenne, and Ngoc Thang Vu. 2018. Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity. In Proceedings of the 11th International Conference on Natural Language Generation, Tilburg, The Netherlands. ArXiv: 1810.04864.
- Juraska et al. (2018) Juraska, Juraj, Panagiotis Karagiannis, Kevin K. Bowden, and Marilyn A. Walker. 2018. A Deep Ensemble Model with Slot Alignment for Sequence-to-Sequence Natural Language Generation. In Proceedings of NAACL-HLT, New Orleans, LA, USA.
- Juraska and Walker (2018) Juraska, Juraj and Marilyn Walker. 2018. Characterizing Variation in Crowd-Sourced Data for Training Neural Language Generators to Produce Stylistically Varied Outputs. In Proceedings of the 11th International Conference on Natural Language Generation, pages 441–450, Tilburg, The Netherlands. ArXiv: 1809.05288.
- Kaffee et al. (2018) Kaffee, Lucie-Aimée, Hady Elsahar, Pavlos Vougiouklis, Christophe Gravier, Frédérique Laforest, Jonathon Hare, and Elena Simperl. 2018. Learning to Generate Wikipedia Summaries for Underserved Languages from Wikidata. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 640–645, New Orleans, LA, USA. ArXiv: 1803.07116.
- Kann, Rothe, and Filippova (2018) Kann, Katharina, Sascha Rothe, and Katja Filippova. 2018. Sentence-Level Fluency Evaluation: References Help, But Can Be Spared! In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 313–323, Brussels, Belgium.
- Kiddon, Zettlemoyer, and Choi (2016) Kiddon, Chloé, Luke Zettlemoyer, Luke, and Yejin Choi. 2016. Globally Coherent Text Generation with Neural Checklist Models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 329–339, Austin, TX, USA.
- Kingma and Ba (2015) Kingma, Diederik and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA. ArXiv: 1412.6980.
Klein et al. (2017)
Klein, Guillaume, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander Rush.
OpenNMT: Open-Source Toolkit for Neural Machine Translation.In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 67–72, Vancouver, Canada.
- Koehn and Knowles (2017) Koehn, Philipp and Rebecca Knowles. 2017. Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver, Canada.
- Lampouras and Vlachos (2016) Lampouras, Gerasimos and Andreas Vlachos. 2016. Imitation learning for language generation from unaligned data. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1101–1112, The COLING 2016 Organizing Committee, Osaka, Japan.
- Lavie and Agarwal (2007) Lavie, Alon and Abhaya Agarwal. 2007. Meteor: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 228–231, Prague, Czech Republic.
- Lebret, Grangier, and Auli (2016) Lebret, Remi, David Grangier, and Michael Auli. 2016. Neural Text Generation from Structured Data with Application to the Biography Domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213, Austin, TX, USA. ArXiv: 1603.07771.
- Lin (2004) Lin, Chin-Yew. 2004. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, pages 74–81, Barcelona, Spain.
- Liu and Lane (2016) Liu, Bing and Ian Lane. 2016. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. In Proceedings of INTERSPEECH, San Francisco, CA, USA. ArXiv: 1609.01454.
- Lu (2009) Lu, Xiaofei. 2009. Automatic measurement of syntactic complexity in child language acquisition. International Journal of Corpus Linguistics, 14(1):3–28.
- Lu (2012) Lu, Xiaofei. 2012. The relationship of lexical richness to the quality of ESL learners’ oral narratives. The Modern Language Journal, 96(2):190–208.
- Mairesse and Walker (2007) Mairesse, F. and M.A. Walker. 2007. PERSONAGE: Personality generation for dialogue. In 45th Annual Meeting of the Association For Computational Linguistics, pages 496–503, Prague.
Mairesse et al. (2010)
Mairesse, François, Milica Gašić, Filip
Jurčíček, Simon Keizer, Blaise Thomson, Kai Yu, and Steve
Phrase-based statistical language generation using graphical models and active learning.In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1552–1561, Uppsala, Sweden.
- Mangrulkar et al. (2018) Mangrulkar, Sourab, Suhani Shrivastava, Veena Thenkanidiyoor, and Dileep Aroor Dinesh. 2018. A Context-aware Convolutional Natural Language Generation model for Dialogue Systems. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 191–200, Melbourne, Australia.
- Manishina et al. (2016) Manishina, Elena, Bassam Jabaian, Stéphane Huet, and Fabrice Lefevre. 2016. Automatic corpus extension for data-driven natural language generation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC), pages 3624–3631, Portorož, Slovenia.
- Manning and Schütze (2000) Manning, Christopher D. and Hinrich Schütze. 2000. Foundations of statistical natural language processing, 2nd printing edition. MIT Press, Cambridge, MA, USA.
- Mason and Watts (2010) Mason, Winter and Duncan J Watts. 2010. Financial incentives and the performance of crowds. ACM SigKDD Explorations Newsletter, 11(2):100–108.
- Mei, Bansal, and Walter (2016) Mei, Hongyuan, Mohit Bansal, and Matthew R. Walter. 2016. What to talk about and how? Selective generation using LSTMs with coarse-to-fine alignment. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA. arXiv:1509.00838.
- Mel’čuk (1988) Mel’čuk, Igor’ A. 1988. Dependency syntax: theory and practice. SUNY series in linguistics. State University Press of New York, Albany, NY, USA.
- Mille and Dasiopoulou (2018) Mille, Simon and Stamatia Dasiopoulou. 2018. FORGe at E2E 2017. In E2E NLG Challenge System Descriptions.
- Nayak et al. (2017) Nayak, Neha, Dilek Hakkani-Tür, Marilyn Walker, and Larry Heck. 2017. To Plan or not to Plan? Discourse Planning in Slot-Value Informed Sequence to Sequence Models for Language Generation. In Proceedings of Interspeech, pages 3339–3343, Stockholm, Sweden.
- Nema et al. (2018) Nema, Preksha, Shreyas Shetty, Parag Jain, Anirban Laha, Karthik Sankaranarayanan, and Mitesh M. Khapra. 2018. Generating Descriptions from Structured Data Using a Bifocal Attention Mechanism and Gated Orthogonalization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1539–1550, New Orleans, LA, USA. ArXiv: 1804.07789.
- Nguyen and Tran (2018) Nguyen, Dang Tuan and Trung Tran. 2018. Structure-based Generation System for E2E NLG Challenge. In E2E NLG Challenge System Descriptions.
- Novikova et al. (2017) Novikova, Jekaterina, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2231–2242.
- Novikova, Dušek, and Rieser (2017) Novikova, Jekaterina, Ondřej Dušek, and Verena Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL), pages 201–206.
- Novikova, Dušek, and Rieser (2018) Novikova, Jekaterina, Ondrej Dušek, and Verena Rieser. 2018. RankME: Reliable human ratings for natural language generation. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 72–78, New Orleans, LA, USA. arXiv:1803.05928.
- Novikova, Lemon, and Rieser (2016) Novikova, Jekaterina, Oliver Lemon, and Verena Rieser. 2016. Crowd-sourcing NLG data: Pictures elicit better data. In Proceedings of the 9th International Natural Language Generation Conference, pages 265–273, Edinburgh, UK. arXiv:1608.00339.
- Novikova and Rieser (2016) Novikova, Jekaterina and Verena Rieser. 2016. The aNALoGuE Challenge: Non Aligned Language GEneration. In Proceedings of the 9th International Natural Language Generation conference, pages 168–170.
- Oraby, Reed, and Tandon (2018) Oraby, Shereen, Lena Reed, and Shubhangi Tandon. 2018. Controlling Personality-Based Stylistic Variation with Neural Natural Language Generators. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia.
- Oraby et al. (2018a) Oraby, Shereen, Lena Reed, Shubhangi Tandon, Sharath T.S., Stephanie Lukin, and Marilyn Walker. 2018a. TNT-NLG, System 1: Using a statistical NLG to massively augment crowd-sourced data for neural generation. In E2E NLG Challenge System Descriptions.
- Oraby et al. (2018b) Oraby, Shereen, Lena Reed, Sharath TS, Shubhangi Tandon, and Marilyn Walker. 2018b. Neural MultiVoice Models for Expressing Novel Personalities in Dialog. In Proceedings of Interspeech, pages 3057–3061, Hyderabad, India. ArXiv: 1809.01331.
- Papineni et al. (2002) Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Association for Computational Linguistics, Philadelphia, PA, USA.
- Parra Escartín et al. (2017) Parra Escartín, Carla, Wessel Reijers, Teresa Lynn, Joss Moorkens, Andy Way, and Chao-Hong Liu. 2017. Ethical considerations in nlp shared tasks. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 66–73, Association for Computational Linguistics.
- Perez-Beltrachini and Gardent (2017) Perez-Beltrachini, Laura and Claire Gardent. 2017. Analysing data-to-text generation benchmarks. In Proceedings of the 10th International Natural Language Generation Conference, Santiago de Compostela, Spain.
- Puzikov and Gurevych (2018) Puzikov, Yevgeniy and Iryna Gurevych. 2018. E2E NLG Challenge: Neural Models vs. Templates. In Proceedings of INLG.
- Reed, Oraby, and Walker (2018) Reed, Lena, Shereen Oraby, and Marilyn Walker. 2018. Can Neural Generators for Dialogue Learn Sentence Planning and Discourse Structuring? In Proceedings of the 11th International Conference on Natural Language Generation, pages 284–295, Tilburg, The Netherlands. ArXiv: 1809.03015.
- Reiter (2018) Reiter, Ehud. 2018. A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3):393–401.
Reiter, Ehud. 2019.
Does Deep Learning Prefer Readability over Accuracy?Ehud Reiter’s Blog. Available online at https://ehudreiter.com/2019/01/08/deep-learning-prefer-readability/ (accessed: Jan 10, 2019).
- Reiter and Belz (2009) Reiter, Ehud and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
- Rieser and Lemon (2009) Rieser, Verena and Oliver Lemon. 2009. Natural language generation as planning under uncertainty for spoken dialogue systems. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), pages 683–691, Athens, Greece.
- Rieser, Lemon, and Keizer (2014) Rieser, Verena, Oliver Lemon, and Simon Keizer. 2014. Natural language generation as incremental planning under uncertainty: Adaptive information presentation for statistical dialogue systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(5):979–993.
- Sakaguchi, Post, and Van Durme (2014) Sakaguchi, Keisuke, Matt Post, and Benjamin Van Durme. 2014. Efficient elicitation of annotations for human evaluation of machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation (WMT), pages 1–11, Baltimore, MD, USA.
- Schilder et al. (2018) Schilder, Frank, Charese Smiley, Elnaz Davoodi, and Dezhao Song. 2018. The E2E NLG Challenge: A tale of two systems. In Proceedings of INLG.
- Sennrich, Haddow, and Birch (2016) Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. ArXiv: 1508.07909.
- Sharma et al. (2016a) Sharma, Shikhar, Jing He, Kaheer Suleman, Hannes Schulz, and Philip Bachman. 2016a. Natural language generation in dialogue using lexicalized and delexicalized data. CoRR, abs/1606.03632.
- Sharma et al. (2016b) Sharma, Shikhar, Jing He, Kaheer Suleman, Hannes Schulz, and Philip Bachman. 2016b. Natural Language Generation in Dialogue using Lexicalized and Delexicalized Data. arXiv:1606.03632 [cs]. ArXiv: 1606.03632.
- Shimorina and Gardent (2018) Shimorina, Anastasia and Claire Gardent. 2018. Handling Rare Items in Data-to-Text Generation. In Proceedings of the 11th International Conference on Natural Language Generation, pages 360–370, Tilburg, The Netherlands.
- Specia, Raj, and Turchi (2010) Specia, Lucia, Dhwaj Raj, and Marco Turchi. 2010. Machine translation evaluation versus quality estimation. Machine translation, 24(1):39–50.
- Sprouse (2011) Sprouse, Jon. 2011. A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior research methods, 43(1):155–167.
- Sripada, Reiter, and Hawizy (2005) Sripada, Somayajulu G, Ehud Reiter, and Lezan Hawizy. 2005. Evaluation of an nlg system using post-edit data: Lessons learnt. In 10th European Workshop on Natural Language Generation.
- Stent, Prasad, and Walker (2004) Stent, Amanda, Rashmi Prasad, and Marilyn Walker. 2004. Trainable sentence planning for complex information presentations in spoken dialog systems. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics, pages 79–86, Barcelona, Spain.
Straková, Straka, and Hajič (2014)
Straková, Jana, Milan Straka, and Jan Hajič. 2014.
Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition.In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13–18, Association for Computational Linguistics, Baltimore, Maryland.
- Sutskever, Vinyals, and Le (2014) Sutskever, Ilya, Oriol Vinyals, and Quoc VV Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112. ArXiv:1409.3215.
- Tandon et al. (2018) Tandon, Shubhangi, Sharath T.S., Shereen Oraby, Lena Reed, Stephanie Lukin, and Marilyn Walker. 2018. TNT-NLG, System 2: Data repetition and meaning representation manipulation to improve neural generation. In E2E NLG Challenge System Descriptions.
- Tian, Douratsos, and Groves (2018) Tian, Ye, Ioannis Douratsos, and Isabel Groves. 2018. Treat the system like a human student: Automatic naturalness evaluation of generated text without reference texts. In Proceedings of the 11th International Conference on Natural Language Generation, pages 109–118, Tilburg, The Netherlands.
- Tran, Nguyen, and Tojo (2017) Tran, Van-Khanh, Le-Minh Nguyen, and Satoshi Tojo. 2017. Neural-based Natural Language Generation in Dialogue using RNN Encoder-Decoder with Semantic Aggregation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 231–240, Saarbrücken, Germany. ArXiv: 1706.06714.
- Ueffing, Camargo de Souza, and Leusch (2018) Ueffing, Nicola, José G. Camargo de Souza, and Gregor Leusch. 2018. Quality Estimation for Automatically Generated Titles of eCommerce Browse Pages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pages 52–59, Association for Computational Linguistics, New Orleans, LA, USA.
- Vedantam, Zitnick, and Parikh (2015) Vedantam, Ramakrishna, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In
- Vinyals, Fortunato, and Jaitly (2015) Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer Networks. In Advances in Neural Information Processing Systems 28 (NIPS 2015), Montréal, Canada. ArXiv: 1506.03134.
- Walker et al. (2004) Walker, Marilyn A, Stephen J Whittaker, Amanda Stent, Preetam Maloor, Johanna Moore, Michael Johnston, and Gunaranjan Vasireddy. 2004. Generation and evaluation of user tailored responses in multimodal dialogue. Cognitive Science, 28(5):811–840.
- Wang et al. (2018) Wang, Qingyun, Xiaoman Pan, Lifu Huang, Boliang Zhang, Zhiying Jiang, Heng Ji, and Kevin Knight. 2018. Describing a Knowledge Base. In Proceedings of the 11th International Conference on Natural Language Generation, pages 10–21, Tilburg University, The Netherlands.
- Wang et al. (2012) Wang, Wei Yu, Dan Bohus, Ece Kamar, and Eric Horvitz. 2012. Crowdsourcing the acquisition of natural language corpora: Methods and observations. In Spoken Language Technology Workshop (SLT), 2012 IEEE, pages 73–78, IEEE.
- Wang, Berant, and Liang (2015) Wang, Yushi, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1332–1342, Association for Computational Linguistics, Beijing, China.
- Wen et al. (2015a) Wen, Tsung-Hsien, Milica Gasić, Dongho Kim, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015a. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 275–284, Association for Computational Linguistics, Prague, Czech Republic.
- Wen et al. (2016) Wen, Tsung-Hsien, Milica Gašić, Nikola Mrkšić, Lina Maria Rojas-Barahona, Pei-hao Su, David Vandyke, and Steve J. Young. 2016. Multi-domain neural network language generation for spoken dialogue systems. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 120–129, San Diego, CA, USA. arXiv:1603.01232.
- Wen et al. (2015b) Wen, Tsung-Hsien, Milica Gašić, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015b. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721, Lisbon, Portugal.
- Wen et al. (2017) Wen, Tsung-Hsien, David Vandyke, Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449, Valencia, Spain. ArXiv: 1604.04562.
Williams and Young (2007)
Williams, Jason D. and Steve Young. 2007.
Partially Observable Markov Decision Processes for Spoken Dialog Systems.Comput. Speech Lang., 21(2):393–422.
Williams, Ronald J. 1992.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256.
- Wiseman, Shieber, and Rush (2017) Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2253–2263.
- Wiseman, Shieber, and Rush (2018) Wiseman, Sam, Stuart M. Shieber, and Alexander M. Rush. 2018. Learning Neural Templates for Text Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3174–3187, Brussels, Belgium. ArXiv: 1808.10122.
- Young et al. (2010) Young, Steve, Milica Gašić, Simon Keizer, François Mairesse, Jost Schatzmann, Blaise Thomson, and Kai Yu. 2010. The Hidden Information State model: A practical framework for POMDP-based spoken dialogue management. Computer Speech & Language, 24(2):150–174.
- Zaidan and Callison-Burch (2011) Zaidan, Omar F. and Chris Callison-Burch. 2011. Crowdsourcing translation: Professional quality from non-professionals. In Proc. of ACL, pages 1220–1229, Portland, Oregon, USA.
- Zhang et al. (2017) Zhang, B., D. Xiong, J. Su, and H. Duan. 2017. A Context-Aware Recurrent Encoder for Neural Machine Translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12):2424–2432.
- Zhang et al. (2018) Zhang, Biao, Jing Yang, Qian Lin, and Jinsong Su. 2018. Attention Regularized Sequence-to-Sequence Learning for E2E NLG Challenge. In E2E NLG Challenge System Descriptions.