Neural sequence to sequence (Seq2Seq) models (Graves, 2013; Sutskever et al., 2014) have shown promising results for this task, especially in combination with an attention mechanism (Bahdanau et al., 2014; Luong et al., 2015). Several recent NLG approaches (Dušek and Jurcícek, 2016; Mei et al., 2016; Kiddon et al., 2016; Agarwal and Dymetman, 2017), as well as most systems in the E2E and WebNLG challenge are based on this architecture. While most NLG models generate text word by word, promising results were also obtained by encoding the input and generating the output text character-by-character Lipton et al. (2015); Goyal et al. (2016); Agarwal and Dymetman (2017). Five out of 62 E2E challenge submissions operate on the character-level. However, it is difficult to draw conclusions from the challenge results with respect to this difference, since the submitted systems also differ in other aspects and were evaluated on a single dataset only.
Besides adequacy and fluency, variation is an important aspect in NLG Stent et al. (2005). In addition to comparing the linguistic and contentwise correctness of word- and character-based Seq2Seq models through automatic and human evaluation, we investigate the variety of their outputs. While template-based systems can assure perfect content and linguistic quality, they often suffer from low diversity. Conversely, neural models might generalize beyond a limited amount of training texts or templates, thereby producing more diverse outputs. To test this hypothesis, we train Seq2Seq models on template-generated texts with a controlled amount of variation and show that they not only reproduce the templates, but also generate novel structures resulting from template combinations.
In sum, we make the following contribution:
We compare word- and character-based Seq2Seq models for NLG on two datasets.
We conduct an extensive automatic and manual analysis of the generated texts and compare them to human performance.
In an experiment with synthetic training data generated from templates, we demonstrate the ability of neural NLG models to learn template combinations and thereby generalize beyond the linguistic structures they were trained on.
2 Related Work
This section reviews relevant related work according to the two main aspects of this paper: different input and output representations for data-to-text NLG as well as measuring and controlling the variation in the generated outputs.
2.1 Input and Output Representations
While the first NLG systems relied on hand-written rules or templates that were filled with the input information (Cheyer and Guzzoni, 2006; Mirkovic et al., 2006), the availability of larger datasets has accelerated the progress in statistical methods to train NLG systems from data-text pairs in the last twenty years (Oh and Rudnicky, 2000; Mairesse and Young, 2014)
. Generating output via language models based on recurrent neural networks (RNNs) conditioned on the input(Sutskever et al., 2011) proved to be an effective method for end-to-end NLG (Wen et al., 2015a, b, 2016).
The input can be represented in several ways: (1) In a discrete vector space via one-hot-vectors(Wen et al., 2015a, b)
, or in a continuous space either (2) by encoding fixed-size input information in a feed-forward neural network(Zhou et al., 2017; Wiseman et al., 2017) or (3) by the means of an encoder RNN, which processes variable-sized inputs sequentially, giving rise to the Seq2Seq architecture.
Character-based Seq2Seq models were first proposed for neural machine translationLing et al. (2015); Chung et al. (2016); Lee et al. (2017). Their main advantage over word-based models is that they can represent an unlimited word inventory with a small vocabulary. They can learn to copy any string from the input to the output, which is especially useful for data-to-text NLG, as information from the input such as the name of a restaurant or a database entity is often expected to appear verbatim in the generated text. Word-based models, in contrast, have to make use of delexicalization during pre- and postprocessing (Wen et al., 2015b; Dušek and Jurcícek, 2016) or have to apply dedicated copy mechanisms (Gu et al., 2016; See et al., 2017; Wiseman et al., 2017) to handle open vocabularies. The other side of the coin is that sequences are much longer in character-based processing, implying longer dependencies and more computation steps for encoding and decoding.
Subword-based representations (Sennrich et al., 2016; Wu et al., 2016) can offer a trade-off between word- and character-based processing and are a popular choice in NMT and summarization (See et al., 2017). Here, the vocabulary consists of subword units of different lengths, which are assigned by minimizing the entropy on the training set. We also experimented with such representations in preliminary experiments, but found them to perform much worse than word- or character-based representations. Our impression is that recurring entity names in the training data coming from multiple reference texts for the same input lead to overfitting on the training vocabulary and to poor generalization to novel inputs. This is also reflected by the rather unsatisfying performance of subword-based approaches in the E2E111The subword-based bzhang_submit system has the second best ROUGE-L score, but ranks poorly in terms of BLEU and quality in the human evaluation, see http://www.macs.hw.ac.uk/InteractionLab/E2E/#results. and WebNLG challenge (ADAPT system Gardent et al. (2017b)).
2.2 Output Diversity
Evaluation of data-to-text NLG has traditionally centered around semantic fidelity, grammaticality, and naturalness (Gatt and Krahmer, 2018; Oraby et al., 2018b). More recently, the controllability of the style of the outputs and their variation has moved into focus as well (Ficler and Goldberg, 2017; Herzig et al., 2017; Oraby et al., 2018b, a).
Oraby et al. (2018b)
showed that the n-gram entropy of the outputs of a neural NLG system is significantly lower compared to its training data. This can be seen as evidence that the NLG system extracts only a few dominant patterns from the training data that it will generate over and over. Without explicit supervision signals, neural NLG models cannot distinguish linguistic or stylistic variation from noise. In the context of image caption generationDevlin et al. (2015) found Seq2Seq models to exactly reproduce sentences from their training data for 60% of the test instances.
Several approaches have been proposed to control NLG outputs with respect to certain stylistic aspects, e.g., mimicking a specific persona or character (Lin and Walker, 2011; Walker et al., 2011; Li et al., 2016), personality traits (Mairesse and Walker, 2008; Herzig et al., 2017; Oraby et al., 2018b, a), or various linguistic aspects such as formality, voice, descriptiveness (Ficler and Goldberg, 2017; Bawden, 2017; Niu et al., 2017). All share the feature that the NLG model is conditioned on a representation of the desired aspect in addition to the usual semantic input representation. While this approach makes it possible to successfully control particular, clearly defined aspects of the generated texts, further research is needed to grant more flexible and comprehensive NLG output control.
To encode variable-length inputs and generate variable-length texts, we implement a standard Seq2Seq model (Cho et al., 2014)
with Long Short-Term Memory (LSTM) cellsHochreiter and Schmidhuber (1997) and attention. Given a training dataset of input-text pairs , the model encodes an input sequence of symbols into a sequence of hidden states by applying a recurrent neural network (RNN) with LSTM cells that can store and forget sequence information:
The decoder generates the output sequence one symbol at a time by computing .
The decoder output , also referred to as context vector, summarizes the input information in each decoding step as weighted sum of the encoder hidden states: . The attention weights are computed with the general attention mechanism (Luong et al., 2015). The decoder hidden states are computed recursively based on the previous output token and decoder output:
is initialized to the final encoder hidden state , are initialized to 0; denotes concatenation. The parameters of the models are the input and output embedding matrices , , the encoder and decoder LSTM parameters, the attention matrix and the output matrix . They are optimized by minimizing the cross entropy of the generated texts with the given references for each example in the training set.
Instead of forcing the decoder to decide on a single output symbol in each decoding step, we apply beam search Cho et al. (2014); Bahdanau et al. (2014) to explore -best partial hypotheses in parallel.
In the word-based model, each input symbol and output symbol denotes a token. In contrast, in the character-based model, each input and output symbol denotes a single character. Our models learn separate encoder and decoder embedding matrices.
We use two recently collected crowd-sourced data-to-text datasets since they are larger and offer more linguistic variety than previously available datasets (Novikova et al., 2017b; Gardent et al., 2017a). The E2E dataset (Novikova et al., 2017b) consists of 47K restaurant descriptions based on 5.7K distinct inputs of 3-8 attributes (name, area, near, eat type, food, price range, family friendly, rating), split into 4862 inputs for training, 547 for development and 630 for testing. The WebNLG dataset (Gardent et al., 2017a) contains 25K verbalizations of 9.6K inputs composed of 1-7 DBpedia triples from 15 categories such as athletes, comic characters, food, sport teams. It is divided into 6893 inputs for training, 872 for development and 1862 for testing. Both datasets have multiple verbalizations for each input. On average there are 8.3 (min. 1, max. 46) verbalizations per input in the E2E dataset and 2.63 (min. 1, max. 12) in the WebNLG dataset, respectively.
To preprocess both datasets, we lowercase all inputs and references and represent the inputs in the bracketed format as shown in Figure 1. For the word-based processing we additionally tokenize the texts with the nltk-tokenizer (Bird et al., 2009) and apply delexicalization, as also illustrated in Figure 1. For the E2E dataset we adopt the challenge’s baseline delexicalization strategy (Dušek and Jurcícek, 2016), which replaces the values of the two open-class attributes name and near in the input and references by placeholders. For the WebNLG dataset, we adopt the delexicalization strategy of the Tilburg submissions to the challenge, since it performed well and does not require external information. They replaced the subject and object entities of the DBpedia triples in the input and text by numbered placeholders agent-n, patient-n, bridge-n, depending on whether they only appear as subject, object or in both roles in the input of an instance. Additionally, we split properties at the camel case in this dataset for both the word- and character-based models as proposed by the Adapt and Melbourne submissions. Table 1 displays statistics for both datasets and processing types.
|avg. input length||28.5||106.0||24.8||139.8|
|avg. text length||20.0||109.3||18.8||117.1|
We conduct our experiments with the OpenNMT toolkit (Klein et al., 2017)
, which we extend to also perform character-based processing. We tuned the hyperparameters for each dataset and processing method to optimize the BLEU score on the development sets. The word-based model for the E2E dataset is trained by stochastic gradient descent (SGD)Robbins and Monro (1951) and an initial learning rate of 1.0. For all other models, we achieved better performance with the Adam optimizer Kingma and Ba (2015)
with an initial learning rate of 0.001. If there is no improvement in the development perplexity, or in any case after the eighth epoch, we halve the learning rate. Also, we clip all gradients to a maximum of five. We use a batch size of 64. To prevent overfitting, we drop out units in the context vectors with a probability of 0.3. We keep the model with the lowest development perplexity in 13 training epochs.
The word-based E2E model has 64-dimensional word embeddings and a single encoder and decoder layer with 64 units each. All other models use 500-dimensional word- or character embeddings and two layers in the encoder and decoder with 500 dimensions each. While a unidirectional encoder was sufficient for the word-based models, bidirectional encoders were beneficial for the character-based models on both datasets.
We use a beam size of 15 for decoding with the word-based models, and found a smaller beam of five to yield better results for the character-based models. This is probably due to the much smaller vocabulary size of the character-based models.
For automatic evaluation, we report BLEU (Papineni et al., 2002), which measures the precision of the generated n-grams compared to the references, and recall-oriented ROUGE-L (Lin, 2004), which measures the longest common subsequence between the generated texts and the references. We compute these scores with the E2E challenge evaluation script222https://github.com/tuetschek/e2e-metrics.
6 Results and Analysis
Table 2 and 3 display the results on the E2E and WebNLG test sets for models of the respective challenges and our own models333For an exact comparison, we recomputed the WebNLG challenge results with the E2E evaluation script. They are usually 1-2 points below the scores reported by Gardent et al. (2017b).. Since the performance of neural models can vary considerably due to random parameter initialization and randomized training procedures Reimers and Gurevych (2017)
, we train ten models with different random seeds for each setting and report the average (avg) and standard deviation (SD).
On the E2E test set, our single best word- and character-based models reach comparable results to the best challenge submissions. The word-based models achieve significantly higher BLEU and ROUGE-L scores than the character-based models444All tests for significance in this paper are conducted with Wilcoxon rank sum tests with Bonferroni correction at a p-level of .. On the WebNLG test set, the BLEU score of our best word-based model outperforms the best challenge submission by a small margin. The character-based model achieves a significantly higher ROUGE-L score than the word-based model, whereas the BLEU score difference is not significant. In the following, we analyze our models in more detail.
6.1 Analysis of Within-Model Performance Differences
The large performance span of the character-based models on the E2E dataset is due to a single outlier model; the second worst model scores 64.5 BLEU points. The worst-scoring model had a lower accuracy of 91.8% on the development set, whereas all other models scored above 92.2%. To gain more insight on what might constitute the large performance difference, we manually compared the generated texts for ten randomly selected inputs for each number of attributes (60 inputs in total) of the character-based model with the best and worst BLEU score. We found that the worst model makes many mistakes on inputs with three to five attributes, often adding, modifying or removing information, whereas the outputs are mostly correct for inputs with six attributes or more. For these, the outputs of the model with the lowest BLEU score are occasionally even better than those of the best model, which often omits information (mainly concerning the attributefamily friendly). We conclude that the large performance difference might be caused by automatic evaluation measures punishing additions more severely than omissions.
We also observe a large performance span for the WebNLG word-based models. Here, we have two models that score exceptionally well with 57.4/58.4 BLEU points, whereas the remaining eight models only obtain BLEU scores in a range of 43.8-48.1. Again, we observe that better models in terms of BLEU score obtain higher accuracies on the development set. We manually compared the outputs of ten randomly chosen inputs for each number of input triples (75 inputs in total) for the model with the highest and lowest BLEU score. In this case, we found that the large difference in the automatic evaluation measures seems justified: The low-scoring model often hallucinates information not present in the input and generally produces many ungrammatical texts, which is not the case for the best model.
|Thomson Reuters (np 3)||68.1||69.3|
|Thomson Reuters (np 4)||67.4||69.8|
|HarvardNLP & H. Elder||67.4||70.8|
|word (best on dev.)||67.8||70.2|
|char. (best on dev.)||67.6||70.4|
|word (best on dev.)||44.2||60.9|
|char. (best on dev.)||41.3||58.4|
6.2 Automatic Evaluation of Human Texts
To gain an impression of the expressiveness of the automatic evaluation scores for NLG, we computed the average scores that the human references would obtain. Table 4 shows the BLEU and ROUGE-L development set scores when treating each human reference as prediction once and evaluating it against the remaining references, compared to the scores of the word-based and character-based models555For a fair comparison between human and model performance, we randomly removed one reference for each instance in the models’ evaluation to ensure the same average number of references. We excluded 55 WebNLG instances that had only one reference.. Strikingly, on the E2E development set, both model variants significantly outperform human texts by far with respect to both automatic evaluation measures. While the human BLEU score is significantly higher than those of both systems on the WebNLG development set, there is no statistical difference between human and system ROUGE-L scores. This further demonstrates the limited utility of BLEU and ROUGLE-L scores to evaluate NLG outputs, which was previously suggested by weak correlations of such scores with human judgments Scott and Moore (2006); Reiter and Belz (2009); Novikova et al. (2017a). Furthermore, the high scores on the E2E dataset imply that the models succeed in picking up patterns from the training data that transfer well to the similar development set, whereas human variation and creativity are punished by lexical overlap-based automatic evaluation scores.
|% new texts||99.70.2||98.20.3||98.80.2||91.10.3||69.84.8||87.50.6|
|% new sents.||85.11.1||61.86.4||71.44.7||87.40.4||57.25.8||82.11.2|
6.3 Manual Error Analysis
Since the expressiveness of automatic evaluation measures for NLG is limited, as shown in the previous subsection, we performed a manual error analysis on inputs of each length. We define the input length as the number of input attributes for the E2E dataset, ranging from three to eight, and number of input triples for the WebNLG dataset, ranging from one to seven. We randomly selected 15 development instances for each input length, resulting in a total of 90 annotated E2E instances and 105 WebNLG instances.
One annotator (one of the authors of this paper) manually assessed the outputs of the models that obtained the best development set BLEU score as summarized in Table 5666Although multiple annotators could increase the reliability of these results, the annotator reported that the task was very straightforward. We do not expect marking content and linguistic errors to lead to annotator disagreements, with the exception of accidentally missed errors.. As we can see from the bottom part of the table, all models struggle more with getting the content right than with producing linguistically correct texts; 70-80% of the texts generated by all models are completely correct linguistically.
Comparing the two datasets, we again observe that the WebNLG dataset is much more challenging than the E2E dataset, especially with respect to correctly verbalizing the content. This can be attributed to the increased diversity of the inputs and texts and to the limited availability of training data for this dataset (cf. Table 1). Moreover, spelling mistakes only appeared in WebNLG texts, mainly concerning omissions of accents or umlauts. This also indicates that there is too few and noisy data for the models to learn the correct spelling of all words. Notably, we did not observe any non-words generated by the character-based models.
The most frequent content error in both datasets concerns omission of information. For the E2E dataset, the family friendly attribute is most frequently dropped by both model types, indicating that the verbalization of this boolean attribute is more difficult to learn than other attributes, whose values mostly appear verbatim in the text. Information modification of the word-based model is mainly due to confusing English with Italian food. Information addition and repetition only occur in the WebNLG dataset. The latter is an especially frequent problem of the character-based model, affecting more than a quarter of all texts.
In comparison, character-based models reproduce the content more faithfully on the E2E dataset while offering the same level of linguistic quality as word-based models, leading to more correct outputs overall. On the WebNLG dataset, the word-based model is more faithful to the inputs, probably because of the effective delexicalization strategy, whereas the character-based model errs less on the linguistic side. Overall, the word-based model yields more correct texts, stressing the importance of delexicalization and data normalization in low resource settings.
6.4 Automatic Evaluation of Output Diversity
While correctness is a necessity in NLG, in many settings it is not sufficient. Often, variation of the generated texts is crucial to avoid repetitive and unnatural outputs. Table 6 shows automatically computed statistics on the diversity of the generated texts of both models and human texts and on the overlap of the (generated) texts with the training set. We measure diversity by the number of unique sentences and words in all development set references and generated texts, as done e.g. by Devlin et al. (2015). Additionally, we report the Shannon text entropy as measure of the amount of variation in the texts following (Oraby et al., 2018b). We compute the text entropy E for words (unigrams) and uni-, bi-, and trigrams as follows:
where is the set of all word types or uni-, bi- and trigrams, denotes frequency and total is the token count or total number of uni-, bi- and trigrams in the texts, respectively.
To measure the extent by which the models generalize beyond plugging in restaurant or other entity names into templates extracted from the training data, we compute the results on the delexicalized outputs of the word-based models and delexicalize the character-based models’ outputs. For the human scores, we generate artificial prediction files, treating each -th reference (42 for E2E, 8 for WebNLG) as reference, apply delexicalization, and average the scores for the files.
On both datasets, our systems produce significantly less varied outputs and reproduce more texts and sentences from the training data than the human texts. Interestingly, however, the character-based models generate significantly more unique sentences and copy significantly less from the training data than the word-based models, which copy about 40% of their generated sentences from the training data.
7 Generalizing from Templates
In search for empirical evidence that neural models are able to surpass the structures they were trained on, we train Seq2Seq models with synthetic training data created by templates. This enables us to control the variation in the training data and identify novel generations of the model (if any). We investigate two questions: (1) Do the neural NLG models indeed accurately learn the templates from the training data? (2) Do they learn to combine the training templates to produce more varied outputs than seen during training?
We generate synthetic training data based on two templates. Template 1 corresponds to UKP-TUDA’s submission to the E2E challenge777https://github.com/UKPLab/e2e-nlg-challenge-2017/blob/master/components/template-baseline.py, where the order of describing the input information is fixed. Specifically, the restaurant’s customer rating is always mentioned before its location. For Template 2, we change the the beginning of the template and switch the order of mentioning the rating and location of the restaurant as shown in Figure 2. Potential combinations of the two templates are to combine the beginning of Template 1 with the ordering of rating and area of Template 2 or vice versa. We generate a single reference text for all 2261 training inputs of the E2E dataset where the name and eattype attribute are present as these are the two obligatory attributes for the templates. We train word-based models on training data generated with Template 1, Template 2 and the concatenation of the training data from Template 1 and 2. To keep the amount of training data equal in all experiments, we once repeat the training corpus generated only with Template 1 or Template 2. The hyperparameters for the three models can be found in the appendix.
Table 7 shows our manual evaluation of the top 30 hypotheses for 10 random E2E test inputs generated by models trained with data synthesized from the two templates. As is evident from the first two rows, all models learned to generalize from the training data to produce correct texts for novel inputs consisting of unseen combinations of input attributes. It was verified in the manual evaluation that 100% of the texts generated by models trained on a single template adhered to this template. Yet, the picture is a bit different for the model trained on data generated by both templates. While the top two hypotheses are equally distributed between adhering to Template 1 and Template 2, more than 5% among the lower-ranked hypotheses constitute a template combination such as the example shown in the bottom part of Figure 2. For 60% of the examined inputs, there was at least one such hypothesis resulting from template combination, of which two thirds were actually correct verbalizations of the input.
Since we found that the models frequently ranked correct hypotheses below hypotheses with content errors, we implemented a simple rule-based reranker based on verbatim matches of attribute values. The reranker assigns an error point to each omission and addition of an attribute value. As can be seen in the final row of Table 7, this simple reranker successfully places correct hypotheses higher up in the ranking, improving the practical usability of the generation model by now offering almost three correct variants for each input among the top five hypotheses on average.
We compared word-based and character-based Seq2Seq models for data-to-text NLG on two datasets and analyzed their output diversity. Our main findings are as follows: Overall, Seq2Seq models can learn to verbalize structured inputs in a decent way; their success depends on the extent of the domain and available (clean) training data.
Second, in a comparison with texts produced by humans, we saw that neural NLG models can even surpass human performance in terms of automatic evaluation measures. On the one hand, this unveils the ability of the models to extract general patterns from the training data that approximate many reference texts, but on the other hand also once more stresses the limited utility of such measures to evaluate NLG systems.
Third, in light of the multi-faceted analysis we performed, it is difficult to draw a general conclusion on whether word- or character-based processing is more useful for data-to-text generation. Both models yielded comparable results with respect to automatic evaluation measures. In the manual error analysis, the character-based model performed better on the E2E dataset, whereas the word-based model generated more correct outputs on the WebNLG dataset. Character-based models were found to have a significantly higher output diversity.
Finally, in a controlled experiment with word-based Seq2Seq models trained on data synthesized from templates, we showed the capability of such models to perfectly reproduce the templates they were trained on. More importantly, models trained on two templates could generalize beyond their training data and come up with novel texts. In future work, we would like to extend this line of research and train more model variants on a higher number of templates.
- Agarwal and Dymetman (2017) Shubham Agarwal and Marc Dymetman. 2017. A surprisingly effective out-of-the-box char2char model on the e2e nlg challenge dataset. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 158–163, Saarbrücken, Germany.
- Angeli et al. (2010) Gabor Angeli, Percy Liang, and Dan Klein. 2010. A simple domain-independent probabilistic approach to generation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 502–512, Stroudsburg, PA, USA.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv e-prints, abs/1409.0473.
- Bawden (2017) Rachel Bawden. 2017. Machine translation, it’s a question of style, innit? the case of english tag questions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2507–2512, Copenhagen, Denmark.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python, 1st edition. O’Reilly Media, Inc.
- Cheyer and Guzzoni (2006) Adam Cheyer and Didier Guzzoni. 2006. Method and apparatus for building an intelligent automated assistant. Patent US 11/518,292 (Patent pending).
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Çağlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar.
- Chung et al. (2016) Junyoung Chung, Kyunghyun Cho, and Yoshua Bengio. 2016. A character-level decoder without explicit segmentation for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, pages 1693–1703, Berlin, Germany.
- Devlin et al. (2015) Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. 2015. Language models for image captioning: The quirks and what works. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, Volume 2: Short Papers, pages 100–105, Beijing, China.
- Dušek and Jurcícek (2016) Ondřej Dušek and Filip Jurcícek. 2016. Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 2: Short Papers, pages 41–51, Berlin, Germany.
- Ficler and Goldberg (2017) Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. CoRR, abs/1707.02633.
- Gardent et al. (2017a) Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017a. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Volume 1: Long Papers, pages 179–188, Vancouver, Canada.
- Gardent et al. (2017b) Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017b. The WebNLG Challenge: Generating Text from RDF Data. In Proceedings of the 10th International Conference on Natural Language Generation, pages 124–133, Santiago de Compostela, Spain.
- Gatt and Krahmer (2018) Albert Gatt and Emiel Krahmer. 2018. Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research (JAIR), 61:65–170.
- Goyal et al. (2016) Raghav Goyal, Marc Dymetman, and Éric Gaussier. 2016. Natural language generation through character-based rnns with finite-state prior knowledge. In Proceedings of the 26th International Conference on Computational Linguistics, COLING 2016, Technical Papers, pages 1083–1092, Osaka, Japan.
- Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor O. K. Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, pages 1631–1640, Berlin, Germany.
- Herzig et al. (2017) Jonathan Herzig, Michal Shmueli-Scheuer, Tommy Sandbank, and David Konopnicki. 2017. Neural response generation for customer service based on personality traits. In Proceedings of the 10th International Conference on Natural Language Generation, INLG 2017, pages 252–256, Santiago de Compostela, Spain.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8).
- Kiddon et al. (2016) Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. 2016. Globally coherent text generation with neural checklist models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, pages 329–339, Austin, TX, USA.
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, San Diego, CA, USA.
- Klein et al. (2017) Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. CoRR, abs/1701.02810.
- Koehn (2017) Philipp Koehn. 2017. Neural machine translation. CoRR, abs/1709.07809.
- Lee et al. (2017) Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully character-level neural machine translation without explicit segmentation. TACL, 5:365–378.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and William B. Dolan. 2016. A Persona-Based Neural Conversation Model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, pages 994–1003, Berlin, Germany.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of summaries. In Proceedings of the ACL workshop on Text Summarization Branches Out, pages 74–81, Barcelona, Spain.
- Lin and Walker (2011) Grace Lin and Marilyn Walker. 2011. All the world’s a stage: Learning character models from film. In Proceedings of the Seventh AIIDE Conference, pages 46–52, Palo Alto, CA, USA.
- Ling et al. (2015) Wang Ling, Isabel Trancoso, Chris Dyer, and Alan Black. 2015. Character-based neural machine translation. CoRR, abs/1511.04586.
- Lipton et al. (2015) Zachary Chase Lipton, Sharad Vikram, and Julian McAuley. 2015. Capturing meaning in product reviews with character-level generative text models. CoRR, abs/1511.03683.
- Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1412–1421, Lisbon, Portugal.
- Mairesse and Young (2014) François Mairesse and Steve J. Young. 2014. Stochastic language generation in dialogue using factored language models. Computational Linguistics, 40(4):763–799.
Mairesse and Walker (2008)
Francois Mairesse and Marilyn Walker. 2008.
Trainable Generation of Big-Five Personality Styles through Data-driven Parameter Estimation.In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (ACL 2008), pages 165–173, Columbus, OH, USA.
- Mei et al. (2016) Hongyuan Mei, Mohit Bansal, and Matthew Walter. 2016. What to talk about and how? selective generation using lstms with coarse-to-fine alignment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics — Human Language Technologies (NAACL HLT), pages 720–730, San Diego, CA, USA.
- Mirkovic et al. (2006) Danilo Mirkovic, Lawrence Cavedon, Matthew Purver, Florin Ratiu, Tobias Scheideck, Fuliang Weng, Qi Zhang, and Kui Xu. 2006. Dialogue management using scripts and combined confidence scores. US Patent App. 11/298,765.
- Niu et al. (2017) Xing Niu, Marianna Martindale, and Marine Carpuat. 2017. A study of style in machine translation: Controlling the formality of machine translation output. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2814–2819, Copenhagen, Denmark.
- Novikova et al. (2017a) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017a. Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2231–2242, Copenhagen, Denmark.
- Novikova et al. (2017b) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017b. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany.
- Oh and Rudnicky (2000) Alice Oh and Alexander Rudnicky. 2000. Stochastic language generation for spoken dialogue systems. In Proceedings of the 2000 ANLP/NAACL Workshop on Conversational Systems - Volume 3, ANLP/NAACL-ConvSyst ’00, pages 27–32, Stroudsburg, PA, USA.
- Oraby et al. (2018a) Shereen Oraby, Lena Reed, Sharath T. S., Shubhangi Tandon, and Marilyn A. Walker. 2018a. Neural multivoice models for expressing novel personalities in dialog. In Interspeech, pages 3057–3061, Hyderabad, India. ISCA.
- Oraby et al. (2018b) Shereen Oraby, Lena Reed, Shubhangi Tandon, Sharath T. S., Stephanie Lukin, and Marilyn Walker. 2018b. Controlling personality-based stylistic variation with neural natural language generators. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 180–190, Melbourne, Australia.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318, Stroudsburg, PA, USA.
- Reimers and Gurevych (2017) Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338–348, Copenhagen, Denmark.
- Reiter and Belz (2009) Ehud Reiter and Anja Belz. 2009. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4):529–558.
- Robbins and Monro (1951) Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics, 22:400–407.
- Scott and Moore (2006) Donia Scott and Johanna Moore. 2006. An NLG evaluation competition? Eight Reasons to Be Cautious. In Proceedings of the Fourth International Natural Language Generation Conference, INLG 2006, Special Session on Sharing Data and Comparative Evaluations, Sydney, Australia.
- See et al. (2017) Abigail See, Christopher Manning, and Peter Liu. 2017. Get to the point: Summarization with pointer-generator networks. In Association for Computational Linguistics.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
- Stent et al. (2005) Amanda Stent, Matthew Marge, and Mohit Singhai. 2005. Evaluating evaluation methods for generation in the presence of variation. In Computational Linguistics and Intelligent Text Processing, pages 341–351, Berlin, Heidelberg. Springer.
Sutskever et al. (2011)
Ilya Sutskever, James Martens, and Geoffrey Hinton. 2011.
Generating text with recurrent neural networks.
Proceedings of the 28th International Conference on Machine Learning, ICML 2011, pages 1017–1024, Bellevue, WA, USA.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, Cambridge, Massachusetts, USA. MIT Press.
Vinyals and Le (2015)
Oriol Vinyals and Quoc V. Le. 2015.
A neural conversational model.
Proceedings of the International Conference on Machine Learning, Deep Learning Workshop, Lille, France.
- Walker et al. (2011) Marilyn A. Walker, Ricky Grant, Jennifer Sawyer, Grace I. Lin, Noah Wardrip-Fruin, and Michael Buell. 2011. Perceived or Not Perceived: Film Character Models for Expressive NLG. In ICIDS, volume 7069 of Lecture Notes in Computer Science. Springer.
- Wen et al. (2015a) Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-hao Su, David Vandyke, and Steve J. Young. 2015a. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. CoRR, abs/1508.01755.
- Wen et al. (2016) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, and Steve J. Young. 2016. Multi-domain neural network language generation for spoken dialogue systems. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 120–129, San Diego, CA, USA.
- Wen et al. (2015b) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-hao Su, David Vandyke, and Steve J. Young. 2015b. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, pages 1711–1721, Lisbon, Portugal.
- Wiseman et al. (2017) Sam Wiseman, Stuart Shieber, and Alexander Rush. 2017. Challenges in data-to-document generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, pages 2253–2263, Copenhagen, Denmark.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
- Zhou et al. (2017) Ming Zhou, Mirella Lapata, Furu Wei, Li Dong, Shaohan Huang, and Ke Xu. 2017. Learning to generate product reviews from attributes. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Volume 1: Long Papers, pages 623–632, Valencia, Spain.
Appendix A Hyperparameters for Models Trained on Synthetic Training Data
For the model trained on template-generated data, we tune the hyperparameters to achieve 100% accuracy for their best hypotheses on template-generated references on the development set. All models have a single-layer LSTM with 64 hidden units in the encoder and decoder. We half the learning rate starting from the eighth training epoch or if the perplexity of the validation set does not improve. The gradient norm is capped at two. The decoder uses the general attention mechanism. For decoding, we set the beam size to 30. Table 8 shows hyperparameters which differ for the models.
|hyperparameter||T 1||T 2||T 1+2|
|init. learning rate||0.001||1.0||1.0|