1 Introduction††Authors contributed equally and are listed alphabetically.††Author contributed to this work as an intern at Amazon.
Much of the recent work on Neural Natural Language Generation (NNLG) focuses on generating a natural language string given some input content, primarily in the form of a structured Meaning Representation (MR) Moryossef et al. (2019); Gong et al. (2019); Dušek et al. (2018); Liu et al. (2017); Colin et al. (2016); Wen et al. (2016); Dusek and Jurcícek (2016); Dušek and Jurcicek (2015); Wen et al. (2015). Popular datasets used for MR-to-text generation are confined to limited domains, e.g., restaurants or product information. They usually consist of simple tuples of slots and values describing the content to be realized, failing to offer any additional information that might be useful for the generation task Novikova et al. (2017b); Gardent et al. (2017); Wen et al. (2015). Table 1 shows examples of MRs from popular datasets.
|E2E Novikova et al. (2017b)||INFORM name[The Punter], food[Indian], priceRange[cheap]||The Punter offers cheap Indian food.|
|Laptop Wen et al. (2016)||INFORM name[satellite eurus65], type[laptop], memory[4gb], driverRange[medium], isForBusiness[false]||The satellite eurus 65 is a laptop designed for home use with 4 gb of memory and a medium sized hard drive|
Only having simple and limited information within these MRs has several shortcomings. Model outputs are either very generic or generators have to be trained for a narrow domain and cannot be used for new domains. Thus, some recent work has focused on different methods to improve naturalness Zhu et al. (2019) and promote domain transfer Tran and Nguyen (2018); Wen et al. (2016).
The fact is that the use of an MR is not unique to the problem of language generation: tasks such as dialog state tracking Rastogi et al. (2019), policy learning Chen et al. (2018), and task completion Li et al. (2017) also require the use of an MR to track context and state information relevant to the task. MRs from these more dialog-oriented tasks are often referred to as a “schema.”
While dialog state tracking schema do not necessarily include descriptions (and generally only include names of intents, slots, and values like traditional MRs), recent work has suggested that the use of descriptions may help with different language tasks, such as zero-shot and transfer learningBapna et al. (2017). The most recent Dialog System Technology Challenge (DSTC8) Rastogi et al. (2019) implements this by introducing the idea of schema-guided dialog state tracking.
Table 2 shows a sample schema from DSTC8. It is much richer and more contextually informative than traditional MRs. Each turn annotated with information about the current speaker, (e.g., SYSTEM, USER), dialog act (e.g., REQUEST), slots (e.g., CUISINE), values (e.g., Mexican and Italian), as well as the surface string utterance. When comparing this schema in Table 2 to the MRs from Table 1, we can see that the only part of the schema reflected in the MRs is the ACTIONS section, which explicitly describes intents, slots, and values.
|VALUES: Mexican, Italian|
|SLOT DESCRIPTIONS -|
|CUISINE: ”Cuisine of food served in the restaurant”|
|SLOT TYPE: CUISINE: is_categorical=true|
|INTENT - FindRestaurants|
|INTENT DESCRIPTION: ”Find a restaurant of a particular cuisine in a city”|
|SERVICE - Restaurants_1|
|SERVICE DESCRIPTION: ”A leading provider for restaurant search and reservations”|
|SPEAKER - System|
|UTTERANCE - ”Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?”|
To our knowledge, no previous work on NNLG has attempted to generate natural language strings from schemata using this vastly richer and more informative data. In this paper, we propose the new task of Schema-guided Natural Language Generation, where we take a turn-level schema as input and generate a natural language string describing the required content, guided by the context information provided in the schema. Following previous work on schema-guided language tasks, we hypothesize that descriptions in the schema will lead to better generated outputs and the possibility of zero-shot learning Bapna et al. (2017). For example, to realize the MR REQUEST(time), domain-specific descriptions of common slots like time can help us realize better outputs, such as ”What time do you want to reserve your dinner?” in the restaurant domain, and ”What time do you want to see your movie?” for movies. Similarly, we note that for dialog system developers, writing domain-specific templates for all scenarios is clearly not scalable, but providing a few domain-specific descriptions for slots/intents is much more feasible.
To allow our models to better generalize and to be more directly useful in the context of a dialog system, we specifically focus on system-side turns from the DSTC8 dataset and generate natural language templates, i.e., delexicalized surface forms, such as ”Is there a specific cuisine type you enjoy, such as $cuisine1, $cuisine2, or something else?” from the example schema in Table 2.
Our contributions in this paper are three-fold: (1) we introduce the novel task of schema-guided NLG
, (2) we present our methods to include schema descriptions in state-of-the-art NNLG models, and (3) we demonstrate how using a schema leads to better quality outputs than traditional MRs. We experiment with three different NNLG models (Sequence-to-Sequence, Conditional Variational AutoEncoders, and GPT-2 as a Pretrained Language Model). We show that the rich additional information in schemata not only helps provide a context for the generation task, i.e., including domain names and descriptions, but also allows our NNLG models to learn more domain-specific information that might not otherwise be represented in the data. We also present experiments focused on quantifying model performance on domains unseen in training and present a human evaluation aimed at assessing model quality in terms of naturalness and semantic correctness.
To create a rich dataset for NNLG, we repurpose the dataset used for the Schema-Guided State Tracking track of DSTC8 Rastogi et al. (2019).111https://github.com/google-research-datasets/dstc8-schema-guided-dialogue We describe our data preprocessing in this section.
Since we are focused on system turns, we first drop all the user turns. The second step in the preprocessing pipeline is to delexicalize each of the system utterances. The original data is annotated with the spans of the slots mentioned in each turn. We replace these mentions with the slot type plus an increasing index prefixed by the $ sign, e.g., $cuisine_1. For example, the utterance “Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?” becomes “Is there a specific cuisine type you enjoy, such as $cuisine_1, $cuisine_2 or something else?
The third step is to extract the MR corresponding to each system turn. We chose to represent MRs slightly differently from the original data. An MR is a 3-tuple: one dialog act has exactly one slot and one value. Therefore, an MR such as REQUEST(cuisine = [Mexican, Italian]) becomes REQUEST(cuisine=$cuisine_1), REQUEST(cuisine=$cuisine_2) (see Table 3). Note that the MR has been delexicalized in the same fashion as the utterance. Similarly, for MRs that do not have a value, e.g., REQUEST(city), we introduced the null value resulting in REQUEST(city=null). We also use the null value to replace the slot in dialog acts that do not require one, e.g., BYE() becomes BYE(null=null). This choice is due to how we encode MRs in our models.
Once we generate templates and MR pairs, we add information about the service. In DSTC8, there are multiple services within a single domain, e.g., services travel_1 and travel_2 are both part of the travel domain, but have distinct schema.222We show service examples in the appendix. DSTC8 annotates each turn with the corresponding service, so we reuse this information. Our schema also includes user intent.333At the time of writing, the DSTC8 test set is not annotated with user intent. Since we are using user intents for our task, we use DSTC8 dev set as our test set. We randomly split the DSTC8 train set into 90% training and 10% development. Since only user turns are annotated with intent information, we use the immediately preceding user turn’s intent annotation if the system turn and the user turn share the same service. If the service is not the same, we drop the intent information, i.e., we use an empty string for intent (this only happens in 3.3% of the cases).
Next, we add information extracted from the schema file of the original data. This includes service description, slot description (one description for each slot in the MR), and intent description. These description are very short English sentences (on average 9.8, 5.9 and 8.3 words for services, slot and intent). Lastly, we add to each tuple a sentence describing, in plain English, the meaning of the MR. These description are not directly available in DSTC8 but are procedurally generated by a set of rules.444We have a single rule for each act type; in total. For example, the MR CONFIRM(city=$city_1) is “Please confirm that the [city] is [$city_1].” The intuition behind these natural language MRs is to provide a more semantically informative representation of the dialog acts, slots and values.
shows statistics for the final dataset. In summary, it is composed of nearly 4K MRs and over 140K templates. On average, every MR has 58 templates associated with it, but there is a large variance. There is one MR with over 1.7K templates (CONFIRM(restaurant_name, city, time, party_size, date)) and many with only one.
|VALUES: Mexican, Italian|
|UTTERANCE - ”Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?”|
|DSTC8 Preprocessed for NLG|
|UTTERANCE - ”Is there a specific cuisine type you enjoy, such as $cuisine1, $cuisine2, or something else?”|
3.1 Feature Encoding
We categorize the features in schemata into two different types. The first type is symbolic features. Symbolic features are encoded using a word embedding layer. They typically consist of single tokens, e.g., service name or dialog act. Since these features resemble variable names, we do not consider their semantics: thus, slot type named restaurant and restaurant_name are encoded separately. The second type of features is natural language features. These features are typically sentences, e.g., service or slot descriptions, that we encode using BERT Devlin et al. (2018)
to derive a single embedding tensor.
To represent the full schema, we adopt a flat-encoding strategy. The first part of each schema is the MR, which we define as a sequence of dialog act, slot, and value tuples. At each timestep, we encode a three-part sequence: (1) a new act, slot, and value tuple from the MR, (2) the embeddings of all schema-level features (i.e., services, intents, and their descriptions), and (3) the embedding of the current slot description (see Figure 1).
Our first model is a Seq2Seq model with attention, copy, and constrained decoding (model diagram in the appendix). We implement the attention from Luong et al. (2015):
where is a function that computes the alignment score of the hidden state of the encoder and the decoder hidden state, . The goal of this layer is to attend to the more salient input features.
The copy mechanism we add is based on pointer-generator networks See et al. (2017). When using the pointer generator, at each decoding step
we compute a probability:
where , , and are a learnable weights matrix;
is a context vector computed by combining the encoder hidden state and the attention weights,is the decoder hidden state, the decoder input, and is a bias term. The probability is used to determine the next word generated according to the following equation:
Thus behaves like a switch to decide whether to generate from the vocab or copy from the input. The goal of the copy mechanism is to enable the generation of special symbols such as $cuisine_1 that are specific to the service.
3.3 Conditional Variational Auto-Encoder
The Conditional Variational Auto-Encoder (CVAE) Hu et al. (2017) is an extension of the VAE models, where an additional vector is attached to the last hidden state of the encoder as the initial hidden state of the decoder. The vector is used to control the semantic meaning of the output to align with the desired MR. We use the encoded feature vector described in Section 3.1 as
. The model objective is the same as VAE, which is the sum of reconstruction loss and Kullback–Leibler divergence loss. At training time,is the encoded input sentence. At prediction time, is sampled from a Gaussian prior learned from the training time. We also adapt the attention mechanism for CVAE by adding an additional matrix to compute the alignment score:
where is the decoder hidden state.
For both the Seq2Seq and CVAE models, we use constrained decoding to prune out candidate outputs that contain slot repetitions. Using a beam, we keep track of the slots that have already been generated and set the probability of a new candidate node to zero if generated slots are repeated.
|[Schema 1] ACTIONS (MR): INFORM(price-per-night= $price-per-night1), NOTIFY-SUCCESS(null=null)|
|Slot Desc: price-per-night: ”price per night for the stay”|
|Service: hotels-4 Service Desc: ”Accommodation searching and booking portal”|
|Intent: ReserveHotel Intent Desc: ”Reserve rooms at a selected place for given dates.”|
|Natural Language MR: the [price per night] is [$price-per-night1]. the request succeeded.|
|Ref||$price-per-night1 a night|
|Seq2Seq||your reservation is booked and the total cost is $price-per-night1 .|
|CVAE||your reservation has been made . the total cost is $price-per-night1 per night .|
|GPT2||your reservation was successful! the cost of the room is $price-per-night1 per night.|
|[Schema 2] ACTIONS (MR): OFFER(movie-name= $movie-name1), OFFER(movie-name= $movie-name2)|
|OFFER(movie-name= $movie-name3), INFORM(count=$count1)|
|Slot Desc: movie-name: ”name of the movie”, count: ”the number of items that satisfy the user’s request”|
|Service: media-2 Service Desc: ”The widest selection and lowest prices for movie rentals”|
|Intent: FindMovies Intent Desc: ”Find movies to watch by genre and, optionally, director or actors”|
|Natural Language MR: there is [$movie-name2] for [movie name].|
|there is [$movie-name3] for [movie name]. there is [$movie-name1] for [movie name]. the [count] is [$count1].|
|Ref||$count1 movies $movie-name1 $movie-name2 and $movie-name3|
|Seq2Seq||i found $count1 movies . how about $movie-name2 ?|
|CVAE||i found $count1 movies you might like . how about $movie-name1 , $movie-name2 or $movie-name3 ?|
|GPT2||sure. hey! i’ve got $count1 movies. do you wish to watch any of these movies - $movie-name1, $movie-name2 or $movie-name3?|
|[Schema 3] ACTIONS (MR): REQUEST(transfer-amount=null), REQUEST(recipient-name=null)|
|Slot Desc: transfer-amount: ”the amount of money to transfer”, recipient-name: ”the name of the recipient to transfer […]”|
|Service: banks-2 Service Desc: ”Service to manage your bank accounts and finances”|
|Intent: TransferMoney Intent Desc: ”Transfer money to another user”|
|Natural Language MR: ”what [transfer amount] do you want? what [recipient name] do you want?”|
|Seq2Seq||what type of ride do you want to transfer ?|
|CVAE||what is your destination ? how much do you want to send the money to ?|
|GPT2||sure, i can help you with that. how much would you like to transfer from? who would you like to send it to?|
3.4 Pretrained Language Model: GPT-2
We also experiment with a pretrained language model, specifically GPT-2 Radford et al. (2019).555GPT-2 small from HuggingFace Transformers (https://github.com/huggingface/transformers) Since GPT-2 is trained on purely natural language strings, we first combine the symbolic and natural language features into flat natural language strings, similar to previous work by Budzianowski and Vulić (2019). We fine-tune the GPT-2 model using these natural language inputs with the target template.666We train with special beginning of sequence, end of sequence, and separator tokens such that each training instance is: “[BOS] schema-tokens [SEP] target-tokens [EOS].” At prediction time, given the schema tokens as input, we use our fine-tuned GPT-2 model with a language model head to conditionally generate an output sequence (until we hit an end-of-sequence token).
For each of our three models, we generate a single output for each test instance. Table 5 shows example model outputs.
4.1 Evaluation Metrics
We focus on three distinct metric types: similarity to references, semantic accuracy, and diversity.
Similarity to references.
As a measure of how closely our outputs match the corresponding test references, we use BLEU (n-gram precision with brevity penalty)Papineni et al. (2002)
and METEOR (n-gram precision and recall, with synonyms)Lavie and Agarwal (2007). To compute these, we find the score for each output compared to references for that instance, and average across all instances.777We use NLTK for BLEU4/METEOR Bird et al. (2009). We include these metrics in our evaluation for primarily for completeness and supplement them with a human evaluation, since it is widely agreed that lexical overlap-based metrics are weak measures of quality Novikova et al. (2017a); Belz and Reiter (2006); Bangalore et al. (2000).
Semantic accuracy. We compute the slot error rate (SER) for each model output as compared to the corresponding MR by finding the total number of deletions, repetitions, and hallucinations over the total number of slots for that instance (lower the better).888Although Wen et al. (2015) compute SER using only deletions and repetitions, we include hallucinations to capture errors more accurately. It is important to note that we only consider slots that have explicit values (e.g., MR: INFORM date=$date1) for our SER computations. We are investigating methods to compute SER over implicit slots (e.g., MR: REQUEST party_size=null) as future work, since it is non-trivial to compute due to the various ways an implicit slot might be expressed in a generated template (e.g., ”How many people are in your party?”, or ”What is the size of your group?”). We also compute ”slot match rate”, or the ratio of generated outputs that contain exactly the same slots as the matching test MR.
Diversity. We measure diversity based on vocabulary, distinct- (the ratio between distinct -grams over total -grams) Li et al. (2016) and novelty (the ratio of unique generated utterances in test versus references in train).999To avoid inflating novelty metrics, we normalize our template values. (e.g., ”Table is reserved for $date1.” is normalized to ”Table is reserved for $date.” for any $dateN value).
|Similarity to Refs||Semantics||Diversity|
|BLEU Avg||METEOR Avg||SER Avg||Slot Match Rate||Vocab1 (Gold: 2.5k)||Vocab2 (Gold: 20k)||Distinct1 (Gold: 0.01)||Distinct2 (Gold: 0.1)||Novelty|
Automatic evaluation metrics comparing traditional MR vs. rich schema. Higher is better for all metrics except SER.
4.2 Traditional MR vs. Rich Schema
Table 6 compares model performance when trained using only the traditional MR versus using the full schema (better result for each model in bold).
We first compare the performance of different models. From the table, we see that Seq2Seq and CVAE have higher BLEU compared to GPT2 (for both MR and Schema), but that GPT2 has a higher METEOR. This indicates that GPT2 is more frequently able to generate outputs that are semantically similar to references, but that might not be exact lexical matches (e.g., substituting ”film” for ”movie”) since GPT2 is a pretrained model. Similarly, GPT2 has a significantly higher vocabulary and diversity than both Seq2Seq and CVAE.
Next, we compare the performance of each model when trained using MR versus Schema. For all models, we see an improvement in similarity metrics (BLEU/METEOR) when training on the full schema. Similarly, in terms of diversity, we see large increases in vocabulary for all models, as well as increases in distinct- and novelty (with the exception of Seq2Seq novelty, which drops slightly).
In terms of semantic accuracy, we see an improvement in both SER and Slot Match Rate for both CVAE and GPT2. For Seq2Seq, however, we actually see that the model performs better on semantics when training on only the MR. To investigate, we look at a breakdown of the kinds of errors made. We find that Seq2Seq/CVAE only suffer from deletions, but GPT2 also produces repetitions and hallucinations; however, training using the schema reduces the number of these mistakes enough to result in an SER improvement for GPT2 (see appendix for details).
4.3 Seen vs. Unseen Services
Next, we are interested to see how our models perform on specific services in the dataset. Recall that the original dataset consists of a set of services that can be grouped into domains: e.g., services restaurant_1 and restaurant_2 are both under the restaurant domain. Based on this, we segment our test set into three parts, by service: seen, or services that have been seen in training, partially-unseen, or services that are unseen in training but are part of domains that have been seen, and fully-unseen where both the service and domain are unseen.101010We show distribution plots by service in the appendix. Table 12 shows model performance in terms of BLEU and SER. We sort services by how many references we have for them in test; events_1 for example constitutes 19% of the 20K test references. To focus our discussion here, we show only the top-3 and bottom-1 services in terms of percentage of test references.111111We show results for all services in the appendix. For fully-unseen we show the only available service (alarm_1). We show the best scores in bold and the worst scores in italic.
For seen services (Figure 11(a)), we see the highest BLEU scores for all models on the rentalcars_1 (which has a high distribution of refs). We note that SER is consistently low across all models (the worst SER across seen is 0.23). We also observe that weather_1, which is different from the rest of the more reservation-focused domains, has the smallest test distribution and notably high SER.
For partially-unseen services (Figure 11(b)), we see the best BLEU for all models on services_4, and the best SER on restaurants_2 (again, with the highest percentage of test references). We note that flights_3 has the worst SER for Seq2Seq/CVAE (second worst for GPT2), despite having a large test distribution. Upon investigation, we find slot descriptions errors: e.g., slot origin_airport_name has slot description ”Number of the airport flying out from” indicating that models are sensitive to schema errors.
To better understand how the models do on average across all services (weighted by the percentage of test refs), we show average BLEU/SER scores in Table 8. We again compare performance between training on the MR vs. the schema. On average, we see that for the seen and fully-unseen partitions, training with the schema is better across almost all metrics. For partially-unseen, we see that CVAE performs better when training on only the MR; however, when averaging across the full test in Table 6, we see an improvement with schema.
We see naturally higher BLEU and lower SER for seen vs. partially-seen. Surprisingly, we see higher schema BLEU for Seq2Seq on fully-unseen as compared to partially-unseen, but we note that there is a very small fully-unseen sample size (only 10 test MRs). We also note that GPT2 has high SER for the fully-unseen domain; upon inspection, we see slot hallucination from GPT2 within alarm_1, while Seq2Seq/CVAE never hallucinate.
4.4 Human Evaluation
We conduct an annotation study to evaluate our schema-guided output quality. We randomly sample 50 MRs from our test set, and present 3 annotators with a single output for each model as well as a reference (randomly shuffled).121212To make annotation more intuitive, we automatically lexicalize slots with values from the schema (although this may add noise), e.g., ”The date is $date1” ”The date is [March 1st].” We use the same values for all templates for consistency.
We ask the annotators to give a binary rating for each output across 3 dimensions: grammar, naturalness, and semantics (as compared to the input MR). We also get an ”overall” rating for each template on a 1 (poor) to 5 (excellent) Likert scale.
Table 9 summarizes the results of the study. For grammar, naturalness, and semantics, we show the ratio of how frequently a given model or reference output is marked as correct over all outputs for that model. For the ”overall” rating, we average the 3 ratings given by the annotators for each instance, and present an average across all MRs (out of 5).
From the table, we see that in terms of grammar and naturalness, CVAE has the highest score of all models. It is particularly interesting to note that CVAE even beats the reference in terms of naturalness, highlighting the fact that even reference quality is subjective not necessarily a gold-standard. In terms of semantics, we see that GPT-2 has the highest ratings of all models. Most interestingly, we see that CVAE has a significantly lower semantic rating, although it is the winner on grammar and naturalness, indicating that while CVAE outputs may be fluent, they frequently do not actually express the required content (see Schema 3 in Table 5). This finding is also consistent with our SER calculations from Table 6, where we see that CVAE has the highest SER.131313We compute Fleiss Kappa scores for each dimension, finding near-perfect agreement for semantics (0.87), substantial agreement for grammar (0.76), and moderate agreement for naturalness (0.58) and overall (0.47).
In terms of overall score, we see that GPT-2 has the highest rating of all three models, closely matching the ratings for the references. This can be attributed to its higher semantic accuracy, combined with good (even if not the highest) ratings on grammar and naturalness.
5 Related Work
Most work on NNLG uses a simple MR that consists of slots and value tokens that only describe information that should be realized, without including contextual information to guide the generator as we do; although some work has described how this could be useful Walker et al. (2018). WebNLG Colin et al. (2016) includes structured triples from Wikipedia which may constitute slightly richer MRs, but are not contextualized. Oraby et al. Oraby et al. (2019) generate rich MRs that contain syntactic and stylistic information for generating descriptive restaurant reviews, but do not include any contextual information that does not need to be included in the output realization. Table-to-text generation using ROTOWIRE (NBA players and stats) also includes richer information, but it is also not contextualized Gong et al. (2019).
Other previous work has attempted to address domain transfer in NLG. Dethlefs et al. Dethlefs (2017) use an abstract meaning representation (AMR) as a way to share common semantic information across domains. Wen et al. Wen et al. (2016) use a ”data counterfeiting” method to generate synthetic data from existing domains to train models on unseen domains, then fine-tune on a small set of in-domain utterances. Tran et al. Tran and Nguyen (2018) also train models on a source domain dataset, then fine-tune on a small sample of target domain utterances for domain adaptation. Rather than fine-tuning models for new domains, our data-drive approach allows us to learn domain information directly from the data schema.
In this paper, we present the novel task of Schema-Guided NLG. We demonstrate how we are able to generate templates (i.e., delexicalized surface strings) across different domains using three state-of-the-art models, informed by a rich schema of information including intent and slot descriptions and domain information. We describe how we preprocess the DSTC8 schema-guided dialog dataset from the dialog state tracking community, repurposing it for NLG. In our evaluation, we demonstrate how training using this rich schema shows improvements across different similarity (up to 0.51 BLEU and 0.32 METEOR), semantic (low as 0.18 average SER), and diversity (up to 2.5K bigram vocabulary) metrics on both seen and unseen domains. Through a human evaluation, we show how our outputs are rated up to 3.61 out of 5 overall (as compared to 3.97 for references). We observe that different models have different strengths: Seq2Seq and CVAE have higher BLEU reference similarity scores, while GPT2 is significantly more diverse and is scored highest in human evaluation.
For future work, we are interested in exploring how schema-guided NLG can be used in the context of a dialog system, where only outputs that have no slot errors and have the best overall fluency should be selected as candidate responses. We are also interested in improving both the semantic correctness and overall fluency of our model outputs by introducing improved methods for constrained decoding and language model integration. Additionally, we plan to develop more accurate automatic measures of semantic quality, as well as more fine-grained control of domain transfer.
- Bangalore et al. (2000) Srinivas Bangalore, Owen Rambow, and Steve Whittaker. 2000. Evaluation metrics for generation. In INLG’2000 Proceedings of the First International Conference on Natural Language Generation, pages 1–8, Mitzpe Ramon, Israel. Association for Computational Linguistics.
- Bapna et al. (2017) Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2017. Towards zero shot frame semantic parsing for domain scaling. In Interspeech 2017.
- Belz and Reiter (2006) Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy. Association for Computational Linguistics.
- Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural Language Processing with Python, 1st edition. O’Reilly Media, Inc.
- Budzianowski and Vulić (2019) Paweł Budzianowski and Ivan Vulić. 2019. Hello, it’s GPT-2 - how can I help you? towards the use of pretrained language models for task-oriented dialogue systems. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pages 15–22, Hong Kong. Association for Computational Linguistics.
- Chen et al. (2018) Lu Chen, Bowen Tan, Sishan Long, and Kai Yu. 2018. Structured dialogue policy with graph neural networks. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1257–1268, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Colin et al. (2016) Emilie Colin, Claire Gardent, Yassine Mrabet, Shashi Narayan, and Laura Perez-Beltrachini. 2016. The webnlg challenge: Generating text from dbpedia data. In Proceedings of the 9th International Natural Language Generation conference, pages 163–167. Association for Computational Linguistics.
- Dethlefs (2017) Nina Dethlefs. 2017. Domain transfer for deep natural language generation from abstract meaning representations. IEEE Computational Intelligence Magazine, 12:18–28.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dušek and Jurcicek (2015) Ondřej Dušek and Filip Jurcicek. 2015. Training a natural language generator from unaligned data. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 451–461, Beijing, China. Association for Computational Linguistics.
- Dusek and Jurcícek (2016) Ondrej Dusek and Filip Jurcícek. 2016. A context-aware natural language generator for dialogue systems. CoRR, abs/1608.07076.
- Dušek et al. (2018) Ondřej Dušek, Jekaterina Novikova, and Verena Rieser. 2018. Findings of the e2e nlg challenge. In Proceedings of the 11th International Conference on Natural Language Generation, pages 322–328. Association for Computational Linguistics.
- Gardent et al. (2017) Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating Training Corpora for NLG Micro-Planning. In 55th annual meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada.
- Gong et al. (2019) Heng Gong, Xiaocheng Feng, Bing Qin, and Ting Liu. 2019. Table-to-text generation with effective hierarchical encoder on three dimensions (row, column and time). In EMNLP/IJCNLP.
Hu et al. (2017)
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing.
Toward controlled generation of text.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1587–1596. JMLR.org.
- Lavie and Agarwal (2007) Alon Lavie and Abhaya Agarwal. 2007. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, pages 228–231, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
- Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 733–743, Taipei, Taiwan. Asian Federation of Natural Language Processing.
- Liu et al. (2017) Tianyu Liu, Kexiang Wang, Lei Sha, Baobao Chang, and Zhifang Sui. 2017. Table-to-text generation by structure-aware seq2seq learning. In AAAI.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
- Moryossef et al. (2019) Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Step-by-step: Separating planning from realization in neural data-to-text generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2267–2277, Minneapolis, Minnesota. Association for Computational Linguistics.
- Novikova et al. (2017a) Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, and Verena Rieser. 2017a. Why we need new evaluation metrics for nlg. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252. Association for Computational Linguistics.
- Novikova et al. (2017b) Jekaterina Novikova, Ondřej Dušek, and Verena Rieser. 2017b. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206. Association for Computational Linguistics.
- Oraby et al. (2019) Shereen Oraby, Vrindavan Harrison, Abteen Ebrahimi, and Marilyn Walker. 2019. Curate and generate: A corpus and method for joint control of semantics and style in neural NLG. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5938–5951, Florence, Italy. Association for Computational Linguistics.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.
- Rastogi et al. (2019) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2019. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
- Tran and Nguyen (2018) Van-Khanh Tran and Le-Minh Nguyen. 2018. Adversarial domain adaptation for variational neural language generation in dialogue systems. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1205–1217, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
- Walker et al. (2018) Marilyn Walker, Albry Smither, Shereen Oraby, Vrindavan Harrison, and Hadar Shemtov. 2018. Exploring conversational language generation for rich content about hotels. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan. European Languages Resources Association (ELRA).
- Wen et al. (2016) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei hao Su, David Vandyke, and Steve J. Young. 2016. Multi-domain neural network language generation for spoken dialogue systems. In HLT-NAACL.
- Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721. Association for Computational Linguistics.
- Zhu et al. (2019) Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2019. Multi-task learning for natural language generation in task-oriented dialogue. In Empirical Methods in Natural Language Processing (EMNLP). ACL.
Appendix A Appendix
Service and Slot Descriptions
|Events_1||The comprehensive portal to find and reserve seats at events near you|
|category||Type of event|
|time||Time when the event is scheduled to start|
|Events_2||Get tickets for the coolest concerts and sports in your area|
|date||Date of event|
|time||Starting time for event|
|Media_1||A leading provider of movies for searching and watching on-demand|
|title||Title of the movie|
|genre||Genre of the movie|
Details of SER Errors
All of the errors made by Seq2Seq and CVAE are deletion errors (constrained decoding prevents repetitions/hallucinations). While using schema leads to more deletions in GPT2, it reduces repetitions and hallucinations, leading to better SER.
Seen vs. Unseen Domains
Data Distribution Plots
For the seen set in Figure 1(a), we present the distribution of references both in training and test. For the unseen sets in Figure 1(b), we present only test reference distribution (since there are no corresponding train references).
Performance across Services
Table 12 shows the performance of each model across all seen and partially-unseen test sets.
Seq2Seq and CVAE Model Diagram
Figure 3 shows a model architecture diagram for Seq2Seq and CVAE.
Table 13 shows more model output examples. Schema 1 shows correct outputs for all models. Schema 2 shows a slot drop in CVAE, and Schema 3 shows incorrect outputs from Seq2Seq/CVAE for the single fully-unseen domain, alarm-1.
|[Schema 1] ACTIONS (MR): CONFIRM(leaving-date=$leaving-date1), CONFIRM(travelers=$travelers1)|
|Slot Desc: leaving-date: ”date of bus leaving for journey”, travelers: ”number of travelers for journey”|
|Service: buses-1 Service Desc: book bus journeys from the biggest bus network in the country|
|Intent: BuyBusTickets Intent Desc: buy tickets for a bus journey|
|Natural Language MR: please confirm that the leaving date is $leaving-date1. please confirm that the travelers is $travelers1.|
|Ref||can you confirm once again that you need tickets for $travelers1 people for the bus leaving on $leaving-date1.|
|Seq2Seq||please confirm the following details : you want to book $travelers1 tickets on $leaving-date1 .|
|CVAE||please confirm : $travelers1 tickets for the bus leaving on $leaving-date1 .|
|GPT2||okay, it’s $travelers1 tickets leaving $leaving-date1, is that right?|
|[Schema 2] ACTIONS (MR): INFORM(ride-fare= $ride-fare1), INFORM(approx-ride-duration=$approx-ride-duration1)|
|Slot Desc: ride-fare”total fare for taxi ride”, approx-ride-duration: ”approximate duration of ride to the destination”|
|Service: ridesharing-1 Service Desc: on-demand taxi calling service|
|Intent: GetRide Intent Desc: call a taxi to head to a given destination|
|Natural Language MR: the ride fare is [$ride-fare1]. the approximate ride duration is [$approx-ride-duration1].|
|Ref||around $approx-ride-duration1 minutes for $ride-fare1.|
|Seq2Seq||around $approx-ride-duration1 minutes for $ride-fare1.|
|CVAE||the cab is on the way . it will cost $ride-fare1.|
|GPT2||the ride will be $ride-fare1 and is expected to last $approx-ride-duration1 minutes.|
|[Schema 3] ACTIONS (MR): REQUEST(new-alarm-time=null)|
|Slot Desc: new-alarm-time: ”time to set for the new alarm”|
|Service: alarm-1 Service Desc: Manage alarms by getting and setting them easily|
|Intent: AddAlarm Intent Desc: Set a new alarm|
|Natural Language MR: what [new alarm time] do you want?|
|Ref||at what time, you want the alarm to be set?|
|Seq2Seq||what time do you want to go ?|
|CVAE||what time would you like to go ?|
|GPT2||what time do you want to set the alarm?|