Schema-Guided Natural Language Generation

05/11/2020 ∙ by Yuheng Du, et al. ∙ Amazon University of Illinois at Urbana-Champaign 0

Neural network based approaches to natural language generation (NLG) have gained popularity in recent years. The goal of the task is to generate a natural language string to realize an input meaning representation, hence large datasets of paired utterances and their meaning representations are used for training the network. However, dataset creation for language generation is an arduous task, and popular datasets designed for training these generators mostly consist of simple meaning representations composed of slot and value tokens to be realized. These simple meaning representations do not include any contextual information that may be helpful for training an NLG system to generalize, such as domain information and descriptions of slots and values. In this paper, we present the novel task of Schema-Guided Natural Language Generation, in which we repurpose an existing dataset for another task: dialog state tracking. Dialog state tracking data includes a large and rich schema spanning multiple different attributes, including information about the domain, user intent, and slot descriptions. We train different state-of-the-art models for neural natural language generation on this data and show that inclusion of the rich schema allows our models to produce higher quality outputs both in terms of semantics and diversity. We also conduct experiments comparing model performance on seen versus unseen domains. Finally, we present human evaluation results and analysis demonstrating high ratings for overall output quality.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Authors contributed equally and are listed alphabetically.Author contributed to this work as an intern at Amazon.

Much of the recent work on Neural Natural Language Generation (NNLG) focuses on generating a natural language string given some input content, primarily in the form of a structured Meaning Representation (MR) Moryossef et al. (2019); Gong et al. (2019); Dušek et al. (2018); Liu et al. (2017); Colin et al. (2016); Wen et al. (2016); Dusek and Jurcícek (2016); Dušek and Jurcicek (2015); Wen et al. (2015). Popular datasets used for MR-to-text generation are confined to limited domains, e.g., restaurants or product information. They usually consist of simple tuples of slots and values describing the content to be realized, failing to offer any additional information that might be useful for the generation task Novikova et al. (2017b); Gardent et al. (2017); Wen et al. (2015). Table 1 shows examples of MRs from popular datasets.

Dataset MR Reference
E2E Novikova et al. (2017b) INFORM name[The Punter], food[Indian], priceRange[cheap] The Punter offers cheap Indian food.
Laptop Wen et al. (2016) INFORM name[satellite eurus65], type[laptop], memory[4gb], driverRange[medium], isForBusiness[false] The satellite eurus 65 is a laptop designed for home use with 4 gb of memory and a medium sized hard drive
Table 1: Sample MRs from popular NNLG datasets.

Only having simple and limited information within these MRs has several shortcomings. Model outputs are either very generic or generators have to be trained for a narrow domain and cannot be used for new domains. Thus, some recent work has focused on different methods to improve naturalness Zhu et al. (2019) and promote domain transfer Tran and Nguyen (2018); Wen et al. (2016).

The fact is that the use of an MR is not unique to the problem of language generation: tasks such as dialog state tracking Rastogi et al. (2019), policy learning Chen et al. (2018), and task completion Li et al. (2017) also require the use of an MR to track context and state information relevant to the task. MRs from these more dialog-oriented tasks are often referred to as a “schema.”

While dialog state tracking schema do not necessarily include descriptions (and generally only include names of intents, slots, and values like traditional MRs), recent work has suggested that the use of descriptions may help with different language tasks, such as zero-shot and transfer learning

Bapna et al. (2017). The most recent Dialog System Technology Challenge (DSTC8) Rastogi et al. (2019) implements this by introducing the idea of schema-guided dialog state tracking.

Table 2 shows a sample schema from DSTC8. It is much richer and more contextually informative than traditional MRs. Each turn annotated with information about the current speaker, (e.g., SYSTEM, USER), dialog act (e.g., REQUEST), slots (e.g., CUISINE), values (e.g., Mexican and Italian), as well as the surface string utterance. When comparing this schema in Table 2 to the MRs from Table 1, we can see that the only part of the schema reflected in the MRs is the ACTIONS section, which explicitly describes intents, slots, and values.

VALUES: Mexican, Italian
CUISINE: ”Cuisine of food served in the restaurant”
SLOT TYPE: CUISINE: is_categorical=true
INTENT - FindRestaurants
INTENT DESCRIPTION: ”Find a restaurant of a particular cuisine in a city”
SERVICE - Restaurants_1
SERVICE DESCRIPTION: ”A leading provider for restaurant search and reservations”
SPEAKER - System
UTTERANCE - ”Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?”
Table 2: Sample schema from DSTC8. ”Actions” describe a traditional MR; blue fields are newly introduced in the schema.

To our knowledge, no previous work on NNLG has attempted to generate natural language strings from schemata using this vastly richer and more informative data. In this paper, we propose the new task of Schema-guided Natural Language Generation, where we take a turn-level schema as input and generate a natural language string describing the required content, guided by the context information provided in the schema. Following previous work on schema-guided language tasks, we hypothesize that descriptions in the schema will lead to better generated outputs and the possibility of zero-shot learning Bapna et al. (2017). For example, to realize the MR REQUEST(time), domain-specific descriptions of common slots like time can help us realize better outputs, such as ”What time do you want to reserve your dinner?” in the restaurant domain, and ”What time do you want to see your movie?” for movies. Similarly, we note that for dialog system developers, writing domain-specific templates for all scenarios is clearly not scalable, but providing a few domain-specific descriptions for slots/intents is much more feasible.

To allow our models to better generalize and to be more directly useful in the context of a dialog system, we specifically focus on system-side turns from the DSTC8 dataset and generate natural language templates, i.e., delexicalized surface forms, such as ”Is there a specific cuisine type you enjoy, such as $cuisine1, $cuisine2, or something else?” from the example schema in Table 2.

Our contributions in this paper are three-fold: (1) we introduce the novel task of schema-guided NLG

, (2) we present our methods to include schema descriptions in state-of-the-art NNLG models, and (3) we demonstrate how using a schema leads to better quality outputs than traditional MRs. We experiment with three different NNLG models (Sequence-to-Sequence, Conditional Variational AutoEncoders, and GPT-2 as a Pretrained Language Model). We show that the rich additional information in schemata not only helps provide a context for the generation task, i.e., including domain names and descriptions, but also allows our NNLG models to learn more domain-specific information that might not otherwise be represented in the data. We also present experiments focused on quantifying model performance on domains unseen in training and present a human evaluation aimed at assessing model quality in terms of naturalness and semantic correctness.

2 Data

To create a rich dataset for NNLG, we repurpose the dataset used for the Schema-Guided State Tracking track of DSTC8 Rastogi et al. (2019).111 We describe our data preprocessing in this section.

Since we are focused on system turns, we first drop all the user turns. The second step in the preprocessing pipeline is to delexicalize each of the system utterances. The original data is annotated with the spans of the slots mentioned in each turn. We replace these mentions with the slot type plus an increasing index prefixed by the $ sign, e.g., $cuisine_1. For example, the utterance “Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?” becomes “Is there a specific cuisine type you enjoy, such as $cuisine_1, $cuisine_2 or something else?

The third step is to extract the MR corresponding to each system turn. We chose to represent MRs slightly differently from the original data. An MR is a 3-tuple: one dialog act has exactly one slot and one value. Therefore, an MR such as REQUEST(cuisine = [Mexican, Italian]) becomes REQUEST(cuisine=$cuisine_1), REQUEST(cuisine=$cuisine_2) (see Table 3). Note that the MR has been delexicalized in the same fashion as the utterance. Similarly, for MRs that do not have a value, e.g., REQUEST(city), we introduced the null value resulting in REQUEST(city=null). We also use the null value to replace the slot in dialog acts that do not require one, e.g., BYE() becomes BYE(null=null). This choice is due to how we encode MRs in our models.

Once we generate templates and MR pairs, we add information about the service. In DSTC8, there are multiple services within a single domain, e.g., services travel_1 and travel_2 are both part of the travel domain, but have distinct schema.222We show service examples in the appendix. DSTC8 annotates each turn with the corresponding service, so we reuse this information. Our schema also includes user intent.333At the time of writing, the DSTC8 test set is not annotated with user intent. Since we are using user intents for our task, we use DSTC8 dev set as our test set. We randomly split the DSTC8 train set into 90% training and 10% development. Since only user turns are annotated with intent information, we use the immediately preceding user turn’s intent annotation if the system turn and the user turn share the same service. If the service is not the same, we drop the intent information, i.e., we use an empty string for intent (this only happens in 3.3% of the cases).

Next, we add information extracted from the schema file of the original data. This includes service description, slot description (one description for each slot in the MR), and intent description. These description are very short English sentences (on average 9.8, 5.9 and 8.3 words for services, slot and intent). Lastly, we add to each tuple a sentence describing, in plain English, the meaning of the MR. These description are not directly available in DSTC8 but are procedurally generated by a set of rules.444We have a single rule for each act type; in total. For example, the MR CONFIRM(city=$city_1) is “Please confirm that the [city] is [$city_1].” The intuition behind these natural language MRs is to provide a more semantically informative representation of the dialog acts, slots and values.

Table 4

shows statistics for the final dataset. In summary, it is composed of nearly 4K MRs and over 140K templates. On average, every MR has 58 templates associated with it, but there is a large variance. There is one MR with over 1.7K templates (

CONFIRM(restaurant_name, city, time, party_size, date)) and many with only one.

DSTC8 (original)
VALUES: Mexican, Italian
UTTERANCE - ”Is there a specific cuisine type you enjoy, such as Mexican, Italian, or something else?”
DSTC8 Preprocessed for NLG
UTTERANCE - ”Is there a specific cuisine type you enjoy, such as $cuisine1, $cuisine2, or something else?”
Table 3: Data preprocessing and delexicalization.
Train Dev Test
Templates 110595 14863 20022
Meaning Representations 1903 1314 749
Services 26 26 17
Domains 16 16 16
Table 4: Preprocessed dataset statistics.

3 Models

3.1 Feature Encoding

We categorize the features in schemata into two different types. The first type is symbolic features. Symbolic features are encoded using a word embedding layer. They typically consist of single tokens, e.g., service name or dialog act. Since these features resemble variable names, we do not consider their semantics: thus, slot type named restaurant and restaurant_name are encoded separately. The second type of features is natural language features. These features are typically sentences, e.g., service or slot descriptions, that we encode using BERT Devlin et al. (2018)

to derive a single embedding tensor.

To represent the full schema, we adopt a flat-encoding strategy. The first part of each schema is the MR, which we define as a sequence of dialog act, slot, and value tuples. At each timestep, we encode a three-part sequence: (1) a new act, slot, and value tuple from the MR, (2) the embeddings of all schema-level features (i.e., services, intents, and their descriptions), and (3) the embedding of the current slot description (see Figure 1).

Figure 1: Flat-encoding strategy.

3.2 Sequence-to-Sequence

Our first model is a Seq2Seq model with attention, copy, and constrained decoding (model diagram in the appendix). We implement the attention from Luong et al. (2015):

where is a function that computes the alignment score of the hidden state of the encoder and the decoder hidden state, . The goal of this layer is to attend to the more salient input features.

The copy mechanism we add is based on pointer-generator networks See et al. (2017). When using the pointer generator, at each decoding step

we compute a probability


where , , and are a learnable weights matrix;

is a context vector computed by combining the encoder hidden state and the attention weights,

is the decoder hidden state, the decoder input, and is a bias term. The probability is used to determine the next word generated according to the following equation:

Thus behaves like a switch to decide whether to generate from the vocab or copy from the input. The goal of the copy mechanism is to enable the generation of special symbols such as $cuisine_1 that are specific to the service.

3.3 Conditional Variational Auto-Encoder

The Conditional Variational Auto-Encoder (CVAE) Hu et al. (2017) is an extension of the VAE models, where an additional vector is attached to the last hidden state of the encoder as the initial hidden state of the decoder. The vector is used to control the semantic meaning of the output to align with the desired MR. We use the encoded feature vector described in Section 3.1 as

. The model objective is the same as VAE, which is the sum of reconstruction loss and Kullback–Leibler divergence loss. At training time,

is the encoded input sentence. At prediction time, is sampled from a Gaussian prior learned from the training time. We also adapt the attention mechanism for CVAE by adding an additional matrix to compute the alignment score:

where is the decoder hidden state.

For both the Seq2Seq and CVAE models, we use constrained decoding to prune out candidate outputs that contain slot repetitions. Using a beam, we keep track of the slots that have already been generated and set the probability of a new candidate node to zero if generated slots are repeated.

[Schema 1] ACTIONS (MR): INFORM(price-per-night= $price-per-night1), NOTIFY-SUCCESS(null=null)
Slot Desc: price-per-night: ”price per night for the stay”
Service: hotels-4 Service Desc: ”Accommodation searching and booking portal”
Intent: ReserveHotel Intent Desc: ”Reserve rooms at a selected place for given dates.”
Natural Language MR: the [price per night] is [$price-per-night1]. the request succeeded.
Ref $price-per-night1 a night
Seq2Seq your reservation is booked and the total cost is $price-per-night1 .
CVAE your reservation has been made . the total cost is $price-per-night1 per night .
GPT2 your reservation was successful! the cost of the room is $price-per-night1 per night.
[Schema 2] ACTIONS (MR): OFFER(movie-name= $movie-name1), OFFER(movie-name= $movie-name2)
OFFER(movie-name= $movie-name3), INFORM(count=$count1)
Slot Desc: movie-name: ”name of the movie”, count: ”the number of items that satisfy the user’s request”
Service: media-2 Service Desc: ”The widest selection and lowest prices for movie rentals”
Intent: FindMovies Intent Desc: ”Find movies to watch by genre and, optionally, director or actors”
Natural Language MR: there is [$movie-name2] for [movie name].
there is [$movie-name3] for [movie name]. there is [$movie-name1] for [movie name]. the [count] is [$count1].
Ref $count1 movies $movie-name1 $movie-name2 and $movie-name3
Seq2Seq i found $count1 movies . how about $movie-name2 ?
CVAE i found $count1 movies you might like . how about $movie-name1 , $movie-name2 or $movie-name3 ?
GPT2 sure. hey! i’ve got $count1 movies. do you wish to watch any of these movies - $movie-name1, $movie-name2 or $movie-name3?
[Schema 3] ACTIONS (MR): REQUEST(transfer-amount=null), REQUEST(recipient-name=null)
Slot Desc: transfer-amount: ”the amount of money to transfer”, recipient-name: ”the name of the recipient to transfer […]”
Service: banks-2 Service Desc: ”Service to manage your bank accounts and finances”
Intent: TransferMoney Intent Desc: ”Transfer money to another user”
Natural Language MR: ”what [transfer amount] do you want? what [recipient name] do you want?”
Ref amount? recipient?
Seq2Seq what type of ride do you want to transfer ?
CVAE what is your destination ? how much do you want to send the money to ?
GPT2 sure, i can help you with that. how much would you like to transfer from? who would you like to send it to?
Table 5: Example model outputs. All models are correct for Schema 1. Seq2Seq model dropped two slots for Schema 2. Schema 3 has incorrect outputs for Seq2Seq and CVAE.

3.4 Pretrained Language Model: GPT-2

We also experiment with a pretrained language model, specifically GPT-2 Radford et al. (2019).555GPT-2 small from HuggingFace Transformers ( Since GPT-2 is trained on purely natural language strings, we first combine the symbolic and natural language features into flat natural language strings, similar to previous work by Budzianowski and Vulić (2019). We fine-tune the GPT-2 model using these natural language inputs with the target template.666We train with special beginning of sequence, end of sequence, and separator tokens such that each training instance is: “[BOS] schema-tokens [SEP] target-tokens [EOS].” At prediction time, given the schema tokens as input, we use our fine-tuned GPT-2 model with a language model head to conditionally generate an output sequence (until we hit an end-of-sequence token).

4 Evaluation

For each of our three models, we generate a single output for each test instance. Table 5 shows example model outputs.

4.1 Evaluation Metrics

We focus on three distinct metric types: similarity to references, semantic accuracy, and diversity.

Similarity to references.

As a measure of how closely our outputs match the corresponding test references, we use BLEU (n-gram precision with brevity penalty)

Papineni et al. (2002)

and METEOR (n-gram precision and recall, with synonyms)

Lavie and Agarwal (2007). To compute these, we find the score for each output compared to references for that instance, and average across all instances.777We use NLTK for BLEU4/METEOR Bird et al. (2009). We include these metrics in our evaluation for primarily for completeness and supplement them with a human evaluation, since it is widely agreed that lexical overlap-based metrics are weak measures of quality Novikova et al. (2017a); Belz and Reiter (2006); Bangalore et al. (2000).

Semantic accuracy. We compute the slot error rate (SER) for each model output as compared to the corresponding MR by finding the total number of deletions, repetitions, and hallucinations over the total number of slots for that instance (lower the better).888Although Wen et al. (2015) compute SER using only deletions and repetitions, we include hallucinations to capture errors more accurately. It is important to note that we only consider slots that have explicit values (e.g., MR: INFORM date=$date1) for our SER computations. We are investigating methods to compute SER over implicit slots (e.g., MR: REQUEST party_size=null) as future work, since it is non-trivial to compute due to the various ways an implicit slot might be expressed in a generated template (e.g., ”How many people are in your party?”, or ”What is the size of your group?”). We also compute ”slot match rate”, or the ratio of generated outputs that contain exactly the same slots as the matching test MR.

Diversity. We measure diversity based on vocabulary, distinct- (the ratio between distinct -grams over total -grams) Li et al. (2016) and novelty (the ratio of unique generated utterances in test versus references in train).999To avoid inflating novelty metrics, we normalize our template values. (e.g., ”Table is reserved for $date1.” is normalized to ”Table is reserved for $date.” for any $dateN value).

Similarity to Refs Semantics Diversity
BLEU Avg METEOR Avg SER Avg Slot Match Rate Vocab1 (Gold: 2.5k) Vocab2 (Gold: 20k) Distinct1 (Gold: 0.01) Distinct2 (Gold: 0.1) Novelty
Seq2Seq MR 0.4616 0.2556 0.1602 0.7530 253 614 0.0398 0.1093 0.5741
Schema 0.4885 0.2680 0.2062 0.7009 275 699 0.0445 0.1288 0.5674
CVAE MR 0.4899 0.2732 0.2469 0.6622 292 727 0.0406 0.1128 0.5434
Schema 0.5079 0.2906 0.2407 0.6983 327 924 0.0445 0.1401 0.6142
GPT2 MR 0.4108 0.2974 0.1929 0.8331 648 2491 0.0818 0.3471 0.5808
Schema 0.4587 0.3266 0.1810 0.8558 678 2659 0.0868 0.3767 0.5955
Table 6:

Automatic evaluation metrics comparing traditional MR vs. rich schema. Higher is better for all metrics except SER.

4.2 Traditional MR vs. Rich Schema

Table 6 compares model performance when trained using only the traditional MR versus using the full schema (better result for each model in bold).

We first compare the performance of different models. From the table, we see that Seq2Seq and CVAE have higher BLEU compared to GPT2 (for both MR and Schema), but that GPT2 has a higher METEOR. This indicates that GPT2 is more frequently able to generate outputs that are semantically similar to references, but that might not be exact lexical matches (e.g., substituting ”film” for ”movie”) since GPT2 is a pretrained model. Similarly, GPT2 has a significantly higher vocabulary and diversity than both Seq2Seq and CVAE.

Next, we compare the performance of each model when trained using MR versus Schema. For all models, we see an improvement in similarity metrics (BLEU/METEOR) when training on the full schema. Similarly, in terms of diversity, we see large increases in vocabulary for all models, as well as increases in distinct- and novelty (with the exception of Seq2Seq novelty, which drops slightly).

In terms of semantic accuracy, we see an improvement in both SER and Slot Match Rate for both CVAE and GPT2. For Seq2Seq, however, we actually see that the model performs better on semantics when training on only the MR. To investigate, we look at a breakdown of the kinds of errors made. We find that Seq2Seq/CVAE only suffer from deletions, but GPT2 also produces repetitions and hallucinations; however, training using the schema reduces the number of these mistakes enough to result in an SER improvement for GPT2 (see appendix for details).

events_1 19% 0.6747 0.0490 0.6668 0.0294 0.5573 0.0588
rentalcars_1 18% 0.7706 0.1500 0.7383 0.1125 0.6638 0.1000
buses_1 15% 0.4787 0.1542 0.5559 0.1000 0.4445 0.0167
weather_1 7% 0.6310 0.1528 0.7286 0.1111 0.5866 0.1667
(a) Seen services.
restaurants_2 24% 0.3575 0.2098 0.3291 0.3501 0.3217 0.0527
flights_3 18% 0.3402 0.4579 0.3920 0.5000 0.3528 0.7368
services_4 18% 0.6233 0.2197 0.5027 0.4013 0.6117 0.0851
movies_2 4% 0.4682 0.4028 0.5003 0.4444 0.4561 0.8472

(b) Partially-unseen services.
alarm_1 100% 0.5641 0.2667 0.5864 0.2667 0.3827 0.5833
(c) Fully-unseen services.
Table 7: Automatic evaluation metrics across seen, partially-unseen, and fully-unseen services when training with schema.
MR 0.55 0.07 0.59 0.12 0.49 0.05
Sch 0.61 0.12 0.65 0.09 0.56 0.04
MR 0.45 0.23 0.48 0.34 0.39 0.31
Sch 0.45 0.28 0.43 0.37 0.44 0.29
MR 0.52 0.27 0.51 0.27 0.31 0.48
Sch 0.56 0.27 0.59 0.27 0.38 0.58
Table 8: Average BLEU and SER by service splits.

4.3 Seen vs. Unseen Services

Next, we are interested to see how our models perform on specific services in the dataset. Recall that the original dataset consists of a set of services that can be grouped into domains: e.g., services restaurant_1 and restaurant_2 are both under the restaurant domain. Based on this, we segment our test set into three parts, by service: seen, or services that have been seen in training, partially-unseen, or services that are unseen in training but are part of domains that have been seen, and fully-unseen where both the service and domain are unseen.101010We show distribution plots by service in the appendix. Table 12 shows model performance in terms of BLEU and SER. We sort services by how many references we have for them in test; events_1 for example constitutes 19% of the 20K test references. To focus our discussion here, we show only the top-3 and bottom-1 services in terms of percentage of test references.111111We show results for all services in the appendix. For fully-unseen we show the only available service (alarm_1). We show the best scores in bold and the worst scores in italic.

For seen services (Figure 11(a)), we see the highest BLEU scores for all models on the rentalcars_1 (which has a high distribution of refs). We note that SER is consistently low across all models (the worst SER across seen is 0.23). We also observe that weather_1, which is different from the rest of the more reservation-focused domains, has the smallest test distribution and notably high SER.

For partially-unseen services (Figure 11(b)), we see the best BLEU for all models on services_4, and the best SER on restaurants_2 (again, with the highest percentage of test references). We note that flights_3 has the worst SER for Seq2Seq/CVAE (second worst for GPT2), despite having a large test distribution. Upon investigation, we find slot descriptions errors: e.g., slot origin_airport_name has slot description ”Number of the airport flying out from” indicating that models are sensitive to schema errors.

To better understand how the models do on average across all services (weighted by the percentage of test refs), we show average BLEU/SER scores in Table 8. We again compare performance between training on the MR vs. the schema. On average, we see that for the seen and fully-unseen partitions, training with the schema is better across almost all metrics. For partially-unseen, we see that CVAE performs better when training on only the MR; however, when averaging across the full test in Table 6, we see an improvement with schema.

We see naturally higher BLEU and lower SER for seen vs. partially-seen. Surprisingly, we see higher schema BLEU for Seq2Seq on fully-unseen as compared to partially-unseen, but we note that there is a very small fully-unseen sample size (only 10 test MRs). We also note that GPT2 has high SER for the fully-unseen domain; upon inspection, we see slot hallucination from GPT2 within alarm_1, while Seq2Seq/CVAE never hallucinate.

4.4 Human Evaluation

We conduct an annotation study to evaluate our schema-guided output quality. We randomly sample 50 MRs from our test set, and present 3 annotators with a single output for each model as well as a reference (randomly shuffled).121212To make annotation more intuitive, we automatically lexicalize slots with values from the schema (although this may add noise), e.g., ”The date is $date1” ”The date is [March 1st].” We use the same values for all templates for consistency.

We ask the annotators to give a binary rating for each output across 3 dimensions: grammar, naturalness, and semantics (as compared to the input MR). We also get an ”overall” rating for each template on a 1 (poor) to 5 (excellent) Likert scale.

Table 9 summarizes the results of the study. For grammar, naturalness, and semantics, we show the ratio of how frequently a given model or reference output is marked as correct over all outputs for that model. For the ”overall” rating, we average the 3 ratings given by the annotators for each instance, and present an average across all MRs (out of 5).

Grammar Naturalness Semantics Overall
Reference 0.95 0.67 0.91 3.97
Seq2Seq 0.82 0.58 0.37 2.72
CVAE 0.89 0.73 0.44 3.01
GPT2 0.80 0.61 0.70 3.61
Table 9: Average human evaluation scores for different quality dimensions.

From the table, we see that in terms of grammar and naturalness, CVAE has the highest score of all models. It is particularly interesting to note that CVAE even beats the reference in terms of naturalness, highlighting the fact that even reference quality is subjective not necessarily a gold-standard. In terms of semantics, we see that GPT-2 has the highest ratings of all models. Most interestingly, we see that CVAE has a significantly lower semantic rating, although it is the winner on grammar and naturalness, indicating that while CVAE outputs may be fluent, they frequently do not actually express the required content (see Schema 3 in Table 5). This finding is also consistent with our SER calculations from Table 6, where we see that CVAE has the highest SER.131313We compute Fleiss Kappa scores for each dimension, finding near-perfect agreement for semantics (0.87), substantial agreement for grammar (0.76), and moderate agreement for naturalness (0.58) and overall (0.47).

In terms of overall score, we see that GPT-2 has the highest rating of all three models, closely matching the ratings for the references. This can be attributed to its higher semantic accuracy, combined with good (even if not the highest) ratings on grammar and naturalness.

5 Related Work

Most work on NNLG uses a simple MR that consists of slots and value tokens that only describe information that should be realized, without including contextual information to guide the generator as we do; although some work has described how this could be useful Walker et al. (2018). WebNLG Colin et al. (2016) includes structured triples from Wikipedia which may constitute slightly richer MRs, but are not contextualized. Oraby et al. Oraby et al. (2019) generate rich MRs that contain syntactic and stylistic information for generating descriptive restaurant reviews, but do not include any contextual information that does not need to be included in the output realization. Table-to-text generation using ROTOWIRE (NBA players and stats) also includes richer information, but it is also not contextualized Gong et al. (2019).

Other previous work has attempted to address domain transfer in NLG. Dethlefs et al. Dethlefs (2017) use an abstract meaning representation (AMR) as a way to share common semantic information across domains. Wen et al. Wen et al. (2016) use a ”data counterfeiting” method to generate synthetic data from existing domains to train models on unseen domains, then fine-tune on a small set of in-domain utterances. Tran et al. Tran and Nguyen (2018) also train models on a source domain dataset, then fine-tune on a small sample of target domain utterances for domain adaptation. Rather than fine-tuning models for new domains, our data-drive approach allows us to learn domain information directly from the data schema.

6 Conclusions

In this paper, we present the novel task of Schema-Guided NLG. We demonstrate how we are able to generate templates (i.e., delexicalized surface strings) across different domains using three state-of-the-art models, informed by a rich schema of information including intent and slot descriptions and domain information. We describe how we preprocess the DSTC8 schema-guided dialog dataset from the dialog state tracking community, repurposing it for NLG. In our evaluation, we demonstrate how training using this rich schema shows improvements across different similarity (up to 0.51 BLEU and 0.32 METEOR), semantic (low as 0.18 average SER), and diversity (up to 2.5K bigram vocabulary) metrics on both seen and unseen domains. Through a human evaluation, we show how our outputs are rated up to 3.61 out of 5 overall (as compared to 3.97 for references). We observe that different models have different strengths: Seq2Seq and CVAE have higher BLEU reference similarity scores, while GPT2 is significantly more diverse and is scored highest in human evaluation.

For future work, we are interested in exploring how schema-guided NLG can be used in the context of a dialog system, where only outputs that have no slot errors and have the best overall fluency should be selected as candidate responses. We are also interested in improving both the semantic correctness and overall fluency of our model outputs by introducing improved methods for constrained decoding and language model integration. Additionally, we plan to develop more accurate automatic measures of semantic quality, as well as more fine-grained control of domain transfer.


Appendix A Appendix

Service and Slot Descriptions

Events_1 The comprehensive portal to find and reserve seats at events near you
category Type of event
time Time when the event is scheduled to start
Events_2 Get tickets for the coolest concerts and sports in your area
date Date of event
time Starting time for event
Media_1 A leading provider of movies for searching and watching on-demand
title Title of the movie
genre Genre of the movie
Table 10: Services, slots and their descriptions. In boldface the service names, in verbatim the slots.

Details of SER Errors

All of the errors made by Seq2Seq and CVAE are deletion errors (constrained decoding prevents repetitions/hallucinations). While using schema leads to more deletions in GPT2, it reduces repetitions and hallucinations, leading to better SER.

SER Delete Repeat Halluc.
Seq2Seq MR 0.1602 0.1602 0 0
Schema 0.2062 0.2062 0 0
CVAE MR 0.2469 0.2469 0 0
Schema 0.2407 0.2407 0 0
GPT2 MR 0.1929 0.0791 0.0037 0.1101
Schema 0.1810 0.0850 0.0020 0.0940
Table 11: Detailed analysis of slot errors.

Seen vs. Unseen Domains

Data Distribution Plots

For the seen set in Figure 1(a), we present the distribution of references both in training and test. For the unseen sets in Figure 1(b), we present only test reference distribution (since there are no corresponding train references).

(a) Distribution of refs in seen services.
(b) Distribution of refs in partially/fully unseen services.
Figure 2: Distribution of references across services.

Performance across Services

Table 12 shows the performance of each model across all seen and partially-unseen test sets.

events_1 19% 0.6747 0.0490 0.6668 0.0294 0.5573 0.0588
rentalcars_1 18% 0.7706 0.1500 0.7383 0.1125 0.6638 0.1000
buses_1 15% 0.4787 0.1542 0.5559 0.1000 0.4445 0.0167
homes_1 9% 0.4513 0.0660 0.5306 0.1176 0.5304 0.0065
ridesharing_1 9% 0.6335 0.2292 0.7025 0.1667 0.6416 0.0000
hotels_1 8% 0.5150 0.0983 0.5756 0.0700 0.3764 0.0000
music_1 8% 0.5921 0.1111 0.7761 0.0278 0.7224 0.0000
travel_1 7% 0.5473 0.0175 0.5742 0.1053 0.4705 0.0000
weather_1 7% 0.6310 0.1528 0.7286 0.1111 0.5866 0.1667
(Average) (100%) 0.6088 0.1151 0.6540 0.0896 0.5612 0.0441
(a) Seen services.
restaurants_2 24% 0.3575 0.2098 0.3291 0.3501 0.3217 0.0527
flights_3 18% 0.3402 0.4579 0.3920 0.5000 0.3528 0.7368
services_4 18% 0.6233 0.2197 0.5027 0.4013 0.6117 0.0851
hotels_4 17% 0.4503 0.2284 0.4310 0.2978 0.4915 0.1552
banks_2 10% 0.5422 0.2546 0.5702 0.2315 0.5377 0.3519
media_2 9% 0.4578 0.3218 0.4299 0.3218 0.3841 0.4483
movies_2 4% 0.4682 0.4028 0.5003 0.4444 0.4561 0.8472
(Average) (100%) 0.4515 0.2801 0.4299 0.3655 0.4425 0.2920
(b) Partially-unseen services.
Table 12: Automatic evaluation metrics across seen and partially-unseen services (best in bold, worst in italic).

Seq2Seq and CVAE Model Diagram

Figure 3 shows a model architecture diagram for Seq2Seq and CVAE.

Figure 3: Seq2seq and CVAE model architectures

Output Examples

Table 13 shows more model output examples. Schema 1 shows correct outputs for all models. Schema 2 shows a slot drop in CVAE, and Schema 3 shows incorrect outputs from Seq2Seq/CVAE for the single fully-unseen domain, alarm-1.

[Schema 1] ACTIONS (MR): CONFIRM(leaving-date=$leaving-date1), CONFIRM(travelers=$travelers1)
Slot Desc: leaving-date: ”date of bus leaving for journey”, travelers: ”number of travelers for journey”
Service: buses-1 Service Desc: book bus journeys from the biggest bus network in the country
Intent: BuyBusTickets Intent Desc: buy tickets for a bus journey
Natural Language MR: please confirm that the leaving date is $leaving-date1. please confirm that the travelers is $travelers1.
Ref can you confirm once again that you need tickets for $travelers1 people for the bus leaving on $leaving-date1.
Seq2Seq please confirm the following details : you want to book $travelers1 tickets on $leaving-date1 .
CVAE please confirm : $travelers1 tickets for the bus leaving on $leaving-date1 .
GPT2 okay, it’s $travelers1 tickets leaving $leaving-date1, is that right?
[Schema 2] ACTIONS (MR): INFORM(ride-fare= $ride-fare1), INFORM(approx-ride-duration=$approx-ride-duration1)
Slot Desc: ride-fare”total fare for taxi ride”, approx-ride-duration: ”approximate duration of ride to the destination”
Service: ridesharing-1 Service Desc: on-demand taxi calling service
Intent: GetRide Intent Desc: call a taxi to head to a given destination
Natural Language MR: the ride fare is [$ride-fare1]. the approximate ride duration is [$approx-ride-duration1].
Ref around $approx-ride-duration1 minutes for $ride-fare1.
Seq2Seq around $approx-ride-duration1 minutes for $ride-fare1.
CVAE the cab is on the way . it will cost $ride-fare1.
GPT2 the ride will be $ride-fare1 and is expected to last $approx-ride-duration1 minutes.
[Schema 3] ACTIONS (MR): REQUEST(new-alarm-time=null)
Slot Desc: new-alarm-time: ”time to set for the new alarm”
Service: alarm-1 Service Desc: Manage alarms by getting and setting them easily
Intent: AddAlarm Intent Desc: Set a new alarm
Natural Language MR: what [new alarm time] do you want?
Ref at what time, you want the alarm to be set?
Seq2Seq what time do you want to go ?
CVAE what time would you like to go ?
GPT2 what time do you want to set the alarm?
Table 13: Example model outputs.