Virtual assistants have become popular in recent years and task-completion is one of their most important aspects. These assistants help users in accomplishing tasks such as finding restaurants, buying sports tickets, finding weather etc., by providing a natural language interface to many services or APIs available on the web. Figure 1 shows a general architecture of a task-oriented dialogue system. Most systems include a natural language understanding and dialogue state tracking module for semantic parsing of the dialogue history. This is followed by a policy module which interacts with the APIs, whenever required, and generates the actions to be taken by the system to continue the dialogue. In the end, the Natural Language Generation (NLG) module converts these actions into an utterance, which is surfaced to the user. Being the user-facing interface of the dialogue system, NLG is one of the most important components impacting user experience.
Traditional NLG systems heavily utilize a set of templates to produce system utterances. Although, the use of templates gives good control over the outputs generated by the system, defining templates becomes increasingly tedious as more APIs are added. Supporting multi-domain conversations spanning across multiple APIs quickly grows out of hand, requiring expert linguists and rigorous testing to ensure the grammatical correctness and appropriateness of generated utterances.
Consequently, data-driven generative approaches have gained prominence. Such systems require much less effort and can generate utterances containing novel patterns. Meanwhile, with the rapid proliferation of personal assistants, supporting large number of APIs across multiple domains has become increasingly important, resulting in research on supporting new APIs with few labelled examples (few-shot learning). To this end, generative models pre-trained on large amounts of unannotated text corpus have been increasingly successful.
In this work, we study the use of pre-trained generative models for NLG. Our key contributions are threefold:
We propose a simple template-based representation of system actions, and formulate NLG as a utterance rewriting task. We demonstrate the superiority of this approach by automatic and human evaluations.
We introduce the SGD-NLG dataset as a benchmark for few-shot and zero-shot learning of natural language generation. Our dataset is based on the SGD dataset Rastogi et al. (2019) and exceeds all other datasets in terms of number of domains, providing a total of 20 domains across training and evaluation sets.
We conduct an extensive set of experiments to investigate the role of dialog history context, cross-domain transfer learning and few-shot learning. We share our findings to guide the design choices in future research.
Our approach achieves state-of-the-art on the MultiWOZ dataset. Next, through experiments on the multi-domain SGD-NLG dataset, we show that this approach enjoys several desirable properties such as strong generalization to unseen domains and improved sample efficiency. Finally, human evaluations show that raters prefer our model’s generated responses over human authored text.
|Approach||Representation of System Actions|
|Naive||inform ( restaurant = Opa! ) inform ( cuisine = greek )|
|Slot Description||inform ( name of restaurant = Opa! ) inform ( type of food served = greek )|
|Template||How about the restaurant Opa!. The restaurant serves greek food.|
|Ground Truth||Opa! is a nice greek restaurant. How does it sound?|
2 Related Work
Natural language generation from structured input (NLG) has been an active area of research, facilitated by creation of datasets like WikiBio Lebret et al. (2016), E2E challenge Novikova et al. (2017), WebNLG Gardent et al. (2017) and MultiWOZ Budzianowski et al. (2018). Neural sequence models have been extensively used in a variety of configurations for NLG in task-oriented dialogue systems. Wen et al. (2017) proposed a two-step approach: first generating a delexicalized utterance with placeholders for slots and then post-processing it to replace placeholders with values from API results, whereas Nayak et al. (2017) highlighted the importance of conditioning generated responses on slot values.
Sequence to sequence architectures directly converting a sequential representation of system actions to system response are also very common Wen et al. (2015); Dušek and Jurcicek (2016b); Zhu et al. (2019); Chen et al. (2019a). Domain-adaptation and transfer learning in low resource settings has also been an extensively studied problem Tran and Le Nguyen (2018); Chen et al. (2019b); Peng et al. (2020); Mi et al. (2019), with recently released datasets like SGD Rastogi et al. (2019) and FewShotWOZ Peng et al. (2020) providing a good benchmark.
Recently, language models pre-trained on large amount of unannotated text corpus have achieved state of the art performance across several natural language processing tasksDevlin et al. (2019); Yang et al. (2019); Liu et al. (2019); Radford et al. (2019); Keskar et al. (2019). Pre-trained generative models have shown promising results for NLG in dialogue systems in low resource settings Budzianowski and Vulic (2019); Peng et al. (2020); Kale and Roy (2020).
Our template based approach bears similarities to the sentence fusion task Barzilay and McKeown (2005)
, where the aim is to combine multiple sentence into a single coherent sentence. While it has been applied to the multi-document summarization task, in this work we demonstrate its effectiveness for task oriented response generation.
For a given dialogue turn, let be the set of actions which are output by the system, where is the total number of actions output by the system for this turn. Each action consists of a single dialogue act representing the semantics of the action, along with optional slot and value parameters - and respectively. For example, inform, req_more and request are some of the dialogue acts defined in the SGD-NLG dataset Rastogi et al. (2019), which are used for informing the value of a slot to the user, asking if the user needs some other help, and requesting the value of a slot from the user respectively. Some acts like inform require both the slot and value parameters, whereas acts like request require the slot parameter only and acts like req_more require none. Some datasets allow multiple slot value arguments for a single act, but such actions can be converted to the above representation by decomposing them into multiple actions with the same act, each containing exactly one slot value pair.
The goal of NLG is to translate to a natural language response with the same semantic content. To this end, we first convert the set into a sequence (Section 6). Then, we utilize the Text-to-Text Transfer Transformer (T5) Raffel et al. (2019) model, which is a sequence to sequence model, to generate the natural language response.
We use the T5-small model which has 6 layers each in the encoder and decoder, with a total of around 60 million parameters. In each of the experiments reported in this paper, we started with a pretrained T5-small model released on its website 111github.com/google-research/text-to-text-transfer-transformer. The model was then fine-tuned on the corresponding dataset using a constant learning rate of 0.001 and batch size of 256. In all experiments, we observed that the model converged before 1000 steps. The checkpoint yielding the highest BLEU score on the development set was picked for reporting test set results. During inference, we use beam search with a width of 4 and length penalty .
We conduct experiments on 2 datasets - SGD-NLG Rastogi et al. (2019), MultiWOZ Budzianowski et al. (2018). SGD-NLG features a larger number of domains and slots as compared to MultiWOZ and the presence of multiple services per
domain makes it representative of practical scale-related challenges faced by today’s virtual assistants.
Furthermore, the evaluation sets contain many domains, and consequently slots, which are not present in the training set, to help evaluate model performance on unseen domains.
Prior work Mi et al. (2019); Tran and Le Nguyen (2018); Wen et al. (2016) has studied zero shot learning, domain adaptation etc. in a simulated setting mainly by holding out domains for adaptation one at a time and creating small subsets. Datasets so far are also very limited in the number of domains. The largest dataset so far, MultiWOZ, has just 5 domains in the test set. Moreover, lack of knowledge of the exact data splits makes it difficult to make comparisons to other methods.
On the other hand, the large size and variety of the SGD dataset makes it a great testbed to study zero-shot learning, few-shot adaptation etc. Having a canonical split will make it easier for future work to compare results across methods.
To encourage reproducible research and a single benchmark that can support different paradigms like joint modeling, domain adaptation etc, we make a new version of the SGD dataset as follows:
To study few-shot learning from scratch, we make k-shot subsets for varying values of k . In this setting each domain has k dialogues.
For all the few shot splits we make sure that they contain examples for every dialogue act and slot type.
Multi-domain dialogues are removed from the training data. Though many dialogues are discarded, we found that this led to minimal loss in quality.
The dev and test sets are left untouched.
We call this dataset SGD-NLG. A comparison with the MultiWOZ dataset can be found in Table 1. The dataset and code will be made publicly available in the future.
Automatic Metrics Following prior work, we use BLEU and Slot Error Rate (SER) as automatic metrics. SER represents the fraction of generated texts where at least one slot was not correctly copied from the structured data. Since this metric relies on string matching, we cannot use it to evaluate binary slots like has_live_music.
Human Evaluation We conduct a human evaluation study via crowd sourcing. Each worker is shown the dialogue act and responses predicted by the NLG models. Following (Peng et al., 2020), they are asked to rate each response on a scale of 1 (bad) to 3 (good) along two axes - informativeness and naturalness. Each example is rated by 3 different workers. The final metric is an average of all the ratings.
6 Encoding System Actions
We experiment with three different representations of system actions as shown in Figure 3, and described below.
6.1 Naive Representation
This approach utilizes the most basic representation of actions, similar to that used in Peng et al. (2020). Canonical representations of each action - or , depending on the parameters present in the action, are concatenated together to obtain a sequence representation of . Although this representation is simple to obtain, it suffers from two drawbacks -
Semantics - This representation doesn’t convey much information about the semantics of a slot. Consequently, the model may need a larger number of training examples to identify the semantics from their usage in the system utterance.
Representation Bias - This representation is very different from what the encoder has seen during pretraining phase, which is natural language text. As a result, the representations learnt during pre-training may not transfer well. Peng et al. (2020) mitigate this by conducting additional pre-training utilizing large scale annotated dialogue datasets. While this method is effective, a large in-domain corpus may not always be available.
6.2 Slot Description Based Representation
Recent work on low-resource natural language understanding tasks have utilized natural language descriptions of slots. These descriptions are easy to obtain, directly encode the semantics of the slot and have been shown to help when in-domain training data is sparse. On similar lines, we extend the Naive representation by replacing the slot names with their natural language descriptions. The action representations are and , where represents a natural language description of slot . This solves the first drawback of the Naive representation mentioned above. We refer to this method as SlotDesc.
6.3 Template Based Representation
|notify_success||Your ride is booked and the cab is on its way.|
|goodbye||Have a safe ride!|
|request(dest)||Where are you riding to?|
|request(shared)||Are you comfortable sharing the ride?|
|confirm(dest=$x)||You are going to $x.|
|inform(fare=$x)||Your ride costs $x dollars.|
|inform(seats=$x)||The cab is for $x riders.|
We solve the representation bias problem by converting the set of actions output by the system into a natural language utterance. We employ a technique similar to that used in Rastogi et al. (2019) and define a minimal set of templates. Specifically, as shown in Figure 4, we define one template for each system action. The representation of is obtained by concatenating the corresponding templatized representation of each action in .
Note that, our focus here is not to generate conversational and grammatically correct utterances, but to have a simple representation of the actions, which can be rewritten by the model to a natural and fluent response. Hence, we don’t need to cover all edge cases typically required in template based methods - handling of plurals, subject-verb agreement, morphological inflection etc. - and only need to define a small number of templates. For most APIs, this amounts to around 15-30 templates. The actual number varies depending on the number of slots and intents supported by the API. Since this method relies on a combination of templates and transfer learning from language models, we name it Template Guided Text Generation (T2G2) .
Baselines Besides our proposed approaches, we also compare with the following baselines:
HDSA - Hierarchically Disentangled Self-Attention Chen et al. (2019a), a transformer based architecture that exploits the structure of dialog acts to build a multi-layer hierarchical graph.
SC-GPT - A GPT-2 based pre-train + fine-tune approach that relies on a large in-domain NLG corpus. This model currently holds state-of-the-art on MultiWOZ.
Copy - A trivial baseline where the input template is treated as the output text. Though the text is accurate and contains all the information, its likely to sound unnatural and ungrammatical.
The results are shown in Table 2. Naive achieves a BLEU score of 34.96 and outperforms the previous best model SC-GPT by 4+ points, setting a new state of the art. SC-GPT consists of a GPT-2 model further pre-trained on a large in-domain NLG corpus. On the other hand, we do not use any such corpus and directly fine-tune on T5. While the SER score is slightly higher, we found that most of the errors can be attributed to the noisy string matching aspect of the metric and were not actually errors. T2G2 performs on par with Naive. This is likely due to the large size of the MultiWOZ dataset (57K utterances spread over just 5 domains) and indicates that with enough annotated data, a simple pre-train and fine-tune approach is enough to attain good performance. Few shot and zero shot settings offer a greater and more realistic challenge, and we explore these settings next.
The ideal NLG model should be able to handle domains it was not exposed to during training. In practice, this is very hard to obtain. The SGD-NLG dataset, which features unseen domains in the evaluations sets, let’s us asses this zero-shot capability of a model. We report results in Table 3 on two test sets - the seen set consists of domains that were seen during training, while the unseen
set consists of brand new domains aka the zero shot setting. Firstly, all models exhibit low SER scores in both seen and unseen domains, with the template approach being the lowest. This suggests that pre-trained language models are adept at copying and the skill generalizes to out-of-domain examples as well. This result also hints at the need for better evaluation metrics for text generation.
SlotDesc performs on par with Naive on seen domains. At the same time, the slot descriptions do improve performance on the unseen domains (+1.5 BLEU), albeit to a limited degree. More effective ways of incorporating descriptions is a promising are for future work. For the seen domains, template outperforms Naive by 2.7 BLEU. The results on the unseen domain are more striking with template improving on Naive by 6.7 points. This confirms the hypothesis that our simple template based input scheme offers superior generalization capabilities with a low overhead. The template model learns to ”fuse” sentences and is able to extend this skill to unseen schemas. The difference in performance between seen and unseen domains, which can be taken as an indicator of the generalization gap, is 12.5 BLEU for the Naive model. T2G2 reduces this to 6.3, effectively halving the gap.
Statistical significance computed using a one tailed t-test,0.01
Qualitative Analysis In Table 4 we list a few examples of model predictions. The first one is from Weather, a domain that is not present in the training set. As this domain is not present in the training set, the Naive model completely fails as expected. SlotDesc, on this other hand is able to successfully utilize the descriptions of the domain and slots to produce a largely accurate and fluent response. It, however, misses the word ‘percent’ when talking about the humidity value. T2G2, which relies on simple, crude templates is able to take that information and rewrite it into a coherent response fully conveying the desired meaning.
The second example is also from an unseen domain - Trains. While Naive and SlotDesc correctly convey the slot related information, they talk about a bus instead of train, since the input most closely resembles the Bus domain seen during training. T2G2, on the other hand, produces accurate text.
The final example illustrates a case where the model has to deal with a seen domain (Media) but an unseen slot (starring). This is likely to be a common scenario, where new functionality needs to be added to an existing API. Here, both Naive and SlotDesc incorrectly treat the slot value Antonio Bustroff as a director, since the slot
directed_by appears in training. T2G2, however, is able to correctly ground the generated text in the input templates to generate the phrase acted in.
We refer the reader to the appendix for more qualitative examples.
Human Evaluation We conduct a human evaluation study as described in 5. A total of 500 examples are rated - 250 each from seen and unseen domains - across the 3 models discussed above and the ground truth response (human). With 3 ratings per example, this leads to a total of 6,000 ratings. In each rating task, the raters were asked to rate the responses generated by each model and the ground truth response with categorical scores 1 (bad), 2 (average) and 3 (good). For each example, the responses were shuffled in a random order to prevent positional bias among the raters.
From the results in Table 5, we find that on seen domains all models perform comparably. Somewhat surprisingly, for seen domains even the baseline model performs on par with the ground truth. This is in line with recent work on task oriented NLG Peng et al. (2020); Kale and Roy (2020) which found that large pre-trained models can be fine-tuned to generate responses that match or exceed the data they were trained on. For unseen domains, T2G2 provides large improvements over baselines for both informativeness and naturalness, confirming the trends from automatic improvements. Remarkably, T2G2 also outperforms the human authored ground truth responses. We take this as a promising indication of the real world applicability of our approach.
7 Other Experiments
Even after narrowing down on the choice of a model architecture, there are many possible choices to be made. In this section, we conduct a thorough analysis of these choices and report our empirical findings on different NLG datasets. We hope that these experiments will guide design choices in the future NLG models.
7.1 Few-shot NLG
Virtual assistants need to support a large number of domains and APIs. Supporting new APIs with ease and without the requirement for large amounts of annotated data is very important. In these experiments, we study the impact on performance of NLG models with the amount of available training data. For a k-shot setting, we sample k dialogs from each domain for training. Since there are 14 domains in the train set of SGD-NLG, for a 5-shot setting this will correspond to 70 dialogs. The dev and test sets are untouched. We run experiments by for in . The dataset is referred to as FewShotSGD and we will make the exact splits publicly available in order to facilitate easy and fair comparisons in future research.
Results are reported in Table 6. In all k-shot settings, T2G2 gives consistent improvements of 3-4 BLEU while reducing the SER by 50%. Even in the extreme 5-shot setting, the SER is just 2.66%. Remarkably, T2G2 in the 20-shot setting (280 dialogs total) performs on par with the Naive model trained on the entire dataset which is 20X larger (5,403 dialogs). We take this as evidence that our templatized input representation can lead to significant reduction in labelled data requirements.
7.2 Joint Modeling
Domain or API specific models effectively have a higher number of parameters per API, which increases the model capacity. On the other hand, parameter sharing effectively increases the amount of supervision per parameter by utilizing training examples from all APIs. Hence, joint modeling could be beneficial in low resource settings if there is some similarity between the underlying structure. Furthermore, joint modeling also reduces the maintenance workload and is resource efficient. For NLG systems, it could also help in maintaining consistent styles across domains and APIs.
Because of these merits, we investigate the effect of joint modeling for NLG. The SGD-NLG dataset, which features a variety of domains, offers an excellent testbed for such a study. We focus on the 12 domains that are present in all 3 splits - train, dev and test. Concretely, we train a single model on domains and compare it with individual models trained for each domain separately. The results are shown in Table 7. We notice consistent improvements in both metrics across all domains. The largest improvement is in the Movies domain, where BLEU improves by 10 points and SER reduces from 21.06 to just 0.15. On an average joint modeling improves BLEU by 3 points and reduces SER from 4% to just 0.53%, demonstrating successful knowledge sharing across domains.
7.3 Role of Context
Dialogue acts represent the semantic content of the system response, but they don’t contain any information about the lexical and syntactic content. The utterances in the dialogue context are also important to generate good responses because they can help model conversational phenomena such as entrainment (lexical and syntactic alignment of responses), and can help it avoid repetition Dušek and Jurcicek (2016a). Context also helps add variations to the responses generated across different conversations for the same system actions.
Table 8 shows the performance of NLG as more utterances from the dialogue context are given as input. In these experiments, we concatenate the last utterances to the system action representation obtained from the Naive and template based methods. Both models, Naive and T2G2, benefit from the additional context, showing an improvement of 3-4 BLEU points.
However, we would like to point out that the evaluation is not completely fair, because we used the ground truth system utterances in the context during evaluation as opposed to the utterances generated by the NLG model itself. Regardless, the improvements clearly point to effectiveness of the added context. We hope these results inspire more work in this exciting direction.
In this work, we propose a template based input representation for task oriented response generation. Coupled with pre-trained language models, this approach enables zero shot generalization to new domains with little effort. Moreover, we show that it can lead to drastic reduction in annotation costs. We also present the first set of results on the multi-domain SGD-NLG dataset, which we hope will pave the way for further research in few-shot, zero-shot and multi-domain language generation.
While in this paper we use standard pre-trained models, designing pre-training tasks tailored to sentence fusion is an interesting line of future work. We also hope to apply T2G2 to languages other than English. Obtaining annotated data in non-English languages is an even bigger challenge, making the sample efficiency of our template rewriting approach especially suited to this setting.
- Sentence fusion for multidocument news summarization. Computational Linguistics 31 (3), pp. 297–328. Cited by: §2.
- Hello, it’s gpt-2-how can i help you? towards the use of pretrained language models for task-oriented dialogue systems. EMNLP-IJCNLP 2019, pp. 15. Cited by: §2.
- MultiWOZ-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016–5026. Cited by: §2, §4.
- Semantically conditioned dialog response generation via hierarchical disentangled self-attention. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3696–3709. Cited by: §2, 1st item.
- Few-shot nlg with pre-trained language model. arXiv preprint arXiv:1904.09521. Cited by: §2.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §2.
- A context-aware natural language generator for dialogue systems. In Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 185–190. Cited by: §7.3.
- Sequence-to-sequence generation for spoken dialogue via deep syntax trees and strings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 45–51. Cited by: §2.
- The webnlg challenge: generating text from rdf data. In Proceedings of the 10th International Conference on Natural Language Generation, pp. 124–133. Cited by: §2.
- Machine translation pre-training for data-to-text generation–a case study in czech. arXiv preprint arXiv:2004.02077. Cited by: §2, §6.5.
- Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §2.
- Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1203–1213. Cited by: §2.
- RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §2.
Meta-learning for low-resource natural language generation in task-oriented dialogue systems.
Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3151–3157. Cited by: §2, §4.
- To plan or not to plan? discourse planning in slot-value informed sequence to sequence models for language generation. Proc. Interspeech 2017, pp. 3339–3343. Cited by: §2.
- The e2e dataset: new challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 201–206. Cited by: §2.
- Few-shot natural language generation for task-oriented dialog. arXiv preprint arXiv:2002.12328. Cited by: §2, §2, §5, item ii, §6.1, §6.5.
- Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §2.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683. Cited by: Figure 2, §3.
Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. arXiv preprint arXiv:1909.05855. Cited by: item 2, §2, §3, §4, §6.3.
- Adversarial domain adaptation for variational neural language generation in dialogue systems. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1205–1217. Cited by: §2, §4.
Multi-domain neural network language generation for spoken dialogue systems. arXiv preprint arXiv:1603.01232. Cited by: §4.
- Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1711–1721. Cited by: §2.
- A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 438–449. Cited by: §2.
- Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §2.
- Multi-task learning for natural language generation in task-oriented dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1261–1266. Cited by: §2.
Sample utterances generated using the different models for various domains are shown in the examples below. The system actions, its template based representation used by the T2G2 model as input, and the reference response are also provided. The predictions are from models trained on the full SGD-NLG dataset and without any dialogue history context. The unseen domains have been marked with an asterisk.