Task-oriented dialogue systems are designed to assist users in performing daily activities, such as restaurant booking, travel planning, and online shopping. These virtual assistants provide natural language interfaces to services and online APIs Rastogi et al. (2020). Based on users’ needs, these systems frequently require support for new domains. However, the current state-of-the-art systems require a substantial amount of in-domain data to properly model a new domain. The data-collection process is both expensive and time-consuming, and thus it is very important to study methods that can build robust and scalable dialogue systems using little to no in-domain data.
The dialogue state tracking (DST) is an essential component of task-oriented dialogue systems that tracks users’ requirements over multi-turn conversations. A popular formulation of the dialogue state is in the form of a list of slot-value pairs. In DST, tracking unseen slots in a new domain, a.k.a. zero-shot domain adaptation, is a significant challenge, since the model has never seen in-domain training samples. There are two main lines of work to tackle this problem. The first proposes domain transferable models using copy mechanisms or ontology graph information Wu et al. (2019); Zhou and Small (2019)
. A limitation of such models is that they may not fully leverage pre-trained language models due to the specialized model architecture. The second line of work uses slot-descriptions as input to the model to facilitate the slot understandingRastogi et al. (2020). However, the provided slot descriptions are collected by crowd sourced human annotators and might be inconsistent among different domains. In general, the optimal approach for constructing slot descriptions in zero-shot settings remains unexplored.
In this work, we tackle the challenge of zero-shot cross-domain DST via leveraging large scale pre-trained sequence-to-sequence (seq2seq) models and with effective encoding of slot descriptions. We first introduce a generative DST model called T5DST, which models the relation of a slot and its dialogue context with a self-attentive encoder, and generates the slot value with a decoder in an autoregressive manner. This simple design allows us to effectively incorporate a pre-trained seq2seq model (e.g., T5 Raffel et al. (2020)) without any task-specific modification. To further enhance the model’s cross-domain transferability, we propose Slot Type Informed Descriptions that capture the shared information of different slots. Experimental results on the MultiWOZ benchmark Budzianowski et al. (2018) suggest that 1) our model achieves significantly higher joint goal accuracy compared to existing results in zero-shot cross domain DST; 2) models using the proposed slot description formulation substantially outperform those using other slot description variants. Our contributions are summarized as the following:
We propose a simple yet novel generative DST model based on T5 that significantly improves existing zero-shot cross-domain DST results;
We investigate the effectiveness of different slot description formulations. To the best of our knowledge, this is the first work that comprehensively studies the effectiveness of slot descriptions in zero-shot cross-domain DST.
2 Related Work
Dialogue State Tracking
has been of broad interest to the dialogue research community Williams and Young (2007); Williams et al. (2014); Heck et al. (2020); Liu et al. (2020); Wu et al. (2020); Madotto et al. (2020). Current state-of-the-art models Chen et al. (2020); Lin et al. (2020); Heck et al. (2020); Hosseini-Asl et al. (2020); Ye et al. (2021); Li et al. (2020) trained with extensive annotated data have been shown promising performance in complex multi-domain conversations Budzianowski et al. (2018). However, collecting large amounts of data for every domain is costly and inefficient. To address this issue, several methods Wu et al. (2019); Zhou and Small (2019) have proposed for transferring prior knowledge of existing domains to new ones. On the other hand, Campagna et al. (2020) proposed an abstract dialogue model that leverages the ontology and in-domain templates to generate a large amount of synthesized data for domain adaptation. Different from their method, in this paper, we utilize a pre-trained seq2seq model and slot descriptions for cross-domain DST without any in-domain data.
has been shown to be a promising technique in cross domain semantic parsing Bapna et al. (2017); Shah et al. (2019); Namazifar et al. (2020). To encourage this line of research in DST as well, MultiWOZ2.1 Eric et al. (2019) provides a further annotation for slot descriptions. Rastogi et al. (2020) incorporated slot descriptions for facilitating cross domain DST, while Gao et al. (2019, 2020) formulated DST as a question answering problem by casting a slot name into questions. However, these works did not show the effectiveness of slot descriptions, by comparing the performance of models with and without them. There is no study on how to construct slot descriptions. In this paper, we aim to fill this research gap by providing an empirical study on the different slot description formulations.
|Slot Type||Slot Name|
|Name||attraction-name, restaurant-name, hotel-name|
|Day||hotel-book day, train-day, restaurant-book day|
|Model||Joint Goal Accuracy|
|w/ Slot Value||32.86±0.56||20.03±0.87||16.65±0.37||65.09±0.12||29.66±2.75||32.86±0.48|
|w/ Slot Type||33.09±1.60||21.21±0.61||21.65±1.07||64.62±0.55||35.42±1.42||35.20±0.59|
Zero-shot cross-domain results in MultiWOZ 2.0. We run each experiment three times with different random seeds, and report the mean and standard deviation. Note that the reported averaged zero shot joint goal accuracy is not comparable to multi-domains joint goal accuracy. *Result fromCampagna et al. (2020).
The design of our model follows the basis of generative question answering models. As illustrated in Figure 1, given a dialogue history which consists of an alternating set of utterances from two speakers, denoted as , we add the "user:" and "system:" prefixes to the user and system utterance respectively. Then all the utterances and slot names are concatenated into a single sequence, i.e., user: system: user: [sep] . The sequence is used as the input to the encoder, and the decoder generates the corresponding slot value :
The learning objective of this generation process is minimizing the negative log-likelihood of given and , that is,
where is the number of slots to be tracked.
We initialize the model parameters with T5 Raffel et al. (2020), an encoder-decoder Transformer with relative position embeddings Shaw et al. (2018) pre-trained on a massive amount of English text. We denote our model as T5DST. To incorporate slot descriptions into T5DST, we replace the slot name with its corresponding slot description as the model input.
3.2 Slot Type Informed Descriptions
Although different slots may have distinguishing names, they can share the same slot type. As shown in Table 1, the slot type of hotel-stars and restaurant-book people are both number slots, while hotel-internet and hotel-parking are both boolean slots. In light of these observations, we hypothesize that adding slot type information to the slot description facilitates the knowledge transfer among different slots. We construct a template for each slot type that follows "[slot type] of [slot] of the [domain]". We denote such a slot description as Slot Type. More details are available in Appendix A.1.
|T5DST w/ Slot Type||58.77||65.72||69.54||43.07||50.71||54.86||57.63||61.86||63.47||70.12||73.67||74.70||70.82||74.18||77.57|
4.1 Dataset and Evaluation
We evaluate the proposed method on the MultiWOZ 2.0 dataset Budzianowski et al. (2018), which has 7 domains. We use the pre-processing and evaluation setup from Wu et al. (2019), where restaurant, train, attraction, hotel, and taxi domains are used for training, as the test set only contains these 5 domains.
In the zero-shot cross-domain experiments, the models are first trained with four domains and then evaluated on the test-set of the unseen domain. Joint goal accuracy is used to evaluate the performance of the models. The generated dialogue states are considered to be correct if and only if all of the predicted values exactly match the oracle values.
We implement T5DST111Source code is available in https://github.com/facebookresearch/Zero-Shot-DST based on the T5-small (60M parameters) model which has 6 encoder-decoder layers and the hidden size . All models are trained using an AdamW Loshchilov and Hutter (2018) optimizer with the initial learning rate of
. In all cross-domain zero-shot experiments, we train the models with batch size 128 for 5 epochs. For the few-shot experiments, the models are first trained on 4 domains for 5 epochs then fine-tuned with 1%, 5% and 10% of target domain data for 10 epochs. For full shot training, we train our model for at most 10 epochs with batch size 64 and early stop according to the loss in the validation set. Other hyper-prameters are same as zero-shot cross-domain setting. We use 8 NVIDIA V100 GPUs for all of our experiments. We use greedy decoding in test time.
Transferable dialogue state generator Wu et al. (2019) which utilizes copy mechanism to facilitate domain knowledge transfer.
4.3.2 Slot Description Variants
Simple transformation of the slot name from "domain-slot" to "[slot] of the [domain]".
Following recent works Zhang et al. (2019); Rastogi et al. (2020), slots are divided into categorical and non-categorical slots. For categorical slots, we incorporate the candidate values into the slot description, i.e., "[slot] of the [domain] is [value-1] or [value-2]?". The order of values is random. For non-categorical slots, their descriptions are the same as aforementioned Naive.
4.4 Results & Discussion
4.4.1 Zero-Shot Cross-Domain
The results of the zero-shot cross domain experiments are shown in Table 2. Overall, T5DST achieves significantly higher performance in terms of averaged joint goal accuracy compared to the three baseline models TRADE, SUMBT, and SimpleTOD++. These results demonstrate that our model can effectively capture the slot-context relation, and thus generalize better in unseen domains.
Replacing slot-names with human annotated slot descriptions does not bring improvement to the zero-shot performance. This might because of the diverse and inconsistent human descriptions among different domains. For example, the human descriptions of attraction-area and restaurant-area are "area to search for attractions" and "area or place of the restaurant" respectively. Such inconsistent descriptions increase the challenge on slot understanding in the zero-shot learning setting. the model using naive slot descriptions gives similar performance to the one that uses original slot names. The two approaches lead to similar semantic representation of the slots. In contrast, incorporating slot values hurts the learning, leading to a lower joint goal accuracy in the restaurant domain. We observe that even though adding value candidates improve some of the categorical slots (e.g., restaurant-area 68.35% 82.25% slot accuracy), it hurts the unseen non-categorical slots (e.g., restaurant-food 40.63% 26.10% slot accuracy). These non-categorical slots are usually the bottlenecks of joint goal accuracy. Finally, models trained with question style descriptions improves the performance in some domains, but fails in the others.
Our proposed slot type informed descriptions consistently improves the zero-shot performance of T5DST in all the domains. It produced an average of 2% joint goal accuracy improvement compared to human labeled and naive description formulations. This result indicates that slot type information may better capture the shared property (e.g., time, location) among different slots, thus facilitating the domain knowledge transferring for DST.
Figure 3 and 4 show the slot accuracy of models using Naive and Slot Type description. Compared to naive description, we obverse significant gain of time slots (e.g., arrive by and leave at), location slots (e.g., departure and destination), and number slots (e.g., book stay and book people) by adding slot type information. We conjecture that explicit information about the target value (i.e., slot type) is important in the low resource condition when the model does not have enough data to capture the semantic meaning of a new slot.
4.4.2 Few-Shot Cross-Domain
We further conduct experiments in few-shot cross-domain settings, as in Wu et al. (2019); Zhou and Small (2019), where the models are first trained on 4 domains then fine-tuned with 1%, 5% and 10% of target domain data. As shown in Table 3, our model outperforms the DSTQA model in 4 out of 5 domains. Moreover, our approach is more practical in a real-world learning scenario as it does not require the supervision of a full ontology graph. We also conduct the full shot experiments and compare our model with previous methods. The reults are reported in Appendix A.2.
In this paper, we propose leveraging large scale pre-trained models with an effective slot description formulation to tackle the zero-shot cross-domain DST challenge. Specifically, we propose T5DST, a novel generative DST model based on the T5 language model, and incorporate Slot Type Informed Descriptions to facilitate cross-domain knowledge transfer. In the evaluation on the MultiWOZ dataset, our approach substantially improves existing results in both the zero-shot and few-shot settings.
- Towards zero-shot frame semantic parsing for domain scaling. arXiv preprint arXiv:1707.02363. Cited by: §2.
MultiWOZ-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5016–5026. Cited by: §1, §2, §4.1.
Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking. arXiv preprint arXiv:2005.00891. Cited by: §2, Table 2.
Schema-guided multi-domain dialogue state tracking with graph attention neural networks. In AAAI 2020, Cited by: §A.2, Table 5, §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.3.1.
- Multiwoz 2.1: multi-domain dialogue state corrections and state tracking baselines. arXiv preprint arXiv:1907.01669. Cited by: §2, §4.3.2.
- From machine reading comprehension to dialogue state tracking: bridging the gap. arXiv preprint arXiv:2004.05827. Cited by: Table 5, §2, §4.3.2, footnote 2.
- Dialog state tracking: a neural reading comprehension approach. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pp. 264–273. Cited by: §2, §4.3.2.
- TripPy: a triple copy strategy for value independent neural dialog state tracking. arXiv preprint arXiv:2005.02877. Cited by: §A.2, Table 5, §2.
- A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796. Cited by: Table 5, §2, §4.3.1.
- Efficient dialogue state tracking by selectively overwriting memory. arXiv preprint arXiv:1911.03906. Cited by: §A.2, Table 5.
- SUMBT: slot-utterance matching for universal and scalable belief tracking. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5478–5483. Cited by: Table 5, §4.3.1.
- CoCo: controllable counterfactuals for evaluating dialogue state trackers. arXiv preprint arXiv:2010.12850. Cited by: §2.
- MinTL: minimalist transfer learning for task-oriented dialogue systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3391–3405. Cited by: Table 5, §2.
Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8433–8440. Cited by: §2.
- Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: §4.2.
- Language models as few-shot learner for task-oriented dialogue systems. arXiv preprint arXiv:2008.06239. Cited by: §2.
- Language model is all you need: natural language understanding as question answering. arXiv preprint arXiv:2011.03023. Cited by: §2.
- Language models are unsupervised multitask learners. OpenAI blog 1 (8), pp. 9. Cited by: §4.3.1.
Exploring the limits of transfer learning with a unified text-to-text transformer.
Journal of Machine Learning Research21 (140), pp. 1–67. Cited by: §1, §3.1.
Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 8689–8696. Cited by: Table 5, §1, §1, §2, §4.3.2.
- Robust zero-shot cross-domain slot filling with example values. arXiv preprint arXiv:1906.06870. Cited by: §2.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468. Cited by: §3.1.
- The dialog state tracking challenge series. AI Magazine 35 (4), pp. 121–124. Cited by: §2.
Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language 21 (2), pp. 393–422. Cited by: §2.
- ToD-bert: pre-trained natural language understanding for task-oriented dialogues. arXiv preprint arXiv:2004.06871. Cited by: §2.
- Transferable multi-domain state generator for task-oriented dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 808–819. Cited by: Table 5, §1, §2, Table 3, §4.1, §4.3.1, §4.4.2.
- Slot self-attentive dialogue state tracking. arXiv preprint arXiv:2101.09374. Cited by: §2.
- Multiwoz 2.2: a dialogue dataset with additional annotation corrections and state tracking baselines. arXiv preprint arXiv:2007.12720. Cited by: §4.3.2.
Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking. arXiv preprint arXiv:1910.03544. Cited by: §4.3.2.
Multi-domain dialogue state tracking as dynamic knowledge graph enhanced question answering. arXiv preprint arXiv:1911.06192. Cited by: Table 5, §1, §2, Table 3, §4.3.1, §4.4.2.
Appendix A Appendices
a.1 Slot Type Informed Description Construction
As shown in Table 4, each slot type has one prefix for appending to the beginning of the description. We used three different templates to construct the slot description. For all the booking slots (e.g., book people), we use "[prefix] [slot] for the [domain] booking". For boolean slots, we use "[prefix] [slot] in the [domain]". And for all the others, we use "[prefix] [slot] of the [domain]". When a slot name (e.g., train-day) overlap with the slot type (e.g., day) or a slot does not fall into any slot type category (others), we simply set the prefix as an empty string.
a.2 Full Shot Results
To understand the full shot performance of our T5DST model and whether slot description is still helpful when there is enough training data, we also conduct the experiments in a full data setting. As shown in Table 5, using slot description only improves the joint goal accuracy by 0.56% in MultiWoz 2.0 and 0.30% in MultiWoz 2.1, which indicates that the description is less effective when there is a large amount of data for training.
Compared to prior models with zero-shot capability, T5DST shows promising performance. Compared to other state-of-the-art models that optimized for full shot training, our model achieve competitive results in MultiWoz 2.0, but inferior results on MultiWoz 2.1. We notice that there are many training strategies (e.g., token masking Kim et al. (2019); Heck et al. (2020)), additional supervision (e.g., full ontology Chen et al. (2020)), and label cleaning strategies Heck et al. (2020)) that may impact final full-shot result. We also expect higher performance with a larger T5 model, such as T5-base or T5-large. However, achieving SOTA in full-scale training is out of the scope of this work.
|Slot Type||Slot Name||Prefix||Examples|
|number of||number of people for the hotel booking|
|location of||location of destination of the train|
|time of||time of arrive by of the train|
|Boolean||hotel-parking, hotel-internet||whether have||whether have parking in the hotel|
|Name||attraction-name, restaurant-name, hotel-name||-||name of attraction|
|Day||hotel-book day, train-day, restaurant-book day||-||day for the hotel booking|
|-||type of the hotel|
|Joint Goal Accuracy|
|Models||#Parameter||Zero-shot Inference||MWoz 2.0||MWoz 2.1|
|TRADE Wu et al. (2019)||-||✓||48.62||45.6|
|STARC Gao et al. (2020)||110M||✓||-||49.48|
|SUMBT Lee et al. (2019)||✓||49.06||-|
|SGD-baseline Rastogi et al. (2020)||110M||✓||-||43.4|
|T5DST + Slot Type||60M||✓||53.42||52.21|
|DSTQA w/o span Zhou and Small (2019)||-||✗||51.44||51.17|
|MinTL (BART) Lin et al. (2020)||400M||✗||52.10||53.67|
|SOM-DST Kim et al. (2019)||340M||✗||52.32||53.68|
|SST Chen et al. (2020)||110M||✗||51.17||55.23|
|TripPy Heck et al. (2020)||110M||✗||-||55.29|
|SimpleTOD Hosseini-Asl et al. (2020)||110M||✗||-||55.76|