Task-oriented dialog systems provide users with a natural language interface to achieve a goal. Modern dialog systems support complex goals that may span multiple domains. For example, during the dialog the user may ask for a hotel reservation (hotel domain) and also a taxi ride to the hotel (taxi domain), as illustrated in the example of Figure 1. Dialog state tracking is one of the core components of task-oriented dialog systems. The dialog state can be thought as the system’s belief of user’s goal given the conversation history. For each user turn, the dialog state commonly includes the set of slot-value pairs, for all the slots which are mentioned by the user. An example is shown in Figure 1
. Accurate DST is critical for task-oriented dialog as most dialog systems rely on such a state to predict the optimal next system action, such as a database query or a natural language generation (NLG) response.
Dialog state tracking requires understanding the semantics of the agent and user dialog so far, a challenging task since a dialog may span multiple domains and may include user or system references to slots happening earlier in the dialog. Data scarcity is an additional challenge, because dialog data collection is a costly and time consuming [13, 14]. As a result, it is critical to be able to train DST systems for new domains with zero or little data.
Previous work formulates DST as a classification task over all possible slot values for each slot, assuming all values are available in advance (e.g. through a pre-defined ontology) [18, 10, 16]. However, DST systems should be able to track the values of even free-form slots such as , which typically contain out-of-vocabulary words. To overcome the limitations of ontology-based approaches candidate-set generation based approaches have been proposed . TRADE  extends this idea further and propose a decoder-based approach that uses both generation and a pointing mechanism, taking a weighted sum of a distribution over the vocabulary and a distribution over the words in the conversation history. This enables the model to produce unseen slot values, and it achieves state-of-the art results on the MultiWOZ public benchmark [3, 8].
We extend this work by  and focus on improving the encoding of dialog context and slot semantics for DST to robustly capture important dependencies between slots and the conversation history as well as long-range coreferences in the conversation history. For this purpose, we propose a Multi-Attention DST (MA-DST) network. It contains multiple layers of cross-attention between the slot encodings and the conversation history to capture relationships at different levels of granularity, which are then followed by a self-attention layer to help resolve references to earlier slot mentions in the dialog. We show that the proposed MA-DST leads to an absolute improvement of over 5% in the joint goal accuracy over the current state-of-the art for the MultiWOZ 2.1 dataset in the full-data setting. We also show that MA-DST can be adapted to new domains with no training data in that new domain, achieving upto a 2% absolute joint goal accuracy gains in the zero-shot setting.
2 Related Work
Dialog state tracking (DST) is a core dialog systems problem that is well studied in the literature. Earlier approaches for DST relied on Markov Decision Processes (MDPs) and partially observable MDPs (POMDPs) [28, 23]
for estimating the state updates. See for a review of DST challenges and earlier related work.
Recent neural state tracking approaches achieve state-of-the-art performance on DST . Some of this work formulates the state tracking problem as a classification task over all possible slot-values per slot [18, 27, 16]. This assumes that an ontology containing all slot values per slot is available in advance. In practice, this is a limiting assumption, especially for free-form slots that may contain values not seen during training . To
address this limitation,  propose a candidate generation approach based on a bi-GRU network, that selects and scores slot values from the conversation history.  propose using a pointer network  for extracting slot values from the history. More recently, hybrid approaches which combine the candidate-set and slot-value generation approaches have appeared [11, 31].
Our work is most similar to TRADE , and extends it by proposing self  and cross-attention  mechanism for capturing slot and history correlations. Attention based archirectures like the Transformer  and architectures that extend it, like BERT  and RoBERTa , achieve the current state-of-the-arts for many NLP tasks. We are also inspired by the work in reading comprehension where cross attention is used to compute relations between a long passage and a query question [33, 4].
For benchmarking, DSTC challenges provide a popular experimentation framework and dialog data collected through human-machine interactions. Initially, they focused on single domain systems like bus routes . Wizard-of-Oz (WOZ) is also a popular framework used to collect human-human dialogs that reflect the target human-machine behavior [27, 1]. Recently, the MultiWOZ 2.0 dataset, collected through WOZ for multiple domains, was introduced to address the lack of a large multi-domain DST benchmark .  released an updated version, called MultiWOZ 2.1, which contains annotation corrections and new benchmark results using the current state-of-the-art approaches. Here, we use the MultiWOZ 2.1 dataset as our benchmark.
3 Model Architecture
3.1 Problem Statement
Let’s denote conversation history till turn as , where and represents the user’s utterance and agent’s response at the turn. Let denote the set of all possible slots across all domains. Let denote the dialog state at turn , which contains all slots and their corresponding values . Slots that are not mentioned in the dialog history take a value. DST consists of predicting slot values for all slots at each turn , given the conversation history .
3.2 Model Architecture Overview
Our model encodes both the slot name and the conversation history so far , and then decodes the slot value , outputting words or special symbols for and values. Our proposed model consists of an encoder for the slot name, an encoder for the conversation history, a decoder
that generates the slot value, and a three-class “slot gate” classifierthat predicts special symbols , which will be described in detail later on. The model weights are shared between the slots, which makes the model more robust and scalable.
This architecture is similar to . We propose modifications to the encoders in order to capture more fine grained dependencies between the slot name and the conversation history. Also, note the domain and slot names are concatenated into a single slot description, which we refer to as slot name for simplicity, and encoded via the slot encoder . Figure 2 illustrates the proposed architecture which we refer to as Multi-Attention DST (MA-DST).
Our proposed slot and conversation history encoders use three stages of attention, specifically low-level cross-attention on the words, higher level cross-attention on the hidden state representations, and self-attention within the dialog history. Below we describe the encoders bottom-up.
Enriched Word Embedding Layer
For both and , we first project each word into a low-dimensional space. We use a 300-dimensional GloVE embedding , and a 100-dimensional character embedding, both of which gets fine-tuned. For the conversation history , we also add a 5-dimensional POS tag embedding and a 5-dimensional NER tag embeddings. We also use the turn index for each word as a feature and initialize it as a 5-dimensional embedding.
To capture the contextual meaning of words, we additionally use contextual ELMo embeddings . We compute 1024-dimensional ELMo embeddings for both and by taking a weighted average of the different ELMo layers’ outputs. Instead of fine-tuning parameters of all the ELMo layers, we just learn these combination weights while training the model. All the word-level embeddings are concatenated to generate an enriched, contextual word-level representation .
Word-Level Cross-Attention Layer
To highlight the words in the conversation history relevant to the slot , we add a word-to-word attention from conversation history to the slot. For computing the attention weights, we used symmetric scaled multiplicative attention 
with a ReLU non-linearity. The weights are calculated according to equation2 and used according to equation 3
to obtain the attended vector for each word in the conversation.
Here, and correspond to the word embedding of the word in the conversation and word in the slot. The length of the slot is denoted by . denotes a non-linear activation, which here is a ReLU. To get the representation for each word in the conversation history, we concatenate the attended vector with the initial word embedding: .
For the slot representation for each word in the slot name, we use the word embedding .
Note that symmetric scaled multiplicative attention with ReLU non-linearity is used in all attention computations of our proposed models, as we empirically found that it gives better performance compared to other attention variants.(e.g. multiplicative, scaled multiplicative, additive).
First Layer RNN
The computed representations and
for each word in the slot name and the conversation history respectively, are then passed through a Gated Recurrent Unit (GRU) in order to model the temporal interactions between the words and get a contextual representation. For each of and , we use bidirectional GRUs and obtain the hidden contextual representation by averaging the hidden states of each GRU direction per time step:
Here, and are the sequences of encoded representations for the conversational history and slot name respectively, output by the first bidirectional GRU layer (assuming K is the slot name length and J the conversation history length in number of words).
Higher Level Cross Attention Layer
We add a cross-attention network on top of the the base RNN layer to attend over higher level representations generated by the previous RNN layer, i.e. and . We used two-way cross-attention network, one from conversation history () to the slot () and the other in the opposite direction. This is inspired by several works in reading comprehension where cross attention is used to compute relations between a long passage and a query question [26, 4].
The Slot to Conversation History attention sub-network helps in highlighting the words in the conversation which are relevant to the slot for which we want to generate the value. Similar to the word level attention, the attention weights are calculated by equation 7.
We fuse the attention vector with it’s corresponding hidden state for each word in the slot name as follows:
where, is the element wise dot product operation.
Similarly, the Conversation to Slot attention sub-network computes attention weights to highlight which words in the slot name are most relevant to each word in the conversation history. This enriches the word representation in the conversation history with an attention based representation , resulting in a new representation . All computations are similar as in the Slot to Conversation History attention, but in the reverse direction.
Second Layer RNN
The representations and are then passed through a second bidirectional GRU layer, to obtain and . This helps in fusing these vectors together along with the temporal information.
Self Attention Layer
We add a self attention network on top of the conversation representation . This layer helps resolve correlation between words across utterances in the conversation history. We introduce this sub-network to address cases where the user refers to slot values that are present in previous utterances, which is a common phenomenon in dialogs, especially multi-domain ones. Self attention is computed as:
The final representation for each word in the conversation is the merged representation of self-attended vector and the hidden state , merged according to equation 9.
Third Layer RNN and Slot Summarization
We use a third layer RNN to get the final representation for the conversation history
Since the slot name is much shorter in length than the conversation history, it can be encoded with less information. Instead of using an additional RNN, we summarize the slot using a linear transformation to reduce the slot representation into a single vector.
where, is the parameter which is learnt during training.
Finally, is the per word representation for the conversation history, while is the summarized slot name representation, both of which will be used at the decoding step.
3.4 Decoder and Slot Gate classifier
The decoder network is a GRU that decodes the value for slot . At each decoding step that computes each word in the slot value, the network computes two distributions: a distribution over all in-vocabulary words (word generation distribution) and one over all words in the conversation history (word history distribution). This allows the decoder to generate unseen words that appear in the conversation history but are not present in the vocabulary of the training data. This formulation removes the dependency of having a predefined ontology that contains all the possible slot values, which is restrictive for free-form slots. Because of the ability to generate unseen slot values, the network is well-suited for zero-shot use cases.
We initialize the decoder by combining the last hidden state of the conversation history representation and the summarized slot representation:
where W is a learnable parameter. At each decoding time-step
, the decoder generates a probability distribution over the vocabulary:
The decoder also generates a probability distribution over words in the conversation history by using a pointer network , i.e., computing attention weights for each word in the conversation history.
To generate the final vocabulary distribution, we take a weighted sum of and :
Where is the probability to generate a word as opposed to copy from the history, and is calculated at each decoder time step.
To avoid running the decoder for slots not present in the conversation, we also train a Slot Gate classifier. This is a 3-way classifier which predicts among the following classes . Only when the classifier predicts we decode the slot value. When the classifier predicts we assume that the slot is not present and takes a value in the state, and when it predicts , we assume the user does not care about the slot value (this appears commonly in dialog and therefore is a special value for DST systems).
The network is trained in a multi-task manner using standard cross entropy loss. We combine the losses of the slot generator (decoder) and the SG classifier as follows:
is a hyperparameter that is optimized empirically.
We evaluate our approach on MultiWOZ, a multi-domain Wizard-of-Oz dataset. MultiWOZ 2.0 is a recent dataset of labeled human-human written conversations spanning multiple domains and topics . As of now, it is the largest labeled, goal-oriented, human-human conversational dataset with around 10k dialogs, each with an average of 13.67 turns. The data spans seven domains and 37 slot types. Due to patterns of annotation errors found in MultiWOZ 2.0,  re-annotated the data and released a MultiWOZ 2.1 version, which corrected a significant number of errors. Table 1 mentions the percentage of slots in each domain whose values changed with the MultiWOZ 2.1 re-annotation.
For all our experiments, we use MultiWOZ 2.1 data, which is shown to be cleaner and more challenging because many slots are now correctly annotated with their corrected values or instead of . We are using only five domains out of the available seven - namely - since the other two domains are only present in the training set. We use the provided train/dev/test split for our experiments.
|Slot Values Updated in MultiWOZ 2.1|
In this section we first describe the evaluation metrics and then present the results of our experiments.
Following are the metrics used to evaluate DST models:
Average Slot Accuracy: The average slot accuracy is defined as the fraction of slots for which the model predicts the correct slot value. For an individual dialog turn , the average slot accuracy is defined as follows:
where and are ground truth and predicted slot value for respectively, is the total number of slots, and is an indicator variable that is 1 if and only if .
Joint Goal Accuracy: The joint goal accuracy is defined as the fraction of dialog turns for which the values for all slots are predicted correctly. If we have total slots we want to track, the joint goal accuracy for an individual dialog turn is defined as follows:
5.2 Experiment Details
We train the encoders to jointly optimize the losses of the slot gate classifier and the slot value generator decoder. The parameters of the model are shared for all
pairs, which makes this model scalable to a large number of domain and slots. We train the model using stochastic gradient descent and use the Adam Optimizer. We empirically optimized the learning rate in the rangeand used for the final model, while we kept betas as and epsilon 1x. We used a batch size of four dialog turns and for each turn we generate all 30 slot values. We decayed the learning rate after regular intervals (epochs) by a factor of (), which was empirically optimized. For ELMo, we kept a dropout of 0.5 for the contexual embedding and used regularization for the weights of ELMo. We used a dropout of 0.2 for all the layers everywhere else. For word embeddings, we used 300-dimensional GloVe embeddings and 100-dimensional character embeddings. For all the GRU and attention layers the hidden size is kept at 400. The weight
for the multi-task loss function in equation18 is kept at .
In this section, we present the results for our model. We measure the quality of the model on joint goal accuracy and average slot accuracy, as described earlier. As our baseline for comparison, we consider the TRADE model , which is the present state of the art for MultiWOZ. To have a fair comparison, we report the numbers on the corrected MultiWOZ 2.1 dataset for both models.
In Table 2, we present the results for DST on single-domain data. We create the train, dev, and test splits of the data for a particular domain by filtering for dialogs which only contain that domain. As shown in table 2, MA-DST outperforms TRADE for all five domains, improving the joint goal accuracy by up to 7% absolute as well as the average slot accuracy by up to 5% absolute.
Table 3 shows results for the multi-domain setting, where we combine all available domains during training and evaluation. We compare the the accuracy of MA-DST with the TRADE baseline and four additional ablation variants of our model. These four variants capture the contribution of the different sub-networks and layers in MA-DST on top of the base encoder-decoder architecture, which is called “Our Base Model” in Table 3. Our full proposed MA-DST model achieves the highest performance on joint goal accuracy and average slot accuracy, surpassing the current state-of-the-art performance. Each of the additional layers of self and cross-attention contribute to progressively higher accuracy for both metrics.
|Baseline (TRADE )||45.6||96.62|
|Our Base Model||44.0||96.15|
|+ Slot Gate + Word-Level Cross-Attention||47.60||97.01|
|+ Higher-Level Cross-Attention||49.56||97.15|
|+ Self-Attention + Slot Summarizer||50.55||97.21|
|+ ELMo (MA-DST)||51.04||97.28|
|Zero Shot Experiment|
In Table 4 we present the zero shot results. For these experiments, the test set contains only dialogs from the target domain while the training set contains only dialogs from the other four domains. As shown in Table 4, MA-DST outperforms TRADE’s state-of-the-art result by up to 2% on the joint goal accuracy metric.
5.4 Error Analysis
|Domain Level Statistics|
In this section we analyze the errors being made by the model on MultiWoz 2.1 dataset. Table 5 shows the Average Slot Accuracy and F1-Score for each domain. In terms of F1-Score, the model performs worse for Taxi domain. The average slot accuracy for Taxi domain is high because a vast number of taxi domain’s slots are (i.e. not present in the dialog), which model easily identifies. Figure 3 shows the per-slot accuracy in the all-domain setting, in descending order of performance. As seen from Figure 3, the MA-DST model tends to make the most errors for open-ended slots such as restaurant-name, attraction-name, hotel-name, train-leaveat. These slots are difficult to predict for the model because, unlike categorical slots, these slots can take on a large number of possible values and are more likely to encounter unseen values. On the other end of spectrum, we have slots like restaurant-bookday, hotel-bookstay, hotel-bookday, and train-day, for which the model is able to achieve more than in terms of average slot accuracy. As expected, most of the top-perfoming slots are categorical, i.e. they can take only a small number of different values from a pre-defined set.
Figure 4 analyzes the relationship between depth of conversation and accuracy of MA-DST. To calculate this, we first bucket the dialog turns according to their turn index, and calculate the joint goal accuracy and average slot accuracy for each bucket. As shown in Figure 4, the joint goal accuracy and average slot accuracy for MA-DST is around and for turn 0, and it decreases to and for turn 10. As expected, we can see that the model’s performance degrades as the conversation becomes longer. This can be explained by the fact that longer conversations tend to be more complex and can have long-range dependencies. To study the effect of attention layers, we compare the joint goal accuracy of our base model, which does not have the attention layers, and MA-DST for each turn. As can be seen from Figure 4, MA-DST performs better than our base model, which doesn’t have the additional attention layers, for both earlier and later turns by an average margin of .
To further analyze what type of errors the model is making, we manually analyzed the model’s output for 20 randomly selected dialogs. Around of the errors are because of wrong annotations, i.e., the model predicted the slot value correctly but the target label was wrong. For e.g. in turn 5 of PMUL3158, is annotated as , while user has mentioned 17:45 as the booking time. These kind of annotation errors are unavoidable. The other common error we observed was of model getting confused among slots of same types. For e.g. in turn 3 of dialog PMUL4547, model populates and with “The Junction”, as user didn’t specify in the utterance whether “The Junction” is attraction or a hotel. Because of similar reason, we also see model confusing between and slot quite a number of times. The other common type of error model makes is by generating slot value which varies from the ground-truth by a word or character. For e.g. for dialog MUL2432, the model generates the value of as 15.15 by directly copying it from the user utterance, however, the label is 15:15 according to the ontology. This kind of error can be solved by fuzzy match between ontology and model’s prediction, but it will introduce dependency on the ontology. We also observed that model’s accuracy for slot values which were “dontcare” was only . We also observed that there are lots of annotation errors for slots with “dontcare” in the training set, thus making it difficult for the model to learn.
We propose a new architecture for dialog state tracking that uses multiple levels of attention to better encode relationships between the conversation history and slot semantics and resolve long-range cross-domain coreferences. Like TRADE , it does not rely on knowing a complete list of possible values for a slot beforehand and both generate values from the vocabulary and copy values from the conversation history. It also shares the same model weights for all pairs so it can easily be adapted to new domains and applied in a zero-shot or few-shot setting. We achieve new state-of-the-art joint goal accuracy on the updated MultiWOZ 2.1 dataset of 51%. In the zero-shot setting we improve the state-of-the-art by over 2%. In the future, it is worth exploring whether the state can be carried from the previous turn to predict the state for the current turn (rather than starting from scratch for each turn). Finally, it may be useful to capture dependencies or correlations between slots rather than independently generating values for each one of them.
-  (2017) Frames: a corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057. Cited by: §2.
-  (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §2.
MultiWOZ - a large-scale multi-domain wizard-of-Oz dataset for task-oriented dialogue modelling.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5016–5026. Cited by: §1, §2, §4.
-  (2017) Reading wikipedia to answer open-domain questions. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Cited by: §2, §3.3.
-  (2016-11) Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 551–561. External Links: Cited by: §2.
Empirical evaluation of gated recurrent neural networks on sequence modeling. External Links: Cited by: §3.3.
-  (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171–4186. Cited by: §2.
-  (2019) MultiWOZ 2.1: multi-domain dialogue state corrections and state tracking baselines. Cited by: §1, §2, §4.
-  (2018-07) Neural approaches to conversational ai. In ACL and SIGIR tutorial, ACL and SIGIR tutorial edition. Note: ACL and SIGIR tutorial Cited by: §2.
-  (2019) Dialog state tracking: a neural reading comprehension approach. External Links: Cited by: §1.
-  (2019) Hyst: a hybrid approach for flexible and accurate dialogue state tracking. arXiv preprint arXiv:1907.00883. Cited by: §2.
-  (2017) FusionNet: fusing via fully-aware attention with application to machine comprehension. External Links: Cited by: §3.3.
-  (2018) Data collection for dialogue system: a startup perspective. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pp. 33–40. Cited by: §1.
-  (2013) Conversations in the crowd: collecting data for task-oriented dialog learning. In First AAAI Conference on Human Computation and Crowdsourcing, Cited by: §1.
-  (2000) A stochastic model of human-machine interaction for learning dialog strategies. IEEE Transactions on speech and audio processing 8 (1), pp. 11–23. Cited by: §2.
-  (2017) An end-to-end trainable neural network model with belief tracking for task-oriented dialog. Proc of Interspeech. Cited by: §1, §2.
-  (2019) RoBERTa: A robustly optimized BERT pretraining approach. External Links: Cited by: §2.
-  (2016) Neural belief tracker: data-driven dialogue state tracking. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL) Volume 1: Long Papers, pp. 1777–1788. Cited by: §1, §2.
-  (2014) GloVe: global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §3.3.
-  (2018) Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). External Links: Cited by: §3.3.
-  (2017) Scalable multi-domain dialogue state tracking. In Proceedings of IEEE ASRU, Cited by: §1, §2.
-  (2017) Get to the point: summarization with pointer-generator networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: Cited by: §3.4.
-  (2010) Bayesian update of dialogue state: a pomdp framework for spoken dialogue systems. Comput. Speech Lang. 24 (4), pp. 562–588. Cited by: §2.
-  (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6000–6010. Cited by: §2.
-  (2015) Pointer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2692–2700. External Links: Cited by: §2.
-  (2017) Making neural qa as simple as possible but not simpler. Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017). External Links: Cited by: §3.3.
-  (2017) A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (ACL): Volume 1, Long Papers, pp. 438–449. Cited by: §2, §2.
-  (2007) Partially observable markov decision processes for spoken dialog systems. Computer Speech & Language 21 (2), pp. 393–422. Cited by: §2.
-  (2013) The dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference, The 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 404–413. Cited by: §2.
-  (2016-04) The dialog state tracking challenge series: a review. Dialogue and Discourse. Cited by: §2.
-  (2019) Transferable multi-domain state generator for task-oriented dialogue systems. In ACL, Cited by: §1, §1, §2, §2, §3.2, §3.4, §5.3, Table 3, §6.
-  (2018-07) An end-to-end approach for handling unknown slot values in dialogue state tracking. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, pp. 1448–1457. External Links: Cited by: §2, §2.
-  (2018-12) SDNet: contextualized attention-based deep network for conversational question answering. arXiv. Cited by: §2.