|NBT-CNN Mrksic et al. (2017)|
|MD-DST Rastogi et al. (2017)|
|GLAD Zhong et al. (2018)|
|StateNet_PSI Ren et al. (2018)|
|TRADE Wu et al. (2019)|
|HyST Goel et al. (2019)|
|DSTRead Gao et al. (2019)|
A Dialogue State Tracker (DST) is a core component of a modular task-oriented dialogue system Young et al. (2013)
. For each dialogue turn, a DST module takes a user utterance and the dialogue history as input, and outputs a belief estimate of the dialogue state. Then a machine action is decided based on the dialogue state according to a dialogue policy module, after which a machine response is generated.
Traditionally, a dialogue state consists of a set of requests and joint goals, both of which are represented by a set of slot-value pairs (e.g. (request, phone), (area, north), (food, Japanese)) Henderson et al. (2014). In a recently proposed multi-domain dialogue state tracking dataset, MultiWoZ Budzianowski et al. (2018), a representation of dialogue state consists of a hierarchical structure of domain, slot, and value is proposed. This is a more practical scenario since dialogues often include multiple domains simultaneously.
Many recently proposed DSTs Zhong et al. (2018); Ramadan et al. (2018) are based on pre-defined ontology lists that specify all possible slot values in advance. To generate a distribution over the candidate set, previous works often take each of the slot-value pairs as input for scoring. However, in real-world scenarios, it is often not practical to enumerate all possible slot value pairs and perform scoring from a large dynamically changing knowledge base Xu and Hu (2018). To tackle this problem, a popular direction is to build a fixed-length candidate set that is dynamically updated throughout the dialogue development. Table 1 briefly summaries the inference time complexity of multiple state-of-the-art DST models following this direction. Since the inference complexity of all of previous model is at least proportional to the number of the slots, these models will struggle to scale to multi-domain datasets with much larger numbers of pre-defined slots.
In this work, we formulate the dialogue state tracking task as a sequence generation problem, instead of formulating the task as a pair-wise prediction problem as in existing work. We propose the COnditional MEmory Relation Network (COMER), a scalable and accurate dialogue state tracker that has a constant inference time complexity. 111The code is released at https://github.com/renll/ComerNet
Specifically, our model consists of an encoder-decoder network with a hierarchically stacked decoder to first generate the slot sequences in the belief state and then for each slot generate the corresponding value sequences. The parameters are shared among all of our decoders for the scalability of the depth of the hierarchical structure of the belief states. COMER applies BERT contextualized word embeddings Devlin et al. (2018) and BPE Sennrich et al. (2016) for sequence encoding to ensure the uniqueness of the representations of the unseen words. The word embeddings for sequence generation are initialized and fixed with the static word embeddings generated from BERT to have the potential of generating unseen words.
Figure 1 shows a multi-domain dialogue in which the user wants the system to first help book a train and then reserve a hotel. For each turn, the DST will need to track the slot-value pairs (e.g. (arrive by, 20:45)) representing the user goals as well as the domain that the slot-value pairs belongs to (e.g. train, hotel). Instead of representing the belief state via a hierarchical structure, one can also combine the domain and slot together to form a combined slot-value pair (e.g. (train; arrive by, 20:45) where the combined slot is “train; arrive by”), which ignores the subordination relationship between the domain and the slots.
A typical fallacy in dialogue state tracking datasets is that they make an assumption that the slot in a belief state can only be mapped to a single value in a dialogue turn. We call this the single value assumption. Figure 2 shows an example of this fallacy from the WoZ2.0 dataset: Based on the belief state label (food, seafood), it will be impossible for the downstream module in the dialogue system to generate sample responses that return information about Chinese restaurants. A correct representation of the belief state could be (food, seafood chinese). This would tell the system to first search the database for information about seafood and then Chinese restaurants. The logical operator “” indicates which retrieved information should have a higher priority to be returned to the user. Thus we are interested in building DST modules capable of generating structured sequences, since this kind of sequence representation of the value is critical for accurately capturing the belief states of a dialogue.
3 Hierarchical Sequence Generation for DST
Given a dialogue which consists of turns of user utterances and system actions, our target is to predict the state at each turn. Different from previous methods which formulate multi-label state prediction as a collection of binary prediction problems, COMER adapts the task into a sequence generation problem via a Seq2Seq framework.
As shown in Figure 3, COMER consists of three encoders and three hierarchically stacked decoders. We propose a novel Conditional Memory Relation Decoder (CMRD) for sequence decoding. Each encoder includes an embedding layer and a BiLSTM. The encoders take in the user utterance, the previous system actions, and the previous belief states at the current turn, and encodes them into the embedding space. The user encoder and the system encoder use the fixed BERT model as the embedding layer.
Since the slot value pairs are un-ordered set elements of a domain in the belief states, we first order the sequence of domain according to their frequencies as they appear in the training set Yang et al. (2018), and then order the slot value pairs in the domain according to the slot’s frequencies of as they appear in a domain. After the sorting of the state elements, We represent the belief states following the paradigm: (Domain1- Slot1, Value1; Slot2, Value2; … Domain2- Slot1, Value1; …) for a more concise representation compared with the nested tuple representation.
All the CMRDs take the same representations from the system encoder, user encoder and the belief encoder as part of the input. In the procedure of hierarchical sequence generation, the first CMRD takes a zero vector for its condition input, and generates a sequence of the domains,
, as well as the hidden representation of domains. For each in , the second CMRD then takes the corresponding as the condition input and generates the slot sequence , and representations, . Then for each in , the third CMRD generates the value sequence based on the corresponding . We update the belief state with the new pairs and perform the procedure iteratively until a dialogue is completed. All the CMR decoders share all of their parameters.
Since our model generates domains and slots instead of taking pre-defined slots as inputs, and the number of domains and slots generated each turn is only related to the complexity of the contents covered in a specific dialogue, the inference time complexity of COMER is with respect to the number of pre-defined slots and values.
3.1 Encoding Module
Let represent a user utterance or system transcript consisting of a sequence of words . The encoder first passes the sequence into a pre-trained BERT model and obtains its contextual embeddings . Specifically, we leverage the output of all layers of BERT and take the average to obtain the contextual embeddings.
For each domain/slot appeared in the training set, if it has more than one word, such as ‘price range’, ‘leave at’, etc., we feed it into BERT and take the average of the word vectors to form the extra slot embedding . In this way, we map each domain/slot to a fixed embedding, which allows us to generate a domain/slot as a whole instead of a token at each time step of domain/slot sequence decoding. We also construct a static vocabulary embedding by feeding each token in the BERT vocabulary into BERT. The final static word embedding is the concatenation of the and .
After we obtain the contextual embeddings for the user utterance, system action, and the static embeddings for the previous belief state, we feed each of them into a Bidirectional LSTM Hochreiter and Schmidhuber (1997).
where is the zero-initialized hidden state for the BiLSTM. The hidden size of the BiLSTM is . We concatenate the forward and the backward hidden representations of each token from the BiLSTM to obtain the token representation , at each time step . The hidden states of all time steps are concatenated to obtain the final representation of . The parameters are shared between all of the BiLSTMs.
3.2 Conditional Memory Relation Decoder
Inspired by Residual Dense Networks Zhang et al. (2018), End-to-End Memory Networks Sukhbaatar et al. (2015) and Relation Networks Santoro et al. (2017), we here propose the Conditional Memory Relation Decoder (CMRD). Given a token embedding, , CMRD outputs the next token, , and the hidden representation, , with the hierarchical memory access of different encoded information sources, , , , and the relation reasoning under a certain given condition ,
the final output matrices are concatenations of all generated and (respectively) along the sequence length dimension, where is the model size, and is the generated sequence length. The general structure of the CMR decoder is shown in Figure 4
. Note that the CMR decoder can support additional memory sources by adding the residual connection and the attention block, but here we only show the structure with three sources: belief state representation (), system transcript representation (), and user utterance representation (), corresponding to a dialogue state tracking scenario. Since we share the parameters between all of the decoders, thus CMRD is actually a 2-dimensional auto-regressive model with respect to both the condition generation and the sequence generation task.
At each time step , the CMR decoder first embeds the token with a fixed token embedding , where is the embedding size and is the vocabulary size. The initial token is “[CLS]”. The embedded vector is then encoded with an LSTM, which emits a hidden representation ,
where is the hidden state of the LSTM. is initialized with an average of the hidden states of the belief encoder, the system encoder and the user encoder which produces , , respectively.
is then summed (element-wise) with the condition representation to produce , which is (1) fed into the attention module; (2) used for residual connection; and (3) concatenated with other , () to produce the concatenated working memory, , for relation reasoning,
where () are the attention modules applied respectively to , , , and means the concatenation operator. The gradients are blocked for during the back-propagation stage, since we only need them to work as the supplementary memories for the relation reasoning followed.
The attention module takes a vector, , and a matrix, as input, where is the sequence length of the representation, and outputs , a weighted sum of the column vectors in .
where the weights , and the bias , are the learnable parameters.
The order of the attention modules, i.e., first attend to the system and the user and then the belief, is decided empirically. We can interpret this hierarchical structure as the internal order for the memory processing, since from the daily life experience, people tend to attend to the most contemporary memories (system/user utterance) first and then attend to the older history (belief states). All of the parameters are shared between the attention modules.
The concatenated working memory,
, is then fed into a Multi-Layer Perceptron (MLP) with four layers,
where is a non-linear activation, and the weights , and the bias , are learnable parameters, and . The number of layers for the MLP is decided by the grid search.
The hidden representation of the next token, , is then (1) emitted out of the decoder as a representation; and (2) fed into a dropout layer with drop rate , and a linear layer to generate the next token,
where the weight and the bias are learnable parameters. Since
is the embedding size and the model parameters are independent of the vocabulary size, the CMR decoder can make predictions on a dynamic vocabulary and implicitly supports the generation of unseen words. When training the model, we minimize the cross-entropy loss between the output probabilities,, and the given labels.
4.1 Experimental Setting
|Avg. # turns,||7.45||13.68|
|Avg. tokens / turn,||11.24||13.18|
|Number of Slots,||3||35|
|Number of Values,||99||4510|
|Training set size||600||8438|
|Validation set size||200||1000|
|Test set size||400||1000|
We first test our model on the single domain dataset, WoZ2.0 Wen et al. (2017). It consists of 1,200 dialogues from the restaurant reservation domain with three pre-defined slots: food, price range, and area. Since the name slot rarely occurs in the dataset, it is not included in our experiments, following previous literature Ren et al. (2018); Liu and Perez (2017). Our model is also tested on the multi-domain dataset, MultiWoZ Budzianowski et al. (2018). It has a more complex ontology with 7 domains and 25 predefined slots. Since the combined slot-value pairs representation of the belief states has to be applied for the model with ITC, the total number of slots is 35. The statistics of these two datsets are shown in Table 2.
Based on the statistics from these two datasets, we can calculate the theoretical Inference Time Multiplier (ITM), , as a metric of scalability. Given the inference time complexity, ITM measures how many times a model will be slower when being transferred from the WoZ2.0 dataset, , to the MultiWoZ dataset, ,
where means the Inference Time Complexity (ITC) of the variable . For a model having an ITC of with respect to the number of slots , and values , the ITM will be a multiplier of 2.15x, while for an ITC of , it will be a multiplier of 25.1, and 1,143 for .
As a convention, the metric of joint goal accuracy is used to compare our model to previous work. The joint goal accuracy only regards the model making a successful belief state prediction if all of the slots and values predicted are exactly matched with the labels provided. This metric gives a strict measurement that tells how often the DST module will not propagate errors to the downstream modules in a dialogue system. In this work, the model with the highest joint accuracy on the validation set is evaluated on the test set for the test joint accuracy measurement.
4.2 Implementation Details
We use the
model for both contextual and static embedding generation. All LSTMs in the model are stacked with 2 layers, and only the output of the last layer is taken as a hidden representation. ReLU non-linearity is used for the activation function,.
The hyper-parameters of our model are identical for both the WoZ2.0 and the MultiwoZ datasets: dropout rate , model size , embedding size . For training on WoZ2.0, the model is trained with a batch size of 32 and the ADAM optimizer Kingma and Ba (2015)
for 150 epochs, while for MultiWoZ, the AMSGrad optimizerReddi et al. (2018)
and a batch size of 16 is adopted for 15 epochs of training. For both optimizers, we use a learning rate of 0.0005 with a gradient clip of 2.0. We initialize all weights in our model with Kaiming initializationHe et al. (2015) and adopt zero initialization for the bias. All experiments are conducted on a single NVIDIA GTX 1080Ti GPU.
|Baselines Mrksic et al. (2017)||70.8%||25.83%|
|NBT-CNN Mrksic et al. (2017)||84.2%||-|
|StateNet_PSI Ren et al. (2018)||88.9%||-|
|GLAD Nouri and Hosseini-Asl (2018)||88.5%||35.58%|
|HyST (ensemble) Goel et al. (2019)||-||44.22%|
|DSTRead (ensemble) Gao et al. (2019)||-||42.12%|
|TRADE Wu et al. (2019)||-||48.62%|
To measure the actual inference time multiplier of our model, we evaluate the runtime of the best-performing models on the validation sets of both the WoZ2.0 and MultiWoZ datasets. During evaluation, we set the batch size to 1 to avoid the influence of data parallelism and sequence padding. On the validation set of WoZ2.0, we obtain a runtime of 65.6 seconds, while on MultiWoZ, the runtime is 835.2 seconds. Results are averaged across 5 runs. Considering that the validation set of MultiWoZ is 5 times larger than that of WoZ2.0, the actual inference time multiplier is 2.54 for our model. Since the actual inference time multiplier roughly of the same magnitude as the theoretical value of 2.15, we can confirm empirically that we have theinference time complexity and thus obtain full scalability to the number of slots and values pre-defined in an ontology.
Table 3 compares our model with the previous state-of-the-art on both the WoZ2.0 test set and the MultiWoZ test set. For the WoZ2.0 dataset, we maintain performance at the level of the state-of-the-art, with a marginal drop of 0.3% compared with previous work. Considering the fact that WoZ2.0 is a relatively small dataset, this small difference does not represent a significant big performance drop. On the muli-domain dataset, MultiWoZ, our model achieves a joint goal accuracy of 45.72%, which is significant better than most of the previous models other than TRADE which applies the copy mechanism and gains better generalization ability on named entity coping.
4.4 Ablation Study
|Model||JD Acc.||JDS Acc.||JG Acc.|
To prove the effectiveness of our structure of the Conditional Memory Relation Decoder (CMRD), we conduct ablation experiments on the WoZ2.0 dataset. We observe an accuracy drop of 1.95% after removing residual connections and the hierarchical stack of our attention modules. This proves the effectiveness of our hierarchical attention design. After the MLP is replaced with a linear layer of hidden size 512 and the ReLU activation function, the accuracy further drops by 3.45%. This drop is partly due to the reduction of the number of the model parameters, but it also proves that stacking more layers in an MLP can improve the relational reasoning performance given a concatenation of multiple representations from different sources.
We also conduct the ablation study on the MultiWoZ dataset for a more precise analysis on the hierarchical generation process. For joint domain accuracy, we calculate the probability that all domains generated in each turn are exactly matched with the labels provided. The joint domain-slot accuracy further calculate the probability that all domains and slots generated are correct, while the joint goal accuracy requires all the domains, slots and values generated are exactly matched with the labels. From Table 5, We can further calculate that given the correct slot prediction COMER has 83.52% chance to make the correct value prediction. While COMER has done great job on domain prediction (95.53%) and value prediction (83.52%), the accuracy of the slot prediction given the correct domain is only 57.30%. We suspect that this is because we only use the previous belief state to represent the dialogue history, and the inter-turn reasoning ability on the slot prediction suffers from the limited context and the accuracy is harmed due to the multi-turn mapping problem Wu et al. (2019). We can also see that the JDS Acc. has an absolute boost of 5.48% when we switch from the combined slot representation to the nested tuple representation. This is because the subordinate relationship between the domains and the slots can be captured by the hierarchical sequence generation, while this relationship is missed when generating the domain and slot together via the combined slot representation.
4.5 Qualitative Analysis
Figure 5 shows an example of the belief state prediction result in one turn of a dialogue on the MultiWoZ test set. The visualization includes the CMRD attention scores over the belief states, system transcript and user utterance during the decoding stage of the slot sequence.
From the system attention (top right), since it is the first attention module and no previous context information is given, it can only find the information indicating the slot “departure” from the system utterance under the domain condition, and attend to the evidence “leaving” correctly during the generation step of “departure”. From the user attention, we can see that it captures the most helpful keywords that are necessary for correct prediction, such as “after” for “day” and “leave at”, “to” for “destination”. Moreover, during the generation step of “departure”, the user attention successfully discerns that, based on the context, the word “leave” is not the evidence that need to be accumulated and choose to attend nothing in this step. For the belief attention, we can see that the belief attention module correctly attends to a previous slot for each generation step of a slot that has been presented in the previous state. For the generation step of the new slot “destination”, since the previous state does not have the “destination” slot, the belief attention module only attends to the ‘-’ mark after the ‘train’ domain to indicate that the generated word should belong to this domain.
5 Related Work
Semi-scalable Belief Tracker Rastogi et al. (2017) proposed an approach that can generate fixed-length candidate sets for each of the slots from the dialogue history. Although they only need to perform inference for a fixed number of values, they still need to iterate over all slots defined in the ontology to make a prediction for a given dialogue turn. In addition, their method needs an external language understanding module to extract the exact entities from a dialogue to form candidates, which will not work if the label value is an abstraction and does not have the exact match with the words in the dialogue.
StateNet Ren et al. (2018) achieves state-of-the-art performance with the property that its parameters are independent of the number of slot values in the candidate set, and it also supports online training or inference with dynamically changing slots and values. Given a slot that needs tracking, it only needs to perform inference once to make the prediction for a turn, but this also means that its inference time complexity is proportional to the number of slots.
TRADE Wu et al. (2019) achieves state-of-the-art performance on the MultiWoZ dataset by applying the copy mechanism for the value sequence generation. Since TRADE takes combinations of the domains and slots as the input, the inference time complexity of TRADE is . The performance improvement achieved by TRADE is mainly due to the fact that it incorporates the copy mechanism that can boost the accuracy on the ‘name’ slot, which mainly needs the ability in copying names from the dialogue history. However, TRADE does not report its performance on the WoZ2.0 dataset which does not have the ‘name’ slot.
DSTRead Gao et al. (2019) formulate the dialogue state tracking task as a reading comprehension problem by asking slot specified questions to the BERT model and find the answer span in the dialogue history for each of the pre-defined combined slot. Thus its inference time complexity is still . This method suffers from the fact that its generation vocabulary is limited to the words occurred in the dialogue history, and it has to do a manual combination strategy with another joint state tracking model on the development set to achieve better performance.
Contextualized Word Embedding (CWE) was first proposed by Peters et al. (2018). Based on the intuition that the meaning of a word is highly correlated with its context, CWE takes the complete context (sentences, passages, etc.) as the input, and outputs the corresponding word vectors that are unique under the given context. Recently, with the success of language models (e.g. Devlin et al. (2018)) that are trained on large scale data, contextualizeds word embedding have been further improved and can achieve the same performance compared to (less flexible) finely-tuned pipelines.
Sequence Generation Models. Recently, sequence generation models have been successfully applied in the realm of multi-label classification (MLC) (Yang et al., 2018). Different from traditional binary relevance methods, they proposed a sequence generation model for MLC tasks which takes into consideration the correlations between labels. Specifically, the model follows the encoder-decoder structure with an attention mechanism (Cho et al., 2014), where the decoder generates a sequence of labels. Similar to language modeling tasks, the decoder output at each time step will be conditioned on the previous predictions during generation. Therefore the correlation between generated labels is captured by the decoder.
In this work, we proposed the Conditional Memory Relation Network (COMER), the first dialogue state tracking model that has a constant inference time complexity with respect to the number of domains, slots and values pre-defined in an ontology. Besides its scalability, the joint goal accuracy of our model also achieve the similar performance compared with the state-of-the-arts on both the MultiWoZ dataset and the WoZ dataset. Due to the flexibility of our hierarchical encoder-decoder framework and the CMR decoder, abundant future research direction remains as applying the transformer structure, incorporating open vocabulary and copy mechanism for explicit unseen words generation, and inventing better dialogue history access mechanism to accommodate efficient inter-turn reasoning.
Acknowledgements. This work is partly supported by NSF #1750063. We thank all the reviewers for their constructive suggestions. We also want to thank Zhuowen Tu and Shengnan Zhang for the early discussions of the project.
- MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In EMNLP, Cited by: §1, §4.1, Table 3.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, Cited by: §5.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §5.
- Dialog state tracking: a neural reading comprehension approach. ArXiv abs/1908.01946. Cited by: Table 1, Table 3, §5.
- HyST: a hybrid approach for flexible and accurate dialogue state tracking. ArXiv abs/1907.00883. Cited by: Table 1, Table 3.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification.
2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: §4.2.
- The second dialog state tracking challenge. In SIGDIAL Conference, Cited by: §1.
- Long short-term memory. Neural Computation 9, pp. 1735–1780. Cited by: §3.1.
- Adam: a method for stochastic optimization. CoRR. Cited by: §4.2.
- Dialog state tracking, a machine reading approach using memory network. In EACL, Cited by: §4.1.
- Neural belief tracker: data-driven dialogue state tracking. In ACL, Cited by: Table 1, Table 3.
- Toward scalable neural dialogue state tracking model. arXiv preprint arXiv:1812.00899. Cited by: Table 3.
- Deep contextualized word representations. In NAACL-HLT, Cited by: §5.
- Large-scale multi-domain belief tracking with knowledge sharing. In ACL, Cited by: §1.
- Scalable multi-domain dialogue state tracking. arXiv preprint arXiv:1712.10224. Cited by: Table 1, §5.
- On the convergence of adam and beyond. In ICLR, Cited by: §4.2.
- Towards universal dialogue state tracking. In EMNLP, Cited by: Table 1, §4.1, Table 3, §5.
A simple neural network module for relational reasoning. In NIPS, Cited by: §3.2.
- Neural machine translation of rare words with subword units. CoRR abs/1508.07909. Cited by: §1.
- End-to-end memory networks. In NIPS, Cited by: §3.2.
- A network-based end-to-end trainable task-oriented dialogue system. In EACL, Cited by: §4.1.
- Transferable multi-domain state generator for task-oriented dialogue systems. In ACL, Cited by: Table 1, §4.4, Table 3, §5.
- An end-to-end approach for handling unknown slot values in dialogue state tracking. In ACL, Cited by: §1.
- SGM: sequence generation model for multi-label classification. In COLING, Cited by: §3, §5.
- Pomdp-based statistical spoken dialog systems: a review. Proceedings of the IEEE 101 (5), pp. 1160–1179. Cited by: §1.
Residual dense network for image super-resolution.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2472–2481. Cited by: §3.2.
- Global-locally self-attentive dialogue state tracker. CoRR abs/1805.09655. Cited by: Table 1, §1.