Building robust task-oriented dialogue systems has gained increasing popularity in both the research and industry communities (chen2017survey). Dialogue state tracking (DST), one of the essential tasks in task-oriented dialogue systems (zhong2018global), is keeping track of user goals or intentions throughout a dialogue in the form of a set of slot-value pairs, i.e., dialogue state. Because the next dialogue system action is selected based on the current dialogue state, an accurate prediction of the dialogue state has significant importance.
Traditional neural DST approaches assume that all candidate slot-value pairs are given in advance, i.e., they perform predefined ontology-based DST (mrkvsic2016neural; zhong2018global; nouri2018toward; lee2019sumbt). Most previous works that take this approach perform DST by scoring all possible slot-value pairs in the ontology and selecting the value with the highest score as the predicted value of a slot. Such an approach has been widely applied to datasets like DSTC2 and WOZ2.0 that have a small ontology size (henderson2014second; wen2016network). Although this approach simplifies the task, it has several inherent limitations: (1) it is often difficult to obtain the ontology in advance, especially in a real scenario (xu2018end), (2) predefined ontology-based DST cannot handle previously unseen slot values, and (3) the approach does not scale large since it has to go over all slot-value candidates at every turn to predict the current dialogue state. Indeed, recent DST datasets often have a large size of ontology; e.g., the total number of slot-value candidates in MultiWOZ 2.1 is 4510, while the numbers are much smaller in DSTC2 and WOZ2.0 as 212 and 99, respectively (budzianowski2018multiwoz2.0).
To address the issues mentioned above, other previous works take an approach that either directly generates or extracts the value from the dialogue context for every slot, allowing open vocabulary-based DST (gao2019dialog; wu2019transferable; ren2019comer). While this formulation is relatively more scalable and robust to handling unseen slot values, many of the previous works do not efficiently perform DST since they predict the dialogue state from scratch at every dialogue turn.
In this work, we focus on open vocabulary-based DST and propose SOM-DST (Figure 1). Regarding dialogue state as a memory that can be selectively overwritten, SOM-DST solves DST as two decomposed sub-tasks: (1) state operation prediction, which decides the types of the operations to be performed on each of the memory slot, and (2) slot value generation, which generates the values to be newly written on a subset of memory slots (Figure 2). To the best of our knowledge, we are the first to propose a selectively overwritable memory-like perspective and a discrete two-step approach on DST. This decomposition allows us to efficiently generate the values of only a minimal subset of the slots, while most of the previous works generate or extract the values for all slots at every dialogue turn.
Moreover, this decomposition reduces the difficulty of DST in an open-vocabulary based setting by clearly separating the roles of the encoder and the decoder. Our encoder, i.e., state operation predictor, can focus on selecting the slots to pass to the decoder, so that the decoder, i.e., slot value generator, can focus only on generating the values of those selected slots.
Our proposed SOM-DST achieves state-of-the-art joint goal accuracy in an open vocabulary-based DST setting on two of the most actively studied datasets, MultiWOZ 2.0 and MultiWOZ 2.1. Ablation study on each component (Section 5.4) further reveals that improving the performance of state operation prediction can significantly boost the final DST accuracy.
In summary, the contributions of our work built on top of a perspective that considers dialogue state tracking as selectively overwriting memory are as follows:
Enabling computation-efficient DST, generating the values of a minimal subset of the slots by utilizing the previous dialogue state at each dialogue turn.
Achieving state-of-the-art performance on MultiWOZ 2.0 and MultiWOZ 2.1 in an open vocabulary-based DST setting.
Highlighting the potential of improving the state operating prediction accuracy in our proposed framework.
2 Previous Open Vocabulary-based DST
Many works on recent task-oriented dialogue datasets with a large scale ontology, such as MultiWOZ 2.0 and MultiWOZ 2.1, solve DST in an open vocabulary-based setting (gao2019dialog; wu2019transferable; ren2019comer; anonymous2020nonautoregressive; anonymous2020endtoend).
wu2019transferable show the potential of applying the encoder-decoder framework (cho2014learning) to open vocabulary-based DST. However, their method is not computationally efficient because it performs autoregressive generation of the values for all slots at every dialogue turn.
ren2019comer tackle the drawback of the model of wu2019transferable, that their model generates the values for all slots at every dialogue turn, by using a hierarchical decoder. They decode the domains, slots, and values in a hierarchical manner to generate the current turn dialogue state itself as the target sequence. In addition, they come up with a new notion dubbed Inference Time Complexity (ITC) to compare the efficiency of different DST models. ITC is calculated using the number of slots and the number of corresponding slot values .111The notations used in the work of ren2019comer are and , respectively. Following their work, we also calculate ITC in Section 5 for comparison.
anonymous2020nonautoregressive introduce another work that tackles the efficiency issue. To maximize the computational efficiency, they use a non-autoregressive decoder to generate the slot values of the current dialogue state at once. They encode the slot type information together with the dialogue context and the delexicalized dialogue context and do not use the previous turn dialogue state as the input. Although the models of ren2019comer and anonymous2020nonautoregressive also generate only a subset of the slot values like ours, our model is more efficient since it just carries over the values from the previous dialogue state while their models generate those values again at every dialogue turn.
anonymous2020endtoend process the dialogue context in both domain-level and slot-level and make the final representation using a late fusion approach to generate the values. They show that there is a performance gain when the model is jointly trained with response generation. However, they still generate the values of every slot at each turn, like wu2019transferable.
gao2019dialog formulate DST as a reading comprehension task and propose a model named DST Reader that extracts the values of the slots from the input. Their fully extractive approach is limited since DST requires tracking abstractive values as well as extractive ones. However, they introduce and show the importance of the concept of a slot carryover module, i.e., a component that makes a binary decision whether to carry the value of a slot from the previous turn dialogue state over to the current turn dialogue state.
zhang2019dsdst target the issue of ill-formatted strings that generative models suffer from. In order to avoid this issue, they take a hybrid approach. For the slots they categorize as picklist-based slots, they used a predefined ontology-based approach as in the work of lee2019sumbt; for the slots they categorize as span-based slots, they use a span extraction-based method like DST-Reader (gao2019dialog). However, their hybrid model shows lower performance than when they use only the picklist-based approach. Although their solely picklist-based model achieves state-of-the-art joint accuracy in MultiWOZ 2.1, it is done in a predefined ontology-based setting, and thus cannot avoid the scalability and generalization issues of predefined ontology-based DST.
3 Selectively Overwriting Memory for Dialogue State Tracking
Figure 2 illustrates the overview of SOM-DST. To describe the proposed SOM-DST, we formally define the problem setting in our work.
Dialogue State We define the dialogue state at turn , , as a fixed-sized memory whose keys are slots and values are the corresponding slot value , where is the total number of such slots. Following the convention of MultiWOZ 2.0 and MultiWOZ 2.1, we use the term “slot” to refer to the concatenation of a domain name and a slot name.
Special Value There are two special values NULL and DONTCARE. NULL means that no information is given about the slot up to the turn. For instance, the dialogue state before the beginning of any dialogue has only NULL as the value of all slots. DONTCARE means that the slot neither needs to be tracked nor considered important in the dialogue at that time.222Such notions of “none value” and “dontcare value” appear in the previous works as well wu2019transferable; gao2019dialog; anonymous2020nonautoregressive; zhang2019dsdst.
Operation At every turn , an operation is chosen by the state operation predictor (Section 3.1) and performed on each slot to set its current turn corresponding value . When an operation is performed, it either keeps the slot value unchanged (carryover) or changes it to some value different from the previous one (delete, dontcare, and update) as the following.
The operations that set the value of a slot to a special value (delete to NULL and dontcare to DONTCARE, respectively) are chosen only when the previous slot value is not the corresponding special value. update operation requires the generation of by slot value generator (Section 3.2).
State operation predictor performs state operation prediction as a classification task, and slot value generator performs slot value generation to find out the values for slots on which update should be performed. The two components of SOM-DST are jointly trained to predict the current turn dialogue state.
3.1 State Operation Predictor
Input Representation We denote the representation of the dialogue utterances at turn as , where is the system response and is the user utterance. ; is a special token used to mark the boundary between and , and [SEP] is a special token used to mark the end of a dialogue turn. We denote the representation of the dialogue state at turn as , where is the representation of the -th slot-value pair. - is a special token used to mark the boundary between a slot and a value, and is a special token that represents the -th slot-value pair. is a token used to aggregate the information of the
-th slot-value pair into a single vector like the use case of[CLS] token in BERT devlin2019bert. In this work, we use the same special token [SLOT] for all . Our state operation predictor employs a pretrained BERT (devlin2019bert) encoder. The input tokens to the state operation predictor is the concatenation of the previous turn dialog utterances, the current turn dialog utterances, and the previous turn dialog state:333We use only the previous turn dialogue utterances as the dialogue history, i.e., the size of the dialogue history is 1. This is because we assume that the Markov assumption holds in our model since a part of the input is the previous turn dialogue state, , which can serve as a compact representation of the whole dialogue history.
where [CLS] is a special token added in front of every input. Using the previous dialogue state as the input serves as an explicit, compact, and informative representation of the dialogue history for the model.
When the value of the -th slot at time , i.e., , is NULL, we use a special token [NULL] as the input. When the value is DONTCARE, we use the string “dont care” to take advantage of the semantics of the phrase “don’t care” which the pretrained BERT encoder would have already learned.
The input to BERT is the sum of the embeddings of the input tokens , segment id embeddings, and position embeddings. In our work, we use 0 as the segment id for the tokens that belong to and 1 for the tokens that belong to or . For the position embeddings, we follow the standard choice of BERT.
Encoder Output The output embedding of the encoder is . are the outputs corresponding to [CLS] and , respectively. , aggregated sequence representation of the entire input , is obtained by a feed-forward layer with a learnable parameter as:
State Operation Prediction State operation prediction is a four-way classification, performed on top of the encoder output for each slot representation :
where is a learnable parameter and
is the probability distribution over operations for the-th slot at turn . In our formulation, , because , delete, dontcare, .
Then, the operation is determined by and the slot value generation is performed on only the slots whose operation is update. We define the set of the slot indices which require the value generation as , and its size as .
3.2 Slot Value Generator
For each -th slot such that , the slot value generator generates a value. Our slot value generator differs from the generators of many of the previous works because it generates the values for only number of slots, not . In most cases, , so this setup enables an efficient computation where only a small number of slot values are generated.
We use Gated Recurrent Unit (GRU)(cho2014properties) decoder like wu2019transferable. GRU is initialized with and , and recursively updates the hidden state by taking a word embedding as the input until [EOS] token is generated:
The slot value generator transforms the output of the decoder to the probability distribution over the vocabulary at the -th decoding step, where is the word embedding matrix shared across the encoder and the decoder, such that is the vocabulary size.
As the work of wu2019transferable, we use the soft copy mechanism see2017get to get the final output distribution over the candidate value tokens:
such that is a scalar value computed as:
where is a learnable parameter and is a context vector.
3.3 Objective Function
During training, we jointly optimize both state operation predictor and slot value generator.
State operation predictor
In addition to the state operation classification, we use domain classification as an auxiliary task to force the model to learn the correlation of slot operations and domain transitions in between dialogue turns. Domain classification is done with a softmax layer on top of:
where is a learnable parameter and is the probability distribution over domains at turn . is the number of domains defined in the dataset.
The loss for each of state operation classification and domain classification is the average of the negative log-likelihood, as follows:
where is the one-hot vector for the ground truth domain and is the one-hot vector for the ground truth operation for the -th slot.
Slot value generator The objective function to train slot value generator is also the average of the negative log-likelihood:
where is the number of tokens of the ground truth value that needs to be generated for the -th slot. is the one-hot vector for the ground truth token that needs to be generated for the -th slot at the -th decoding step.
Therefore, the final joint loss to be minimized at dialogue turn is the sum of losses mentioned above:
4 Experimental Setup
|The Number of Data|
|Hotel||price range, type, parking, book stay, book day, book people, area, stars, internet, name||3,381||416||394|
|Train||destination, day, departure, arrive by, book people, leave at||3,103||484||494|
|Restaurant||food, price range, area, name, book time, book day, book people||3,813||438||437|
|Taxi||leave at, destination, departure, arrive by||1,654||207||195|
|Attraction||area, name, type||2,717||401||395|
We use MultiWOZ 2.0 (budzianowski2018multiwoz2.0) and MultiWOZ 2.1 (eric2019multiwoz2.1) as the datasets in our experiments. MultiWOZ 2.0 and MultiWOZ 2.1 are two of the largest publicly available multi-domain task-oriented dialogue datasets, including about 10,000 dialogues within seven domains.
Note that MultiWOZ 2.1 is a newer version of MultiWOZ 2.0 where the annotation errors in the previous version are corrected. eric2019multiwoz2.1 report that the correction of the annotations change about 32% of the state annotations, which indicates that MultiWOZ 2.0 consists of many annotation errors.
Following wu2019transferable, we use only five domains (restaurant, train, hotel, taxi, attraction) excluding hospital and police since the the two domains take up only a small portion in the dataset and do not even appear in the test set. Therefore, the number of domains is 5 and the number of slots is 30 in our experiments. To preprocess the datasets, we exploit the preprocessing script provided by wu2019transferable.444https://github.com/jasonwu0731/trade-dst A more detailed statistics of MultiWOZ 2.1 is given in Table 1.
We employ the pretrained BERT-base-uncased model555https://github.com/huggingface/transformers for the encoder of the state operation predictor and one GRU (cho2014properties) for the decoder of the slot value generator. The decoder hidden size is the same as the encoder, , which is 768 for BERT-base-uncased. The token embedding matrix of slot value generator is shared with that of state operation predictor. We use BertAdam as our optimizer (kingma2014adam). We use greedy decoding for the slot value generator.
Note that the encoder of state operation predictor makes use of a pretrained model, whereas the decoder of slot value generator needs to be trained from scratch. Therefore, we use different learning rate schemes for the encoder and the decoder. We set the peak learning rate and warmup proportion to 4e-5 and 0.2 for the encoder and 1e-4 and 0.1 for the decoder, respectively.
We use a batch size of 64 and set the dropout srivastava2014dropout probability to 0.1. The max sequence length for all inputs is fixed to 256.
We train state operation predictor and slot value generator jointly for 40 epochs and choose the model which reports the highest joint goal accuracy on the validation set. During training, we use the ground truth operations and the ground truth previous turn dialogue state instead of the predicted ones. We use teacher forcing 50% of the time to train the decoder.
4.3 Baseline Models
We compare the performance of SOM-DST with both predefined ontology-based models and open vocabulary-based models. A short explanation for each of the models is specified below.
FJST uses a bidirectional LSTM to encode the dialogue history and a feed-forward network to predict the value of each slot (eric2019multiwoz2.1).
HJST is proposed together with FJST; it encodes the dialogue history using an LSTM like FJST, but uses a hierarchical network (eric2019multiwoz2.1).
SUMBT exploits BERT as the encoder for the dialogue context and slot-value pairs. After encoding them, it scores every candidate slot-value pair in a non-parametric manner using a distance measure (lee2019sumbt).
HyST employs a hierarchical encoder and takes a hybrid approach that incorporates both a predefined ontology-based setting and an open vocabulary-based setting (goel2019hyst).
DST Reader formulates the problem of DST as an extractive QA task; it extracts the value for a slot from the input as a span (gao2019dialog).
TRADE encodes the whole dialogue context and decodes the value for every slot using a copy-augmented decoder (wu2019transferable).
COMER uses a hierarchical decoder to generate the current turn dialogue state itself as the target sequence (ren2019comer).
NADST uses a non-autoregressive decoder to generate the current turn dialogue state (anonymous2020nonautoregressive).
ML-BST encodes the dialogue context with domain and slot information and combines them in a late fusion approach to make the final representation of the input. Then, it generates the slot values and the system response jointly (anonymous2020endtoend).
DS-DST takes a hybrid approach of predefined ontology-based DST and open vocabulary-based DST. It defines picklist-based slots for classification similarly to SUMBT and span-based slots for span extraction alike DST Reader (zhang2019dsdst).
DST-picklist is proposed together with DS-DST, but this model performs only predefined ontology-based DST considering all slots as picklist-based slots (zhang2019dsdst).
5 Experimental Results
5.1 Joint Goal Accuracy
|Model||MultiWOZ 2.0||MultiWOZ 2.1|
|Open Vocabulary||DST Reader||39.41||36.40|
Table 2 shows the joint goal accuracy of SOM-DST and other baselines on the test set of MultiWOZ 2.0 and MultiWOZ 2.1. Joint goal accuracy is an accuracy which checks whether the predicted values of all slots exactly match those of the ground truth. As shown in the table, SOM-DST achieves state-of-the-art performance in an open vocabulary-based setting. Interestingly, on the contrary to the previous works, our model achieves higher performance on MultiWOZ 2.1 than on MultiWOZ 2.0. This is presumably because our model, which explicitly uses the dialogue state labels as input, benefits more from the error correction on the state annotations done in MultiWOZ 2.1.
5.2 Inference Time Complexity (ITC)
|Inference Time Complexity|
In the task of DST, the scalability of the model becomes an issue when the number of slots to be tracked increases. Therefore, performing an efficient computation is considered important and widely studied (ren2019comer; anonymous2020nonautoregressive). For a fair comparison of the efficiency of each of the models, we investigate Inference Time Complexity (ITC), which defines the efficiency of a DST model using , the number of slots, and , the number of values of the slot, following ren2019comer.
Going a step further from the work of ren2019comer, we report ITC of the models in the best case and the worst case. This enables a more precise comparison with the models that ren2019comer do not cover, e.g., the model of anonymous2020nonautoregressive.
Table 3 shows ITC of several models in their best and worst cases. Since our model generates values for only the slots on which update operation has to be performed, the best case complexity of our model is , when there is no slot whose operation is update. Even though the worst case is , i.e., when the values of all slots need to be updated, the number of slots to be updated at a turn in the train set of MultiWOZ 2.1 is only 1.12 in average and 9 at maximum. Since the number of all slots is 30, which is three times larger than the maximum number of slots to be updated at a turn, our model can still be considered as computationally efficient.
5.3 Domain-Specific Accuracy
|Domain||Model||Joint Accuracy||Slot Accuracy|
Table 4 shows the domain-specific results of our model and the concurrent works which report such figures (anonymous2020nonautoregressive; anonymous2020endtoend). Domain-specific accuracy is the accuracy measured on a subset of the predicted dialogue state, where the subset consists of the slots specific to a domain.
While the performance is similar to or a little lower than that of other models in other domains, SOM-DST outperforms other models in taxi and train domains. This implies that state-of-the-art joint goal accuracy of our model on the test set comes mainly from these two domains.
A characteristic of the data from these domains is that they consist of challenging conversations where the domain changes more than once. Indeed, among complicated dialogues where the domain switches more than once, i.e., the user changes the topic of the conversation during a dialogue more than once, the number of dialogues that ends in taxi domain is ten times more than other cases.666A more detailed statistics are given in Table 7.
Therefore, we may say that our model outperforms other models in dialogues of such complicated cases. In other words, our model performs relatively more robust DST in challenging conversations. We conjecture that this strength attributes to effective utilization of previous turn dialogue state in its explicit form; the model can explicitly keep even the information mentioned near the beginning of the conversation and make use of it. Figure 1 shows an example of a complicated conversation in MultiWOZ 2.1, where our model indeed correctly performs DST.
5.4 Ablation Study
Table 5 shows the result of ablation studies. We report that the joint goal accuracy drops when we change the state operation prediction from a four-way classification to a binary classification of whether to (1) carry over the previous slot value to the current turn or (2) change to a new value, like in the work of gao2019dialog. We assume the reason is that it is better to separately model operations delete, dontcare, and update that correspond to the latter class of the binary classification, since the values of dontcare and update tend to appear implicitly while the values for update are often explicitly expressed in the dialogue.
Lastly, we observe that the joint accuracy rises from 52.57% to 86.87% when the ground truth state operation is given to the model instead of the predicted one at the time of inference, implying that accurate state operation prediction is the key to boost the performance of DST in our proposed framework. On the other hand, the performance is only 55.17% when the ground truth value token is given at every step of the decoding at the time of inference. This result indicates that a large room for improvement exists in the task of state operation prediction. For instance, the imbalance in the number of operations, such that carryover takes the largest portion as can be seen in Table 6, and the error propagation from the previous dialogue state are some of the issues to be tackled in future research.
|# Operations||F1 score|
We propose SOM-DST, an open vocabulary-based dialogue state tracker that regards dialogue state as an explicit memory that can be selectively overwritten. SOM-DST decomposes dialogue state tracking (DST) into state operation prediction and slot value generation. This setup makes the generation process efficient because the values of only a minimal subset of the slots are generated at each dialogue turn. SOM-DST achieves state-of-the-art joint goal accuracy on both MultiWOZ 2.0 and MultiWOZ 2.1 datasets in an open vocabulary-based setting. SOM-DST effectively makes use of the explicit dialogue state and discrete operations to perform relatively robust DST even in complicated conversations. Further analysis of SOM-DST shows that the key to dramatically improve the performance of DST in the proposed setting exists especially in the task of state operation prediction. From this result, we propose that tackling the problem with our proposed definition of DST is a promising future research direction.
The authors would like to thank the members of Clova AI for proofreading this manuscript.
Appendix A Appendix