Task-oriented dialogue system interacts with users in natural language to accomplish tasks such as restaurant reservation or flight booking. The goal of dialogue state tracking is to provide a compact representation of the conversation at each dialogue turn, called dialogue state, for the system to decide the next action to take. A typical dialogue state is consist of goal of user, action of the current user utterance (inform, request etc.) and dialogue history young2013pomdp. User’s goal is the most important among them, it is a set of slot-value pairs. All of these are defined in a particularly designed ontology that restricts which slots the system can handle, and the range of values each slot can take. The joint goal is a set that accumulates turn goals up to the current turn. Accuracy of the joint goal is a rigorous evaluation, because each error slot-value pair may propagate to the next few turns. In this paper, we focus on tracking the user’s goal and use the joint goal accuracy to evaluate our model.
Considering an example of restaurant reservation, users can inform the system some restrictions of their goals (e.g., inform(food = thai)) or request further information they want (e.g., request(phone number)) at each turn. Figure 1 shows an annotated dialogue in WoZ 2.0 dataset.
In many task-oriented dialogue system applications we have built in the industry, the state tracker faces several challenges. First, with the complexity of tasks increases, more slots and values can be used during the conversation, the state tracker needs to be more accurate otherwise the system may fail to accomplish tasks. Second, the number of slots or tasks in the ontology may be large, the state tracker requires scalability to large-scale multi-domain dialogues. Third, when new values or slots are introduced to the system, it is hard to collect lots of labeled data, so the ability of few-shot or even zero-shot learning on newly added data is required. To tackle these challenges, we propose a scalable multi-domain dialogue state tracker, Point-Or-Generate Dialogue State Tracker (POGD), which could handle unseen values and easy to generalize to new slots.
Let’s discuss the designation of our proposed POGD model. By taking a deep look at the dialogue data examples, we find that the user’s goal can be split into explicit and implicit cases222In this paper, we think a value is explicitly expressed if it has a Levinstein distance less than 3 with it in the ontology.. The explicit case means that the value is expressed the same way as it in the ontology, otherwise, it is the implicit case. The explicit case usually needs information searching and matching, while implicit one requires generation. It seems natural to introduce Pointer-Generator Network see2017get to improve the accuracy. However, the distribution of the above two cases is unbalanced, i.e. matching is used much more than generating. Furthermore, values in the implicit case are also in an imbalanced distribution333In WoZ 2.0 there are 14.65% implicit values, and in MultiWoZ 2.0 the number is 10.50%. Among implicit values, the value don’t care accounts 82.49% in WoZ 2.0 and 16.64% in MultiWoZ 2.0, respectively.. Even in the explicit case, values may not be expressed exactly the same as it in the ontology, we called it rephrasing444In the explicit case, rephrased values take a proportion of 29.32% in WoZ 2.0 and 32.1% in MultiWoZ 2.0, respectively. (e.g., moderately in user’s utterance and moderate in the ontology). For the above reasons, straightforward utilization of PGNet might lead to chaos.
We attempt to think the same way as PGNet—point out slot values from the user’s utterance or generate them based on slot-specific contexts. Figure 2 shows the architecture of POGD. POGD is designed as follows: A Pointer is trained to figure out values explicitly mentioned in utterances, while a Generator is trained to infer implicitly expressed ones. Different from PGNet, the Pointer and Generator are decomposed into two separate modules, and a Switcher—a binary classifier is adapted to enable adaptive switching between them to fit the distribution. To handle rephrasing values in the explicit case and imbalanced distribution of implicit values, a looking up in the value set is performed via attention mechanism in both Pointer and Generator. Finally, a Classifier is applied to determine whether a slot is relevant to an utterance or not. POGD shares all parameters across all slots to be scalable, and to achieve knowledge sharing. The training of the four modules is formulated as a multi-task learning procedure to promote the capability of generalization. POGD achieves a state-of-the-art performance of joint goal accuracy on the small-scale WoZ 2.0 wen2016network dataset. On the large-scale multi-domain dataset MultiWoZ 2.0 budzianowski2018multiwoz, our model obtains 39.15% joint goal accuracy, outperforming prior work by 3.57%. The contributions of this paper are three folds:
We address the DST problem by solving two easier subtasks, 1) point out explicitly expressed slot values in user’s utterance and 2) generate implicitly expressed ones based on slot-specific contexts.
We propose a scalable multi-domain dialogue state tracker POGD, which obtains state-of-the-art results.
To our knowledge, this is the first attempt to evaluate the capability of generalization of a dialogue state tracker on unseen slot values and new slots.
The rest of this paper is structured as follows: In Section 2 we give a brief review of previous works in the field and show their limitations. The architecture and other details of our model are described in Section 3. We perform several experiments to show advantages of POGD in Section 4. Followed by conclusions in Section 5.
2 Related Work
Traditional works on DST such as rule-based models wang2013simple; sun2014generalized, statistical models thomson2010bayesian; young2010hidden; lee2013recipe; lee2013structured; sun2014sjtu; xie2015recurrent failed to produce satisfactory performance and now rarely discussed. Approaches using a separate Spoken Language Understanding (SLU) module henderson2012discriminative; zilka2015incremental; mrkvsic2015multi suffered from error accumulation from the SLU, and relied on hand-crafted semantic dictionaries or delexicalization—replace the values in the utterance with general tags to achieve generalization.
Recently, deep-learning showed its power to the dialogue state tracking challenges (DSTCs)williams2013dialog; henderson2014second; henderson2014third. A variety of DL-based methods were proposed: Neural Belief Tracker (NBT) mrkvsic2016neural applied representation learning to learn features as opposed to hand-crafting features; PtrNet xu2018end aimed to handle unknown values; GLAD zhong2018global and its further improved version GCE nouri2018toward addressed the problem that extraction of rare slot-value pairs; another improvement of GLAD sharma2019improving used the relevant past user’s utterance to get a better performance. But parameters of these models increased with the number of slots. For these scalable approaches, ramadan2018large tried to share parameters across slots, but their model needs to iterate all slots and values in the ontology at each dialogue turn. rastogi2017scalable generated a fix set of candidate values using a separate SLU module, so suffered from error accumulation. StateNet ren2018towards reduced the dependence of ontology but no verification was done in their work.
In this section, we discuss the details of our model. We formulate state tracking in the way that predicts the turn state like GLAD zhong2018global, but we do not use the previous system acts in POGD’s inputs. At each dialogue turn, POGD’s input is consist of utterance , a slot under considering and its value set . For utterance , we simply concatenate user and system utterance ( and ) by a particular symbol usr.
The Pointer is inspired by previous work PtrNet xu2018end, which addressed the DST problem using Pointer Networks vinyals2015pointer. We use a bidirectional LSTM hochreiter1997long to get the encoding of the same way as PtrNet.
Where is number of words in and is the dimension of LSTM hidden states. Different from PtrNet, Pointer module is not designed in an encoder-decoder structure, we use two Linears () and to encode slot information.
Then, the starting position and the ending position of predicted value are computed via in Figure 2 as following,
where and we define a trainable parameter and a in each attention module. Note that there is no guarantee that , when , it indicates that Pointer’s output is not reliable, we will set Switcher’s output to choose Generator’s output as the final predicted value.
We get a by summing the word embeddings between and in2. We find that values and their rephrased have similar embeddings, which is why the Pointer can overcome rephrasing values via attention mechanism.
The cosine similarity between and values in value set is calculate using dot attention luong2015effective,
where is in range of the number of values in , we select the value with max attention score as the output of Pointer. We also calculate a context by following steps for further using in Classifier and Switcher.
The Generator works similar to Pointer, as described in Figure 2, after encoding utterance and slot and computing attention scores the same way as Pointer, a slot-specific context is computed as follows.
Different from Pointer, Generator is designed to infer the implicit values based on slot-specific contexts, so it needs to handle more complex similarity than the word embeddings similarity computed in in Pointer.
is transformed by a multilayer perceptron (MLP)gardner1998artificial
with ReLUnair2010rectified activation, let denote a -layers MLP with ReLU activation at each hidden layer, transformed context is computed as following.
Attention performed in Generator is to deal with the imbalance of implicit values. Traditional sequence to sequence models like PGNet see2017get used a Linear to select final output from vocabulary, which may suffer from imbalanced distribution of values, because each value in vocabulary corresponding to a weight in the Linear. So instead of using a Linear, we use attention mechanism, the trainable parameters and in are shared across all values, which reduces the impact of data imbalance on performance. does’t perform dot attention as in Pointer but perform the same as . We also choose the value that has the max score as the final output of Generator.
3.3 Classifier & Switcher
Classifier and Switcher are two binary classifiers that have the same architecture—a 3-layers perceptron with dropoutsrivastava2014dropout at the hidden layer, and they also have the same inputs except for the encoding of slots. But the goal of the two modules are different: Switcher is trained to choose the final predicted value from Pointer’s output and Generator’s output , while Classifier is trained to determine whether the slot under considering is relevant to the input utterance or not. Let’s discuss how Switcher works, and the Classifier acts the same.
The Switcher’s input is consist of three parts: Pointer’s attention context , Generator’s attention context , and the encoding of slot. As shown in Figure 2, they are simply concatenated together. let denote the input of Switcher, the Switcher works as following.
where is the slot encoder of Switcher as shown in Figure 2. The Switcher let Pointer and Generator learn simplified tasks—Pointer learns to point only on pointable values and Generator learns to generate only on implicit values.
The final output produced by the lecun2012efficient function. In our experiments, the final predicted value is produced as follows.
Here and are the output of Switcher and Classifier.
3.4 Multi-Task Learning
The four modules in our model are jointly trained via multi-task learning. This section we discuss how we separate the tasks and which parameters are shared.
We use 5 loss functions to train our model. In Pointer, we use, as outputs and ground truth starting and ending positions of values as labels to compute two cross entropy losses and . For Switcher and Classifier, they use expected outputs generated from the dataset as labels and perform binary cross entropy losses and . During the training process, we use Switcher’s labels to produce final output , then a cross entropy loss is computed between the ground truth output of the tracker. Generator is a little particular, it doesn’t have its own loss function, as we described in the last subsection, only contains Pointer’s output or Generator’s output , so the error back propagation of will only go through either Pointer or Generator. We use cut values in Pointer and the indexing data doesn’t have gradient so the Pointer needs its own losses.
The Classifier needs negative sampling, which brings more difficulty to the joint training process. Negative samples are randomly selected from unrelated slots with some ratio, we should make them only have effect on the Classifier. When a negative sample go through the model, Switcher’s label will set to its output before the back propagation, and other modules’ labels will set to zero and ignored from cross entropy loss functions.
Classifier and Switcher use Pointer and Generator’s attention contexts as a part of the input, caruana1997multitask showed that multi-task learning can improve generalization, our experiments further verified this argument.
In this section, we perform several experiments to show the advantages of our model. We show joint goal accuracy of POGD on two datasets, and compare to several previous approaches. Then we examine our model’s generalization on unseen values, and the capability of few-shot learning on new slots.
|average tokens per turn||11.24||13.18|
We mainly use two datasets to evaluate our model, one is a small-scale dataset, the second version of Wizard-of-Oz (WoZ 2.0) wen2016network which user’s goal is to find a suitable restaurant around Cambridge. Another one is a large-scale multi-domain dataset, the second version MultiWoZ 2.0 budzianowski2018multiwoz, which contains 6 domains but we only use 5 domains occurred in its test set as ramadan2018large did. Both of them are human-machine conversations. A comparison of two datasets showed in Table 1.
|Belief Tracking: CNNramadan2018large||85.5||25.83|
|Neural Belief Tracker: NBT-DNNmrkvsic2016neural||84.4||/|
|GLAD + RC + FSsharma2019improving||89.2||/|
4.2 Implementation Details
We use semantically specialised Paragram-SL999 vectors wieting2015paraphrase the same as Neural Belief Tracker mrkvsic2016neural and do not fine-tune the embeddings of utterance and values. Slots’ embeddings are random initialized. Model is trained using ADAM kingma2014adam with learning rate 1e-3. We apply dropout srivastava2014dropout with rate 0.3 at word embeddings and hidden layers of in Classifier and Switcher. The position labels used to train Pointer are generated the same way as PtrNet xu2018end
, we use the last occurrence of the reference value in the dialogue history, if a subsequence of an utterance in history has a Levenshtein distance less than 3 with the value, it will be treated as a successful match. Switcher’s labels are generated the same time—matching is successful or not. Labels of Classifier need negative sampling, we randomly choose unrelated slots as negative samples, with a probability of 13/30 in MultiWoZ 2.0, and in the WoZ 2.0 dataset we use all unrelated slots as negative samples. We train the model 400 epochs with L2 regularization 2e-7 on WoZ 2.0 and 50 epochs with L2 1e-7 on MultiWoZ 2.0. Other details of data generation and parameters of POGD are described in the Appendix.
As described in Section 1, the joint goal accuracy is a widely used metric in DST task. Table 2 shows the joint goal accuracy of our model compares against various previous reported baselines on WoZ 2.0 and MultiWoZ 2.0 test set. POGD achieves competitive result 88.70% on the small dataset and gets a state-of-the-art result 39.15% on the large scale dataset.
In this section, we will show the POGD model is generalizable in two aspects: unseen values and new slots. New values or called unseen values means they don’t appear during the training process.
4.4.1 Unseen values
The generalization of unseen values is evaluated at WoZ 2.0 dataset. We randomly choose values at rate 15% to 55% of slot food which contains 76 values in total, and delete data contains these values from the train set. We report Precision, Recall and F1 score of these values in the test set, as shown in Table 3.
Experiments here are not intend to get the best performance but to prove our POGD model can handle unseen values, so we only train 20 epochs to get results in Table 3. It is important to emphasize that our POGD can generalize to unseen values without using SLU module or delexicalization like previous worksrastogi2017scalable; ren2018towards.
4.4.2 New slots
The generalization of new slots is rarely discussed. Recent approaches like NBT mrkvsic2016neural and PtrNet xu2018end trained model separately for each slot type, when facing a new slot, they need to train an entirely new model. As for those scalable approaches zhong2018global; ren2018towards; ramadan2018large; rastogi2017scalable; nouri2018toward, they may generalize to a new slot by fine-tuning their models but no experiments were done in their works.
In our POGD model, parameters are shared across all slots to share knowledge, when a new slot is added to ontology, with a few data and training epochs, the model could get a satisfactory performance. Furthermore, we find that the similarity between embeddings of slots will train to be consistent with the similarity of the slot values’ entity types, we verify this by computing the cosine similarity between slots embeddings. We keep the most similar slot for each slot, and use the similarity between their embeddings as weights, then construct a directed graph. We plot the graph using Gephi555https://gephi.org/ with Fruchterman–Reingold algorithm fruchterman1991graph, as shown in Figure 3. We can find that a pair of slots have a higher similarity when their corresponding values have a more similar entity type. Experiments show this phenomenon can be used to improve the convergence speed of our model.
The following experiments are performed to show the ability of few-shot learning and the convergence speed of POGD. In this subsection, data of 30 slots from 5 domains with 1:10 negative sampling rate are used. We use data of 29 slots except train departure to train a model 10 epochs as a pretrained model, and use train departure’s data to evaluate. We use this slot because it is complex enough and has less overlapping values with its similar slot—restaurant name, which ensures the effectiveness of experiments.
To show the capability of few-shot learning, we randomly choose training examples from data of slot train departure which contains 19651 examples in total (including negative examples), after fine-tuning the pretrained model using these examples 10 epochs, we choose the best performance and get results shown in Figure 4. We find that with only 700 of 19651 (3.56%) training examples we could get nearly the same results as using full data.
We compare the convergence speed between the slot train departure is random initialized and initialized with embeddings of restaurant name, the results shown in Figure 5 indicate that initializing with similar slot embeddings can significantly increase the convergence speed. But if we only use the new slot’s data to fine-tune the pretrained model, after several epochs the pretrained model may be broken, so an efficient way to learn a new slot is using the slot-specific data and full data alternately to fine-tune the pretrained model, while new slot initialized with a similar slot.
4.5 Ablation study
We perform ablation experiments to analyze the effectiveness of different components of POGD. The results of these experiments are shown in Table 4.
Combining Pointer and Generator we can get a strong performance on figuring out values given a correct slot. When Classifier uses the labels, we can get a 94.11% joint goal accuracy on WoZ 2.0 and 58.49% on MultiWoZ 2.0, which outperforms the previous state-of-the-artsharma2019improving; nouri2018toward by 4.91% and 22.91%. It is because the binary Switcher lets them handle simplified cases: Pointer only learns to point out values from the user’s utterance and Generator only learns to infer implicit values.
The Switcher choose the best output from Pointer and Generator. From the results we notice that when only the Switcher uses the generated labels as outputs, the performance of our model is slightly worse, after analyzing the output of the model, we found two reasons. One is the dialogue state doesn’t always update as soon as a value is mentioned by the user, but Switcher’s labels are generated turn by turn, if a value’s first occurrence is missed, it will not be corrected in the followed turns. For example, a user informs the system he wants to find a restaurant that serves Chinese food, but in some data examples the slot-value pair of food=chinese is added to the dialogue state until the booking is completed. In such cases, generated labels cause all of these turn states go wrong, but Switcher and Classifier may decide to output this pair when it is first mentioned, so after turns of wrong state, it may finally get a correct state (e.g., when booking is completed). Another reason is, a pointable value not always produced by Pointer, it may generate by Generator with higher confidence. Because of these two reasons, Switcher produced unexpected results with better performance than the generated labels.
The Classifier works well but suffers a lot from error accumulation. Even though it has high accuracy as illustrated in Table 5, but with the increasing of slots, each percent of classifier errors cause more loss on the joint goal accuracy, because at each turn we iterate all slot and try to figure out their values, but most of them are not relevant to the utterance. This is the biggest problem with our model. We should design it in a more stable way in the future.
We propose the Point-Or-Generate Dialogue State Tracker (POGD)—a scalable multi-domain dialogue state tracker. POGD can handle unseen values and reduce the effort when new slots added to the ontology. Our model obtains 88.7% joint goal accuracy on the WoZ 2.0 dataset, and on the large-scale multi-domain dataset MultiWoZ 2.0, it obtains 39.15% joint goal accuracy, outperforming prior work by 3.57%.
Appendix A Appendix
a.1 Data generation
POGD is designed to predict the turn goals, but in MultiWoZ 2.0, it only contains the full dialogue states, so we need to generate turn goals labels. The turn goals labels are generated as follows, at the first turn of dialogue the state is equal to the turn goals, for the following turns, we simply delete the slot-value pairs that appeared in previous turns. If a value doesn’t occur in the dialogue state on its first occurrence, we will add it to the turn state where it was mentioned. This operation only performed in WoZ 2.0, in MultiWoZ 2.0, a value may use for several slots, this operation may lead to further errors. One important thing, no modification was performed on the test set.
Words are represented using 300 dimensional embeddings, and each slot is mapped to a randomly initialized 100 dimensional embedding. LSTMs: all LSTMs in POGD are one layer BiLSTM and their hidden state size 1s 128. Linears: , and have the same output size of 2 * BiLSTMs’ hidden size to perform attention. The output size of and is equal to their input dimensions. MLPs: and have the same architecture as described in Section 3, they have hidden layer sizes the same as LSTMs’ hidden size. is a and has a hidden size of 2 * dimension of words embeddings.
During the training process on WoZ 2.0 dataset, due to such little data, instead of on MultiWoZ 2.0 dataset, Classifier and Switcher are and the is not used. We set the hidden state of BiLSTMs to 150 to perform without , because 2 * LSTM’s hidden size should equals to the embeddings of values.
During the training process, we used a dynamic batch size to adjust the learning rate. We changed the batch size every 30 epochs on WoZ 2.0 and 5 epochs on MultiWoZ, and each time batch size changed, we will load the best model parameters up to now and continue to training. We found that this trick accelerated the convergence and slightly increased the performance.