Much attention has recently been paid to dialogue state tracking (DST), which aims to extract user goals at each turn of a dialogue and represents them using slot-value pairs. These pairs are later used by the policy learning module to decide the next system action.
decompose DST into a series of classification problems. They first score all possible slot-values pairs and then choose the value with the highest probability as the predicted value for that slot.
Some more recent work proposes generation-based DST . Based on the encoder-decoder framework , it directly generates value for each slot. It is further enhanced with a soft copy mechanism , which enables it to generate values using text from the input source.
Despite its remarkable success,  simply concatenates different dialogue turns and feeds them as a flat sequence to the model. For example, it would organize the dialogue in Figure 1 in the form of . Obviously, when the dialogue contains multiple turns, the sequence would become long, and copying the correct slot values becomes especially difficult.
To this end, we propose a Hierarchical Dynamic Copy Network (HDCN) to facilitate focusing on the most informative turns, making it easier to copy slot values from the dialogue context. We define the information turn of a certain slot as the turn where its corresponding value appears, or can be inferred. For example, in Figure 1, the information turn for slot hotel-area is turn 3, where its value east appears. Obviously, to generate value for a certain slot, we should focus more on its information turn.
To achieve this, we use two levels of attention applied to the word- and turn-level, which are then re-normalized and result in the final copy distribution. We further add a focus loss term to encourage our model to assign higher turn-level attention weight to the information turn. Notably, our turn representation is calculated dynamically during decoding using the word-level attention. This dynamic calculation leads to better turn representations for two reasons: 1) it utilizes all the hidden states instead of only the last one, and 2) it enables our model to form different turn representations for different slots. As shown in Figure 1, when calculating word-level attention over turn 4, the slot hotel-book day would attend more to the word Tuesday, and the slot hotel-book stay would assign more attention to the word 2. In this way, the resulting turn-representations for these two slots also become different.
We conduct experiments on the MultiWOZ 2.1 dataset. Results show that HDCN brings over 1% (absolute) improvement in terms of joint accuracy, with better interpretability. Note that our goal is not to beat the state-of-the-art systems such as [13, 12], as they are often based on pretrained language model (e.g., BERT). Rather, we aim to offer a lightweight improvement over the best system in the pre-BERT era . We believe although BERT-based models can achieve high accuracy, they inevitably cost more inference time, making the DST component not practical in real-word dialog systems.
2.1 Task Definition
Given a set of machine and user utterance pairs in T turns of dialogue , we aim to generate dialogue states containing a series of slot-value pairs , where is the number of all the possible slots. In the following subsection, we first give a brief description of generation-based DST and introduce our work in detail.
2.2 Generation-Based DST
 first proposes to regard DST as a generation problem. Their proposed model, TRADE, achieves the best performance in the pre-BERT era. It follows the encoder-decoder paradigm , and adopts a soft-copy mechanism  to copy slot values from the input source.
At encoding time, it uses a bi-directional gated recurrent units (GRU) to encode the dialogue history, which is organized as a flat sequence .
To generate value for slot , it supplies the summed embedding of as the first input to the decoder111More specifically, the summed embedding of the domain and slot. For example, if is hotel-area, then the summed embedding of is Emb(hotel) + Emb(area), and decodes times independently, where is the number of all the possible slots.
At decoding step , the decoder returns a hidden state , which is later mapped into the vocabulary space using embedding matrix . Formally,
At the same time, is used to calculate attention weights over the encoded dialogue history. and are then weighted and summed to obtain the final word distribution :
where is a trainable matrix, is the input word embedding to the decoder at time step and
is the context vector. Intuitively, can be seen as a soft switch to choose between generating a word from the vocabulary, and copying a word from the dialogue history.
Although TRADE takes an important first step towards generation-based DST, it simply concatenates different dialogue turns and regards the multi-turn dialogue as a flat sequence. Correspondingly, copying the correct slot values becomes especially hard when the sequence is long. In this paper, we propose the hierarchical dynamic copy network (HDCN) to address this limitation. The overall structure of our model is shown in Figure 2.
We first encode each turn with a bidirectional LSTM (BiLSTM) encoder and summarize the information in the forward and backward BiLSTM hidden states, namely . During encoding, we compare two initialization strategies for :
last initialization: is initialized as the last hidden state of previous turn .
After encoding, we aggregate all the encoded information into a fix-length vector using max-pooling, with which we initialize our LSTM decoder. As inTRADE, our model decodes times independently to generate the dialogue state. For the slot at time step :
where is the word embedding and is the decoder hidden state. Following TRADE, we also supply the summed embedding of as the first input, , to the decoder.
After obtaining , we adopt a hierarchical dynamic copy approach to obtain copy distribution. Specifically, we first calculate word-level attention distribution over the dialogue history :
where is trainable parameters.
Note that is dependent on the target slot (we skip in the notations for simplicity.). Intuitively, the decoder hidden state can be seen as a high level representation of a fixed query: what is the informative words for the slot?
We then aggregate the representations of those informative words to form turn representations:
Traditional hierarchical LSTM methods regard the last hidden state of the word-level LSTM as the turn representation. By contrast, our turn representation considers all the hidden states . More importantly, this dynamic calculation procedure enables our model to form different turn representations for different target slots, leading to more flexible and robust representations. For example, in Figure 1, when calculating over turn 4, the slot hotel-book day would attend more to the word Tuesday, and the slot hotel-book stay would attend more to the word 2. In this way, the turn representations for these two slots also differ.
Similarly, to identify informative turns for the slot, we calculate turn-level attention weights using .
where is model parameters. Intuitively, can be seen as an indicator of how informative turn j is for the slot (similarly, we skip in the notation for simplicity).
Then, we re-normalize the two levels of attention to obtain the final copy distribution. Formally,
In this way, words from more informative turns get rewarded, and words from less informative turns are discouraged. A more detailed analysis can be found in Subsection 4.3.
Finally, and the vocabulary distribution are weighted and summed to obtain the final word distribution:
2.4 Turn Focus Mechanism
We define the information turn for a certain slot as the turn where its corresponding value appears, or can be inferred. It is essential to assign the highest turn-level attention weight to the information turn in order to capture key information.
To achieve this, we sum and normalize all the turn-level attention weights after decoding, and then calculate the turn focus loss as the negative log probability of the information turn. Formally,
where is the information turn for the slot.
In this way, our model can learn to assign the highest turn-level attention weight to the information turn. Combined with our hierarchical dynamic copy mechanism, words from the information turn are more likely to be copied (see Equation 8).
Note that some slots are never mentioned in the dialogue, and they thus do not have corresponding information turns. To handle these slots, we pad a sentry turn at the beginning of the dialogue (see Figure 2). The sentry turn contains only one word none and acts as the information turn for these slots.
Besides, we observe that slot values that contain multiple words are mostly named entities mentioned in the dialogue, like holy trinity church or london liverpool street. Obviously, these words are almost always from the same turn. Based on this observation, we further propose two methods to better handle multi-words values: HDCN-freeze and HDCN-force.
As the name suggests, HDCN-freeze freezes turn-level attention after the first decoding step. That is, we modify Equation 7 into the following:
However, we found that this approach is too rigid. Inspired by the coverage mechanism , we further propose HDCN-cover, a more flexible approach. The basic idea is to encourage rather than force our model to focus on the same turn during decoding. Specifically, we maintain a turn coverage vector during decoding, which is the sum of turn-level attention distribution over all previous decoding steps:
is used as extra input to the calculation of turn-level attention, changing equation 7 to:
where is a learnable parameter randomly initialized. During training, we find that steadily increases, suggesting that turns receiving higher attention in previous decoding steps indeed get rewarded.
Note that our coverage vector serves the exact opposite purpose of the original coverage mechanism . Coverage mechanism was first proposed for repetition reduction by discouraging the model from attending to the same place. By contrast, we here want to encourage the model to attend to the same turn during decoding.
2.5 Task Learning
During training, the cross-entropy loss for the slot is calculated as:
where is the ground truth at time step and is the total decoding steps.
where is the focus ratio and is used to adjust the relative importance of each term. is the number of all the possible slots.
Following previous work [26, 20, 6], we train and evaluate our model with MultiWOZ  dataset. It is currently the largest multi-domain conversational corpus spanning over seven domains, containing 8438 multi-turn dialogues. The training set, validation set and test set contain 8,438, 1,000 and 1,000 dialogues, respectively.
The original MultiWOZ dataset were detected to contain substantial errors in the state annotation. These errors were later fixed in MultiWOZ 2.1 dataset . The correction process results in changes to over 32% of state annotations across 40% of the dialogue turns. In this paper, we conducted experiments with the corrected version dataset, MultiWOZ 2.1.
Note that MultiWOZ dataset does not contain information turn labels, yet it can be easily inferred using rule-based method. Specifically, if the dialogue state changes from to after turn , then is the information turn for those slots whose values differ in and . The information turn labels are only used during training.
3.2 Training Procedure
Our model was implemented in PyTorch and trained on a single NVIDIA 1080 Ti GPU. The LSTMs and word embeddings are 400 dimensional. Following previous work [29, 26], we initialized the embeddings by concatenating Glove embeddings  and character embeddings . Dropout  is used with dropout rate set to 0.5. The number of layers of LSTM encoder/decoder is set to 2. The batch size is set to 16. in Equation 16 is set to 0.1.
We use Adam  with learning rate = , and . We choose our model according to the results on the development set. The reported results are the average of five test set results.
3.3 Systems for Comparison
In this paper, we compare our model against the following baselines:
FJST  is a classification-based model. It uses bidirectional LSTM network to encode the full dialogue history.
DST Reader  first proposes to regard the problem of DST as a reading comprehension problem. It uses a simple attention-based neural network to point to the slot values within the dialogue.
combines a hierarchical encoder fixed vocabulary system with an open vocabulary n-gram copy-based system
Hier-LSTM based is our baseline to examine the effectiveness of our hierarchical dynamic copy mechanism. It is based on the hierarchical LSTM encoder, and the decoding process is the same as TRADE. For a fair comparison, we also implement copy and focus mechanism to this baseline.
Joint accuracy -
Joint accuracy is the most widely used evaluation metrics for DST[29, 26, 20]. It compares the predicted dialogue state to the ground truth at each dialogue turn, and the output is considered correct if and only if all the predicted values exactly match the ground truth values. All the baseline results are taken from the official paper of MultiWOZ 2.1 dataset .
Focus accuracy - In this work, we also consider focus accuracy to examine our claim that the focus mechanism can encourage the model to focus on information turns. For each slot, focus accuracy compares its summed turn-level attention weights with its information turn. The result is considered correct if and only if the information turn receives the highest attention weight.
|DST Reader ||36.40|
|Hier-LSTM based (baseline)||39.39|
|HDCN-freeze, zero initialized||45.48|
|HDCN-freeze, last initialized||46.11|
|HDCN-cover, zero initialized||45.62|
|HDCN-cover, last initialized||46.76|
|Settings||Focus Acc.||Joint Acc.|
4.1 Overall Performance
The joint accuracy results on the MultiWOZ dataset are shown in Table 1.
Our first observation is that the Hier-LSTM based baseline gives poor performance, with joint accuracy lower than the flat-structured TRADE. Besides, we can see that HJST also performs worse than FJST, which further proves that simply using two-levels of LSTM to model multi-turn dialogue is not suited for DST. We suppose this is because the last hidden state of word-level LSTM is not sufficient to capture all the key information within the turn. By contrast, our proposed model uses the attention mechanism to dynamically calculate turn representations and selectively uses all the word representations within the turn. More importantly, our model allows for different turn representations for different target slots.
Besides, we can see that HDCN-cover generally performs better than HDNC-freeze. Although they both aim for consistent focusing on the same turn during decoding, the former approach allows for more flexibility. Another observation is that the last-initialization approach performs notably better than zero-initialization, with about 1% absolute improvement. This is because the former can effectively utilize history information when encoding each turn, while the latter encodes each turn independently and discards the history information entirely.
4.2 Focus Mechanism
One of the main claims of this paper is that we can improve DST performance by encouraging our model to focus on the information turn. To further examine this claim, we conduct experiments with different focus ratio, i.e., in Equation 16 on the development set. The results are shown in Table 2.
Firstly, we can see that the focus accuracy indeed increases with the focus ratio. This is understandable because a higher focus ratio means we attach greater importance to the focus loss term.
Besides, the results also buttress our claim that the focus mechanism can improve DST performance. We can see that even with 0.01 focus ratio, the improvement is notable. Our model reaches the best performance when and the reported test set results are based on this setting.
Furthermore, we note that the joint accuracy does not increase monotonously with the focus ratio. After reaching its peak value, the joint accuracy slightly decreases. We suppose this is because focusing too much on a single turn may lead to the neglect of contextual information. Besides, from Equation 16 we can see that excessively high focus ratio could hinder the decoding process.
4.3 Attention Weights Analysis
As demonstrated in Subgraph (a), all the five slots successfully focus on their corresponding information slots. For example, slot hotel-book day assigns the highest attention weight to turn 4, and slot hotel-book people assigns the highest attention weight to turn 5. Besides, since the slot hotel-parking is never mentioned in the dialogue, it assigns the highest attention weight to the sentry turn, which only contains one word none.
Subgraph (b) gives the word-level attention over turn 5 before re-normalization with Equation 8. A crucial observation is that both slot hotel-book stay and hotel-book people assign the highest attention weight to the word 3. This is because both slots tend to receive a number as value, and the number 3 can hence become puzzling. In fact, we find that TRADE, which adopts flat-copy and does not involve the re-normalization process, wrongly assigns the value for slot slot-book stay to be 3.
Subgraph (c) gives the word-level attention over turn 5 after re-normalization. We can see that since slot hotel-book stay barely attends to turn 5 (see Subgraph (a)), its attention on the word 3 almost becomes 0 after re-normalization. By contrast, slot hotel-book people still (rightfully) assigns the highest attention to 3, because turn 5 is its information turn and receives the highest turn-level attention.
From the above analysis, we can see why the proposed model improves over TRADE. TRADE organizes the dialogue as a long sequence. It is hence hard for it to attend to and copy the gold value, as there can be many puzzling words along that sequence, like 3 for slot hotel-book stay in the above example. By contrast, our model adopts a hierarchical copy process and can assign the highest turn-level attention weight to the information turn, thanks to the focus mechanism.
5 Related Work
DST - In recent years, neural network approaches have defined the state-of-the-art in DST research. Most previous works decomposed this task as a series of classification problems [29, 17, 19, 2, 15]. They took each of the slot-value pairs as input for scoring, and output the value with the highest score as the predicted value for a slot. For such models, a predefined ontology is required, which lists all the possible slot-value pairs. However, such an ontology is not always available in practice. Besides, the number of all the possible slot-values pairs can be large, and enumerating all the possible slot-values pairs sometimes requires astronomical time and resources. For example, in MultiWOZ dataset, there are over 4,500 possible slot values in total, meaning that such models have to perform over 4,500 classifications to determine the current dialogue state.
To address these limitations, some recent works discarded the classification framework [27, 26, 2, 6].  is the first model that applies pointer networks  to the single-domain DST problem, which generates both start and end pointers to perform index-based copying.  used a simple attention-based neural network to point to the slot values within the conversation. TRADE 
first proposed to regard this problem as a text generation problem. It encodes the dialogue history as a long sequence with a GRU and uses another GRU as the decoder to predict the value for each slot independently. proposed to sequentially decode domain, slot and value, hence remained a constant computational complexity, regardless of the number of predefined slots.
Hierarchical Attention Network - Our hierarchical dynamic copy mechanism draws inspiration from Hierarchical Attention Network (HAN) , which was first proposed for text classification. Similar to HAN, our approach also has two levels of attention mechanisms applied at the word- and sentence-level. However, our approach is notably different from HAN in two ways: one is that that when calculating attention scores (Equation 5 and 7), we use the decoder hidden state while HAN uses randomly initialized vector. In this way, our model can form different representations for different target slots. The other difference is that we re-normalize the two levels of attention to form the final copy distribution, while HAN only uses attention distribution to form sentence and document representations.
In this work, we propose to tackle the DST problem with HDCN. Instead of viewing the dialogue as a flat sequence, we propose the hierarchical dynamic copy mechanism to facilitate focusing on the information turns, making it easier to copy slot values from the dialogue context. Besides, we further propose two methods to better handle multi-words values. Extensive experiments show that our model improve over TRADE over 1%.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.2.
-  (2019) Scalable neural dialogue state tracking. ArXiv abs/1910.09942. Cited by: §1, §5, §5.
-  (2018) MultiWOZ - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In EMNLP, Cited by: §3.1.
-  (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.2.
-  (2019) MultiWOZ 2.1: multi-domain dialogue state corrections and state tracking baselines. ArXiv abs/1907.01669. Cited by: §3.1, §3.3, §3.3, §3.4, Table 1.
-  (2019) Dialog state tracking: a neural reading comprehension approach. ArXiv abs/1908.01946. Cited by: §3.1, §3.3, Table 1, §5.
-  (2019) HyST: a hybrid approach for flexible and accurate dialogue state tracking. ArXiv abs/1907.00883. Cited by: §3.3, Table 1.
-  (2016) Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §5.
-  (2016) A joint many-task model: growing a neural network for multiple nlp tasks. In EMNLP, Cited by: §3.2.
-  (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §3.2.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
-  (2020) A simple language model for task-oriented dialogue. arXiv preprint arXiv:2005.00796. Cited by: §1.
-  (2020) Efficient dialogue state tracking by selectively overwriting memory. In ACL, Cited by: §1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
-  (2019) SUMBT: slot-utterance matching for universal and scalable belief tracking. In ACL, Cited by: §5.
-  (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025. Cited by: §2.3.
-  (2016) Neural belief tracker: data-driven dialogue state tracking. In ACL, Cited by: §1, §5.
-  (2014) Glove: global vectors for word representation. In EMNLP, Cited by: §3.2.
-  (2018) Large-scale multi-domain belief tracking with knowledge sharing. In ACL, Cited by: §5.
-  (2019) Scalable and accurate dialogue state tracking via hierarchical sequence generation. In EMNLP/IJCNLP, Cited by: §3.1, §3.4, §5.
-  (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §1, §2.2, §2.2, §2.4, §2.4, §5.
-  (2015) Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, Cited by: item 1, §3.3.
-  (2016) A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI, Cited by: item 1.
-  (2014) Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112. External Links: Cited by: §1, §2.2.
-  (2015) Pointer networks. In International Conference on Neural Information Processing Systems, Cited by: §5, §5.
-  (2019) Transferable multi-domain state generator for task-oriented dialogue systems. In ACL, Cited by: §1, §1, §1, §2.2, §2.2, §3.1, §3.2, §3.3, §3.4, Table 1, §5, §5.
-  (2018) An end-to-end approach for handling unknown slot values in dialogue state tracking. In ACL, Cited by: §5.
-  (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §5.
-  (2018) Global-locally self-attentive dialogue state tracker. ArXiv abs/1805.09655. Cited by: §1, §3.2, §3.4, §5.