Copy-Enhanced Heterogeneous Information Learning for Dialogue State Tracking

by   Qingbin Liu, et al.

Dialogue state tracking (DST) is an essential component in task-oriented dialogue systems, which estimates user goals at every dialogue turn. However, most previous approaches usually suffer from the following problems. Many discriminative models, especially end-to-end (E2E) models, are difficult to extract unknown values that are not in the candidate ontology; previous generative models, which can extract unknown values from utterances, degrade the performance due to ignoring the semantic information of pre-defined ontology. Besides, previous generative models usually need a hand-crafted list to normalize the generated values. How to integrate the semantic information of pre-defined ontology and dialogue text (heterogeneous texts) to generate unknown values and improve performance becomes a severe challenge. In this paper, we propose a Copy-Enhanced Heterogeneous Information Learning model with multiple encoder-decoder for DST (CEDST), which can effectively generate all possible values including unknown values by copying values from heterogeneous texts. Meanwhile, CEDST can effectively decompose the large state space into several small state spaces through multi-encoder, and employ multi-decoder to make full use of the reduced spaces to generate values. Multi-encoder-decoder architecture can significantly improve performance. Experiments show that CEDST can achieve state-of-the-art results on two datasets and our constructed datasets with many unknown values.


page 1

page 2

page 3

page 4


Towards Universal Dialogue State Tracking

Dialogue state tracking is the core part of a spoken dialogue system. It...

Learn to Focus: Hierarchical Dynamic Copy Network for Dialogue State Tracking

Recently, researchers have explored using the encoder-decoder framework ...

Scalable and Accurate Dialogue State Tracking via Hierarchical Sequence Generation

Existing approaches to dialogue state tracking rely on pre-defined ontol...

Dialogue State Tracking with Pretrained Encoder for Multi-domain Trask-oriented Dialogue Systems

In task-oriented dialogue systems, Dialogue State Tracking (DST) is a co...

Flexibly-Structured Model for Task-Oriented Dialogues

This paper proposes a novel end-to-end architecture for task-oriented di...

Semantic Role Labeling Guided Multi-turn Dialogue ReWriter

For multi-turn dialogue rewriting, the capacity of effectively modeling ...

Parallel Interactive Networks for Multi-Domain Dialogue State Generation

The dependencies between system and user utterances in the same turn and...

1 Introduction

Task-oriented dialogue systems help users to achieve specific goals such as finding restaurants [16] and movie information retrieval [3]. DST is a core component in task-oriented dialogue systems. DST estimates user goals through the dialogue context. Dialogue systems use these user goals to decide the next system actions, which are used to generate natural language responses.

Typically, in DST tasks, all user goals are in a fixed domain ontology and represented by slot-value pairs. Consider the task of finding restaurants in Figure 1 as an example. During each turn, the user informs specific constraints (e.g. inform(price range = moderate)) or requests some information (e.g. request(address)). Constraints informed in one turn is called turn goals. Joint goals are the constraints informed in current and previous turns. These values vary lexically and morphologically (e.g. moderate: [moderately, mid-priced]) as shown in the red-color word in Figure 1. The rephrasing of values needs to be converted into a standard form in the fixed ontology. Therefore, traditional DST is to extract possible the standard slot-value pairs given the conversation context.

Figure 1: An example of the DST in finding restaurant task. Given each system response (orange) and the following user utterance (blue), DST estimates joint and request goals. The red-color word shows rephrasing of values. The green-color word can be copied from the dialogue context directly as unknown values.

In the research community, previous models with a fixed ontology are not scalable in many domains, because 1) the entire dataset is usually not exposed to us and we can only get a part of the values from the training set, and 2) such a fixed ontology is not scalable when the value set is dynamic (e.g. new movie added, new restaurant opened, etc.) and unbounded (e.g. date, volume etc.) [13]. The scalability of these models in real-world dialogues remains to be evaluated.

Dealing with the values that don’t appear in the ontology is a critical but rarely mentioned capability of DST. The generative models with pointer or copy-mechanism can generate the unknown value by selecting words in the dialogue as values [15]. However, the performance of common generative models is low due to weakly considering the semantic information of the ontology. For example, [17] pointers out text spans from the dialogue as unknown values without referring to the ontology, which could not learn the non-pointer values. Besides, they used a hand-crafted list to exhaustively enumerate all rephrasing for each value. The non-pointer and non-enumerated values degrade performance.

In this paper, we propose a Copy-Enhanced Heterogeneous Information Learning model (CEDST), which can generate all the possible value including the known and unknown value from the heterogeneous texts (dialogue context and ontology). Based on the findings that most values (e.g. 82% in the WoZ2 dataset) can be copied directly from the conversation as shown in the green-color word in Figure 1. Our proposed model is augmented with the copy mechanism [4]

to only copy values with the normalized form. Moreover, CEDST maps the heterogeneous known values and dialogues into multiple vector spaces by multi-encoder and uses multi-decoder to extract dialogues states from these reduced spaces. The multi-encoder-decoder architecture extracts the slot-related and slot-shared features. This architecture can effectively generate a reduced state space for each slot and learn contextual information to generate the unknown values. Since there are no public DST datasets with the unknown value, we construct datasets containing the unknown values in different proportions. CEDST is able to achieve state-of-the-art results on two public and constructed datasets.

In summary, this paper makes the following contributions:

  • We apply copy mechanism in DST tasks. Through this mechanism, CEDST effectively copies unknown values from the conversation for each slot.

  • We propose the multi-encoder-decoder architecture in CEDST to decompose the heterogeneous texts into multiple small spaces. The value for each slot is effectively selected from the corresponding space.

  • Many experiments are constructed to evaluate the generative capability of models. And, we construct DST datasets with the unknown value in the development and test sets. CEDST is able to achieve state-of-the-art performance on these datasets.

2 Related Work

Current dialogue state trackers usually employ the E2E models. Typically, E2E models estimate the turn-level user goal and request given a system response and following user utterance at every turn, and then estimates the joint goal by considering all previous dialogue turns.

2.1 Discriminative Models in DST

Typically, Discriminative models involve multi-class or multiple binary classifications to model the posterior probability of every candidate value.

[6] employs a strategy called delexicalisation, which uses hand-crafted semantic dictionaries to replace slots and values mentioned in dialogues with generic labels. Therefore, the corpus is converted to this form such as (’want a tagged-value restaurant.’). Many -gram features are collected from the converted corpus. These fratures are used by the model to make a binary decision for every slot-value pair [6]. The neural belief tracker [10] automatically learns effective features from word embedding without semantic dictionaries in the delexicalisation strategy. Memory network is also applied to automatically learn effective features [12]

. The global-locally self-attentive model proposed by

[19] extracts local features in each slot and global features between slots. These local and global features enable the model to focus more on rare slot-value pairs.

As far as we know, [13] is the only work using the discriminative model to handle the dynamic and unbounded value set. [13] represents the dialogue states by candidate sets derived from the dialogue and knowledge, then scores values in the candidate set with binary classifications. Although the sophisticated generation strategy of the candidate set allows the model to extract unknown values, [13] needs a separate SLU module and may propagate errors.

2.2 Generative Models in DST

Generative models with the pointer or copy mechanism can generate unknown values. [17] uses the pointer network to select a continuous text span on the system response and following user utterance as final dialogue states. However, it cannot deal with discontinuous and unclear values and ignore the semantic information in the ontology. In addition, [17] needs an additional post-normalization to deal with lexical diversity.

3 Copy-Augmented Heterogeneous Information Learning

In this paper, DST is regarded as a E2E generative problem with heterogeneous texts. Given a user utterance and a system response , the model generates possible values from , or the pre-defined ontology. In this section, we will firstly introduce a general encoder-decoder framework with copy mechanism [4], and then introduce CEDST.

3.1 Background: Encoder-Decoder with Copy Mechanism

The encoder-decoder framework was firstly introduced in sequence-to-sequence learning [15]

and then quickly extended to many NLP tasks. In this framework, an encoder, usually recurrent neural networks such as long-short term memory (LSTM)

[8], transforms the input sequence

to a hidden representation

as below:


where is the word embedding of .

The decoder uses the generated word decoded in the last step and a vector calculated by attention mechanism [1] to update its states :


The target word is sampled from the following probability distribution:


Copy mechanism allows the encoder-decoder model to copy words directly from the input sequence. The generation probability is as below:


where g represents the generation mode, and c the copy mode.

3.2 Overview

The overview of CEDST is presented in Figure 2. The context multi-encoder takes the user utterance () and system response () as inputs during every turn and generates specific hidden for each slot (purple strips in Figure 2). The hidden is used for the language understanding and copy mechanism. The known-value multi-encoder encodes known values () in the ontology into the same vector space with the hidden of and for each slot. The multi-encoder bridges the representation gap between the words in the dialogue and known values that may consist of multiple words. Because the values of each slot are different, we accordingly generate specific space for each slot. These spaces are consisted of the short-term memories termed as , , , in Figure 2. Multi-Decoder attentively read the hidden and selectively generate the known value or the word step by step for every slot with copy mechanism.

Figure 2: The overview of CEDST. The context multi-encoder includes the user utterance () multi-encoder and the system response () multi-encoder. is known values in the training set. is the generated value for every slot. Multiple memories (, , , ) is generated according to the corresponding slot.

3.3 Multi-Encoder

Multi-encoder in CEDST is designed in a private and shared architecture following the multi-encoder in [19], which is used for extract features for a discriminator. The private bi-directional LSTM (BiLSTM) focuses on extract the slot-related feature. For example, the words ‘Chinese restaurant’ and ‘seafood’ is likely to be related to the ‘food’ slot. The shared BiLSTM focuses on extract the shared feature between slots such as the affirmative and negative words:


where . The proportion of the private and shared features is learned by a gate with a trainable parameter.


The private self attention learns the related context for each slot.


where . The shared self attention is computed as the above private self attention with the parameters between slots. The final representation is the integration of private and shared self attention:


Finally, multi-encoder transforms each input into a hidden and a context .

Multi-encoders encode the known values, system response, and user utterance. The hidden makes up the memories (e.g. ) in CEDST for each slot. Special values such as ‘don’t care’ and ‘none’, which indicates that the user doesn’t care or has no intent to some slots, are also encoded to a hidden. By doing so, special cases in DST can be processed in a unified framework. The system response is replaced with system actions as previous works. CEDST models the interaction between system actions () and user utterance () as below:


where .

The slot-specific user utterance, system response, and known values make up the complete state space, hence the large state space of heterogeneous texts is decomposed into multiple spaces. These spaces are closely related to the corresponding slot. For the special values are also in the spaces, every training sample can find the target in these spaces, and this could optimize the inefficient training problem of the data sparseness.

3.4 Multi-Decoder and Copy Mechanism

Multi-decoder is designed to effectively utilize these state spaces to generate values for every slot. As shown in Figure 3, multi-decoder employs a private decoder for each slot and a shared decoder between slots to decoder comprehensive information from the inputs. It updates its hidden state () with the last predicted word () and attention vector () as below:


where is a trainable parameter. is computed as below:


where is the concatenation of and .

Some slots only have one target value in a training sample (named as the single-value slot in our paper). For example, the user only inform one value or ‘none’ for the ‘food’ slot in the finding restaurant task. The multi-decoder selects a word or a value for the single-value slot with the following distribution:


where is the short-term memory such as or . stands for the probability distribution on the dialogue. represents the distribution on the ontology. If the model selects the candidate value at the first step, it stops the subsequent decoding and generates the value through . If it copies a word in the utterance at the first step, it continues decoding until generating the entire value through . Therefore, the generate and copy mode are determined at the first step.

The slot such as the ‘request’ slot has multiple values (the multi-value slot). The common decoder is different to handle the multi-value slot, due to the insufficient coverage problem. This problem may make the model not fully generating all values for the slot. Therefore, our multi-decoder generates multiple words or values only in the first step to alleviate the problem as below:


The words and values will be selected in the first step as long as its probability is greater than a threshold (0.5 in our model). The copy mechanism in the multi-value slot can copy all possible words at one time. The predicted values can be easily obtained through some simple segmentation rules.

The target of copied values is the index of corresponding words in the dialogue, otherwise the index of values in the ontology. We utilize the cross-entropy to optimize the probability.

Figure 3: The architecture of the multi-decoder, which is generating the value for the ‘food’ slot. and is the hidden of the user utterance and system response corresponding to this slot. is the short-term memory belong to the ‘food’ slot.

4 Experiments

4.1 Dataset

4.1.1 Public Datasets

The Wizard of Oz 2 (WoZ2) dataset is used for finding restaurants around Cambridge [16, 10]. WoZ2 contains three single-value informable slots (’area’, ’food’, ’price range’) and a multi-value request slot. API calls in Task 5 of the bAbI dataset is regarded as the state [2]. The out-of-vocabulary (OOV) test set in bAbI only contains unknown values.

4.1.2 Constructed Datasets with Unknown Values

The standard DST dataset WoZ2 only can evaluate the capability of models to process known values. Therefore, we manually construct datasets based on WoZ2. We randomly select known values in WoZ2 and mask them and their annotation in the train set according to the ratio of 20%, 40%, 60%. Selected values are seamed as unknown values.

4.2 Implementation Details

GloVe [11]

and character n-gram embedding

[5] is the fixed word embedding. The loss is optimized by ADAM [9] with a fixed learning rate 0.001. The parameters of our model is initialized by the Xavier initializer. We apply dropout [14] with 0.8 keep rate to word embedding and outputs of multi-encoder.

4.3 Comparison on Public Datasets

Models Joint Goal Turn Request
Delex. 70.8% 87.1%
Delex. + Dictionaries 83.7% 87.6%
NBT - DNN 84.4% 91.2%
NBT - CNN 84.2% 91.6%
GLAD 88.1 0.4% 97.1 0.2%
PtrNet 87.5% -
CEDST 89.6% 97.0%
Table 1: Results on WoZ2. Baselines are delexicalisation-based model [6], delexicalisation with semantic dictionaries [16], NBT [10], GLAD [19] and PtrNet[17].
Model Slot Regular Test OOV Test
p=0 p=0.1 p=0 p=0.1
PtrNet Food 100% 100% 86.2% 100%
Location 100% 100% 74.7% 99.6%
CEDST Food 100% - 100% -
Location 100% - 100% -
Table 2: Results on bAbI. Comparison with the generative model PtrNet.
20% UNK 40% UNK 60% UNK
Models Joint Goal Turn Request Joint Goal Turn Request Joint Goal Turn Request
GLAD 51.9% 91.0% 29.2% 75.3% 11.0% 73.9%
CEDST 53.6% 91.2% 32.4% 75.6% 11.2% 73.9%
Table 3: The performance of models on constructed datasets. The ratios of unknown values (UNK) in the three datasets are 20%, 40%, 60%.

Table 1 shows the performance of models on WoZ2. Delex. is the delexicalisation-based model [6]. Delex. + Dictionaries is the delexicalisation-based model with semantic dictionaries to hand the rephrasing [16]. [10] provides two neural belief trackers that learn features from the word embedding. GLAD utilizes the global-locally encoder and self attention to extract comprehensive features [19]. The metrics is the joint goal accuracy and turn request. The joint goal is obtained by a simple rule to integrate the current and previous turn goals in CEDST. CEDST uses the value of the previous turn goal if the current turn does not get a new value, or replace the old value with a new one if obtained in current turn.

On WoZ2, CEDST achieves the state-of-the-art result. The joint goal accuracy is 89.6%. The turn request accuracy is 97.0%. It reveals that CEDST is good at generating known values. This also proved that the generation and utilization of the reduced multiple state spaces by the multi-encoder-decoder in CEDST is very effective. This architecture can significantly improve performance.

On bAbI, CEDST outperforms PtrNet by 13.8 and 25.3 on the ‘food’ and ‘location’ slots. It proves that CEDST is very effective at generating unknown values. Even PtrNet adopted Targeted Feature Dropout to enhance the generative ability, CEDST still gets the same or better performance.

4.4 Comparison on Constructed Datasets

For GLAD [19] achieves high performance on WoZ2, we compare GLAD with CEDST on constructed datasets. As shown in Table 2, CEDST also achieves state-of-the-art performance. We can see that the two models achieve much lower performance on constructed datasets than on WoZ2 even with only 20% unknown values. It shows that dealing with unknown values is an important capability of DST models. CEDST increases by 1.7%, 3.2% and 0.2% respectively. In the dataset with 60% unknown values, both models only get very little labeling information from the training set, which makes the performance very low.

Figure 4: The count of generated unknown values. UNK_ALL is the total number of generated unknown values. UNK_Correct is the count of correct unknown values.

In order to clearly analyze the improvement of CEDST, we count generated and correct unknown values on the test set as shown in Figure 4. CEDST generates most correct unknown values on the 40% dataset. Because known values in the 20% dataset are much more than unknown values, CEDST focuses more on the target word itself and less on context words. The works in [7] and [18] have proved that context words are essential for generating unknown values. On the 60% dataset, CEDST cannot get sufficient information from the train set and generate less unknown values. The proportion of correct to generated unknown values is 30.8%, 34.7% and 56.3%, which proves the good scalability of CEDST.

4.5 Ablation Study

40% UNK The WoZ2 dataset
Models Joint Goal Turn Request Joint Goal Turn Request
CEDST 32.4% 75.6% 89.6% 97.0%
- multi-encoder 30.6% 75.3% 85.8% 97.0%
- multi-decoder 31.2% 75.3% 87.9% 97.1%
- copy 29.4% 75.1% 87.1% 96.2%
- self att. 23.0% 75.3% 70.0% 95.5%
- shared LSTM 31.8% 75.7% 88.6% 97.3%
Table 4: Ablation experiments on WoZ2 and the 40% dataset. For ‘- multi-encoder’, we use a single BiLSTM to replace the multi-encoder. For ‘-multi-decoder’, we only use the shared decoder. For ‘- self att.’, we use the last hidden to replace self-attention.

4.5.1 Multi-Encoder Converts Inputs into Effective Hidden

We use a single BiLSTM to replace multi-encoder. The BiLSTM only extract shared features and degrades the performance. Private features are very important. For example, the ‘part’ word is only related to the ‘area’ slot. Multi-encoder extract private and shared features, which learn very effective representations.

4.5.2 Multi-decoder Derives More Useful Information

The private and shared decoders can derive more information from state spaces. Since the target value and the state space is different, the private decoder that dedicates on decoding specific information for each slot is very useful.

4.5.3 Copy Mechanism Generates Unknown Values Effectively

We remove copy mechanism and utterance hidden in . It disables the generation of unknown values and significantly reduces the performance on the 40% dataset. Because we also remove the utterance hidden, this model degrades the performance on WoZ2. This shows that these hidden can promote dialogue understanding.

4.5.4 Self Attention

Self attention in multi-encoder is used to obtain a context hidden. We use the last hidden of the encoder to replace the context. It significantly degrades performance. It shows that self attention can get a context effectively.

4.5.5 Shared LSTM

The shared LSTM is removed in multi-encoder-decoder. It only reduces 0.6% joint goal accuracy on the 40% dataset and 1.0% on WoZ2. It even achieves a little higher on the request accuracy. It proved that, in these datasets, global features are less important than private features.

5 Conclusions

We propose the copy-augmented heterogeneous information learning model for dealing with unknown values in DST. The copy mechanism in CEDST can copy words in the dialogue context as unknown values. Meanwhile, the multi-encoder-decoder in CEDST can effectively decompose heterogeneous texts into multiple small spaces corresponding to the slots. CEDST can effectively generate values from reduced spaces. CEDST achieves state-of-the-art performance both on WoZ2, bAbI, and our constructed datasets.


This work is supported by the National Natural Science Foundation of China (No.61533018), the Natural Key R&D Program of China (No.2018YFC0830101), the National Natural Science Foundation of China (No.61702512, No.61806201) and the independent research project of National Laboratory of Pattern Recognition. This work was also supported by Alibaba Group through Alibaba Innovative Research (AIR) Program, CCF-DiDi BigData Joint Lab and CCF-Tencent Open Research Fund.