Task-oriented dialogue systems help users to achieve specific goals such as finding restaurants  and movie information retrieval . DST is a core component in task-oriented dialogue systems. DST estimates user goals through the dialogue context. Dialogue systems use these user goals to decide the next system actions, which are used to generate natural language responses.
Typically, in DST tasks, all user goals are in a fixed domain ontology and represented by slot-value pairs. Consider the task of finding restaurants in Figure 1 as an example. During each turn, the user informs specific constraints (e.g. inform(price range = moderate)) or requests some information (e.g. request(address)). Constraints informed in one turn is called turn goals. Joint goals are the constraints informed in current and previous turns. These values vary lexically and morphologically (e.g. moderate: [moderately, mid-priced]) as shown in the red-color word in Figure 1. The rephrasing of values needs to be converted into a standard form in the fixed ontology. Therefore, traditional DST is to extract possible the standard slot-value pairs given the conversation context.
In the research community, previous models with a fixed ontology are not scalable in many domains, because 1) the entire dataset is usually not exposed to us and we can only get a part of the values from the training set, and 2) such a fixed ontology is not scalable when the value set is dynamic (e.g. new movie added, new restaurant opened, etc.) and unbounded (e.g. date, volume etc.) . The scalability of these models in real-world dialogues remains to be evaluated.
Dealing with the values that don’t appear in the ontology is a critical but rarely mentioned capability of DST. The generative models with pointer or copy-mechanism can generate the unknown value by selecting words in the dialogue as values . However, the performance of common generative models is low due to weakly considering the semantic information of the ontology. For example,  pointers out text spans from the dialogue as unknown values without referring to the ontology, which could not learn the non-pointer values. Besides, they used a hand-crafted list to exhaustively enumerate all rephrasing for each value. The non-pointer and non-enumerated values degrade performance.
In this paper, we propose a Copy-Enhanced Heterogeneous Information Learning model (CEDST), which can generate all the possible value including the known and unknown value from the heterogeneous texts (dialogue context and ontology). Based on the findings that most values (e.g. 82% in the WoZ2 dataset) can be copied directly from the conversation as shown in the green-color word in Figure 1. Our proposed model is augmented with the copy mechanism 
to only copy values with the normalized form. Moreover, CEDST maps the heterogeneous known values and dialogues into multiple vector spaces by multi-encoder and uses multi-decoder to extract dialogues states from these reduced spaces. The multi-encoder-decoder architecture extracts the slot-related and slot-shared features. This architecture can effectively generate a reduced state space for each slot and learn contextual information to generate the unknown values. Since there are no public DST datasets with the unknown value, we construct datasets containing the unknown values in different proportions. CEDST is able to achieve state-of-the-art results on two public and constructed datasets.
In summary, this paper makes the following contributions:
We apply copy mechanism in DST tasks. Through this mechanism, CEDST effectively copies unknown values from the conversation for each slot.
We propose the multi-encoder-decoder architecture in CEDST to decompose the heterogeneous texts into multiple small spaces. The value for each slot is effectively selected from the corresponding space.
Many experiments are constructed to evaluate the generative capability of models. And, we construct DST datasets with the unknown value in the development and test sets. CEDST is able to achieve state-of-the-art performance on these datasets.
2 Related Work
Current dialogue state trackers usually employ the E2E models. Typically, E2E models estimate the turn-level user goal and request given a system response and following user utterance at every turn, and then estimates the joint goal by considering all previous dialogue turns.
2.1 Discriminative Models in DST
Typically, Discriminative models involve multi-class or multiple binary classifications to model the posterior probability of every candidate value. employs a strategy called delexicalisation, which uses hand-crafted semantic dictionaries to replace slots and values mentioned in dialogues with generic labels. Therefore, the corpus is converted to this form such as (’want a tagged-value restaurant.’). Many -gram features are collected from the converted corpus. These fratures are used by the model to make a binary decision for every slot-value pair . The neural belief tracker  automatically learns effective features from word embedding without semantic dictionaries in the delexicalisation strategy. Memory network is also applied to automatically learn effective features 
. The global-locally self-attentive model proposed by extracts local features in each slot and global features between slots. These local and global features enable the model to focus more on rare slot-value pairs.
As far as we know,  is the only work using the discriminative model to handle the dynamic and unbounded value set.  represents the dialogue states by candidate sets derived from the dialogue and knowledge, then scores values in the candidate set with binary classifications. Although the sophisticated generation strategy of the candidate set allows the model to extract unknown values,  needs a separate SLU module and may propagate errors.
2.2 Generative Models in DST
Generative models with the pointer or copy mechanism can generate unknown values.  uses the pointer network to select a continuous text span on the system response and following user utterance as final dialogue states. However, it cannot deal with discontinuous and unclear values and ignore the semantic information in the ontology. In addition,  needs an additional post-normalization to deal with lexical diversity.
3 Copy-Augmented Heterogeneous Information Learning
In this paper, DST is regarded as a E2E generative problem with heterogeneous texts. Given a user utterance and a system response , the model generates possible values from , or the pre-defined ontology. In this section, we will firstly introduce a general encoder-decoder framework with copy mechanism , and then introduce CEDST.
3.1 Background: Encoder-Decoder with Copy Mechanism
The encoder-decoder framework was firstly introduced in sequence-to-sequence learning 8], transforms the input sequence as below:
where is the word embedding of .
The decoder uses the generated word decoded in the last step and a vector calculated by attention mechanism  to update its states :
The target word is sampled from the following probability distribution:
Copy mechanism allows the encoder-decoder model to copy words directly from the input sequence. The generation probability is as below:
where g represents the generation mode, and c the copy mode.
The overview of CEDST is presented in Figure 2. The context multi-encoder takes the user utterance () and system response () as inputs during every turn and generates specific hidden for each slot (purple strips in Figure 2). The hidden is used for the language understanding and copy mechanism. The known-value multi-encoder encodes known values () in the ontology into the same vector space with the hidden of and for each slot. The multi-encoder bridges the representation gap between the words in the dialogue and known values that may consist of multiple words. Because the values of each slot are different, we accordingly generate specific space for each slot. These spaces are consisted of the short-term memories termed as , , , in Figure 2. Multi-Decoder attentively read the hidden and selectively generate the known value or the word step by step for every slot with copy mechanism.
Multi-encoder in CEDST is designed in a private and shared architecture following the multi-encoder in , which is used for extract features for a discriminator. The private bi-directional LSTM (BiLSTM) focuses on extract the slot-related feature. For example, the words ‘Chinese restaurant’ and ‘seafood’ is likely to be related to the ‘food’ slot. The shared BiLSTM focuses on extract the shared feature between slots such as the affirmative and negative words:
where . The proportion of the private and shared features is learned by a gate with a trainable parameter.
The private self attention learns the related context for each slot.
where . The shared self attention is computed as the above private self attention with the parameters between slots. The final representation is the integration of private and shared self attention:
Finally, multi-encoder transforms each input into a hidden and a context .
Multi-encoders encode the known values, system response, and user utterance. The hidden makes up the memories (e.g. ) in CEDST for each slot. Special values such as ‘don’t care’ and ‘none’, which indicates that the user doesn’t care or has no intent to some slots, are also encoded to a hidden. By doing so, special cases in DST can be processed in a unified framework. The system response is replaced with system actions as previous works. CEDST models the interaction between system actions () and user utterance () as below:
The slot-specific user utterance, system response, and known values make up the complete state space, hence the large state space of heterogeneous texts is decomposed into multiple spaces. These spaces are closely related to the corresponding slot. For the special values are also in the spaces, every training sample can find the target in these spaces, and this could optimize the inefficient training problem of the data sparseness.
3.4 Multi-Decoder and Copy Mechanism
Multi-decoder is designed to effectively utilize these state spaces to generate values for every slot. As shown in Figure 3, multi-decoder employs a private decoder for each slot and a shared decoder between slots to decoder comprehensive information from the inputs. It updates its hidden state () with the last predicted word () and attention vector () as below:
where is a trainable parameter. is computed as below:
where is the concatenation of and .
Some slots only have one target value in a training sample (named as the single-value slot in our paper). For example, the user only inform one value or ‘none’ for the ‘food’ slot in the finding restaurant task. The multi-decoder selects a word or a value for the single-value slot with the following distribution:
where is the short-term memory such as or . stands for the probability distribution on the dialogue. represents the distribution on the ontology. If the model selects the candidate value at the first step, it stops the subsequent decoding and generates the value through . If it copies a word in the utterance at the first step, it continues decoding until generating the entire value through . Therefore, the generate and copy mode are determined at the first step.
The slot such as the ‘request’ slot has multiple values (the multi-value slot). The common decoder is different to handle the multi-value slot, due to the insufficient coverage problem. This problem may make the model not fully generating all values for the slot. Therefore, our multi-decoder generates multiple words or values only in the first step to alleviate the problem as below:
The words and values will be selected in the first step as long as its probability is greater than a threshold (0.5 in our model). The copy mechanism in the multi-value slot can copy all possible words at one time. The predicted values can be easily obtained through some simple segmentation rules.
The target of copied values is the index of corresponding words in the dialogue, otherwise the index of values in the ontology. We utilize the cross-entropy to optimize the probability.
4.1.1 Public Datasets
The Wizard of Oz 2 (WoZ2) dataset is used for finding restaurants around Cambridge [16, 10]. WoZ2 contains three single-value informable slots (’area’, ’food’, ’price range’) and a multi-value request slot. API calls in Task 5 of the bAbI dataset is regarded as the state . The out-of-vocabulary (OOV) test set in bAbI only contains unknown values.
4.1.2 Constructed Datasets with Unknown Values
The standard DST dataset WoZ2 only can evaluate the capability of models to process known values. Therefore, we manually construct datasets based on WoZ2. We randomly select known values in WoZ2 and mask them and their annotation in the train set according to the ratio of 20%, 40%, 60%. Selected values are seamed as unknown values.
4.2 Implementation Details
and character n-gram embedding is the fixed word embedding. The loss is optimized by ADAM  with a fixed learning rate 0.001. The parameters of our model is initialized by the Xavier initializer. We apply dropout  with 0.8 keep rate to word embedding and outputs of multi-encoder.
4.3 Comparison on Public Datasets
|Models||Joint Goal||Turn Request|
|Delex. + Dictionaries||83.7%||87.6%|
|NBT - DNN||84.4%||91.2%|
|NBT - CNN||84.2%||91.6%|
|GLAD||88.1 0.4%||97.1 0.2%|
|Model||Slot||Regular Test||OOV Test|
|20% UNK||40% UNK||60% UNK|
|Models||Joint Goal||Turn Request||Joint Goal||Turn Request||Joint Goal||Turn Request|
Table 1 shows the performance of models on WoZ2. Delex. is the delexicalisation-based model . Delex. + Dictionaries is the delexicalisation-based model with semantic dictionaries to hand the rephrasing .  provides two neural belief trackers that learn features from the word embedding. GLAD utilizes the global-locally encoder and self attention to extract comprehensive features . The metrics is the joint goal accuracy and turn request. The joint goal is obtained by a simple rule to integrate the current and previous turn goals in CEDST. CEDST uses the value of the previous turn goal if the current turn does not get a new value, or replace the old value with a new one if obtained in current turn.
On WoZ2, CEDST achieves the state-of-the-art result. The joint goal accuracy is 89.6%. The turn request accuracy is 97.0%. It reveals that CEDST is good at generating known values. This also proved that the generation and utilization of the reduced multiple state spaces by the multi-encoder-decoder in CEDST is very effective. This architecture can significantly improve performance.
On bAbI, CEDST outperforms PtrNet by 13.8 and 25.3 on the ‘food’ and ‘location’ slots. It proves that CEDST is very effective at generating unknown values. Even PtrNet adopted Targeted Feature Dropout to enhance the generative ability, CEDST still gets the same or better performance.
4.4 Comparison on Constructed Datasets
For GLAD  achieves high performance on WoZ2, we compare GLAD with CEDST on constructed datasets. As shown in Table 2, CEDST also achieves state-of-the-art performance. We can see that the two models achieve much lower performance on constructed datasets than on WoZ2 even with only 20% unknown values. It shows that dealing with unknown values is an important capability of DST models. CEDST increases by 1.7%, 3.2% and 0.2% respectively. In the dataset with 60% unknown values, both models only get very little labeling information from the training set, which makes the performance very low.
In order to clearly analyze the improvement of CEDST, we count generated and correct unknown values on the test set as shown in Figure 4. CEDST generates most correct unknown values on the 40% dataset. Because known values in the 20% dataset are much more than unknown values, CEDST focuses more on the target word itself and less on context words. The works in  and  have proved that context words are essential for generating unknown values. On the 60% dataset, CEDST cannot get sufficient information from the train set and generate less unknown values. The proportion of correct to generated unknown values is 30.8%, 34.7% and 56.3%, which proves the good scalability of CEDST.
4.5 Ablation Study
|40% UNK||The WoZ2 dataset|
|Models||Joint Goal||Turn Request||Joint Goal||Turn Request|
|- self att.||23.0%||75.3%||70.0%||95.5%|
|- shared LSTM||31.8%||75.7%||88.6%||97.3%|
4.5.1 Multi-Encoder Converts Inputs into Effective Hidden
We use a single BiLSTM to replace multi-encoder. The BiLSTM only extract shared features and degrades the performance. Private features are very important. For example, the ‘part’ word is only related to the ‘area’ slot. Multi-encoder extract private and shared features, which learn very effective representations.
4.5.2 Multi-decoder Derives More Useful Information
The private and shared decoders can derive more information from state spaces. Since the target value and the state space is different, the private decoder that dedicates on decoding specific information for each slot is very useful.
4.5.3 Copy Mechanism Generates Unknown Values Effectively
We remove copy mechanism and utterance hidden in . It disables the generation of unknown values and significantly reduces the performance on the 40% dataset. Because we also remove the utterance hidden, this model degrades the performance on WoZ2. This shows that these hidden can promote dialogue understanding.
4.5.4 Self Attention
Self attention in multi-encoder is used to obtain a context hidden. We use the last hidden of the encoder to replace the context. It significantly degrades performance. It shows that self attention can get a context effectively.
4.5.5 Shared LSTM
The shared LSTM is removed in multi-encoder-decoder. It only reduces 0.6% joint goal accuracy on the 40% dataset and 1.0% on WoZ2. It even achieves a little higher on the request accuracy. It proved that, in these datasets, global features are less important than private features.
We propose the copy-augmented heterogeneous information learning model for dealing with unknown values in DST. The copy mechanism in CEDST can copy words in the dialogue context as unknown values. Meanwhile, the multi-encoder-decoder in CEDST can effectively decompose heterogeneous texts into multiple small spaces corresponding to the slots. CEDST can effectively generate values from reduced spaces. CEDST achieves state-of-the-art performance both on WoZ2, bAbI, and our constructed datasets.
This work is supported by the National Natural Science Foundation of China (No.61533018), the Natural Key R&D Program of China (No.2018YFC0830101), the National Natural Science Foundation of China (No.61702512, No.61806201) and the independent research project of National Laboratory of Pattern Recognition. This work was also supported by Alibaba Group through Alibaba Innovative Research (AIR) Program, CCF-DiDi BigData Joint Lab and CCF-Tencent Open Research Fund.
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014), http://arxiv.org/abs/1409.0473
-  Bordes, A., Weston, J.: Learning end-to-end goal-oriented dialog. CoRR (2016)
Dhingra, B., Li, L., Li, X., Gao, J., Chen, Y.N., Ahmed, F., Deng, L.: Towards end-to-end reinforcement learning of dialogue agents for information access. In: Proceedings of ACL. pp. 484–495 (2017)
-  Gu, J., Lu, Z., Li, H., Li, V.O.: Incorporating copying mechanism in sequence-to-sequence learning. In: Proceedings of ACL. pp. 1631–1640 (2016)
-  Hashimoto, K., Xiong, C., Tsuruoka, Y., Socher, R.: A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587 (2016)
-  Henderson, M., Thomson, B., Young, S.: Word-based dialog state tracking with recurrent neural networks. In: Proceedings of SIGDIAL. pp. 292–299 (2014)
-  Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation pp. 1735–1780 (1997)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Mrkšić, N., Ó Séaghdha, D., Wen, T.H., Thomson, B., Young, S.: Neural belief tracker: Data-driven dialogue state tracking. In: Proceedings of ACL. pp. 1777–1788 (2017)
-  Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of EMNLP. pp. 1532–1543 (2014)
-  Perez, J., Liu, F.: Dialog state tracking, a machine reading approach using memory network. In: Proceedings of EACL. pp. 305–314 (2017)
-  Rastogi, A., Hakkani-Tür, D., Heck, L.P.: Scalable multi-domain dialogue state tracking. CoRR abs/1712.10224 (2017)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. Machine Learning Research pp. 1929–1958 (2014)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of NIPS. pp. 3104–3112 (2014)
-  Wen, T.H., Vandyke, D., Mrkšić, N., Gasic, M., Rojas Barahona, L.M., Su, P.H., Ultes, S., Young, S.: A network-based end-to-end trainable task-oriented dialogue system. In: Proceedings of EACL. pp. 438–449 (2017)
-  Xu, P., Hu, Q.: An end-to-end approach for handling unknown slot values in dialogue state tracking. In: Proceedings of ACL. pp. 1448–1457 (2018)
-  Xu, P., Sarikaya, R.: Targeted feature dropout for robust slot filling in natural language understanding. In: Proceedings of ISCA (2014)
-  Zhong, V., Xiong, C., Socher, R.: Global-locally self-attentive encoder for dialogue state tracking. In: Proceedings of ACL. pp. 1458–1467 (2018)