Toward Scalable Neural Dialogue State Tracking Model

12/03/2018 ∙ by Elnaz Nouri, et al. ∙ Microsoft Salesforce 0

The latency in the current neural based dialogue state tracking models prohibits them from being used efficiently for deployment in production systems, albeit their highly accurate performance. This paper proposes a new scalable and accurate neural dialogue state tracking model, based on the recently proposed Global-Local Self-Attention encoder (GLAD) model by Zhong et al. which uses global modules to share parameters between estimators for different types (called slots) of dialogue states, and uses local modules to learn slot-specific features. By using only one recurrent networks with global conditioning, compared to (1 + # slots) recurrent networks with global and local conditioning used in the GLAD model, our proposed model reduces the latency in training and inference times by 35% on average, while preserving performance of belief state tracking, by 97.38% on turn request and 88.51% on joint goal and accuracy. Evaluation on Multi-domain dataset (Multi-WoZ) also demonstrates that our model outperforms GLAD on turn inform and joint goal accuracy.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dialog State Tracking (DST) is an important component of task-oriented dialogue systems. DST keeps track of the interaction’s goal and what has happened in the dialog history. Majority of the deployed dialogue systems in commercial settings such as common customer support systems and intelligent assistants, such as Amazon Alexa, Apple Siri and Google Assistant, take advantage of dialogue state tracking. Dialog state tracking uses the information from user utterance at each turn, context from previous turns, and other external information as well as the output of the system at every turn. Decision made by the dialogue state tracker, is then used to determine what action should be taken by the system in next steps. This is a critical role to play in the design of any task oriented dialogue system.

State of the art approaches for dialogue state tracking rely on deep learning models, which represent the dialogue state as a distribution over all candidate slot values that are defined in the ontology. Recently, several neural-based DST systems have been proposed. 

Mrksic et al. (2017)

proposed a Neural Belief Tracker (NBT) model based on binary decision making of each state-values, where representation of user utterance, system action, and candidate pairs are computed based on deep distribiutional representation of word vectors. In their model, they used deep network (DNN) and convolutional network (CNN) to compute such representation vectors. 

Wen et al. (2017) proposed a sequence-to-sequence model for estimating the next dialogue state. In their work, the encoded hidden vector of user utterance is used to determine the current dialogue state, followed by a policy network to query over knowledge databse. Then, the retrieved information is used as a conditionining input to the decoder, to generate the system response.

Recently, Zhong et al. (2018)

proposed a model based on training a binary classifier for each slot-value, Global-Locally Self Attentive encoder (GLAD, by employing recurrent and self attention for each utterance and previous system actions, and measuring similaity of these computed representation to each slot-value, which achieve state of the art results on WoZ 

(Wen et al., 2017) and DSTC2 (Williams et al., 2013) datasets.

Although the proposed neural based models achieves state of the art results on several benchmark, they are still inefficient for deployment in production system, due to their latency which stems from using recurrent networks. In this paper, we propose a new encoder, by improving GLAD architecture (Zhong et al., 2018). The proposed encoder is based on removing slot-dependent recurrent network for utterance and system action encoder, and employing a global conditioning of aforementioned encoder on the slot type embedding vector. By removing the slot-dependent recurrent network, the proposed model is able to preserve the performance in predicting correct belief state, while improving computational complexity. The detailed description of encoder is explained in the section 2.

1.1 Related Works

A similar scalable dialogue state tracking model is also proposed by Rastogi et al. (2017), which is based on conditioning the encoder input. They used a similar conditioning of user utterance representation on slot values (candidate sets) and slot type. However, our proposed model is based on conditioning only on slot type. Therefor, our proposed model is simpler since it contains only one conditioned encoder for user utterance, whereas  Rastogi et al. (2017) model requires two independet conditioned encoder.

Recently, Xu and Hu (2018) proposed a model for unknown slot type by using a pointer network, based on conditioning to slot type embedding. Our proposed model is also relaxing the current GLAD architecture for unknown slot types during inference.

(a) GCE Encoder
(b) Dialogue State Tracker
Figure 1: Proposed Dialogue State Tracking model with (a) Globally-Conditioned Encoder (GCE), and (b) overall state tracking model.

2 Proposed model

In this section, we describe the proposed model. First, section 2.1 explains the recently proposed GLAD encoder (Zhong et al., 2018) architecture, followed by our proposed encoder in section 2.2.

2.1 Global-Locally Self-Attentive Model

GLAD model is based on learning multiple binary classifier for each slot-value pair. In this architecture, separate encoders are considered for utterance, previous system action, and all slot values. The output of these encoders are then used by two scores model, i.e. previous system action and utterance, to predict the probability distribution on slot-value pairs. This means, each scores model compute the similarity of each slot-value to the utterance representation or previous system action. All encoders used similar architecture, i.e. global-local self attention (GLAD). To compute the hidden representation of its input sequence and its summary (context), GLAD a combination of bidirectional LSTM 

(Hochreiter and Schmidhuber, 1997), to compute the temporal representation, followed by self-attention layer to extract the context vector. To incorporate information regarding each slot, there is a dedicated recurrent and self-attention network for each slot. Therefor, to estimate the probability distribution over values of each slot, GLAD encoder has to learn a different hidden and context vector for utterance and previous system action.

2.2 Globally-Conditioned Encoder (GCE)

In this section, we describe the proposed globally-conditioned encoder (GCE) model. Here, we employ the similar approach of learning slot-specific temporal and context representation of user utterance and system actions, as proposed in GLAD (Zhong et al., 2018). However, we emphasize the limitation of GLAD encoder in using slot-specific recurrent and self-attention layers in their encoders. Our proposed encoder is based on improving the latency and speed of inference by remving the inefficient recurrent layers and self-attention layers, without degrading the performance.

The proposed model is based on removing slot-specific recurrent and self-attention layers, and using only slot embedding vector (i.e. for -th slot), as a conditioning vector to the temporal and context extraction layers, as shown in Figure 1.


To compute -th slot-based representation , the slot embedding is concatenated with sequence tokens , i.e. user utterance or previous system actions, as input to the recurrent layer, where concatenation denoted as . Then a slot-based attention score is computed for each token hidden representation , by concatenating them to slot embedding and passing to a linear layer. In this way, the computed attention is conditioned on the slot embedding, to pay attention to the slot-only information in the input sequence .

Therefor, the GCE encoder function can be represented as,


Encoding Modules:

Based on the definition of the proposed GCE encoder, the representation of user utterance, previous system action and current slot-value pair is computed as below,


where denotes the user utterance word embeddings, is the -th previous system action, and is the current slot value pair to be evaluated (e.g food=italian).

Scoring Model:

We follow the proposed architecture in GLAD (Zhong et al., 2018) for computing score of each slot-value pair, in the user utterance and previous system actions.

To determine whether the user has mentioned a specific value of slot , we compute the slot-th conditioned scores for its values.


Similarly, to determine whether any slot-value is mentioned in previous system actions, that the user is referring to in the current utterance, we compute slot-conditioned scores of previous system actions.


The final scores of slot is the weighted sum of user-based and action-based scores, i.e. and

, which are normalized by sigmoid function



where is a learned parameter.

3 Experiment

In this section, we evaluate the proposed encoder for the task of dialogue state tracking om single and multi-domain dialogue state tracking. Wizard of oz (WoZ) restaurant reservation dataset Wen et al. (2017) is chosen for single-domain, and the performance is compared with the recent neural belief tracking models. Moreover, we also evaluate on recen;t proposed multi-domain dataset, Multi-WoZ (Budzianowski et al., 2018), which consists of seven domains, i.e. restaurant, hotel, train, attraction, hospital, taxi, and police.

The evaluation metric is based on joint goal and turn-level request and joint goal tracking accuracy. The joint goal is the accumulation of turn goals as described in 

Zhong et al. (2018). The fixed pretrained GLoVe embedding (Pennington et al., 2014)

with character-n gram embedding 

(Hashimoto et al., 2017) are used in embedding layer. The implementation details and code of the GCE model can be found at


Table 1

shows the evaluation performance on WoZ dataset. It is indicated that our proposed GCE model performance is on par with GLAD model. To further compare the latency of GCE and GLAD during training and testing, computation time for a batch of turn and the overall epoch time during training is measured. We further evaluate the complete test time, which contains

dialogue and turns (WoZ test set), as shown in Table 2. The computation time is measured in second, and it is indicated that GCE improves latency in both training and testing by on average.


Table 3 shows the evauation on Multi-Woz (Budzianowski et al., 2018) dataset which consists of k dialogues. In this setting, we completely ignore the domain information and use slot names only. The results indicate that GCE model outperforms GLAD on turn inform and join goal accuracy.

Model Joint goal Turn request
Delex. Model (Mrksic et al., 2017) 70.8% 87.1%
Delex. + Semantic Dictionary (Mrksic et al., 2017) 83.7% 87.6%
Neural Belief Tracker-DNN (Mrksic et al., 2017) 84.4% 91.2%
Neural Belief Tracker-CNN (Wen et al., 2017) 84.2% 91.6%
GLAD (Zhong et al., 2018) 88.10.4% 97.10.2%
GCE (Ours) 88.51% 97.38%
Table 1: Test accuracy on WoZ restaurant reservation dataset.
Train (sec.) Test (sec.)
Model Turn Total Turn Total
GLAD (Zhong et al., 2018) 1.78 89 2.32 76
GCE (Ours) 1.16 60 1.92 63
Table 2: Time complexity for each batch of turn, and train and test epoch on WoZ dataset. Each batch contains 50 turns. All numbers are in second.
Model split Turn inform Joint goal
GLAD (Zhong et al., 2018) Dev 66.91% 34.83%
Test 66.89% 35.57%
GCE (Ours) Dev 67.78% 37.42%
Test 67.88% 35.58%
Table 3: Performance on Multi-Domain dataset, Multi-WoZ (Budzianowski et al., 2018).

4 Conclusion

In this paper, we proposed a neural model for dialogue state traking. Based on globally conditioning the encoder model on slot types (GCE), slot-conditioned representations are computed for user utterance and previous system actions, which are used to compute the mentioned slot value. By relaxing GLAD model from slot-specific recurrent networks and self-attentions, our model achieved lower computational complexity with better accuracy. We also showed that GCE model is generalizable to multi-domain dialogue state tracking, by evaluation on Multi-WoZ dataset.


  • Budzianowski et al. (2018) P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gavsi’c. Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. 2018.
  • Hashimoto et al. (2017) K. Hashimoto, C. Xiong, Y. Tsuruoka, and R. Socher.

    A joint many-task model: Growing a neural network for multiple nlp tasks.

    In EMNLP, 2017.
  • Hochreiter and Schmidhuber (1997) S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9:1735–1780, 1997.
  • Mrksic et al. (2017) N. Mrksic, D. Ó. Séaghdha, T.-H. Wen, B. Thomson, and S. J. Young. Neural belief tracker: Data-driven dialogue state tracking. In ACL, 2017.
  • Pennington et al. (2014) J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    , pages 1532–1543, 2014.
  • Rastogi et al. (2017) A. Rastogi, D. Z. Hakkani-Tür, and L. P. Heck. Scalable multi-domain dialogue state tracking. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 561–568, 2017.
  • Wen et al. (2017) T.-H. Wen, L. M. Rojas-Barahona, M. Gasic, N. Mrksic, P. hao Su, S. Ultes, , S. J. Young, and D. Vandyke. A network-based end-to-end trainable task-oriented dialogue system. In EACL, 2017.
  • Williams et al. (2013) J. D. Williams, A. Raux, D. Ramachandran, and A. W. Black. The dialog state tracking challenge. In SIGDIAL Conference, 2013.
  • Xu and Hu (2018) P. Xu and Q. Hu. An end-to-end approach for handling unknown slot values in dialogue state tracking. In ACL, 2018.
  • Zhong et al. (2018) V. Zhong, C. Xiong, and R. Socher. Global-locally self-attentive dialogue state tracker. In ACL, 2018.