Slot-filling based spoken language understanding (SLU) system is often a central component  in conversational systems. A major challenge in the slot-filling paradigm is to handle conversational context, where a user utterance can refer back to a set of slots implicitly or explicitly.
Traditionally, the contextual interpretation of slots has been cast as a coreference resolution problem. There is a rich body of work on coreference resolution for written text, which rely on clustering[2, 3, 4] or ranking mention pairs [5, 6]. These have been extended to spoken dialog, by adding discourse specific features [7, 8, 9, 10, 11, 12]. However, most of these approaches follow a pipelined model of mention detection followed by coreference resolution; where linguistic features, syntax and discourse features are usually applied. In contrast, our proposed formulation does not rely on explicit linguistic features such as gender and type agreement, which are hard to acquire across languages. Furthermore, we can generalize the solution to sub-tasks such as zero pronouns naturally, as we don’t have to explicitly identify the anaphoric mentions.
Another challenge is dealing with multi-domain language understanding systems, where each domain has its own schema to represent slots and intents. Domains are developed mostly independently and hence do not share a common schema. Also, dialog assistants are now being extended by community developers using services such as Google DialogFlow111https://dialogflow.com/API.AI or Alexa Skills Kit . Domains developed by external developers are completely outside of a central repository of domain and slots, hence no assumption can be made about their schema. The lack of shared schema makes it hard to maintain contextual slots across domain boundaries. Table 1 shows an example conversation a user may have with such a dialog system. In this example user interacts with three domain: Weather, LocalSearch and Traffic. Each domain has its own schema to represents slots. The user starts off by asking weather in a city, follows it up with question about restaurant that serves mexican cuisine, finally she asks about directions to the restaurant. In this example, we showcase the challenges of multi-domain system. Weather and LocalSearch use different schema to store information about the location, using slot keys WeatherLocation and City respectively. For carrying conversation across domains from U1 to U2, we need to be able to transform the slot [WeatherLocation: san francisco] to [City: san francisco], without having access to a common schema. Even within a domain there could be issues of diverse schemas - U2 and V2, the domain chooses to have different schemas to represent its user and system turns. To make the task more complex, some domains choose to represent all its slots as just a generic label Entity.
|Domain||Turns||Current Turn Slots||Carried Slots|
|Weather||U1: weather in san francisco||WeatherLocation: san francisco|
|Weather||V1: weather is rainy and temperature 42F||Temperature: 42F|
|LocalSearch||U2: any mexican restaurants nearby||PlaceType: mexican restaurants||City: san francisco|
|LocalSearch||V2: la taqueria is a mile away||Entity: la taqueria|
|Traffic||U3: thanks, send directions to my phone||Place: la taqueria|
|Town: san francisco|
There has been work on improving semantic frame error rate for current turn by leveraging context turns by encoding dialog states.  compare various approaches encoding context,  describe a memory network architecture for knowledge carryover,  add semantic context from the frame,  use context features for domain classification. Our work differs from this body of literature; we keep the system for semantic frame prediction fixed and explore methods to explicitly add slots from previous turn. While the previous work assumes existence of a dialog manager which can be used to keep track of entities from previous turns, we make no such assumptions.
A closely related task is dialog state tracking [11, 18, 19], where the system has to predict a set of slot-value pairs which matches the contents of the current segment. Usually, state trackers produce a distribution over all possible slot-value pairs; this does not scale for open-ended slot values (such as Date or Time), as well as slots whose values are constantly being updated (such as Songs, Movies). Our approach avoids this by reformulating the tracking problem as a carryover action for the current turn. More closely related is frame-tracking , which was introduced as an extension to state tracking, here the slots need to be tracked over multiple frames and maintain reference to original frame. A key difference here is that our formulation deals with the issue of disparate labels over a large-scale multi-domain system.
In this paper, we present a neural network architecture that addresses the challenges above. Main contributions of the paper.
We present the task of tracking slots in a conversation as a carrryover decision. This allows us to scale to a potentially unbounded set of slot values, and allows us to generalize anaphora resolution to both explicit and implicit references.
We address diverse schema challenge by leveraging label embeddings (see sec 2.2.1) to generate potential candidates to be carried.
We show our proposed model outperforms a strong rule-based baseline. We also demonstrate via experiments why our task is more complex than dialog state tracking by benchmarking our approach on DSTC-2 as well as on a dataset collected from a real virtual assistant device.
2.1 Task Definition
We define a a dialog turn at time as the tuple , where is a sequence of words ; is the dialog act; and is a set of slots, where each slot is a key value pair , with being the slot name (or slot key), and being the slot value. represents a user-initiated turn and represents a system initiated turn.
Given a sequence of user turns ; and their associated system turns 222For simplicity we assume a turn taking model - a user turn and system turn alternate.; and the current user turn , the task is to predict a carryover decision over each of the candidate slots in i.e we carryover slot to turn if , where is a decision threshold to be optimized. This formulation allows us to scale to potentially unbounded slots and also handle diverse schemas as we will discuss later.
2.2 Model Definition
We use an encoder-decoder approach as shown in Figure 1
to classify each candidate slot as being relevant for the current turn. into
2.2.1 Candidate Slot Generation
As shown in Table 1, the schemas associated with each turn can be in completely different label spaces. So, we use slot key embeddings to map the keys of the candidate slots in into the schema associated with the current domain. We use pre-trained word embeddings as the source for computing the slot key embeddings.
For each slot name we compute its label embedding by averaging over the associated slot value embeddings. For multi-word slot values, the embedding is constructed by averaging the associated word embeddings.
We now construct the transformed candidate set as , where, is the dot product and is a tunable threshold over the development set.
2.2.2 Dialog Encoding
We first embed the words in the utterance sequence using word embeddings , to get the sequence , which are then fed into an LSTM to recursively encode the current turn, the turns in context associated with the user and the system respectively:
The LSTM is stateful i.e the LSTM output of the last token in the utterance is fed as an initial state to the LSTM for the next utterance.
2.2.3 Encoding Dialog Act
The dialog act for the current turn
is encoded into a fixed length vector of dimension, using the intent embedding dictionary as :
For each dialog act we compute its embedding by averaging over the associated utterance embeddings. For each utterance we calculate its embedding is constructed by averaging the associated word embeddings.
2.2.4 Encoding Candidate Slot
The candidate slot is encoded into a fixed length vector as a concatenation of the slot key embedding and the slot value embedding.
2.2.5 Recency Encoding
The slot distance , defined as the integer offset of the candidate slot from the current turn, is encoded as one-hot . The final distance encoding vector can then be constructed using an affine transform:
2.2.6 Attention Mechanism
We consider two levels of attention - the word level attention allows the model to focus on individual mentions in the utterance that influence the slot carryover decision, and the stream level attention which allows the model to focus on specific streams (user and system) in the dialog.
Word Attention: For each stream defined in Section 2.2.2, we attend over the words in that stream and compute a per-stream context vector. For stream vector sequence , and slot embedding , we compute the word level attentional context vector as:
Here an index into , represents the hidden encoding of the associated input word at that position. We then compute the importance of the word to the slot as the similarity defined in ; obtain the normalized weights that is then used to compute the weighted context vector . Similarly, we can compute and .
Stream Attention: As before, we attend over each individual stream to obtain the final context vector 333If word attention is turned off we choose the final state from each stream LSTM to construct
3.1 Data Setup
For the experiments, we use subset of data collected on a commercial voice assistant. Table 2 summarizes the statistics in the training, development and test sets across different domains. Around 20% of the sessions have utterances from multiple schema. Also, as expected for a voice assistant, we have a significant imbalance where the number of positive candidate slots are much smaller than number of possible candidates for each turn. This is due to cross domain interactions which follow each other but are not part of the same goal, which is common in a digital assistant. Furthermore, some domains chose not to associate any label with an entity mention which we represent as Entity slot this results in very large number of potential candidates as we consider all possible target slots in the current domain for such entities.
To demonstrate complexity of our data we also report results on dataset released in Dialog State Tracking Challenge . We modified the dataset to fit our caryryover task accordingly. We consider only the 1-best ASR and 1-best SLU hypothesis. Unlike our commercial dataset, in DSTC the slots tracked as part of the goal only occur from the user turn. So, we remove candidates from system turns as a pre-processing step. Also, in DSTC task dialogs can be system initiated but for our task is always user initiated, hence we remove the first system turn.
3.2 Training Setup and Evaluation Metrics
We introduce two baselines. The ‘Naive Baseline’ system carries over all the slots from the most recent turn in the dialogue session.This is because, most recent entities are more likely to be referred to by the users in a spoken dialogue system. We also use a stronger elaborate ‘Rule Baseline’, where we detect the referring expression, and for each referring expression, linguistic and semantic features are used to retain only those antecedent candidate slots that agree in gender, number and type. Algorithm 1 shows an example rule that executes for the use case U2 in Table 1.
|Avg. turns per session||2.2||2.14||2.18|
|%age of disparate schema sessions||19.86||18.75||20.53|
|Avg. positive carryover candidates per turn||0.37||0.39||0.35|
|Avg. negative carryover candidates per turn||4.07||4.05||4.00|
For the model, we initialize the word embeddings using 300 dimensional pre-trained GloVe  vectors. The model is trained using mini-batch SGD with Adam optimizer  with standard parameters to minimize the class weighted cross-entropy loss. In our experiments, we use 128 dimensions for the LSTM hidden states and 256 dimensions for the hidden state in the decoder. Similar to 
, we pre-train a LSTM for the named entity recognition task and use this model to initialize the parameters of the LSTM based encoders. All model setups are trained for 20 epochs with early stopping criterion optimised on a dev set. We only select those slots as the final hypothesis, whose, which was optimized over the dev set. For each utterance, independent carryover decisions are made for each candidate slot. We evaluate the models by comparing the hypothesis and reference slots to measure precision, recall and F1 scores.
3.3 Results and Discussion
shows the cumulative impact of various training strategies of our proposed model. We see that compared to a strong rule-based system, our proposed approach gives significant gains in accuracy. Our proposed encoder-decoder improves upon the strong rule based baseline. Adding word attention helps improve the precision and F1 but at the cost of recall; the results are significant compared to the system without attention. Intuitively, the slot value matching referring tokens in the dialog turn indicates that it is relevant to the conversation. The word attention model captures this intuition as part of the model learning process. This alleviates the need to explicitly define semantic type similarity features, and detecting anaphoric mentions like we do in the rule based system. Adding stream attention improves recall, but the overall F1 degrades. Stream attention, which helps isolate user and system turns did not help; we speculate that the distance feature already captures this, and there is insufficient data to train this appropriately.
For completeness we also include performance on the public DSTC2 dataset in Table 4 . We do not claim to be solving DSTC2 but only use this dataset as a comparison of task complexity - the DSTC2 task is relatively simple as evidenced by the naive baseline having a high F1 score on this task, but very low on our commercial assistant task.
|+ word attention||75.76||94.65||84.16|
|+ stream attention||73.48||96.18||83.31|
|+ word attention||97.20||97.65||97.42|
|+ stream attention||97.23||95.61||96.42|
In this work, we presented the task of contextual carryover of slots in a multi-domain large-scale dialog system. To address the scalability of the solution over a large set of slot values we re-formulated this as a slot carryover decision to identify the most relevant set of slots at the current turn. Furthermore, we proposed an efficient way to leverage label embeddings to deal with heterogeneous schemas. We presented empirical results demonstrating the efficacy of our neural network formulation over a strong rule-based baseline. We also quantified the gains from various components of the proposed approach.
G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding,” inINTERSPEECH, 2013.
-  B. Wellner and A. McCallum, “Towards conditional models of identity uncertainty with application to proper noun coreference,” in IJCAI Workshop on Information Integration and the Web, 2003.
-  V. Stoyanov and J. Eisner, “Easy-first coreference resolution.” in COLING, 2012, pp. 2519–2534.
-  A. Culotta, M. L. Wick, and A. McCallum, “First-order probabilistic models for coreference resolution.” in HLT-NAACL, 2007, pp. 81–88.
-  G. Durrett and D. Klein, “Easy victories and uphill battles in coreference resolution.” in EMNLP, 2013, pp. 1971–1982.
-  S. J. Wiseman, A. M. Rush, S. M. Shieber, and J. Weston, “Learning anaphoricity and antecedent ranking features for coreference resolution.” Association for Computational Linguistics, 2015.
-  S. Rao, A. Ettinger, H. Daumé III, and P. Resnik, “Dialogue focus tracking for zero pronoun resolution.” in HLT-NAACL, 2015, pp. 494–503.
-  M. Strube and C. Müller, “A machine learning approach to pronoun resolution in spoken dialogue,” in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 2003, pp. 168–175.
-  A. Stent and S. Bangalore, “Interaction between dialog structure and coreference resolution,” in SLT, 2010.
-  C. Liu, P. Xu, and R. Sarikaya, “Deep contextual language understanding in spoken dialogue systems,” in Sixteenth annual conference of the international speech communication association, 2015.
-  J. D. Williams, A. Raux, D. Ramachandran, and A. W. Black, “The dialog state tracking challenge,” in SIGDIAL Conference, 2013.
-  M. Eckert and M. Strube, “Resolving discourse deictic anaphora in dialogues,” in Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 1999, pp. 37–44.
-  A. Kumar, A. Gupta, J. Chan, S. Tucker, B. Hoffmeister, M. Dreyer, S. Peshterliev, A. Gandhe, D. Filiminov, A. Rastrow, C. Monson, and A. Kumar, “Just ASK: building an architecture for extensible self-service spoken language understanding,” CoRR, vol. abs/1711.00549, 2017. [Online]. Available: http://arxiv.org/abs/1711.00549
-  A. Bapna, G. Tur, D. Hakkani-Tur, and L. Heck, “Sequential dialogue context modeling for spoken language understanding,” in Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 103–114.
-  Y.-N. Chen, D. Hakkani-Tür, G. Tür, J. Gao, and L. Deng, “End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding.” in INTERSPEECH, 2016, pp. 3245–3249.
D. Yann, G. Tur, D. Hakkani-Tur, and L. Heck, “Zero-shot learning and clustering for semantic utterance classification using deep learning,” 2014.
-  P. Xu and R. Sarikaya, “Contextual domain classification in spoken language understanding systems using recurrent neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 136–140.
-  T. Hori, H. Wang, C. Hori, S. Watanabe, B. Harsham, J. Le Roux, J. R. Hershey, Y. Koji, Y. Jing, Z. Zhu et al., “Dialog state tracking with attention-based sequence-to-sequence learning,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 552–558.
-  M. Henderson, B. Thomson, and S. J. Young, “Deep neural network approach for the dialog state tracking challenge,” in SIGDIAL Conference, 2013.
-  H. Schulz, J. Zumer, L. E. Asri, and S. Sharma, “A frame tracking model for memory-enhanced dialogue systems,” in ACL REPL4NLP, 2017.
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word
Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. [Online]. Available: http://www.aclweb.org/anthology/D14-1162
-  M. Henderson, B. Thomson, and J. D. Williams, “The second dialog state tracking challenge,” in Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2014, pp. 263–272.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
-  S. Wiseman, A. M. Rush, S. M. Shieber, and J. Weston, “Learning anaphoricity and antecedent ranking features for coreference resolution,” in ACL, 2015.