Task-oriented dialogue systems assist users with accomplishing tasks, such as making restaurant reservations or booking flights, by interacting with them in natural language. The capability to identify task-specific semantics is a key requirement for these systems. This is accomplished in the spoken language understanding (SLU) module, which typically parses natural language user utterances into semantic frames, composed of user intent, dialogue acts and slots , that can be processed by downstream dialogue system components. An example semantic frame is shown for a restaurant reservation query in Figure 1. It is common to model intent, dialogue act and slot prediction jointly [2, 3, 4, 5], which is a direction we follow.
Much prior research into SLU has focused on single-turn language understanding, where the system receives only the user utterance and, possibly, external contextual features such as knowledge base annotations  and semantic context from the frame , as inputs. However, task-oriented dialogue commonly involves the user and the system indulging in multiple turns of back-and-forth conversation in order to achieve the user goal. Multi-turn SLU presents different challenges, since the user and the system may refer to entities introduced in prior dialogue turns, introducing ambiguity. For example, depending on the context, “three” could indicate a date, time, number of tickets or restaurant rating. Context from previous user and system utterances in multi-turn dialogue has been shown to help resolve these ambiguities [8, 9]
. While initial work in this direction used only the previous system turn for context, the advent of deep learning techniques, memory networks in particular, facilitated incorporating context from the full dialogue history.
In essence, memory network-based approaches to multi-turn SLU store prior user and system utterances and, at the current turn, encode these into embeddings, using RNNs or otherwise. These memory embeddings are then aggregated to obtain the context vector which is used to condition the SLU output at the current turn. This aggregation step could use an attention mechanism based on cosine similarity with the user utterance embedding. Other approaches account for temporal order of utterances in the memory by using an RNN for aggregation  or decaying attention weights with time .
Although improving accuracy, using memory networks for encoding context is not computationally efficient for two reasons. First, at each turn, they process multiple history utterances to obtain the SLU output. Secondly, dialogue context could potentially be gleaned from existing dialogue system components such as the dialogue state tracker [14, 15, 16]. Using a separate SLU-specific network instead of reusing the context from DST duplicates computation. Furthermore, such approaches work with the natural language representation of the system utterance to have a consistent representation with user turns, while ignoring the system dialogue acts, which contain the same information but are more structured and have a smaller vocabulary.
In this work, we investigate some effective approaches to encoding dialogue context for SLU. Our contributions are two-fold. First, we propose a novel approach to encoding system dialogue acts for SLU, substituting the use of system utterances, which allows reuse of the dialogue policy manager’s output to obtain context. Second, we propose an efficient mechanism for encoding dialogue context using hierarchical recurrent neural networks which processes a single utterance at a time, yielding computational gains without compromising performance. Our representation of dialogue context is similar to those used in dialogue state tracking models [17, 18, 19], thus enabling the sharing of context representation between SLU and DST.
|System:||Which restaurant and for how many?|
|Dialogue Acts:||request(#), request(rest)|
|Dialogue Acts:||inform(#), inform(rest)|
The rest of this paper is organized as follows: Section 2 describes the overall architecture of the model. This section also formally describes the different tasks in SLU and outlines their implementation. Section 3 presents the setup for training and evaluation of our model. We conclude with experimental results and discussion.
Let a dialogue be a sequence of turns, each turn containing a user utterance and a set of dialogue acts corresponding to the preceding system utterance. Figure 2 gives an overview of our model architecture. For a new turn , we use the system act encoder (Section 2.1) to obtain a vector representation of all system dialogue acts . We also use the utterance encoder (Section 2.2) to generate the user utterance encoding by processing the user utterance token embeddings .
The dialogue encoder (Section 2.3) summarizes the content of the dialogue by using , , and its previous hidden state to generate the dialogue context vector , and also update its hidden state . The dialogue context vector is then used for user intent classification and dialogue act classification (Section 2.4). The utterance encoder also generates updated token embeddings, which are used by the slot tagger (Section 2.5) to identify the slot values present in the user utterance.
Both the utterance encoder and slot tagger use bidirectional RNNs. In addition to the aforementioned inputs, both RNNs allow for additional inputs (positions and in Figure 2) and external initialization of hidden states (positions and in Figure 2), to incorporate context in our model. In the following sections, we describe each of these components in detail.
2.1 System Act Encoder
The system act encoder encodes the set of dialogue acts at turn into a vector invariant to the order in which acts appear. This contrasts with a system utterance-based representation, which imposes an implicit ordering on the underlying acts.
Each system dialogue act contains an act type and optional slot and value parameters. We categorize the dialogue acts into two broad types - acts with an associated slot (and possibly a slot value i.e. request(time), negate(time=‘6 pm’)), and acts without (e.g. greeting). Note that the same dialogue act can appear in the dialogue with or without an associated slot (negate(time=‘6 pm’) versus negate).
For each slot type in our slot vocabulary, we define a binary vector of size , where is the set of all system act types, indicating the presence of each system act type with that slot associated, ignoring slot values for tractability. Similarly, we define a binary vector of the same size indicating the presence of each system act without any slot associated. For each slot , we also define an embedding . The final encoding is obtained from these vectors after a shared feedforward layer on the slot-associated act features, followed by averaging over the set of slots mentioned so far, concatenating with the no-slot act features and a second feedforward layer, as in equations 1 - 4. Parameters , , , and slot embeddings are trainable; denotes concatenation.
2.2 Utterance Encoder
The user utterance encoder takes in the list of user utterance tokens as input. Special tokens SOS and EOS are added at the beginning and end of the token list. Let denote the utterance token embeddings, being the number of tokens in the user utterance for turn . We use a single layer bi-directional RNN  using GRU cell  with state size to encode the user utterance.
The user utterance encoder outputs embedded representations of the user utterance and of the individual utterance tokens, obtained by concatenating the final states and the intermediate outputs of the forward and backward RNNs respectively.
2.3 Dialogue Encoder
The dialogue encoder incrementally generates the embedded representation of the dialogue context at every turn. We implement the dialogue encoder using a unidirectional GRU RNN, with each timestep corresponding to a dialogue turn. As shown in Figure 2, it takes and its previous state as inputs and outputs the updated state and the encoded representation of the dialogue context (identical for a GRU RNN). This method of encoding context is more efficient than other state of the art approaches like memory networks which process multiple utterances from the history to process each turn.
2.4 Intent and Dialogue Act Classification
The user intent helps to identify the APIs/databases which the dialogue system should interact with. Intents are predicted at each turn so that a change of intent during the dialogue can be detected. We assume that each user utterance contains a single intent and predict the distribution over all intents at each turn, , using equation 6. On the other hand, dialogue act classification is defined as a multi-label binary classification problem to model the presence of multiple dialogue acts in an utterance. Equation 7 is used to calculate , where
is the probability of presence of dialogue actin turn .
In the above equations , , , , and , and denoting the user intent and dialogue act vocabularies respectively and . During inference, we predict as the intent label and all dialogue acts with probability greater than are associated with the utterance, where
is a hyperparameter tuned using the validation set.
2.5 Slot Tagging
Slot tagging is the task of identifying the values for different slots present in the user utterance. We use the (inside-outside-begin) tagging scheme (Figure 1) to assign a label to each token . The slot tagger takes the token embeddings output by the utterance encoder as input and encodes them using a bidirectional RNN  using LSTM cell  with hidden state size to generate token embeddings , being the number of user utterance tokens in turn . We use an LSTM cell instead of a GRU because it gave better results on the validation set. For the th token, we use the token vector to obtain the distribution across all slot labels using equation 8, being the total number of slot types. During inference, we predict as the slot label for the th token.
|Model||Config||Sim-R Results||Sim-M Results||Overall Results|
|7. only, No DE||B||-||99.62||93.21||95.53||87.63||99.12||96.00||87.30||75.44||99.48||94.04||93.07||84.17|
We use two representations of dialogue context: the dialogue encoding vector encodes all turns prior to the current turn and the system intent vector encodes the current turn system utterance. Thus, and together encode the entire conversation observed till the user utterance. These vectors can be fed as inputs at multiple places in the SLU model. In this work, we identify four positions to feed context i.e. positions A through D in Figure 2. Positions A and C feed context vectors as additional inputs at each RNN step whereas positions B and D use the context vectors to initialize the hidden state of the two RNNs after a linear projection to the hidden state dimension. We experiment with the following configurations for integrating dialogue context in our framework:
[leftmargin=10pt, topsep=0pt, noitemsep]
only, No DE: We feed , the system act encoding, in one of positions A-D, omit the dialogue encoder, and instead use , the utterance encoder’s final state, for intent and act prediction. The best model for this configuration, as evaluated on the validation set, had fed in position B, and test set results for this model are reported in row 7 of Table 1.
only: We feed into the dialogue encoder, and to one of the positions A-D. Row 8 of Table 1 contains results for the best model for this configuration, which had fed in position D of the slot tagger.
only: We feed into the dialogue encoder and , the dialogue encoding from the previous turn, into the slot tagger at positions C or D. Row 9 of Table 1 shows results for the best model with fed in position D.
and : We feed into the dialogue encoder, and and independently into one of positions C and D, 4 combinations in total. Row 10 of Table 1 shows results for the best model with fed in position C and in position D.
We obtain dialogues from the Simulated Dialogues dataset 111The dataset is available at http://github.com/google-research-datasets/simulated-dialogue, described in . The dataset has dialogues from restaurant (Sim-R, 11234 turns in 1116 training dialogues) and movie (Sim-M, 3562 turns in 384 training dialogues) domains and a total of three intents. The dialogues in the dataset consist of 12 slot types and 21 user dialogue act types, with 2 slot types and 12 dialogue acts shared between Sim-R and Sim-M. One challenging aspect of this dataset is the prevalence of unseen entities. For instance, only 13% of the movie names in the validation and test sets are also present in the training set.
We compare our models’ performance with the following four baseline models:
[leftmargin=10pt, topsep=0pt, noitemsep]
NoContext: A two-layer stacked bidirectional RNN using GRU and LSTM cells respectively, and no context.
PrevTurn: This is similar to the NoContext model. with a different bidirectional GRU layer encoding the previous system turn, and this encoding being input to the slot tagging layer of encoder i.e. position C in Figure 2.
MemNet: This is the system from , using cosine attention. For this model, we report metrics with models trained with memory sizes of 6 and 20 turns. A memory size of 20, while making the model slower, enables it to use the entire dialogue history for most of the dialogues.
SDEN: This is the system from  which uses a bidirectional GRU RNN for combining memory embeddings. We report metrics for models with memory sizes 6 and 20.
3.3 Training and Evaluation
We use sigmoid cross entropy loss for dialogue act classification (since it is modeled as a multilabel binary classification problem) and softmax cross entropy loss for intent classification and slot tagging. During training, we minimize the sum of the three constituent losses using the ADAM optimizer  for 150k training steps with a batch size of 10 dialogues.
To improve model performance in the presence of out of vocabulary (OOV) tokens arising from entities not present in the training set, we randomly replace tokens corresponding to slot values in user utterance with a special OOV token with a value dropout probability that linearly increases during training.
To find the best hyperparameter values, we perform grid search over the token embedding size (), learning rate (), maximum value dropout probability () and the intent prediction threshold (), for each model configuration listed in Section 3. The utterance encoder and slot tagger layer sizes are set equal to the token embedding dimension, and that of the dialogue encoder to half this dimension. In Table 1, we report intent accuracy, dialogue act F1 score, slot chunk F1 score  and frame accuracy on the test set for the best runs for each configuration in Section 3 based on frame accuracy on the combined validation set, to avoid overfitting. A frame is considered correct if its predicted intent, slots and acts are all correct.
4 Results and Discussion
Table 1 compares the baseline models with different variants of our model. We observe that our models compare favorably to the state of the art MemNet and SDEN baselines. The use of context plays a crucial role across all datasets and tasks, especially for intent and dialogue act classification, giving an improvement of 15% and 5% respectively across all configurations. For all subsequent discussion, we concentrate on frame accuracy since it summarizes the performance across all tasks.
An important consideration is the computational efficiency of the compared appraoches: memory network-based models are expensive to run, since they process multiple utterances from the dialogue history at every turn. In contrast, our approach only adds a two-layer feedforward network (the system act encoder) and one step of a GRU cell (for the dialogue encoder) per turn to encode all context. Empirically, MemNet-6 and MemNet-20 experiments took roughly 4x and 12x more time to train respectively than our slowest model containing both the system act encoder and the dialogue encoder, on our training setup. SDEN runs are slower than their MemNet counterparts since they use RNNs for combining memory embeddings. In addition to being fast, our models generalize better on the smaller Sim-M dataset, suggesting that memory network-based models tend to be more data intensive.
Two interesting experiments to compare are rows 2 and 7 i.e. “PrevTurn” and “ only, No DE”; they both use context only from the previous system utterance/acts, discarding the remaining turns. Our system act encoder, comprising only a two-layer feedforward network, is in principle faster than the bidirectional GRU that “PrevTurn” uses to encode the system utterance. This notwithstanding, the similar performance of both models suggests that using system dialogue acts for context is a good alternative to using the corresponding system utterance.
Table 1 also lists the best configurations for feeding context vectors and . In general, we observe that feeding context vectors as initial states to bidirectional RNNs (i.e. position D for slot tagging plus a side input to the dialogue encoder, or position B for all tasks in case of no dialogue encoder) yields better results than feeding them as additional inputs at each RNN step (positions C and A). This may be caused by the fact that our context vectors do not vary with the user utterance tokens, because of which introducing them repeatedly is likely redundant. For each experiment in Section 3, the differences between the varying context position combinations are statistically significant, as determined by McNemar’s test with .
Another interesting observation is that using as compared to as additional context for the slot tagger does not improve slot tagging performance. This indicates a strong correspondence between the slots present in the system acts and those mentioned in the user utterance i.e. the user is often responding directly to the immediately prior system prompt, thereby reducing the dependence on context from the previous turns for the slot tagging task.
To conclude, we present a fast and efficient approach to encoding context for SLU. Avoiding the significant per-turn overhead of memory networks, our method accumulates dialogue context one turn at a time, resulting in a faster and more generalizable model without any loss in accuracy. We also demonstrate that using system dialogue acts is an efficientalternative to using system utterances for context.
-  G. Tur and R. De Mori, Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons, 2011.
-  D. Hakkani-Tür, A. Celikyilmaz, Y.-N. Chen, J. Gao, L. Deng, and Y.-Y. Wang, “Multi-domain joint semantic frame parsing using bi-directional rnn-lstm.” 2016.
-  X. Zhang and H. Wang, “A joint model of intent determination and slot filling for spoken language understanding.”
-  B. Liu and I. Lane, “Attention-based recurrent neural network models for joint intent detection and slot filling,” arXiv preprint arXiv:1609.01454, 2016.
P. Xu and R. Sarikaya, “Convolutional neural network based triangular crf for joint intent detection and slot filling,” inAutomatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 78–83.
B. Dhingra, L. Li, X. Li, J. Gao, Y.-N. Chen, F. Ahmed, and L. Deng, “Towards end-to-end reinforcement learning of dialogue agents for information access,” inProceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2017, pp. 484–495.
-  Y. N. Dauphin, G. Tur, D. Hakkani-Tur, and L. Heck, “Zero-shot learning for semantic utterance classification,” arXiv preprint arXiv:1401.0509, 2013.
-  A. Bhargava, A. Celikyilmaz, D. Hakkani-Tür, and R. Sarikaya, “Easy contextual intent prediction and slot detection,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8337–8341.
-  P. Xu and R. Sarikaya, “Contextual domain classification in spoken language understanding systems using recurrent neural network,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 136–140.
-  S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in neural information processing systems, 2015, pp. 2440–2448.
-  Y.-N. Chen, D. Hakkani-Tür, G. Tur, J. Gao, and L. Deng, “End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding,” in Proceedings of The 17th Annual Meeting of the International Speech Communication Association (INTERSPEECH). San Francisco, CA: ISCA, 2016.
-  A. Bapna, G. Tur, D. Hakkani-Tur, and L. Heck, “Sequential dialogue context modeling for spoken language understanding,” in Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 103–114.
-  S.-Y. Su, P.-C. Yuan, and Y.-N. Chen, “How time matters: Learning time-decay attention for contextual spoken language understanding in dialogues,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, To appear.
-  M. Henderson, B. Thomson, and J. Williams, “The second dialog state tracking challenge,” in 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue, vol. 263, 2014.
-  M. Henderson, B. Thomson, and S. Young, “Word-based dialog state tracking with recurrent neural networks,” in Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2014, pp. 292–299.
-  N. Mrkšić, D. Ó. Séaghdha, T.-H. Wen, B. Thomson, and S. Young, “Neural belief tracker: Data-driven dialogue state tracking,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2017, pp. 1777–1788.
K. Yoshino, T. Hiraoka, G. Neubig, and S. Nakamura, “Dialogue state tracking using long short term memory neural networks,” 2016.
-  B. Liu and I. Lane, “An end-to-end trainable neural network model with belief tracking for task-oriented dialog,” in Proceedings of Interspeech, 2017.
-  J. D. Williams, “Web-style ranking and slu combination for dialog state tracking,” in Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), 2014, pp. 282–291.
M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
-  K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
-  E. F. Tjong Kim Sang and S. Buchholz, “Introduction to the conll-2000 shared task: Chunking,” in Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning-Volume 7. Association for Computational Linguistics, 2000, pp. 127–132.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  P. Shah, D. Hakkani-Tür, G. Tur, A. Rastogi, A. Bapna, N. Nayak, and L. Heck, “Building a conversational agent overnight with dialogue self-play,” arXiv preprint arXiv:1801.04871, 2017.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.