Memory Consolidation for Contextual Spoken Language Understanding with Dialogue Logistic Inference

by   He Bai, et al.

Dialogue contexts are proven helpful in the spoken language understanding (SLU) system and they are typically encoded with explicit memory representations. However, most of the previous models learn the context memory with only one objective to maximizing the SLU performance, leaving the context memory under-exploited. In this paper, we propose a new dialogue logistic inference (DLI) task to consolidate the context memory jointly with SLU in the multi-task framework. DLI is defined as sorting a shuffled dialogue session into its original logical order and shares the same memory encoder and retrieval mechanism as the SLU model. Our experimental results show that various popular contextual SLU models can benefit from our approach, and improvements are quite impressive, especially in slot filling.



There are no comments yet.


page 1

page 2

page 3

page 4


Sequential Dialogue Context Modeling for Spoken Language Understanding

Spoken Language Understanding (SLU) is a key component of goal oriented ...

Attentive Contextual Carryover for Multi-Turn End-to-End Spoken Language Understanding

Recent years have seen significant advances in end-to-end (E2E) spoken l...

Joint Learning of Word and Label Embeddings for Sequence Labelling in Spoken Language Understanding

We propose an architecture to jointly learn word and label embeddings fo...

A Result based Portable Framework for Spoken Language Understanding

Spoken language understanding (SLU), which is a core component of the ta...

Investigation of Language Understanding Impact for Reinforcement Learning Based Dialogue Systems

Language understanding is a key component in a spoken dialogue system. I...

Exploiting Sentence and Context Representations in Deep Neural Models for Spoken Language Understanding

This paper presents a deep learning architecture for the semantic decode...

An Effective Non-Autoregressive Model for Spoken Language Understanding

Spoken Language Understanding (SLU), a core component of the task-orient...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spoken language understanding (SLU) is a key technique in today’s conversational systems such as Apple Siri, Amazon Alexa, and Microsoft Cortana. A typical pipeline of SLU includes domain classification, intent detection, and slot fillingTur and De Mori (2011), to parse user utterances into semantic frames. Example semantic frames Chen et al. (2018) are shown in Figure 1 for a restaurant reservation.

Figure 1: Example semantic frames of utterances and with domain (D), intent (I) and semantic slots in IOB format (, ).

Traditionally, domain classification and intent detection are treated as classification tasks with popular classifiers such as support vector machine and deep neural network

Haffner et al. (2003); Sarikaya et al. (2011). They can also be combined into one task if there are not many intents of each domainBai et al. (2018)

. Slot filling task is usually treated as a sequence labeling task. Popular approaches for slot filling include conditional random fields (CRF) and recurrent neural network (RNN)

Raymond and Riccardi (2007); Yao et al. (2014). Considering that pipeline approaches usually suffer from error propagation, the joint model for slot filling and intent detection has been proposed to improve sentence-level semantics via mutual enhancement between two tasks Xu and Sarikaya (2013); Hakkani-Tür et al. (2016); Zhang and Wang (2016); Goo et al. (2018), which is a direction we follow.

Figure 2: Architecture of our proposed contextual SLU with memory consolidation.

To create a more effective SLU system, the contextual information has been shown useful Bhargava et al. (2013); Xu and Sarikaya (2014), as natural language utterances are often ambiguous. For example, the number 6 of utterance in Figure 1 may refer to either B-time or B-people without considering the context. Popular contextual SLU models Chen et al. (2016); Bapna et al. (2017) exploit the dialogue history with the memory network Weston et al. (2014), which covers all three main stages of memory process: encoding (write), storage (save) and retrieval (read) Baddeley (1976). With such a memory mechanism, SLU model can retrieve context knowledge to reduce the ambiguity of the current utterance, contributing to a stronger SLU model. However, the memory consolidation, a well-recognized operation for maintaining and updating memory in cognitive psychology Sternberg and Sternberg (2016), is underestimated in previous models. They update memory with only one objective to maximizing the SLU performance, leaving the context memory under-exploited.

In this paper, we propose a multi-task learning approach for multi-turn SLU by consolidating context memory with an additional task: dialogue logistic inference (DLI), defined as sorting a shuffled dialogue session into its original logical order. DLI can be trained with contextual SLU jointly if utterances are sorted one by one: selecting the right utterance from remaining candidates based on previously sorted context. In other words, given a response and its context, the DLI task requires our model to infer whether the response is the right one that matches the dialogue context, similar to the next sentence prediction task Logeswaran and Lee (2018). We conduct our experiments on the public multi-turn dialogue dataset KVRET Eric and Manning (2017), with two popular memory based contextual SLU models. According to our experimental results, noticeable improvements are observed, especially on slot filling.

2 Model Architecture

This section first explains the memory mechanism for contextual SLU, including memory encoding and memory retrieval. Then we introduce the SLU tagger with context knowledge, the definition of DLI and how to optimize the SLU and DLI jointly. The overall model architecture is illustrated in Figure 2.

Memory Encoding

To represent and store dialogue history , we first encode them into memory embedding with a BiGRU Chung et al. (2014) layer and then encode the current utterance into sentence embedding with another BiGRU:


Memory Retrieval

Memory retrieval refers to formulating contextual knowledge of the user’s current utterance by recalling dialogue history. There are two popular memory retrieval methods:

The attention based Chen et al. (2016) method first calculates the attention distribution of over memories by taking the inner product followed by a softmax function. Then the context can be represented with a weighted sum over by the attention distribution:


where is the attention weight of . In Chen et al., they sum with utterance embedding , then multiplied with a weight matrix to generate an output knowledge encoding vector :


The sequential encoder based Bapna et al. (2017) method shows another way to calculate :


where the function is a fully connected forward layer.

Contextual SLU

Following Bapna et al., our SLU model is a stacked BiRNN: a BiGRU layer followed by a BiLSTM layer. However, Bapna et al. only initializes the BiLSTM layer’s hidden state with , resulting in the low participation of context knowledge. In this work, we feed to the second layer in every time step:


where is the first layer’s output and is the length of . The second layer encodes into the final state and outputs , which can be used in the following intent detection layer and slot tagger layer respectively.


where and are weight matrices of output layers and is the index of each word in utterance .

Dialogue Logistic Inference

As described above, the memory mechanism holds the key to contextual SLU. However, context memory learned only with SLU objective is under-exploited. Thus, we design a dialogue logistic inference (DLI) task that can consolidate the context memory by sharing encoding and retrieval components with SLU. DLI is introduced below:

Given a dialogue session , where is the sentence in this conversation, we can shuffle into a random order set . It is not hard for human to restore to by determining which is the first sentence then the second and so on. This is the basic idea of DLI: choosing the right response given a context and all candidates. For each integer in range to , training data of DLI can be labelled automatically by:



is the index of the current utterance. In this work, we calculate the above probability with a 2-dimension softmax layer:


where is a weight matrix for dimension transformation.

Joint Optimization

As we depict in Figure 2

, we train DLI and SLU jointly in order to benefit the memory encoder and memory retrieval components. Loss functions of SLU and DLI are as follows.


where is a candidate of the current response, , and are training targets of intent, slot and DLI respectively. Finally, the overall multi-task loss function is formulated as


where is a hyper parameter.

3 Experiments

In this section, we first introduce datasets we used, then present our experimental setup and results on these datasets.

3.1 Datasets

KVRET Eric and Manning (2017) is a multi-turn task-oriented dialogue dataset for an in-car assistant. This dataset was collected with the Wizard-of-Oz scheme Wen et al. (2017) and consists of 3,031 multi-turn dialogues in three distinct domains, and each domain has only one intent, including calendar scheduling, weather information retrieval, and point-of-interest navigation.

However, all dialogue sessions of KVRET are single domain. Following Bapna et al., we further construct a multi-domain dataset KVRET* by randomly selecting two dialogue sessions with different domain from KVRET and recombining them into one conversation. The recombining probability is set to 50%. Detailed information about these two datasets is shown in Table 1.

Datasets Train Dev Test Avg.turns
KVRET 2425 302 304 5.25
KVRET* 1830 224 226 6.88
Table 1: Detailed information of KVRET and KVRET* datasets, including train/dev/test size and average turns per conversation.
Slot Intent Slot Intent
P R F1 Acc. P R F1 Acc.
NoMem No 54.8 80.0 56.7 93.4 48.9 81.0 54.7 93.8
MemNet No 75.8 81.1 75.8 93.9 73.1 81.8 74.5 92.8
Yes 76.0 82.3 77.4(+1.6) 93.9(+0) 75.8 81.3 76.3(+1.8) 93.8(+1.0)
SDEN No 70.5 80.9 70.1 93.6 56.9 81.3 59.4 93.0
Yes 64.9 80.9 70.8 (+0.7) 93.8(+0.2) 56.5 81.4 60.2(+0.8) 93.5(+0.5)
No 71.9 82.2 74.0 93.7 72.7 80.8 74.9 93.2
Yes 75.2 81.4 76.6(+2.6) 94.3(+0.6) 78.0 81.4 78.3(+3.4) 93.2(+0)
Table 2: SLU results on original KVRET and multi-domain KVRET*, including accuracy of intent detection and average precision, recall and F1 score of slot filling.
Figure 3: (a) Validation loss and slot F1 score of during training. (b) Slot F1 score and intent accuracy of with different lambda.

3.2 Experimental Setup

We conduct extensive experiments on intent detection and slot filling with datasets described above. The domain classification is skipped because intents and domains are the same for KVRET.

For training model, our training batch size is 64, and we train all models with Adam optimizer with default parameters Kingma and Ba (2014)

. For each model, we conduct training up to 30 epochs with five epochs’ early stop on validation loss. The word embedding size is 100, and the hidden size of all RNN layer is 64. The

is set to be 0.3. The dropout rate is set to be 0.3 to avoid over-fitting.

3.3 Results

The following methods are investigated and their results are shown in Table 2:

NoMem: A single-turn SLU model without memory mechanism.

MemNet: The model described in Chen et al. , with attention based memory retrieval.

SDEN: The model described in Bapna et al. , with sequential encoder based memory retrieval.

: Similar with SDEN, but the usage of is modified with Eq.6.

As we can see from Table 2, all contextual SLU models with memory mechanism can benefit from our dialogue logistic dependent multi-task framework, especially on the slot filling task. We also note that the improvement on intent detection is trivial, as single turn information has already trained satisfying intent classifiers according to results of NoMem in Table 2. Thus, we mainly analyze DLI’s impact on slot filling task and the prime metric is the F1 score.

In Table 2, the poorest contextual model is the SDEN, as its usage of the vector is too weak: simply initializes the BiLSTM tagger’s hidden state with , while other models concatenate with BiLSTM’s input during each time step. The more the contextual model is dependent on , the more obvious the improvement of the DLI task is. Comparing the performance of MemNet with on these two datasets, we can find that our is stronger than MemNet after the dialogue length increased. Finally, we can see that improvements on KVRET* are higher than KVRET. This is because retrieving context knowledge from long-distance memory is challenging and our proposed DLI can help to consolidate the context memory and improve memory retrieval ability significantly in such a situation.

We further analyze the training process of on KVRET* to figure out what happens to our model with DLI training, which is shown in Figure 3. We can see that the validation loss of falls quickly and its slot F1 score is relatively higher than the model without DLI training, indicating the potential of our proposed method.

To present the influence of hyper-parameter , we show SLU results with ranging from 0.1 to 0.9 in Figure 3. In this figure, we find that the improvements of our proposed method are relatively steady when is less than 0.8, and 0.3 is the best one. When is higher than 0.8, our model tends to pay much attention to the DLI task, overlook detail information within sentences, leading the SLU model to perform better on the intent detection but failing in slot filling.

4 Conclusions

In this work, we propose a novel dialogue logistic inference task for contextual SLU, with which memory encoding and retrieval components can be consolidated and further enhances the SLU model through multi-task learning. This DLI task needs no extra labeled data and consumes no extra inference time. Experiments on two datasets show that various contextual SLU model can benefit from our proposed method and improvements are quite impressive, especially on the slot filling task. Also, DLI is robust to different loss weight during multi-task training. In future work, we would like to explore more memory consolidation approaches for SLU and other memory related tasks.


The research work descried in this paper has been supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002103 and the Natural Science Foundation of China under Grant No. U1836221.


  • Baddeley (1976) Alan D Baddeley. 1976. The psychology of memory. New York.
  • Bai et al. (2018) He Bai, Yu Zhou, Jiajun Zhang, Liang Zhao, Mei-Yuh Hwang, and Chengqing Zong. 2018.

    Source critical reinforcement learning for transferring spoken language understanding to a new language.

    In Proceedings of the 27th International Conference on Computational Linguistics, pages 3597–3607.
  • Bapna et al. (2017) Ankur Bapna, Gokhan Tur, Dilek Hakkani-Tur, and Larry Heck. 2017. Sequential Dialogue Context Modeling for Spoken Language Understanding. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL).
  • Bhargava et al. (2013) A. Bhargava, A. Celikyilmaz, D. Hakkani-Tur, and R. Sarikaya. 2013. Easy contextual intent prediction and slot detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8337–8341. IEEE.
  • Chen et al. (2018) Yun-Nung Chen, Asli Celikyilmaz, and Dilek Hakkani-Tur. 2018. Deep learning for dialogue systems. In Proceedings of the 27th International Conference on Computational Linguistics: Tutorial Abstracts (COLING), pages 25–31. Association for Computational Linguistics.
  • Chen et al. (2016) Yun-Nung Chen, Dilek Hakkani-Tür, Gokhan Tur, Jianfeng Gao, and Li Deng. 2016. End-to-End Memory Networks with Knowledge Carryover for Multi-Turn Spoken Language Understanding. In Interspeech, pages 3245–3249.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Deep Learning and Representation Learning Workshop.
  • Eric and Manning (2017) Mihail Eric and Christopher D. Manning. 2017. Key-Value Retrieval Networks for Task-Oriented Dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL).
  • Goo et al. (2018) Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-Gated Modeling for Joint Slot Filling and Intent Prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 753–757, New Orleans, Louisiana.
  • Haffner et al. (2003) Patrick Haffner, Gokhan Tur, and Jerry H Wright. 2003. Optimizing svms for complex call classification. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 1, pages I–I. IEEE.
  • Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Interspeech, pages 715–719.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Computer Science.
  • Logeswaran and Lee (2018) Lajanugen Logeswaran and Honglak Lee. 2018. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893.
  • Raymond and Riccardi (2007) Christian Raymond and Giuseppe Riccardi. 2007. Generative and discriminative algorithms for spoken language understanding. In Eighth Annual Conference of the International Speech Communication Association.
  • Sarikaya et al. (2011) Ruhi Sarikaya, Geoffrey E Hinton, and Bhuvana Ramabhadran. 2011. Deep belief nets for natural language call-routing. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages 5680–5683. IEEE.
  • Sternberg and Sternberg (2016) Robert J Sternberg and Karin Sternberg. 2016. Cognitive psychology. Nelson Education.
  • Tur and De Mori (2011) Gokhan Tur and Renato De Mori. 2011. Spoken language understanding: Systems for extracting semantic information from speech. John Wiley & Sons.
  • Wen et al. (2017) Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. 2017. A Network-based End-to-End Trainable Task-oriented Dialogue System. In EACL, pages 1–12.
  • Weston et al. (2014) Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
  • Xu and Sarikaya (2013) Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neural network based triangular crf for joint intent detection and slot filling. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pages 78–83. IEEE.
  • Xu and Sarikaya (2014) Puyang Xu and Ruhi Sarikaya. 2014. Contextual domain classification in spoken language understanding systems using recurrent neural network. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 136–140, Florence, Italy. IEEE.
  • Yao et al. (2014) Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014.

    Spoken language understanding using long short-term memory neural networks.

    In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 189–194. IEEE.
  • Zhang and Wang (2016) Xiaodong Zhang and Houfeng Wang. 2016. A joint model of intent determination and slot filling for spoken language understanding. In IJCAI, pages 2993–2999.