Tensorflow implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling" (https://arxiv.org/abs/1609.01454)
Attention-based encoder-decoder neural network models have recently shown promising results in machine translation and speech recognition. In this work, we propose an attention-based neural network model for joint intent detection and slot filling, both of which are critical steps for many speech understanding and dialog systems. Unlike in machine translation and speech recognition, alignment is explicit in slot filling. We explore different strategies in incorporating this alignment information to the encoder-decoder framework. Learning from the attention mechanism in encoder-decoder model, we further propose introducing attention to the alignment-based RNN models. Such attentions provide additional information to the intent classification and slot label prediction. Our independent task models achieve state-of-the-art intent detection error rate and slot filling F1 score on the benchmark ATIS task. Our joint training model further obtains 0.56 reduction on intent detection and 0.23 independent task models.READ FULL TEXT VIEW PDF
Neural machine translation has shown very promising results lately. Most...
We propose a novel Transformer encoder-based architecture with syntactic...
Generative encoder-decoder models offer great promise in developing
Attention-based encoder-decoder neural network models have recently show...
In real-time dialogue systems running at scale, there is a tradeoff betw...
We propose an attention-enabled encoder-decoder model for the problem of...
Understanding privacy policies is crucial for users as it empowers them ...
Tensorflow implementation of "Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling" (https://arxiv.org/abs/1609.01454)
attention based joint model for intent detection and slot filling
Spoken language understanding (SLU) system is a critical component in spoken dialogue systems. SLU system typically involves identifying speaker’s intent and extracting semantic constituents from the natural language query, two tasks that are often referred to as intent detection and slot filling.
Intent detection and slot filling are usually processed separately. Intent detection can be treated as a semantic utterance classification problem, and popular classifiers like support vector machines (SVMs) and deep neural network methods 
can be applied. Slot filling can be treated as a sequence labeling task. Popular approaches to solving sequence labeling problems include maximum entropy Markov models (MEMMs), conditional random fields (CRFs) , and recurrent neural networks (RNNs) [5, 6, 7]. Joint model for intent detection and slot filling has also been proposed in literature [8, 9]. Such joint model simplifies the SLU system, as only one model needs to be trained and fine-tuned for the two tasks.
Recently, encoder-decoder neural network models have been successfully applied in many sequence learning problems such as machine translation  and speech recognition . The main idea behind the encoder-decoder model is to encode input sequence into a dense vector, and then use this vector to generate corresponding output sequence. The attention mechanism introduced in  enables the encoder-decoder architecture to learn to align and decode simultaneously.
In this work, we investigate how an SLU model can benefit from the strong modeling capacity of the sequence models. Attention-based encoder-decoder model is capable of mapping sequences that are of different lengths when no alignment information is given. In slot filling, however, alignment is explicit, and thus alignment-based RNN models typically work well. We would like to investigate the combination of the attention-based and alignment-based methods. Specifically, we want to explore how the alignment information in slot filling can be best utilized in the encoder-decoder models, and on the other hand, whether the alignment-based RNN slot filling models can be further improved with the attention mechanism that introduced from the encoder-decoder architecture. Moreover, we want to investigate how slot filling and intent detection can be jointly modeled under such schemes.
The remainder of the paper is organized as follows. In section 2, we introduce the background on using RNN for slot filling and using encoder-decoder models for sequence learning. In section 3, we describe two approaches for jointly modeling intent and slot filling. Section 4 discusses the experiment setup and results on ATIS benchmarking task. Section 5 concludes the work.
Slot filling can be treated as a sequence labeling problem, where we have training examples of and we want to learn a function that maps an input sequence to the corresponding label sequence . In slot filling, the input sequence and label sequence are of the same length, and thus there is explicit alignment.
RNNs have been widely used in many sequence modeling problems [6, 13]. At each time step of slot filling, RNN reads a word as input and predicts its corresponding slot label considering all available information from the input and the emitted output sequences. The model is trained to find the best parameter set that maximizes the likelihood:
where represents the input word sequence, represents the output label sequence prior to time step . During inference, we want to find the best label sequence given an input sequence such that:
where represents the predicted output sequence prior to time step . Comparing to an RNN model for sequence labeling, the RNN encoder-decoder model is capable of mapping sequence to sequence with different lengths. There is no explicit alignment between source and target sequences. The attention mechanism later introduced in  enables the encoder-decoder model to learn a soft alignment and to decode at the same time.
In this section, we first describe our approach on integrating alignment information to the encoder-decoder architecture for slot filling and intent detection. Following that, we describe the proposed method on introducing attention mechanism from the encoder-decoder architecture to the alignment-based RNN models.
The encoder-decoder model for joint intent detection and slot filling is illustrated in Figure 2. On encoder side, we use a bidirectional RNN. Bidirectional RNN has been successfully applied in speech recognition  and spoken language understanding . We use LSTM  as the basic recurrent network unit for its ability to better model long-term dependencies comparing to simple RNN.
In slot filling, we want to map a word sequence to its corresponding slot label sequence . The bidirectional RNN encoder reads the source word sequence forward and backward. The forward RNN reads the word sequence in its original order and generates a hidden state at each time step. Similarly, the backward RNN reads the word sequence in its reverse order and generate a sequence of hidden states . The final encoder hidden state at each time step is a concatenation of the forward state and backward state , i.e. .
The last state of the forward and backward encoder RNN carries information of the entire source sequence. We use the last state of the backward encoder RNN to compute the initial decoder hidden state following the approach in . The decoder is a unidirectional RNN. Again, we use an LSTM cell as the basic RNN unit. At each decoding step , the decoder state is calculated as a function of the previous decoder state , the previous emitted label , the aligned encoder hidden state , and the context vector :
where the context vector is computed as a weighted sum of the encoder states :
a feed-forward neural network. At each decoding step, the explicit aligned input is the encoder state. The context vector provides additional information to the decoder and can be seen as a continuous bag of weighted features .
For joint modeling of intent detection and slot filling, we add an additional decoder for intent detection (or intent classification) task that shares the same encoder with slot filling decoder. During model training, costs from both decoders are back-propagated to the encoder. The intent decoder generates only one single output which is the intent class distribution of the sentence, and thus alignment is not required. The intent decoder state is a function of the shared initial decoder state , which encodes information of the entire source sequence, and the context vector , which indicates part of the source sequence that the intent decoder pays attention to.
The attention-based RNN model for joint intent detection and slot filling is illustrated in Figure 3. The idea of introducing attention to the alignment-based RNN sequence labeling model is motivated by the use of attention mechanism in encoder-decoder models. In bidirectional RNN for sequence labeling, the hidden state at each time step carries information of the whole sequence, but information may gradually lose along the forward and backward propagation. Thus, when making slot label prediction, instead of only utilizing the aligned hidden state at each step, we would like to see whether the use of context vector gives us any additional supporting information, especially those require longer term dependencies that is not being fully captured by the hidden state.
In the proposed model, a bidirectional RNN (BiRNN) reads the source sequence in both forward and backward directions. We use LSTM cell for the basic RNN unit. Slot label dependencies are modeled in the forward RNN. Similar to the encoder module in the above described encoder-decoder architecture, the hidden state at each step is a concatenation of the forward state and backward state , . Each hidden state contains information of the whole input word sequence, with strong focus on the parts surrounding the word at step . This hidden state is then combined with the context vector to produce the label distribution, where the context vector is calculated as a weighted average of the RNN hidden states .
For joint modeling of intent detection and slot filling, we reuse the pre-computed hidden states of the bidirectional RNN to produce intent class distribution. If attention is not used, we apply mean-pooling  over time on the hidden states
followed by logistic regression to perform the intent classification. If attention is enabled, we instead take the weighted average of the hidden statesover time.
Comparing to the attention-based encoder-decoder model that utilizes explicit aligned inputs, the attention-based RNN model is more computational efficient. During model training, the encoder-decoder slot filling model reads through the input sequence twice, while the attention-based RNN model reads through the input sequence only once.
ATIS (Airline Travel Information Systems) data set  is widely used in SLU research. The data set contains audio recordings of people making flight reservations. In this work, we follow the ATIS corpus111We thank Gokhan Tur and Puyang Xu for sharing the ATIS data set. setup used in [6, 7, 9, 19]. The training set contains 4978 utterances from the ATIS-2 and ATIS-3 corpora, and the test set contains 893 utterances from the ATIS-3 NOV93 and DEC94 data sets. There are in total 127 distinct slot labels and 18 different intent types. We evaluate the system performance on slot filling using F1 score, and the performance on intent detection using classification error rate.
We obtained another ATIS text corpus that was used in  and  for SLU evaluation. This corpus contains 5138 utterances with both intent and slot labels annotated. In total there are 110 different slot labels and 21 intent types. We use the same 10-fold cross validation setup as in  and .
LSTM cell is used as the basic RNN unit in the experiments. Our LSTM implementation follows the design in . Given the size the data set, we set the number of units in LSTM cell as 128. The default forget gate bias is set to 1 . We use only one layer of LSTM in the proposed models, and deeper models by stacking the LSTM layers are to be explored in future work.
Word embeddings of size 128 are randomly initialized and fine-tuned during mini-batch training with batch size of 16. Dropout rate 0.5 is applied to the non-recurrent connections 
during model training for regularization. Maximum norm for gradient clipping is set to 5. We use Adam optimization method following the suggested parameter setup in.
We first report the results on our independent task training models. Table 1 shows the slot filling F1 scores using our proposed architectures. Table 2 compares our proposed model performance on slot filling to previously reported results.
In Table 1, the first set of results are for variations of encoder-decoder models described in section 3.1. Not to our surprise, the pure attention-based slot filling model that does not utilize explicit alignment information performs poorly. Letting the model to learn the alignment from training data does not seem to be appropriate for slot filling task. Line 2 and line 3 show the F1 scores of the non-attention and attention-based encode-decoder models that utilize the aligned inputs. The attention-based model gives slightly better F1 score than the non-attention-based one, on both the average and best scores. By investigating the attention learned by the model, we find that the attention weights are more likely to be evenly distributed across words in the source sequence. There are a few cases where we observe insightful attention (Figure 4) that the decoder pays to the input sequence, and that might partly explain the observed performance gain when attention is enabled.
The second set of results in Table 1 are for bidirectional RNN models described in section 3.2. Similar to the previous set of results, we observe slightly improved F1 score on the model that uses attentions. The contribution from the context vector for slot filling is not very obvious. It seems that for sequence length at such level (average sentence length is 11 for this ATIS corpus), the hidden state that produced by the bidirectional RNN is capable of encoding most of the information that is needed to make the slot label prediction.
Table 2 compares our slot filling models to previous approaches. Results from both of our model architectures advance the best F1 scores reported previously.
Table 3 compares intent classification error rate between our intent models and previous approaches. Intent error rate of our proposed models outperform the state-of-the-art results by a large margin. The attention-based encoder-decoder intent model advances the bidirectional RNN model. This might be attributed to the sequence level information passed from the encoder and additional layer of non-linearity in the decoder RNN.
Table 4 shows our joint training model performance on intent detection and slot filling comparing to previous reported results. As shown in this table, the joint training model using encoder-decoder architecture achieves 0.09% absolute gain on slot filling and 0.45% absolute gain (22.2% relative improvement) on intent detection over the independent training model. For the attention-based bidirectional RNN architecture, the join training model achieves 0.23% absolute gain on slot filling and 0.56% absolute gain (23.8% relative improvement) on intent detection over the independent training models. The attention-based RNN model seems to benefit more from the joint training. Results from both of our joint training approaches outperform the best reported joint modeling results.
In this paper, we explored strategies in utilizing explicit alignment information in the attention-based encoder-decoder neural network models. We further proposed an attention-based bidirectional RNN model for joint intent detection and slot filling. Using a joint model for the two SLU tasks simplifies the dialog system, as only one model needs to be trained and deployed. Our independent training models achieved state-of-the-art performance for both intent detection and slot filling on the benchmark ATIS task. The proposed joint training models improved the intent detection accuracy and slot filling F1 score further over the independent training models.
K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long short-term memory neural networks,” inSpoken Language Technology Workshop (SLT), 2014 IEEE. IEEE, 2014, pp. 189–194.
Proc. NIPS Workshop on Machine Learning for Spoken Language Understanding and Interactions, 2015.
P. Xu and R. Sarikaya, “Convolutional neural network based triangular crf for joint intent detection and slot filling,” inAutomatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013, pp. 78–83.