The Slot Filling task is a subtask of Spoken Language Understanding (SLU) and can be treated as a standard sequence labeling or sequence discrimination task Mesnil et al. (2013). Figure 1 shows a typical sentence in the Airline Travel Information System (ATIS) dataset Hemphill et al. (1990) and its annotation of domain, intent, named entity and slot. Typically, the SLU will firstly recognize the sentence domain and intent. Then relying on a Slot Filling module, it extracts additional essential information to determine the appropriate response to users.
The annotation of slots and named entities follows the IOB (Inside/Outside/Beginning) convention. The B- prefix before a tag indicates that the tag is the beginning of a chunk. An I- prefix before a tag indicates that the tag is inside a chunk. And an O tag indicates that a token belongs to no chunk. We can see that Slot Filling is similar to the Named Entity Recognition (NER) task, while the slots are more specific than named entities. For example, the slot tag of ”Boston” is B-departure while the named entity tag of it is B-city.
In recent years, Recurrent Neural Networks (RNNs) based models have been applied to the Slot Filling problem and achieved the state-of-the-art performances Mesnil et al. (2013). However, training typical RNN models is often data-demanding, which limits its practical use in many specific domains where large amount of labeled training data is not available.
In this paper, we investigate the effect of incorporating pre-trained word embedding and language models into RNN based Slot Filling models. Our evaluation on the Airline Travel Information System (ATIS) data corpus shows that incorporating an extra language model embedding layer pre-trained on an unlabeled corpus can significantly reduce the size of labeled training data without sacrificing the Slot Filling performance. Our results also suggest that using the pre-trained GloVe word embedding model and the bidirectional Long-Short Term Memory (bi-LSTM) model can achieve a better performance for the Slot Filling task.
2 Related Work
Modern methods to solve the Slot Filling problem include generative models such as Hidden Markov Model (HMM)Wang et al. (2005) and discriminative models such as Conditional Random Field (CRF) Lafferty et al. (2001)
. With the popularity of RNNs in many other natural language processing (NLP) tasks such as language modelingMikolov et al. (2010) and machine translation Cho et al. (2014), RNN has also been applied to Slot Filling and achieved the state-of-the-art performance Mesnil et al. (2015).
However, RNN models usually need to be trained with a large amount of labeled data to achieve the expected performance. The work presented in this paper has been inspired by previous work of using pre-trained fine-tuning word embedding models to improve the performance of deep learning based models (e.g.,Mesnil et al. (2013)), and by the work of Peters et al. Peters et al. (2017) which used a pre-trained language model to encode the surrounding context of each word and improved the NER task performance.
3 RNN with Language Model Embedding
In light of the success of RNNs in language modeling and many other natural language processing tasks, RNNs were introduced to solve slot filling problem which unsurprisingly achieved the state-of-the-art performance Mesnil et al. (2015). But every coin has two sides. RNNs are usually data thirsty which means they need to be trained with a large amount of data to achieve the expected performance. In this section we discuss how to alleviate this shortcoming of RNNs with pre-trained language model embedding.
Baseline LSTM Model
Our baseline model has basically the same structure as described in Mesnil et al. Mesnil et al. (2015) paper. But we replaced the simple Jordan and Elman versions of RNN with a modern two-layer LSTM.
We will briefly refer LSTM as in the subsequent content. Our baseline LSTM model can be described as below.
where represents word embedding.
LSTM Model with Pre-trained Language Model Embedding
Nowadays, using pre-trained word embedding such as Word2Vec or GloVe is quite popular in some NLP tasks. Before training the word embedding matrix, instead of initializing it with random values, initializing with pre-trained embedding can provide useful semantic and syntactic knowledge learned from another large dataset. However, for slot filling task, in addition to the meaning of a word, it’s also important to represent the word in context. The state-of-the-art method (our baseline model) relies on the RNN model to encode the word sequences into a context-sensitive representation, which requires additional labeled data Peters et al. (2017). Inspired by this work, we implemented a LSTM RNN model with language model embeddings pre-trained on the One Billion Word Benchmark Jozefowicz et al. (2016), which contains one billion words and a vocabulary of about 800K words. As illustrated in Figure 2, the input sentence will be fed into both the GloVe and the pre-trained language model, and the result embeddings will then be concatenated as the new input embedding for the downstream 2-layer LSTM model.
Other Models Implemented
Besides the models we talked above, we also implemented bidirectional LSTM with GloVe word embedding which is described as below:
We choose to use bi-directional recurrent neural network (bi-RNN) because combining information of the succeeding words is proved to be important for slot filling task Vu et al. (2016). In bi-directional RNNs, words from both previous and future time step are regarded to predict the semantic tag of the target word. And GloVe can also provide a lot of useful addition semantic and syntactic information. We experimented all the combinations of these settings and in next section will talk about our experiment results.
We evaluate our model on the widely used Airline Travel Information System (ATIS) dataset Hemphill et al. (1990). Words in the dataset are all labeled in the IOB format as shown in Table 1:
Using Tensorflow, We first implement a two-layer LSTM network as our baseline model. Then we compare it with the proposed LSTM with language model embedding and another two complexer architectures, and report F1 scores on the testing dataset for each of the followings:
Baseline LSTM: A forward RNN with LSTM cells, where the language model embedding and GloVe word embedding are not included.
LSTM + LM: A forward LSTM (baseline) with pre-trained language model embedding, where the GloVe word embedding is not included.
Bi-LSTM + LM: A bi-directional LSTM with pre-trained language model embedding, where the GloVe word embedding is not included.
Bi-LSTM + LM + Glove: A bi-directional LSTM with pre-trained language model embedding and GloVe word embedding.
For the model architectures with GloVe word embedding, we assign the pre-trained embedding vectors to corresponding words in our dataset that can be found in the GloVe dictionary and randomly initiate the embeddings of the rest words. For models without GloVe embedding, we just randomly initiate the whole word embedding matrix. We use the RMSPropOptimizer with a constant learning rate
to reduce the cross entropy loss. We also adopt dropout with 80% keep probability to avoid over fitting.
We train the four models with variable dataset size from 100 sentences to 3600 sentences. Then we track the loss function value during the training process and find that 40 training epochs are sufficient for the optimizer to converge. So we test the performance of the four different models on a test dataset of 893 sentences after 40 epochs and plot their F1 scores in Figure 3.
Observing the trend of each curve, we see that the F1 score increases rapidly with the increasing dataset size while the number of sentences is under 600. But keeping increasing dataset size above that number does not help improve the performance much since the amount of data is already sufficient to train the models with our defined complexity.
|Data Size||Model||F1 Score|
|LSTM + LM||85.49|
|Bi-LSTM + LM||86.88|
|Bi-LSTM + LM + GloVe||90.24|
|LSTM + LM||89.24|
|Bi-LSTM + LM||89.84|
|Bi-LSTM + LM + GloVe||92.81|
|LSTM + LM||89.14|
|Bi-LSTM + LM||91.65|
|Bi-LSTM + LM + GloVe||92.39|
|LSTM + LM||90.25|
|Bi-LSTM + LM||90.93|
|Bi-LSTM + LM + GloVe||93.35|
|LSTM + LM||90.43|
|Bi-LSTM + LM||91.12|
|Bi-LSTM + LM + GloVe||93.2|
|LSTM + LM||91.5|
|Bi-LSTM + LM||92.53|
|Bi-LSTM + LM + GloVe||94.07|
|LSTM + LM||91.49|
|Bi-LSTM + LM||92.7|
|Bi-LSTM + LM + GloVe||94.48|
|LSTM + LM||91.13|
|Bi-LSTM + LM||93.17|
|Bi-LSTM + LM + GloVe||94.09|
|LSTM + LM||90.71|
|Bi-LSTM + LM||93.41|
|Bi-LSTM + LM + GloVe||94.19|
|LSTM + LM||91.57|
|Bi-LSTM + LM||93.38|
|Bi-LSTM + LM + GloVe||94.47|
For detailed results and comparison, we also list the F1 score values with respect to different training data sizes in Table 2. By comparing the F1 scores of different models, we find that adding pre-trained language model embedding can significantly improve the performance of LSTM, especially when the training dataset is relatively small. For instance, with only 100 and 200 training examples, our best model (Bi-LSTM+LM+Glove) outperforms the baseline LSTM model by large margins of 18% and 10% respectively. Such results demonstrate the great potential of our Bi-LSTM+LM+Glove based Slot Filling to be used in practical domains where labeled training data are difficult/expensive to get.
As the training data size increases, the benefit of incorporating pre-trained language model embedding becomes less significant since the training dataset is large enough for the baseline LSTM to learn a good context model. Nevertheless, we can still conclude that besides the proposed language model embedding, GloVe word embedding and bi-directional LSTM help to improve the model performance for the slot filling task as well. With sufficient amount of labeled training data (), the Bi-LSTM+LM+Glove still outperforms the baseline LSTM model by nearly 3%.
5 Conclusion and Future Work
In this paper, we proposed a bi-directional LSTM model with pre-trained language model embedding and GloVe word embedding for slot filling task. This model significantly improves the recognition performance compared to the baseline LSTM model, especially under the situation where we do not have enough labeled sentences in a specific domain.
One envisioned future work is to explore what other kind of general knowledge can be learned from public resources and embedded into the model for a specific task domain.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Hemphill et al. (1990) Charles T Hemphill, John J Godfrey, and George R Doddington. 1990. The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
- Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
- Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
- Mesnil et al. (2015) Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(3):530–539.
- Mesnil et al. (2013) Grégoire Mesnil, Xiaodong He, Li Deng, and Yoshua Bengio. 2013. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In Interspeech, pages 3771–3775.
- Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association.
- Peters et al. (2017) Matthew E Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. arXiv preprint arXiv:1705.00108.
- Vu et al. (2016) Ngoc Thang Vu, Pankaj Gupta, Heike Adel, and Hinrich Schütze. 2016. Bi-directional recurrent neural network with ranking loss for spoken language understanding. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on, pages 6060–6064. IEEE.
- Wang et al. (2005) Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spoken language understanding. IEEE Signal Processing Magazine, 22(5):16–31.