Recurrent Neural Networks with Pre-trained Language Model Embedding for Slot Filling Task

12/12/2018 ∙ by Liang Qiu, et al. ∙ 0

In recent years, Recurrent Neural Networks (RNNs) based models have been applied to the Slot Filling problem of Spoken Language Understanding and achieved the state-of-the-art performances. In this paper, we investigate the effect of incorporating pre-trained language models into RNN based Slot Filling models. Our evaluation on the Airline Travel Information System (ATIS) data corpus shows that we can significantly reduce the size of labeled training data and achieve the same level of Slot Filling performance by incorporating extra word embedding and language model embedding layers pre-trained on unlabeled corpora.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Slot Filling task is a subtask of Spoken Language Understanding (SLU) and can be treated as a standard sequence labeling or sequence discrimination task Mesnil et al. (2013). Figure 1 shows a typical sentence in the Airline Travel Information System (ATIS) dataset Hemphill et al. (1990) and its annotation of domain, intent, named entity and slot. Typically, the SLU will firstly recognize the sentence domain and intent. Then relying on a Slot Filling module, it extracts additional essential information to determine the appropriate response to users.

The annotation of slots and named entities follows the IOB (Inside/Outside/Beginning) convention. The B- prefix before a tag indicates that the tag is the beginning of a chunk. An I- prefix before a tag indicates that the tag is inside a chunk. And an O tag indicates that a token belongs to no chunk. We can see that Slot Filling is similar to the Named Entity Recognition (NER) task, while the slots are more specific than named entities. For example, the slot tag of ”Boston” is B-departure while the named entity tag of it is B-city.

Figure 1: ATIS Utterance Example with the IOB Representation Hemphill et al. (1990).

In recent years, Recurrent Neural Networks (RNNs) based models have been applied to the Slot Filling problem and achieved the state-of-the-art performances Mesnil et al. (2013). However, training typical RNN models is often data-demanding, which limits its practical use in many specific domains where large amount of labeled training data is not available.

In this paper, we investigate the effect of incorporating pre-trained word embedding and language models into RNN based Slot Filling models. Our evaluation on the Airline Travel Information System (ATIS) data corpus shows that incorporating an extra language model embedding layer pre-trained on an unlabeled corpus can significantly reduce the size of labeled training data without sacrificing the Slot Filling performance. Our results also suggest that using the pre-trained GloVe word embedding model and the bidirectional Long-Short Term Memory (bi-LSTM) model can achieve a better performance for the Slot Filling task.

2 Related Work

Modern methods to solve the Slot Filling problem include generative models such as Hidden Markov Model (HMM)

Wang et al. (2005) and discriminative models such as Conditional Random Field (CRF) Lafferty et al. (2001)

. With the popularity of RNNs in many other natural language processing (NLP) tasks such as language modeling

Mikolov et al. (2010) and machine translation Cho et al. (2014), RNN has also been applied to Slot Filling and achieved the state-of-the-art performance Mesnil et al. (2015).

However, RNN models usually need to be trained with a large amount of labeled data to achieve the expected performance. The work presented in this paper has been inspired by previous work of using pre-trained fine-tuning word embedding models to improve the performance of deep learning based models (e.g.,

Mesnil et al. (2013)), and by the work of Peters et al. Peters et al. (2017) which used a pre-trained language model to encode the surrounding context of each word and improved the NER task performance.

3 RNN with Language Model Embedding

Overview

In light of the success of RNNs in language modeling and many other natural language processing tasks, RNNs were introduced to solve slot filling problem which unsurprisingly achieved the state-of-the-art performance Mesnil et al. (2015). But every coin has two sides. RNNs are usually data thirsty which means they need to be trained with a large amount of data to achieve the expected performance. In this section we discuss how to alleviate this shortcoming of RNNs with pre-trained language model embedding.

Baseline LSTM Model

Our baseline model has basically the same structure as described in Mesnil et al. Mesnil et al. (2015) paper. But we replaced the simple Jordan and Elman versions of RNN with a modern two-layer LSTM.

We will briefly refer LSTM as in the subsequent content. Our baseline LSTM model can be described as below.

where represents word embedding.

LSTM Model with Pre-trained Language Model Embedding

Nowadays, using pre-trained word embedding such as Word2Vec or GloVe is quite popular in some NLP tasks. Before training the word embedding matrix, instead of initializing it with random values, initializing with pre-trained embedding can provide useful semantic and syntactic knowledge learned from another large dataset. However, for slot filling task, in addition to the meaning of a word, it’s also important to represent the word in context. The state-of-the-art method (our baseline model) relies on the RNN model to encode the word sequences into a context-sensitive representation, which requires additional labeled data Peters et al. (2017). Inspired by this work, we implemented a LSTM RNN model with language model embeddings pre-trained on the One Billion Word Benchmark Jozefowicz et al. (2016), which contains one billion words and a vocabulary of about 800K words. As illustrated in Figure 2, the input sentence will be fed into both the GloVe and the pre-trained language model, and the result embeddings will then be concatenated as the new input embedding for the downstream 2-layer LSTM model.

Figure 2: LSTM Model with Pre-trained Language Model Embedding.

Other Models Implemented

Besides the models we talked above, we also implemented bidirectional LSTM with GloVe word embedding which is described as below:

We choose to use bi-directional recurrent neural network (bi-RNN) because combining information of the succeeding words is proved to be important for slot filling task Vu et al. (2016). In bi-directional RNNs, words from both previous and future time step are regarded to predict the semantic tag of the target word. And GloVe can also provide a lot of useful addition semantic and syntactic information. We experimented all the combinations of these settings and in next section will talk about our experiment results.

4 Evaluation

Data

We evaluate our model on the widely used Airline Travel Information System (ATIS) dataset Hemphill et al. (1990). Words in the dataset are all labeled in the IOB format as shown in Table 1:

Word Label
what O
flights O
leave O
at O
about B-depart_time.time_relative
DIGIT B-depart_time.time
in O
the O
afternoon B-depart_time.period_of_day
and O
arrive O
in O
San B-toloc.city_name
Francisco I-toloc.city_name
Table 1: Labeled Sentence Example in IOB Format.

Models

Using Tensorflow, We first implement a two-layer LSTM network as our baseline model. Then we compare it with the proposed LSTM with language model embedding and another two complexer architectures, and report F1 scores on the testing dataset for each of the followings:

  • Baseline LSTM: A forward RNN with LSTM cells, where the language model embedding and GloVe word embedding are not included.

  • LSTM + LM: A forward LSTM (baseline) with pre-trained language model embedding, where the GloVe word embedding is not included.

  • Bi-LSTM + LM: A bi-directional LSTM with pre-trained language model embedding, where the GloVe word embedding is not included.

  • Bi-LSTM + LM + Glove: A bi-directional LSTM with pre-trained language model embedding and GloVe word embedding.

For the model architectures with GloVe word embedding, we assign the pre-trained embedding vectors to corresponding words in our dataset that can be found in the GloVe dictionary and randomly initiate the embeddings of the rest words. For models without GloVe embedding, we just randomly initiate the whole word embedding matrix. We use the RMSPropOptimizer with a constant learning rate

to reduce the cross entropy loss. We also adopt dropout with 80% keep probability to avoid over fitting.

Results

We train the four models with variable dataset size from 100 sentences to 3600 sentences. Then we track the loss function value during the training process and find that 40 training epochs are sufficient for the optimizer to converge. So we test the performance of the four different models on a test dataset of 893 sentences after 40 epochs and plot their F1 scores in Figure 3.

Observing the trend of each curve, we see that the F1 score increases rapidly with the increasing dataset size while the number of sentences is under 600. But keeping increasing dataset size above that number does not help improve the performance much since the amount of data is already sufficient to train the models with our defined complexity.

Figure 3: F1 Scores of Different Model Architectures Trained with Variable Dataset Sizes.
Data Size Model F1 Score
100 Baseline LSTM 72.06
LSTM + LM 85.49
Bi-LSTM + LM 86.88
Bi-LSTM + LM + GloVe 90.24
200 Baseline LSTM 82.83
LSTM + LM 89.24
Bi-LSTM + LM 89.84
Bi-LSTM + LM + GloVe 92.81
300 Baseline LSTM 87.46
LSTM + LM 89.14
Bi-LSTM + LM 91.65
Bi-LSTM + LM + GloVe 92.39
400 Baseline LSTM 88.44
LSTM + LM 90.25
Bi-LSTM + LM 90.93
Bi-LSTM + LM + GloVe 93.35
500 Baseline LSTM 89.87
LSTM + LM 90.43
Bi-LSTM + LM 91.12
Bi-LSTM + LM + GloVe 93.2
1000 Baseline LSTM 91.48
LSTM + LM 91.5
Bi-LSTM + LM 92.53
Bi-LSTM + LM + GloVe 94.07
1500 Baseline LSTM 91.09
LSTM + LM 91.49
Bi-LSTM + LM 92.7
Bi-LSTM + LM + GloVe 94.48
2000 Baseline LSTM 91.66
LSTM + LM 91.13
Bi-LSTM + LM 93.17
Bi-LSTM + LM + GloVe 94.09
2500 Baseline LSTM 91.27
LSTM + LM 90.71
Bi-LSTM + LM 93.41
Bi-LSTM + LM + GloVe 94.19
3000 Baseline LSTM 91.6
LSTM + LM 91.57
Bi-LSTM + LM 93.38
Bi-LSTM + LM + GloVe 94.47
Table 2: Part of the F1 Scores in Figure 3.

For detailed results and comparison, we also list the F1 score values with respect to different training data sizes in Table 2. By comparing the F1 scores of different models, we find that adding pre-trained language model embedding can significantly improve the performance of LSTM, especially when the training dataset is relatively small. For instance, with only 100 and 200 training examples, our best model (Bi-LSTM+LM+Glove) outperforms the baseline LSTM model by large margins of 18% and 10% respectively. Such results demonstrate the great potential of our Bi-LSTM+LM+Glove based Slot Filling to be used in practical domains where labeled training data are difficult/expensive to get.

As the training data size increases, the benefit of incorporating pre-trained language model embedding becomes less significant since the training dataset is large enough for the baseline LSTM to learn a good context model. Nevertheless, we can still conclude that besides the proposed language model embedding, GloVe word embedding and bi-directional LSTM help to improve the model performance for the slot filling task as well. With sufficient amount of labeled training data (), the Bi-LSTM+LM+Glove still outperforms the baseline LSTM model by nearly 3%.

5 Conclusion and Future Work

In this paper, we proposed a bi-directional LSTM model with pre-trained language model embedding and GloVe word embedding for slot filling task. This model significantly improves the recognition performance compared to the baseline LSTM model, especially under the situation where we do not have enough labeled sentences in a specific domain.

One envisioned future work is to explore what other kind of general knowledge can be learned from public resources and embedded into the model for a specific task domain.

References