Joint Slot Filling and Intent Detection via Capsule Neural Networks

12/22/2018 ∙ by Chenwei Zhang, et al. ∙ Tencent University of Illinois at Chicago 0

Being able to recognize words as slots and detect the intent of an utterance has been a keen issue in natural language understanding. The existing works either treat slot filling and intent detection separately in a pipeline manner, or adopt joint models which sequentially label slots while summarizing the utterance-level intent without explicitly preserving the hierarchical relationship among words, slots, and intents. To exploit the semantic hierarchy for effective modeling, we propose a capsule-based neural network model which accomplishes slot filling and intent detection via a dynamic routing-by-agreement schema. A re-routing schema is proposed to further synergize the slot filling performance using the inferred intent representation. Experiments on two real-world datasets show the effectiveness of our model when compared with other alternative model architectures, as well as existing natural language understanding services.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the ever-increasing accuracy in speech recognition and complexity in user-generated utterances, it becomes a critical issue for mobile phones or smart speaker devices to understand the natural language in order to give informative responses. Slot filling and intent detection play important roles in Natural Language Understanding (NLU) systems. For example, given an utterance from the user, the slot filling annotates the utterance on a word-level, indicating the slot type mentioned by a certain word such as the slot artist mentioned by the word Sungmin, while the intent detection works on the utterance-level to give categorical intent label(s) to the whole utterance. Figure 1 illustrates this idea.

Figure 1: An example of an utterance with BOI format annotation for slot filling, which indicates the slot of artist, play list owner, and play list name from an utterance with an intent AddToPlaylist.
Figure 2: Illustration of the proposed Capsule-NLU model for joint slot filling and intent detection. The model does slot filling by learning to assign each word in the WordCaps to the most appropriate slot in SlotCaps via dynamic routing. The weights learned via dynamic routing indicate how strong each word in WordCaps belongs to a certain slot type in SlotCaps. The dynamic routing also learns slot representations using WordCaps and the learned weight. The learned slot representations in SlotCaps are further aggregated to predict the utterance-level intent of the utterance. Once the intent label of the utterance is determined, a novel re-routing process is proposed to help improve word-level slot filling by the inferred utterance-level intent label. The solid lines indicate the dynamic-routing process and dash lines indicate the re-routing process.

To deal with diversely expressed utterances without additional feature engineering, deep neural network based user intent detection models (Hu et al., 2009; Xu and Sarikaya, 2013; Zhang et al., 2016; Liu and Lane, 2016; Zhang et al., 2017; Chen et al., 2016; Xia et al., 2018)

are proposed to classify user intents given their utterances in the natural language.

Currently, the slot filling is usually treated as a sequential labeling task. A neural network such as a recurrent neural network (RNN) or a convolution neural network (CNN) is used to learn context-aware word representations, along with sequence tagging methods such as conditional random field (CRF)

(Lafferty et al., 2001) that infer the slot type for each word in the utterance.

Word-level slot filling and utterance-level intent detection can be conducted simultaneously to achieve a synergistic effect. The recognized slots, which possess word-level signals, may give clues to the utterance-level intent of an utterance. For example, with a word Sungmin being recognized as a slot artist, the utterance is more likely to have an intent of AddToPlayList than other intents such as GetWeather or BookRestaurant.

Some existing works learn to fill slots while detecting the intent of the utterance Xu and Sarikaya (2013); Hakkani-Tür et al. (2016); Liu and Lane (2016); Goo et al. (2018): a convolution layer or a recurrent layer is adopted to sequentially label word with their slot types: the last hidden state of the recurrent neural network, or an attention-weighted sum of all convolution outputs are used to train an utterance-level classification module for intent detection. Such approaches achieve decent performances but do not explicitly consider the hierarchical relationship between words, slots, and intents: intents are sequentially summarized from the word sequence. As the sequence becomes longer, it is risky to simply rely on the gate function of RNN to control the information flow for intent detection given the utterance.

In this work, we make the very first attempt to bridge the gap between word-level slot modeling and the utterance-level intent modeling via a hierarchical capsule neural network structure (Hinton et al., 2011; Sabour et al., 2017)

. A capsule houses a vector representation of a group of neurons. The capsule model learns a hierarchy of feature detectors via a routing-by-agreement mechanism: capsules for detecting low-level features send their outputs to high-level capsules only when there is a strong agreement of their predictions to high-level capsules.

The aforementioned properties of capsule models are appealing for natural language understanding from a hierarchical perspective: words such as Sungmin are routed to concept-level slots such as artist, by learning how each word matches the slot representation. Concept-level slot features such as artist, playlist owner, and playlist collectively contribute to an utterance-level intent AddToPlaylist. The dynamic routing-by-agreement assigns a larger weight from a lower-level capsule to a higher-level when the low-level feature is more predictive to one high-level feature, than other high-level features. Figure 2 illustrates this idea.

The inferred utterance-level intent is also helpful in refining the slot filling result. For example, once an AddToPlaylist intent representation is learned in IntentCaps, the slot filling may capitalize on the inferred intent representation and recognize slots that are otherwise neglected previously. To achieve this, we propose a re-routing schema for capsule neural networks, which allows high-level features to be actively engaged in the dynamic routing between WordCaps and SlotCaps, which improves the slot filling performance.

To summarize, the contributions of this work are as follows:

  • Encapsulating the hierarchical relationship among word, slot, and intent in an utterance by a hierarchical capsule neural network structure.

  • Proposing a dynamic routing schema with re-routing that achieves synergistic effects for joint slot filling and intent detection.

  • Showing the effectiveness of our model on two real-world datasets, and comparing with existing models as well as commercial NLU services.

2 Approach

We propose to model the hierarchical relationship among each word, the slot it belongs to, and the intent label of the whole utterance by a hierarchical capsule neural network structure called Capsule-NLU. The proposed architecture consists of three types of capsules: 1) WordCaps that learn context-aware word representations, 2) SlotCaps that categorize words by their slot types via dynamic routing, and construct a representation for each type of slot by aggregating words that belong to the slot, 3) IntentCaps determine the intent label of the utterance based on the slot representation as well as the utterance contexts. Once the intent label has been determined by IntentCaps, the inferred utterance-level intent helps re-recognizing slots from the utterance by a re-routing schema.

2.1 WordCaps

Given an input utterance of words, where each word is initially represented by a vector of dimension . Here we simply trained word represenations from scratch. Various neural network structures can be used to learn context-aware word representations. For example, a recurrent neural network such as a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) can be applied to learn representations of each word in the utterance:

(1)

For each word , we concatenate each forward hidden state obtained from the forward with a backward hidden state from to obtain a hidden state . The whole hidden state matrix can be defined as , where is the number of hidden units in each LSTM. More sophisticated approaches such as ELMo (Peters et al., 2018) or Bert (Devlin et al., 2018) could be also adopted.

2.2 SlotCaps

Traditionally, the learned hidden state for each word

is used as the logit to predict its slot tag. When

for all words in the utterance is learned, sequential tagging methods like the linear-chain CRF models the tag dependencies by assigning a transition score for each transition pattern between adjacent tags to ensure the best tag sequence of the utterance from all possible tag sequences.

Instead of doing slot filling via sequential labeling which does not directly consider the dependencies among words, the SlotCaps learn to recognize slots via dynamic routing. The routing-by-agreement explicitly models the hierarchical relationship between capsules. For example, the routing-by-agreement mechanism send a low-level feature, e.g. a word representation in WordCaps, to high-level capsules, e.g. SlotCaps, only when the word representation has a strong agreement with a slot representation.

The agreement value on a word may vary when being recognized as different slots. For example, the word three may be recognized as a party_size_number slot or a time slot. The SlotCaps first convert the word representation obtained in WordCaps with respect to each slot type. We denote as the resulting prediction vector of the -th word when being recognized as the -th slot:

(2)

where denotes the slot type and .

is the activation function such as

. and are the weight and bias matrix for the -th capsule in SlotCaps, and is the dimension of the prediction vector.

Slot Filling by Dynamic Routing-by-agreement We propose to determine the slot type for each word by dynamically route prediction vectors of each word from WordCaps to SlotCaps. The dynamic routing-by-agreement learns an agreement value that determines how likely the -th word agrees to be routed to the -th slot capsule. is calculated by the dynamic routing-by-agreement algorithm (Sabour et al., 2017), which is briefly recalled in Algorithm 1.

1:procedure Dynamic Routing(, ) 2:     for each WordCaps t and SlotCaps k: . 3:     for  iterations do 4:         for all WordCaps : 5:         for all SlotCaps k: 6:         for all SlotCaps k: 7:         for all WordCaps t and SlotCaps k: 8:     end for 9:     Return 10:end procedure
Algorithm 1 Dynamic routing-by-agreement

The above algorithm determines the agreement value between WordCaps and SlotCaps while learning the slot representations in an unsupervised, iterative fashion. is a vector that consists of all where .

is the logit (initialized as zero) representing the log prior probability that the

-th word in WordCaps agrees to be routed to the -th slot capsule in SlotCaps (Line 2). During each iteration (Line 3), each slot representation is calculated by aggregating all the prediction vectors for that slot type , weighted by the agreement values obtained from (Line 5-6):

(3)
(4)

where a squashing function is applied on the weighted sum to get for each slot type. Once we updated the slot representation in the current iteration, the logit becomes larger when the dot product is large. That is, when a prediction vector is more similar to a slot representation , the dot product is larger, indicating that it is more likely to route this word to the -th slot type (Line 7). An updated, larger will leads to a larger agreement value between the -th word and the -th slot in the next iteration. On the other hand, it assigns low when there is inconsistency between and . The agreement values learned via the unsupervised, iterative algorithm ensures the outputs of the WordCaps get sent to appropriate subsequent SlotCaps after iterations.

Cross Entropy Loss for Slot Filling
For the -th word in an utterance, its slot type is determined as follows:

(5)

The slot filling loss is defined over the utterance as the following cross-entropy function:

(6)

where indicates the ground truth slot type for the -th word. when the -th word belongs to the -th slot type.

2.3 IntentCaps

The IntentCaps take the output for each slot in SlotCaps as the input, and determine the utterance-level intent of the whole utterance. The IntentCaps also convert each slot representation in SlotCaps with respect to the intent type:

(7)

where and is the number of intents. and are the weight and bias matrix for the -th capsule in IntentCaps.

IntentCaps adopt the same dynamic routing-by-agreement algorithm, where:

(8)

Max-margin Loss for Intent Detection
Based on the capsule theory, the orientation of the activation vector

represents intent properties while its length indicates the activation probability. The loss function considers a max-margin loss on each labeled utterance:

(9)

where is the norm of and is an indicator function, is the ground truth intent label for the utterance . is the weighting coefficient, and and are margins.

The intent of the utterance can be easily determined by choosing the activation vector with the largest norm .

2.4 Re-Routing

The IntentCaps not only determine the intent of the utterance by the length of the activation vector, but also learn discriminative intent representations of the utterance by the orientations of the activation vectors. Previously, the dynamic routing-by-agreement shows how low-level features such as slots help construct high-level ideas such as intents. While the high-level features also work as a guide that helps learn low-level features. For example, the AddToPlaylist intent activation vector in IntentCaps also help strength the existing slots such as artist_name during slot filling on the words Sungmin in SlotCaps.

Thus we propose a re-routing schema for SlotCaps where the dynamic routing-by-agreement is realized by the following equation that replaces the Line 7 in Algorithm 1:

(10)

where is the intent activation vector with the largest norm. is a bi-linear weight matrix. The routing information for each word is updated toward the direction where the prediction vector not only coincides with representative slots, but also towards the most-likely intent of the utterance. As a result, the re-routing makes SlotCaps obtain updated routing information as well as updated slot representations.

3 Experiment Setup

To demonstrate the effectiveness of our proposed models, we compare the proposed model Capsule-NLU with existing alternatives, as well as commercial natural language understanding services.

Datasets For each task, we evaluate our proposed models by applying it on two real-word datasets: SNIPS Natural Language Understanding benchmark111https://github.com/snipsco/nlu-benchmark/ (SNIPS-NLU) and the Airline Travel Information Systems (ATIS) dataset Tur et al. (2010). The statistical information on two datasets are shown in Table 1.

Dataset SNIPS-NLU ATIS
Vocab Size 11,241 722
Average Sentence Length 9.05 11.28
#Intents 7 21
#Slots 72 120
#Training Samples 13,084 4,478
#Validation Samples 700 500
#Test Samples 700 893
Table 1: Dataset statistics.

SNIPS-NLU contains natural language corpus collected in a crowdsourced fashion to benchmark the performance of voice assistants. ATIS is a widely used dataset in spoken language understanding, where audio recordings of people making flight reservations are collected.

Baselines We compare the proposed capsule-based model Capsule-NLU with other alternatives: 1) Joint Seq. (Hakkani-Tür et al., 2016) adopts a Recurrent Neural Network (RNN) for slot filling and the last hidden state of the RNN is used to predict the utterance intent. 2) Attention BiRNN (Liu and Lane, 2016) further introduces a RNN based encoder-decoder model for joint slot filling and intent detection. An attention weighted sum of all encoded hidden states is used to predict the utterance intent. 3) Slot-gated Full Atten. (Goo et al., 2018)

utilizes a slot-gated mechanism as a special gate function in Long Short-term Memory Network (LSTM) to improve slot filling by the learned intent context vector. The intent context vector is used for intent detection. 4) DR-AGG

(Gong et al., 2018)

aggregates word-level information for text classification via dynamic routing. The high-level capsules after routing are concatenated, followed by a multi-layer perceptron layer that predicts the utterance label. We used this capsule-based text classification model for intent detection only. 5) IntentCapsNet

(Xia et al., 2018) adopts a multi-head self-attention to extract intermediate semantic features from the utterances, and uses dynamic routing to aggregate semantic features into intent representations for intent detection. We use this capsule-based model for intent detection only.

We also compare our proposed model Capsule-NLU with existing commercial natural language understanding services, including api.ai (Now called DialogFlow)222https://dialogflow.com/, Waston Assistant333https://www.ibm.com/cloud/watson-assistant/, Luis444https://www.luis.ai/, wit.ai555https://wit.ai/, snips.ai666https://snips.ai/, recast.ai777https://recast.ai/, and Amazon Lex888https://aws.amazon.com/lex/.

Implementation Details

The hyperparameters used for experiments are shown in Table

2. We use the validation data to choose hyperparameters. In the loss function, the down-weighting coefficient is 1.0, margins and are set to 0.95 and 0.05 for all the existing intents. Adam optimizer (Kingma and Ba, 2014) is used to minimize the loss.

Dataset
SNIPS-NLU 1024 512 512 64 2 2
ATIS 1024 512 512 256 3 3
Table 2: Hyperparameter settings.
Model SNIPS-NLU ATIS
Slot (F1) Intent (Acc) Overall (Acc) Slot (F1) Intent (Acc) Overall (Acc)
Joint Seq. Hakkani-Tür et al. (2016) 0.873 0.969 0.732 0.942 0.926 0.807
Attention BiRNN Liu and Lane (2016) 0.878 0.967 0.741 0.942 0.911 0.789
Slot-Gated Full Atten. Goo et al. (2018) 0.888 0.970 0.755 0.948 0.936 0.822
DR-AGG Gong et al. (2018) - 0.966 - - 0.914 -
IntentCapsNet Xia et al. (2018) - 0.974 - - 0.948 -
Capsule-NLU 0.918 0.973 0.809 0.952 0.950 0.834
Capsule-NLU w/o Intent Detection 0.902 - - 0.948 - -
Capsule-NLU w/o Joint Training 0.902 0.977 0.804 0.948 0.847 0.743
Table 3: Slot filling and intent detection results using Capsule-NLU on two datasets.
Figure 3:

Stratified 5-fold cross validation for benchmarking with existing NLU services on SNIPS-NLU dataset. Black bars indicate the standard deviation.

4 Results

Quantitative Evaluation: The intent detection results on two datasets are reported in Table 3, where the proposed capsule-based model performs consistently better than current learning schemes for joint slot filling and intent detection, as well as capsule-based neural network models that only focuses on intent detection. These results demonstrate the novelty of the proposed capsule-based model Capsule-NLU in jointly modeling the hierarchical relationships among words, slots and intents via the dynamic routing between capsules.

Also, we benchmark the intent detection performance of the proposed model with existing natural language understanding services999https://www.slideshare.net/KonstantinSavenkov/nlu-intent-detection-benchmark-by-intento-august-2017 in Figure 3. Since the original data split is not available, we report the results with stratified 5-fold cross validation. From Figure 3 we can see that the proposed model Capsule-NLU is highly competitive with off-the-shelf systems that are available to use. Note that, our model archieves the performance without using pre-trained word representations: the word embeddings are simply trained from scratch.

Ablation Study To investigate the effectiveness of Capsule-NLU in joint slot filling and intent detection, we also report ablation test results in Table 3. “w/o Intent Detection” is the model without intent detection: only a dynamic routing is performed between WordCaps and SlotCaps for the slot filling task, where we minimize during training; “w/o Joint Training” adopts a two-stage training where the model is first trained for slot filling by minimizing , and then use the fixed slot representations to train for the intent detection task which minimizes . From the lower part of Table 3 we can see that by using a capsule-based hierarchical modeling between words and slots, the model Capsule-NLU w/o Intent Detection is already able to outperform current alternatives on slot filling that adopt a sequential labeling schema. The joint training of slot filling and intent detection is able to give each subtask further improvements when the model parameters are updated jointly.

5 Related Works

Intent Detection With recent developments in deep neural networks, user intent detection models (Hu et al., 2009; Xu and Sarikaya, 2013; Zhang et al., 2016; Liu and Lane, 2016; Zhang et al., 2017; Chen et al., 2016; Xia et al., 2018)

are proposed to classify user intents given their diversely expressed utterances in the natural language. As a text classification task, the decent performance on utterance-level intent detection usually relies on hidden representations that are learned in the intermediate layers via multiple non-linear transformations. Recently, various capsule based text classification models are proposed that aggregate word-level features for utterance-level classification via dynamic routing-by-aggre ment

(Gong et al., 2018; Zhao et al., 2018; Xia et al., 2018). Among them, Xia et al. (2018) adopts self-attention to extract intermediate semantic features and uses a capsule-based neural network for intent detection. However, existing works do not study word-level supervisions for the slot filling task. In this work, we explicitly model the hierarchical relationship between words and slots on the word-level, as well as intents on the utterance-level via dynamic routing.

Slot Filling

Slot filling annotates the utterance with finer granularity: it associates certain parts of the utterance, usually named entities, with pre-defined slot tags. Currently, the slot filling is usually treated as a sequential labeling task. A recurrent neural network such as Gated Recurrent Unit (GRU) or Long Short-term Memory Network (LSTM) is used to learn context-aware word representations, and Conditional Random Fields (CRF) are used to annotate each word based on its slot type. Recently,

Shen et al. (2017); Tan et al. (2017) introduce the self-attention mechanism for CRF-free sequential labeling.

Joint Modeling via Sequence Labeling To overcome the error propagation in the word-level slot filling task and the utterance-level intent detection task in a pipeline, joint models are proposed to solve two tasks simultaneously in a unified framework. Xu and Sarikaya (2013) propose a Convolution Neural Network (CNN) based sequential labeling model for slot filling. The hidden states corresponding to each word are summed up in a classification module to predict the utterance intent. A Conditional Random Field module ensures the best slot tag sequence of the utterance from all possible tag sequences. Hakkani-Tür et al. (2016) adopt a Recurrent Neural Network (RNN) for slot filling and the last hidden state of the RNN is used to predict the utterance intent. Liu and Lane (2016) further introduce an RNN based encoder-decoder model for joint slot filling and intent detection. An attention weighted sum of all encoded hidden states is used to predict the utterance intent. Some specific mechanisms are designed for RNNs to explicitly encode the slot from the utterance. For example, Goo et al. (2018) utilize a slot-gated mechanism as a special gate function in Long Short-term Memory Network (LSTM) to improve slot filling by the learned intent context vector. However, as the sequence becomes longer, it is risky to simply rely on the gate function to sequentially summarize and compress all slots and context information in a single vector Cheng et al. (2016). In this paper, we harness the capsule neural network to learn a hierarchy of feature detectors and explicitly model the hierarchical relationships among word-level slots and utterance-level intent. Also, instead of doing sequence labeling for slot filling, we use a dynamic routing-by-agreement schema between capsule layers to assign a proper slot type for each word in the utterance.

6 Conclusions

In this paper, a capsule-based model, namely Capsule-NLU, is introduced to harness the hierarchical relationships among words, slots, and intents in the utterance for joint slot filling and intent detection. Unlike treating slot filling as a sequential prediction problem, the proposed model assigns each word to its most appropriate slots in SlotCaps by a dynamic routing-by-agreement schema. The learned word-level slot representations are futher aggregated to get the utterance-level intent representations via dynamic routing-by-agreement. A re-routing schema is proposed to further synergize the slot filling performance using the inferred intent representation. Experiments on two real-world datasets show the effectiveness of the proposed models when compared with other alternatives as well as existing NLU services.

References