1 Introduction
With the ever-increasing accuracy in speech recognition and complexity in user-generated utterances, it becomes a critical issue for mobile phones or smart speaker devices to understand the natural language in order to give informative responses. Slot filling and intent detection play important roles in Natural Language Understanding (NLU) systems. For example, given an utterance from the user, the slot filling annotates the utterance on a word-level, indicating the slot type mentioned by a certain word such as the slot artist mentioned by the word Sungmin, while the intent detection works on the utterance-level to give categorical intent label(s) to the whole utterance. Figure 1 illustrates this idea.


To deal with diversely expressed utterances without additional feature engineering, deep neural network based user intent detection models (Hu et al., 2009; Xu and Sarikaya, 2013; Zhang et al., 2016; Liu and Lane, 2016; Zhang et al., 2017; Chen et al., 2016; Xia et al., 2018)
are proposed to classify user intents given their utterances in the natural language.
Currently, the slot filling is usually treated as a sequential labeling task. A neural network such as a recurrent neural network (RNN) or a convolution neural network (CNN) is used to learn context-aware word representations, along with sequence tagging methods such as conditional random field (CRF)
(Lafferty et al., 2001) that infer the slot type for each word in the utterance.Word-level slot filling and utterance-level intent detection can be conducted simultaneously to achieve a synergistic effect. The recognized slots, which possess word-level signals, may give clues to the utterance-level intent of an utterance. For example, with a word Sungmin being recognized as a slot artist, the utterance is more likely to have an intent of AddToPlayList than other intents such as GetWeather or BookRestaurant.
Some existing works learn to fill slots while detecting the intent of the utterance Xu and Sarikaya (2013); Hakkani-Tür et al. (2016); Liu and Lane (2016); Goo et al. (2018): a convolution layer or a recurrent layer is adopted to sequentially label word with their slot types: the last hidden state of the recurrent neural network, or an attention-weighted sum of all convolution outputs are used to train an utterance-level classification module for intent detection. Such approaches achieve decent performances but do not explicitly consider the hierarchical relationship between words, slots, and intents: intents are sequentially summarized from the word sequence. As the sequence becomes longer, it is risky to simply rely on the gate function of RNN to control the information flow for intent detection given the utterance.
In this work, we make the very first attempt to bridge the gap between word-level slot modeling and the utterance-level intent modeling via a hierarchical capsule neural network structure (Hinton et al., 2011; Sabour et al., 2017)
. A capsule houses a vector representation of a group of neurons. The capsule model learns a hierarchy of feature detectors via a routing-by-agreement mechanism: capsules for detecting low-level features send their outputs to high-level capsules only when there is a strong agreement of their predictions to high-level capsules.
The aforementioned properties of capsule models are appealing for natural language understanding from a hierarchical perspective: words such as Sungmin are routed to concept-level slots such as artist, by learning how each word matches the slot representation. Concept-level slot features such as artist, playlist owner, and playlist collectively contribute to an utterance-level intent AddToPlaylist. The dynamic routing-by-agreement assigns a larger weight from a lower-level capsule to a higher-level when the low-level feature is more predictive to one high-level feature, than other high-level features. Figure 2 illustrates this idea.
The inferred utterance-level intent is also helpful in refining the slot filling result. For example, once an AddToPlaylist intent representation is learned in IntentCaps, the slot filling may capitalize on the inferred intent representation and recognize slots that are otherwise neglected previously. To achieve this, we propose a re-routing schema for capsule neural networks, which allows high-level features to be actively engaged in the dynamic routing between WordCaps and SlotCaps, which improves the slot filling performance.
To summarize, the contributions of this work are as follows:
-
Encapsulating the hierarchical relationship among word, slot, and intent in an utterance by a hierarchical capsule neural network structure.
-
Proposing a dynamic routing schema with re-routing that achieves synergistic effects for joint slot filling and intent detection.
-
Showing the effectiveness of our model on two real-world datasets, and comparing with existing models as well as commercial NLU services.
2 Approach
We propose to model the hierarchical relationship among each word, the slot it belongs to, and the intent label of the whole utterance by a hierarchical capsule neural network structure called Capsule-NLU. The proposed architecture consists of three types of capsules: 1) WordCaps that learn context-aware word representations, 2) SlotCaps that categorize words by their slot types via dynamic routing, and construct a representation for each type of slot by aggregating words that belong to the slot, 3) IntentCaps determine the intent label of the utterance based on the slot representation as well as the utterance contexts. Once the intent label has been determined by IntentCaps, the inferred utterance-level intent helps re-recognizing slots from the utterance by a re-routing schema.
2.1 WordCaps
Given an input utterance of words, where each word is initially represented by a vector of dimension . Here we simply trained word represenations from scratch. Various neural network structures can be used to learn context-aware word representations. For example, a recurrent neural network such as a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) can be applied to learn representations of each word in the utterance:
(1) |
For each word , we concatenate each forward hidden state obtained from the forward with a backward hidden state from to obtain a hidden state . The whole hidden state matrix can be defined as , where is the number of hidden units in each LSTM. More sophisticated approaches such as ELMo (Peters et al., 2018) or Bert (Devlin et al., 2018) could be also adopted.
2.2 SlotCaps
Traditionally, the learned hidden state for each word
is used as the logit to predict its slot tag. When
for all words in the utterance is learned, sequential tagging methods like the linear-chain CRF models the tag dependencies by assigning a transition score for each transition pattern between adjacent tags to ensure the best tag sequence of the utterance from all possible tag sequences.Instead of doing slot filling via sequential labeling which does not directly consider the dependencies among words, the SlotCaps learn to recognize slots via dynamic routing. The routing-by-agreement explicitly models the hierarchical relationship between capsules. For example, the routing-by-agreement mechanism send a low-level feature, e.g. a word representation in WordCaps, to high-level capsules, e.g. SlotCaps, only when the word representation has a strong agreement with a slot representation.
The agreement value on a word may vary when being recognized as different slots. For example, the word three may be recognized as a party_size_number slot or a time slot. The SlotCaps first convert the word representation obtained in WordCaps with respect to each slot type. We denote as the resulting prediction vector of the -th word when being recognized as the -th slot:
(2) |
where denotes the slot type and .
is the activation function such as
. and are the weight and bias matrix for the -th capsule in SlotCaps, and is the dimension of the prediction vector.Slot Filling by Dynamic Routing-by-agreement We propose to determine the slot type for each word by dynamically route prediction vectors of each word from WordCaps to SlotCaps. The dynamic routing-by-agreement learns an agreement value that determines how likely the -th word agrees to be routed to the -th slot capsule. is calculated by the dynamic routing-by-agreement algorithm (Sabour et al., 2017), which is briefly recalled in Algorithm 1.
The above algorithm determines the agreement value between WordCaps and SlotCaps while learning the slot representations in an unsupervised, iterative fashion. is a vector that consists of all where .
is the logit (initialized as zero) representing the log prior probability that the
-th word in WordCaps agrees to be routed to the -th slot capsule in SlotCaps (Line 2). During each iteration (Line 3), each slot representation is calculated by aggregating all the prediction vectors for that slot type , weighted by the agreement values obtained from (Line 5-6):(3) |
(4) |
where a squashing function is applied on the weighted sum to get for each slot type. Once we updated the slot representation in the current iteration, the logit becomes larger when the dot product is large. That is, when a prediction vector is more similar to a slot representation , the dot product is larger, indicating that it is more likely to route this word to the -th slot type (Line 7). An updated, larger will leads to a larger agreement value between the -th word and the -th slot in the next iteration. On the other hand, it assigns low when there is inconsistency between and . The agreement values learned via the unsupervised, iterative algorithm ensures the outputs of the WordCaps get sent to appropriate subsequent SlotCaps after iterations.
Cross Entropy Loss for Slot Filling
For the -th word in an utterance, its slot type is determined as follows:
(5) |
The slot filling loss is defined over the utterance as the following cross-entropy function:
(6) |
where indicates the ground truth slot type for the -th word. when the -th word belongs to the -th slot type.
2.3 IntentCaps
The IntentCaps take the output for each slot in SlotCaps as the input, and determine the utterance-level intent of the whole utterance. The IntentCaps also convert each slot representation in SlotCaps with respect to the intent type:
(7) |
where and is the number of intents. and are the weight and bias matrix for the -th capsule in IntentCaps.
IntentCaps adopt the same dynamic routing-by-agreement algorithm, where:
(8) |
Max-margin Loss for Intent Detection
Based on the capsule theory, the orientation of the activation vector
represents intent properties while its length indicates the activation probability. The loss function considers a max-margin loss on each labeled utterance:
(9) |
where is the norm of and is an indicator function, is the ground truth intent label for the utterance . is the weighting coefficient, and and are margins.
The intent of the utterance can be easily determined by choosing the activation vector with the largest norm .
2.4 Re-Routing
The IntentCaps not only determine the intent of the utterance by the length of the activation vector, but also learn discriminative intent representations of the utterance by the orientations of the activation vectors. Previously, the dynamic routing-by-agreement shows how low-level features such as slots help construct high-level ideas such as intents. While the high-level features also work as a guide that helps learn low-level features. For example, the AddToPlaylist intent activation vector in IntentCaps also help strength the existing slots such as artist_name during slot filling on the words Sungmin in SlotCaps.
Thus we propose a re-routing schema for SlotCaps where the dynamic routing-by-agreement is realized by the following equation that replaces the Line 7 in Algorithm 1:
(10) |
where is the intent activation vector with the largest norm. is a bi-linear weight matrix. The routing information for each word is updated toward the direction where the prediction vector not only coincides with representative slots, but also towards the most-likely intent of the utterance. As a result, the re-routing makes SlotCaps obtain updated routing information as well as updated slot representations.
3 Experiment Setup
To demonstrate the effectiveness of our proposed models, we compare the proposed model Capsule-NLU with existing alternatives, as well as commercial natural language understanding services.
Datasets For each task, we evaluate our proposed models by applying it on two real-word datasets: SNIPS Natural Language Understanding benchmark111https://github.com/snipsco/nlu-benchmark/ (SNIPS-NLU) and the Airline Travel Information Systems (ATIS) dataset Tur et al. (2010). The statistical information on two datasets are shown in Table 1.
Dataset | SNIPS-NLU | ATIS |
---|---|---|
Vocab Size | 11,241 | 722 |
Average Sentence Length | 9.05 | 11.28 |
#Intents | 7 | 21 |
#Slots | 72 | 120 |
#Training Samples | 13,084 | 4,478 |
#Validation Samples | 700 | 500 |
#Test Samples | 700 | 893 |
SNIPS-NLU contains natural language corpus collected in a crowdsourced fashion to benchmark the performance of voice assistants. ATIS is a widely used dataset in spoken language understanding, where audio recordings of people making flight reservations are collected.
Baselines We compare the proposed capsule-based model Capsule-NLU with other alternatives: 1) Joint Seq. (Hakkani-Tür et al., 2016) adopts a Recurrent Neural Network (RNN) for slot filling and the last hidden state of the RNN is used to predict the utterance intent. 2) Attention BiRNN (Liu and Lane, 2016) further introduces a RNN based encoder-decoder model for joint slot filling and intent detection. An attention weighted sum of all encoded hidden states is used to predict the utterance intent. 3) Slot-gated Full Atten. (Goo et al., 2018)
utilizes a slot-gated mechanism as a special gate function in Long Short-term Memory Network (LSTM) to improve slot filling by the learned intent context vector. The intent context vector is used for intent detection. 4) DR-AGG
(Gong et al., 2018)aggregates word-level information for text classification via dynamic routing. The high-level capsules after routing are concatenated, followed by a multi-layer perceptron layer that predicts the utterance label. We used this capsule-based text classification model for intent detection only. 5) IntentCapsNet
(Xia et al., 2018) adopts a multi-head self-attention to extract intermediate semantic features from the utterances, and uses dynamic routing to aggregate semantic features into intent representations for intent detection. We use this capsule-based model for intent detection only.We also compare our proposed model Capsule-NLU with existing commercial natural language understanding services, including api.ai (Now called DialogFlow)222https://dialogflow.com/, Waston Assistant333https://www.ibm.com/cloud/watson-assistant/, Luis444https://www.luis.ai/, wit.ai555https://wit.ai/, snips.ai666https://snips.ai/, recast.ai777https://recast.ai/, and Amazon Lex888https://aws.amazon.com/lex/.
Implementation Details
The hyperparameters used for experiments are shown in Table
2. We use the validation data to choose hyperparameters. In the loss function, the down-weighting coefficient is 1.0, margins and are set to 0.95 and 0.05 for all the existing intents. Adam optimizer (Kingma and Ba, 2014) is used to minimize the loss.Dataset | ||||||
---|---|---|---|---|---|---|
SNIPS-NLU | 1024 | 512 | 512 | 64 | 2 | 2 |
ATIS | 1024 | 512 | 512 | 256 | 3 | 3 |
Model | SNIPS-NLU | ATIS | ||||
---|---|---|---|---|---|---|
Slot (F1) | Intent (Acc) | Overall (Acc) | Slot (F1) | Intent (Acc) | Overall (Acc) | |
Joint Seq. Hakkani-Tür et al. (2016) | 0.873 | 0.969 | 0.732 | 0.942 | 0.926 | 0.807 |
Attention BiRNN Liu and Lane (2016) | 0.878 | 0.967 | 0.741 | 0.942 | 0.911 | 0.789 |
Slot-Gated Full Atten. Goo et al. (2018) | 0.888 | 0.970 | 0.755 | 0.948 | 0.936 | 0.822 |
DR-AGG Gong et al. (2018) | - | 0.966 | - | - | 0.914 | - |
IntentCapsNet Xia et al. (2018) | - | 0.974 | - | - | 0.948 | - |
Capsule-NLU | 0.918 | 0.973 | 0.809 | 0.952 | 0.950 | 0.834 |
Capsule-NLU w/o Intent Detection | 0.902 | - | - | 0.948 | - | - |
Capsule-NLU w/o Joint Training | 0.902 | 0.977 | 0.804 | 0.948 | 0.847 | 0.743 |

Stratified 5-fold cross validation for benchmarking with existing NLU services on SNIPS-NLU dataset. Black bars indicate the standard deviation.
4 Results
Quantitative Evaluation: The intent detection results on two datasets are reported in Table 3, where the proposed capsule-based model performs consistently better than current learning schemes for joint slot filling and intent detection, as well as capsule-based neural network models that only focuses on intent detection. These results demonstrate the novelty of the proposed capsule-based model Capsule-NLU in jointly modeling the hierarchical relationships among words, slots and intents via the dynamic routing between capsules.
Also, we benchmark the intent detection performance of the proposed model with existing natural language understanding services999https://www.slideshare.net/KonstantinSavenkov/nlu-intent-detection-benchmark-by-intento-august-2017 in Figure 3. Since the original data split is not available, we report the results with stratified 5-fold cross validation. From Figure 3 we can see that the proposed model Capsule-NLU is highly competitive with off-the-shelf systems that are available to use. Note that, our model archieves the performance without using pre-trained word representations: the word embeddings are simply trained from scratch.
Ablation Study To investigate the effectiveness of Capsule-NLU in joint slot filling and intent detection, we also report ablation test results in Table 3. “w/o Intent Detection” is the model without intent detection: only a dynamic routing is performed between WordCaps and SlotCaps for the slot filling task, where we minimize during training; “w/o Joint Training” adopts a two-stage training where the model is first trained for slot filling by minimizing , and then use the fixed slot representations to train for the intent detection task which minimizes . From the lower part of Table 3 we can see that by using a capsule-based hierarchical modeling between words and slots, the model Capsule-NLU w/o Intent Detection is already able to outperform current alternatives on slot filling that adopt a sequential labeling schema. The joint training of slot filling and intent detection is able to give each subtask further improvements when the model parameters are updated jointly.
5 Related Works
Intent Detection With recent developments in deep neural networks, user intent detection models (Hu et al., 2009; Xu and Sarikaya, 2013; Zhang et al., 2016; Liu and Lane, 2016; Zhang et al., 2017; Chen et al., 2016; Xia et al., 2018)
are proposed to classify user intents given their diversely expressed utterances in the natural language. As a text classification task, the decent performance on utterance-level intent detection usually relies on hidden representations that are learned in the intermediate layers via multiple non-linear transformations. Recently, various capsule based text classification models are proposed that aggregate word-level features for utterance-level classification via dynamic routing-by-aggre ment
(Gong et al., 2018; Zhao et al., 2018; Xia et al., 2018). Among them, Xia et al. (2018) adopts self-attention to extract intermediate semantic features and uses a capsule-based neural network for intent detection. However, existing works do not study word-level supervisions for the slot filling task. In this work, we explicitly model the hierarchical relationship between words and slots on the word-level, as well as intents on the utterance-level via dynamic routing.Slot Filling
Slot filling annotates the utterance with finer granularity: it associates certain parts of the utterance, usually named entities, with pre-defined slot tags. Currently, the slot filling is usually treated as a sequential labeling task. A recurrent neural network such as Gated Recurrent Unit (GRU) or Long Short-term Memory Network (LSTM) is used to learn context-aware word representations, and Conditional Random Fields (CRF) are used to annotate each word based on its slot type. Recently,
Shen et al. (2017); Tan et al. (2017) introduce the self-attention mechanism for CRF-free sequential labeling.Joint Modeling via Sequence Labeling To overcome the error propagation in the word-level slot filling task and the utterance-level intent detection task in a pipeline, joint models are proposed to solve two tasks simultaneously in a unified framework. Xu and Sarikaya (2013) propose a Convolution Neural Network (CNN) based sequential labeling model for slot filling. The hidden states corresponding to each word are summed up in a classification module to predict the utterance intent. A Conditional Random Field module ensures the best slot tag sequence of the utterance from all possible tag sequences. Hakkani-Tür et al. (2016) adopt a Recurrent Neural Network (RNN) for slot filling and the last hidden state of the RNN is used to predict the utterance intent. Liu and Lane (2016) further introduce an RNN based encoder-decoder model for joint slot filling and intent detection. An attention weighted sum of all encoded hidden states is used to predict the utterance intent. Some specific mechanisms are designed for RNNs to explicitly encode the slot from the utterance. For example, Goo et al. (2018) utilize a slot-gated mechanism as a special gate function in Long Short-term Memory Network (LSTM) to improve slot filling by the learned intent context vector. However, as the sequence becomes longer, it is risky to simply rely on the gate function to sequentially summarize and compress all slots and context information in a single vector Cheng et al. (2016). In this paper, we harness the capsule neural network to learn a hierarchy of feature detectors and explicitly model the hierarchical relationships among word-level slots and utterance-level intent. Also, instead of doing sequence labeling for slot filling, we use a dynamic routing-by-agreement schema between capsule layers to assign a proper slot type for each word in the utterance.
6 Conclusions
In this paper, a capsule-based model, namely Capsule-NLU, is introduced to harness the hierarchical relationships among words, slots, and intents in the utterance for joint slot filling and intent detection. Unlike treating slot filling as a sequential prediction problem, the proposed model assigns each word to its most appropriate slots in SlotCaps by a dynamic routing-by-agreement schema. The learned word-level slot representations are futher aggregated to get the utterance-level intent representations via dynamic routing-by-agreement. A re-routing schema is proposed to further synergize the slot filling performance using the inferred intent representation. Experiments on two real-world datasets show the effectiveness of the proposed models when compared with other alternatives as well as existing NLU services.
References
- Chen et al. (2016) Yun-Nung Chen, Dilek Hakkani-Tür, Gökhan Tür, Jianfeng Gao, and Li Deng. 2016. End-to-end memory networks with knowledge carryover for multi-turn spoken language understanding. In INTERSPEECH. pages 3245–3249.
-
Cheng et al. (2016)
Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.
Long short-term memory-networks for machine reading.
In
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
. pages 551–561. - Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
- Gong et al. (2018) Jingjing Gong, Xipeng Qiu, Shaojing Wang, and Xuanjing Huang. 2018. Information aggregation via dynamic routing for sequence encoding. In Proceedings of the 27th International Conference on Computational Linguistics. pages 2742–2752.
- Goo et al. (2018) Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). volume 2, pages 753–757.
- Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Interspeech. pages 715–719.
- Hinton et al. (2011) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. 2011. Transforming auto-encoders. In International Conference on Artificial Neural Networks. Springer, pages 44–51.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Hu et al. (2009) Jian Hu, Gang Wang, Fred Lochovsky, Jian-tao Sun, and Zheng Chen. 2009. Understanding user’s query intent with wikipedia. In WWW. ACM, pages 471–480.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML. ICML 2001, pages 282–289.
- Liu and Lane (2016) Bing Liu and Ian Lane. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. Interspeech pages 685–689.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 .
- Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In NIPS. pages 3859–3869.
- Shen et al. (2017) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2017. Disan: Directional self-attention network for rnn/cnn-free language understanding. arXiv preprint arXiv:1709.04696 .
- Tan et al. (2017) Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2017. Deep semantic role labeling with self-attention. arXiv preprint arXiv:1712.01586 .
- Tur et al. (2010) Gokhan Tur, Dilek Hakkani-Tür, and Larry Heck. 2010. What is left to be understood in atis? In Spoken Language Technology Workshop (SLT), 2010 IEEE. IEEE, pages 19–24.
- Xia et al. (2018) Congying Xia, Chenwei Zhang, Xiaohui Yan, Yi Chang, and Philip S Yu. 2018. Zero-shot user intent detection via capsule neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. pages 3090–3099.
- Xu and Sarikaya (2013) Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neural network based triangular crf for joint intent detection and slot filling. In ASRU. IEEE, pages 78–83.
- Zhang et al. (2017) Chenwei Zhang, Nan Du, Wei Fan, Yaliang Li, Chun-Ta Lu, and Philip S Yu. 2017. Bringing semantic structures to user intent detection in online medical queries. In IEEE Big Data. pages 1019–1026.
- Zhang et al. (2016) Chenwei Zhang, Wei Fan, Nan Du, and Philip S Yu. 2016. Mining user intentions from medical queries: A neural network based heterogeneous jointly modeling approach. In WWW. pages 1373–1384.
- Zhao et al. (2018) Wei Zhao, Jianbo Ye, Min Yang, Zeyang Lei, Suofei Zhang, and Zhou Zhao. 2018. Investigating capsule networks with dynamic routing for text classification. arXiv preprint arXiv:1804.00538 .