Enriched Pre-trained Transformers for Joint Slot Filling and Intent Detection

04/30/2020 ∙ by Momchil Hardalov, et al. ∙ Hamad Bin Khalifa University Sofia University 0

Detecting the user's intent and finding the corresponding slots among the utterance's words are important tasks in natural language understanding. Their interconnected nature makes their joint modeling a standard part of training such models. Moreover, data scarceness and specialized vocabularies pose additional challenges. Recently, the advances in pre-trained language models, namely contextualized models such as ELMo and BERT have revolutionized the field by tapping the potential of training very large models with just a few steps of fine-tuning on a task-specific dataset. Here, we leverage such model, namely BERT, and we design a novel architecture on top it. Moreover, we propose an intent pooling attention mechanism, and we reinforce the slot filling task by fusing intent distributions, word features, and token representations. The experimental results on standard datasets show that our model outperforms both the current non-BERT state of the art as well as some stronger BERT-based baselines.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the proliferation of portable devices, smart speakers, and the evolution of personal assistants, such as Amazon’s Alexa, Apple’s Siri, Google’s Assistant, and Microsoft’s Cortana, a need for better natural language understanding (NLU) has emerged. The major challenges such systems face are (i) finding the intention behind the user’s request, and (ii) gathering the needed information to complete it via slot filling, while (iii) engaging in a dialogue with the user. Table 1 shows a user request collected from a personal voice assistant. Here, the intent is to play music by the artist Justin Broadrick from year 2005

. The slot filling task naturally arises as a sequence tagging task. Conventional neural network architectures, such as RNNs or CNNs are appealing approaches to tackle the problem.

Intent PlayMusic
Words play music from 2005 by justin broadrick
Slots O O O B-year O B-artist I-artist
Table 1: Example from the SNIPS dataset with slots encoded in the BIO format. The utterance’s intent is PlayMusic, and the given slots are year and artist.

Various extensions thereof can be found in previous work (Xu and Sarikaya, 2013a; Goo et al., 2018; Hakkani-Tür et al., 2016; Liu and Lane, 2016; E et al., 2019; Gangadharaiah and Narayanaswamy, 2019)

. Finally, sequence tagging approaches such as Maximum Entropy Markov model (MEMM) 

(Toutanova and Manning, 2000; McCallum et al., 2000) and Conditional Random Fields (CRF) (Lafferty et al., 2001; Jeong and Lee, 2008; Huang et al., 2015) have been added on top to enforce better modeling of the slot filling task. Recently, some interesting work has been done with hierarchical structured capsule networks (Xia et al., 2018; Zhang et al., 2019).

Here, we investigate the usefulness of pre-trained models on the task of NLU. Our approach is based on BERT (Devlin et al., 2019). That model offer two main advantages over previous work (Hakkani-Tür et al., 2016; Xu and Sarikaya, 2013a; Gangadharaiah and Narayanaswamy, 2019; Liu and Lane, 2016; E et al., 2019; Goo et al., 2018): (i) they are based on the Transformer architecture (Vaswani et al., 2017), which allows them to use bi-directional context when encoding the tokens instead of left-to-right (as in RNNs) or limited windows (as in CNNs), and (ii) the model is trained on huge unlabeled text collections, which allows it to leverage relations learned during pre-training, e.g., that Justin Broadrick is connected to music or that San Francisco is a city.

We further adapt the pre-trained models for the NLU tasks. For the intent, we introduce a pooling attention layer, which uses a weighted sum of the token representations from the last language modelling layer. We further reinforce the slot representation with the predicted intent distribution, and word features such as predicted word casing, and named entities. To demonstrate its effectiveness, we evaluate it on two publicly available datasets: ATIS (Hemphill et al., 1990) and SNIPS (Coucke et al., 2018)111The source code will be made available..

Our contributions are as follows:

  • We enrich pre-trained language model, e.g. BERT (Devlin et al., 2019), to jointly solve intent classification and slot filling.

  • We introduce an additional pooling network from the intent classification task, allowing the model to obtain the hidden representation from the entire sequence.

  • We use the predicted user intent as an explicit guide for the slot fitting layer rather than just depending on the language model.

  • We reinforce the slot learning with named entity and true case annotations222http://stanfordnlp.github.io/CoreNLP/truecase.html.

The rest of this paper is organized as follows: Section 2 describes our approach. Details about the dataset, baselines and model details are shown in Section 3. All experiments are described in Section 4. Section 5 presents related work. Finally, Section 6 concludes and points to possible directions for future work.

2 Proposed Approach

We propose a joint approach for intent classification and slot filling built on top of a pre-trained language model, i.e., BERT (Devlin et al., 2019). We further improve the base model in three directions: (i) for intent classification, we obtain a pooled representation from the last hidden states for all tokens (Section 2.1), (ii) we obtain predictions for the word case and named entities for each token in the utterance (word features), and (iii

) we feed the predicted intent distribution vector, BERT’s last hidden representations, and word features into a slot filling layer (see Section 

2.2). The complete architecture of the model is shown in Figure (b)b.

(a) BERT-Joint.
(b) Transformer-NLU (ours).
Figure 3: Model architectures for joint learning of intent and slot filling: LABEL:sub@subfig:base:bert classical joint learning with BERT, and LABEL:sub@subfig:bert:ours proposed enhanced version of the model.

2.1 Intent Pooling Attention

Traditionally, BERT and subsequent BERT-style models use a special token ([CLS]) to denote the beginning of a sequence. In the original paper (Devlin et al., 2019), the authors attach a binary classification loss to it for predicting whether two sequences follow each other in the text (next sentence prediction, or NSP). Adding such an objective forces the last residual block to pool a contextualized representation for the whole sentence from the penultimate layer, which should have a more semantic, rather then task-specific meaning. The latter strives to improve downstream sentence-level classification tasks. However, its effectiveness has been recently debated in the literature (Lample and Conneau, 2019; Joshi et al., 2019; Yang et al., 2019; Lan et al., 2019). It has been even argued that it should be removed (Liu et al., 2019).

Here, the task is to jointly learn two strongly correlated tasks, as using representations from the last, task-specific layer is more natural compared to using the pooled one from the [CLS] token. Therefore, we introduce a pooling attention layer to better model the relationship between the task-specific representations for each token and for the intent. We further adopt a global concat attention (Luong et al., 2015) as a throughput mechanism. Namely, we learn a function () to predict the attention weights for each token. We obtain by multiplying the outputs from the language model by a latent weight matrix , where is the number of tokens in an example and is the hidden size of the Transformer. This is followed by a non-linear

activation. In order to obtain importance logit for each token, we multiply the latter by a projection vector

(shown in Eq. 1). We further normalize and scale (Vaswani et al., 2017) to obtain the attention weights.

Finally, we gather a hidden representation as a weighted sum of all attention inputs, and we pass it through a activation (see Eq. 3). For the final prediction, we use a linear projection on top of . Finally, we apply dropouts on , and as proposed in Vaswani et al. (2017), on the attention weights.


2.2 Slots Modeling

The task of slot filling is closely related to other well-known tasks in the field, e.g., part-of-speech (POS) tagging, named entity recognition (NER), etc. Also, it can benefit from knowing the interesting entities in the text. Therefore, we reinforce the slot filling with tags found by a named entity recognizer (word features). Next, we combine the intent prediction, the language model’s hidden representations, and some extracted word features into a single vector used for token slot attribution. Details about all components are discussed below.

Word Features

A major shortcoming of having free-form text as an input is that it tends not to follow basic grammatical principles or style rules. The casing of words can also guide the models while filling the slots, i.e., upper-case words can refer to names or to abbreviations. Also, knowing the proper casing enabled the use of external NERs or other tools that depend on the text quality.

As a first step, we improve the text casing using a True Case model. The model maps the words into the following classes: UPPER, LOWER, INIT_UPPER, and O, where O is for mixed-case words such as McVey

. With the text re-cased, we further extract the named entities with a NER annotator. Named entities are recognized using a combination of three CRF sequence taggers trained on various corpora. Numerical entities are recognized using a rule-based system. Both the truecaser and the NER model are part of the Stanford CoreNLP toolkit 

(Manning et al., 2014).

Finally, we merge some entities ((job) TITLE, IDEOLOGY, CRIMINAL_CHARGE, CAUSE_OF_DEATH) into a special category OTHER as they do not correlate directly to the domains of either dataset. Moreover, we add a custom regex-matching entry for AIRPORT_CODES, which are three-letter abbreviations of the airports. The latter is specially designed for the ATIS (Tur et al., 2010) dataset. While, marking the proper terms, some of the codes introduce noise, e.g., the proposition for could be marked as an AIRPORT_CODE because of FOR (Aeroporto Internacional Pinto Martins, Fortaleza, CE, Brazil). In order to mitigate this effect, we do a lookup in a dictionary of English words, and if a match is found, we trigger the O class for the token.


In order to allow the network to learn better feature representations for the named entities and the casing, we pass them through a two-layer feed-forward network. The first layer is shown in Eq. 5 followed by a non-linear PReLU activation (Eq. 6), where . The second one is a linear projection denoted as (Eq. 7), where .

Sub-word Alignment

Modern NLP approaches suggest the use of sub-word units (Sennrich et al., 2016; Wu et al., 2016; Kudo and Richardson, 2018), which mitigate the effects of rare words, while preserving the efficiency of a full-word model. Although they are a flexible framework for tokenization, sub-word units require additional bookkeeping for the models in order to maintain the original alignment between words and their labels.

In our approach, we first split the sentences into the original word-tag pairs, we then disassemble each one into word pieces. Next, the original slot tag is assigned to the first word piece, while each subsequent is marked with a special tag (X). Still, the word features from the original token are copied to each unit. To align the predicted labels with the input tags, we keep a binary vector for active positions.

Slot Filling as Token Classification

As in (Devlin et al., 2019), we treat the slot filling as a token classification problem, where we apply a shared layer on top of each token’s representations in order to predict tags.

Furthermore, we assemble the feature vector for the

slot by concatenating together the predicted intent probabilities, the word features, and the contextual representation from the language model. Afterwards, we add a dropout followed by a linear projection to the proper number of slots, as shown in Eq. 



2.3 Interaction and Learning

To train the model, we use a joint loss function

) for the intent and for the slots. For both tasks, we apply cross-entropy over a softmax activation layer, except in the case of CRF tagging. In those experiments, the slot loss will be the negative log-likelihood (NLL) loss. Moreover, we introduce a new hyper-parameter to balance the objectives of the two tasks (see Eq. 9). Finally, we propagate the loss from all the non-masked positions in the sequence, including word pieces, and special tokens ([CLS], s etc.). Note that we do not freeze any weights during fine-tuning.


3 Experimental Setup

3.1 Dataset

In our experiments, we use two publicly available datasets, the Airline Travel Information System (ATIS) (Hemphill et al., 1990), and SNIPS (Coucke et al., 2018). The ATIS dataset contains transcripts from audio recordings of flight information requests, while the SNIPS dataset is gathered by a custom-intent-engine for personal voice assistants. Albeit both are widely used in NLU benchmarks, ATIS is substantially smaller – almost three times in terms of examples, and it contains fifteen times less words. However, it has a richer set of labels, 21 intents and 120 slot categories, as opposed to the 7 intents and 72 slots in SNIPS. Another key difference is the diversity of domains – ATIS has only utterances from the flight domain, while SNIPS covers various subjects, including entertainment, restaurant reservations, weather forecasts, etc. Furthermore, ATIS allows multiple intent labels. As they only form about 2% of the data, we do not extend our model to multi-label classification. Yet, we add a new intent category for combinations seen in the training dataset, e.g., the class for utterance with intents flight and also airfare, would be marked as atis_airfare#atis_flight. A comparison between the two datasets is shown in Table 2.

Vocab Size 722 11,241
Average Sentence Length 11.28 9.05
#Intents 21 7
#Slots 120 72
#Training Samples 4,478 13,084
#Dev Samples 500 700
#Test Samples 893 700
Table 2: Statistics about the ATIS and SNIPS datasets.

3.2 Measures

We evaluate our models with three well-established evaluation metrics. The intent detection performance is measured in terms of accuracy. For the slot filling task, we use F1-score. Finally, the the joint model is evaluated using sentence-level accuracy, i.e., proportion of examples in the corpus, whose intent and slots are both correctly predicted. Here, we must note that during evaluation we consider only the predictions for aligned words, and we omit special tokens, and word pieces.

3.3 Baselines

For our baseline models, we chose BERT (Devlin et al., 2019) due to its state-of-the-art performance in various NLP tasks. The model’s architecture is shown in Figure (a)a.


For training the model, we follow the fine-tuning procedure proposed by Devlin et al. (2019). We train a linear layer over the pooled representation of the special [CLS] token to predict the utterance’s intent. The latter is optimized during pre-training using the next sentence prediction (NSP) loss to encode the whole sentence. Moreover, we add a shared layer on top of the last hidden representations of the tokens in order to obtain a slot prediction. Both objectives are optimized using a cross-entropy loss.

State-of-the-art Models

We further compare our approach to some other benchmark models:

  • [leftmargin=*,nosep]

  • Joint Seq. (Hakkani-Tür et al., 2016)

    uses a Recurrent Neural Network (RNN) to obtain hidden states for each token in the sequence for slot filling, and uses the last state to predict the intent.

  • Atten.-Based (Liu and Lane, 2016) treats the slot filling task as a generative one, applying sequence-to-sequence RNN to label the input. Further, an attention weighted sum over the encoder’s hidden states is used to detect the intent.

  • Slotted-Gated (Goo et al., 2018) introduces a special gated mechanism to an LSTM network, thus reinforcing the slot filling with the hidden representation used for the intent detection.

  • Capsule-NLU (Zhang et al., 2019) adopts Capsule Networks to exploit the semantic hierarchy between words, slots, and intents using dynamic routing-by-agreement schema.

  • Interrelated (E et al., 2019) uses a Bidirectional LSTM network with attentive sub-networks for the slot and the intent modeling, and an interrelated mechanism to establish a direct connection between the two. SF (slot), and ID (intent) prefixes indicate which sub-network is to be executed first.

  • BERT-Joint (Chen et al., 2019) adapts the standard BERT classification, and token classification pipeline to jointly model the slot and intent, in addition they experiments with a CRF layer on top. However, they are evaluating the slot filling task using per-token F1-score (micro averaging), rather than per-slot entry, as it is standard, leading to higher results.

3.4 Model Details

We use the PyTorch implementation of BERT from the Transformers library of 

(Wolf et al., 2019)

as a base for our models. We fine-tune all models for 50 epochs with hyper-parameters set as follows: batch size of 64 examples, maximum sequence length of 50 word pieces, dropout set to 0.1 (for both attentions and hidden layers), and weight decay of 0.01. For optimization, we use Adam with a learning rate of 8e-05,

0.9, 0.999, 1e-06, and warm-up proportion of 0.1. Finally, in order to balance between the intent and the slot losses, we set the parameter (Eq. 9) to 0.6. In order to tackle the problem with random fluctuations for BERT, we ran the experiments three times and we used the best-performing model on the development set. We define the latter as the highest sum from all three measures described in Section 3.2. All the above-mentioned hyper-parameter values were tuned on the development set, and then used for the final model on the test set.

Model Intent
(Acc) Sent.
(Acc) Slot
(F1) Intent
(Acc) Sent.
(Acc) Slot
Joint Seq. Hakkani-Tür et al. (2016)
Atten.-Based (Liu and Lane, 2016)
Sloted-Gated (Goo et al., 2018)
Capsule-NLU (Zhang et al., 2019)
Interrelated SF-First (E et al., 2019)
Interrelated ID-First (E et al., 2019)
BERT-Joint (Chen et al., 2019) *
BERT-Joint Ours (98.1) (97.9)
Transformer-NLU:BERT (98.2) (98.0)
Transformer-NLU:BERT w/o Slot Features
Transformer-NLU:BERT w/ CRF
Table 3: Intent detection and slot filling results measured on the SNIPS and the ATIS dataset. Highest results in each category are written in bold. Models used in the analysis are in italic. * Chen et al. (2019)’s BERT-Joint uses per token F1 with micro averaging (which is not standard and inflates their score by several points absolute). For comparison we also report the per token F1 for our models (the numbers in parenthesis).

4 Results

Here, we discuss the results of our model and we compare them to state-of-the-art and to BERT-based baselines. We further present an exhaustive analysis of the model’s components.

Evaluation results

Table 3 presents quantitative evaluation results in terms of (i) intent accuracy, (ii) sentence accuracy, and (iii) slot F1 (see Section 3.2). The first part of the table refers to previous works, whereas the second part presents our experiments and it is separated with a double horizontal line. The evaluation results confirm that our model performs constantly better then current state-of-the-art baselines, which supports the effectiveness of the approach.

While, models become more accurate, the absolute difference between two experiments becomes smaller and smaller, thus a better measurement is needed. Hereby, we introduce a fine-grained measure, i.e., Relative Error Reduction (RER) percentage, which is defined as the proportion of absolute error reduced by a compared to .


Table 4 shows the error reducction by our model compared to the current state of the art (SOTA), and to a BERT-based baseline. Since there is no single best model from the SOTA, we take the per-column maximum among all, albeit they are not recorded in a single run. For the ATIS dataset, we see a reduction of 13.66% (1.79 points absolute) for sentence accuracy, and 10.71% (0.45 points absolute) for slot F1, but just 4.91% for intent accuracy. Such a small improvement can be due to the quality of the dataset and to its size. For the SNIPS dataset, we see major increase in all measures and over 55% error reduction. In absolute terms, we have 1.57 for intent, 10.96 for sentence, and 4.34 for slots. This effects cannot be only attributed to the better model (discussed in the analysis below), but also to the implicit information that BERT learned during its extensive pre-training. This is especially useful in the case of SNIPS, where fair amount of the slots in categories like SearchCreativeWork, SearchScreeningEvent, AddToPlaylist, PlayMusic are names of movies, songs, artists, etc.

(a) atis_flight (ATIS).
(b) AddToPlaylist (SNIPS).
Figure 6: Intent pooling attention weights () for one sample per dataset. The larger the number, the higher the word’s contribution. The example intent is shown in the sub-figure’s caption.

Transformer-NLU Analysis

We dissect the proposed model by adding or removing prominent components to outline their contributions. The results are shown in the second part of Table 3. First, we compare the results of BERT-Joint and the enriched model Transformer-NLU:BERT. We can see a notable reduction of the intent classification error by 17.44% and 11.63% for the ATIS and the SNIPS dataset, respectively. Furthermore, we see a 19.87% (ATIS) and 17.35% (SNIPS) error reduction in slot’s F1, and 11.43% (ATIS) and 11.63% (SNIPS) for sentence accuracy.

Metric Relative Error Reduction
Intent (Acc) 4.91% 17.44%
Sent. (Acc) 13.66% 11.43%
Slot (F1) 10.71% 19.87%
Intent (Acc) 55.64% 11.63 %
Sent. (Acc) 57.38% 12.38%
Slot (F1) 55.86% 17.35%
Transf.-NLU:BERT vs. SOTA vs. BERT
Table 4: Relative error reduction (Eq. 10) comparison between the proposed model (Transformer-NLU:BERT) and the two baselines: i) current state of the art for each measure, and ii) conventionally fine-tuned BERT-Joint without any improvements.

Next, we remove the additional slot features – predicted intent, word casing, and named entities. The results are shown as Transformer-NLU:BERT w/o Slot Features. As expected, the intent accuracy remains unchanged for both datasets, since we retain the pooling attention layer, while the F1-score for the slots decreases. For SNIPS, the model achieved the same score as for BERT-Joint, while for ATIS it was within 0.2 points absolute.

Finally, we added a CRF layer on top of the slot network, since it had shown positive effects in earlier studies (Xu and Sarikaya, 2013a; Huang et al., 2015; Liu and Lane, 2016; E et al., 2019). We denote the experiment as Transformer-NLU:BERT w/ CRF. However, in our case it did not yield the expected improvement. The results for slot filling are close to the highest recorded, while a drastic drop in intent detection accuracy is observed, i.e., -17.44% for ATIS, and -20.28% for SNIPS. We attribute this degradation to the large gradients from the NLL loss. The effect is even stronger in the case of smaller datasets, making the optimization unstable for parameter-rich models such as BERT. We tried to mitigate this issue by increasing the hyper-parameter, effectively reducing the contribution of the slot’s loss to the total, which in turn harmed the slot’s F1.

Intent Pooling Attention Visualization

To demonstrate the effects of the intent pooling attention, we visualize the learned attention weights for two examples from the benchmark datasets. On the left subplot of Figure 6, a request from the ATIS dataset is presented: i want fly from baltimore to dallas round trip. The utterance’s intent is marked as atis_flight, and we can clearly see that the attention had successfully picked the key tokens from the text, i.e., I, want, fly, from, and to, whereas supplementary words such as names, locations, dates, etc. have a lesser contribution to the final sentence-level representation. Moreover, when trained on the ATIS dataset the layer tends to set the weights in the two extremes — equally high for important tokens, and towards zero for others. We attribute this behaviour to the limited domain and vocabulary.

Another example, from the SNIPS dataset, is shown on Figure (b)b. Here, the intent is to add a song to a playlist (AddToPlaylist). In this example, we see a more diverse spread of attention weights. The model again assigns the highest weight to the most relevant tokens add, to, the, and play. Also, the model learned that the first wordpiece has the highest contribution, while the subsequent ones are supplementary.

Finally, we let the pooling attention layer to take in consideration the special tokens marking the start and the end ([CLS], and [SEP]) of a sequence, since they are expected to learn semantic sentence-level representations from the penultimate layer. Also, the model learns to exploit them as it assigns high attention weights to both.

5 Related Work

Intent Classification

Modeling user’s intent had been an area of interest itself. Several approaches exists that focus only on the utterance’s intent, and omit slot information, i.e., Hu et al. (2009) maps each intent domain and user’s queries into Wikipedia representation space, Kim et al. (2017) and Xu and Sarikaya (2013b) use log-linear models with multiple-stages and word features, while (Ravuri and Stolcke, 2015)

investigate word and character n-gram language models based on Recurrent Neural Networks and LSTMs 

(Hochreiter and Schmidhuber, 1997) , Xia et al. (2018) proposed a zero-shot transfer thought Capsule Networks (Sabour et al., 2017) and semantic features for detecting user intents where no labeled data exists. Moreover, some work extend the problem to a multi-class multi-label one (Xu and Sarikaya, 2013b; Kim et al., 2017; Gangadharaiah and Narayanaswamy, 2019).

Slot Filling

Before the rise of deep learning models, sequential ones such as Maximum Entropy Markov model (MEMM)

(Toutanova and Manning, 2000; McCallum et al., 2000), and Conditional Random Fields (CRF) Lafferty et al. (2001); Jeong and Lee (2008) were the state-of-the-art choice. Recently, several combinations between these frameworks and different neural network architecture were proposed (Xu and Sarikaya, 2013a; Huang et al., 2015; E et al., 2019). However, a steer away from sequential models is observed in favour of self-attentive ones such as the Transformer (Devlin et al., 2019; Liu et al., 2019; Radford et al., 2018, 2019). They compose a contextualized representation of both the sentences, and each word, though a sequence of intermediate non-linear hidden layers, usually followed by a projection layer in order to obtain per-token tags. Such models had been successfully applied to closely related tasks, e.g., as named entity recognition (Devlin et al., 2019), POS tagging (Tsai et al., 2019), etc.

Approaches modeling the intent or the slot as independent of each other suffer from uncertainty propagation due the absence of shared knowledge between the tasks. To overcome this limitation, we learn both tasks using a joint model.

Joint Models

Given the strong correlation between the intent and word-level slot tags, using a joint model is the natural way to address the error propagation in both tasks. It is no surprise that this has been the trend in natural language understanding. On the other hand, deep neural networks with their back-propagation mechanism propose a convenient framework for learning multiple objectives at once. Moreover, a variety of state-of-the-art approaches for jointly modeling the intent and slot lay on this fundamentals. Xu and Sarikaya (2013a)

introduced a shared intent and slot hidden state Convolutional Neural Network (CNN) followed by a globally normalized CRF (TriCRF) for sequence tagging. Since then, Recurrent Neural Networks have been the dominating force in the field, e.g., 

Hakkani-Tür et al. (2016) used bidirectional LSTM cells for slot filling and the last hidden state for intent classification, Liu and Lane (2016) introduced a shared attention weights between the slot and intent layer. Goo et al. (2018) integrated the intent via a gating mechanism into the context vector of LSTM cells used for slot filling.

Recently, researchers started to explore new directions for jointly modeling beyond sequential reading models. E et al. (2019) introduced a novel bi-directional interrelated model, utilizing an iterative mechanism to correct the predicted intent and slot by multiple step refinement. Further, Zhang et al. (2019) tried to exploit the semantic hierarchical relationship between words, slots, and intent via a dynamic routing-by-agreement schema in a Capsule Network (Sabour et al., 2017).

In this work, we utilize a pre-trained Transformer model to bootstrap the learning, when having a small dataset. Also, instead of depending only on the language model’s hidden state to learn the interaction between the slot and the intent, we fuse the two tasks together. Namely, we guide the slot filling with the predicted intent, and use a pooled representation from the task-specific outputs of BERT for intent detection. Finally, unlike others, we leverage additional information from external sources: (i) from explicit NER and true case annotations, (ii) from implicit information learned by the language model during its extensive pre-training.

6 Conclusions and Future Work

We studied the two main challenges in natural language understanding, i.e., intent detection and slot filling. In particular, we proposed an enriched pre-trained language model to jointly model the two tasks, i.e., Transformer-NLU. We designed a pooling attention layer in order to obtain intent representation beyond just the pooled one from a the special start token. Further, we reinforced the slot filling with word specific features, and the predicted intent distribution. We evaluated our approach on two real-world datasets, also comparing it to baselines such as current state-of-the-art, and BERT-based models. Our experiments showed that Transformer-NLU outperforms other alternatives in all standard measures used to evaluate NLU tasks. Surprisingly, both the usage of the robustly pre-trained version of BERT, and adding a CRF layer on top of the slot filling network did not bring improvements. Finally, the Transformer-NLU:BERT had achieved intent accuracy of 97.87 (ATIS) and 98.86 (SNIPS). Or in relative error reduction – almost 5% for ATIS, and over 55% for SNIPS, compared to the current state of the art. In terms of slot filling’s F1, our approach scored 96.25 (+13.66%) for ATIS, and 96.57 (+55.86%) for SNIPS.

In future work we plan to investigate the natural hierarchy of the slots, e.g. B-toloc.city can be split into B, toloc, and city. Further, to experiment with better named entity recognition framework such as FLAIR (Akbik et al., 2019a, b).


  • A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf (2019a) FLAIR: an easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 54–59. External Links: Link, Document Cited by: §6.
  • A. Akbik, T. Bergmann, and R. Vollgraf (2019b) Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 724–728. External Links: Link, Document Cited by: §6.
  • Q. Chen, Z. Zhuo, and W. Wang (2019) Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. External Links: Link Cited by: 6th item, Table 3.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv:1805.10190. External Links: Link Cited by: §1, §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: 1st item, §1, §2.1, §2.2, §2, §3.3, §3.3, §5.
  • H. E, P. Niu, Z. Chen, and M. Song (2019) A novel bi-directional interrelated model for joint intent detection and slot filling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5467–5471. External Links: Link, Document Cited by: §1, §1, 5th item, Table 3, §4, §5, §5.
  • R. Gangadharaiah and B. Narayanaswamy (2019) Joint multiple intent detection and slot labeling for goal-oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 564–569. External Links: Link, Document Cited by: §1, §1, §5.
  • C. Goo, G. Gao, Y. Hsu, C. Huo, T. Chen, K. Hsu, and Y. Chen (2018) Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 753–757. External Links: Link, Document Cited by: §1, §1, 3rd item, Table 3, §5.
  • D. Hakkani-Tür, G. Tur, A. Celikyilmaz, Y. V. Chen, J. Gao, L. Deng, and Y. Wang (2016) Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In Proceedings of The 17th Annual Meeting of the International Speech Communication Association, INTERSPEECH ’16. External Links: Link Cited by: §1, §1, 1st item, Table 3, §5.
  • C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990, External Links: Link Cited by: §1, §3.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667 Cited by: §5.
  • J. Hu, G. Wang, F. Lochovsky, J. Sun, and Z. Chen (2009) Understanding user’s query intent with Wikipedia. In Proceedings of the 18th International Conference on World Wide Web, WWW ’09, New York, NY, USA, pp. 471–480. External Links: ISBN 978-1-60558-487-4, Link, Document Cited by: §5.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv:1508.01991. External Links: Link Cited by: §1, §4, §5.
  • M. Jeong and G. G. Lee (2008) Triangular-chain conditional random fields. IEEE Transactions on Audio, Speech, and Language Processing 16 (7), pp. 1287–1302. Cited by: §1, §5.
  • M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2019) SpanBERT: improving pre-training by representing and predicting spans. arXiv:1907.10529. External Links: Link Cited by: §2.1.
  • B. Kim, S. Ryu, and G. G. Lee (2017) Two-stage multi-intent detection for spoken language understanding. Multimedia Tools and Applications 76 (9), pp. 11377–11390. Cited by: §5.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

    Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §2.2.
  • J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In

    Proceedings of the Eighteenth International Conference on Machine Learning

    ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: ISBN 1-55860-778-1, Link Cited by: §1, §5.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. arXiv:1901.07291. External Links: Link Cited by: §2.1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite BERT for self-supervised learning of language representations

    arXiv:1909.11942. External Links: Link Cited by: §2.1.
  • B. Liu and I. Lane (2016) Attention-based recurrent neural network models for joint intent detection and slot filling. In Proceedings of The 17th Annual Meeting of the International Speech Communication Association, INTERSPEECH ’16, pp. 685–689. Cited by: §1, §1, 2nd item, Table 3, §4, §5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692. External Links: Link Cited by: §2.1, §5.
  • T. Luong, H. Pham, and C. D. Manning (2015)

    Effective approaches to attention-based neural machine translation

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Link, Document Cited by: §2.1.
  • C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky (2014) The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60. External Links: Link Cited by: §2.2.
  • A. McCallum, D. Freitag, and F. C. N. Pereira (2000) Maximum entropy markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, pp. 591–598. External Links: ISBN 1-55860-707-2, Link Cited by: §1, §5.
  • A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. External Links: Link Cited by: §5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog. Cited by: §5.
  • S. Ravuri and A. Stolcke (2015) Recurrent neural network and LSTM models for lexical utterance classification. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §5.
  • S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Advances in neural information processing systems, pp. 3856–3866. Cited by: §5, §5.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §2.2.
  • K. Toutanova and C. D. Manning (2000) Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, Hong Kong, China, pp. 63–70. External Links: Link, Document Cited by: §1, §5.
  • H. Tsai, J. Riesa, M. Johnson, N. Arivazhagan, X. Li, and A. Archer (2019) Small and practical BERT models for sequence labeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3630–3634. External Links: Link, Document Cited by: §5.
  • G. Tur, D. Hakkani-Tür, and L. Heck (2010) What is left to be understood in ATIS?. In 2010 IEEE Spoken Language Technology Workshop, pp. 19–24. External Links: Document Cited by: §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems, NIPS ’17, Long Beach, CA, USA, pp. 5998–6008. Cited by: §1, §2.1, §2.1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. arXiv:1910.03771. External Links: Link Cited by: §3.4.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144. External Links: Link Cited by: §2.2.
  • C. Xia, C. Zhang, X. Yan, Y. Chang, and P. Yu (2018) Zero-shot user intent detection via capsule neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3090–3099. External Links: Link, Document Cited by: §1, §5.
  • P. Xu and R. Sarikaya (2013a) Convolutional neural network based triangular CRF for joint intent detection and slot filling. In

    2013 IEEE Workshop on Automatic Speech Recognition and Understanding

    pp. 78–83. Cited by: §1, §1, §4, §5, §5.
  • P. Xu and R. Sarikaya (2013b) Exploiting shared information for multi-intent natural language sentence classification.. In Interspeech, pp. 3785–3789. Cited by: §5.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. arXiv:1906.08237. External Links: Link Cited by: §2.1.
  • C. Zhang, Y. Li, N. Du, W. Fan, and P. Yu (2019) Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5259–5267. External Links: Link, Document Cited by: §1, 4th item, Table 3, §5.