Simple, Fast, Accurate Intent Classification and Slot Labeling

03/19/2019 ∙ by Arshit Gupta, et al. ∙ Amazon Stanford University 0

In real-time dialogue systems running at scale, there is a tradeoff between system performance, time taken for training to converge, and time taken to perform inference. In this work, we study modeling tradeoffs intent classification (IC) and slot labeling (SL), focusing on non-recurrent models. We propose a simple, modular family of neural architectures for joint IC+SL. Using this framework, we explore a number of self-attention, convolutional, and recurrent models, contributing a large-scale analysis of modeling paradigms for IC+SL across two datasets. At the same time, we discuss a class of 'label-recurrent' models, proposing that otherwise non-recurrent models with a 10-dimensional representation of the label history provide multi-point SL improvements. As a result of our analysis, we propose a class of label-recurrent, dilated, convolutional IC+SL systems that are accurate, achieving a 30 the Snips dataset, as well as fast, at 2x the inference and 2/3 to 1/2 the training time of comparable recurrent models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A general framework of joint IC+SL, decoupling modeling tasks to permit the analysis of each component independently.

At the core of task-oriented dialogue systems are spoken language understanding models, tasked with determining the intent of users’ utterances and labeling semantically relevant words. Performance on these tasks, known as intent classification (IC) and slot labeling (SL), upper-bounds the utility of such dialogue systems. A large body of recent research has improved these models through the use of recurrent neural networks, encoder-decoder architectures, and attention mechanisms. However, in dialogue systems in particular, system speed is at a premium, both during training and in real-time inference. In this work, we analyze tradeoffs between accuracy and computational efficiency in spoken language understanding, and to our knowledge are the first to propose fully non-recurrent and label-recurrent model paradigms including self-attention and convolution for comparison to state-of-the-art recurrent models in terms of accuracy and speed.

We present a new modular framework for joint IC+SL models that permits the analysis of individual modeling components that decomposes these joint models into separate components for word context encoding, summarization of the sentence into a single vector for intent classification, and modeling of dependencies in the output space of slot label sequences. Using this framework, we identify three distinct model families of interest: fully recurrent, label-recurrent, and non-recurrent.

Recent state-of-the-art models fall into the first category, as encoder-decoder architectures have recurrent encoders to perform word context encoding, and predict slot label sequences using recurrent decoders that use both word and label information as they decode Hakkani-Tür et al. (2016); Liu and Lane (2016). As the name suggests, ’non-recurrent’ are networks without any recurrent connection: fully feed-forward, attention-based, or convolutional models, for example. Lastly, we have a class of label-recurrent models, inspired by models that impose structured sequential models like conditional random fields on top of non-recurrent word contextualization components. In this class of models, slot label decoding proceeds such that label sequences are encoded by a recurrent component, but word sequences are not. Using this framework, we demonstrate that label-recurrent convolutional models can achieve state-of-the-art accuracy while maintaining faster training and inference speeds than fully-recurrent models. We evaluate on the ATIS benchmark dataset and the more recent Snips dataset.

By analyzing the value of recurrent connections in the utterance space and label space separately, we demonstrate the value of modeling dependencies in the output space and how it depends on the average length of slot labels in the dataset.

To conclude, we propose a novel class of label-recurrent convolutional architectures that are fast, simple, and work well across datasets. Our study also leads to a strong new state-of-the-art IC accuracy and SL F1 on the Snips dataset.

2 Prior Work

There is a large body of research in applying recurrent modeling advances to intent classification and slot labeling (frequently called spoken language understanding.) Traditionally, for intent classification, word n-grams were used with SVM classifier

Haffner et al. (2003) and Adaboost Schapire and Singer (2000). For the SL task, CRF’s Gorin et al. (1997) have been used in the past.

Recently, a larger focus has been on joint modeling of IC and SL tasks. Long short-term memory recurrent neural networks

Hochreiter and Schmidhuber (1997)

and Gated Recurrent Unit models

Cho et al. (2014) were proposed for slot labeling by Yao et al. (2014) and Zhang and Wang (2016) respectively, while Guo et al. (2014) used recursive neural networks. Subsequent improvements to recurrent neural modeling techniques, like bidirectionality and attention Bahdanau et al. (2014) were incorporated into IC+SL in recent years as well Hakkani-Tür et al. (2016); Liu and Lane (2016). Li et al. introduced a self-attention based joint model where they used self-attention and LSTM layers along with the gating mechanism for this task.

Non-recurrent modeling for language has been re-visited recently, even as recurrent techniques continue to be dominant. Dilated convolutional CNNs Yu and Koltun (2015)

with CRF label modeling were applied to named entity recognition by

Strubell et al. (2017), and earlier were applied to SL by Xu and Sarikaya (2013). Convolutional and attention-based sentence encoders have been applied in complex tasks, including machine translation, natural language inference, and parsing. Gehring et al. (2017); Vaswani et al. (2017); Shen et al. (2017); Kitaev and Klein (2018) We draw from both of these bodies of work to propose a simple yet highly effective family of IC+SL models.

3 A general framework of joint IC+SL

Intent classification and slot labeling take as input an utterance , composed of words and of length . Models construct a distribution over intents and slot label sequences given the utterance. One intent is assigned per utterance and one slot label is assigned per word:


where , a fixed set of intents, and , a fixed set of slot labels. Models are trained to minimize the cross-entropy loss between the assigned distribution and the training data. To the end of constructing this distribution, our framework explicitly separates the following components, which are explicitly or implicitly present in all joint IC+SL systems:

3.1 Word contextualization

We first assume words are encoded through an embedding layer, providing context-independent word vectors. Overloading notation, we denote the embedded sequence , with .

In this component, word representations are enriched with sentential context. Each word is assigned a contextualized representation . To ease layering these components, we keep the dimensionality the same as the word embeddings; .

Our study consists mainly of varying this component across models which are described in detail in Section 4. In all models, we assume independence of intent classification and slot labeling given the learned representations:


3.2 Sentence representation

In this component, the output of the word contextualization component is summarized in a single vector,


where . For all our experiments, we keep this component constant, using a simple attention-like pooling which is the weighted sum of word contextualization for each position in the sentence. These weights are computed using softmax over these individual word contextualizations

While simple, this model permits word contextualization components freedom in how they encode sentential information; for example, self-attention models may spread full-sentence information across all words, whereas 1-directional LSTMs may focus full-sentence information in the last word’s vector.

3.3 Intent prediction

In this component, the sentence representation is used as features to predict the intent of the utterance. For all experiments, we keep this component fixed as well, using a simple two-layer feed-forward block on top of .

3.4 Slot label prediction

In this component, the output of the word contextualization component is used to construct a distribution over slot label sequences for the utterance. We decompose the joint probability of the label sequence given the contextualized word representations into a left-to-right labeling:


In our experiments, we explore two models for slot prediction, one fully-parallelizable because of strong independence assumptions, the other permitting a constrained dependence between labeling decisions that we call ‘label-recurrent’.

Independent slot prediction

The first is a non-recurrent model, which assumes indepdencence between all labeling decisions once given , as well as independence from all word representations except that of the word being labeled:


This model is fully parallelizable on GPU architectures, and the probability of each labeling decision is modeled according to


hence, SL prediction features are learned using each contextualized word independently.

Label-recurrent slot prediction

The second class of slot prediction models we consider lead to our classification, ‘label-recurrent.’111We use this term for clarity in language, not to claim that no such models have been explored in the past. These models permit dependence of labeling decisions on the sequence of decisions made so far, but keep the independence assumption on the word representations:


Notably, this family of models excludes traditional encoder-decoder models, since the decoder component uses labeling decisions and earlier word representations to influence the predictor features . However, it includes models such as CNN-CRF.

The space of label sequences in slot labeling is much smaller than the space of word sequences. This adds minimal computational burden and permits the model to benefit from GPU parallelism during computation.

For our experiments, we propose a single label-recurrent model, which encodes labeling histories using only a 10-dimensional LSTM. First, slot labels are embedded, such that for each , we have . An initial tag history state, , is randomly initialized. Each tag decision is fed along with the previous tag history state to the LSTM, which returns the next tag history state:


We omit a precise description of the LSTM model here for space, deferring the reader to Hochreiter and Schmidhuber (1997).

The tag history is used at each prediction step as additional inputs to construct the predictor features , replacing Eqn. 7 with:


where denotes concatenation. This model and other label-recurrent models are not only parallelizable more than fully-recurrent models, but also provide an architectural inductive bias, separating modeling of tag sequences from modeling of word sequences. In our experiments, we perform greedy decoding to maintain a high decoding speed.

4 Word contextualization models

In this section, we describe word contextualization models with the goal of identifying non-recurrent architectures that achieve high accuracy and faster speed than recurrent models.

4.1 Feed-forward model

In this model, we set , where is a learned absolute position representation, with one vector learned per absolute position, as used in Gehring et al. (2017). While extremely simple, this model provides a useful baseline as a totally context-free model. It also permits us to analyze the contribution of a label-recurrent component in such a context-deprived scenario.

4.2 Self-attention models

Recent work in non-recurrent modeling has surfaced a number of variants of attention-based word context modeling.

The simplest constructs each by incorporating a weighted average of the rest of the sequence,

. We use a general bilinear attention mechanism with a residual connection while masking out the identity in the attention weights.


In this and all subsequent models, we optionally stack multiple layers, feeding the word representations from each layer into the next; in this case we denote the models attn-1l, attn-2l, etc.

We also analyze multi-head attention models, drawing from Vaswani et al. (2017). For a model with heads, we construct one matrix of the form for each head, and transform each , for . These are passed into the attention equations above, generating context vectors , which are then concatenated to form a vector in

. These context layers are usually sent through a linear transformation to combine features between the heads, the output of which is

, but we found that omitting this combination transformation leads to significantly improved results, so we do so in all experiments. We denote these models k-head attn.

4.2.1 Relative position representations

We found in early experiments that the absolute position embeddings in self-attention models are insufficient for representing order. Hence, in all attention models except when explicitly noted, we use relative position representations as follows. We follow Shaw et al. (2018), who improved the absolute position representations of the Transformer model Vaswani et al. (2017) by learning vector representations of relative positions and incorporating them into the self-attention mechanism as follows:


where is a learned vector representing how the relative positions and should be incorporated, and is a learned bias that determines how the relative position should affect the weight given to position when contextualizing position . The function determines which relative positions to group together with a single relative position vector. Given the generally small datasets in IC+SL, we use the following relative position function, which buckets relative positions together in exponentially larger groups as distance increases, following the results of Khandelwal et al. (2018), that LSTMs represent position fuzzily at long relative distances.


This is similar to the very recent preprint work of Bilan and Roth (2018), who use linearly increasing bucket sizes; we found exponentially increasing sizes to work well compared to the constant bucket sizes of Shaw et al. (2018).

4.3 Convolutional models

Convolution incorporates local word context into word representations, where kernel width parameter specifies the total size (in words) of local context considered. Each convolutional layer produces a vector representation of each word,


and includes a residual connection, and variance normalization, following

Gehring et al. (2017). To maintain the dimensionality of as , we use a filter count of . We vary the number of CNN layers as well as the kernel width, and for all models use a variant known as dilated CNNs. These CNNs incorporate distant context into word representations by skipping an increasing number of nearby words in each subsequent convolutional pass. We use an exponentially increasing dilation size; in the first layer, words of distance 1 are incorporated; at layer two, words of distance 2, then 4, etc. This permits large contexts to be incorporated into word representations while keeping kernel sizes and the number of layers low.

4.4 Recurrent models

We also construct a recurrent word contextualization model, more or less identical to encoders of recent state-of-the-art models. We use a bidirectional LSTM to encode word contexts, . As with all other models, we report the performance of this model with feed-forward slot label prediction as well as with label-recurrent slot label prediction. Though similar to earlier work, both models are new; though though the latter is recurrent both in word contextualization and slot label prediction, it is distinct from past models in that the two recurrent components are completely decoupled until the prediction step.

5 Datasets

We evaluate our framework and models on the ATIS data set Hemphill et al. (1990) of spoken airline reservation requests and the Snips NLU Benchmark set222 The ATIS training set contains 4978 utterances from the ATIS-2 and ATIS-3 corpora; the test set consists of 893 utterances from the ATIS-3 NOV93 and DEC94 data sets. The number of slot labels is 127, and the number of intent classes is 18. Only the words themselves are used as input; no additional tags are used.

Model label recurrent IC acc SL F1 Inference ms/utterance Epochs to converge s/epoch #
Snips ATIS Snips ATIS
feed-forward No 98.56 97.14 53.59 69.68 0.61 48 1.82 17k
feed-forward Yes 98.54 97.46 75.35 88.72 1.82 83 2.52 19k
cnn, 5kernel, 1l No 98.56 98.40 85.88 94.11 0.82 23 1.90 42k
cnn, 5kernel, 3l No 99.04 98.42 92.21 96.68 1.37 55 2.16 91k
cnn, 3kernel, 4l No 98.81 98.32 91.65 96.75 1.28 57 2.29 76k
cnn, 5kernel, 1l Yes 98.85 98.36 93.12 96.39 2.13 51 2.77 43k
cnn, 5kernel, 3l Yes 99.10 98.36 94.22 96.95 2.68 59 3.34 93k
cnn, 3kernel, 4l Yes 98.96 98.32 93.71 96.95 2.60 53 3.43 78k
attn, 1head, 1l, no-pos No 98.50 97.51 53.61 69.31 1.95 25 1.94 22k
attn, 1head, 1l No 98.53 97.74 75.55 93.22 4.75 117 4.34 23k
attn, 1head, 3l No 98.74 98.10 81.51 94.07 7.68 160 4.32 33k
attn, 2head, 3l No 98.31 98.10 83.02 94.61 7.86 79 4.87 47k
attn, 1head, 1l, no pos Yes 98.63 97.68 74.94 88.60 3.24 60 2.66 24k
attn, 1head, 1l Yes 98.61 98.00 86.72 94.53 6.12 89 5.53 24k
attn, 1head, 3l Yes 98.51 98.26 88.04 94.99 9.03 109 6.06 34k
attn, 2head, 3l Yes 98.48 98.26 89.31 95.86 9.17 93 6.54 49k
lstm, 1l No 98.82 98.34 91.83 97.28 2.65 45 2.91 47k
lstm, 2l No 98.77 98.20 93.10 97.36 4.72 58 5.09 77k
lstm, 1l Yes 98.68 98.36 93.83 97.37 3.98 54 4.62 49k
lstm, 2l Yes 98.71 98.30 93.88 97.28 6.03 69 6.82 79k
Table 1: Development results on the Snips 2017 and ATIS datasets, comparing models from feed-forward, convolutional, self-attention, and recurrent paradigms, as well as comparing non-recurrent, label-recurrent, and fully recurrent architectures, on IC, SL, inference speed, and training time. Inference speed, convergence time, and parameter count are drawn from Snips experiments, but the trends hold on ATIS. The best IC and SL for each dataset is bolded within each model paradigm to help compare between paradigms.

The Snips 2017 dataset is a collection of 16K crowdsourced queries, with about 2400 utterances per each of 7 intents. These intents range from ‘Play Music’ to ‘Get Weather’. Training data contains 13784 utterances and the test data consists of 700 utterances. The utterance tokens are mixed case unlike the ATIS dataset, where all the tokens are lowercased. Total number of slot labels are 72. We use IOB tagging, and split 10% of the train set off to form a development set. Utterances in Snips are, on average, short, with 9.15 words per utterance compared to ATIS’ 11.2. However, slot label sequences themselves are longer in Snips, averaging 1.8 tokens per span to ATIS’ 1.2, making span-level slot labeling more difficult. For our development experiments, we use the casing and tokenization provided by Snips. Co, but to compare to prior work, in one test experiment we use the lowercased, tokenized version of Goo et al. (2018)333

6 Experiments

We evaluate multiple models from each of our model paradigms to help determine what modeling structures are necessary for NLU, and where the best accuracy-speed tradeoffs are. First, we report extensive evaluation across the Snips and ATIS development sets, tracking inference speed and time to convergence along with the usual IC accuracy and SL F1. Second, we pick a small number of our best-performing models to evaluate on ATIS and Snips test sets, to compare against prior work.

For each experiment below, we train until convergence, where convergence is defined by an early stopping criterion with a patience of 30 epochs and an average of development set IC accuracy and token-level SL F1 used as the performance metric.

6.1 Modeling study experiments

In our first category of experiments, we evaluate variants of each word contextualization paradigm introduced. All model structures are kept fixed except those being tested, including hyperparameters like the learning rate, batch size and


We evaluate one feed-forward word contextualization module (labeled as feed-forward) to provide a baseline performance. As with all subsequent models, we evaluate this word contextualization module with and without our proposed label-recurrent decoder. This baseline should help us determine the extent to which each dataset requires the modeling of context.

We evaluate 3 convolutional word contextualization modules. The first has 1 layer with a kernel size of 5, and is intended to provide intuition as to whether a relatively large local context can sufficiently model SL behavior. We label this model cnn, 5kernel, 1l, and name all other CNN models similarly. The next model has 3 layers with kernel size 5, and is dilated. This model incorporates long-distance context hierarchically, and is shorter and wider-per-layer than the otherwise-similar 3rd CNN model, with 4 layers and kernel size 3.

We evaluate 4 attention-based word contextualization modules. The first is simple, with 1 attention head and 1 layer. Unlike all others we analyze, it does not use relative position embeddings. Thus, this model is word order-invariant except for a simple absolute position embedding. If it improves over feedforward, then, it provides strong evidence that semantic information from the context words, irrespective of order, is useful in making tagging decisions. We label this model with the flag no-pos. To evaluate the utility of relative position embeddings, we also compare a model with 1 head and one layer, labeled attn, 1head, 1l. We then test two increasingly complex models, first with 3 layers and 1 head, the second with 3 layers and 2 heads per layer.

We evaluate 2 LSTM-based word contextualization modules; one uses a single LSTM layer, whereas the other stacks a second on top of the first. As with all other models, we test these two models both with independent slot prediction and label-recurrent slot prediction.

6.2 Comparison to prior work

For our second category of experiments, we take a few high-performing models from our analysis and evaluate them on the Snips and ATIS test sets for comparison to prior work. For these models, we report not only the average IC accuracy and SL F1 across random initializations, but also the standard deviation and best model, as most work has not reported average values. We keep all hyperparameters fixed across all experiments, potentially hindering performance but providing a stronger analysis of robustness.

7 Results and discussion

In this section, we draw from results reported in Table 1, on the development sets of Snips and ATIS. It is easy to see that very little in the way of modeling is necessary for IC task, so we focus our analysis on SL task. We emphasize that ATIS has shorter spans than Snips, averaging 1.2 and 1.8 tokens, respectively, leading to differing modeling requirements.

7.1 Minimal modeling for NLU

By analyzing three simple models - feed-forward, attn-1head-1l-no-pos, and cnn-5kernel-1l - we conclude that explicitly incorporating local features is a useful inductive bias for high SL accuracy. The purely feed-forward model achieves 53.59 SL F1 on Snips, whereas one layer of convolution improves that number to 85.88. The story is similar for ATIS SL. However, a single layer of attention without position information fails to improve over the feed-forward model whatsoever which we believe is due to the order-invariant nature of self-attention. This also emphasizes the fact that focusing on local context is useful inductive prior for SL task.

For each of these simple models, switching from independent slot label prediction to label-recurrent prediction provides large gains on both datasets. We find an approximate 1.3ms/utterance slowdown from using label recurrence across all models. Thus, in terms of accuracy-for-speed, very simple models can achieve much of the results of more expensive models as long as they are label-recurrent and incorporate local context.

Model Recurrence Mean Max Mean Max
LSTM+attn+gates Goo et al. (2018) full 97.0 - 88.8 -
Our CNN, 5Kernel, 3L none 97.650.28 97.57 89.570.54 90.66
Our CNN, 5Kernel, 3L label 97.570.41 98.29 92.300.40 93.11
Our LSTM, 2L word 97.280.36 97.57 90.660.55 91.53
Our LSTM, 2L full (decoupled) 97.220.32 97.14 91.530.50 92.62
Table 2: Test set results on the Snips dataset as preprocessed by Goo et al. (2018), compared to their recurrent model.
Model Recurrence Mean Max Mean Max
’13 CNN-CRF Xu and Sarikaya (2013) label - - - 94.35
’16 LSTM Hakkani-Tür et al. (2016) full - - 94.70 95.48
’16 seq2seq+attn*Liu and Lane (2016) full - 98.43 95.470.22 95.87
BiRNN+attn*Liu and Lane (2016) full - 98.21 95.420.18 95.75
LSTM+attn+gates Goo et al. (2018) full 94.10 - 95.20 -
’18 Two LSTMs Wang et al. (2018) full - 98.99 - 96.89
’18 self-attn+LSTM Li et al. (2018) full - 98.77 - 96.52
Our CNN, 5Kernel, 3L none 97.040.62 97.98 94.840.22 94.95
Our CNN, 5Kernel, 3L label 97.370.57 98.10 95.270.19 95.54
Our LSTM, 2L word 96.840.49 97.65 95.130.29 95.41
Our LSTM, 2L full (decoupled) 97.000.44 97.98 95.150.25 95.21
Table 3: Test set results on the ATIS dataset, compared to recent recurrent models. Results using digit masking preprocessing are marked with (*)

7.2 High-performing convolutional models

The larger convolutional models provide very high accuracy while maintaining fast inference and training speeds. In particular, our best CNN model, cnn-5kernel-3l, achieves 94.22 SL F1 on Snips, compared to the two-layer LSTM with label-recurrence, which achieves 93.88. The model achieves this modest improvement with over 2x the inference speed, training in under 1/2 the time, and demonstrating even stronger results on the test sets, discussed below.

On ATIS, where utterances are longer but slot label spans are shorter, LSTMs outperform CNNs on the development sets.

7.3 Issues with self-attention

Our strongest self-attention model underperforms CNNs and LSTMs on both Snips and ATIS, with a maximum SL of 89.31 and 95.86 on the datasets, respectively. Though self-attention models have seen success in complex tasks with lots of training data, we suggest in this study that they lack the inductive biases to perform well on these small datasets.

Relative position embeddings go a long way in improving self-attention models; adding them to a 1-layer attentional encoder improves ATIS and Snips SL by approximately 24 and 22 points, respectively. We find that adding attention heads does not add considerably to the computational complexity of attention models, while increasing accuracy; thus in a speed-accuracy tradeoff, it is likely better to add heads rather than layers as each layer adds additional computations.

7.4 Word and label recurrence in LSTMs

Our LSTM word contextualization modules show that with recurrent word context modeling, label-recurrence is less important. For instance, 2-layer LSTM achieves only .78 increase in SL with label recurrence over independent prediction.

7.5 Best models compared to prior work

We report test set results on Snips and ATIS in Tables 2 and 3. Our best models from our validation study, cnn-5kernel-3l and lstm-2l, outperform the state-of-the-art on the Snips dataset, with label-recurrence proving crucial, especially for Snips. In particular, cnn-5kernel-3l with label recurrence achieves an average SL F1 of 92.30, improving over the previous state-of-the-art of 88.8, and .57-point improvement on IC.

On ATIS, our label-recurrent models are competitive with the attention-augmented encoder-decoder models of Liu and Lane (2016). We hypothesize that our models perform better on Snips because much of Snips slot labeling depends on consistency within long spans, whereas ATIS slot labels have longer-distance dependencies, for example between to_city and from_city tags.

Wang et al.

attribute their outlier result to using IC and SL-specific LSTMs and use 300-dimensional LSTMs, but with an ATIS vocabulary of 867 words (suggesting a relatively simple sequence space), we are unable to determine the source of the improvement from a modeling standpoint. Also, in a recently published work,

Li et al. use the 264-dimension embeddings with self-attention, BiLSTM and gating mechanism to achieve high performance on SL task.

Figure 2: Visualization of the weight given to each token representation by the attention-based pooling for sentence representation. Lighter colors indicate greater attention.

7.6 Attention Visualization

We note that anecdotally, few words in each utterance are useful in indicating the intent. In the example given in Figure 2, the presence of possible departure and arrival cities may be distracting, but the attention mechanism correctly learns to focus on the words that indicate the atis_aircraft intent.

8 Conclusion

We presented a general family of joint IC+SL neural architectures that decomposes the task into modules for analysis. Using this framework, we conducted an extensive study of non-recurrent word contextualization methods, and separately evaluate the utility of recurrence in the representation space as well in the structured output space. We determined that label-recurrent models, with parallelizable non-recurrent word representation methods and a recurrent model of slot label dependencies, are a good fit for high performance in both accuracy and speed.

With the results of this study, we proposed a convolution-based joint IC+SL model that achieves new state-of-the-art results on the complex SL labeling of the Snips dataset while maintaining a simple design, shorter training, and faster inference than comparable recurrent methods.

9 Implementation details

For all models, we randomly initialize word embeddings and use . We optimize using Adadelta algorithm Zeiler (2012), with initial learning rate,

. We clip and pad all training and development sentences to length 30, with clipping affecting a small number of utterances. Dropout

Srivastava et al. (2014) probability of is used in all models. We train using a batch size of 128 split across 4 GPUs on a p3.8xlarge EC2 instance, and perform inference using CPUs on same machine.


  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
  • Bilan and Roth (2018) Ivan Bilan and Benjamin Roth. 2018. Position-aware self-attention with relative positional encodings for slot filling. arXiv preprint arXiv:1807.03052.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. In

    International Conference on Machine Learning

    , pages 1243–1252.
  • Goo et al. (2018) Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 753–757.
  • Gorin et al. (1997) Allen L Gorin, Giuseppe Riccardi, and Jeremy H Wright. 1997. How may i help you? Speech communication, 23(1-2):113–127.
  • Guo et al. (2014) D. Guo, G. Tur, W.T. Yih, and G. Zweig. 2014. Joint semantic utterance classification and slot filling with recursive neural networks. In Proceedings of Spoken Language Technology Workshop (SLT), page 554–559.
  • Haffner et al. (2003) Patrick Haffner, Gokhan Tur, and Jerry H Wright. 2003. Optimizing svms for complex call classification. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 1, pages I–I. IEEE.
  • Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Interspeech, pages 715–719.
  • Hemphill et al. (1990) C.T. Hemphill, J.J. Godfrey, and G.R. Doddington. 1990. The ATIS spoken language systems pilot corpus. In Proceedings of the DARPA Speech and Natural Language Workshop, page 96–101.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Khandelwal et al. (2018) Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context. Association for Computational Linguistics (ACL).
  • Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2676–2686. Association for Computational Linguistics.
  • Li et al. (2018) Changliang Li, Liang Li, and Ji Qi. 2018. A self-attentive model with gate mechanism for spoken language understanding. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 3824–3833.
  • Liu and Lane (2016) B. Liu and I. Lane. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. In Proceedings of Interspeech.
  • Schapire and Singer (2000) Robert E Schapire and Yoram Singer. 2000. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2-3):135–168.
  • Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468. Association for Computational Linguistics.
  • Shen et al. (2017) T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang. 2017. DiSAN: Directional self-attention network for RNN/CNN-free language understanding.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958.
  • Strubell et al. (2017) E. Strubell, P. Verga, D. Belanger, and A. McCallum. 2017. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of EMNLP.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Wang et al. (2018) Yu Wang, Yilin Shen, and Hongxia Jin. 2018. A bi-model based rnn semantic frame parsing model for intent detection and slot filling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), volume 2, pages 309–314.
  • Xu and Sarikaya (2013) P. Xu and R. Sarikaya. 2013. Convolutional neural network based triangular CRF for joint intent detection and slot labeling. In Proceedings of IEEE ASRU Workshop.
  • Yao et al. (2014) Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014. Spoken language understanding using long short-term memory neural networks. In Spoken Language Technology Workshop (SLT), 2014 IEEE, pages 189–194. IEEE.
  • Yu and Koltun (2015) Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
  • Zeiler (2012) Matthew D Zeiler. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
  • Zhang and Wang (2016) Xiaodong Zhang and Houfeng Wang. 2016. A joint model of intent determination and slot filling for spoken language understanding. In IJCAI, pages 2993–2999.