A survey of joint intent detection and slot-filling models in natural language understanding

01/20/2021 ∙ by H. Weld, et al. ∙ The University of Sydney 0

Intent classification and slot filling are two critical tasks for natural language understanding. Traditionally the two tasks have been deemed to proceed independently. However, more recently, joint models for intent classification and slot filling have achieved state-of-the-art performance, and have proved that there exists a strong relationship between the two tasks. This article is a compilation of past work in natural language understanding, especially joint intent classification and slot filling. We observe three milestones in this research so far: Intent detection to identify the speaker's intention, slot filling to label each word token in the speech/text, and finally, joint intent classification and slot filling tasks. In this article, we describe trends, approaches, issues, data sets, evaluation metrics in intent classification and slot filling. We also discuss representative performance values, describe shared tasks, and provide pointers to future work, as given in prior works. To interpret the state-of-the-art trends, we provide multiple tables that describe and summarise past research along different dimensions, including the types of features, base approaches, and dataset domain used.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The efficacy of virtual assistants becomes more important as their popularity rises. Central to their performance is the ability for the electronic assistant to understand what the human user is saying, in order to act, or reply, in a way that meaningfully satisfies the requester.

The human-device interface may be text based, but is now most frequently voice, and will probably in the near future include image or video. To put the understanding of human utterances within a framework, within the natural language processing (NLP) stack lies spoken language understanding (SLU). SLU starts with automatic speech recognition (ASR), the task of taking the sound waves or images of expressed language, and transcribing to text. Natural language understanding (NLU) then takes the text and extracts the semantics for use in further processes - information gathering, question answering, dialogue management, request fulfilment, and so on.

The concept of a hierarchical semantic frame has developed to represent the levels of meaning within spoken utterances. At the highest level is a domain, then intent and then slots. The domain is the area of information the utterance is concerned with. The intent (a.k.a. goal in early papers) is the speaker’s desired outcome from the utterance. The slots are the types of the words or spans of words in the utterance that contain semantic information relevant to the fulfilment of the intent. An example is given in Table 1 for the domain movies. Within this domain the example has intent find_movie and the individual tokens are labelled with their slot tag using the IOB tagging format.

query find recent comedies by james cameron
slots O B-date B-genre O B-dir I-dir
intent find_movie
domain movies
Table 1. An example of an utterance as semantic frame with domain, intent and IOB slot annotation (from (Hakkani-Tür et al., 2016))

The NLU task is thus the extraction of the semantic frame elements from the utterance. NLU is important - it is central to devices that desire a spoken interface with humans - for example, conversational agents, instruction in vehicles (driverless or otherwise), Internet of Things (IoT), personal assistants, online helpdesks/chatbots, robot instruction, and so on. Improving the quality of the semantic detection will improve the quality of the experience for the user, and from here it draws its importance and popularity as a research topic.

In many data sets, and indeed real world applications, the domain is limited; it is concerned only with hotel bookings, or air flight information, for example. In these cases the domain level is generally not part the analysis. However in wider ranging applications, for example the SNIPS data set discussed later, or the manifold personal voice assistants which are expected to field requests from various domains, inclusion of the domain detection in the problem can lead to better results. However, for the purposes of this survey we will treat the domain as ancillary.

This leaves us with intent and slot identification. What does the human user want from the communication, and what semantic entities carry the details? The two sub-tasks are known as intent detection and slot filling. The latter may be a misnomer as the task is more correctly slot labelling, or slot tagging. Slot filling is more precisely giving the slot a value of a type matching the label. For example, a slot labelled “B-city” could be filled with the value “Sydney”. Intent detection is usually approached as a supervised classification task, mapping the entire input sentence to an element of a finite set of classes. Slot filling then is a labelling of the sequence of tokens in the utterance, making it within the sequence-to-sequence (seq2seq) class of problems.

While early research looked at the tasks separately, or put them in a series pipeline, it was quickly noted that the slot labels present and the intent class should and do influence each other in ways that solving the two tasks simultaneously should garner better results for both tasks. This has been the at the centre of NLU over recent years though work on the single tasks has continued.

A joint model which simultaneously addresses each sub-task must, to be successful, capture the joint distributions of intent and slot labels, with respect also to the words in the utterance, their local context, and the global context in the sentence. A joint model has the advantage over pipeline models that it is less susceptible to error propagation, and over separate models in general that there is a only a single model to train and fine tune.

A drawback is that a large annotated corpus is usually required, though this is also true of separate models. The model may also be relatively complicated and take time to train. It has also been observed that joint models may not generalise well to unseen data, due to the variety of natural language expressions of similar intent. In real world applications the domains and label sets may change over time.

In many ways the development of the field has followed a similar path to other areas of NLP, starting with classical (statistical or probabilistic) models. Neural networks were applied as computing power increased. In particular, due to the sequential nature of the slot labelling sub-task, recurrent neural networks (RNNs) have been a technology frequently used in the field. In more recent years the transformer architecture has debuted to address issues like long range dependency. As a result, attention has increased in importance. As far as feature creation goes, convolution, word embeddings, and pre-trained language models have all been applied, amongst many other methods. The use of external knowledge bases has been observed in more recent papers.

The most regularly used data sets are two freely available sets - ATIS and SNIPS. A common experiment is implied by the literature, from which the reported results are compared in this survey. In addition, one of the aims of this survey is to address further standardisation, in terms of the parameters of the experiment and the evaluation metrics used.

The approaches to the joint task have been manifold and have shown excellent results in standard supervised training/test experiments. As new techniques make what may appear to be incremental increases to the state of the art it is perhaps time to recast the measures of success in the field. Rather than just developing new, more challenging annotated data sets, increasingly important must be the development of unstructured semantic detection in new domains.

The motivation for this survey is to take stock of the state of the field in 2020 following a surge of ideas and approaches over recent years, particularly in the joint task. We collect information on the approaches pursued so far and the issues encountered and addressed. With the survey completed we propose some future directions for the field.

In summary, this survey address three major questions:

  • Q1: How do these joint models achieve and balance two aspects, intent classification and slot filling?

  • Q2: Have syntactic clues/features been fully exploited or does semantics override this consideration?

  • Q3: Can successful models in one supervised domain be made more generalisable to new domains or languages or unseen data?

1.1. Scope

The focus of this survey is on extraction of the intent and slots of single utterances. The separate tasks are covered and then the joint task is addressed in detail. Papers on the following aspects will be reviewed but the information considered as ancillary only:

  • Multi-domain data sets with annotated domain. Extraction of the domain is a classification task like intent detection, albeit less granular. Inclusion of the domain identification task may aid the two sub-tasks of interest and this will be mentioned;

  • Dialogue action. Dialogue action is the identification of the next action to be taken by a dialogue management system once intent and slots have been identified. In some cases dialogue action is a direct substitute for intent and in others a mapping is made from intent and slot labels to the action. We review papers that include a strong focus on intent and slot detection;

  • Multi-intent data sets. An utterance may have multiple intent (for example, flight booking and hotel booking). More work has been done in pure intent detection on this aspect than in the joint task. We will consider it in the pure intent task in particular and make comments on expansion to the joint task;

  • Automatic speech recognition (ASR). Some papers begin with the ASR step and look at error propagation from ASR to intent or slot prediction. We don’t consider this aspect.

Year # papers Feature engineering Technologies
2011 3 Dependency parse SVM, DBN, multi-layer NN, AdaBoost
2012 1 Bag-of-words

C4.5, RF, NB, KNN, Linear SVM

2013 1 n-gram SVM, SVM-HMMs
2015 4 n-gram, word2vec, LSTM, RNN, ensemble, RF, clustering, SVM, AdaBoost, NN, J48, FFN
2016 1 CNN RF
2018 7 GloVe, word2vec, character embedding, grammatical features, dependency parse, knowledge base, POS, CRF, Regex, PCFG-ML, fastText

(Bi)CNN, (Bi)LSTM, (Bi)GRU, ensemble, Capsule networks, attention, (Bi)RNN, adversarial networks, gradient reversal layer, SVM, J48, Logistic regression, PPN, RF, Gaussian Naïve Bayes, KNN, NB, softmax regression

2019 4 n-gram, character, word2vec, CNN, BiLSTM

BiLSTM, attention, Ridge, KNN, MLP, passive aggressive, RF, linear SVC, SGD, nearest centroid, multinominal NB, Bernoulli NB, K-means, CNN, BiGRU, density-based novelty detection algorithm, local outlier factor

2020 1 BERT, word2vec, CNN Siamese, triple loss
Table 2. Historical overview of intent detection papers

1.2. Related surveys

(Tur and De Mori, 2011) is a complete summary of the SLU field at the advent of the neural era (2011). (Wang and Yuan, 2016)

concentrate on models that jointly address sub-tasks in dialogue systems, including NLU, dialogue management (DM) and Natural Language Generation (NLG). They cover the early models in the joint task but predate the works explicitly tying NLU to dialogue action covered here.

(Tur et al., 2018) concentrates on goal-oriented conversational language understanding but within that field provide an excellent precursor to this survey, covering the state of the art to 2017 in the two sub-tasks and 2016 in the joint task. (Hou et al., 2019) give a small overview of the separate and joint tasks. (Liu et al., 2019) provides a good survey of intent detection methods up to 2018 including multi-intent detection and evaluation methods.

Tangentially related surveys include (Serban et al., 2018) which surveys dialogue data sets available for research, and (Deriu et al., 2020) which gives an overview of evaluation methods for dialogue systems.

In this survey we bring the coverage of methods up to mid-2020 including the many applications of deep learning in the field. As well as a technological survey we look at issues addressed in each task and the joint task, and the approaches designed to address these issues. We also supply a summary of reported performance on the standard data sets.

1.3. Structure of the survey

The survey begins with a broad overview of the literature in Section 2. We then give a detailed description of the methods for each sub-task (Sections 3 and 4) and the joint task in Section 5, along with the issues addressed and solutions proposed. In Section 6 a survey of the data sets encountered takes place. In Sections 7 and 8 there is a description of the experiments and evaluation methods applied and a discussion of standardisation of these. A summary of the results achieved over the history of the field is given in Section 9. We finish with a discussion of the challenges and opportunities for research in the field and give concluding remarks in Section 10.

Year # papers Feature engineering Technologies
2011 1

Neural network, observation feature vector

Deep learning, CRF
2012 1 n-gram, K-DCN Kernel learning, deep learning, DCN, log-linear model
2013 3 Discriminative embedding, named entity, dependency parse, POS, SENNA, RNNLM, bag-of-words DBN, RNN, RNN-LM
2014 2

RNN, lexicon feature

CRF, LSTM, regression model, deep learning
2015 3 Word embedding, named entity, word embedding RNN, sampling approach, external memory
2016 3 Word embedding, context window, RNN, CNN BiRNN, attention, LSTM, encoder-labeler, CNN
2017 1 Word embedding (Bi)LSTM, encoder-decoder, focus mechanism, entity position-aware attention
2018 5 BiLSTM, word embedding, character, CNN, delexicalisation

CRF, MTL, segment tagging, NER, BiLSTM, attention, delexicalised sentence generation, DNN, reinforcement learning, GRU, pointer network

2019 6 Word embedding, web-data, expert feedback, contextual information, GloVe, POS, character, BERT BiLSTM, BiGRU, different knowledge sources, context gate, MTL, CNN
2020 2 ResTDNN Prior knowledge driven label embedding, CRF, TDNN, RNN
Table 3. Historical overview of slot labelling papers

2. Overview of the literature

Intent classification is a form of text classification where the text is a single sentence that comes from a spoken or written utterance. Much effort has been made to construct features which encapsulate the sentence, both semantically and syntactically, and the words within it. These features have been passed to classifiers from the suite of classical and, from 2011, deep learning methods, as outlined in Table

2. Issues around ambiguity, shortness of sentences, treatment of out-of-vocabulary words and emerging label sets are amongst those covered in the literature.

Slot tagging (see Table 3

) is framed as a sequence labelling problem and in early years drew from methods for statistically modelling the dependencies within sequences, like conditional random fields (CRFs) and Hidden Markov models (HMMs). Around 2013 the strength of RNNs in this area had been observed and was applied to the task and developed over the ensuing years. Interestingly the use of CRFs returned, often as a post-RNN step, due to their efficacy at handling label dependency issues. As far as feature creation goes the general goal of the task is to use the semantic information within the words and various context windows from small to long-range within the sentence. Attention is used as one approach for eliciting useful context. Slot tagging has experimented with external knowledge bases for extra performance.

Methods used by both sub-tasks to extend their features include looking at meta-data from the data collection. Multi-task learning has also been used by both tasks to look for synergistic learning from other related tasks. Of course the joint task itself is an example of this synergistic approach. Both tasks have also considered methods for transfer learning to other languages and to data with new, unseen tag sets.

The two earliest papers (2008-9) addressing the joint task drew methods from classical NLP. Features were constructed from words, n-grams and suffixes, or from a semantic parsing of the utterances. A CRF or a support vector machine (SVM) was used for the analysis. In 2013 the first neural network was used though it really just constructed convolutional neural network (CNN) features for use in the CRF model from 2008. In 2014 a recursive neural network (RecNN), which works over trees, was applied to the dependency parse of the utterances. In 2015 the first completely neural network was devised, using a recurrent neural network (RNN, different to a recursive neural network) embedding of words, CNN representation of sentences, and a feed forward network (FFN) for the analysis.

By 2016 the RNN encoder-decoder architecture had been found to be useful for seq2seq tasks and started to make its impact in the joint task. Unidirectional and bi-directional Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) cells were tested within circuits. Attention made its first appearance. On the input feature side K-SAN graphs were used as a knowledge base.

In 2017 the field appeared to stay progress, with only character embedding being added to the input features and no improvement of performance results on the major data sets. Perhaps though, researchers were working on the many developments which exploded in 2018. Word embeddings were introduced - word2vec, GloVe and ELMo. The circuits were still largely RNN based. For new architectures a capsule neural network and bidirectional circuits were introduced. Here bidirectional refers to explicit influence paths through the circuit: intent2slot refers to intent information being used as part of slot prediction and slot2intent the opposite, slot information being used as part of intent prediction.

In 2019 BERT debuted as a word embedding technology and ELMo fell away. More knowledge bases were used as input features. Work on pre-processing the data sets included delexicalisation, augmentation, and sparse word embeddings using a lasso method. In architecture RNN and attention continued to be used and CRF made a return to handle label dependency issues. Newly applied architectures included the transformer, and memory neural networks.

The indications from 2020 are that graph embeddings are being used more to capture slot-intent and word-slot-intent relationships.

Year # papers Feature engineering Technologies
2008 1 words/n-grams/suffixes CRF
2009 1 semantic tree SVM
2013 1 CNN CRF
2014 1 dependency parse RecNN (diff to RNN)
2015 1 RNN words, CNN sentence, Bag of words MLP
2016 6 RNN, K-SAN (Bi)LSTM, (Bi)GRU, encoder-decoder RNN, attention
2017 4 character, word, CNN BiLSTM
2018 18 word2vec, GloVe, ELMo, CNN sentence, attention sentence BiLSTM, BiGRU, encoder-decoder RNN, Capsule NN, BiDirectional
2019 29 BERT, GloVe, character, knowledge base (tuples), delexicalisation memory NN, transformer, CRF, attention, BiDirectional
2020 10 BERT, Graph embedding Graph S-LSTM, BiDirectional, GCN, Capsule
Table 4. Historical overview of joint task papers
Paper Addressed issue Approach
(González-Caro and Baeza-Yates, 2011) Multi-faceted query intent prediction Combined multifaceted (multi-label) intent classification
(Sarikaya et al., 2011) Small/lack of labelled training data Initialise FFN using DBN derived from trained stacked RBMs
(Tur et al., 2011) Short text query in web search Simplified sentence structure as additional feature
(Chen et al., 2012) Contextual/temporal information modeling Semi-supervised co-training based on two independent features (text/metadata)
(Bhargava et al., 2013) Small/lack of labelled training data 1) Incorporating temporal information as additional feature; 2) Modeling temporal/session information as a sequence
(Ravuri and Stolcke, 2015) OOV issue Incorporating temporal information using RNN-based models with one-hot word embedding and n-gram hashing
(Hasanuzzaman et al., 2015) Small/lack of labelled training data Multi-objective ensemble learning with feature engineering (external resource used)
(Purohit et al., 2015) Ambiguity in interpretation; Imbalanced data Hybrid feature representation created by combining top-down processing using knowledge-guided patterns with bottom-up processing using a bag-of-tokens model
(Kanhabua et al., 2015) Event-based web searching Time-based and event-based clustering with click-through and standard statistical feature-based classification
(Hashemi et al., 2016) Complex feature engineering

CNN feature extracted vector representation

(Zhang et al., 2016) Co-occurrence of words from different intents; word correlations addressing Heterogeneous features of pairwise word correlation and POS information
(Firdaus et al., 2018b) Exploring combination of deep learning architectures Pre-trained embedding with ensemble of deep learning models
(Xia et al., 2018) Emerging intents detection Capsule-based architectures with zero-shot learning to discriminate emerging intents via knowledge transfer
(Costello et al., 2018) Multi-domain/multi-lingual generalisation ability Multi-layer ensemble models of different deep learning techniques
(Masumura et al., 2018) Multi-task and multi-lingual joint modelling Adversarial training method for the multi-task and multi-lingual joint modelling
(Mohasseb et al., 2018) Grammar feature exploration Grammar-based framework with 3 main features
(Xie et al., 2018) Short text; Semantic feature expansion Semantic Tag-empowered combined features
(Qiu et al., 2018) Potential consciousness information mining

A similarity calculation method based on LSTM and a traditional machine learning method based on multi-feature extraction

(Kim and Kim, 2018) OOD utterances Multi-task learning
(Cohan et al., 2019) Utilisation of naturally labelled data Multitask learning based on joint loss
(Shridhar et al., 2019) OOV issue; Small/lack of labelled training data Subword semantic hashing
(Wang et al., 2019) Learning of deep semantic information Hybrid CNN and bidirectional GRU neural network with pre-trained embeddings (Char-CNN-BGRU)
(Lin and Xu, 2019) Emerging intents detection

Maximise inter-class variance and minimise intra-class variance to get the discriminative feature

(Ren and Xue, 2020) Similar utterance with different intent Triples of samples used for training
(Yilmaz and Toraman, 2020) OOD utterances KL divergence vector for classification
Table 5. Intent detection papers reviewed with addressed issue, approach and techniques

3. Intent Detection

Intent detection is typically set up as a sentence classification problem. That is, a feature or features are constructed from the sentence and these are passed through a classification algorithm to predict a class for the sentence from a predefined set of classes. As a classification problem the techniques applied look to discover a well defined decision boundary between the features. Intent classification differs from classification tasks in other fields due to the nature of the data which are text sentences, coming from spoken language utterances. Hence, at least initially, the features should look to capture semantic information in the sentence. Beyond the semantic information within the words many approaches have been made to extend the feature set using internal (syntactic, word context) or external (meta-data, sentence context) information.

3.1. Major areas of research

Research into intent classification in SLU has generally come from four areas: search engines, question answering systems, dialogue systems, and text categorisation. Early search engines applied text similarity to select results for users. More recently, intent classification has been applied to understand the searcher’s intent further and this approach has been proven to give better search results. However, web queries are usually short and informal, causing difficulties in classifying intents because of insufficient information. Similarly, answering questions from users also benefits from understanding the intents of questions to generate better quality responses. In dialogue systems it has been shown to be useful to identify intents of users in order to give appropriate responses to users. Moreover, intent classification can be applied in more general NLP tasks, such as text classification, sentiment analysis and scientific citation.

3.2. Overview of technological approaches

Before 2015, most papers focused on classical machine learning approaches, such as SVM, K-nearest neighbours and random forest. Features used by these models were mainly generated by dependency parsing, word embedding and n-grams. One deep learning method explored early on was deep belief networks (DBN). In more recent years, with the success of deep learning in other areas, neural networks, especially RNNs, started to be widely used for this task. Attention mechanisms have been integrated in models for identifying which parts of sentences should contribute to the classification. Since intent classification is proposed to be integrated with web engineering, which requires the ability to understand short texts that contain less information, features used for training have been enriched by feature engineering using web metadata.

3.3. Issues addressed in intent detection

In this section we survey the issues encountered in the literature around intent detection, and the solutions proposed. The issues may be specific to the task, like ambiguity of semantic intent. They may be general machine learning issues like lack of training data, and dealing with new or adapting domains. They may be issues specific to the available data like imbalanced data, short sentences, or the out-of-vocabulary (OOV) issue. Feature creation to capture information extra to that in the words is considered to boost performance. Extending the range of the task to multiple intents or to identify out-of-domain sentences is covered.

3.3.1. Ambiguity in interpretation

In essence, this issue is at the heart of intent classification; identifying the decision boundary between samples close together in feature space, yet belonging to different classes. This issue may be more prevalent with short texts, since they may include insufficient information and not follow correct grammar.

An early approach from (Purohit et al., 2015) was to propose a rich feature representation with an ensemble learning framework giving different perspectives on the classification. The feature creation is covered further in ensuing sections.

A more recent approach from (Ren and Xue, 2020) proposed training triples of samples - an anchor sample, a positive sample in the same class and a negative sample from a different class. Combining convolutional and BERT encodings of each one and mapping them to Euclidean space with Siamese shared weights, an intermediate loss of the anchor-positive distance minus the anchor-negative distance is minimised. The Euclidean mapping of the anchor is used for classification.

This latter approach feeds in to the emerging field of contrastive learning and methods from there should be deployed in the NLU field.

3.3.2. Lack of labelled training data or small training sets

Collecting and labelling large amounts of data for training can be expensive. With small data sets, models are more likely to be over-trained. Further, the out-of-vocabulary (OOV) issue, where words appear in the test set that are not in the training set, is more likely to occur with them.

(Sarikaya et al., 2011)

proposed a DBN-initialised neural network for intent classification to learn from unlabelled data and generate features for a feed-forward network. The feed forward network is then fine-tuned on labelled data which may be small in number but still give reasonable results. A DBN is a stack of Restricted Boltzmann Machines (RBMs). To train a DBN, the RBMs are trained layer-by-layer in sequence using parameters learned by previous layers. After training the stack of RBMs, the weights of the DBN are used to initialise the weights of a feed-forward neural network. This approach performed better than traditional machine learning models, such as maximum entropy and boosting, and similarly to SVM.

(Hasanuzzaman et al., 2015) tried to include temporal query understanding into web search query intent classification. They tackled two major issues: one is the inadequacy of limited training data while the other is the limited literal features able to be extracted from queries of short length (typically 3-4 words). They utilised external resources collected from the web that may help bolster temporal information, such as web snippets for queries and the most relevant year, date etc. Based on this 28 features were designed and extracted. They then proposed an ensemble learning solution framework defined as a multi-objective optimisation problem (MOO) and explored with 28 classifiers using different optimisation strategies. The utilisation of external resources and ensemble learning was intended to reduce bias to better handle the limited training data.

Methods for dealing with new unannotated domains and data sets by transfer of concept from existing data sets or models weights combined with few shot methodologies have been explored in the slot labelling and joint task area and are discussed later.

Sufficient data is essential for training a model. To work with unlabelled data, unsupervised training methods could be investigated in further research.

3.3.3. Multi-domain/multi-lingual generalisability

Most text classification models focus on only one language, one domain and also one task. Some models have been proposed to have better generalisability.

(Costello et al., 2018) developed a novel multi-layer ensembling approach that ensembles both different model initialisation and different model architectures to determine how multi-layer ensembling improves performance on multilingual intent classification. They constructed a CNN with character-level embedding and a bidirectional CNN with attention mechanism. In addition, they explored LSTM and GRU with or without character-level embedding and attention mechanism. When ensembling models, they use a majority vote with confidence approach.

(Masumura et al., 2018)

proposed an adversarial training method for the multi-task and multi-lingual joint modelling to improve performance on minority data. The language-specific network can be shared between multiple tasks, where words in the input utterance are converted into language-specific hidden representations. Next, each word representation is converted into a hidden representation that uses BiLSTMs to take neighbouring word context information into account. Task-specific networks can be shared between multiple languages, where the language-specific hidden representations are converted into task-specific hidden representations. The proposed method combines a language-specific task adversarial network with a task-specific language adversarial network.

3.3.4. Emerging intents detection

In dynamic real world applications the intent set evolves. A method to detect and classify emerging intents is a desirable adjunct task.

(Hashemi et al., 2016)

proposed to use a CNN to extract query features for intent classification, which is trained based on word-level embeddings generated by word2vec trained on Google News. Query representations are taken after a max pooling layer. They perform clustering on these representations and observe that new examples far from the clusters could be used to identify emerging intents, though they do not perform that task.

(Xia et al., 2018) proposed two capsule-based architectures to detect emerging intents. They construct three capsules, SemanticCaps, DetectionCaps and Zero-shot DetectionCaps. SemanticCaps is based on a bidirectional RNN with multiple self-attention heads and is used to extract semantic features from utterances. Then, DetectionCaps aggregate the low-level information from SemanticCaps to high-level information in an unsupervised routing-by-agreement approach and obtain intent representations. For detecting emerging intents, the Zero-shot DetectionCap takes information from SemanticCaps and DetectionCaps to calculate vote vectors for information transferral. Then, the vote vectors are multiplied with similarity between embeddings of existing intents and emerging intents and summed to generate representations of emerging intent labels.

3.3.5. Unseen intents

Dealing with intents which are unseen in the training data is a related challenging task. (Lin and Xu, 2019)

proposed a two-stage method to detect unseen intent labels. First, they used a Bi-LSTM to extract features of a sentence. Then, the forward output vector and the backward output vector were concatenated and the concatenation result was used as the input of the next stage. The model uses large margin cosine loss (LMCL) as the loss function, instead of softmax loss, which aims to maximise the decision margin. Thus, the inter-class variance is maximised and the intra-class variance is minimised. This is to ensure that the features extracted by Bi-LSTM can be more discriminative. In the second stage, the model takes the concatenation vector and applies a local outlier factor (LOF) to detect unseen intents, which is a density-based detection algorithm.

The OOV issue

Most word embedding based approaches are dependent on vocabularies and may suffer to some extent from OOV issues, though small training data sets may be affected more. Using character n-grams is a common approach to handle unseen words based on the idea that similar words may come from a common root. Another is to replace all words below a chosen frequency in the training set with a special token, say UNKNOWN.

(Ravuri and Stolcke, 2015) proposed character n-grams as their word input encoding method with both RNN and LSTM, since they thought the OOV issues became more severe when using RNNs because an unknown word could propagate an effect to the consequent words. Similarly, (Shridhar et al., 2019) proposed sub-word semantic hashing inspired by the Deep Semantic Similarity Model for solving the OOV issue which comes with small training data sets. Before sub-word semantic hashing, sentences are transferred into lower case, pronouns in sentences are replaced by ‘-PRON-’, and special characters except stop characters are removed. Then, classes with less sentences are oversampled by adding augmented sentences, which are generated using synonym replacement. After this, every token in each sentence is wrapped by two ’#’ symbols and represented using trigrams. These sub-tokens are then vectorised using an inverse document frequency vector and the Euclidean norm. In the end, the vectors can be used in any intent classification model.

3.3.6. Short text queries

User queries for search engines are usually short (3-4 words) and lack context, so it is essential to extract more information from queries for successful classification. Syntactic features, such as POS tags, and also external knowledge sources have been used to enrich query features.

(Hasanuzzaman et al., 2015) included temporal query understanding into web search query intent classification and their model works well with the limited literal features for queries of short length. (Purohit et al., 2015) focused on intent understanding of social media text such as tweets. Some of these can be short, leading to ambiguity of interpretation and sparsity of relevant behaviours. They try to improve the expressiveness of data by utilising multiple patterns from knowledge sources and fuse the top-down knowledge-guided patterns with bottom-up frequency-based representation for feature formation. Based on this, they utilise an ensemble learning strategy to reduce the bias.

(Xie et al., 2018) proposed a model called Semantic Tag-empowered User Intent Classification (ST-UIC), based on a constructed semantic tag repository. This model uses a combination of four kinds of features including characters, non-key-noun part-of-speech tags, target words, and semantic tags. After pre-processing, characters and target word features are extracted for maintaining the contextual information. Then, key nouns are expanded using semantic tags and POS tags are used as features if a query does not contain target words. With this approach, representation can be enriched for short queries.

In contrast, (Tur et al., 2011) tried to simplify the query input based on a dependency parser to generate simple and well-formed queries. They were motivated by the performance gain of existing statistical SLU models on simple, well-formed queries as well as the need for handling increased web search queries formed by key words. They simply kept the top level predicate and its dependants for the query simplification, and combined it with the sentence input for further classification using AdaBoost. The essence here is to try to provide the extracted key word pieces as auxiliary information, which proved to decrease the intent classification error rate. Because some semantic and syntactic information contained in the sentence are filtered out, when this simplified syntactic structure of sentence was used alone as input for classification, a decrease of performance was reported.

This issue mainly occurs with user queries for web searching. While not widely reported on in the joint task literature, short texts do occur in other data sets and the methods described above can be applied there. Rather than the rule based feature construction approach, knowledge graphs may be further investigated as a method for adding relevant external information.

3.3.7. Other feature engineering

Even when the utterances are longer the challenge is to extract information relevant to the classification task. One early solution generating features using neural networks was by (Sarikaya et al., 2011), who generated features for a feed-forward network using their DBN-initialised neural network. After training, the features generated by DBNs were found to be useful in discriminative classification tasks.

Many papers since have used standard word embeddings, RNN and CNN feature creation. In order to boost semantic understanding (Wang et al., 2019)

proposed to use both in a model called Character-CNN-BGRU. Firstly this model uses character embeddings to represent sentences, rather than word embeddings. A CNN takes the character embedding as the input and extracts local features via max pooling after a convolutional layer. Meanwhile, a window feature sequence layer is added on the convolutional layer to obtain temporal information, which is important for the bidirectional gated recurrent unit (BiGRU). Finally, the output of the max pooling layer and the output of BiGRU are concatenated and passed to a softmax layer.

While semantic features are useful, methods to extract other usable information for classification have been developed.

Grammar feature exploration

From the question answering field comes a feature creation technique based on grammar. Question words, such as ”what”, ”how” and ”where”, and also domain specific grammar may indicate the class of a question. Based on this idea, (Mohasseb et al., 2018) proposed a grammar-based framework for question classification, which utilises three features: grammatical features, domain specific grammatical features and grammatical patterns. Grammatical features are used to parse a question into a sequence of grammatical terms. Domain specific features are used to identify the domains which grammatical terms in the sentence correspond to and tag them. After parsing and tagging each term in the question, the pattern is formulated. The classification task is processed using machine learning models, such as SVM or the J48 algorithm.

3.3.8. Imbalanced data

A data set is unlikely to have the same number of samples for each class and sometimes data can be quite imbalanced. Training a model using imbalanced data can cause poor performance on minority classes.

(Purohit et al., 2015) noticed that social media text corpora can have such imbalanced data. They suggested that two of their constructed features aid with imbalanced data in their data set. These are Contrast Patterns features, where they mine sequential patterns within each intent class then contrast them. (Shridhar et al., 2019) dealt with imbalanced data through oversampling by adding augmented sentences for classes with less samples during the pre-processing stage of their model.

ATIS, one of the major data sets for the joint task is very imbalanced in the intent aspect, and yet performance is excellent. The imbalanced data issue is rarely brought up in the joint task literature.

3.3.9. Co-occurrence of words from different intents

This is a particular form of ambiguity. Words important to different intents may co-occur in a query of a particular intent and how these words are positioned may convey crucial information for intent detection of the current query.

Based on this idea, (Zhang et al., 2016)

proposed two types of heterogeneous information: (1) pairwise word feature correlations (2) POS tags of the queries. The pairwise feature correlations are calculated based on cosine similarity between each semantic feature pair and learned by CNNs with pooling layers. Here each dimension of the word vector is deemed to be a semantic feature. It tries to model the intent through these feature-level representations. Meanwhile the POS tags provide the word-level information about word categories. Their experiment results show that utilising the feature-level semantic representation outperform the baseline model using only word-level features and incorporating POS information with the feature-level representation significantly improves the performance.

3.3.10. Contextual/temporal information modelling

A single sentence in a conversation can be ambiguous, but the ambiguity can be eliminated if previous utterances are considered during intent classification. Meanwhile, web queries are sensitive to time and the intent carried by them may change over time; for example as world events wax and wane in importance. Therefore, some studies incorporated contextual and temporal information in intent classification.

With multi-turn dialogue (Bhargava et al., 2013) included the context from previous queries for the intent classification and slot filling of the current query. Each sub-task is treated separately, so this is not a joint model. For intent classification, they compared two approaches. The first is to construct a context sessions feature by simply using 1 for the type of intent inferred from the previous utterance set and 0 to all others. This is combined with a basic bag-of-n-gram feature for the current utterance and fed to an SVM for intent classification. The second approach is to treat the query sequence as a sequential tagging problem using SVM-HMMs with a Viterbi algorithm. The incorporation of intent from previous utterances as additional information showed a significant reduction in error rate.

(Hasanuzzaman et al., 2015) utilised external resources collected from the web that may help bolster temporal information, such as web snippets for queries and the most relevant year, date and other time indicators. Based on this, 28 features are designed and extracted. Another solution to generate features is referring to one of a set of contemporaneous events, known as event-based web searching. (Kanhabua et al., 2015)

utilised the event-related log patterns that reveal both implicit and explicit temporal information needs, together with general lexical information such as named entities for feature representation. Classification is performed by SVM, AdaBoost, decision tree/J48, and a neural network.

(Chen et al., 2012)

included temporal information in their model as well. They proposed a metadata feature for enhancing community query intent classification. The metadata feature included query topic, query time, and user experience indicated by the number of previous queries. They firstly explored supervised learning using the text and metadata features separately. The result showed that using both two features together outperformed the separate experiment for query intent classification. This drove them to use a co-training procedure, which is a semi-supervised learning framework that can utilise a small amount of annotated queries plus a large amount of unlabelled ones. During the training, two separate and independent classifiers are trained first based on the two features. Then the prediction with higher confidence from either of the classifiers will be used as the label for the unlabelled query for training until a stopping criteria is reached. Utilising the predictions of the two separate classifiers as labels for further training requires that the two features are conditionally independent and sufficient for classification. Experimenting with SVM, higher micro and macro F1 scores were achieved from semi-supervised co-training compared to supervised learning with a combination of the two features.

Rather than temporal information (Qiu et al., 2018) proposed the construction of multiple features from user metadata, regex extraction of named entities, and probabilistic context free grammar of composite entities.

Contextual and temporal information may be not suitable for all data sets, but, when available, future research can attempt to use it with their models. The multi-turn dialogue approach will be explored further under the joint task. Graphs may also be explored for integration of contemporaneous topics and events.

3.3.11. Target variation

Typically in intent classification the pre-defined intents are enumerated and a cross-entropy loss is calculated. (Qiu et al., 2018) rather calculated LSTM embeddings of the training sentences and averaged them within samples of the same intent. For test prediction they calculate a similarity measure of the LSTM embedded test sample with the training samples averaged by intent and choose the closest.

A label can be composed from several words; ”play_music” is composed of ”play” and ”music”, for example. Future research may investigate whether some words in the sentence correspond to ”play” and some other words correspond to ”music”.

3.3.12. Generalisability via ensembles

Overfitting of models to the training data distribution leads to poor generalisability. One method to address this is to use ensembles of models to synergistically exploit the benefits of each. Deep learning architectures, such as LSTM, GRU and CNN, have been used frequently in intent classification. (Firdaus et al., 2018b) proposed to combine those deep learning architectures. They used GloVe and word2vec for word embedding. To start exploring combinations of different models, they constructed CNN, LSTM and GRU individually. Then, four ensemble models were built, which are CNN-LSTM, CNN-GRU, LSTM-GRU and CNN-LSTM-GRU. In these models, predictions from individual models are combined using a MLP model. (Qiu et al., 2018) also used an ensemble of classical methods, being random forest, SVM, Naïve Bayes and softmax regression, with the ensemble outperforming the components.

Out-of-domain utterances

Not all utterances made to SLU devices contain an intent related to the purpose of the device. They may be incidental conversation or have intent that the device is not meant to fulfil.

(Kim and Kim, 2018) augmented a dataset with such utterances and performed a multi-task learning approach to perform classification of in domain utterances and detection of out-of-domain (OOD) utterances simultaneously. A loss function which maximises the intent accuracy while accepting a value of false acceptance rate (1-recall) for OOD utterances below a threshold is back-propagated. Including the second task boosts the intent detection performance.

(Yilmaz and Toraman, 2020) also constructed augmented data sets with OOD utterances. They constructed a vector of KL divergence values for subsequent pairs of intent probabilities determined from hidden states of a unidirectional LSTM fed with word embeddings. The KL vectors are fed to an SVM, Naïve Bayes and Logistic Regression for OOD classification. Logistic regression gives the best results.

3.3.13. Multifaceted query intent prediction

Most annotated queries in training data sets express only one intent. In real life however, queries may contain more than one intent. For example, the query “find Beyonce’s movie and music” has two intents, ‘find_movie’ and ‘find_music’. An NLU system should be able to handle multi-intent queries.

One straightforward solution is to use the top couple of predictions from existing single label classifiers. Another solution is having binary classifiers for every label in single classifiers. (González-Caro and Baeza-Yates, 2011) used this approach. In their model, each utterances has multiple facets, each a class with at least two categorical labels. For instance, in the Task facet the intent can be ’Informational’, ’Not Informational’, or ’Ambiguous’. They experimented with linear SVMs for intent classification of different combination of the facets. Compared with the corresponding single facet classification, additional supervision from multiple labels led to improvement of the overall performance. This additional information was found beneficial to small categories (classes with few samples in the corpus) with the recall of those small categories improving.

Treating multi-labels as atomic labels has been explored in some studies. This approach may suffer data sparsity problem, but has good classification accuracy. Based on this, (Xu and Sarikaya, 2013)

proposed two approaches to exploit the information shared among different intent combinations. The first one is adding class features, which is to add n-gram features for combined intent appropriate to the separated intents. For example, the multi-intent label, ’buy_game#play_game’, should have added two features for the embedded intents ’buy_game’ and ’play_game’. The other is adding hidden variables to identify segments belonging to each intent. Instead of using existing segmentation algorithms, they add a layer of hidden states corresponding to each word. This can indicate the embedded intent which a word most strongly aligns to. Following that, a perceptron layer performs the classification.

Understanding queries with multiple intents can make conversations with dialogue systems more natural and smooth, but there is not much work in this area. Further research could explore more approaches on modelling relations among label combinations.

4. Slot filling

Slot filling is the second critical task in natural language understanding. It is the attachment of a label to each token in an utterance. The label describes the type of semantic information contained in the word represented by the token. A span is a contiguous set of words which together make up a semantic unit, for example “new york city” is a single span represented in BIO notation with the labels B-city I-city I-city. Slot filling is treated as a sequence labelling task. The task is learning not just slot label distributions for words but also what slot labels typically co-occur in utterances (label dependency), and in what order. Reference to context in both directions of a word should be included to maximise performance.

4.1. Major areas of research

Slot filling is a key task in dialogue systems, to interpret natural language from users, from which the system can judge what information to retrieve or what task to complete for the user. Slot filling models are also integrated with online shopping websites, whose core is a task-oriented dialog system. Product search queries can be better understood and shopping assistance can be provided to customers. The field of question answering has provided research to extract the semantic features within queries.

4.2. Overview of technological approaches

Traditionally, generative models (for example, HMMs (Wang et al., 2005)

) which capture the joint probability distribution of the utterance tokens and their slot labels, and discriminative models (like CRFs) that estimate slot label conditional probabilities given the observed token sequence, were used to address this problem. With the success of deep learning, researchers experimented with putting components, for example, CRFs and brief networks, within a deep structure.

Since 2013, RNNs have been increasingly popular in this field. In RNNs each word can access information from the previous words. Later, bi-directional RNNs were applied to utilise both the past and future context. However, the distance between words is linear in RNNs, and the vanishing gradient problem may occur, meaning that long-term dependencies cannot be learnt by the model. As a result, LSTM cells began to be used more frequently because of their ability to forget unimportant information and more successfully model longer dependencies. However, even LSTMs can underperform with very long sentences. Other models are thus incorporated to capture label dependency, which refers to the situation that some slots appear in context with some other slots more frequently, and perform sequence-level optimisation. The combination of CRF layers attached to RNNs is often seen.

Another issue is that RNNs are typically not able to process multiple words simultaneously due to their sequential nature. Thus, attention mechanisms which can take more tokens into account simultaneously than RNNs are used. Additional features are also incorporated to improve the performance of RNNs, such as named entity features, segment features and external memory. Integrating extra knowledge from different sources is also considered an effective approach. Some recent slot filling models also attempt to handle unseen semantic labels and multiple domain tasks by adapting neural CRFs and label embedding.

4.2.1. Exploring new models in detail

Models that have proved reliable for sequence labelling in other fields have been adopted in addressing the slot filling problem. In particular, deep learning models have been applied in slot filling.

(Deng et al., 2012) made an early deep model by constructing a stacked model, in their case a deep convex network (DCN), and extended it to the kernel version (K-DCN) for domain and intent classification tasks. With the kernel approach, the number of layers can be increased. Later, they attempted to solve slot filling problems using K-DCN as feature extractors.

RNNs with different architectures have been explored in many studies, considering their promising performance in sequence modeling elsewhere. In 2013, (Mesnil et al., 2013) compared recurrent neural networks, including Elman-type and Jordan-type networks and bi-directional Jordan-type RNNs. Two years later, (Mesnil et al., 2015) implemented Elman-type and Jordan-type networks and also their variations. Both Elman and Jordan-type networks are constructed with a 3-word context window. Moreover, a bi-directional Jordan-type network was implemented which takes both past and future information into account.

(Yao et al., 2013) adopted Recurrent Neural Network Language Models (RNN-LMs) to predict slot labels rather than words. This model used an Elman architecture RNN which can remember past words. Originally, the output of the training model is exactly the input word sequence, but in the new model, the outputs are sequences of labels instead. Future words, named entities, syntactic features and word-class information were also integrated in the analysis.

In the first models using RNNs in NLU, only words preceding the current word were considered. However, words occurring after the current word can also provide useful information. Therefore, (Vu et al., 2016) use bi-directional Elman-type networks with a 3-word context window. In their network, the BiLSTM generates forward output and backward output, which are combined for making predictions. The model adopts the ranking loss function, so the model is not forced to learn a pattern for the artificial class O. In an extension (Vu, 2016) proposed a bi-directional sequential CNN for slot labelling which considers both previous contextual words with preserved order, surrounding context and also past and future information. To label a word, two matrices are separately generated for the previous and future context words of the word. These are combined to form a matrix for the current word. Then, the two matrices for the past and future information are passed to corresponding vanilla sequential CNNs. The output of networks are concatenated with the matrix for the current word. After that two matrices can be combined using a weighted sum of the forward and the backward hidden layer or by concatenating. For training, the ranking loss function is again applied.

(Korpusik et al., 2019) perform a useful comparison of BiGRU, CNN and BERT based models with the then new BERT outperforming the other candidates.

The incorporation of newer technologies for slot labelling, particularly seemingly suitable attention based methods like the Transformer, has been subsumed by the joint task in more recent years as will be explored in Section 5.

Paper Addressed issue Approach
(Yu et al., 2011) Long-range state dependency Deep learning, CRF
(Deng et al., 2012) To extend DCN Kernel learning, deep learning, DCN, log-linear model
(Deoras and Sarikaya, 2013) Data sparsity problem (CRF) Deep belief network (DBN)
(Mesnil et al., 2013) To explore RNN RNN
(Yao et al., 2013) To explore RNN-LM RNN-LM
(Yao et al., 2014) Label dependencies, label bias problem RNN, CRF
(Yao et al., 2014) Gradient diminishing and exploding problem, label dependencies, label bias problem LSTM, regression model, deep learning
(Liu and Lane, 2015) Label dependencies RNN, sampling approach
(Mesnil et al., 2015) To explore RNN RNN
(Peng and Yao, 2015) Vanishing and exploding gradient RNN, external memory
(Kurata et al., 2016) Label dependencies LSTM, encoder-labeler
(Vu et al., 2016) To explore past and future information Bi-directional RNN, ranking loss function
(Vu, 2016) To explore CNN CNN
(Zhu and Yu, 2017) To explore attention mechanism Bi-directional LSTM, LSTM, encoder-decoder, focus mechanism
(Dai et al., 2018) Unseen slots CRF
(Gong et al., 2019) To explore MTL MTL, segment tagging, NER
(Louvan and Magnini, 2018) To explore MTL MTL, NER, bi-LSTM, CRF
(Shin et al., 2018) To better labelling common words Encoder-decoder attention, delexicalised sentence generation
(Wang et al., 2018) Imbalanced data DNN, reinforcement learning
(Zhao and Feng, 2018) OOV GRU, attention, pointer network
(Gong et al., 2019) To explore MTL MTL, segment tagging, NER
(Kim et al., 2019) To extend original SLU to H2H conversations Bi-LSTM, different knowledge sources
(Shen et al., 2019) Continual learning Bi-LSTM, context gate
(Veyseh et al., 2019) Restricted utilising contextual information MTL, Bi-LSTM
(Korpusik et al., 2019) Architecture comparison BiGRU, CNN, BERT
(Louvan and Magnini, 2019) Low resource data set MTL
(Zhu et al., 2020) Data sparsity problem Prior knowledge driven label embedding, CRF
(Zhang et al., 2020) Long range dependency, vanishing gradient TDNN
Table 6. Slot filling papers reviewed with addressed issue and approach

4.3. Issues addressed by slot labelling papers

Like intent detection, issues found in slot filling can be task specific, more general machine learning issues, data set issues, or may involve feature engineering or methodological approaches for better results. Slot filling introduces issues typical to seq2seq tasks like label dependency and long distance dependency.

4.3.1. Label dependency

There are dependencies between slot labels, meaning that some slots appear more commonly with some other slots in the same utterance. For example, in a travel data set it is highly probable that B-FromCity and B-ToCity are in the same sentence. Capturing such label dependencies would help find the best slot combinations and generate better prediction results.

One approach is to integrate CRFs and regression models. (Yao et al., 2014) applied an LSTM for the slot labelling task and included a regression model to capture label dependencies. (Yao et al., 2014) proposed a recurrent conditional random field (R-CRF) which integrates recurrent neural networks and a CRF to explicitly model the dependencies between semantic labels and achieve sequence-level optimisation. The R-CRF can use the RNN activations as features of the CRF. Similar to CRFs, this model can use sequence-level optimisation as well.

Another solution utilises the encoder-decoder architecture, which had been applied for machine translation previously and is able to encode the global information of the input sentence. (Kurata et al., 2016) proposed the encoder-labeler LSTM. This architecture encodes the sentence to a fixed length vector. Then, the encoding vectors are used as the input of another LSTM, the labeler LSTM, which considers label dependencies. The labeler LSTM can predict the slot label conditioned on the encoded sentence information. With such a method, the model is able to label slots utilising the whole sentence information. (Zhu and Yu, 2017) proposed a BiLSTM-LSTM encoder-decoder model, where a sentence is encoded using a bi-directional LSTM and the encoded sentence is then decoded using a uni-directional LSTM. At the same time, they developed a focus mechanism for this model because of the alignment limitation of attention mechanisms.

Moreover, (Liu and Lane, 2015) used continuous vectors to represent possible output labels, which are fed to recurrent connections. These continuous vectors are also fed to hidden layers, so every hidden layer can utilise word input, previous hidden states and predicted output labels. Moreover, both true labels and the predicted output labels can be fed to some layers in a fashion decided by a sampling approach for robustness. Because the previous predicted labels are used as input to the next step, error propagation should be studied further.

4.3.2. Long range dependency

Long-range dependency (LRD), also called long memory or long-range persistence, is a phenomenon that may arise in the analysis of spatial or time series data. It relates to the rate of decay of statistical dependence of two points with increasing time interval or spatial distance between the points. While the short text queries described in the intent section should not suffer from this, utterances in SLU data sets can be much longer.

The approach from (Yu et al., 2011)

is a deep structured CRF which is made up of several simple CRFs. Lower layers can generate frame-level marginal posterior probabilities. Then the higher layer takes these probabilities along with the observation sequence of the previous layer. In the end, the highest level will make the final predictions. In the training process, layers are trained separately for efficiency. For each layer, the parameters are determined once the layer is trained.

(Zhang et al., 2020) addressed the shortcomings of RNN being the long range dependency issue and vanishing or exploding gradient. They used a deep stacking of time delay neural networks for feature creation. These create convolutional features from context windows of varying size and varying step sizes through the sentences. The features are then passed through a final RNN and classified.

The Transformer architecture, which utilises self attention across utterances, and memory networks which can store longer range information, are now studied in the joint task, as described in the next section. Focused research using these methods for only the slot task may uncover useful methods for use in the joint task.

4.3.3. The label bias problem

Label bias is a seq2seq issue related to maximum entropy Markov models (MEMM). In MEMM, the states with a single outgoing transition can ignore their observation, meaning that the model can tend to stay in a state which is unlikely to happen. Most approaches to this problem combine CRFs with RNNs.

(Yao et al., 2014) applied LSTM cells, which contain a gate which can forget unimportant information, and also incorporated a regression model to model label dependencies. In order to avoid the label bias problem, the regression model took non-normalised scores before softmax. The R-CRF of (Yao et al., 2014) also addresses label bias. Similarly, (Mesnil et al., 2015) explored RNNs with multiple different architectures and proposed to apply Viterbi encodings and recurrent CRFs to eliminate the label bias problem.

4.3.4. Learning common words

The surrounding words of one slot in different sentences are usually similar. For example, the word ”to” is highly likely to be lie between B-FromCity and B-ToCity. Therefore, (Kurata et al., 2016) thought that learning the common words around slots in different ways may be helpful for slot filling.

(Shin et al., 2018) introduced a model which can jointly generate delexicalised sentences and predict labels using the encoder-decoder framework with input alignment. In the delexicalised sentences, words are replaced by their corresponding slot labels. For example, to delexicalise the sentence “i want to fly from baltimore to dallas”, the words ‘baltimore’ and ‘dallas’ should be replaced by B-FromCity and B-ToCity. This approach is based on the fact that different words that correspond to the same slot usually play a similar semantic and syntactic role in the sentence, which allows the model to learn the common words surrounding the slots.

4.3.5. Low resource data sets

A more general machine learning issue is that methods generally rely on the presence of annotated training data which is costly to produce. (Louvan and Magnini, 2019) tested models which also trained on a large freely available annotated data set for a similar task (NER for example) in a multi-task learning environment (see Section 4.3.11. They then used just 10% of the available training data from slot tagging data sets to train the slot labelling task. They varied this proportion and observed when it reached 40% that the auxiliary task stopped having much effect.

4.3.6. Diminishing and exploding gradients

In neural networks with multiple layers, or long RNN sequences, the gradient may become vanishingly small, preventing the weights from changing value during back-propagation. Alternatively, large gradients may self-propagate and lead to unstable networks.

(Yao et al., 2014) combined an LSTM and a regression model for slot filling. LSTM are partially designed to address these issues. To avoid the gradient diminishing and exploding problem, the memory cells within them are linearly activated and propagated between different time steps.

(Peng and Yao, 2015)

introduced an external memory to overcome the limitations on memory capacity of simple RNNs and therefore the diminishing and exploding gradient problems can be addressed. The new model is call RNN-EM, where the hidden layer has an additional input that comes from the external memory. A weight vector is used to retrieve the content from the external memory, which is determined according to the similarity between the contents and the hidden layer.

4.3.7. Data sparsity problem

Data sparsity can be an issue when there are a large number of discrete features. If those discrete feature are represented using matrices, those matrices can be large and sparse, which may lead to a model ignoring the relations among features.

(Deoras and Sarikaya, 2013)

applied deep belief networks (DBN) for semantic tagging integrated with lexical, named entity, dependency parser based syntactic features and part of speech (POS) tags. A DBN is a stack of Restricted Boltzmann Machines (RBMs), where the input of one layer is the output of the previous one and each layer applies a sigmoid activation function on their inputs. In this model, features are embedded into vectors at the first layer and passed to the next layer due to the large input layer. During the pre-training process, the parameters of neurons are learnt using the online version of conjugate gradient (CG) optimisation on several small batches. Compared to CRF, DBN is more general.

(Zhu et al., 2020) noticed that the data sparsity problem can also occur with labels because labels are usually encoded using one-hot vectors, so they also proposed a label embedding which is constructed using prior knowledge including atomic concepts, slot descriptions, and slot exemplars. An atomic concept assumes that each slot can be represented as a set of atoms. Slot descriptions are the textual description for slots in natural language. Slot exemplars are to extract label embeddings for slot labels which contain values of each slot and their neighbour contexts.

4.3.8. Continuous learning

With new data becoming available quickly, it is desirable to have a retrained model incorporating the new data. However, the training process could be time and cost consuming and keeping all the existing data can introduce redundancy. Thus, there is a problem of how to continuously learn from new data.

(Shen et al., 2019) proposed a ProgModel, consisting of a context gate. This gate aims to transfer previously learned knowledge to a small expanded component, which is placed after the hidden state of the new model. The training procedure is progressively conducted at each batch. Therefore, the model can learn faster from the new training data without forgetting the previous expressions.

4.3.9. Imbalanced data

Training data sets can be imbalanced, dominated by some tags while only containing a small number of examples of other tags. This can lead to poorer performance in the minority tags. One solution from the slot labelling literature comes from (Wang et al., 2018) who design a deep reinforcement learning (DRL) based augmented tagger with a deep neural network, which includes a training part and an inference part. While the whole data set is used in the training part, only partial data with unsatisfactory performance will be evaluated by the augmented tagger.

Similarly to intent classification, augmentation via generating sentences which contain the minor tags could be researched.

4.3.10. Unseen labels

As with intent classification models, slot labelling models can rely heavily on the training data and will struggle to correctly assign a label which does not appear in the training data.

(Zhao and Feng, 2018) proposed a seq2seq model together with a pointer network to solve this problem. This model predicts slot values by jointly learning to copy a word which may be out-of-vocabulary (OOV) from an input utterance through a pointer network, or generate a word within the vocabulary through a seq2Seq model with attention.

(Dai et al., 2018)

proposed an elastic conditional random field (eCRF) which can utilise semantic meaning in slot embedding for open-ontology slot filling. The model has a slot description encoder which takes all slot descriptions as input, and outputs distributed representations for slots. Meanwhile, a BiLSTM is used to extract features from utterances. Then, the eCRF labeller, which is a potential function containing two terms for semantic similarity of the slot descriptions and the extracted contextual features and interactions between the slot labels, is applied.

4.3.11. Exploring multi-task learning

Multi-task learning (MTL) is the idea that similar auxiliary tasks can assist a main task. Some papers have used this idea setting slot labelling as the main task.

(Louvan and Magnini, 2018)

attempted to perform slot filling and named entity recognition (NER) jointly in a multi-task framework considering that most slot values are also named entities and NER has high state-of-the-art performance. They mentioned that a better slot tagging result can be achieved if NER is at a lower level. Similarly,

(Gong et al., 2019)

investigated hierarchical multi-task learning to perform low-level tasks first, namely named entity tagging and segment tagging, and then the high-level task, that is slot labelling, can make use of the results from the low levels with cascade and residual connections.

(Louvan and Magnini, 2019) then performed a more wide ranging experiment with two auxiliary tasks (NER and semantic tagging (SemTag)), and a comparison of a training on auxiliary task(s) followed by fine-tuning on the main task approach versus the previously explored hierarchical approach. They found generally the hierarchical approach gave better results and that parsimoniously using only one auxiliary task (NER) worked better.

Most models utilise contextual information, but they use it in a restricted manner, for example, self-attention. Therefore, (Veyseh et al., 2019) proposed a multi-task setting to train a model to incorporate the contextual information in two different levels which are representation level and task-specific level. This multi-task setting includes the slot filling as the main task and two auxiliary tasks. The first one is to increase consistency between the word representation and its context and another one is to enhance task specific information in contextual information.

4.3.12. Extending to human-to-human conversations

As the main thrust of this survey shows, task oriented language understanding in human-to-machine (H2M) conversations has been extensively studied. An interesting twist is to perform slot tagging in human-to-human (H2H) conversations. Here the agent is a third party listening in, not being directly asked to perform a task.

(Kim et al., 2019) focused on slot filling in H2H conversations and explored LSTMs with different knowledge sources. First, the character embedding and the word embedding are concatenated for each word. Then, the embeddings are passed to a bi-directional LSTM, which will be used for making final predictions. Furthermore, there is an additional model which can utilise knowledge from multiple sources, including sentence embeddings for H2H conversation, contextual information and H2M expert feedback. The sentence embeddings are generated by a sentence level embedding model trained using tweets with URL and web search queries to the same URL. The contextual information is extracted from previous utterances in the conversation. Also, this model uses pre-trained slot filling models for H2M conversations on similar domains as the expert model. The knowledge from three sources is encoded into vectors which are then combined and aggregated with the output of the bi-directional LSTM.

Slot tagging introduces the problem of generating a sequence of hidden labels to a sequence of word tokens, qualitatively different to the intent classification task. Together, in SLU, the two sub-tasks contribute to a better representation of the semantics of an utterance than each one separately. In the next section we consider the joint task where both are addressed in one model.

5. Joint intent and slot models

The joint task marries the objectives of the two sub-tasks. As most papers point out there is a relationship between the slot labels we should expect to see conditional on the intent, and vice versa. A statistical view of this is that a model needs to learn the joint distributions of intent and slot labels. The model should also pay regard to the distributions of slot labels within utterances, and one would expect to inherit approaches to label dependency from the slot-labelling sub-task. Approaches to the joint task range from implicit learning of the distribution, through explicit learning of the conditional distribution of slot labels over the intent label, and vice versa, to fully explicit learning of the full joint distribution.

The joint task should also expect to inherit most of the issues of each sub-task and we find this is true. The mass of research now appears in the joint task and as such newer methods tried there have not been tried previosly in the sub-tasks.

5.1. Major areas of research

Research in the joint task has largely come from the personal assistant or chatbot fields. The chatbot is usually task-oriented within a single domain, while the personal assistant may be single or multi-domain.

Other areas to contribute papers are IoT instruction, robotic instruction (there is also a different concept of intent in robotics to describe what action the robot is attempting), and in vehicle dialogue for driverless vehicles. These areas also need to filter out utterances not applied to the device.

Researchers have also drawn data from question answering systems, for example (Zhang and Wang, 2016) who annotated a Chinese question dataset from Baidu Knows.

5.2. Overview of technological approaches

In this section we give an overview of papers which focused on the joint task itself, measuring their efficacy largely on performance against state-of-the-art. Generally they are using new technologies as they became available, or new architectures to make the learning more explicit.

5.2.1. Classical methods

The earliest work on the joint task used a tri-level CRF with the three layers being token features, slot labels and intent labels. (Jeong and Lee, 2008) showed this architecture performed better than performing the two sub-tasks in a pipeline. Other early statistical models used a maximum entropy model (MEM) for intent and a CRF for slot labelling ((Wang, 2010)), and a multilayer HMM ((Celikyilmaz and Hakkani-Tur, 2012).

5.2.2. Recursive neural networks

The earliest attempt at a neural model to address the joint task was in (Guo et al., 2014) which used recursive neural networks (RecNNs) (different to recurrent neural networks (RNN)). RecNNs work over trees, in this case the constituency parse tree of the utterance, with leaves corresponding to the words (represented by word vectors). A neural network is applied at each node of the tree, recursively upwards to the root, computing a state for each node. At each node the states from children nodes are combined with a weight vector representing the node’s syntactic type. Individual slot label classifiers are applied to each leaf using a combination of the word vectors of itself and its neighbours and the state vectors along the path from the leaf to the root. The state at the root is passed to an intent classifier. A combined loss over the slots and intent (and domain) is back-propagated. An optional post-processing, Viterbi decoded Markov layer is applied to the slots. Results were close to, but below, the then state-of the art for the tasks treated separately.

5.2.3. Recurrent neural networks

In 2016 the power of the RNN circuit for seq2seq tasks was explored in multiple papers. Features representing tokens are passed in temporal sequence to RNN units which have a hidden state. Intermediate hidden states may be used for slot labelling. The final hidden state is an embedding of the entire utterance and may be used for intent prediction. The classic encoder-decoder, which produces a sequential output, is the most commonly used architecture ((Liu and Lane, 2016a). Issues with the original RNN cells are addressed by LSTM and GRU cells. Bidirectional RNNs, where the input sequence is passed in in both forward and backward direction, address issues with unidirectional capturing of context.

Other architectures include (a joint loss is back-propagated unless mentioned):

  • a two layer LSTM with the top layer hidden states informing slot labelling and the first layer final state informing intent classification (Zhou et al., 2016);

  • the slot tagging task is softmax classifiers applied to the output of a simple BiLSTM using the concatenated hidden states. A special token is added to encapsulate the whole utterance for use in intent classification (Hakkani-Tür et al., 2016);

  • a Bi-LSTM encoder decoder but with separate losses for intent and slot prediction ((Zheng et al., 2017);

  • rather than seq2seq (Kim et al., 2017) perform a global slot prediction (learning the joint distribution) from a matrix of the hidden states to a matrix of slot tag probabilities for each word, intent is predicted from a sum of hidden states;

  • (Wen et al., 2018) propose to use both a hierarchical (multi-layer) and a contextual (BiLSTM or LSTM) approach, investigating various combinations and using differing layers for intent and slot prediction;

  • an ensemble using both BiLSTM and BiGRU fed to separate MLPs whose outputs are fused then projected and a softmax applied to predict intent and slots concurrently is proposed by (Firdaus et al., 2018a).

For RNNs the input is typically token feature by token feature in temporal sequence. (Hakkani-Tür et al., 2016) compared that to using context windows with superior results. However (Zheng et al., 2017) showed inferior results with context windows.

One critical observation made of many purely recurrent models is that the sharing of the information between the two sub-tasks is implicit. That is, while the sub-tasks are addressed jointly, it is often only through back-propagation of a joint loss.

5.2.4. Attention

Attention is an obvious technique for forcing an interaction between information from the two sub-tasks, in a learned way. Some attention constructions may still be seen as an implicit way of sharing information, but stronger methods start to force explicit learning.

In early papers a basic concept of attention used was the weighted sum of Bi-RNN hidden states as an input to slot and intent prediction ((Liu and Lane, 2016a)). Then (Goo et al., 2018) used a stronger, more explicit attention. The base circuit is a BiLSTM taking word vectors in sequence and using a different learned weighted sum of the intermediate states of the BiLSTM for each slot prediction (the slot attention) and the final state for intent detection. The new addition is a slot gate which takes the current slot attention vector and combines it with the current intent vector in an attention operation. The output of the slot gate feeds the slot prediction. This circuit is an early example of intent2slot, a path through the circuit where intent prediction information is also fed explicitly to the slot prediction element. Another variation on intent2slot is provided in (Li et al., 2019).

(Qin et al., 2019) also use an intent2slot architecture but with BERT encoding and and using stack propagation. Rather than a gate like (Goo et al., 2018), the intent detection itself directly feeds the slot filling. Also the intent detection is performed at token level and the final intent is taken by vote. (Wang et al., 2020a) too use an intent2slot gate with BERT embeddings.

(Yu et al., 2018) in a sense provide the dual approach to that of (Goo et al., 2018), providing an attended slot prediction as the main input into intent prediction. The attention is additive on the weighted hidden states of a BiLSTM encoder and the weighted sum of the predicted slot labels. We call this explicit feed of slot information to the intent slot2intent.

(Zhang et al., 2019a) extend the intent2slot gate of (Goo et al., 2018) with a pair of slot gates, one carrying the global intent information to the slot task, and one taking it to each slot location individually. (Zhang et al., 2019b) also apply intent2slot but only to tokens determined to be not labelled ‘O’.

(Li et al., 2018) introduce self-attention to the BiLSTM architecture to force a stronger learning “at the semantic level” between the slots and the intent. A first self-attention layer performs attention on word and convolutional character embeddings. This is concatenated with the word embeddings and fed to a BiLSTM layer. The final state informs intent detection. Self attention is performed between the intermediate states of the BiLSTM. This self attention is combined with the intent prediction which is then combined with the intermediate states to perform slot tagging. (Chen et al., 2019a) starts with word and character embeddings from a BiLSTM layer, then performs multi-head self-attention on these, followed by a BiLSTM encoder whose final state informs intent prediction. Another multi-head self-attention on the second BiLSTM hidden states, combined with the masked intent prediction, feed a CRF for slot prediction.

In (Chen and Yu, 2019) a BiLSTM layer takes the word inputs. State attention is performed as follows. For slots each hidden state is combined with the softmax of a weighted sum of all the hidden states passed through a feed forward network. Intent detection takes the last hidden state in combination with a similar weighted sum of the intermediate states. A similar formulation is used for word attention by weighting sums of word vectors rather than hidden states. All these features are combined in a fusion layer to inform the two tasks.

(Xu et al., 2020) use a standard encoder-decoder LSTM which incorporates a length variable attention, that is attention of a sub-sequence of learned width over the hidden states.


The transformer architecture (Vaswani et al., 2017), a non-recurrent model useful for capturing global dependencies via multi-head self-attention (among other strengths) appears in (Thi Do and Gaspers, 2019) to construct contextual embeddings of the word tokens. In their model attention is applied between all these to inform the intent prediction sub-task where it gives a superior result.

(Zhang and Wang, 2019) also use the transformer architecture. They pass word embeddings to a 3 level transformer layer, then extract a global output to inform intent detection and token level output to pass to a CRF for slot detection. Differently to (Thi Do and Gaspers, 2019), a special token is added to represent the whole utterance. Both these models only use the bidirectional encoder of (Vaswani et al., 2017).

5.2.5. Hierarchical models

A hierarchical model passes information learned to be relevant through ordered levels. While this flow is explicit it is often unidirectional. For example, (Lee et al., 2018) supply a hierarchical approach with slot, intent and domain levels. Each element of the intent level is represented as the vector sum of the components in the slot layer coming from the same utterance.

(Zhang et al., 2019) provided relevant feedback from the highest level back to the lowest in their capsule network solution, a novel approach that sought to explicitly capture the words-slots-intent hierarchy. A capsule represents of a group of neurons whose output can be used for predictions at the next level; word capsules can be used to make slot label predictions, and so on. The hierarchy is learned using a routing-by-agreement mechanism: the prediction is only endorsed when there is strong agreement from the incoming capsule. The authors also propose a mechanism whereby a strong intent message at the highest level can be fed back to the earlier levels to help them in their task. This explicit and direct feedback is stronger than the implicit or indirect joint learning typically found in RNN models. (Staliūnaitė and Iacobacci, 2020) extended this work to a multi-task setting with extra mid-level capsules for NER and POS labels, with mixed results.

Paper Addressed issue Approach
(Jeong and Lee, 2008) Joint solution of related tasks Tri-layer CRF, extra layer for classification
(Wang, 2010) Small training sets MEM and CRF, joint task versus pipeline
(Celikyilmaz and Hakkani-Tur, 2012) Small training sets Tri-level HMM, bolstered features
(Xu and Sarikaya, 2013) Automated feature creation CNN features into TriCRF
(Guo et al., 2014) Incorporate discrete constituency parse of utterance RecNN on word vecs and parse tree
(Shi et al., 2015) Context from multi-turn dialogue RNN (token) and CNN (sentence) features, MLP
(Zhou et al., 2016) Hierarchical task relationship RNN, LSTM
(Hakkani-Tür et al., 2016) Seq2seq, joint model, architectures BiLSTM
(Chen et al., 2016) Incorporate language knowledge K-SAN attention network, GRU
(Zhang and Wang, 2016) Apply RNN to intent GRU
(Liu and Lane, 2016a) Employ encoder-decoder with attention Encoder-decoder with attention
(Liu and Lane, 2016b) Real time analysis LSTM, MLP
(Zheng et al., 2017) NLP in navigation dialogue BiLSTM encoder decoder, seq2seq
(Ma et al., 2017) No long term memory, linearity LSTM, sparse attention
(Kim et al., 2017) Error propagation, information sharing between tasks Word and character RNN embedding
(Yang et al., 2017) Noisy NLU outputs Dialogue act unit after NLU
(Goo et al., 2018) Learn relationship between slot and intent attention vectors Slot gate, BiLSTM
(Pan et al., 2018) Multiple utterance dialogue Utterance to utterance attention
(Wen et al., 2018) Using hierarchy and context Two layer (Bi)LSTM
(Wang et al., 2018b) Capturing local semantic information CNN, BiLSTM encoder decoder
(Firdaus et al., 2018a) Domain dependence Ensemble model, GRU
(Shen et al., 2018) Slow training time Progressive multi-task model using user information
(Li et al., 2018) Correlation of different tasks Multi-task model incl. POS tag
(Li et al., 2018) Sharing semantic information Self-attention
(Zhang et al., 2018) Tagging strategy Token tags include intent and slot
(Zhang et al., 2019) Hierarchical structure Capsule network with rerouting (feedback)
(Zhao et al., 2018) Spatial (context) and serial (order) information Encoder-decoder, CNN
(Wang et al., 2018a) slot2intent and intent2slot Bi-directional architecture
(Siddhant et al., 2019) Unsupervised learning ELMo on unused utterances, BiLSTM
(Yu et al., 2018) Use sequence labelling output for intent Cross attention, BiLSTM, CRF
(Lee et al., 2018) Hierarchical vector approach Learn vectors representing elements of frame
(Jung et al., 2018) Model relationship between text and its semantic frame Vector representation of frame
(Ray et al., 2018) Rare, OOV words Paraphrasing input utterances
(Liu et al., 2019a) Unidirectional information flow Memory network
(Shen et al., 2019) Poor generalisation in deployment Sparse word embedding (prune useless words)
(Ray et al., 2019) Slots which take many values perform poorly Delexicalisation
(Wang et al., 2019) Language knowledge base, history context Attention over external knowledge base, multiturn history
(Li et al., 2019) Implicit knowledge sharing between tasks BiLSTM, multi-task (DA)
(Gupta et al., 2019) Speed Non-recurrent and label recurrent networks
(Gupta et al., 2019) Multi-turn dialogue, using context Token attention, previous history
(Chen et al., 2019a) Capturing intent-slot correlation Multi-head self attention, masked intent
(Chen et al., 2019b) Poor generalisation BERT
(Bhasin et al., 2019) Learning joint distribution CNN, BiLSTM, cross-fusion, masking
(Thi Do and Gaspers, 2019) Lack of annotated data, flexibility Language transfer, multitasking, modularisation
(Zhang et al., 2019a) Key verb-slot correlation Key verb in features, BiLSTM, attention
(Zhang and Wang, 2019) Learning joint distribution Transformer architecture
(Daha and Hewavitharana, 2019) Efficient modelling of temporal dependency Character embedding and RNN
(Dadas et al., 2019) Lack of annotated data, small data sets Augmented data set
(Chen and Yu, 2019) Learning joint distribution Word embedding attention
(E et al., 2019) Learning joint distribution Bidirectional architecture, feedback
(Zhang et al., 2019b) Poor generalisation BERT encoding, multi-head self attention
(Qin et al., 2019) Weak influence of intent on slot Use intent prediction instead of summarised intent info in slot tagging
(Gangadharaiah and Narayanaswamy, 2019) Multi-intent samples Multi-label classification methods
(Firdaus et al., 2019) Multi-turn dialogue history, learning joint distribution RNN, CRF
(Pentyala et al., 2019) Optimal architecture BiLSTM, different architectures
(Castellucci et al., 2019) Non-recurrent model, transfer learning BERT, language transfer
(Schuster et al., 2019) Low resource languages Transfer methods with SLU test case
(Okur et al., 2019) Natural language Locate intent keywords, non-other slots
(Xu et al., 2020) Only good performance in one sub-task Joint intent/slot tagging, length variable attention
(Bhasin et al., 2020) Learning joint distribution Multimodal Low-rank Bilinear Attention Network
(Firdaus et al., 2020) Learning joint distribution Stacked BiLSTM
(Zhang et al., 2020) Limitations of sequential analysis Graph representation of text
(Wang et al., 2020b) Non-convex optimisation Convex combination of ensemble of models
(Wang et al., 2020a) BERT issues with logical dependency (I before B) CRF and self attention over BERT
(Ni et al., 2020) Model transfer, IoT Pipeline structure from medical analogue
(Krone et al., 2020) Unseen labels Few-shot meta-learning
(Bhathiya and Thayasivam, 2020) Unseen labels, language transfer Few-shot meta-learning
(Tang et al., 2020) Linear chain CRF limitations GCN based CRF
(Staliūnaitė and Iacobacci, 2020) Extend capsule network Capsule network with MTL
(Han et al., 2020) Explicit interaction, word-label information Bidirection, attention, GCN graph
Table 7. Joint task papers reviewed with addressed issue, approach and techniques

5.2.6. Bi-directional models

A model where there is a pipeline from one sub-task to the other may be seen as unidirectional. A bi-directional model, different to the bi-directionality seen in RNNs, has an explicit path from slot processing into intent prediction and also from intent processing into slot prediction. This can form two parallel paths through the circuit, often with a fusion layer or a joint loss.

(Wang et al., 2018a) proposes the first such bi-directional circuit. In this paper each path is a a BiLSTM and the hidden states from each path are shared with the other, another form of explicit influence between the tasks. An optional LSTM decoder is supplied on each side. Interestingly the loss is not a joint loss but the circuit alternates between predicting intent for a batch, and back-propagating intent loss, then predicting slots for the same batch and back-propagating slot loss. They call this asynchronous training.

(Bhasin et al., 2019) also uses bi-directional paths. Starting with GloVe word embeddings, an intent path converts them to convolutional features which are concatenated then projected. The slot path passes the word vectors through a BiLSTM with a CRF on top with the results also projected. Three types of fusion of the paths (after reshaping/broadcasting) were tested: addition, average or concatenation.

(E et al., 2019) also consider bi-directionality. They start with a BiLSTM encoder. A weighted sum of intermediate states for each step (the slot contexts) feeds a slot sub-net, while the weighted final hidden state (the intent context) feeds an intent sub-net. These two interact in either a slot2intent fashion (slot affects intent) or intent2slot. The outputs then feed a softmax intent classifier and a CRF respectively. In slot2intent mode a learned combination of the slot contexts and intent context then feed the intent sub-net, where they are combined with the intent context for prediction. In intent2slot mode the intent context is combined with the slot contexts to form a slot informed intent context. This is then fed to the slot sub-net where it is combined with the slot contexts to feed the CRF for prediction. As may be expected intent2slot gave better slot results and slot2intent gave better intent results. That only one can be applied is a weakness of the architecture.

(Han et al., 2020) (submitted for publication) use strong bi-directionality with explicit intent2slot and slot2intent paths. The intent2slot path uses attention between an initial intent prediction on a BERT sentence embedding and the BERT word token embeddings. Dually, slot2intent uses attention between initial slot predictions on BERT word embedding and the BERT utterance embedding.

5.2.7. Memory networks

(Liu et al., 2019a) consider that even with the inclusion of feedback that the circuit of (Zhang et al., 2019) is still overly unidirectional. To overcome this they consider the use of memory networks to the joint task. As they see the typical interaction as a pipeline from words to slots to intent, they alternate interaction from slots to intent and vice versa via multiple blocks of memory nets. The network begins with GloVe word embeddings and max pooled convolutional character embeddings. These feed the first memory block, which constructs slot features, intent features and hidden states. Further memory blocks in the stack take the previous block’s hidden states as inputs. The memory blocks perform three operations, which also strive to capture local context and global sequential patterns:

  • Deliberate Attention: a slot memory (with number of cells equal to number of slot labels) and intent memory (ditto for number of intent labels) are randomly initialised then updated. At each word position each memory is updated as a weighted sum of the other memory and of the block hidden states for the current word. Diffusion of influence between slots and intents thus takes place and can inform the hidden states for the next word.

  • Local Calculation: this is a recurrent process receiving the input embeddings or previous block’s hidden states. It calculates slot representation and intent representations as interactions between its inputs and the slot and intent memories. It is an LSTM network.

  • Global Recurrence: a BiLSTM layer on top which encodes global sequential interactions.

After the stacked blocks a final prediction takes place. Slots are labelled via a CRF on the final hidden states and slot representations. Intent is via an average of the final hidden states and intent representations.

5.2.8. Meta-studies of flow architectures

In an approach which considers both feature creation and architecture (Pentyala et al., 2019) give an interesting generalisation of multi-task learning architectures then apply it to the joint task. For example a three sub-task parallel architecture would take samples with training labels for each sub-task, develop universal features, task specific features, and grouped features, concatenate them and then feed them to task specific decoders. Series architectures are also given. Their base circuit uses word and character embeddings and is a standard BiLSTM encoder feeding an LSTM decoder for slots and a softmax classifier on the final hidden states for intent. No attention or slot gating occur. The base circuit is then adjusted to match some of the series and parallel architectures. (Firdaus et al., 2020) also look at varieties of series architectures from multi-task circuit design.

5.2.9. Graph networks

Graph networks can be used to address shortcomings of limited context windows suffered by RNNs and CRFs, as they can learn global relationships between words and labels.

(Zhang et al., 2020) use a graph S-LSTM network to overcome perceived shortcomings of RNNs, being lack of parallelisation (due to sequential nature), weak local context use, and lack of long range detection. The graph has as nodes the word representations and sentence representation from an LSTM, hence the network simultaneously works on the whole sentence. Only word nodes within a context window are connected by edges. The sentence node is connected to all word nodes. Messages are passed between the nodes to enable global coordination. The final node states for each slot go through a convolution unit and self attention before being used for slot filling. The final sentence node state is used directly for intent detection.

(Tang et al., 2020) see shortcomings of linear chain CRFs as being limited context and only applicable to the slot sequence. They construct a graph based CRF graph convolutional network which learns relationships between words, slot labels and intent labels. BERT embeddings are passed through a BiLSTM which feed the GCN for prediction. A weighted joint loss is back-propagated.

5.2.10. Importing methods from analogous fields

(Bhasin et al., 2020) propose an interesting analogy; that the relationship between intent and slots is similar to that between the query and image in visual question answering. Thus they borrow an idea from the latter field - Multimodal Low Rank Bilinear (MLB) fusion, between the features of each part.

(Ni et al., 2020) also propose an analogy with the joint task of clinical domain detection and entity recognition in medical literature. Coming from the IoT field they also propose a pipeline structure where intent is detected first and then slots determined in a closed domain setting.

5.3. Feature creation and enhancement

As discussed in the sub-task sections, feature creation is a critical part of the design of circuits in NLU as it ideally should capture, at least, semantic information of the individual tokens, their context, and of the entire sentence. Then, any other information that may be used to enhance the result may be considered, including meta-data and syntactic information.

5.3.1. Token embedding

The earliest models used features familiar from methods like POS tagging and containing one-hot word embedding, n-grams, affixes etc. ((Jeong and Lee, 2008)). (Celikyilmaz and Hakkani-Tur, 2012)

incorporated entity lists from sites such as IMDB (movie titles) or Trip Advisor (hotel names).

Neural models enable the embedding of diverse natural language without such feature engineering. The first neural features were convolutional embeddings of the utterance words in (Xu and Sarikaya, 2013), which fed to a statistical model after (Jeong and Lee, 2008). (Shi et al., 2015) was the first to use RNN based token embeddings but also combined those into a CNN based sentence embedding.

(Ma et al., 2017) used for input at each step a convolution of the current word and the previously predicted slot labels. (Wang et al., 2018b) used multiple convolutional features of the embedding words but also maintained the order of the words within the convolutions. These were then fed to an RNN layer.

The gamut of word embedding methods have been used including word2vec ((Pan et al., 2018; Wang et al., 2018b)), fastText ((Firdaus et al., 2020)), GloVe ((Zhang and Wang, 2016; Liu et al., 2019a; Dadas et al., 2019; Okur et al., 2019; Bhasin et al., 2019; Thi Do and Gaspers, 2019; Pentyala et al., 2019; Bhasin et al., 2020)), ELMo (Zhang et al., 2020) and (Krone et al., 2020) (pre-print only), BERT ((Zhang et al., 2019b; Qin et al., 2019; Ni et al., 2020) and (Chen et al., 2019b; Castellucci et al., 2019; Krone et al., 2020) (pre-print only) and (Han et al., 2020) (submitted for publication). (Firdaus et al., 2018a) and (Firdaus et al., 2019) used concatenated GloVe and word2vec embeddings to capture more word information.

While BERT displays impressive performance, (Wang et al., 2020a) identify a limitation (logical dependency for slot filling) and counter it by feeding it to an intent2slot gate, an attention layer and a CRF.

(Gupta et al., 2019) tested ten different word contextualisation embeddings from four different method groups (feed forward, CNN, attention, LSTM) with different depths.

(Kim et al., 2017) were the first to use a combination of character and word embedding. Others also used this ((Liu et al., 2019a; Chen et al., 2019a; Firdaus et al., 2019; Pentyala et al., 2019). On the other hand, (Daha and Hewavitharana, 2019) use only character embedding.

Pre-computed syntactic features, for example POS tags for each token using the nltk library ((Firdaus et al., 2018a) have been included with word embeddings.

(Zhang et al., 2019a)

take from the service robotics field the importance of a key verb in an instruction in informing the slot labels. The key verb is deduced from a dependency parsing. A feature is constructed from the training data to encode a priori dependencies between words and key verbs. The circuit takes the key verb feature and concatenates it with each word’s one hot encoding. These are passed to a BiLSTM layer to produce token embeddings.

5.3.2. Sentence embedding

The use of the final hidden state in an RNN as the sentence embedding was used frequently ((Zhou et al., 2016; Liu and Lane, 2016a; Wang et al., 2018b)). Sentences were also embedded by using a special token for the whole sentence in (Hakkani-Tür et al., 2016; Zhang and Wang, 2019), as a max pooling of the RNN hidden states ((Zhang and Wang, 2016), as a learned weighted sum of Bi-RNN hidden states ((Liu and Lane, 2016a)), as an average pooling of RNN hidden states ((Ma et al., 2017)), as a convolutional combination of the input word vectors ((Zhao et al., 2018; Bhasin et al., 2019), and as self-attention over BERT word embeddings ((Zhang et al., 2019b)).

(Ma et al., 2017) also apply a sparse attention mechanism which evaluates word importance over a batch and applies weights within each sample utterance for the intent detection.

(Daha and Hewavitharana, 2019) used an extra ¡TAGG¿ token after the end-of-sentence ¡EOS¿ token for sentence encapsulation and see better intent prediction performance. (Okur et al., 2019) encode both a ¡BOU¿ and ¡EOU¿ token at the beginning and end of the utterance in their BiLSTMs.

5.4. Target variations

The targets are typically the annotated intent and slot labels. (Zhang et al., 2018) construct a single tag for each token which incorporates the slot tag and the sentence intent. Their circuit then just performs a single seq2seq task and the sentence intent is deduced by a majority vote of the intent portion of the predicted tags. (Xu et al., 2020) use the same single tag set. (Qin et al., 2019) perform the intent detection at token level though separate to the slot prediction, and the final intent is taken by vote.

(Lee et al., 2018) works with learned embeddings of slot labels, intents and domains where the sum of slot label embeddings for an utterance is close to the intent embedding in vector space. A network can then be trained to map tokens to vectors close to the slot labels and intent for the utterance.

(Jung et al., 2018) proposes a vector embedding of the entire semantic frame (intent, slot labels, slot values) as the target. In training the utterance and the semantic frame are input and vectorised. A semantic frame vector is output. The distance between the output vector and input frame vector is minimised. In testing the text is input and a vector is output and the nearest semantic frame vector is chosen.

(Okur et al., 2019) proposed an extra token tag for intent keywords, for example the word “play” in an utterance with intent PlayMusic. In one of their models only intent keywords and non-Other slot tokens contribute to intent detection.

5.5. Issues addressed and solutions proposed

5.5.1. Narrowness of approach

The use of features constructed only from the tokens in the sentences may be too narrow an approach. External knowledge about the words’ places in the language, or the syntactic structure of the sentence, or of co-occurrence statistics amongst word and labels may aid the task. Methods to incorporate extra elements have been developed.

Knowledge bases

Knowledge bases are constructs containing information or statistical priors that may be useful to the task at hand. They may be constructed independent of the task, or as a preliminary step using information from the training data. They have been used for feature construction, as features themselves, and to be consulted via attention.

(Chen et al., 2016) was the first to use an extra knowledge base to inform the joint task. They use a K-SAN input, being a structured knowledge network. Two K-SANs are constructed, one taking a dependency parse of the utterance (syntactic), and the other an Abstract Meaning Representation (AMR) graph (semantic). Each representation is tested separately. A CNN encodes the representation into a vector, while a separate CNN encodes the sentence itself into another vector. Attention is applied between the two vectors and the results combined to give a “knowledge guided representation” of the utterance. This is included as an input to a GRU RNN cell along with the word encodings in sequence. A second RNN just takes the utterance words as input. A weighted sum of the hidden states of the two RNNs is used for prediction.

(Wang et al., 2019) incorporate the ConceptNet111http://conceptnet.io framework as a knowledge base source. (Head, Relation, Tail) triples are extracted for each word in the utterance. The TransE model ((Bordes et al., 2013)) for embedding multi-relational data is used to encode the knowledge. Attention is applied between words and the knowledge base encoding.

(Han et al., 2020) (submitted for publication) use Graph Convolutional Networks (GCNs) pre-trained on the training data to enable embeddings of the utterance and words which contain knowledge of related intent and slot labels. These embeddings are one input to their circuit.

(Qin et al., 2020) capture the interaction between multiple intents, and slots, with a graph representation. For multi-intent a score is calculated for each intent and those above a threshold are returned. The graphs use graph attention networks. Tokens are encoded by a BiLSTM and then multiple intents are predicted. These slot path takes the token embeddings through an LSTM which provides a feature for each token which interacts with the intent predictions and the slot-intent graph to make slot predictions.

The inclusion of knowledge embedded in graph representations, or networks that perform tasks on such graphs has borne fruit in the very recent literature. Further research in this area could include other types of such graphical representations and incorporate information not just from the current training set or external knowledge bases but some combination of the two, or data from several training sets.

5.5.2. Multi-turn dialogue

Typically in NLU only the current single utterance is analysed. Temporal information or previous utterance context or previous dialogue action are not considered. However as noted in the intent and slot sections using such information in the model can lead to better performance.

There are multiple data sets available which contain a multi-turn dialogue around a single intent or set of related intents. In these cases incorporating the history from previous turns can be incorporated. (Shi et al., 2015) fed a sentence embedding along with the predicted intent and domain labels of previous turns into the intent prediction for the current turn. (Pan et al., 2018) calculate attention between the BiGRU embeddings of successive utterances which make up a single sample and contribute to a single intent. (Wang et al., 2019) similarly use attention between the BiLSTM encoding of each utterance to the previous utterances in the history.

(Gupta et al., 2019) look at multiple contextual inputs in multi-turn dialogues for the current utterance. For the current utterance they apply token2token attention and sentence2token attention at the input. Information from previous turns, including intents, slots and dialogue actions can then be attached.

While it is sensible for the research to focus on single utterance analysis it should be noted that SLU devices are often listening to all dialogue, filtering out-of-domain utterances using methods discussed in Section 3.3.4, and that incorporating lead in dialogue can be useful to the joint task.

Multi-task learning

Looking for synergies with related tasks has been an approach in the two sub-tasks and has been actively applied in the joint task. As described earlier the full semantic frame contains three levels - domain, intent and slots. Simultaneously solving the domain with the other layers has been explored (Shi et al., 2015; Hakkani-Tür et al., 2016).

(Shen et al., 2018) introduced an extra task to predict tags for known user information from metadata (for example location, timestamp). The metadata task is preliminary and thus informs the BiLSTM word embedding. The results of the preliminary task feed the regular joint task training and the BiLSTM word embeddings are updated.

(Li et al., 2018) works on the theory that adding a further sequential task (POS tagging) will aid the joint tasks. A single LSTM layer takes word embeddings and performs an intent and slot prediction at each step, feeding those predictions with the LSTM hidden state to a next-word POS tagger. A joint loss across all tasks is calculated. The results show that the extra task helps improve intent detection.

(Yang et al., 2017) claim that noisy SLU output can be mitigated by making it part of an end-to-end network including dialogue action prediction in the dialogue manager, with errors back-propagating from the dialogue manager refining the NLU prediction. The hidden states of a BiLSTM SLU model also feed a second BiLSTM which performs the dialogue action prediction. A joint loss across all tasks is back-propagated. In related work, (Li et al., 2019) also tied together an SLU network and a network to predict the next dialogue action. They use a stronger NLU segment to improve overall results. A joint loss across intent, slots and actions was back-propagated and performance exceeded the SLU model alone. (Gupta et al., 2019) use dialogue action in a multi-turn data set. (Firdaus et al., 2020) incorporate dialogue action, typically as the first task in a multi-task pipeline, rather than the last.

(Staliūnaitė and Iacobacci, 2020) incorporated POS and NER tagging simultaneously with slot tagging and intent detection using a capsule network, however the results were generally poorer when both NER and POS were included rather than just one, and mixed for different data sets indicating a generalisability issue.

The method of using SLU as fine-tuning with pre-training on another task, or vice versa, has shown improvements in the SLU performance. However the results of (Staliūnaitė and Iacobacci, 2020), echoing those of (Louvan and Magnini, 2019) on slot tagging, indicate a parsimonious approach to adding extra tasks simultaneously more often yields a better result.

5.5.3. Generalisability

Domain dependence

An issue found is that a model trained successfully on one domain or data set does not perform as well on a different domain or data set, implying it has simply learned statistical properties of the training data set. One issue suggested by (Firdaus et al., 2018a) is that the language in the data sets is not particularly “natural”. Though their ensemble model with syntactic POS features performed well on ATIS it is unclear it generalised to a second data set.

(Firdaus et al., 2018a) propose to design a domain invariant model by using an ensemble of word embeddings in an ensemble circuit with a BiGRU unit and a BiLSTM unit. While together they outperform each unit used alone, the circuit didn’t transfer well to a new dataset. This approach of using multiple methods in one circuit for generalisability appears to rely too much on chance than good design.

(Shen et al., 2019) looked at the drop off of performance of state-of-the-art architectures when deployed. Some issues that cause drop off in performance are personalised language of users not matching the training data, and the cost of annotated training sets (and hence their limited size and spread). Focusing on the vocabulary they propose a sparse vocabulary embedding which they apply to two existing architectures and show improved results. The embedding uses lasso regularisation to penalise words useless to the tasks. They apply the method to the networks of (Liu and Lane, 2016a) and (Goo et al., 2018) and find that while using sparse vocabulary that intent accuracy increases but slot f1 decreases. They qualitatively discuss these results with observations on what words/structures help the two sub-tasks and the joint task.

(Zhang et al., 2019b) use BERT encoding, claiming that a pre-trained model should address the poor generalisability of models that perform their own embedding. They use a two step decoder where the first step decodes intent which feeds the intent classifier and also the second decoder which works on slot labelling. The intent decoder performs multi-head self attention on the BERT encodings. In the slot decoder a BERT embedding for a word is concatenated with the attended intent in training only if it is a “real slot”, i.e. non-’O’; otherwise it is concatenated with a random vector. Each concatenation feeds a softmax classifier for the token. A joint loss is back-propagated. The results are good for both ATIS and SNIPS.

Non-English data and transfer learning

NLU is eventually required in many languages, most of which do not have the large annotated training datasets required. An aspect of generalisability of models is thus whether they can be used outside the language on which they are trained.

Papers have used the same architecture for both English and non-English data sets to give comparative studies across languages. (Jeong and Lee, 2008) used ATIS and a Korean banking dataset. (Zhang and Wang, 2016) used ATIS and Chinese questions collected from Baidu Knows. (Pan et al., 2018) work only with a Chinese data set where word boundaries are not clearly identified.

Other papers considered the transfer of the model from English to other languages to address lack of annotated data in those languages. (Thi Do and Gaspers, 2019) consider a simple weight transfer from an English model for use in German. (Castellucci et al., 2019) (pre-print only) consider transfer learning from English to Italian.

(Schuster et al., 2019) study transfer to low resource languages, in this case from English to Spanish and Thai. The circuit is a basic BiLSTM with CRF. They evaluate three different cross-lingual transfer methods: (1) translating the training data, (2) using cross-lingual pre-trained embeddings (CoVE), and (3) using a multilingual machine translation encoder as contextual word representations. They find that using cross-lingual transfer well outperforms training on limited data from the low resource language. The work is extended by (Liu et al., 2019b), (Bhathiya and Thayasivam, 2020) and (Qin et al., 2020) but moves into cross-lingual transfer theory and out of the scope of this survey.

This issue of generalisability is still very much open and in demand by endusers. Methods discussed in Section 5.5.4 for using few-shot methods to boost performance of existing models in new domains or data sets warrant further investigation.

5.5.4. Limited training data

Annotated training data is costly in time and resources to produce. With new domains and applications for SLU appearing, with existing domains changing, and with colloquial language shifting, there is a need for methods to perform well with limited training data.

Small data sets

In an early statistical model (Wang, 2010) test two-pass (pipeline, intent then slot) versus one-pass (simultaneous solving) for a small training set. They show that intent classification is much better in the two pass model while token level slot f1 suffers slightly. (Tam et al., 2015) proposed using an RNN network to learn the word/label dependency distributions from available training data. For intent, the intent label is attached to each word in an utterance. Synthetic samples are then generated for use in training. They showed that this can lead to better results for slot tagging using a CRF on three data sets but that the results for intent were inconsistent.

(Dadas et al., 2019) propose a data set augmentation scheme which generates new training samples from existing ones via three methods: labelled word replacement from an external synonym lexicon; random replacement of outside words with a synonym; and “sequence order mutation” - change of order of spans for utterances with one labelled span. They showed that augmentation can improve the slot f1 result, more so for smaller data sets, but has little effect on intent accuracy. There is a further literature on data set augmentation for SLU which we will not cover here.

Lack of annotated data

As new domains appear it takes time and cost to develop annotated data sets for training. (Shen et al., 2018)

address this by training on user metadata as a preliminary step. They show they can achieve higher slot f1 scores on smaller training sets and with less epochs than only using the intent and slot annotations.

(Siddhant et al., 2019) construct an unlabelled utterance data set collected from ASR interactions with their agent. They train an ELMo style word embedding on this data set. For the joint task they find their embedding outperforms fastText. As well as language transfer, (Bhathiya and Thayasivam, 2020) also address the transfer to new label domains with minimal samples available via a few-shot meta-learning approach.

Unseen labels

(Krone et al., 2020) (pre-print only) address the issue of unseen test classes by applying two few-shot algorithms: model agnostic meta-learning (MAML) and prototypical networks, in combination with three word embeddings - GloVe, BERT and ELMo. They find the prototypical network algorithm performs best, that joint training significantly improves slot filling span based F1, and that ELMo and BERT share the spoils from the word embeddings.

5.5.5. The OOV issue

Out-of-vocabulary words in the test set, that is words that do not appear in the training set, may lead to lower test performance. Similarly the use of rare words in the training set may introduce unwanted bias. This issue is related to generalisability and also to changing vocabulary from user to user, or over time.

(Zhang and Wang, 2016) set all words that only appear once in the training set to an unknown UNK token. Then new words in the test set are also set to the UNK token. They also replace all numbers with a generic DIGIT token. This is also applied by (Li et al., 2018) and (Zhang et al., 2020). (Ray et al., 2018) perform a paraphrasing of input utterances to cater for rare or OOV words, or for unusually phrased requests. The paraphrasing is performed by an encoder-decoder RNN and is performing a kind of translation. The paraphrase can be applied to any downstream model. (Chen et al., 2019b) propose BERT embeddings as a sop to rare or OOV words. BERT uses word-piece encoding to provide a meaningful embedding for all words.

(Ray et al., 2019) address the issue of networks having trouble with slots with large semantic variability - that is, there are many values the slot can take during training and many unseen values during testing/deployment. They call these out-of-distribution (OOD) slots. They propose a new delexicalisation method. This replaces values in OOD slot locations with default values in pre-processing.

5.5.6. Obfuscation and speed

Taking a contrary view, (Gupta et al., 2019) consider how joint modelling may obfuscate, or hide, information and may also be unnecessarily slow. They propose a modularised network with separated tasks after a common word contextualisation pre-processing. The modularisation enables easier analysis of results. They perform speed analysis within their model suite.

(Wang et al., 2020b) propose a convex combined multiple model approach to counter limitations of non-convex optimisation, one of which is slow speed of convergence due to being stuck near non-optimal solutions. Each network in the circuit has the same structure but different initialised weights. A convex combination of label predictions from each network is used as the label prediction for each slot and the intent. Both a local loss function for each network and a global loss function on the combination are back-propagated. The networks are BiLSTMs with a context layer. The convex combination outperforms single classifiers. The speed improvements are significant.

5.5.7. Real time learning

In (Liu and Lane, 2016b), the authors consider real time analysis where the whole utterance isn’t analysed but a prediction is made at each time step. In this RNN the intent is predicted at each step and used as context to the slot prediction (as well as a next word language model). Thus the current slot prediction is conditional on the input words to that point, the previous slot predictions and the previous intent predictions. The recurrent unit is an LSTM but the current intent and slot predictions use MLPs on the current hidden state.

5.5.8. Label dependency

This is an issue covered in the slot filling section and the methods used there including CRFs and encoder-decoder seq2seq models have been used in the joint task. We note further that the use of CRFs after a deep learning solution became popular again from 2018 (see Table 4) to counter this issue ((Yu et al., 2018; Zhang and Wang, 2019; E et al., 2019; Firdaus et al., 2019)). (Wang et al., 2020a) use a CRF to counter a label dependency limitation for slot-filling in using BERT due to its non-recurrent nature.

(Chen et al., 2019a)

claim earlier models do not perform slot filling realistically enough (so reflecting the language priors) nor explore intent-slot correlation well. They propose to use a CRF for the former and a masked intent prediction as an input to the CRF for the latter. The mask is “a conditional probability distribution of slot given intent, obtained from training data”.

(Bhasin et al., 2019) also use a CRF with masking, prior conditional probabilities of slot/intent co-occurrence obtained from training data, for slot prediction.

5.5.9. Handling multi-labels

The multi-label issue was addressed for intent classification in Section 3.3.13. With the move into neural models similar methods have been applied.

(Wen et al., 2018) simply removed multi-label samples from their data set. (Dadas et al., 2019) tried using both the first label as the only label, and merging labels to a compound label. (Qin et al., 2020) consider multi-intent data sets, including their own extension of SNIPS to multi-intent. For multi-intent a score is calculated for each intent and those above a threshold are returned.

(Gangadharaiah and Narayanaswamy, 2019) studied both sentence level and token level multi-intent detection. For ATIS, they split the compound multi-labels, giving about 2% of the data set with multi-labels. They also use an internal data set with 52% of the samples having multi-labels. Although the assignment method is unclear a sentence may be assigned multi-labels during prediction, and these are then assigned to individual tokens in the sentence to aid with slot filling.

6. Data sets

Name Public Train-Val-Test Num Intents Num Slots Domain,Notes
ATIS Y 4478/500/893 21 128 air travel
SNIPS-NLU Y 13084/700/700 7 72 personal assist.
FRAMES Y 20006/-/6598 24 136 hotel, multiturn
CQUD N 3286 43 20 question answering
TREC Y 5500/-/500 6(50) - question classification
TRAINS N 5355/-/1336 12 32 problem solving, multiturn
Microsoft Cortana N 10k/1k/15k 10-20 15-63 personal assist., multi-domain
Facebook Y 30521/4181/8621 12 11 multi-lingual task oriented
SRTS FrameNet N 2803/-/312 12 61 robotics
Alexa N 264000/-/- 246 3409 17 domains
DSTC2 Y 4790/1579/4485 13 9 multi-turn, restaurant search
DSTC4 Y 5648/1939/3178 87 68 multi-turn, Skype tour guide dialogues
DSTC5 Y 27528/3441/3447 84 533 dialogue with social robots
CMRS N 2901/969/967 5 11 Chinese, meeting room reservations
CU-Move N 57584/-/- 5 38 in-vehicle dialogue
AMIE N 3418/-/- 10 7 in-vehicle dialogue
TeleBank N 2238/-/- 25 17 Korean, banking
MIT MOVIE_ENG Y 8798/97/2443 - 25 movies, slot only
MIT RESTAURANT Y 6894/766/1521 - 17 restaurants, slot only
Table 8. Major data sets used in the literature, single turn in English unless noted, Train-Val-Test gives the number of utterances

6.1. Introduction

A summary of the most commonly used data sets is presented in Table 8. Here we cover the most commonly used data sets, ATIS and SNIPS, popular due to their easy availability and ubiquity of use allowing comparison between models. We then briefly cover the other data sets.

6.2. The Air Travel Information System (ATIS)

ATIS was introduced in 1990 in (Hemphill et al., 1990) and its history is instructive in understanding some of the conventions of the field. The domain is air travel information including “information about flights, fares, airlines, cities, airports, and ground services”. The first release, ATIS-0, collected 740 evaluable samples. Each sample contained a sound file of a single utterances question, a transcription of the question, a set of tuples constituting the answer, and the SQL query that produced the tuples.

Tokens were generated according to Standard Normal Orthographic Representation (SNOR) rules: whitespace-separated lexical tokens, case insensitive alphabetic text, spelled letters are represented with the letter followed by a fullstop (e.g., “a. b. c.”), no non-alphabetic characters (except apostrophes for contractions and possessives and hyphens for hyphenated words and fragments). The average length of the SNOR translated utterances was 11.3 tokens.

Extensions to the data set were made available in subsequent years ATIS-1 ((Pallett et al., 1992), ATIS-2 ((Hirschman, 1992)) and ATIS-3 ((Dahl et al., 1994)) in late 1993 to mid-1994.

The set which evolved to become the standard ATIS for NLU analysis was drawn from the annotated samples in ATIS-2 and 3. (Yulan He and Young, 2003) were the first to use the combined set for language understanding. (Raymond and Riccardi, 2007) used the same set but tweaked the annotation to something more closely resembling the ATIS set used today. The set contains 4978 training samples and 893 test samples. In more recent years with the advent of neural net models, 500 of the training samples are set aside as a validation set.

(Tur et al., 2010) work towards formalising the ATIS data set, using the same samples as (Yulan He and Young, 2003) and (Raymond and Riccardi, 2007). The intents listed by (Tur et al., 2010) are not the current ones as they list 17 intents each of which have non-zero frequency in train and test set.

In later releases some joint intents are included to give 21 intents. Also in later releases the SNOR rules are relaxed. For example punctuation is allowed (“st. louis”), utterances are all lowercase, numbers are allowed for times and years but not dates. We note that in the version of the data set used today that the intents are highly imbalanced with 75% of the samples in a single intent.

(Tur et al., 2010) perform an AdaBoost classification on word n-gram features for intent classification and then separately a CRF method to label slots and then perform a classification of error types into 6 types for intent and 5 types for slots. They then suggest research directions based on these errors, being: use of parsers to identify head words or clauses; a priori information (knowledge bases); and, methods to enable long distance pattern identification, as opposed to more local, shorter patterns. They also measure the high mis-annotation rate (2.5% for intent and 8.4% for slots).

In 2018 (Béchet and Raymond, 2018) performed the next analysis specifically to question the usefulness of ATIS. They ran a set of different methods from a boosted tree ensemble to a BiLSTM net on ATIS slot tagging with and without named entity tag labelling. They use the same data as (Raymond and Riccardi, 2007), which removes issues with position labels (B,I,O) by collapsing semantic spans as single tokens. For example, ‘san jose’ is a single token not two. While this weakens their approach the results are worth looking at.

They chose their five best models and cluster the predicted slots according to:

  • agree/correct (AC) - all models get the slot correct and agree on the answer;

  • non-agreement/error (NE) - all models got wrong but there is no agreement on the errors;

  • agree/error (AE) - all models got the wrong slot but they all made the same error;

  • non-agreement/correct (NC) - models don’t agree on the solution but at least one is correct.

These clusters suggest future directions for research. While AC is ‘solved’, AE and NE are open problems (aspects of the data set not captured by the models), and NC are useful for model comparison between those that got them right and those that did not.

They also highlight issues with the data set - bad annotations, ambiguity “where slots could be labelled with different labels”, and repetition errors where “only the first mention of an entity is labelled”, e.g. in “show flight and prices Kansas city to Chicago on next Wednesday arriving in Chicago by 7pm” Chicago is only labelled once.

They estimate that about 2.5% of the utterances are erroneously slot-tagged and conclude that ATIS is at the end of its useful life for analysis.

(Niu and Penn, 2019) performed the next deep analysis of the ATIS data set and extensively reviewed the shortcomings of the data set. They have subsequently re-annotated the data set fixing what they deem errors.

Even without this re-annotated version of ATIS results reported in the literature show that the test intent accuracy being achieved is now above 99% and slot f1 above 98%. It appears that the models to date have successfully captured the joint distributions of words, slots and intents in the data set. Further models may only make improvements at the edges and while useful may be hidden by what appear non-significant increase in the evaluation measures.

6.3. Snips

The SNIPS Natural Language Understanding data set and its creation are fully described in (Coucke et al., 2018). It contains 15884 utterances (train 13084, development 700, test 700) in 7 balanced intent classes. In training there are 72 slot labels and a vocabulary size of 11241 words. The average sequence length is 9.05. Unlike ATIS, SNIPS covers different domains - weather, restaurants and entertainment. (Liu et al., 2019a) show an interesting visualisation that the slot labels used for different domains form largely disjoint sets. These differences have made it a useful counterpoint for experimentation in NLU and models addressing both ATIS and SNIPS successfully show they can handle imbalanced data. However, the reported test results for SNIPS too are excellent - intent accuracy above 99% and slot f1 around 98%.

6.4. Other data sets

Microsoft have several non-publicly available sets which have been used by Microsoft researchers. For example FRAMES is multi-turn dialogues around hotel bookings. The Microsoft Cortana personal voice assistant data sets have at least six domains - weather, calendar, communication, reminder, alarm, places. Other software houses with data sets include Facebook (public) and Alexa (private).

Some competitions have applicable data sets, for example the DSTC 2, 3 and 5 competitions have been used in papers. These often contain multi-turn dialogues. Also the Chinese competition based CCKS data set has been used for research.

The TRAINS data set, a collection of problem-solving dialogues, has been used in four papers. Data sets from diverse but relevant fields have been FrameNet from robotics, CU-Move and AMIE from in vehicle communication, and from question answering CQUD (from Baidu Knows), Yahoo and TREC (only intent annotated).

Non-English data sets have been generated, for example (Bellomaria et al., 2019) derived an Italian data set starting by translating SNIPS and then using Italian words for tokens like cities or movie names.

6.5. Discussion

It is argued that ATIS and SNIPS have reached near to the end of their useful lives as benchmarks for the joint task. Excellent test results show that the methods developed in this survey can successfully learn the joint distributions of intent and slot labels, and slot labels with each other, in a supervised learning setting. They appear to be set to continue being the benchmarks due to the ability to compare a new approach to previous ones, though this should be tempered by the use of non-standard experimental set up discussed in Section 8.

They are useful for study because they are single utterance, have reasonable numbers of intents and slots, are task focused (so have a clear intent), have reasonable utterance lengths. Challenges include mis-annotation, OOV issues and perhaps the level of unnaturalness of the language.

In their defence they provide differences - class imbalance versus imbalance, single versus multi-domain - and a model that scores well on both can claim to have some generalised ability. However as noted in the literature the greater generalisability of such supervised learning models to new domains is in question.

It is probable that more naturally conversational data should be tested. To avoid costly annotation this should be largely unannotated, encouraging research in zero or few shot methods. Such methods can still be tested on ATIS and SNIPS (as in (Krone et al., 2020) (pre-print only). Metrics for measuring the efficacy of such models in the absence of annotation need to be considered.

We further note that all the few and zero shot papers reviewed use annotated datasets for evaluation, hence still need to be transferred to new unseen datasets.

7. Evaluation metrics

7.1. Intent classification

7.1.1. Intent accuracy

For intent classification the widely used metric of accuracy is most commonly used for evaluation. Accuracy is the ratio of the number of correct predictions of intent to the total number of sentences.

Some utterances in the ATIS data set have more than one intent label. Most researchers, since they are not doing multiple label detection, consider the combined label as a new label type, e.g. atis_airfare#atis_flight_time. (Zhang et al., 2020) note “some researchers ((Liu and Lane, 2016a); (Li et al., 2018)) count an utterance as a correct classification if any ground truth label is predicted. Others ((Goo et al., 2018); (E et al., 2019)) require that all of these intent labels have to be correctly predicted if an utterance is to be counted as a correct classification.”

7.2. Error rate

Some papers, (Mohasseb et al., 2018) for example, instead use error rate, the ratio of wrongly classified samples to the total number of samples or 100% - accuracy, to measure intent classification performance.

7.2.1. Intent precision, recall and F1

Less frequently (e.g. (Li et al., 2019)) precision, recall and F1 are used to evaluate intent prediction. For an intent class ,

  • TP is the number of True positives, intents which are correctly classified as of class .

  • FP is the number of False positives, intents which belong to other classes but are incorrectly classified as class .

  • FN is the number of False negatives, intents of class which are incorrectly classified as other classes.

Two approaches are used; micro-averaged and macro-averaged. In the micro-averaged approach, the TP, FP and FN are summed across all classes:


In the macro-averaged approach, the precision and recall are computed for each class first, then the average across all

classes is reported.


For both approaches, F1 is computed as:


A variation used for multi-label identification are precision and recall at the top-k predictions. Here precision is the ratio of correct labels in the top k predictions divided by k and recall is the ratio of correct labels in top k predictions over the total number of correct labels. This is used by (Zhang et al., 2016).

7.2.2. Tests for significance

Standard tests for significance of difference between two models are used. Welch’s t-test is to test the hypothesis that two populations have the same mean.

(Firdaus et al., 2018b) and (Firdaus et al., 2019) used this with the p-value threshold set to 0.05. Other papers use the student t-test for similar purposes.

McNemar’s test is to test paired binary classified data to evaluate how well two tests agree with each other. (Jeong and Lee, 2008) used this for a classification of ATIS intents into two domains.

7.3. Slot labelling evaluation

7.3.1. Span slot precision, recall and F1

A span (sometimes called a chunk) refers to a sequence of words with the same class. For example the labelling B-MISC I-MISC I-MISC is a span of class MISC.

For a class we can thus define at the span level:

  • TP is the number of True positives, the number of spans of class which are wholly correctly predicted.

  • FP is the number of False positives, the number of spans of a different class which are incorrectly predicted as of class .

  • FN is the number of False negatives, the number of spans of class which are incorrectly predicted, partially or wholly, to another class.

Micro-averaged and macro-averaged precision and recall and F1 can then be calculated, similarly to the previous intent section.

In most papers slot F1 is reported as the span based micro-averaged F1 over all classes excluding O. The conlleval.py222https://github.com/sighsmile/conlleval script is regularly used ((Deoras and Sarikaya, 2013; Liu et al., 2019a; Daha and Hewavitharana, 2019)) to calculate F1 score, precision and recall with micro-averaging.

7.3.2. Token-based slot precision, recall and F1

In this evaluation metric, TP, FP and FN are calculated at the token level. For slot label (e.g B-MISC):

  • TP is the number of True positives, is the number of tokens which are correctly predicted as label .

  • FP is the number of False positives, is the number of tokens which are from another label but incorrectly predicted as label .

  • FN is the number of False negatives, is the number of tokens of label which are incorrectly labelled.

The formulas for precision, recall and F1 are the same as the span-based. (Li et al., 2019) use token based slot measures.

7.3.3. Slot accuracy

Slot accuracy is the ratio of the number of correctly labelled slots to the total number of slots. This is used in (Yu et al., 2011) where it is referred as word labelling accuracy (WLA).

7.4. Semantic accuracy

A sentence is correctly analysed if both the intent is correctly predicted and all the slots (including O labels) are correctly predicted. Semantic accuracy is then the number of correctly analysed sentences divided by the number of sentences.

7.5. Other accuracy measures

Other classifications are done outside the joint task which are outside the scope of this paper. For example if a domain is predicted a domain accuracy is measured, and if a dialog act is predicted a dialog act accuracy is measured ((Celikyilmaz and Hakkani-Tur, 2012) used this to evaluate their dialogue model).

7.6. Qualitative evaluation

(Jeong and Lee, 2008) used Hinton diagrams to visualise relationships between intent and slot labels arising from weights in their tri-CRF. Attention heat maps are used for similar purposes by (Ma et al., 2017; Qin et al., 2020).

(Firdaus et al., 2019) provided t-SNE plots of their intent features to illustrate their effectiveness in prediction.

8. Experimental Setup

The standard experiment trains on annotated utterances, creates features, and learns to predict an intent and slot labels for each utterance. A held out, unseen test set is used for evaluating performance.

The experimental setup varies for different papers. Further, many do not clearly state their setup with respect to data sets and hyper-parameters. In those papers which do specify the setup for data sets, most utilised the train-test split, where usually 80% of observations were treated as training data and the remainder were for testing. The training data may be split further to make a validation set. Alternatively, papers use 5-fold or 10-fold cross-validation for evaluation. Results are sometimes reported to be averaged over a number of runs.

For parameter tuning, dropout rates ranged from 0.003 to 0.5 and the size of hidden states was normally between 100 and 200, with as low as 64. Some models indicate the use of the Adam optimisation method with a learning rate between 0.0001 and 0.01. (Vu et al., 2016) used 0.02 as their initial learning rate in the first ten epochs and then halved it for the last 15 epochs. Similarly, (Ravuri and Stolcke, 2015) halved the learning rate once the cross entropy loss decreased less than 0.01 per example on the held out set. Moreover, a few papers mentioned that they set parameters randomly in the beginning, and then apply 5-fold validation when tuning parameters. Word embedding dimensions vary from 64 to 1024.

The same problem of lack of reporting occurred with the number of epochs. In papers which stated the number of epochs, most models were trained for less than 50 epochs, with some of these training models using early stopping. (Gupta et al., 2019) allowed unlimited number of epochs with a stopping criteria, meanwhile (Lin and Xu, 2019) specified the maximum epoch can be 200 and applied early stopping as well. (Qin et al., 2019) used 300 epochs with no early stopping. Further, some models report the final epoch results while other report the best results. In the joint task there is an effort to standardise the number of epochs for the benchmark data sets to 10 for ATIS and 20 for SNIPS (with early stopping strategy permitted), to allow for comparison between models. This was initialised by (Goo et al., 2018) who also reran the experiments of (Hakkani-Tür et al., 2016) and (Liu and Lane, 2016a) under that regime. Table 10 contains the results for experiments using this number of epochs.

Considering that many papers did not clearly state their experimental setup, it may bring difficulties in replicating the models and obtaining results similar to those shown in the papers. Therefore, it is recommended that papers include detailed information about setup in the experiments section.

Furthermore, standardisation of the experiment is worth consideration. As discussed in Section 9, a standard number of epochs on the standard data sets allows for a level of comparison between models. Results should be reported at this level. This should not limit results for different experimental setup being reported.

9. Performance summary

To summarise performance in the joint task we list the models and their reported test results for the ATIS and SNIPS data-sets for the three standard evaluation metrics (if available) in Table 9. Papers are included in this table if at least one of their results is better than the benchmarks reported in papers from the previous calendar year. Several interesting patterns can be observed based on the results available: (1) The overall improvement on Slot F1 and Semantic accuracy for SNIPS over time (from around 87.3 in 2016 to almost 98.78 in 2019 for Slot F1 and from 73.2 in 2016 to 93.6 in 2020 for Semantic accuracy) is much more significant than ATIS (from 93.96 in 2014 to 98.75 in 2019 for Slot F1 and from 78.9 in 2016 to 91.6 in 2020 for Semantic accuracy), while the Intent accuracy performs in the opposite way (from 78.9 in 2016 to 91.6 in 2020 for ATIS and from 96.7 in 2016 to 99.98 in 2019 for SNIPS. (2) For those models that reported Slot F1 and Intent accuracy on both data-sets, 17 out of 26 perform better in Slot F1 for ATIS and in Intent accuracy for SNIPS. (3) Before 2019, all best performance for Semantic accuracy come from ATIS while from 2019 and a shift to SNIPS starts from 2019 ending up with all best Semantic accuracy in SNIPS.

However, as mentioned in the previous section, there is a wide variety in the number of epochs for which neural models are allowed to run. In order to make fair comparison, we further extracted those models that use 10 epochs for ATIS and 20 epochs for SNIPS and provide the test results in Table 10. These results are either from papers who follow this etiquette, or from the reproduction by (Goo et al., 2018), or are replicated by us using the GitHub code supplied by the authors when indicated by . In the latter case we also confirm the consistent calculation of intent and semantic accuracy as well as span-based slot f1. A similar pattern to before can be observed, that a significant improvement in Slot F1 and Semantic accuracy for SNIPS has been made since 2019 and most of the models perform better in Slot F1 for ATIS and Intent accuracy for SNIPS. These patterns could be related to the various distribution of slot and intent labels and different nature of domains of the two data sets with regard to different architectures of the models.

In summary, we note that the results are now excellent for the two most commonly used data sets, and any fruitful newer developments that may be lost in results that appear to not significantly increase the results for these datasets. Just as SNIPS grew to become standard, and offered different aspects to ATIS (balanced data, multi-domain), it is probable that a new data set should become part of the SLU reporting canon. It should address the issues of unlabelled data and emerging domains as these problems should be addressed by newer models.

Slot f1 Intent acc Semantic acc Slot f1 Intent acc Semantic acc
(Jeong and Lee, 2008) Joint 2 94.42 93.07
(Xu and Sarikaya, 2013) CNN TriCRF 95.42 94.09
(Guo et al., 2014) RecNN+Viterbi 93.96 95.4
(Shi et al., 2015) RNN Joint + NE 96.83 95.4
(Hakkani-Tür et al., 2016) in (Goo et al., 2018)* 94.3 92.6 80.7 87.3 96.9 73.2
(Chen et al., 2016) K-SAN Syntax 95.38 84.32
(Zhang and Wang, 2016) W+N 96.89 98.32
(Liu and Lane, 2016a) in (Goo et al., 2018)* 94.2 91.1 78.9 87.8 96.7 74.1
(Goo et al., 2018) Slot-Gated (Full Atten.)* 94.8 93.6 82.2 88.8 97.0 75.5
(Goo et al., 2018) Slot-Gated (Intent Atten.)* 95.2 94.1 82.6 88.3 96.8 74.6
(Wang et al., 2018b) Attention and aligned 97.76 97.17
(Firdaus et al., 2018a) 98.02 98.43
(Li et al., 2018) 96.52 98.77
(Li et al., 2018) 94.81 98.54
(Wang et al., 2018a) 96.89 98.99
(Yu et al., 2018) ACJIS Model 96.43 98.57
(Siddhant et al., 2019) ELMo 95.62 97.42 87.35 93.9 99.29 85.43
(E et al., 2019) SF First (with CRF) *? 95.8 97.8 86.8 91.4 97.4 80.6
(Zhang et al., 2019) Capsule *? 95.2 95.0 83.4 91.8 97.3 80.9
(Gupta et al., 2019) CNN 3L, 5 kern., label recur. 96.95 98.36 94.22 99.1
(Gupta et al., 2019) LSTM 1L, label recur. 97.37 98.36 93.83 98.68
(Gupta et al., 2019) CNN 3L, 5 kern., label recur.* 95.27 97.37 92.3 97.57
(Chen et al., 2019a) 96.54 98.91 93.94 99.71
(Zhang and Wang, 2019) 95.1 97.2 93.3 98.9
(Daha and Hewavitharana, 2019) BiLSTM-CRF 95.6 96.6 86.2 94.6 97.4 87.2
(Liu et al., 2019a) CM-Net with GloVe 96.2 99.1 97.15 99.29
(Liu et al., 2019a) CM-Net with BERT 97.31 99.32
(Qin et al., 2019) Our Model 95.9 96.9 86.5 94.2 98 86.9
(Qin et al., 2019) Model+BERT 96.1 97.5 88.6 97 99 92.9
(Firdaus et al., 2019) HCNN+CRF, word+char embed’s 97.32 99.09 94.38 98.24
(Castellucci et al., 2019) 95.7 97.8 88.2 96.2 99 91.6
(Zhang et al., 2019b) 98.75 99.76 98.78 99.98
(Pentyala et al., 2019) Base 95.4 96.1 94.8 98
(Pentyala et al., 2019) Base+BERT 95.8 96.6 94.5 97.6
(Chen et al., 2019b) BERT 96.1 97.5 88.2 97 98.6 92.8
(Chen et al., 2019b) BERT+CRF 96 97.9 88.6 96.7 98.4 92.6
(Firdaus et al., 2020) BLSTM+atten+Multi:DAC+ID+SF 98.11 99.06
(Wang et al., 2020a) SASGBC 96.69 98.21 91.6 96.43 98.86 92.57
(Wang et al., 2020b) CMA-BLSTMS n-128 96.89 98.88
(Tang et al., 2020) fully-E@EMG-CRF 96.4 99.0 89.6 97.2 99.7 93.6
(Han et al., 2020) Bi-flow, atten., graph* 96.3 98.6 88.2 96.1 99.2 89.8
Table 9. Natural language understanding (NLU) performance on ATIS and SNIPS-NLU data sets (%). * denotes ATIS 10 epoch, SNIPS 20 epoch, indicates GitHub available
Slot f1 Intent acc Semantic acc Slot f1 Intent acc Semantic acc
(Hakkani-Tür et al., 2016) in (Goo et al., 2018)* 94.3 92.6 80.7 87.3 96.9 73.2
(Liu and Lane, 2016a) in (Goo et al., 2018)* 94.2 91.1 78.9 87.8 96.7 74.1
(Goo et al., 2018) Slot-Gated (Full Atten.)* 94.8 93.6 82.2 88.8 97.0 75.5
(Goo et al., 2018) Slot-Gated (Intent Atten.)* 95.2 94.1 82.6 88.3 96.8 74.6
(Li et al., 2018) 94.82 97.00 84.00 88.93 97.71 76.43
(Wang et al., 2018a) 95.14 96.08 84.87 88.46 96.71 75.39
(E et al., 2019) SF First (with CRF) *? 95.8 97.8 86.8 91.4 97.4 80.6
(Zhang et al., 2019) Capsule *? 95.2 95.0 83.4 91.8 97.3 80.9
(Gupta et al., 2019) CNN 3L, 5 kern., label recur.* 95.27 97.37 92.3 97.57
(Qin et al., 2019) Our Model 93.15 95.9 80.4 90.88 97.14 79.71
(Chen et al., 2019b) BERT 95.54 97.54 87.35 96.91 98.43 92.43
(Chen et al., 2019b) BERT+CRF 96.03 97.76 88.47 96.60 98.57 92.14
(Han et al., 2020) Bi-flow, atten., graph* 96.3 98.6 88.2 96.1 99.2 89.8
Table 10. Natural language understanding (NLU) performance on ATIS and SNIPS-NLU data sets (%) using ATIS 10 epoch, SNIPS 20 epoch. reproduced by this paper
Model ATIS
(Deng et al., 2012) Log-linear K-DCN 91.88
(Deoras and Sarikaya, 2013) DBN + Sntc 96.0
(Mesnil et al., 2013) Bi-directional Jordan-RNN 93.98
(Yao et al., 2013) RNN + Lex + NE 96.6
(Yao et al., 2014) R-CRF Model 2 96.65
(Yao et al., 2014) Deep LSTM 95.08
(Liu and Lane, 2015) RNN trained with sampled label linearly decreasing 97.87
(Mesnil et al., 2015) Hybrid 95.06
(Peng and Yao, 2015) RNN-EM 95.25
(Liu and Lane, 2016a) Attention Encoder-Decoder NN (with aligned inputs) 95.78
(Kurata et al., 2016) Encoder-labeler Deep LSTM (W) 95.47
(Vu et al., 2016) 5xR-biRNN 95.56
(Vu, 2016) R-bi-sCNN 95.61
(Zhu and Yu, 2017) BLSTM-LSTM (Focus) 95.79
(Gong et al., 2019) DCMTL 95.83
(Louvan and Magnini, 2018) MTL, different supervision level 95.94
(Wang et al., 2018) DRL based Augmented Tagging System () 97.86
(Shen et al., 2019) c-ProgModel 93.91
(Zhang et al., 2020) SC-TDNN-C 95.73
Table 11. Slot F1 Scores of slot filling models on ATIS

10. Critical Discussion and Conclusions

Section 1, Introduction, presented the following three questions concerning joint intent detection and slot filling based on writing style:

  • Q1: How do these joint models achieve and balance two aspects, intent classification and slot filling?

  • Q2: Have syntactic clues/features been fully exploited or does semantics override this consideration?

  • Q3: Can successful models in one supervised domain be made more generalisable to new domains or languages or unseen data?

Based on the literature review conducted in


  • F. Béchet and C. Raymond (2018) Is ATIS too shallow to go deeper for benchmarking Spoken Language Understanding models?. In InterSpeech 2018, Hyderabad, India, pp. 1–5. External Links: Link Cited by: §6.2.
  • V. Bellomaria, G. Castellucci, A. Favalli, and R. Romagnoli (2019) Almawave-slu: A new dataset for SLU in italian. Vol. abs/1907.07526. External Links: Link, 1907.07526 Cited by: §6.4.
  • A. Bhargava, A. Celikyilmaz, D. Hakkani-T”̈ur, and R. Sarikaya (2013) Easy contextual intent prediction and slot detection. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 8337–8341. Cited by: Table 5, §3.3.10.
  • A. Bhasin, B. Natarajan, G. Mathur, J. H. Jeon, and J. Kim (2019) Unified parallel intent and slot prediction with cross fusion and slot masking. In Natural Language Processing and Information Systems, E. Métais, F. Meziane, S. Vadera, V. Sugumaran, and M. Saraee (Eds.), Cham, pp. 277–285. External Links: ISBN 978-3-030-23281-8 Cited by: §5.2.6, §5.3.1, §5.3.2, §5.5.8, Table 7.
  • A. Bhasin, B. Natarajan, G. Mathur, and H. Mangla (2020) Parallel intent and slot prediction using mlb fusion. In 2020 IEEE 14th International Conference on Semantic Computing (ICSC), Vol. , San Diego, USA, pp. 217–220. Cited by: §5.2.10, §5.3.1, Table 7.
  • H. S. Bhathiya and U. Thayasivam (2020) Meta learning for few-shot joint intent detection and slot-filling. In Proceedings of the 2020 5th International Conference on Machine Learning Technologies, ICMLT 2020, New York, NY, USA, pp. 86–92. External Links: ISBN 9781450377645, Link, Document Cited by: §5.5.3, §5.5.4, Table 7.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2787–2795. External Links: Link Cited by: §5.5.1.
  • G. Castellucci, V. Bellomaria, A. Favalli, and R. Romagnoli (2019) Multi-lingual intent detection and slot filling in a joint bert-based model. Vol. abs/1907.02884. External Links: Link, 1907.02884 Cited by: §5.3.1, §5.5.3, Table 7, Table 9.
  • A. Celikyilmaz and D. Hakkani-Tur (2012) A joint model for discovery of aspects in utterances. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 330–338. External Links: Link Cited by: §5.2.1, §5.3.1, Table 7, §7.5.
  • L. Chen, D. Zhang, and L. Mark (2012) Understanding user intent in community question answering. In Proceedings of the 21st International Conference Companion on World Wide Web - WWW '12 Companion, Lyon, France, pp. 823–828. External Links: Document, Link Cited by: Table 5, §3.3.10.
  • M. Chen, J. Zeng, and J. Lou (2019a) A self-attention joint model for spoken language understanding in situational dialog applications. Vol. abs/1905.11393. External Links: Link, 1905.11393 Cited by: §5.2.4, §5.3.1, §5.5.8, Table 7, Table 9.
  • Q. Chen, Z. Zhuo, and W. Wang (2019b) BERT for joint intent classification and slot filling. Vol. abs/1902.10909. External Links: Link, 1902.10909 Cited by: §5.3.1, §5.5.5, Table 7, Table 10, Table 9.
  • S. Chen and S. Yu (2019) WAIS: word attention for joint intent detection and slot filling. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, Honolulu, USA, pp. 9927–9928. External Links: Document Cited by: §5.2.4, Table 7.
  • Y. Chen, D. Hakanni-Tür, G. Tur, A. Celikyilmaz, J. Guo, and L. Deng (2016) Syntax or semantics? knowledge-guided joint semantic frame parsing. In 2016 IEEE Spoken Language Technology Workshop (SLT), Vol. , San Diego, USA, pp. 348–355. Cited by: §5.5.1, Table 7, Table 9.
  • A. Cohan, W. Ammar, M. van Zuylen, and F. Cady (2019) Structural scaffolds for citation intent classification in scientific publications. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3586–3596. External Links: Link, Document Cited by: Table 5.
  • C. Costello, R. Lin, V. Mruthyunjaya, B. Bolla, and C. Jankowski (2018) Multi-layer ensembling techniques for multilingual intent classification. External Links: 1806.07914 Cited by: Table 5, §3.3.3.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, M. Primet, and J. Dureau (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. Vol. abs/1805.10190. External Links: Link, 1805.10190 Cited by: §6.3.
  • S. Dadas, J. Protasiewicz, and W. Pedrycz (2019) A deep learning model with data enrichment for intent detection and slot filling. In 2019 IEEE International Conference on Systems, Man and Cybernetics, October 6-9, 2019, Bari, Italy, pp. 3012–3018. External Links: Link, Document Cited by: §5.3.1, §5.5.4, §5.5.9, Table 7.
  • F. Daha and S. Hewavitharana (2019) Deep neural architecture with character embedding for semantic frame detection. In 2019 IEEE 13th International Conference on Semantic Computing (ICSC), Vol. , Newport Beach, USA, pp. 302–307. Cited by: §5.3.1, §5.3.2, Table 7, §7.3.1, Table 9.
  • D. A. Dahl, M. Bates, M. Brown, W. Fisher, K. Hunicke-Smith, D. Pallett, C. Pao, A. Rudnicky, and E. Shriberg (1994) Expanding the scope of the ATIS task: the ATIS-3 corpus. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, Plainsboro, USA, pp. 43–48. External Links: Link Cited by: §6.2.
  • Y. Dai, Y. Zhang, Z. Ou, Y. Wang, and J. Feng (2018) Elastic crfs for open-ontology slot filling. External Links: 1811.01331 Cited by: §4.3.10, Table 6.
  • L. Deng, G. Tur, X. He, and D. Hakkani-Tür (2012) Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In 2012 IEEE Spoken Language Technology Workshop (SLT), Vol. , Miami, USA, pp. 210–215. Cited by: §4.2.1, Table 6, Table 11.
  • A. Deoras and R. Sarikaya (2013) Deep belief network based semantic taggers for spoken language understanding. In Interspeech, Lyon, France, pp. 2713–2717. Cited by: §4.3.7, Table 6, §7.3.1, Table 11.
  • J. Deriu, A. Rodrigo, A. Otegi, G. Echegoyen, S. Rosset, E. Agirre, and M. Cieliebak (2020) Survey on evaluation methods for dialogue systems. Artificial Intelligence Review 53 (), pp. . External Links: Document, ISBN 1573-7462, Link Cited by: §1.2.
  • H. E, P. Niu, Z. Chen, and M. Song (2019) A novel bi-directional interrelated model for joint intent detection and slot filling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5467–5471. External Links: Link, Document Cited by: §5.2.6, §5.5.8, Table 7, §7.1.1, Table 10, Table 9.
  • M. Firdaus, S. Bhatnagar, A. Ekbal, and P. Bhattacharyya (2018a) A deep learning based multi-task ensemble model for intent detection and slot filling in spoken language understanding. In Neural Information Processing, L. Cheng, A. C. S. Leung, and S. Ozawa (Eds.), Cham, pp. 647–658. External Links: ISBN 978-3-030-04212-7 Cited by: 6th item, §5.3.1, §5.3.1, §5.5.3, §5.5.3, Table 7, Table 9.
  • M. Firdaus, S. Bhatnagar, A. Ekbal, and P. Bhattacharyya (2018b) Intent detection for spoken language understanding using a deep ensemble model. In Lecture Notes in Computer Science, pp. 629–642. External Links: Document, Link Cited by: Table 5, §3.3.12, §7.2.2.
  • M. Firdaus, H. Golchha, A. Ekbal, and P. Bhattacharyya (2020) A deep multi-task model for dialogue act classification, intent detection and slot filling. Cognitive Computation 12 (), pp. . External Links: Document, ISBN 1866-9964, Link Cited by: §5.2.8, §5.3.1, §5.5.2, Table 7, Table 9.
  • M. Firdaus, A. Kumar, A. Ekbal, and P. Bhattacharyya (2019) A multi-task hierarchical approach for intent detection and slot filling. Knowledge-Based Systems 183, pp. 104846. External Links: ISSN 0950-7051, Document, Link Cited by: §5.3.1, §5.3.1, §5.5.8, Table 7, §7.2.2, §7.6, Table 9.
  • R. Gangadharaiah and B. Narayanaswamy (2019) Joint multiple intent detection and slot labeling for goal-oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 564–569. External Links: Link, Document Cited by: §5.5.9, Table 7.
  • Y. Gong, X. Luo, Y. Zhu, W. Ou, Z. Li, M. Zhu, K. Q. Zhu, L. Duan, and X. Chen (2019) Deep cascade multi-task learning for slot filling in online shopping assistant. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, Honolulu, USA, pp. 6465–6472. External Links: Document Cited by: §4.3.11, Table 6, Table 11.
  • C. González-Caro and R. Baeza-Yates (2011) A multi-faceted approach to query intent classification. In String Processing and Information Retrieval, R. Grossi, F. Sebastiani, and F. Silvestri (Eds.), Berlin, Heidelberg, pp. 368–379. External Links: ISBN 978-3-642-24583-1 Cited by: Table 5, §3.3.13.
  • C. Goo, G. Gao, Y. Hsu, C. Huo, T. Chen, K. Hsu, and Y. Chen (2018) Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, USA, pp. 753–757. Cited by: §5.2.4, §5.2.4, §5.2.4, §5.2.4, §5.5.3, Table 7, §7.1.1, §8, Table 10, Table 9, §9.
  • D. Guo, G. Tur, W. Yih, and G. Zweig (2014) Joint semantic utterance classification and slot filling with recursive neural networks. In 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, USA, pp. 554–559. Cited by: §5.2.2, Table 7, Table 9.
  • A. Gupta, J. Hewitt, and K. Kirchhoff (2019) Simple, fast, accurate intent classification and slot labeling for goal-oriented dialogue systems. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 46–55. External Links: Link, Document Cited by: §5.3.1, §5.5.6, Table 7, §8, Table 10, Table 9.
  • A. Gupta, P. Zhang, G. Lalwani, and M. Diab (2019) CASA-NLU: context-aware self-attentive natural language understanding for task-oriented chatbots. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1285–1290. External Links: Link, Document Cited by: §5.5.2, §5.5.2, Table 7.
  • D. Hakkani-Tür, G. Tür, A. Celikyilmaz, Y. Chen, J. Gao, L. Deng, and Y. Wang (2016) Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Interspeech, San Francisco, USA, pp. 715–719. Cited by: Table 1, 2nd item, §5.2.3, §5.3.2, §5.5.2, Table 7, §8, Table 10, Table 9.
  • S. Han, S. Long, H. Weld, H. Li, and J. Poon (2020) BANG!: bi-directional attentional nlu with graph neural networks for joint intent classification and slot filling. Cited by: §5.2.6, §5.3.1, §5.5.1, Table 7, Table 10, Table 9.
  • M. Hasanuzzaman, S. Saha, G. Dias, and S. Ferrari (2015) Understanding temporal query intent. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15, Santiago, Chile, pp. 823–826. External Links: Document, Link Cited by: Table 5, §3.3.10, §3.3.2, §3.3.6.
  • H. B. Hashemi, A. Asiaee, and R. Kraft (2016) Query intent detection using convolutional neural networks. In International Conference on Web Search and Data Mining, Workshop on Query Understanding, San Francisco, USA, pp. . Cited by: Table 5, §3.3.4.
  • C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The atis spoken language systems pilot corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’90, Hidden Valley, Pennsylvania, USA, pp. 96–101. External Links: Link, Document Cited by: §6.2.
  • L. Hirschman (1992) Multi-site data collection for a spoken language corpus - mad cow. In Second International Conference on Spoken Language Processing (ICSLP’92), Banff, Canada, pp. 903–906. Cited by: §6.2.
  • L. Hou, Y. Li, C. Li, and M. Lin (2019) Review of research on task-oriented spoken language understanding. Journal of Physics: Conference Series 1267, pp. 012023. External Links: Document, Link Cited by: §1.2.
  • M. Jeong and G. G. Lee (2008) Triangular-chain conditional random fields. IEEE Transactions on Audio, Speech, and Language Processing 16 (7), pp. 1287–1302. Cited by: §5.2.1, §5.3.1, §5.3.1, §5.5.3, Table 7, §7.2.2, §7.6, Table 9.
  • S. Jung, J. Lee, and J. Kim (2018) Learning to embed semantic correspondence for natural language understanding. In Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 131–140. External Links: Link, Document Cited by: §5.4, Table 7.
  • N. Kanhabua, T. Ngoc Nguyen, and W. Nejdl (2015) Learning to detect event-related queries for web search. In Proceedings of the 24th International Conference on World Wide Web, WWW ’15 Companion, New York, NY, USA, pp. 1339–1344. External Links: ISBN 9781450334730, Link, Document Cited by: Table 5, §3.3.10.
  • J. Kim and Y. Kim (2018) Joint learning of domain classification and out-of-domain detection with dynamic class weighting for satisficing false acceptance rates. In Proc. Interspeech 2018, Hyderabad, India, pp. 556–560. External Links: Document, Link Cited by: Table 5, §3.3.12.
  • K. Kim, R. Jha, K. Williams, A. Marin, and I. Zitouni (2019) Slot tagging for task oriented spoken language understanding in human-to-human conversation scenarios. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, pp. 757–767. External Links: Link, Document Cited by: §4.3.12, Table 6.
  • Y. Kim, S. Lee, and K. Stratos (2017) ONENET: joint domain, intent, slot prediction for spoken language understanding. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , Okinawa, Japan, pp. 547–553. Cited by: 4th item, §5.3.1, Table 7.
  • M. Korpusik, Z. Liu, and J. Glass (2019) A Comparison of Deep Learning Methods for Language Understanding. In Proc. Interspeech 2019, Graz, Austria, pp. 849–853. External Links: Document, Link Cited by: §4.2.1, Table 6.
  • J. Krone, Y. Zhang, and M. Diab (2020) Learning to classify intents and slot labels given a handful of examples. External Links: 2004.10793 Cited by: §5.3.1, §5.5.4, Table 7, §6.5.
  • G. Kurata, B. Xiang, B. Zhou, and M. Yu (2016) Leveraging sentence-level information with encoder LSTM for semantic slot filling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2077–2083. External Links: Link, Document Cited by: §4.3.1, §4.3.4, Table 6, Table 11.
  • J. Lee, D. Kim, R. Sarikaya, and Y. Kim (2018) Coupled representation learning for domains, intents and slots in spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT), Vol. , Athens, Greece, pp. 714–719. Cited by: §5.2.5, §5.4, Table 7.
  • C. Li, C. Kong, and Y. Zhao (2018) A joint multi-task learning framework for spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, pp. 6054–6058. Cited by: §5.5.2, Table 7, Table 9.
  • C. Li, L. Li, and J. Qi (2018)

    A self-attentive model with gate mechanism for spoken language understanding

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3824–3833. External Links: Link, Document Cited by: §5.2.4, §5.5.5, Table 7, §7.1.1, Table 10, Table 9.
  • C. Li, Y. Zhao, and D. Yu (2019) Conditional joint model for spoken dialogue system. In Cognitive Computing – ICCC 2019, R. Xu, J. Wang, and L. Zhang (Eds.), Cham, pp. 26–36. External Links: ISBN 978-3-030-23407-2 Cited by: §5.2.4, §5.5.2, Table 7, §7.2.1, §7.3.2.
  • T. Lin and H. Xu (2019) Deep unknown intent detection with margin loss. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5491–5496. External Links: Link, Document Cited by: Table 5, §3.3.5, §8.
  • B. Liu and I. Lane (2015) Recurrent neural network structured output prediction for spoken language understanding. In Proceedings of NIPS Workshop on Machine Learning for Spoken Language Understanding and Interactions, Montreal, Canada, pp. . Cited by: §4.3.1, Table 6, Table 11.
  • B. Liu and I. Lane (2016a) Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016, San Francisco, USA, pp. 685–689. External Links: Document, Link Cited by: §5.2.3, §5.2.4, §5.3.2, §5.5.3, Table 7, §7.1.1, §8, Table 10, Table 11, Table 9.
  • B. Liu and I. Lane (2016b) Joint online spoken language understanding and language modeling with recurrent neural networks. In Proceedings of the SIGDIAL 2016 Conference, Los Angeles, USA, pp. 22–30. External Links: Link, 1609.01462 Cited by: §5.5.7, Table 7.
  • J. Liu, Y. Li, and M. Lin (2019) Review of intent detection methods in the human-machine dialogue system. Journal of Physics: Conference Series 1267, pp. 012059. External Links: Document, Link Cited by: §1.2.
  • Y. Liu, F. Meng, J. Zhang, J. Zhou, Y. Chen, and J. Xu (2019a) CM-net: a novel collaborative memory network for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1051–1060. External Links: Link, Document Cited by: §5.2.7, §5.3.1, §5.3.1, Table 7, §6.3, §7.3.1, Table 9.
  • Z. Liu, J. Shin, Y. Xu, G. I. Winata, P. Xu, A. Madotto, and P. Fung (2019b) Zero-shot cross-lingual dialogue systems with transferable latent variables. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1297–1303. External Links: Link, Document Cited by: §5.5.3.
  • S. Louvan and B. Magnini (2018) Exploring named entity recognition as an auxiliary task for slot filling in conversational language understanding. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, Brussels, Belgium, pp. 74–80. External Links: Link, Document Cited by: §4.3.11, Table 6, Table 11.
  • S. Louvan and B. Magnini (2019) Leveraging non-conversational tasks for low resource slot filling: does it help?. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 85–91. External Links: Link, Document Cited by: §4.3.11, §4.3.5, Table 6, §5.5.2.
  • M. Ma, K. Zhao, L. Huang, B. Xiang, and B. Zhou (2017) Jointly trained sequential labeling and classification by sparse attention neural networks. In Interspeech, Stockholm, Sweden, pp. 3334–3338. Cited by: §5.3.1, §5.3.2, §5.3.2, Table 7, §7.6.
  • R. Masumura, Y. Shinohara, R. Higashinaka, and Y. Aono (2018) Adversarial training for multi-task and multi-lingual joint modeling of utterance intent classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 633–639. External Links: Document, Link Cited by: Table 5, §3.3.3.
  • G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu, and G. Zweig (2015) Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Trans. Audio, Speech and Language Processing 23 (3), pp. 530–539. External Links: ISSN 2329-9290, Document Cited by: §4.2.1, §4.3.3, Table 6, Table 11.
  • G. Mesnil, X. He, L. Deng, and Y. Bengio (2013) Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In Interspeech, Lyon, France, pp. 3771–3775. Cited by: §4.2.1, Table 6, Table 11.
  • A. Mohasseb, M. Bader-El-Den, and M. Cocea (2018) Classification of factoid questions intent using grammatical features. ICT Express 4 (4), pp. 239–242. External Links: Document, Link Cited by: Table 5, §3.3.7, §7.2.
  • P. Ni, Y. Li, G. Li, and V. Chang (2020) Natural language understanding approaches based on joint task of intent detection and slot filling for iot voice interaction. Neural Computing and Applications 32 (), pp. . External Links: Document, ISBN 1433-3058, Link Cited by: §5.2.10, §5.3.1, Table 7.
  • J. Niu and G. Penn (2019) Rationally reappraising ATIS-based dialogue systems. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5503–5507. Cited by: §6.2.
  • E. Okur, S. H. Kumar, S. Sahay, A. A. Esme, and L. Nachman (2019) Natural language interactions in autonomous vehicles: intent detection and slot filling from passenger utterances. External Links: 1904.10500 Cited by: §5.3.1, §5.3.2, §5.4, Table 7.
  • D. S. Pallett, N. L. Dahlgren, J. G. Fiscus, W. M. Fisher, J. S. Garofolo, and B. C. Tjaden (1992) DARPA february 1992 atis benchmark test results. In Proceedings of the Workshop on Speech and Natural Language, HLT ’91, USA, pp. 15–27. External Links: ISBN 1558602720, Link, Document Cited by: §6.2.
  • L. Pan, Y. Zhang, F. Ren, Y. Hou, Y. Li, X. Liang, and Y. Liu (2018) A multiple utterances based neural network model for joint intent detection and slot filling. In Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2018), Tianjin, China, pp. 25–33. Cited by: §5.3.1, §5.5.2, §5.5.3, Table 7.
  • B. Peng and K. Yao (2015) Recurrent neural networks with external memory for language understanding. External Links: 1506.00195 Cited by: §4.3.6, Table 6, Table 11.
  • S. Pentyala, M. Liu, and M. Dreyer (2019) Multi-task networks with universe, group, and task feature learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 820–830. External Links: Link, Document Cited by: §5.2.8, §5.3.1, §5.3.1, Table 7, Table 9.
  • H. Purohit, G. Dong, V. Shalin, K. Thirunarayan, and A. Sheth (2015) Intent classification of short-text on social media. In 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), Chengdu, China, pp. 222–228. External Links: Document, Link Cited by: Table 5, §3.3.1, §3.3.6, §3.3.8.
  • L. Qin, W. Che, Y. Li, H. Wen, and T. Liu (2019) A stack-propagation framework with token-level intent detection for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), Hong Kong, pp. 2078–2087. External Links: Link, Document Cited by: §5.2.4, §5.3.1, §5.4, Table 7, §8, Table 10, Table 9.
  • L. Qin, M. Ni, Y. Zhang, and W. Che (2020) CoSDA-ml: multi-lingual code-switching data augmentation for zero-shot cross-lingual nlp. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 3853–3860. Note: Main track External Links: Document, Link Cited by: §5.5.3.
  • L. Qin, X. Xu, W. Che, and T. Liu (2020) AGIF: an adaptive graph-interactive framework for joint multiple intent detection and slot filling. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1807–1816. External Links: Link, Document Cited by: §5.5.1, §5.5.9, §7.6.
  • L. Qiu, Y. Chen, H. Jia, and Z. Zhang (2018) Query intent recognition based on multi-class features. IEEE Access 6, pp. 52195–52204. External Links: Document, Link Cited by: Table 5, §3.3.10, §3.3.11, §3.3.12.
  • S. Ravuri and A. Stolcke (2015) Recurrent neural network and lstm models for lexical utterance classification. In Proc. Interspeech, Dresden, Germany, pp. 135–139. External Links: Link Cited by: Table 5, §3.3.5, §8.
  • A. Ray, Y. Shen, and H. Jin (2018) Robust spoken language understanding via paraphrasing. In Proc. Interspeech 2018, Hyderabad, India, pp. 3454–3458. External Links: Document, Link Cited by: §5.5.5, Table 7.
  • A. Ray, Y. Shen, and H. Jin (2019) Iterative delexicalization for improved spoken language understanding. In Interspeech, Graz, Austria, pp. 1183–1187. External Links: Document Cited by: §5.5.5, Table 7.
  • C. Raymond and G. Riccardi (2007) Generative and discriminative algorithms for spoken language understanding. In Interspeech, Antwerp, Belgium, pp. 1605–1608. Cited by: §6.2, §6.2, §6.2.
  • F. Ren and S. Xue (2020) Intention detection based on siamese neural network with triplet loss. IEEE Access 8 (), pp. 82242–82254. Cited by: Table 5, §3.3.1.
  • R. Sarikaya, G. E. Hinton, and B. Ramabhadran (2011) Deep belief nets for natural language call-routing. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, pp. 5680–5683. External Links: Document, Link Cited by: Table 5, §3.3.2, §3.3.7.
  • S. Schuster, S. Gupta, R. Shah, and M. Lewis (2019) Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3795–3805. External Links: Link, Document Cited by: §5.5.3, Table 7.
  • I. V. Serban, R. Lowe, P. Henderson, L. Charlin, and J. Pineau (2018) A survey of available corpora for building data-driven dialogue systems: the journal version. Dialogue & Discourse 9 (1), pp. 1–49. Cited by: §1.2.
  • Y. Shen, W. Chen, and H. Jin (2019) Interpreting and improving deep neural slu models via vocabulary importance. In Interspeech, Graz, Austria, pp. 1328–1332. External Links: Document Cited by: §5.5.3, Table 7.
  • Y. Shen, X. Zeng, and H. Jin (2019) A progressive model to enable continual learning for semantic slot filling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 1279–1284. External Links: Link, Document Cited by: §4.3.8, Table 6, Table 11.
  • Y. Shen, X. Zeng, Y. Wang, and H. Jin (2018) User information augmented semantic frame parsing using progressive neural networks. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, B. Yegnanarayana (Ed.), Hyderabad, India, pp. 3464–3468. External Links: Link, Document Cited by: §5.5.2, §5.5.4, Table 7.
  • Y. Shi, K. Yao, H. Chen, Y. Pan, M. Hwang, and B. Peng (2015) Contextual spoken language understanding using recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , Brisbane, Australia, pp. 5271–5275. Cited by: §5.3.1, §5.5.2, §5.5.2, Table 7, Table 9.
  • Y. Shin, K. Yoo, and S. Lee (2018) Slot filling with delexicalized sentence generation. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, Hyderabad, India, pp. 2082–2086. External Links: Document Cited by: §4.3.4, Table 6.
  • K. Shridhar, A. Dash, A. Sahu, G. G. Pihlgren, P. Alonso, V. Pondenkandath, G. Kovacs, F. Simistira, and M. Liwicki (2019) Subword semantic hashing for intent classification on small datasets. In 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, pp. 1–6. External Links: Document, Link Cited by: Table 5, §3.3.5, §3.3.8.
  • A. Siddhant, A. K. Goyal, and A. Metallinou (2019) Unsupervised transfer learning for spoken language understanding in intelligent agents. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, Honolulu, USA, pp. 4959–4966. External Links: Document Cited by: §5.5.4, Table 7, Table 9.
  • I. Staliūnaitė and I. Iacobacci (2020) Auxiliary capsules for natural language understanding. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , Barcelona, Spain, pp. 8154–8158. Cited by: §5.2.5, §5.5.2, §5.5.2, Table 7.
  • Y. Tam, Y. Shi, H. Chen, and M. Hwang (2015) RNN-based labeled data generation for spoken language understanding. In Interspeech, Dresden, Germany, pp. 125–129. Cited by: §5.5.4.
  • H. Tang, D. Ji, and Q. Zhou (2020) End-to-end masked graph-based crf for joint slot filling and intent detection. Neurocomputing 413, pp. 348–359. External Links: Document, ISBN 0925-2312 Cited by: §5.2.9, Table 7, Table 9.
  • Q. N. Thi Do and J. Gaspers (2019) Cross-lingual transfer learning for spoken language understanding. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , Brighton, United Kingdom of Great Britain and Northern Ireland, pp. 5956–5960. Cited by: §5.2.4, §5.2.4, §5.3.1, §5.5.3, Table 7.
  • G. Tur, A. Celikyilmaz, X. He, D. Hakkani-Tür, and L. Deng (2018) Deep learning in conversational language understanding. In Deep Learning in Natural Language Processing, L. Deng and Y. Liu (Eds.), pp. 23–48. External Links: ISBN 978-981-10-5209-5, Document, Link Cited by: §1.2.
  • G. Tur and R. De Mori (2011) Spoken language understanding: systems for extracting semantic information from speech. John Wiley and Sons, New York, USA. External Links: Link Cited by: §1.2.
  • G. Tur, D. Hakkani-Tur, L. Heck, and S. Parthasarathy (2011) Sentence simplification for spoken language understanding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, pp. 5628 – 5631. Cited by: Table 5, §3.3.6.
  • G. Tur, D. Hakkani-Tür, and L. Heck (2010) What is left to be understood in atis?. In 2010 IEEE Spoken Language Technology Workshop, Berkeley, USA, pp. 19–24. Cited by: §6.2, §6.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems 30, pp. 5998–6008. External Links: Link Cited by: §5.2.4, §5.2.4.
  • A. P. B. Veyseh, F. Dernoncourt, and T. H. Nguyen (2019) Improving slot filling by utilizing contextual information. External Links: 1911.01680 Cited by: §4.3.11, Table 6.
  • T. Vu, P. Gupta, H. Adel, and H. Schütze (2016) Bi-directional recurrent neural network with ranking loss for spoken language understanding. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, pp. 6060–6064. External Links: Document Cited by: §4.2.1, Table 6, §8, Table 11.
  • T. Vu (2016) Sequential convolutional neural networks for slot filling in spoken language understanding. In Interspeech, San Francisco, USA, pp. 3250–3254. External Links: Document Cited by: §4.2.1, Table 6, Table 11.
  • C. Wang, Z. Huang, and M. Hu (2020a) SASGBC: improving sequence labeling performance for joint learning of slot filling and intent detection. In Proceedings of 2020 the 6th International Conference on Computing and Data Engineering, ICCDE 2020, New York, NY, USA, pp. 29–33. External Links: ISBN 9781450376730, Link, Document Cited by: §5.2.4, §5.3.1, §5.5.8, Table 7, Table 9.
  • X. Wang and C. Yuan (2016) Recent advances on human-computer dialogue. CAAI Transactions on Intelligence Technology 1 (4), pp. 303–312. External Links: Link Cited by: §1.2.
  • Y. Wang, L. Deng, and A. Acero (2005) Spoken language understanding: an introduction to the statistical framework. IEEE Signal Processing Magazine 22 (5), pp. 16–31. External Links: Link Cited by: §4.2.
  • Y. Wang (2010) Strategies for statistical spoken language understanding with small amount of data - an empirical study. In Interspeech, Makuhari, Japan, pp. 2498–2501. Cited by: §5.2.1, §5.5.4, Table 7.
  • Y. Wang, Y. Deng, Y. Shen, and H. Jin (2020b) A new concept of multiple neural networks structure using convex combination. IEEE Transactions on Neural Networks and Learning Systems 31 (), pp. 1–12. Cited by: §5.5.6, Table 7, Table 9.
  • Y. Wang, A. Patel, and H. Jin (2018) A new concept of deep reinforcement learning based augmented general tagging system. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1683–1693. External Links: Link Cited by: §4.3.9, Table 6, Table 11.
  • Y. Wang, Y. Shen, and H. Jin (2018a) A bi-model based rnn semantic frame parsing model for intent detection and slot filling. External Links: 1812.10235 Cited by: §5.2.6, Table 7, Table 10, Table 9.
  • Y. Wang, T. He, R. Fan, W. Zhou, and X. Tu (2019) Effective utilization of external knowledge and history context in multi-turn spoken language understanding model. In 2019 IEEE International Conference on Big Data (Big Data), Vol. , Los Angeles, USA, pp. 960–967. Cited by: §5.5.1, §5.5.2, Table 7.
  • Y. Wang, J. Huang, T. He, and X. Tu (2019) Dialogue intent classification with character-CNN-BGRU networks. Multimedia Tools and Applications 79 (7-8), pp. 4553–4572. External Links: Document, Link Cited by: Table 5, §3.3.7.
  • Y. Wang, L. Tang, and T. He (2018b) Attention-based cnn-blstm networks for joint intent detection and slot filling. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, M. Sun, T. Liu, X. Wang, Z. Liu, and Y. Liu (Eds.), Cham, pp. 250–261. External Links: ISBN 978-3-030-01716-3 Cited by: §5.3.1, §5.3.1, §5.3.2, Table 7, Table 9.
  • L. Wen, X. Wang, Z. Dong, and H. Chen (2018) Jointly modeling intent identification and slot filling with contextual and hierarchical information. In Natural Language Processing and Chinese Computing, X. Huang, J. Jiang, D. Zhao, Y. Feng, and Y. Hong (Eds.), Cham, pp. 3–15. External Links: ISBN 978-3-319-73618-1 Cited by: 5th item, §5.5.9, Table 7.
  • C. Xia, C. Zhang, X. Yan, Y. Chang, and P. Yu (2018) Zero-shot user intent detection via capsule neural networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3090–3099. External Links: Document, Link Cited by: Table 5, §3.3.4.
  • W. Xie, D. Gao, R. Ding, and T. Hao (2018) A feature-enriched method for user intent classification by leveraging semantic tag expansion. In Natural Language Processing and Chinese Computing, pp. 224–234. External Links: Document, Link Cited by: Table 5, §3.3.6.
  • C. Xu, Q. Li, D. Zhang, J. Cui, Z. Sun, and H. Zhou (2020) A model with length-variable attention for spoken language understanding. Neurocomputing 379, pp. 197–202. External Links: Document, ISBN 0925-2312 Cited by: §5.2.4, §5.4, Table 7.
  • P. Xu and R. Sarikaya (2013) Exploiting shared information for multi-intent natural language sentence classification. In Interspeech, Lyon, France, pp. 3785–3789. Cited by: §3.3.13.
  • P. Xu and R. Sarikaya (2013) Convolutional neural network based triangular crf for joint intent detection and slot filling. In 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, Vol. , Vancouver, Canada, pp. 78–83. Cited by: §5.3.1, Table 7, Table 9.
  • X. Yang, Y. Chen, D. Hakkani-Tür, P. Crook, X. Li, J. Gao, and L. Deng (2017) End-to-end joint learning of natural language understanding and dialogue manager. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , New Orleans, USA, pp. 5690–5694. Cited by: §5.5.2, Table 7.
  • K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi (2014) Spoken language understanding using long short-term memory neural networks. In 2014 IEEE Spoken Language Technology Workshop (SLT), Vol. , South Lake Tahoe, USA, pp. 189–194. External Links: Document, ISSN null Cited by: §4.3.1, §4.3.3, §4.3.6, Table 6, Table 11.
  • K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao (2014) Recurrent conditional random field for language understanding. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. , Florence, Italy, pp. 4077–4081. Cited by: §4.3.1, §4.3.3, Table 6, Table 11.
  • K. Yao, G. Zweig, M. Hwang, Y. Shi, and D. Yu (2013) Recurrent neural networks for language understanding. In Interspeech, Lyon, France, pp. 2524–2528. External Links: Document Cited by: §4.2.1, Table 6, Table 11.
  • E. H. Yilmaz and C. Toraman (2020) KLOOS: kl divergence-based out-of-scope intent detection in human-to-machine conversations. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, New York, NY, USA, pp. 2105–2108. External Links: ISBN 9781450380164, Link, Document Cited by: Table 5, §3.3.12.
  • D. Yu, S. Wang, and l. Deng (2011) Sequential labeling using deep-structured conditional random fields. Selected Topics in Signal Processing, IEEE Journal of 4, pp. 965 – 973. External Links: Document Cited by: §4.3.2, Table 6, §7.3.3.
  • S. Yu, L. Shen, P. Zhu, and J. Chen (2018) ACJIS: a novel attentive cross approach for joint intent detection and slot filling. In 2018 International Joint Conference on Neural Networks (IJCNN), Vol. , Rio de Janeiro, Brazil, pp. 1–7. Cited by: §5.2.4, §5.5.8, Table 7, Table 9.
  • Yulan He and S. Young (2003) A data-driven spoken language understanding system. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), Vol. , Piscataway, USA, pp. 583–588. Cited by: §6.2, §6.2.
  • C. Zhang, W. Fan, N. Du, and P. S. Yu (2016) Mining user intentions from medical queries: a neural network based heterogeneous jointly modeling approach. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, Republic and Canton of Geneva, CHE, pp. 1373–1384. External Links: ISBN 9781450341431, Link, Document Cited by: Table 5, §3.3.9, §7.2.1.
  • C. Zhang, Y. Li, N. Du, W. Fan, and P. Yu (2019) Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5259–5267. External Links: Link, Document Cited by: §5.2.5, §5.2.7, Table 7, Table 10, Table 9.
  • D. Zhang, Z. Fang, Y. Cao, Y. Liu, X. Chen, and J. Tan (2018) Attention-based rnn model for joint extraction of intent and word slot based on a tagging strategy. In Artificial Neural Networks and Machine Learning – ICANN 2018, V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and I. Maglogiannis (Eds.), Cham, pp. 178–188. External Links: ISBN 978-3-030-01424-7 Cited by: §5.4, Table 7.
  • L. Zhang, D. Ma, X. Zhang, X. Yan, and H. Wang (2020) Graph lstm with context-gated mechanism for spoken language understanding. In AAAI 2020, New York, USA, pp. . Cited by: §5.2.9, §5.3.1, §5.5.5, Table 7, §7.1.1.
  • L. Zhang and H. Wang (2019) Using bidirectional transformer-crf for spoken language understanding. In Natural Language Processing and Chinese Computing, J. Tang, M. Kan, D. Zhao, S. Li, and H. Zan (Eds.), Cham, pp. 130–141. External Links: ISBN 978-3-030-32233-5 Cited by: §5.2.4, §5.3.2, §5.5.8, Table 7, Table 9.
  • S. Zhang, J. Jiang, Z. He, X. Zhao, and J. Fang (2019a) A novel slot-gated model combined with a key verb context feature for task request understanding by service robots. IEEE Access 7 (), pp. 105937–105947. Cited by: §5.2.4, §5.3.1, Table 7.
  • X. Zhang and H. Wang (2016) A joint model of intent determination and slot filling for spoken language understanding. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York, USA, pp. 2993–2999. Cited by: §5.1, §5.3.1, §5.3.2, §5.5.3, §5.5.5, Table 7, Table 9.
  • Z. Zhang, H. Huang, and K. Wang (2020) Using deep time delay neural network for slot filling in spoken language understanding. Symmetry 12 (6), pp. . Cited by: §4.3.2, Table 6, Table 11.
  • Z. Zhang, Z. Zhang, H. Chen, and Z. Zhang (2019b) A joint learning framework with bert for spoken language understanding. IEEE Access 7, pp. 168849–168858. Cited by: §5.2.4, §5.3.1, §5.3.2, §5.5.3, Table 7, Table 9.
  • L. Zhao and Z. Feng (2018) Improving slot filling in spoken language understanding with joint pointer and attention. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 426–431. External Links: Link, Document Cited by: §4.3.10, Table 6.
  • X. Zhao, E. Haihong, and M. Song (2018) A joint model based on cnn-lstms in dialogue understanding. In 2018 International Conference on Information Systems and Computer Aided Education (ICISCAE), Vol. , Piscataway, USA, pp. 471–475. Cited by: §5.3.2, Table 7.
  • Y. Zheng, Y. Liu, and J. H. L. Hansen (2017) Intent detection and semantic parsing for navigation dialogue language processing. In 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC), Vol. , Yokohama, Japan, pp. 1–6. Cited by: 3rd item, §5.2.3, Table 7.
  • Q. Zhou, L. Wen, X. Wang, L. Ma, and Y. Wang (2016) A hierarchical lstm model for joint tasks. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, M. Sun, X. Huang, H. Lin, Z. Liu, and Y. Liu (Eds.), Cham, pp. 324–335. External Links: ISBN 978-3-319-47674-2 Cited by: 1st item, §5.3.2, Table 7.
  • S. Zhu and K. Yu (2017) Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, USA, pp. 5675–5679. External Links: Document Cited by: §4.3.1, Table 6, Table 11.
  • S. Zhu, Z. Zhao, R. Ma, and K. Yu (2020) Prior knowledge driven label embedding for slot filling in natural language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing PP, pp. 1–1. External Links: Document Cited by: §4.3.7, Table 6.