Meta learning to classify intent and slot labels with noisy few shot examples

11/30/2020 ∙ by Shang-Wen Li, et al. ∙ Amazon 0

Recently deep learning has dominated many machine learning areas, including spoken language understanding (SLU). However, deep learning models are notorious for being data-hungry, and the heavily optimized models are usually sensitive to the quality of the training examples provided and the consistency between training and inference conditions. To improve the performance of SLU models on tasks with noisy and low training resources, we propose a new SLU benchmarking task: few-shot robust SLU, where SLU comprises two core problems, intent classification (IC) and slot labeling (SL). We establish the task by defining few-shot splits on three public IC/SL datasets, ATIS, SNIPS, and TOP, and adding two types of natural noises (adaptation example missing/replacing and modality mismatch) to the splits. We further propose a novel noise-robust few-shot SLU model based on prototypical networks. We show the model consistently outperforms the conventional fine-tuning baseline and another popular meta-learning method, Model-Agnostic Meta-Learning (MAML), in terms of achieving better IC accuracy and SL F1, and yielding smaller performance variation when noises are present.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Works

Goal-oriented dialogue systems is a hot topic in machine learning research. The systems have widespread applications in the industry and are the foundation of many successful products, including Alexa, Siri, Google Assistant, and Cortana. One core component of a dialog system is spoken language understanding (SLU), which consists of two main problems, intent classification (IC) and slot labeling (SL) Tür et al. (2002); Hakkani-Tür et al. (2006)

. In IC, we attempt to classify the goal of a user query, usually input in text or transcribed by automatic speech recognition (ASR) system from audio. SL, similar to the named-entity recognition (NER) problem, aims to label each token in a query an entity type. The only difference is that entity types in SL are domain-specific and based upon dialog ontology. Recent advances in neural models have enabled greatly improved SLU

Yao et al. (2014); Guo et al. (2014); Mesnil et al. (2014); Cao et al. (2020); Lai et al. (2020a).

However, two significant challenges hinder the broad application and expansion of the SLU models in industrial settings. First of all, neural methods require a large amount of labeled data for training Liu et al. (2020). SLU is often coupled with the ontology of the underlying dialog system and thus domain-dependent. Collecting a large number of in-domain labeled data for neural models is prohibitively expensive and time-consuming. Secondly, the performance of SLU models in practice often suffers from fluctuations due to various types of noises. One common noise is adaptation data perturbation. In many industrial applications such as cloud services111Alexa ASK:; Google DialogFlow:, the SLU model is built by fine-tuning (or adapting) a pre-trained, shared network to the target domain with data provided by developers. The developers often have a limited background in SLU and machine learning. Thus the data provided varies in quality and is subject to different types of perturbations, such as missing or replaced data samples (e.g., a subset of an optimal example set is missing or replaced by redundant examples in the adaptation data) and typos. Another common noise comes from the mismatch of input modalities between adaptation and inference stages. For instance, the model is adapted with human transcription yet deployed to understand ASR decoded text, or the input at adaptation and inference stages relies on the recognition of different versions of ASR models. Given that most neural methods comprise a large number of parameters and are heavily optimized for the training (i.e., adaptation in the context of cloud service) data provided, the resulting model is usually sensitive to these noises. The requirement of noise-free adaptation and inference conditions also prohibits the use of neural SLU techniques because it is often infeasible to achieve such conditions.

Transfer learning and meta-learning are two conventional techniques that have been applied to address the challenge of data scarcity. Transfer learning usually refers to pre-training initial models using mismatched domains with rich human annotations and then adapting the models with limited labels in targeted domains. Previous works Yang et al. (2017); Kumar et al. (2017); Schuster et al. (2019); Chen et al. (2019); Lai et al. (2020b) have shown promising results in applying transfer learning to SLU. Note that pre-training discussed here covers methods including using a pre-trained language model like BERT Devlin et al. (2018) directly and further training downstream tasks on data in mismatched domains with the pre-trained model. In the following, we focus on the latter due to utilizing data from other domains better and yielding higher accuracy. In recent years, meta-learning has gained growing interest among the machine learning fields for tackling few-shot learning (i.e., data scarcity) scenarios. Model-Agnostic Meta-Learning (MAML) Finn et al. (JMLR. org, 2017) focuses on learning parameter initialization from multiple subtasks, such that the initialization can be fine-tuned with few labels and yield good performance in targeted tasks. Metric-based meta-learning, including prototypical networks (ProtoNets) Snell et al. (2017); Luo et al. (2020) and matching networks Vinyals et al. (2016), aim to learn embedding or metric space which can be generalized to domains unseen in the training set after adaptation with a small number of examples from the unseen domains. Recent work unveils excellent potential in applying meta-learning techniques to SLU in the few-shot learning context Krone et al. (2020).

As compared to data scarcity, another challenge for SLU, the robustness against noises, is also gaining attention. Simulated ASR errors are used to augment training data for SLU models Simonnet et al. (2018). Researchers also leverage information from confusion networks or lattices Hakkani-Tür et al. (2006); Tür et al. (2013); Ladhak et al. (2016); Huang and Chen (2019); Masumura et al. (2018); Yaman et al. (2008), and adversarial training techniques Zhu et al. (2018); Lee et al. (2019) for models to learn query embeddings that are robust against ASR errors. For text input, methods have also been explored on model robustness against noises from misspelling and acronym Li and Liu (2015). In contrast to these noise types that have gained attention, to our best knowledge, there is no prior work investigating the impact of missing or replaced examples in adaptation data. Moreover, the intersection of data scarcity and noise robustness is unexplored. Since the scarcity of labeled data and data noisiness usually co-occur in SLU applications (both reflect the difficulty of acquiring annotated data), the lack of studies in the intersectional areas hinders the use of neural SLU models and its expansion to broader use cases.

Given the deficiency, we establish a novel few-shot noisy SLU task by introducing two common types of natural noise, adaptation example missing/replacing and modality mismatch, to the previously defined few-shot IC/SL splits Krone et al. (2020). The task is built upon three public datasets, ATIS Hemphill et al. , SNIPS Coucke et al. (2018), and TOP Gupta et al. (2018). We further propose a noise-robust few-shot SLU model based on ProtoNets for the established task. In summary, our primary contributions are 3-fold: 1) formulating the first few-shot noisy SLU task and evaluation framework, 2) proposing the first working solution for the few-shot noisy SLU with the existing ProtoNet algorithm, and 3) in the context of noisy and scarce learning examples, comparing the performance of the proposed method with conventional techniques, including MAML and fine-tuning based adaptation.

2 Approaches

In this section, we dive deep into the formulation of few-shot noisy SLU tasks. We also elaborate on the method we propose to overcome this challenging task.

2.1 Problem formulation

2.1.1 Few-shot SLU

The goal of the few-shot SLU is to adapt an IC/SL classifier pre-trained on data, , in mismatch domains with rich annotation to new domains using data, , which comprises few labeled examples per class. In this setting, pre-training and test splits are two disjoint class sets and from and respectively. Classes in are used for model pre-training while those in are held out for test time to adapt the pre-trained model and evaluate generalizability. The test is done episodically, where each episode is a mini adaptation task containing a support set and a query set . Adaptation of to new domains is achieved with a few labeled examples provided by the support set, whereas model performance is evaluated by averaging metrics measured in each episode’s query set. In our implementation, we also pre-train meta-learning methods on episodes, since a previous study showed that pre-training with matched condition yields better performance Krone et al. (2020).

We follow the setup in Krone et al. (2020) to build support and query sets. We define , where and is either the pre-train or test split. Here is the input utterance of the -th example for intent (i.e., class) , is the slot label of , and is the number of examples per class in the support set. Similarly is defined as , where and . In this way, we construct an episode by sampling and examples per intent in and from and respectively. Note that in practice consists of more than a few examples per class, and thus we downsample to resemble a few-shot learning task in each episode. Besides, for fair evaluation, we map slot labels in the query set but not in the support one to Other, which is excluded when we evaluate SL performance. The mapping also guarantees not updating gradient for slot labels unseen in support set during pre-training episodically.

2.1.2 Few-shot noisy SLU

Noises usually co-occur with data scarcity in the industrial settings of SLU. Thus we further formulate a few-shot noisy SLU task for benchmark and model development. The task is built upon the few-shot SLU described above, and the goal is to adapt the IC/SL classifier with few examples such that the resulting model can perform well and robust in new domains when noise exists. Specifically, we investigate two types of noise, adaptation example missing/replacing and modality mismatch222We select the two types of noises as they are common in cloud services, where the input modality at deployment can be different from development; the provided adaptation data and its quality can fluctuate due to developers’ limited background or deletion per user privacy concerns. We plan to explore other prevalent noises, including typos and acronyms, in future work..

For adaptation example missing/replacing, , , and are kept identical to the ones in few-shot SLU, while the support set in test split, , is the perturbed . In the following experiment, we perturb by either removing examples per intent from , or replacing examples per intent with ones sampled from but excluding and . We choose to study these two types of perturbation, for they are the basic operation on the example set at the utterance level. A more complicated variation can be built by combining the two. To quantify the model robustness against the missing and replacing operation, in each episode at the test stage, we adapt the pre-trained classifier with and separately, and evaluate on

to measure the performance difference between models adapted with the original and perturbed support set. With this setup, we estimate how well and robust classifiers can perform with a network pre-trained on mismatched but rich-annotated domains as well as a small and perturbed adaptation set. Good and robust performance in such a setting is especially useful in the context of cloud services.

In modality mismatch, the goal is to benchmark performance impact when the preprocessing pipeline of data used in pre-training and adaptation is different from the one in inference. We simulate this noise type by replacing examples in with ASR hypotheses and IC/SL annotation on them. Human transcription is still used in pre-training split and for adaptation. Since most SLU benchmarking datasets only provide IC/SL annotation on human transcription, further data processing is required. Here we adopt a common technique, noise-corrupted synthesized speech Wang et al. (2018); Fazel-Zarandi et al. (2017), to obtain ASR hypotheses and corresponding annotation. We first apply TTS on text input and feed the synthesized audio into ASR to generate the ASR hypothesis. Levenshtein alignment between tokens in the hypothesis and original text is then adopted to project SL annotation onto the hypothesis (note that intent labeling is not affected by ASR results). Projected SL labels are reviewed and corrected by human annotators for the experiment. In such a manner, we generate examples to measure model performance and robustness when the data preprocess pipeline, i.e., text input from human vs. audio input recognized by ASR, is mismatched.

2.2 Learning frameworks

Figure 1: The architectures for the IC-SL joint classifiers built with various learning frameworks. Three popular architectures for SLU classifiers, ELMo, GloVe, and BERT, are shown.

To build robust models for the few-shot noisy SLU, we propose ProtoNets Snell et al. (2017) based SLU. ProtoNets is a popular meta-learning framework for the few-shot learning scenario. We apply ProtoNets to IC and SL problems by first representing the input utterance and tokens with encoder . Then we compute prototypes, i.e., centroids of examples, of each intent and slot class with the embedding. That is


Here and are the prototype for intent and slot ; and are the embeddings of utterance and token in utterance respectively; is the token index;


is the token level example set for slot . Given an example , we predict the intent and each token’s slot by computing the softmax of distance from the example to prototypes. Specifically,


We denote the approach of building IC and SL classifiers with this framework as Proto.

In the implementation, we jointly pre-train IC and SL. Thus the loss function is defined as the sum of IC and SL negative log-likelihood averaged over instances in

given prototypes computed from . We backward propagate the gradient of loss at pre-training episodically to tune the encoder. At the testing, the learned encoder along with is used for calculating class prototype and predict examples in .

For comparison, we also build two baselines, one based on MAML (denoted as MAML) and another based on fine-tuning (denoted as Finetune). MAML is another popular meta-learning framework. We utilize MAML to optimize parameters of the IC-SL classifier on and , such that after is adapted with the for steps (i.e., backward update gradient by epochs), the resulting classifier can generalize well in . Concretely, we perform the following two-step optimization at pre-training:


Then the learned classifier is adapted and evaluated with and respectively. Besides, for computational efficiency, in implementation, we adopt first-order approximation of MAML, foMAML, which has been shown to achieve similar performance with less computation Finn et al. (JMLR. org, 2017); Krone et al. (2020). On the other hand, Finetune

is a common supervised-learning framework for low resource SLU

Goyal et al. (2018). In Finetune, we pre-train models with examples from , in batch. Adaptation and evaluation are also conducted in episodes, where the output layers of pre-trained models are first adapted with , and then the resulting IC-SL models are evaluated with .

2.3 Model architecture

Figure 1 visualizes the architectures for the IC-SL joint classifiers built with these frameworks. Three popular architectures for SLU, ELMo, GloVe, and BERT, are investigated Krone et al. (2020). The architectures share the same design of using a bi-LSTM Hochreiter and Schmidhuber (1997) layer to encode embeddings and then fully connected IC and SL prediction layers with bi-LSTM hidden states as input. These classifiers differ in the utilized embeddings, where ELMo adopts a concatenation of GloVe Pennington et al. (2014) and ELMo Peters et al. (2018), GloVe uses GloVe only and BERT employs a pre-trained BERT for encoding tokens. In our experiments, we keep these pre-trained encoders frozen in adaptation since our previous study shows that it is insufficient to adapt these encoders with few-shot examples Krone et al. (2020). When applying these architectures to learning framework Proto

, the distance between the output and prototypes is further computed, and the probability of an IC or SL class is the softmax of the distance. As for

MAML and Finetune

, the output of IC and SL layers is used directly as the prediction logits. Note that our learning frameworks discussed above are architecture agnostic. For the generalizability of our experiment, we choose three popular architectures in SLU to explore. Other model backbones are also viable for these frameworks.

3 Experiments

3.1 Datasets and experiment setup

We build the few-shot learning task to evaluate the proposed approach based on three public SLU datasets: ATIS Hemphill et al. , SNIPS Coucke et al. (2018), and TOP Gupta et al. (2018). ATIS is a dataset in the airline domain, while SNIPS comprises utterances in inquiring music, media, and weather. TOP pertains to navigation and event search with nested and flat intent labels. As discussed in previous work Krone et al. (2020), we only utilize the non-hierarchical intents in experiments for comparable results. In the context of few-shot learning, data for pre-training and adaptation is often in mismatched domains. Hence, we build the few-shot learning datasets, SNIPS-fs, ATIS-fs, and TOP-fs, by manually selecting intents from SNIPS, ATIS, and TOP respectively for test split, and intents from the remaining two datasets for pre-training and validation splits. In Table 1, we provide detailed statistics of these few-shot datasets333The splits of intents are selected by maximizing the distance between intents belonging to different splits, where each intent is represented by the average over the BERT-CLS embeddings of its utterances. We also investigated different split methods, such as random sampling, and observed no significant difference. We will release the splits and resulting datasets for reproducibility..

Task\Splits Pre-train Validation Test
SNIPS-fs (20345, 7, TOP) (4333, 5, TOP) (6254, 3, SNIPS)
(4373, 5, ATIS) (662, 7, ATIS)
ATIS-fs (20345, 7, TOP) (4333, 5, TOP) (829, 7, ATIS)
(8230, 5, SNIPS)
TOP-fs (4373, 5, ATIS) (662, 7, ATIS) (4426, 6, TOP)
(8230, 5, SNIPS)
Table 1: Statistics of pre-train, validation, and test splits for established few-shot datasets, shown in the form of (utterance counts, intent counts, datasets from which the intents were selected).

With the established splits, we extract episodes. At the pre-training stage, the number of intents in each episode, i.e., the -way, is sampled from [3, ] uniformly. After that, we sample intents from , and sample and utterances respectively for each sampled intent as the support and query set. At the validation and test stage, is set to and , and the remaining settings are the same. Additionally, with the same rationale as many meta-learning studies where and are set to some small numbers arbitrarily Finn et al. (JMLR. org, 2017); Snell et al. (2017), in the experiment here, we let both and equal to 10.

Additional steps are required to introduce noises to these few-shot datasets. For adaptation example missing/replacing, we keep the pre-training and validation episodes the same, while perturbing the test episodes by sampling (set to arbitrarily small numbers, 1 to 5, in the experiment) utterances from each intent in the support set, and remove the sampled ones or replace with others. For modality mismatch, since SNIPS and TOP only contain human text input, we use commercial TTS444Amazon Polly: and ASR services555Amazon Transcribe:, to synthesize the audio and decode the audio back to text. Slot labels on the human text are projected onto ASR hypotheses with Levenshtein alignment. We adopt a similar process for ATIS but skip TTS since audio recordings are available in ATIS. The word error rate (WER) of decoded audio for ATIS, SNIPS, and TOP is 18.4%, 16.2%, and 14.7%, respectively.

In the experiment here, we pre-train and adapt all the models using Adam optimizer. The learning rate is set to 0.001 in Finetune and Proto. For MAML, the rate is 0.003 at the pre-training step and 0.01 at the adaptation with set to 8. We pre-train models for 30 epochs either episodically (MAML or Proto) or in batch (Finetune

) with size 512. At the testing, we adapt these models for ten steps on the support set. We select all the hyperparameters for each approach separately, such that the hyperparameters yield the minimum IC and SL joint loss averaged over the three validation sets without perturbation. As for the model architecture, we use a 2-layer BiLSTM with 256 hidden units for contextual encoding. GloVe with 300 dimensions, ELMo with 512 dimensions, and BERT-medium with 512 dimensions are selected for the token embeddings as these settings are commonly adopted in SLU experiments. To assess the model performance, we report the average IC accuracy and SL F1 over 100 episodes and three random seeds, a typical setting in few-shot learning to avoid performance fluctuation

Krone et al. (2020).

3.2 Experiment results

Then we evaluate the performance and robustness of the proposed method (Proto) as well as baselines (MAML and Finetune) on established few-shot noisy tasks. Table 2 shows the IC accuracy and SL F1 for the methods under three settings of adaptation example perturbation (i.e., no perturbation, removing or replacing utterances). Here, we start from minimal perturbation, = 1. For evaluating robustness, we also report in parentheses the absolute accuracy and F1 difference between each perturbed setting and no perturbation counterpart666

Note that for measuring the absolute difference, we first calculate the difference in IC/SL performance between the noisy and no perturbation setting for each episode. The average and standard deviation of the difference over episodes are reported in the table. The reported metrics are different from those obtained by directly computing the difference between the averaged performance reported in the table for the noisy and no perturbation settings.

. Similar to previous observation Krone et al. (2020), we find Proto outperforms MAML and Finetune consistently for IC and SL problems in all the few-shot datasets; also the relatively low SL F1 from baselines suggests that few-shot SL is challenging. In addition to the three learning frameworks, the impact of various model architectures is investigated as well. We observe that BERT consistently yields better performance than GloVe and ELMo (c.f., row 1 to 9), presumably because BERT encodes more knowledge for SLU. Thus, in the following, we choose BERT as the backbone model architecture.

Results in Table 2 also suggest that Proto yields the most robust performance against adaptation example missing/replacing, with absolute differences of IC accuracy and SL F1 between 0.2 and 1.7. We surmise the robustness results from the model’s ensemble nature, where the inference can be viewed as an aggregation of classifier prediction based on the distance to examples. Finetune is comparably robust for IC in replacement but worse in removal, presumably because the latter leaves fewer adaptation utterances, which can be consequential at few-shot learning. The SL performance from Finetune is too low to measure the robustness. MAML, on the other hand, exhibits a large variation in performance. We believe the reason is that the adaptation in MAML, which decides where to evaluate the gradient, amplifies perturbation.

-0.2in0in IC acc. (absolute acc. difference) SL F1 (absolute F1 difference) SNIPS-fs ATIS-fs TOP-fs SNIPS-fs ATIS-fs TOP-fs FINETUNE  83.60.8 69.91.6 57.70.7  19.60.7 20.10.6 15.70.6 FINETUNE-GloVe  81.00.7 65.61.1 53.10.6  18.00.4 18.20.5 14.70.4 FINETUNE-ELMo  82.11.2 68.71.9 58.01.0  19.30.6 19.40.8 14.90.4 No MAML  87.40.8 71.11.1 57.60.5  33.01.3 26.30.9 22.90.8 perturbation MAML-GloVe  79.40.9 65.90.9 53.20.6  30.11.0 24.90.8 21.30.6 MAML-ELMo  83.50.9 69.61.3 54.50.8  31.91.1 25.41.2 23.20.9 Proto  90.90.3 75.30.7 61.91.1  45.40.5 42.71.6 35.40.5 Proto-GloVe  75.30.4 70.10.6 52.71.3  32.20.7 41.51.5 30.60.6 Proto-ELMo  87.10.5 76.00.8 59.81.2  43.10.6 41.21.8 35.70.8 FINETUNE  83.40.6 69.31.4 57.40.6  18.90.7 19.80.7 15.30.4 Remove 1  (3.13.0) (4.13.3) (4.02.9)  (1.11.8) (1.11.4) (0.90.6) utterance MAML  87.30.5 71.31.6 58.81.4  32.61.4 25.90.8 22.30.7 per intent  (4.34.2) (11.58.3) (10.78.8)  (5.24.0) (3.82.8) (4.43.5) Proto  90.70.3 75.00.6 61.30.9  44.80.6 42.51.5 35.20.5  (0.81.6) (0.30.6) (1.31.4)  (1.51.5) (0.40.5) (0.20.3) FINETUNE  83.20.9 69.41.5 57.10.8  18.80.5 20.00.6 15.50.5 Replace 1  (1.92.4) (0.61.0) (1.61.5)  (1.71.5) (1.10.5) (1.00.2) utterance MAML  87.50.7 71.01.7 57.51.3  32.81.3 25.60.1 22.80.6 per intent  (2.02.7) (5.35.0) (4.95.4)  (3.22.6) (1.81.7) (1.52.0) Proto  90.90.4 75.20.6 62.11.0  45.50.6 42.61.5 35.10.5  (0.91.8) (0.40.7) (1.71.7)  (1.71.6) (0.60.6) (0.30.4)

Table 2: Average IC accuracy and SL F1 over 100 test episodes for three few-shot datasets in the form of mean standard deviation, computed over three random seeds. We show results for three methods and perturbation settings. In the parentheses, the absolute accuracy and F1 difference on test episodes between replace or remove and no perturbation counterpart is reported. Model architecture BERT is used by default if not specified.

We further vary , the number of perturbed examples, and quantify its performance impact. In Figure 2, we present the absolute difference in IC accuracy between perturbed and no perturbation setting over various (we decided not to look into SL robustness since only Proto yields satisfactory results even before perturbation). Remove and replace perturbation is shown in the left and right panel respectively. Again Proto yields the most robust results (in terms of least difference) compared to baselines in explored perturbation settings. Also, utterance removing is more challenging for model robustness. Both findings strengthen our argument above.

Figure 2: The absolute IC accuracy difference between no perturbation and perturbed settings with various , the number of utterances removed or replaced from the adaptation set.

Lastly, we measure the model robustness against modality mismatch. In Table 3, we report the IC accuracy and SL F1 when models are pre-trained and adapted in human transcription while evaluated with ASR hypotheses. Similar to what we found above, Proto yields the best performance in both IC and SL. By comparing the results in the condition of mismatched modality reported here with the matched modality counterpart (i.e., no perturbation in Table 2), we observe that Proto is again the most robust approach in IC (accuracy drop ranging from 0.3 to 2.0 for Proto, 3.0 to 4.4 for Finetune, and 3.5 to 4.3 for MAML). SL result from Proto is stable as well (2.2 to 4.3 F1 drop), while Finetune and MAML yield relatively low F1 scores. Findings here agree with the observation made above for adaptation example missing/replacing, and further support our discussion about the robustness of different learning frameworks.

FINETUNE 79.41.8 66.91.9 53.31.1
IC acc. MAML 83.30.9 68.10.9 54.01.0
Proto 89.00.7 75.00.7 59.90.6
FINETUNE 17.50.4 19.20.2 14.50.3
SL F1 MAML 26.71.6 20.70.5 18.80.5
Proto 41.10.6 39.71.3 33.20.6
Table 3: IC accuracy and SL F1 from models pre-trained and adapted in human transcription while evaluated with ASR hypotheses.

4 Conclusions

In this paper, we establish a novel SLU task, the few-shot noisy SLU, with existing public datasets. We further propose a ProtoNets based approach, Proto, to build IC and SL classifiers with few noisy examples. When there is no noise in few-shot examples, Proto yields better performance than other approaches utilizing MAML and fine-tuning frameworks. Proto also achieves the highest and most robust IC accuracy and SL F1 when two types of noise, adaptation example missing/replacing and modality mismatch, are injected in adaption and evaluation set respectively. We believe the ensemble nature of ProtoNets benefits the model robustness, and the simplicity of Proto’s model architecture is also helpful in the few-shot noisy scenario. Our contribution here is a step toward the efficient and robust deployment of SLU models. While our results are promising, there is still substantial work, from the creation of few-shot SLU datasets covering more noises to studies of faster and stabler learning algorithms, in pursuit of the goal.


  • J. Cao, J. Wang, W. Hamza, K. Vanee, and S. Li (2020) Style attuned pre-training and parameter efficient fine-tuning for spoken language understanding. Proc. Interspeech 2020, pp. 1570–1574. Cited by: §1.
  • Q. Chen, Z. Zhuo, and W. Wang (2019) Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. Cited by: §1.
  • A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: §1, §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • M. Fazel-Zarandi, S. Li, J. Cao, J. Casale, P. Henderson, D. Whitney, and A. Geramifard (2017) Learning robust dialog policies in noisy environments. In Workshop on Conversational AI: Today’s Practice and Tomorrow’s Potential, NeurIPS, Cited by: §2.1.2.
  • C. Finn, P. Abbeel, and S. Levine (JMLR. org, 2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 1126–1135. Cited by: §1, §2.2, §3.1.
  • A. Goyal, A. Metallinou, and S. Matsoukas (2018) Fast and scalable expansion of natural language understanding functionality for intelligent agents. arXiv preprint arXiv:1805.01542. Cited by: §2.2.
  • D. Guo, G. Tür, W. Yih, and G. Zweig (2014)

    Joint semantic utterance classification and slot filling with recursive neural networks

    In 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 554–559. Cited by: §1.
  • S. Gupta, R. Shah, M. Mohit, A. Kumar, and M. Lewis (2018) Semantic parsing for task oriented dialog using hierarchical representations. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 2787–2792. Cited by: §1, §3.1.
  • D. Hakkani-Tür, F. Béchet, G. Riccardi, and G. Tür (2006) Beyond asr 1-best: using word confusion networks in spoken language understanding. Computer Speech & Language 20 (4), pp. 495–514. Cited by: §1, §1.
  • [11] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: §1, §3.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.3.
  • C. Huang and Y. Chen (2019) Adapting pretrained transformer to lattices for spoken language understanding. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cited by: §1.
  • J. Krone, Y. Zhang, and M. Diab (2020) Learning to classify intents and slot labels given a handful of examples. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, ACL, Cited by: §1, §1, §2.1.1, §2.1.1, §2.2, §2.3, §3.1, §3.1, §3.2.
  • A. Kumar, A. Gupta, J. Chan, S. Tucker, B. Hoffmeister, M. Dreyer, S. Peshterliev, A. Gandhe, D. Filiminov, A. Rastrow, et al. (2017) Just ask: building an architecture for extensible self-service spoken language understanding. arXiv preprint arXiv:1711.00549. Cited by: §1.
  • F. Ladhak, A. Gandhe, M. Dreyer, L. Mathias, A. Rastrow, and B. Hoffmeister (2016)

    LatticeRnn: recurrent neural networks over lattices.

    In Interspeech, Cited by: §1.
  • C. Lai, J. Cao, S. Bodapati, and S. Li (2020a) Towards semi-supervised semantics understanding from speech. In Workshop on Self-Supervised Learning for Speech and Audio Processing, NeurIPS, Cited by: §1.
  • C. Lai, Y. Chuang, H. Lee, S. Li, and J. Glass (2020b) Semi-supervised spoken language understanding via self-supervised speech and language model pretraining. arXiv preprint arXiv:2010.13826. Cited by: §1.
  • C. Lee, Y. Chen, and H. Lee (2019) Mitigating the impact of speech recognition errors on spoken question answering by adversarial domain adaptation. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7300–7304. Cited by: §1.
  • C. Li and Y. Liu (2015) Improving named entity recognition in tweets via detecting non-standard words. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 929–938. Cited by: §1.
  • A. T. Liu, S. Li, and H. Lee (2020) Tera: self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028. Cited by: §1.
  • H. Luo, S. Li, and J. Glass (2020) Prototypical Q Networks for Automatic Conversational Diagnosis and Few-Shot New Disease Adaption. In Proc. Interspeech 2020, pp. 3895–3899. External Links: Document, Link Cited by: §1.
  • R. Masumura, Y. Ijima, T. Asami, H. Masataki, and R. Higashinaka (2018) Neural confnet classification: fully neural network based spoken utterance classification using word confusion networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6039–6043. Cited by: §1.
  • G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tür, X. He, L. Heck, G. Tür, D. Yu, et al. (2014) Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 530–539. Cited by: §1.
  • J. Pennington, R. Socher, and C. D. Manning (2014)

    Glove: global vectors for word representation

    In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. Cited by: §2.3.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of NAACL-HLT, pp. 2227–2237. Cited by: §2.3.
  • S. Schuster, S. Gupta, R. Shah, and M. Lewis (2019) Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of NAACL-HLT, pp. 3795–3805. Cited by: §1.
  • E. Simonnet, S. Ghannay, N. Camelin, and Y. Estève (2018) Simulating ASR errors for training SLU systems. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Cited by: §1.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in neural information processing systems, pp. 4077–4087. Cited by: §1, §2.2, §3.1.
  • G. Tür, A. Deoras, and D. Hakkani-Tür (2013) Semantic parsing using word confusion networks with conditional random fields. In INTERSPEECH, Cited by: §1.
  • G. Tür, J. Wright, A. Gorin, G. Riccardi, and D. Hakkani-Tür (2002) Improving spoken language understanding using word confusion networks. In Seventh International Conference on Spoken Language Processing, Cited by: §1.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §1.
  • S. Wang, T. Gunter, and D. VanDyke (2018) On modelling uncertainty in neural language generation for policy optimisation in voice-triggered dialog assistants. In 2nd Workshop on Conversational AI: Today’s Practice and Tomorrow’s Potential, NeurIPS, Cited by: §2.1.2.
  • S. Yaman, L. Deng, D. Yu, Y. Wang, and A. Acero (2008) An integrative and discriminative technique for spoken utterance classification. IEEE Transactions on Audio, Speech, and Language Processing 16 (6), pp. 1207–1214. Cited by: §1.
  • Z. Yang, R. Salakhutdinov, and W. W. Cohen (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. arXiv preprint arXiv:1703.06345. Cited by: §1.
  • K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi (2014) Spoken language understanding using long short-term memory neural networks. In 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 189–194. Cited by: §1.
  • S. Zhu, O. Lan, and K. Yu (2018) Robust spoken language understanding with unsupervised asr-error adaptation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Cited by: §1.