Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

10/26/2020 ∙ by Cheng-I Lai, et al. ∙ 0

Much recent work on Spoken Language Understanding (SLU) is limited in at least one of three ways: models were trained on oracle text input and neglected ASR errors, models were trained to predict only intents without the slot values, or models were trained on a large amount of in-house data. In this paper, we propose a clean and general framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech to address these issues. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT, and fine-tuned on a limited amount of target SLU data. We study two semi-supervised settings for the ASR component: supervised pretraining on transcribed speech, and unsupervised pretraining by replacing the ASR encoder with self-supervised speech representations, such as wav2vec. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation. Experiments on ATIS show that our SLU framework with speech as input can perform on par with those using oracle text as input in semantics understanding, even though environmental noise is present and a limited amount of labeled semantics data is available for training.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spoken Language Understanding (SLU)111

SLU typically consists of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). ASR maps audio to text, and NLU maps text to semantics. Here, we are interested in learning a mapping directly from raw audio to semantics.

is at the front-end of many modern intelligent home devices, virtual assistants, and socialbots [33, 9]: given a spoken command, an SLU engine should extract relevant semantics222Semantic acquisition is commonly framed as Intent Classification (IC) and Slot Labeling/Filling (SL), see [33, 9, 30]. from spoken commands for the appropriate downstream tasks. Since SLU tasks such as the Airline Travel Information System (ATIS) [14], the field has progressed from knowledge-based [27]

to data-driven approaches, notably those based on neural networks. In the seminal paper on ATIS by Tur et al.

[30], incorporating linguistically motivated features for NLU and improving ASR robustness were underscored as the research emphasis for the coming years. Now, a decade later, we should ask ourselves again, how much has the field progressed, and what is left to be done?

Figure 1: Comparison of input/output pairs of our proposed framework with past work, which are categorized as one of: (A) NLU, which assumes oracle text as input instead of speech, (B) predicting intent only from speech, ignoring their slot values, and (C) predicting text, intent, and slots from speech. (D) Our work predicts text, intent, and slots from speech while taking advantage of unlabeled data.

Self-supervised language models (LMs), such as BERT [10], and end-to-end SLU [13, 28, 19] appear to have addressed the problems posed in [30]. As shown in Figure 1, we can examine past SLU work from the angle of how they constructed the input/output pairs. In [5], Intent Classification (IC) and Slot Labeling (SL) are jointly predicted on top of BERT, discarding the need of a Conditional Random Fields (CRF) [34]. However, these NLU works [5, 35, 15] usually ignore ASR or require an off-the-shelf ASR during testing. A line of E2E SLU work does take speech as input, yet it frames slots as intents and therefore their SLU models are really designed for IC only [28, 19, 31, 6, 23]. Another line of E2E SLU work jointly predicts text and IC/SL from speech, yet it either requires large amounts of in-house data, or restricts the pretraining scheme to ASR subword prediction [13, 24, 29, 11]. In contrast, we would desire a framework that predicts text, intents, and slots from speech, while learning with limited semantics labels by pretraining on unlabeled data.

Figure 2:

Our proposed semi-supervised learning framework with ASR and BERT for joint intent classification (IC) and slot labeling (SL) directly from speech.

(A) shows the end-to-end approach, in which E2E ASR and BERT are trained jointly by predicting text and IC/SL. (B) shows the 2-Stage baseline, where text and IC/SL are obtained successively. (C) shows the SpeechBERT baseline, where BERT is adapted to take audio as input by first pretraining with Audio MLM loss and then fine-tuning for IC/SL. A separate pretrained ASR is still needed for (B) and (C). (D) shows the ASR () and NLU () building blocks used in (A)-(C). Note that and have different subword tokenizations: SentencePiece (BPE) [16] and BertToken. Dotted shapes are pretrained. Figure best viewed in colors.

The case for semi-supervised SLU. Neural networks benefit from large quantities of labeled training data, and one can train end-to-end SLU models with them [9, 13, 28, 24]. However, curating labeled IC/SL data is expensive, and often only a limited amount of labels are available. Semi-supervised learning could be a useful scenario for training SLU models for various domains whereby model components are pretrained on large amounts of unlabeled data and then fine-tuned with target semantic labels. While [19, 31, 24, 29] have explored this pretraining then fine-tuning scheme, they did not take advantage of the generalization capacity of contextualized LMs, such as BERT, for learning semantics from speech. Notably, self-supervised speech representation learning [8, 20, 1, 26, 18] provides a clean and general learning mechanism for downstream speech tasks, yet the semantic transferrability of these representations are unclear. Our focus is on designing a better learning framework distinctly for semantic understanding under limited semantic labels, on top of ASR and BERT. We investigated two learning settings for the ASR component: (1) pretraining on transcribed speech with ASR subword prediction, and (2) pretraining on untranscribed speech data with contrastive losses [1, 26].

The key contributions of this paper are summarized as follows:

  • We introduce a semi-supervised SLU framework for learning semantics from speech to alleviate: (1) the need for a large amount of in-house, homogenous data [9, 13, 28, 24], (2) the limitation of only intent classification [28, 19, 15] by predicting text, slots and intents, and (3) any additional manipulation on labels or loss, such as label projection [3], output serialization [13, 29, 11], ASR n-best hypothesis, or ASR-robust training losses [15, 17]. Figure 2 illustrates our approach.

  • We investigate two learning settings for our framework: supervised pretraining and unsupervised pretraining (Figure 3), and evaluated our framework with a new metric, the slot edit score, for end-to-end semantic evaluation. Our framework improves upon previous work in Word Error Rate (WER) and IC/SL on ATIS, and even rivaled its NLU counterpart with oracle text input [5]. In addition, it is trained with noise augmentation such that it is robust to real environmental noises.

2 Proposed Learning Framework

We now formulate the mapping from speech to text, intents, and slots. Consider a target SLU dataset , consisting of i.i.d. sequences, where are the audio, word and slots sequences, and is their corresponding intent label. Note that and are of the same length, and

is a one hot vector. We are interested in finding the model

with loss,


We proceed to describe an end-to-end implementation of 333We abuse some notations by representing models by their model parameters, e.g. for the ASR model and for BERT..

2.1 End-to-End: Joint E2E ASR and BERT Fine-Tuning.

As illustrated in Figure 2, consists of a pretrained E2E ASR and a pretrained deep contextualized LM , such as BERT, and is fine-tuned jointly for , and on . The choice of E2E ASR over hybrid ASR here is because the errors from and can be back-propagated through ; following [5], we have predicted via an additional CRF/linear layer on top of BERT, and is predicted on top of the BERT output of the [CLS] token. The additional model parameters for predicting SL and IC are and , respectively, and we have . During end-to-end fine-tuning, outputs from and are concatenated to predict and with loss , while is predicted with loss . The main benefit this formulation brings is that now and do not solely depend on an ASR top-1 hypothesis during training, and the end-to-end objective is thus,


The ASR objective is formulated to maximize sequence-level log-likelihood, and . Before writing down , we describe a masking operation because ASR and BERT typically employ different subword tokenization methods.

Differentiate Through Subword Tokenizations To concatenate and outputs along the hidden dimension, we need to make sure they have the same length along the token dimension. We stored the first indices where are broken down into subword tokens into a matrix: for and for , where is the number of tokens for and , is the number of ASR subword tokens, and for BERT. Let be the output matrix before softmax, and similarly for . The concatenated matrix is given as , where and are hidden dimensions for and . is then,


where the sum of cross entropy losses for IC and SL are maximized, and and are updated through . Ground truth is used as input to instead of due to teacher forcing.

2.2 Inference

At test time, an input audio sequence and the sets of all possible word tokens , slots , and intents are given. We are then interested in decoding for its target word sequence , slots sequence , and intent label . Having obtained , the decoding procedure for the end-to-end approach is,


This two step decoding procedure, first then is necessary given that no explicit serialization on and are imposed, as in [13, 29]. While decoding for , additional input is given and we have instead of given the context from self-attention in BERT. Note that here and throughout the work, we only take the top-1 hypothesis (instead of top-N) to decode for .

3 Learning with Less Supervision

Our semi-supervised framework relies on pretrained ASR and NLU components. Depending on the accessibility of the data, we explored two level of supervision444In either settings, the amount of IC/SL annotations remains the same.

. The first setting is where an external transcribed corpus is available, and we utilized transfer learning for initializing the ASR. The second setting is where external audio is available but not transcriptions, and in this case, the ASR is initialized with self-supervised learning. In both settings, BERT is pretrained with MLM and NSP as described in

[10]. Figure 3 distinguishes the two learning settings.

Figure 3: Two semi-supervised settings: (A) additional transcribed speech is available. is pretrained and fine-tuned for ASR. (B) additional audio is available but without transcription. encoder is replaced with a pretrained wav2vec [26, 1] before fine-tuning.

3.1 Transfer Learning from a Pretrained ASR

Following [19, 24, 29], is pretrained on an external transcribed speech corpus before fine-tuning on the target SLU dataset.

3.2 Unsupervised ASR Pretraining with wav2vec

According to UNESCO, 43% of the languages in the world are endangered. Supervised pretraining is not possible for many languages, as transcribing a language requires expert knowledge in phonetics, morphology, syntax, and so on. This partially motivates the line of self-supervised learning work in speech, that powerful learning representations require little fine-tuning data. Returning to our topic, we asked, how does self-supervised learning help with learning semantics?

Among many others, wav2vec 1.0 [26] and 2.0 [1] demonstrated the effectiveness of self-supervised representations for ASR. They are pretrained with contrastive losses [20], and differ mainly by their architectures. We replaced encoder with these wav2vec features, and appended the decoder for fine-tuning on SLU.

4 Experiments

Datasets ATIS [14] contains 8hr of audio recordings of people making flight reservations with corresponding human transcripts. A total of 5.2k utterances with more than 600 speakers are present. Note that ATIS is considerably smaller than those in-house SLU data used in [9, 13, 28, 24], justifying our limited semantics labels setup. Waveforms are sampled at 16kHz. For the unlabeled semantics data, we selected Librispeech 960hr (LS-960) [21] for pretraining. Besides the original ATIS, models are evaluated on its noisy copy (augmented with MS-SNSD [25]

). We made sure the noisy train and test splits in MS-SNSD do not overlap. Text normalization is applied on the ATIS transcription with an open-source software

555 Utterances are ignored if they contain words with multiple possible slot tags.

Hyperparameters All speech is represented as sequences of 83-dimensional Mel-scale filter bank energies with , computed every 10ms. Global mean normalization is applied. E2E ASR is implemented in ESPnet, where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer is similar to [23]. E2E ASR is optimized with hybrid CTC/attention losses [32] with label smoothing. The decoding beam size is set to 5 throughout this work. We do not use an external LM during decoding. SpecAugment [22] is used as the default for data augmentation. A SentencePiece (BPE) vocabulary size is set to 1k. BERT is a bert-base-uncased from HuggingFace. Code will be made available666Code: Semi-Supervsied-Spoken-Language-Understanding-PyTorch.

4.1 E2E Evaluation with Slots Edit score.

Our framework is evaluated with an end-to-end evaluation metric, termed the slots edit

. Unlike slots score, slots edit accounts for instances where predicted sequences have different lengths as the ground truth. It bears similarity with the E2E metric proposed in [13, 24]. To calculate the score, the predicted text and oracle text are first aligned. For each slot label , where is the set of all possible slot labels except for the “O” tag, we calculate the insertion (false positive, FP), deletion (false negative, FN) and substition (FN and FP) of its slots value. Slots edit

is the harmonic mean of precision and recall over all slots:


4.2 End-to-End 2-Stage Fine-tuning

An observation from the experiment was that ASR is much harder than IC/SL. Therefore, we adjusted our end-to-end training to a two-stage fine-tuning: pretrain ASR on LS-960, then fine-tine ASR on ATIS, and lastly jointly fine-tune for ASR and IC/SL on ATIS.

Figure 4: Our SpeechBERT [7] pretraining and fine-tuning setup.

4.3 Baselines: Alternative Formulations

Two variations for constructing are presented (refer to Figure 2). They will be the baselines to the end-to-end approach.

2-Stage: Cascade ASR to BERT A natural complement to the E2E approach is to separately pretrain and fine-tune ASR and BERT. In this case, errors from and cannot be back-propagated to .

SpeechBERT : BERT in Speech-Text Embed Space Another sensible way to construct is to somehow “adapt” BERT such that it can take audio as input and outputs IC/SL, while not compromising its original semantic learning capacity. SpeechBERT [7] was initially proposed for spoken question answering, but we found the core idea of training BERT with audio-text pairs fitting as another baseline for our end-to-end approach. We modified the pretraining and fine-tuning setup described in [7] for SLU. Audio MLM (c.f MLM in BERT [10]) pretrains by mapping masked audio segments to text. This pretraining step gradually adapts the original BERT to a phonetic-semantic joint embedding space. Then, is fine-tuned by mapping unmasked audio segments to IC/SL. Figure 4 illustrates the audio-text and audio-IC/SL pairs for SpeechBERT. Unlike the end-to-end approach, is kept frozen throughout SpeechBERT pretraining and fine-tuning.

4.4 Main Results on Clean ATIS

We benchmarked our proposed framework with several prior works, and Table 1 presents their WER, slots edit F1 and intent F1 results. JointBERT [5] is our NLU baseline, where BERT is jointly fine-tuned for IC/SL, and it gets around 95% slots edit and over 98% IC F1. Since JointBERT has access to the oracle text, this is the upper bound for our SLU models with speech as input. CLM-BERT [3] explored using in-house conversational LM for NLU. We replicated [29], where an LAS [4] directly predicts interleaving word and slots tokens (serialized output), and optimized with CTC over words and slots. We also experimented with a Kaldi hybrid ASR.

Both our proposed end-to-end and baselines approaches surpassed prior SLU work. We hypothesize the performance gain originates from our choices of (1) adopting pretrained E2E ASR and BERT, (2) applying text-norm on target transcriptions for training the ASR, and (3) end-to-end fine-tuning text and IC/SL.

Frameworks Unlabeled ATIS clean test
Semantics Data WER slots edit intent
NLU with Oracle Text
JointBERT [5] - 95.64 98.99
End-to-End w/ 2-Stage LS-960 2.18 95.88 97.26
2-Stage Baseline LS-960 1.38 93.69 97.01
SpeechBERT Baseline LS-960 1.4 92.36 97.4
Prior Work
ASR-Robust Embed [15] WSJ 15.55 - 95.65
Kaldi Hybrid ASR+BERT LS-960 13.31 85.13 94.56
ASR+CLM-BERT [3] in-house 18.4. 93.8777For [3], model predictions are evaluated only if its ASR hypothesis and human transcription have the same number of tokens. 97.1
LAS+CTC [29] LS-460 8.32 86.85 -
Table 1: WER, slots edit and intent on ATIS. ASR is pretrained on Librispeech 960h (LS-960). Results indicate our semi-supervised framework is effective in data scarcity setting, exceeding prior work in WER and IC/SL while approaching the NLU upperbound.

4.5 Environmental Noise Augmentation

A common scenario where users utter their spoken commands to SLU engines is when environmental noises are present in the background. Nonetheless, common SLU benchmarking datasets like ATIS, SNIPS [9], or FSC [19] are very clean. To quantify model robustness under noisy settings, we augmented ATIS with environmental noise from MS-SNSD. Table 2 reveals that those work well on clean ATIS may break under realistic noises, and although our models are trained with SpecAugment, there is still a 4-27% performance drop from clean test.

We followed the noise augmentation protocol in [25], where for each sample, five noise files are sampled and added to the clean file with SNR levels of dB, resulting in a five-fold augmentation. We observe that augmenting the training data with a diverse set of environmental noises work well, and there is now minimal model degradation. Our end-to-end approach reaches 95.46% for SL and 97.4% for IC, which is merely a 1-2% drop from clean test, and almost a 40% improvement over hybrid ASR+BERT.

Frameworks ATIS noisy test
WER slots edit intent
Kaldi Hybrid ASR+BERT 44.72 69.55 88.94
Proposed w/ Noise Aug.
End-to-End w/ 2-Stage 3.6 95.46 97.40
2-Stage Baseline 3.5 92.52 96.49
SpeechBERT Baseline 3.6 88.7 96.15
Proposed w/o Noise Aug.
End-to-End w/ 2-Stage 9.62 91.54 96.14
2-Stage Baseline 8.98 90.09 95.74
SpeechBERT Baseline 9.0 81.72 94.05
Table 2: Noise augmentation effectively reduces model degradation.

4.6 Effectiveness of Unsupervised Pretraining with wav2vec

Table 3 shows the results on different ASR pretraining strategies: unsupervised pretraining with wav2vec, transfer learning from ASR, and no pretraining at all. We extracted both the latent vector and context vector from wav2vec 1.0. To simplify the pipeline and in contrast to [26], we pre-extracted the wav2vec features and did not fine-tune wav2vec with on ATIS. We also chose not to decode with a LM to be consistent with prior SLU work. We first observed the high WER for latent vector from wav2vec 1.0, indicating they are sub-optimal and merely better than training from scratch by a slight margin. Nonetheless, encouragingly, context vector from wav2vec 1.0 gets 67% slots and 90% intent .

To improve the results, we added subsampling layers [4] on top of the wav2vec features to downsample the sequence length with convolution. The motivation here is and are comparably longer than the normal ASR encoder outputs. With sub-sampling, from wav2vec 1.0 now achieves 85.64% for SL and 95.67% for IC, a huge relative improvement over training ASR from scratch, and closes the gap between unsupervised and supervised pretraining for SLU.

Frameworks ATIS clean test
WER slots edit intent
Proposed 2-Stage w/o ASR Pretraining
2-Stage Baseline 58.7 29.22 82.08
Proposed 2-Stage w/ Transfer Learning from ASR
2-Stage Baseline 1.38 93.69 97.01
Proposed 2-Stage w/ Unsupervised Pretraining
wav2vec1.0 [26] + 2-Stage 54.2 35.04 83.68
wav2vec1.0 [26] + 2-Stage 30.4 67.33 89.86
wav2vec1.0 [26] + subsample + 2-Stage 13.2 85.64 95.67
Table 3: Effectiveness of different ASR pretraining strategies for our 2-Stage baseline. Results with wav2vec 2.0 [1] is omitted since they are not much better. Setup is visaulized in Figure 3.

5 Conclusions and Future Work

This work attempts to respond to a classic paper ”What is left to be understood in ATIS? [30]”, and to the advancement put forward by contextualized LM and end-to-end SLU up against semantics understanding. We showed for the first time that an SLU model with speech as input could perform on par with NLU models on ATIS, entering the 5% “corpus errors” range [30, 2]. However, we collectively believe that there are unsolved questions remaining, such as the prospect of building a single framework for multi-lingual SLU [12], or the need for a more spontaneous SLU corpus that is not limited to short segments of spoken commands.

Acknowledgments We thank Nanxin Chen, Erica Cooper, Alexander H. Liu, Wei Fang, and Fan-Keng Sun for their comments on this work.


  • [1] A. Baevski, H. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477. Cited by: §1, Figure 3, §3.2, Table 3.
  • [2] F. Béchet and C. Raymond (2018) Is atis too shallow to go deeper for benchmarking spoken language understanding models?. Cited by: §5.
  • [3] J. Cao, J. Wang, W. Hamza, K. Vanee, and S. Li (2020) Style attuned pre-training and parameter efficient fine-tuning for spoken language understanding. arXiv preprint arXiv:2010.04355. Cited by: 1st item, §4.4, Table 1, footnote 7.
  • [4] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §4.4, §4.6.
  • [5] Q. Chen, Z. Zhuo, and W. Wang (2019) Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. Cited by: 2nd item, §1, §2.1, §4.4, Table 1.
  • [6] W. I. Cho, D. Kwak, J. Yoon, and N. S. Kim (2020) Speech to text adaptation: towards an efficient cross-modal distillation. arXiv preprint arXiv:2005.08213. Cited by: §1.
  • [7] Y. Chuang, C. Liu, and H. Lee (2019) SpeechBERT: cross-modal pre-trained language model for end-to-end spoken question answering. arXiv preprint arXiv:1910.11559. Cited by: Figure 4, §4.3.
  • [8] Y. Chung, W. Hsu, H. Tang, and J. Glass (2019)

    An unsupervised autoregressive model for speech representation learning

    arXiv preprint arXiv:1904.03240. Cited by: §1.
  • [9] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, et al. (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: 1st item, §1, §1, §4.5, §4, footnote 2.
  • [10] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3, §4.3.
  • [11] S. Ghannay, A. Caubriere, Y. Esteve, A. Laurent, and E. Morin (2018) End-to-end named entity extraction from speech. arXiv preprint arXiv:1805.12045. Cited by: 1st item, §1.
  • [12] J. Glass, G. Flammia, D. Goodine, M. Phillips, J. Polifroni, S. Sakai, S. Seneff, and V. Zue (1995) Multilingual spoken-language understanding in the mit voyager system. Speech communication 17 (1-2), pp. 1–18. Cited by: §5.
  • [13] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur, P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters (2018) From audio to semantics: approaches to end-to-end spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 720–726. Cited by: 1st item, §1, §1, §2.2, §4.1, §4.
  • [14] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington (1990) The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: §1, §4.
  • [15] C. Huang and Y. Chen (2020) Learning asr-robust contextualized embeddings for spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8009–8013. Cited by: 1st item, §1, Table 1.
  • [16] T. Kudo and J. Richardson (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: Figure 2.
  • [17] C. Lee, Y. Chen, and H. Lee (2019) Mitigating the impact of speech recognition errors on spoken question answering by adversarial domain adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7300–7304. Cited by: 1st item.
  • [18] A. T. Liu, S. Yang, P. Chi, P. Hsu, and H. Lee (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423. Cited by: §1.
  • [19] L. Lugosch, M. Ravanelli, P. Ignoto, V. S. Tomar, and Y. Bengio (2019) Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670. Cited by: 1st item, §1, §1, §3.1, §4.5.
  • [20] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §3.2.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.
  • [22] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §4.
  • [23] M. Radfar, A. Mouchtaris, and S. Kunzmann (2020) End-to-end neural transformer based spoken language understanding. arXiv preprint arXiv:2008.10984. Cited by: §1, §4.
  • [24] M. Rao, A. Raju, P. Dheram, B. Bui, and A. Rastrow (2020) Speech to semantics: improve asr and nlu jointly via all-neural interfaces. arXiv preprint arXiv:2008.06173. Cited by: 1st item, §1, §1, §3.1, §4.1, §4.
  • [25] C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke (2019) A scalable noisy speech dataset and online subjective test framework. arXiv preprint arXiv:1909.08050. Cited by: §4.5, §4.
  • [26] S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862. Cited by: §1, Figure 3, §3.2, §4.6, Table 3.
  • [27] S. Seneff (1992) TINA: a natural language system for spoken language applications. Computational linguistics 18 (1), pp. 61–86. Cited by: §1.
  • [28] D. Serdyuk, Y. Wang, C. Fuegen, A. Kumar, B. Liu, and Y. Bengio (2018) Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5754–5758. Cited by: 1st item, §1, §1, §4.
  • [29] N. Tomashenko, A. Caubrière, Y. Estève, A. Laurent, and E. Morin (2019) Recent advances in end-to-end spoken language understanding. In International Conference on Statistical Language and Speech Processing, pp. 44–55. Cited by: 1st item, §1, §1, §2.2, §3.1, §4.4, Table 1.
  • [30] G. Tur, D. Hakkani-Tür, and L. Heck (2010) What is left to be understood in atis?. In 2010 IEEE Spoken Language Technology Workshop, pp. 19–24. Cited by: §1, §1, §5, footnote 2.
  • [31] P. Wang, L. Wei, Y. Cao, J. Xie, and Z. Nie (2020) Large-scale unsupervised pre-training for end-to-end spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7999–8003. Cited by: §1, §1.
  • [32] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §4.
  • [33] D. Yu, M. Cohn, Y. M. Yang, C. Chen, W. Wen, J. Zhang, M. Zhou, K. Jesse, A. Chau, A. Bhowmick, et al. (2019) Gunrock: a social bot for complex and engaging long conversations. arXiv preprint arXiv:1910.03042. Cited by: §1, footnote 2.
  • [34] J. Zhou and W. Xu (2015)

    End-to-end learning of semantic role labeling using recurrent neural networks


    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    pp. 1127–1137. Cited by: §1.
  • [35] S. Zhu and K. Yu (2017) Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5675–5679. Cited by: §1.