Spoken Language Understanding (SLU)111 SLU typically consists of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). ASR maps audio to text, and NLU maps text to semantics. Here, we are interested in learning a mapping directly from raw audio to semantics.
SLU typically consists of Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU). ASR maps audio to text, and NLU maps text to semantics. Here, we are interested in learning a mapping directly from raw audio to semantics.is at the front-end of many modern intelligent home devices, virtual assistants, and socialbots [33, 9]: given a spoken command, an SLU engine should extract relevant semantics222Semantic acquisition is commonly framed as Intent Classification (IC) and Slot Labeling/Filling (SL), see [33, 9, 30]. from spoken commands for the appropriate downstream tasks. Since SLU tasks such as the Airline Travel Information System (ATIS) , the field has progressed from knowledge-based 
to data-driven approaches, notably those based on neural networks. In the seminal paper on ATIS by Tur et al., incorporating linguistically motivated features for NLU and improving ASR robustness were underscored as the research emphasis for the coming years. Now, a decade later, we should ask ourselves again, how much has the field progressed, and what is left to be done?
Self-supervised language models (LMs), such as BERT , and end-to-end SLU [13, 28, 19] appear to have addressed the problems posed in . As shown in Figure 1, we can examine past SLU work from the angle of how they constructed the input/output pairs. In , Intent Classification (IC) and Slot Labeling (SL) are jointly predicted on top of BERT, discarding the need of a Conditional Random Fields (CRF) . However, these NLU works [5, 35, 15] usually ignore ASR or require an off-the-shelf ASR during testing. A line of E2E SLU work does take speech as input, yet it frames slots as intents and therefore their SLU models are really designed for IC only [28, 19, 31, 6, 23]. Another line of E2E SLU work jointly predicts text and IC/SL from speech, yet it either requires large amounts of in-house data, or restricts the pretraining scheme to ASR subword prediction [13, 24, 29, 11]. In contrast, we would desire a framework that predicts text, intents, and slots from speech, while learning with limited semantics labels by pretraining on unlabeled data.
The case for semi-supervised SLU. Neural networks benefit from large quantities of labeled training data, and one can train end-to-end SLU models with them [9, 13, 28, 24]. However, curating labeled IC/SL data is expensive, and often only a limited amount of labels are available. Semi-supervised learning could be a useful scenario for training SLU models for various domains whereby model components are pretrained on large amounts of unlabeled data and then fine-tuned with target semantic labels. While [19, 31, 24, 29] have explored this pretraining then fine-tuning scheme, they did not take advantage of the generalization capacity of contextualized LMs, such as BERT, for learning semantics from speech. Notably, self-supervised speech representation learning [8, 20, 1, 26, 18] provides a clean and general learning mechanism for downstream speech tasks, yet the semantic transferrability of these representations are unclear. Our focus is on designing a better learning framework distinctly for semantic understanding under limited semantic labels, on top of ASR and BERT. We investigated two learning settings for the ASR component: (1) pretraining on transcribed speech with ASR subword prediction, and (2) pretraining on untranscribed speech data with contrastive losses [1, 26].
The key contributions of this paper are summarized as follows:
We introduce a semi-supervised SLU framework for learning semantics from speech to alleviate: (1) the need for a large amount of in-house, homogenous data [9, 13, 28, 24], (2) the limitation of only intent classification [28, 19, 15] by predicting text, slots and intents, and (3) any additional manipulation on labels or loss, such as label projection , output serialization [13, 29, 11], ASR n-best hypothesis, or ASR-robust training losses [15, 17]. Figure 2 illustrates our approach.
We investigate two learning settings for our framework: supervised pretraining and unsupervised pretraining (Figure 3), and evaluated our framework with a new metric, the slot edit score, for end-to-end semantic evaluation. Our framework improves upon previous work in Word Error Rate (WER) and IC/SL on ATIS, and even rivaled its NLU counterpart with oracle text input . In addition, it is trained with noise augmentation such that it is robust to real environmental noises.
2 Proposed Learning Framework
We now formulate the mapping from speech to text, intents, and slots. Consider a target SLU dataset , consisting of i.i.d. sequences, where are the audio, word and slots sequences, and is their corresponding intent label. Note that and are of the same length, and
is a one hot vector. We are interested in finding the modelwith loss,
We proceed to describe an end-to-end implementation of 333We abuse some notations by representing models by their model parameters, e.g. for the ASR model and for BERT..
2.1 End-to-End: Joint E2E ASR and BERT Fine-Tuning.
As illustrated in Figure 2, consists of a pretrained E2E ASR and a pretrained deep contextualized LM , such as BERT, and is fine-tuned jointly for , and on . The choice of E2E ASR over hybrid ASR here is because the errors from and can be back-propagated through ; following , we have predicted via an additional CRF/linear layer on top of BERT, and is predicted on top of the BERT output of the [CLS] token. The additional model parameters for predicting SL and IC are and , respectively, and we have . During end-to-end fine-tuning, outputs from and are concatenated to predict and with loss , while is predicted with loss . The main benefit this formulation brings is that now and do not solely depend on an ASR top-1 hypothesis during training, and the end-to-end objective is thus,
The ASR objective is formulated to maximize sequence-level log-likelihood, and . Before writing down , we describe a masking operation because ASR and BERT typically employ different subword tokenization methods.
Differentiate Through Subword Tokenizations To concatenate and outputs along the hidden dimension, we need to make sure they have the same length along the token dimension. We stored the first indices where are broken down into subword tokens into a matrix: for and for , where is the number of tokens for and , is the number of ASR subword tokens, and for BERT. Let be the output matrix before softmax, and similarly for . The concatenated matrix is given as , where and are hidden dimensions for and . is then,
where the sum of cross entropy losses for IC and SL are maximized, and and are updated through . Ground truth is used as input to instead of due to teacher forcing.
At test time, an input audio sequence and the sets of all possible word tokens , slots , and intents are given. We are then interested in decoding for its target word sequence , slots sequence , and intent label . Having obtained , the decoding procedure for the end-to-end approach is,
This two step decoding procedure, first then is necessary given that no explicit serialization on and are imposed, as in [13, 29]. While decoding for , additional input is given and we have instead of given the context from self-attention in BERT. Note that here and throughout the work, we only take the top-1 hypothesis (instead of top-N) to decode for .
3 Learning with Less Supervision
Our semi-supervised framework relies on pretrained ASR and NLU components. Depending on the accessibility of the data, we explored two level of supervision444In either settings, the amount of IC/SL annotations remains the same.
. The first setting is where an external transcribed corpus is available, and we utilized transfer learning for initializing the ASR. The second setting is where external audio is available but not transcriptions, and in this case, the ASR is initialized with self-supervised learning. In both settings, BERT is pretrained with MLM and NSP as described in. Figure 3 distinguishes the two learning settings.
3.1 Transfer Learning from a Pretrained ASR
3.2 Unsupervised ASR Pretraining with wav2vec
According to UNESCO, 43% of the languages in the world are endangered. Supervised pretraining is not possible for many languages, as transcribing a language requires expert knowledge in phonetics, morphology, syntax, and so on. This partially motivates the line of self-supervised learning work in speech, that powerful learning representations require little fine-tuning data. Returning to our topic, we asked, how does self-supervised learning help with learning semantics?
Among many others, wav2vec 1.0  and 2.0  demonstrated the effectiveness of self-supervised representations for ASR. They are pretrained with contrastive losses , and differ mainly by their architectures. We replaced encoder with these wav2vec features, and appended the decoder for fine-tuning on SLU.
Datasets ATIS  contains 8hr of audio recordings of people making flight reservations with corresponding human transcripts. A total of 5.2k utterances with more than 600 speakers are present. Note that ATIS is considerably smaller than those in-house SLU data used in [9, 13, 28, 24], justifying our limited semantics labels setup. Waveforms are sampled at 16kHz. For the unlabeled semantics data, we selected Librispeech 960hr (LS-960)  for pretraining. Besides the original ATIS, models are evaluated on its noisy copy (augmented with MS-SNSD 
). We made sure the noisy train and test splits in MS-SNSD do not overlap. Text normalization is applied on the ATIS transcription with an open-source software555https://github.com/EFord36/normalise. Utterances are ignored if they contain words with multiple possible slot tags.
Hyperparameters All speech is represented as sequences of 83-dimensional Mel-scale filter bank energies with , computed every 10ms. Global mean normalization is applied. E2E ASR is implemented in ESPnet, where it has 12 Transformer encoder layers and 6 decoder layers. The choice of the Transformer is similar to . E2E ASR is optimized with hybrid CTC/attention losses  with label smoothing. The decoding beam size is set to 5 throughout this work. We do not use an external LM during decoding. SpecAugment  is used as the default for data augmentation. A SentencePiece (BPE) vocabulary size is set to 1k. BERT is a bert-base-uncased from HuggingFace. Code will be made available666Code: Semi-Supervsied-Spoken-Language-Understanding-PyTorch.
4.1 E2E Evaluation with Slots Edit score.
Our framework is evaluated with an end-to-end evaluation metric, termed the slots edit. Unlike slots score, slots edit accounts for instances where predicted sequences have different lengths as the ground truth. It bears similarity with the E2E metric proposed in [13, 24]. To calculate the score, the predicted text and oracle text are first aligned. For each slot label , where is the set of all possible slot labels except for the “O” tag, we calculate the insertion (false positive, FP), deletion (false negative, FN) and substition (FN and FP) of its slots value. Slots edit
4.2 End-to-End 2-Stage Fine-tuning
An observation from the experiment was that ASR is much harder than IC/SL. Therefore, we adjusted our end-to-end training to a two-stage fine-tuning: pretrain ASR on LS-960, then fine-tine ASR on ATIS, and lastly jointly fine-tune for ASR and IC/SL on ATIS.
4.3 Baselines: Alternative Formulations
Two variations for constructing are presented (refer to Figure 2). They will be the baselines to the end-to-end approach.
2-Stage: Cascade ASR to BERT A natural complement to the E2E approach is to separately pretrain and fine-tune ASR and BERT. In this case, errors from and cannot be back-propagated to .
SpeechBERT : BERT in Speech-Text Embed Space Another sensible way to construct is to somehow “adapt” BERT such that it can take audio as input and outputs IC/SL, while not compromising its original semantic learning capacity. SpeechBERT  was initially proposed for spoken question answering, but we found the core idea of training BERT with audio-text pairs fitting as another baseline for our end-to-end approach. We modified the pretraining and fine-tuning setup described in  for SLU. Audio MLM (c.f MLM in BERT ) pretrains by mapping masked audio segments to text. This pretraining step gradually adapts the original BERT to a phonetic-semantic joint embedding space. Then, is fine-tuned by mapping unmasked audio segments to IC/SL. Figure 4 illustrates the audio-text and audio-IC/SL pairs for SpeechBERT. Unlike the end-to-end approach, is kept frozen throughout SpeechBERT pretraining and fine-tuning.
4.4 Main Results on Clean ATIS
We benchmarked our proposed framework with several prior works, and Table 1 presents their WER, slots edit F1 and intent F1 results. JointBERT  is our NLU baseline, where BERT is jointly fine-tuned for IC/SL, and it gets around 95% slots edit and over 98% IC F1. Since JointBERT has access to the oracle text, this is the upper bound for our SLU models with speech as input. CLM-BERT  explored using in-house conversational LM for NLU. We replicated , where an LAS  directly predicts interleaving word and slots tokens (serialized output), and optimized with CTC over words and slots. We also experimented with a Kaldi hybrid ASR.
Both our proposed end-to-end and baselines approaches surpassed prior SLU work. We hypothesize the performance gain originates from our choices of (1) adopting pretrained E2E ASR and BERT, (2) applying text-norm on target transcriptions for training the ASR, and (3) end-to-end fine-tuning text and IC/SL.
|Frameworks||Unlabeled||ATIS clean test|
|Semantics Data||WER||slots edit||intent|
|NLU with Oracle Text|
|End-to-End w/ 2-Stage||LS-960||2.18||95.88||97.26|
|ASR-Robust Embed ||WSJ||15.55||-||95.65|
|Kaldi Hybrid ASR+BERT||LS-960||13.31||85.13||94.56|
|ASR+CLM-BERT ||in-house||18.4.||93.8777For , model predictions are evaluated only if its ASR hypothesis and human transcription have the same number of tokens.||97.1|
4.5 Environmental Noise Augmentation
A common scenario where users utter their spoken commands to SLU engines is when environmental noises are present in the background. Nonetheless, common SLU benchmarking datasets like ATIS, SNIPS , or FSC  are very clean. To quantify model robustness under noisy settings, we augmented ATIS with environmental noise from MS-SNSD. Table 2 reveals that those work well on clean ATIS may break under realistic noises, and although our models are trained with SpecAugment, there is still a 4-27% performance drop from clean test.
We followed the noise augmentation protocol in , where for each sample, five noise files are sampled and added to the clean file with SNR levels of dB, resulting in a five-fold augmentation. We observe that augmenting the training data with a diverse set of environmental noises work well, and there is now minimal model degradation. Our end-to-end approach reaches 95.46% for SL and 97.4% for IC, which is merely a 1-2% drop from clean test, and almost a 40% improvement over hybrid ASR+BERT.
|Frameworks||ATIS noisy test|
|Kaldi Hybrid ASR+BERT||44.72||69.55||88.94|
|Proposed w/ Noise Aug.|
|End-to-End w/ 2-Stage||3.6||95.46||97.40|
|Proposed w/o Noise Aug.|
|End-to-End w/ 2-Stage||9.62||91.54||96.14|
4.6 Effectiveness of Unsupervised Pretraining with wav2vec
Table 3 shows the results on different ASR pretraining strategies: unsupervised pretraining with wav2vec, transfer learning from ASR, and no pretraining at all. We extracted both the latent vector and context vector from wav2vec 1.0. To simplify the pipeline and in contrast to , we pre-extracted the wav2vec features and did not fine-tune wav2vec with on ATIS. We also chose not to decode with a LM to be consistent with prior SLU work. We first observed the high WER for latent vector from wav2vec 1.0, indicating they are sub-optimal and merely better than training from scratch by a slight margin. Nonetheless, encouragingly, context vector from wav2vec 1.0 gets 67% slots and 90% intent .
To improve the results, we added subsampling layers  on top of the wav2vec features to downsample the sequence length with convolution. The motivation here is and are comparably longer than the normal ASR encoder outputs. With sub-sampling, from wav2vec 1.0 now achieves 85.64% for SL and 95.67% for IC, a huge relative improvement over training ASR from scratch, and closes the gap between unsupervised and supervised pretraining for SLU.
|Frameworks||ATIS clean test|
|Proposed 2-Stage w/o ASR Pretraining|
|Proposed 2-Stage w/ Transfer Learning from ASR|
|Proposed 2-Stage w/ Unsupervised Pretraining|
|wav2vec1.0  + 2-Stage||54.2||35.04||83.68|
|wav2vec1.0  + 2-Stage||30.4||67.33||89.86|
|wav2vec1.0  + subsample + 2-Stage||13.2||85.64||95.67|
5 Conclusions and Future Work
This work attempts to respond to a classic paper ”What is left to be understood in ATIS? ”, and to the advancement put forward by contextualized LM and end-to-end SLU up against semantics understanding. We showed for the first time that an SLU model with speech as input could perform on par with NLU models on ATIS, entering the 5% “corpus errors” range [30, 2]. However, we collectively believe that there are unsolved questions remaining, such as the prospect of building a single framework for multi-lingual SLU , or the need for a more spontaneous SLU corpus that is not limited to short segments of spoken commands.
Acknowledgments We thank Nanxin Chen, Erica Cooper, Alexander H. Liu, Wei Fang, and Fan-Keng Sun for their comments on this work.
-  (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. arXiv preprint arXiv:2006.11477. Cited by: §1, Figure 3, §3.2, Table 3.
-  (2018) Is atis too shallow to go deeper for benchmarking spoken language understanding models?. Cited by: §5.
-  (2020) Style attuned pre-training and parameter efficient fine-tuning for spoken language understanding. arXiv preprint arXiv:2010.04355. Cited by: 1st item, §4.4, Table 1, footnote 7.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §4.4, §4.6.
-  (2019) Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909. Cited by: 2nd item, §1, §2.1, §4.4, Table 1.
-  (2020) Speech to text adaptation: towards an efficient cross-modal distillation. arXiv preprint arXiv:2005.08213. Cited by: §1.
-  (2019) SpeechBERT: cross-modal pre-trained language model for end-to-end spoken question answering. arXiv preprint arXiv:1910.11559. Cited by: Figure 4, §4.3.
An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240. Cited by: §1.
-  (2018) Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190. Cited by: 1st item, §1, §1, §4.5, §4, footnote 2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §3, §4.3.
-  (2018) End-to-end named entity extraction from speech. arXiv preprint arXiv:1805.12045. Cited by: 1st item, §1.
-  (1995) Multilingual spoken-language understanding in the mit voyager system. Speech communication 17 (1-2), pp. 1–18. Cited by: §5.
-  (2018) From audio to semantics: approaches to end-to-end spoken language understanding. In 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 720–726. Cited by: 1st item, §1, §1, §2.2, §4.1, §4.
-  (1990) The atis spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990, Cited by: §1, §4.
-  (2020) Learning asr-robust contextualized embeddings for spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8009–8013. Cited by: 1st item, §1, Table 1.
-  (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: Figure 2.
-  (2019) Mitigating the impact of speech recognition errors on spoken question answering by adversarial domain adaptation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7300–7304. Cited by: 1st item.
-  (2020) Mockingjay: unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6419–6423. Cited by: §1.
-  (2019) Speech model pre-training for end-to-end spoken language understanding. arXiv preprint arXiv:1904.03670. Cited by: 1st item, §1, §1, §3.1, §4.5.
-  (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §3.2.
-  (2015) Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §4.
-  (2019) Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779. Cited by: §4.
-  (2020) End-to-end neural transformer based spoken language understanding. arXiv preprint arXiv:2008.10984. Cited by: §1, §4.
-  (2020) Speech to semantics: improve asr and nlu jointly via all-neural interfaces. arXiv preprint arXiv:2008.06173. Cited by: 1st item, §1, §1, §3.1, §4.1, §4.
-  (2019) A scalable noisy speech dataset and online subjective test framework. arXiv preprint arXiv:1909.08050. Cited by: §4.5, §4.
-  (2019) Wav2vec: unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862. Cited by: §1, Figure 3, §3.2, §4.6, Table 3.
-  (1992) TINA: a natural language system for spoken language applications. Computational linguistics 18 (1), pp. 61–86. Cited by: §1.
-  (2018) Towards end-to-end spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5754–5758. Cited by: 1st item, §1, §1, §4.
-  (2019) Recent advances in end-to-end spoken language understanding. In International Conference on Statistical Language and Speech Processing, pp. 44–55. Cited by: 1st item, §1, §1, §2.2, §3.1, §4.4, Table 1.
-  (2010) What is left to be understood in atis?. In 2010 IEEE Spoken Language Technology Workshop, pp. 19–24. Cited by: §1, §1, §5, footnote 2.
-  (2020) Large-scale unsupervised pre-training for end-to-end spoken language understanding. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7999–8003. Cited by: §1, §1.
-  (2017) Hybrid ctc/attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing 11 (8), pp. 1240–1253. Cited by: §4.
-  (2019) Gunrock: a social bot for complex and engaging long conversations. arXiv preprint arXiv:1910.03042. Cited by: §1, footnote 2.
End-to-end learning of semantic role labeling using recurrent neural networks. In
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1127–1137. Cited by: §1.
-  (2017) Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5675–5679. Cited by: §1.