Two-stage Textual Knowledge Distillation to Speech Encoder for Spoken Language Understanding

10/25/2020
by   Seongbin Kim, et al.
0

End-to-end approaches open a new way for more accurate and efficient spoken language understanding (SLU) systems by alleviating the drawbacks of traditional pipeline systems. Previous works exploit textual information for an SLU model via pre-training with automatic speech recognition or fine-tuning with knowledge distillation. To utilize textual information more effectively, this work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning, sequentially. We use vq-wav2vec BERT as a speech encoder because it captures general and rich features. Furthermore, we improve the performance, especially in a low-resource scenario, with data augmentation methods by randomly masking spans of discrete audio tokens and contextualized hidden representations. Consequently, we push the state-of-the-art on the Fluent Speech Commands, achieving 99.7 in the full dataset setting and 99.5 ablation studies, we empirically verify that all used methods are crucial to the final performance, providing the best practice for spoken language understanding. Code to reproduce our results will be available upon publication.

READ FULL TEXT
research
05/20/2023

Sentence Embedder Guided Utterance Encoder (SEGUE) for Spoken Language Understanding

The pre-trained speech encoder wav2vec 2.0 performs very well on various...
research
04/21/2021

Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning

In the traditional cascading architecture for spoken language understand...
research
08/13/2020

Large-scale Transfer Learning for Low-resource Spoken Language Understanding

End-to-end Spoken Language Understanding (SLU) models are made increasin...
research
10/23/2022

Bootstrapping meaning through listening: Unsupervised learning of spoken sentence embeddings

Inducing semantic representations directly from speech signals is a high...
research
01/27/2023

Can We Use Probing to Better Understand Fine-tuning and Knowledge Distillation of the BERT NLU?

In this article, we use probing to investigate phenomena that occur duri...
research
06/28/2022

Bottleneck Low-rank Transformers for Low-resource Spoken Language Understanding

End-to-end spoken language understanding (SLU) systems benefit from pret...
research
10/06/2020

Textual Supervision for Visually Grounded Spoken Language Understanding

Visually-grounded models of spoken language understanding extract semant...

Please sign up or login with your details

Forgot password? Click here to reset