Speech-language Pre-training for End-to-end Spoken Language Understanding

by   Yao Qian, et al.

End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose to unify a well-optimized E2E ASR encoder (speech) and a pre-trained language model encoder (language) into a transformer decoder. The unified speech-language pre-trained model (SLP) is continually enhanced on limited labeled data from a target domain by using a conditional masked language model (MLM) objective, and thus can effectively generate a sequence of intent, slot type, and slot value for given input speech in the inference. The experimental results on two public corpora show that our approach to E2E SLU is superior to the conventional cascaded method. It also outperforms the present state-of-the-art approaches to E2E SLU with much less paired data.



There are no comments yet.


page 1

page 2

page 3

page 4


Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning

In the traditional cascading architecture for spoken language understand...

End-to-End Spoken Language Understanding for Generalized Voice Assistants

End-to-end (E2E) spoken language understanding (SLU) systems predict utt...

FANS: Fusing ASR and NLU for on-device SLU

Spoken language understanding (SLU) systems translate voice input comman...

Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs

A major focus of recent research in spoken language understanding (SLU) ...

Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining

Much recent work on Spoken Language Understanding (SLU) is limited in at...

Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces

We consider the problem of spoken language understanding (SLU) of extrac...

From Audio to Semantics: Approaches to end-to-end spoken language understanding

Conventional spoken language understanding systems consist of two main c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.