Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

05/02/2022
by   Felix Wu, et al.
0

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task – transcribing audio inputs into pseudo subword sequences. This process stands on its own, or can be applied as low-cost second-stage pre-training. We experiment with automatic speech recognition (ASR), spoken named entity recognition, and speech-to-text translation. We set new state-of-the-art results for end-to-end spoken named entity recognition, and show consistent improvements on 20 language pairs for speech-to-text translation, even when competing methods use additional text data for training. Finally, on ASR, our approach enables encoder-decoder methods to benefit from pre-training for all parts of the network, and shows comparable performance to highly optimized recent methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/29/2022

MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for Speech Recognition

In this paper, we propose a novel multi-modal multi-task encoder-decoder...
research
03/31/2022

Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

This paper studies a novel pre-training technique with unpaired speech d...
research
12/14/2021

On the Use of External Data for Spoken Named Entity Recognition

Spoken language understanding (SLU) tasks involve mapping from speech au...
research
04/11/2022

Unified Speech-Text Pre-training for Speech Translation and Recognition

We describe a method to jointly pre-train speech and text in an encoder-...
research
02/12/2022

USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

Improving end-to-end speech recognition by incorporating external text d...
research
03/30/2021

Pre-training for low resource speech-to-intent applications

Designing a speech-to-intent (S2I) agent which maps the users' spoken co...
research
04/02/2022

End-to-end model for named entity recognition from speech without paired training data

Recent works showed that end-to-end neural approaches tend to become ver...

Please sign up or login with your details

Forgot password? Click here to reset