TESSP: Text-Enhanced Self-Supervised Speech Pre-training

11/24/2022
by   Zhuoyuan Yao, et al.
0

Self-supervised speech pre-training empowers the model with the contextual structure inherent in the speech signal while self-supervised text pre-training empowers the model with linguistic information. Both of them are beneficial for downstream speech tasks such as ASR. However, the distinct pre-training objectives make it challenging to jointly optimize the speech and text representation in the same model. To solve this problem, we propose Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to incorporate the linguistic information into speech pre-training. Our model consists of three parts, i.e., a speech encoder, a text encoder and a shared encoder. The model takes unsupervised speech and text data as the input and leverages the common HuBERT and MLM losses respectively. We also propose phoneme up-sampling and representation swapping to enable joint modeling of the speech and text information. Specifically, to fix the length mismatching problem between speech and text data, we phonemize the text sequence and up-sample the phonemes with the alignment information extracted from a small set of supervised data. Moreover, to close the gap between the learned speech and text representations, we swap the text representation with the speech representation extracted by the respective private encoders according to the alignment information. Experiments on the Librispeech dataset shows the proposed TESSP model achieves more than 10 the test-clean and test-other sets. We also evaluate our model on the SUPERB benchmark, showing our model has better performance on Phoneme Recognition, Acoustic Speech Recognition and Speech Translation compared with WavLM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/11/2022

Unified Speech-Text Pre-training for Speech Translation and Recognition

We describe a method to jointly pre-train speech and text in an encoder-...
research
10/30/2022

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Self-supervised pre-training has been successful in both text and speech...
research
04/23/2023

SAR: Self-Supervised Anti-Distortion Representation for End-To-End Speech Model

In recent Text-to-Speech (TTS) systems, a neural vocoder often generates...
research
10/20/2021

SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training

Unsupervised pre-training is now the predominant approach for both text ...
research
08/11/2023

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image gener...
research
11/02/2022

Phoneme Segmentation Using Self-Supervised Speech Models

We apply transfer learning to the task of phoneme segmentation and demon...
research
10/05/2020

Semi-Supervised Speech-Language Joint Pre-Training for Spoken Language Understanding

Spoken language understanding (SLU) requires a model to analyze input ac...

Please sign up or login with your details

Forgot password? Click here to reset