SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

09/30/2022
by   Ziqiang Zhang, et al.
0

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2022

SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

The rapid development of single-modal pre-training has prompted research...
research
10/30/2022

token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text

Self-supervised pre-training has been successful in both text and speech...
research
11/21/2022

VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

Although speech is a simple and effective way for humans to communicate ...
research
02/27/2023

Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model

Automatic Speech Recognition (ASR) is a technology that converts spoken ...
research
06/15/2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

The excellent generalization ability of self-supervised learning (SSL) f...
research
09/07/2021

Text-Free Prosody-Aware Generative Spoken Language Modeling

Speech pre-training has primarily demonstrated efficacy on classificatio...
research
10/26/2022

IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text

We present IMU2CLIP, a novel pre-training approach to align Inertial Mea...

Please sign up or login with your details

Forgot password? Click here to reset