MAESTRO: Matched Speech Text Representations through Modality Matching

04/07/2022
by   Zhehuai Chen, et al.
0

We present Maestro, a self-supervised training method to unify representations learnt from speech and text modalities. Self-supervised learning from speech signals aims to learn the latent structure inherent in the signal, while self-supervised learning from text attempts to capture lexical information. Learning aligned representations from unpaired speech and text sequences is a challenging task. Previous work either implicitly enforced the representations learnt from these two modalities to be aligned in the latent space through multitasking and parameter sharing or explicitly through conversion of modalities via speech synthesis. While the former suffers from interference between the two modalities, the latter introduces additional complexity. In this paper, we propose Maestro, a novel algorithm to learn unified representations from both these modalities simultaneously that can transfer to diverse downstream tasks such as Automated Speech Recognition (ASR) and Speech Translation (ST). Maestro learns unified representations through sequence alignment, duration prediction and matching embeddings in the learned space through an aligned masked-language model loss. We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11 reduction in Word Error Rate (WER), multidomain SpeechStew ASR (3.7 and 21 languages to English multilingual ST on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2022

Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

Training state-of-the-art Automated Speech Recognition (ASR) models typi...
research
08/27/2021

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...
research
07/02/2023

Don't Stop Self-Supervision: Accent Adaptation of Speech Representations via Residual Adapters

Speech representations learned in a self-supervised fashion from massive...
research
08/26/2018

Analyzing Learned Representations of a Deep ASR Performance Prediction Model

This paper addresses a relatively new task: prediction of ASR performanc...
research
08/11/2023

Improving Joint Speech-Text Representations Without Alignment

The last year has seen astonishing progress in text-prompted image gener...
research
04/30/2019

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...
research
03/05/2023

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Recent work has explored using self-supervised learning (SSL) speech rep...

Please sign up or login with your details

Forgot password? Click here to reset