Maestro-U: Leveraging joint speech-text representation learning for zero supervised speech ASR

by   Zhehuai Chen, et al.

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover 102 languages, where transcribed speech is available in 52 of these languages and can be used to improve end-to-end ASR quality on the remaining 50. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8% to 30.8%, a relative reduction of 53%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5% relative and reduces the CER of 19 languages below 15%.


page 1

page 2

page 3

page 4


MAESTRO: Matched Speech Text Representations through Modality Matching

We present Maestro, a self-supervised training method to unify represent...

Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

This paper proposes Virtuoso, a massively multilingual speech-text joint...

Pronunciation Dictionary-Free Multilingual Speech Synthesis by Combining Unsupervised and Supervised Phonetic Representations

This paper proposes a multilingual speech synthesis method which combine...

Analyzing Learned Representations of a Deep ASR Performance Prediction Model

This paper addresses a relatively new task: prediction of ASR performanc...

Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio

Bootstrapping speech recognition on limited data resources has been an a...

Adaptive multilingual speech recognition with pretrained models

Multilingual speech recognition with supervised learning has achieved gr...

MediaSpeech: Multilanguage ASR Benchmark and Dataset

The performance of automated speech recognition (ASR) systems is well kn...

Please sign up or login with your details

Forgot password? Click here to reset