Joint Unsupervised and Supervised Training for Multilingual ASR

11/15/2021
by   Junwen Bai, et al.
0

Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art methods, and beat the monolingual baseline by a significant margin, demonstrating JUST's capability of handling low-resource languages in multilingual ASR. Our average WER of all languages outperforms average monolingual baseline by 33.3 state-of-the-art 2-stage XLSR by 32 our WER is less than half of the monolingual baseline and even beats the supervised transfer learning method which uses external supervision.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2022

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

Multilingual end-to-end models have shown great improvement over monolin...
research
07/15/2021

CLSRIL-23: Cross Lingual Speech Representations for Indic Languages

We present a CLSRIL-23, a self supervised learning based audio pre-train...
research
10/30/2020

Joint Masked CPC and CTC Training for ASR

Self-supervised learning (SSL) has shown promise in learning representat...
research
09/14/2023

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

We introduce a multilingual speaker change detection model (USM-SCD) tha...
research
05/19/2023

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

We improve low-resource ASR by integrating the ideas of multilingual tra...
research
08/27/2021

Injecting Text in Self-Supervised Speech Pretraining

Self-supervised pretraining for Automated Speech Recognition (ASR) has s...
research
08/17/2021

Self-Supervised Pretraining and Controlled Augmentation Improve Rare Wildlife Recognition in UAV Images

Automated animal censuses with aerial imagery are a vital ingredient tow...

Please sign up or login with your details

Forgot password? Click here to reset