Large scale weakly and semi-supervised learning for low-resource video ASR

05/16/2020
by   Kritika Singh, et al.
0

Many semi- and weakly-supervised approaches have been investigated for overcoming the labeling cost of building high quality speech recognition systems. On the challenging task of transcribing social media videos in low-resource conditions, we conduct a large scale systematic comparison between two self-labeling methods on one hand, and weakly-supervised pretraining using contextual metadata on the other. We investigate distillation methods at the frame level and the sequence level for hybrid, encoder-only CTC-based, and encoder-decoder speech recognition systems on Dutch and Romanian languages using 27,000 and 58,000 hours of unlabeled audio respectively. Although all approaches improved upon their respective baseline WERs by more than 8 sequence-level distillation for encoder-decoder models provided the largest relative WER reduction of 20 supervised baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2019

Training ASR models by Generation of Contextual Information

Supervised ASR models have reached unprecedented levels of accuracy, tha...
research
07/01/2022

Improving Low-Resource Speech Recognition with Pretrained Speech Models: Continued Pretraining vs. Semi-Supervised Training

Self-supervised Transformer based models, such as wav2vec 2.0 and HuBERT...
research
03/09/2021

Contrastive Semi-supervised Learning for ASR

Pseudo-labeling is the most adopted method for pre-training automatic sp...
research
08/04/2020

Weakly Supervised Construction of ASR Systems with Massive Video Data

Building Automatic Speech Recognition (ASR) systems from scratch is sign...
research
08/04/2017

Massively Multilingual Neural Grapheme-to-Phoneme Conversion

Grapheme-to-phoneme conversion (g2p) is necessary for text-to-speech and...
research
02/24/2020

Semi-Supervised Speech Recognition via Local Prior Matching

For sequence transduction tasks like speech recognition, a strong struct...
research
03/01/2023

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Large-scale, weakly-supervised speech recognition models, such as Whispe...

Please sign up or login with your details

Forgot password? Click here to reset