Effectiveness of self-supervised pre-training for speech recognition

11/10/2019
by   Alexei Baevski, et al.
0

We present pre-training approaches for self-supervised representation learning of speech data. A BERT, masked language model, loss on discrete features is compared with an InfoNCE-based constrastive loss on continuous speech features. The pre-trained models are then fine-tuned with a Connectionist Temporal Classification (CTC) loss to predict target character sequences. To study impact of stacking multiple feature learning modules trained using different self-supervised loss functions, we test the discrete and continuous BERT pre-training approaches on spectral features and on learned acoustic representations, showing synergitic behaviour between acoustically motivated and masked language model loss functions. In low-resource conditions using only 10 hours of labeled data, we achieve Word Error Rates (WER) of 10.2% and 23.5% on the standard test "clean" and "other" benchmarks of the Librispeech dataset, which is almost on bar with previously published work that uses 10 times more labeled data. Moreover, compared to previous work that uses two models in tandem, by using one model for both BERT pre-trainining and fine-tuning, our model provides an average relative WER reduction of 9

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/07/2021

W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

Motivated by the success of masked language modeling (MLM) in pre-traini...
research
12/16/2021

Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription

We present a self-supervised pre-training approach for learning rich vis...
research
06/14/2021

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challe...
research
09/03/2023

Acoustic-to-articulatory inversion for dysarthric speech: Are pre-trained self-supervised representations favorable?

Acoustic-to-articulatory inversion (AAI) involves mapping from the acous...
research
12/18/2022

BEATs: Audio Pre-Training with Acoustic Tokenizers

The massive growth of self-supervised learning (SSL) has been witnessed ...
research
12/13/2020

Discriminative Pre-training for Low Resource Title Compression in Conversational Grocery

The ubiquity of smart voice assistants has made conversational shopping ...
research
01/28/2020

Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction

We propose an approach for pre-training speech representations via a mas...

Please sign up or login with your details

Forgot password? Click here to reset