W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training

08/07/2021
by   Yu-An Chung, et al.
0

Motivated by the success of masked language modeling (MLM) in pre-training natural language processing models, we propose w2v-BERT that explores MLM for self-supervised speech representation learning. w2v-BERT is a framework that combines contrastive learning and MLM, where the former trains the model to discretize input continuous speech signals into a finite set of discriminative speech tokens, and the latter trains the model to learn contextualized speech representations via solving a masked prediction task consuming the discretized tokens. In contrast to existing MLM-based speech pre-training frameworks such as HuBERT, which relies on an iterative re-clustering and re-training process, or vq-wav2vec, which concatenates two separately trained modules, w2v-BERT can be optimized in an end-to-end fashion by solving the two self-supervised tasks (the contrastive task and MLM) simultaneously. Our experiments show that w2v-BERT achieves competitive results compared to current state-of-the-art pre-trained models on the LibriSpeech benchmarks when using the Libri-Light 60k corpus as the unsupervised data. In particular, when compared to published models such as conformer-based wav2vec 2.0 and HuBERT, our model shows 5% to 10% relative WER reduction on the test-clean and test-other subsets. When applied to the Google's Voice Search traffic dataset, w2v-BERT outperforms our internal conformer-based wav2vec 2.0 by more than 30% relatively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2019

Effectiveness of self-supervised pre-training for speech recognition

We present pre-training approaches for self-supervised representation le...
research
10/13/2020

CAPT: Contrastive Pre-Training for LearningDenoised Sequence Representations

Pre-trained self-supervised models such as BERT have achieved striking s...
research
07/01/2023

Self-Supervised Query Reformulation for Code Search

Automatic query reformulation is a widely utilized technology for enrich...
research
10/05/2021

DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT

Self-supervised speech representation learning methods like wav2vec 2.0 ...
research
06/14/2021

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Self-supervised approaches for speech representation learning are challe...
research
04/17/2022

On Effectively Learning of Knowledge in Continual Pre-training

Pre-trained language models (PLMs) like BERT have made significant progr...
research
10/23/2019

Emergent Properties of Finetuned Language Representation Models

Large, self-supervised transformer-based language representation models ...

Please sign up or login with your details

Forgot password? Click here to reset