wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

06/20/2020
by   Alexei Baevski, et al.
0

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and solves a contrastive task defined over a quantization of the latent representations which are jointly learned. We set a new state of the art on both the 100 hour subset of Librispeech as well as on TIMIT phoneme recognition. When lowering the amount of labeled data to one hour, our model outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of labeled data and pre-training on 53k hours of unlabeled data still achieves 5.7/10.1 WER on the noisy/clean test sets of Librispeech. This demonstrates the feasibility of speech recognition with limited amounts of labeled data. Fine-tuning on all of Librispeech achieves 1.9/3.5 WER using a simple baseline model architecture. We will release code and models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2020

Self-training and Pre-training are Complementary for Speech Recognition

Self-training and unsupervised pre-training have emerged as effective ap...
research
02/01/2021

On Scaling Contrastive Representations for Low-Resource Speech Recognition

Recent advances in self-supervised learning through contrastive training...
research
08/23/2023

KinSPEAK: Improving speech recognition for Kinyarwanda via semi-supervised learning methods

Despite recent availability of large transcribed Kinyarwanda speech data...
research
03/28/2022

Robust Speaker Recognition with Transformers Using wav2vec 2.0

Recent advances in unsupervised speech representation learning discover ...
research
09/15/2021

Improving Streaming Transformer Based ASR Under a Framework of Self-supervised Learning

Recently self-supervised learning has emerged as an effective approach t...
research
11/15/2020

Unsupervised Contrastive Learning of Sound Event Representations

Self-supervised representation learning can mitigate the limitations in ...
research
01/07/2022

Improved Input Reprogramming for GAN Conditioning

We study the GAN conditioning problem, whose goal is to convert a pretra...

Please sign up or login with your details

Forgot password? Click here to reset