vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

10/12/2019
by   Alexei Baevski, et al.
0

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a gumbel softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/13/2020

Visually Guided Self Supervised Learning of Speech Representations

Self supervised representation learning has recently attracted a lot of ...
research
02/03/2022

Self-supervised Learning with Random-projection Quantizer for Speech Recognition

We present a simple and effective self-supervised learning approach for ...
research
02/27/2020

Learning Representations by Predicting Bags of Visual Words

Self-supervised representation learning targets to learn convnet-based i...
research
12/14/2022

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Current self-supervised learning algorithms are often modality-specific ...
research
04/06/2022

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

Direct speech-to-speech translation (S2ST) models suffer from data scarc...
research
06/15/2023

Pushing the Limits of Unsupervised Unit Discovery for SSL Speech Representation

The excellent generalization ability of self-supervised learning (SSL) f...
research
10/27/2022

Evaluating context-invariance in unsupervised speech representations

Unsupervised speech representations have taken off, with benchmarks (SUP...

Please sign up or login with your details

Forgot password? Click here to reset