DeepAI AI Chat
Log In Sign Up

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

by   Samik Sadhu, et al.

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0


page 1

page 2

page 3

page 4


DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Recent success in speech representation learning enables a new way to le...

data2vec-aqc: Search for the right Teaching Assistant in the Teacher-Student training setup

In this paper, we propose a new Self-Supervised Learning (SSL) algorithm...

Exploring Representation Learning for Small-Footprint Keyword Spotting

In this paper, we investigate representation learning for low-resource k...

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from s...

Metagenome2Vec: Building Contextualized Representations for Scalable Metagenome Analysis

Advances in next-generation metagenome sequencing have the potential to ...

Self-supervised speech unit discovery from articulatory and acoustic features using VQ-VAE

The human perception system is often assumed to recruit motor knowledge ...

A vector quantized masked autoencoder for speech emotion recognition

Recent years have seen remarkable progress in speech emotion recognition...