DeepAI
Log In Sign Up

On Scaling Contrastive Representations for Low-Resource Speech Recognition

02/01/2021
by   Lasse Borgholt, et al.
0

Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.

READ FULL TEXT

page 1

page 4

06/20/2020

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from s...
10/12/2021

Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition

While wav2vec 2.0 has been proposed for speech recognition (ASR), it can...
11/14/2022

Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations

Children's speech recognition is a vital, yet largely overlooked domain ...
03/01/2022

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

Human speech data comprises a rich set of domain factors such as accent,...
06/05/2016

What is the Best Feature Learning Procedure in Hierarchical Recognition Architectures?

(This paper was written in November 2011 and never published. It is post...
07/10/2021

Layer-wise Analysis of a Self-supervised Speech Representation Model

Recently proposed self-supervised learning approaches have been successf...
08/20/2021

Contrastive Representations for Label Noise Require Fine-Tuning

In this paper we show that the combination of a Contrastive representati...