On Scaling Contrastive Representations for Low-Resource Speech Recognition

02/01/2021
by   Lasse Borgholt, et al.
0

Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose a bidirectional extension to the original wav2vec framework that consistently improves performance.

READ FULL TEXT

page 1

page 4

research
06/20/2020

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from s...
research
10/12/2021

Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition

While wav2vec 2.0 has been proposed for speech recognition (ASR), it can...
research
03/01/2022

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

Human speech data comprises a rich set of domain factors such as accent,...
research
11/14/2022

Improving Children's Speech Recognition by Fine-tuning Self-supervised Adult Speech Representations

Children's speech recognition is a vital, yet largely overlooked domain ...
research
09/21/2023

Sparsely Shared LoRA on Whisper for Child Speech Recognition

Whisper is a powerful automatic speech recognition (ASR) model. Neverthe...
research
06/05/2016

What is the Best Feature Learning Procedure in Hierarchical Recognition Architectures?

(This paper was written in November 2011 and never published. It is post...
research
06/30/2023

Towards Improving the Performance of Pre-Trained Speech Models for Low-Resource Languages Through Lateral Inhibition

With the rise of bidirectional encoder representations from Transformer ...

Please sign up or login with your details

Forgot password? Click here to reset