More Speaking or More Speakers?

11/02/2022
by   Dan Berrebbi, et al.
0

Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of numbers of speakers in the training data on a recent SSL algorithm (wav2vec 2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on both labeled and unlabeled data by varying the number of speakers while keeping the number of hours fixed and vice versa. Our findings suggest that SSL requires a large amount of unlabeled data to produce high accuracy results, while ST requires a sufficient number of speakers in the labelled data, especially in the low-regime setting. In this manner these two approaches improve supervised learning in different regimes of dataset composition.

READ FULL TEXT

page 4

page 5

research
09/15/2021

Improving Streaming Transformer Based ASR Under a Framework of Self-supervised Learning

Recently self-supervised learning has emerged as an effective approach t...
research
11/04/2022

Biased Self-supervised learning for ASR

Self-supervised learning via masked prediction pre-training (MPPT) has s...
research
07/17/2021

Self Training with Ensemble of Teacher Models

In order to train robust deep learning models, large amounts of labelled...
research
02/14/2022

Unlabeled Data Help: Minimax Analysis and Adversarial Robustness

The recent proposed self-supervised learning (SSL) approaches successful...
research
09/27/2016

Weakly Supervised PLDA Training

PLDA is a popular normalization approach for the i-vector model, and it ...
research
10/17/2022

Continuous Pseudo-Labeling from the Start

Self-training (ST), or pseudo-labeling has sparked significant interest ...
research
05/28/2022

Is Lip Region-of-Interest Sufficient for Lipreading?

Lip region-of-interest (ROI) is conventionally used for visual input in ...

Please sign up or login with your details

Forgot password? Click here to reset