Deploying self-supervised learning in the wild for hybrid automatic speech recognition

05/17/2022
by   Mostafa Karimi, et al.
0

Self-supervised learning (SSL) methods have proven to be very successful in automatic speech recognition (ASR). These great improvements have been reported mostly based on highly curated datasets such as LibriSpeech for non-streaming End-to-End ASR models. However, the pivotal characteristics of SSL is to be utilized for any untranscribed audio data. In this paper, we provide a full exploration on how to utilize uncurated audio data in SSL from data pre-processing to deploying an streaming hybrid ASR model. More specifically, we present (1) the effect of Audio Event Detection (AED) model in data pre-processing pipeline (2) analysis on choosing optimizer and learning rate scheduling (3) comparison of recently developed contrastive losses, (4) comparison of various pre-training strategies such as utilization of in-domain versus out-domain pre-training data, monolingual versus multilingual pre-training data, multi-head multilingual SSL versus single-head multilingual SSL and supervised pre-training versus SSL. The experimental results show that SSL pre-training with in-domain uncurated data can achieve better performance in comparison to all the alternative out-domain pre-training strategies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/07/2022

Improved Self-Supervised Multilingual Speech Representation Learning Combined with Auxiliary Language Information

Multilingual end-to-end models have shown great improvement over monolin...
research
10/11/2021

K-Wav2vec 2.0: Automatic Speech Recognition based on Joint Decoding of Graphemes and Syllables

Wav2vec 2.0 is an end-to-end framework of self-supervised learning for s...
research
11/04/2022

Biased Self-supervised learning for ASR

Self-supervised learning via masked prediction pre-training (MPPT) has s...
research
03/01/2022

Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training

Human speech data comprises a rich set of domain factors such as accent,...
research
10/15/2021

Multilingual Speech Recognition using Knowledge Transfer across Learning Processes

Multilingual end-to-end(E2E) models have shown a great potential in the ...
research
04/04/2022

A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems

Self-supervised models for speech processing emerged recently as popular...
research
02/12/2021

Bi-APC: Bidirectional Autoregressive Predictive Coding for Unsupervised Pre-training and Its Application to Children's ASR

We present a bidirectional unsupervised model pre-training (UPT) method ...

Please sign up or login with your details

Forgot password? Click here to reset