Joint Masked CPC and CTC Training for ASR

10/30/2020
by   Chaitanya Talnikar, et al.
0

Self-supervised learning (SSL) has shown promise in learning representations of audio that are useful for automatic speech recognition (ASR). But, training SSL models like wav2vec 2.0 requires a two-stage pipeline. In this paper we demonstrate a single-stage training of ASR models that can utilize both unlabeled and labeled data. During training, we alternately minimize two losses: an unsupervised masked Contrastive Predictive Coding (CPC) loss and the supervised audio-to-text alignment loss Connectionist Temporal Classification (CTC). We show that this joint training method directly optimizes performance for the downstream ASR task using unsupervised data while achieving similar word error rates to wav2vec 2.0 on the Librispeech 100-hour dataset. Finally, we postulate that solving the contrastive task is a regularization for the supervised CTC loss.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2021

SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition

End-to-end Automatic Speech Recognition (ASR) models are usually trained...
research
12/11/2018

Contrastive Training for Models of Information Cascades

This paper proposes a model of information cascades as directed spanning...
research
09/01/2023

Learning Speech Representation From Contrastive Token-Acoustic Pretraining

For fine-grained generation and recognition tasks such as minimally-supe...
research
11/15/2021

Joint Unsupervised and Supervised Training for Multilingual ASR

Self-supervised training has shown promising gains in pretraining models...
research
02/05/2021

Multi-Task Self-Supervised Pre-Training for Music Classification

Deep learning is very data hungry, and supervised learning especially re...
research
10/27/2022

Iterative pseudo-forced alignment by acoustic CTC loss for self-supervised ASR domain adaptation

High-quality data labeling from specific domains is costly and human tim...
research
04/19/2023

A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale

Unpaired text and audio injection have emerged as dominant methods for i...

Please sign up or login with your details

Forgot password? Click here to reset