SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition

10/08/2021
by   Li Fu, et al.
7

End-to-end Automatic Speech Recognition (ASR) models are usually trained to reduce the losses of the whole token sequences, while neglecting explicit phonemic-granularity supervision. This could lead to recognition errors due to similar-phoneme confusion or phoneme reduction. To alleviate this problem, this paper proposes a novel framework of Supervised Contrastive Learning (SCaLa) to enhance phonemic information learning for end-to-end ASR systems. Specifically, we introduce the self-supervised Masked Contrastive Predictive Coding (MCPC) into the fully-supervised setting. To supervise phoneme learning explicitly, SCaLa first masks the variable-length encoder features corresponding to phonemes given phoneme forced-alignment extracted from a pre-trained acoustic model, and then predicts the masked phonemes via contrastive learning. The phoneme forced-alignment can mitigate the noise of positive-negative pairs in self-supervised MCPC. Experimental results conducted on reading and spontaneous speech datasets show that the proposed approach achieves 2.84 Character Error Rate (CER) reductions compared to the baseline, respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/22/2022

Guided contrastive self-supervised pre-training for automatic speech recognition

Contrastive Predictive Coding (CPC) is a representation learning method ...
research
10/30/2020

Joint Masked CPC and CTC Training for ASR

Self-supervised learning (SSL) has shown promise in learning representat...
research
12/14/2021

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Recently, self-supervised pretraining has achieved impressive results in...
research
04/05/2022

Towards End-to-end Unsupervised Speech Recognition

Unsupervised speech recognition has shown great potential to make Automa...
research
05/29/2023

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Self-supervised learning (SSL) of speech has shown impressive results in...
research
09/04/2023

AVATAR: Robust Voice Search Engine Leveraging Autoregressive Document Retrieval and Contrastive Learning

Voice, as input, has progressively become popular on mobiles and seems t...
research
07/10/2023

Hate Speech Detection via Dual Contrastive Learning

The fast spread of hate speech on social media impacts the Internet envi...

Please sign up or login with your details

Forgot password? Click here to reset