Contrastive Siamese Network for Semi-supervised Speech Recognition

05/27/2022
by   Soheil Khorram, et al.
5

This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20 word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning

Self-supervised visual pretraining has shown significant progress recent...
research
07/25/2022

Unsupervised data selection for Speech Recognition with contrastive loss ratios

This paper proposes an unsupervised data selection method by using a sub...
research
10/08/2021

Improving Pseudo-label Training For End-to-end Speech Recognition Using Gradient Mask

In the recent trend of semi-supervised speech recognition, both self-sup...
research
06/16/2021

Collaborative Training of Acoustic Encoders for Speech Recognition

On-device speech recognition requires training models of different sizes...
research
08/02/2021

Forward-Looking Sonar Patch Matching: Modern CNNs, Ensembling, and Uncertainty

Application of underwater robots are on the rise, most of them are depen...
research
11/25/2020

SAR-Net: A End-to-End Deep Speech Accent Recognition Network

This paper proposes a end-to-end deep network to recognize kinds of acce...

Please sign up or login with your details

Forgot password? Click here to reset