UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

by   Chengyi Wang, et al.

In this paper, we propose a unified pre-training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically-aware contrastive self-supervised learning are conducted in a multi-task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross-lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self-supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4 reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain-shift speech recognition task, i.e., a relative word error rate reduction of 6 previous approach.


page 1

page 2

page 3

page 4


Unsupervised Cross-lingual Representation Learning for Speech Recognition

This paper presents XLSR which learns cross-lingual speech representatio...

Wav2Vec-Aug: Improved self-supervised training with limited data

Self-supervised learning (SSL) of speech representations has received mu...

CLSRIL-23: Cross Lingual Speech Representations for Indic Languages

We present a CLSRIL-23, a self supervised learning based audio pre-train...

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Wav2vec-C introduces a novel representation learning technique combining...

Conditional independence for pretext task selection in Self-supervised speech representation learning

Through solving pretext tasks, self-supervised learning (SSL) leverages ...

Exploring wav2vec 2.0 on speaker verification and language identification

Wav2vec 2.0 is a recently proposed self-supervised framework for speech ...

UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training

Self-supervised learning (SSL) is a long-standing goal for speech proces...

Code Repositories