DeepAI AI Chat
Log In Sign Up

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

by   Xuankai Chang, et al.

Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.


page 1

page 2

page 3

page 4


Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Recently, self-supervised pretraining has achieved impressive results in...

Can Self-Supervised Learning solve the problem of child speech recognition?

Despite recent advancements in deep learning technologies, Child Speech ...

DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children's ASR

Self-supervised learning (SSL) in the pretraining stage using un-annotat...

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...

Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project

Czech is a very specific language due to its large differences between t...

Blockwise Self-Supervised Learning at Scale

Current state-of-the-art deep networks are all powered by backpropagatio...

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attracti...