DeepAI AI Chat
Log In Sign Up

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

10/09/2021
by   Xuankai Chang, et al.
0

Self-supervised pretraining on speech data has achieved a lot of progress. High-fidelity representation of the speech signal is learned from a lot of untranscribed data and shows promising performance. Recently, there are several works focusing on evaluating the quality of self-supervised pretrained representations on various tasks without domain restriction, e.g. SUPERB. However, such evaluations do not provide a comprehensive comparison among many ASR benchmark corpora. In this paper, we focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models. We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR. Without any modification of the back-end model architectures or training strategy, some of the experiments with pretrained representations, e.g., WSJ, WSJ0-2mix with HuBERT, reach or outperform current state-of-the-art (SOTA) recognition performance. Moreover, we further explore more scenarios for whether the pretraining representations are effective, such as the cross-language or overlapped speech. The scripts, configuratons and the trained models have been released in ESPnet to let the community reproduce our experiments and improve them.

READ FULL TEXT

page 1

page 2

page 3

page 4

12/14/2021

Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

Recently, self-supervised pretraining has achieved impressive results in...
04/06/2022

Can Self-Supervised Learning solve the problem of child speech recognition?

Despite recent advancements in deep learning technologies, Child Speech ...
06/16/2022

DRAFT: A Novel Framework to Reduce Domain Shifting in Self-supervised Learning and Its Application to Children's ASR

Self-supervised learning (SSL) in the pretraining stage using un-annotat...
04/30/2019

Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text

Sequence-to-sequence ASR models require large quantities of data to atta...
06/15/2022

Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project

Czech is a very specific language due to its large differences between t...
02/03/2023

Blockwise Self-Supervised Learning at Scale

Current state-of-the-art deep networks are all powered by backpropagatio...
08/07/2020

Pretraining Techniques for Sequence-to-Sequence Voice Conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models are attracti...