Improving Hybrid CTC/Attention End-to-end Speech Recognition with Pretrained Acoustic and Language Model

12/14/2021
by   Keqi Deng, et al.
0

Recently, self-supervised pretraining has achieved impressive results in end-to-end (E2E) automatic speech recognition (ASR). However, the dominant sequence-to-sequence (S2S) E2E model is still hard to fully utilize the self-supervised pre-training methods because its decoder is conditioned on acoustic representation thus cannot be pretrained separately. In this paper, we propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models (AMs) and language models (LMs). In our framework, the encoder is initialized with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary task during training and inference. Furthermore, we design a one-cross decoder (OCD), which relaxes the dependence on acoustic representations so that it can be initialized with pretrained LM (DistilGPT2). Experiments are conducted on the AISHELL-1 corpus and achieve a 4.6% character error rate (CER) on the test set. Compared with our vanilla hybrid CTC/attention Transformer baseline, our proposed CTC/attention-based Preformer yields 27% relative CER reduction. To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/15/2021

Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-supervised Learning

Recently, self-supervised pre-training has gained success in automatic s...
research
10/09/2021

An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition

Self-supervised pretraining on speech data has achieved a lot of progres...
research
05/08/2019

RWTH ASR Systems for LibriSpeech: Hybrid vs Attention - w/o Data Augmentation

We present state-of-the-art automatic speech recognition (ASR) systems e...
research
02/16/2022

Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers

End-to-end speech recognition is a promising technology for enabling com...
research
10/08/2021

SCaLa: Supervised Contrastive Learning for End-to-End Automatic Speech Recognition

End-to-end Automatic Speech Recognition (ASR) models are usually trained...
research
10/25/2022

Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition

The recent emergence of joint CTC-Attention model shows significant impr...
research
12/28/2020

Lattice-Free MMI Adaptation Of Self-Supervised Pretrained Acoustic Models

In this work, we propose lattice-free MMI (LFMMI) for supervised adaptat...

Please sign up or login with your details

Forgot password? Click here to reset