Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

01/25/2022
by   Keqi Deng, et al.
0

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1 significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2020

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

For real-world deployment of automatic speech recognition (ASR), the sys...
research
11/02/2022

BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic sp...
research
04/10/2021

Non-autoregressive Transformer-based End-to-end ASR using BERT

Transformer-based models have led to a significant innovation in various...
research
04/21/2023

Non-autoregressive End-to-end Approaches for Joint Automatic Speech Recognition and Spoken Language Understanding

This paper presents the use of non-autoregressive (NAR) approaches for j...
research
10/28/2020

Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input

Non-autoregressive (NAR) transformer models have achieved significantly ...
research
10/11/2021

A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

Non-autoregressive (NAR) models simultaneously generate multiple outputs...
research
10/29/2022

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

This paper presents BERT-CTC, a novel formulation of end-to-end speech r...

Please sign up or login with your details

Forgot password? Click here to reset