4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

12/21/2022
by   Yui Sudo, et al.
0

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on the application scenarios. 2) Joint training may bring model regularization and improve the model robustness thanks to their complementary properties. 3) Novel one-pass joint decoding methods using CTC, attention, and RNN-T further improves the performance. The experimental results showed that the proposed model consistently reduced the WER.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2020

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

For real-world deployment of automatic speech recognition (ASR), the sys...
research
05/27/2020

Insertion-Based Modeling for End-to-End Automatic Speech Recognition

End-to-end (E2E) models have gained attention in the research field of a...
research
07/20/2021

Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models

Non-autoregressive (NAR) modeling has gained more and more attention in ...
research
06/17/2018

Extending Recurrent Neural Aligner for Streaming End-to-End Speech Recognition in Mandarin

End-to-end models have been showing superiority in Automatic Speech Reco...
research
11/04/2022

Multi-blank Transducers for Speech Recognition

This paper proposes a modification to RNN-Transducer (RNN-T) models for ...
research
11/06/2019

A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...
research
06/17/2019

Multi-Stream End-to-End Speech Recognition

Attention-based methods and Connectionist Temporal Classification (CTC) ...

Please sign up or login with your details

Forgot password? Click here to reset