An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

04/19/2022
by   Niko Moritz, et al.
0

The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are the RNN-Transducer (RNN-T) and the connectionist temporal classification (CTC) objectives. Both perform an alignment-free training by marginalizing over all possible alignments, but use different transition rules. Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T), which both can be realized using the graph temporal classification-transducer (GTC-T) loss function. Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time, often in an infinite loop. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible and unifiable with traditional FST-based hybrid ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way, though: By regularizing the training - via joint LAS training or parameter initialization from RNN-T - both MonoRNN-T and CTC-T perform as well - or better - than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2021

Sequence Transduction with Graph-based Supervision

The recurrent neural network transducer (RNN-T) objective plays a major ...
research
11/04/2022

Multi-blank Transducers for Speech Recognition

This paper proposes a modification to RNN-Transducer (RNN-T) models for ...
research
11/05/2020

Alignment Restricted Streaming Recurrent Neural Network Transducer

There is a growing interest in the speech community in developing Recurr...
research
11/09/2020

Benchmarking LF-MMI, CTC and RNN-T Criteria for Streaming ASR

In this work, to measure the accuracy and efficiency for a latency-contr...
research
07/26/2023

Say Goodbye to RNN-T Loss: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

RNN-T models are widely used in ASR, which rely on the RNN-T loss to ach...
research
07/04/2023

Align With Purpose: Optimize Desired Properties in CTC Models with a General Plug-and-Play Framework

Connectionist Temporal Classification (CTC) is a widely used criterion f...
research
11/19/2021

A comparison of streaming models and data augmentation methods for robust speech recognition

In this paper, we present a comparative study on the robustness of two d...

Please sign up or login with your details

Forgot password? Click here to reset