A comparison of streaming models and data augmentation methods for robust speech recognition

11/19/2021
by   Jiyeon Kim, et al.
0

In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variability, and SpecAugment. Experimental results show that unidirectional models are in general more sensitive to noisy examples in the training set. It is observed that the final performance of the model depends on the proportion of training examples processed by data augmentation techniques. MoChA models generally perform better than RNN-T models. However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the incorporation of additional augmentations techniques. On the other hand, RNN-T models perform better than MoChA models in terms of latency, inference time, and the stability of training. Additionally, RNN-T models are generally more robust against noise and reverberation. All these advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2022

Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Disordered speech recognition is a highly challenging task. The underlyi...
research
12/22/2019

end-to-end training of a large vocabulary end-to-end speech recognition system

In this paper, we present an end-to-end training framework for building ...
research
06/07/2018

Training Augmentation with Adversarial Examples for Robust Speech Recognition

This paper explores the use of adversarial examples in training speech r...
research
02/27/2023

A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit

Data augmentations are known to improve robustness in speech-processing ...
research
04/19/2022

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

The two most popular loss functions for streaming end-to-end automatic s...
research
05/04/2021

Streaming end-to-end speech recognition with jointly trained neural feature enhancement

In this paper, we present a streaming end-to-end speech recognition mode...
research
12/14/2021

ImportantAug: a data augmentation agent for speech

We introduce ImportantAug, a technique to augment training data for spee...

Please sign up or login with your details

Forgot password? Click here to reset