DeepAI AI Chat
Log In Sign Up

A comparison of streaming models and data augmentation methods for robust speech recognition

by   Jiyeon Kim, et al.

In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variability, and SpecAugment. Experimental results show that unidirectional models are in general more sensitive to noisy examples in the training set. It is observed that the final performance of the model depends on the proportion of training examples processed by data augmentation techniques. MoChA models generally perform better than RNN-T models. However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the incorporation of additional augmentations techniques. On the other hand, RNN-T models perform better than MoChA models in terms of latency, inference time, and the stability of training. Additionally, RNN-T models are generally more robust against noise and reverberation. All these advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.


page 1

page 2

page 3

page 4


Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Disordered speech recognition is a highly challenging task. The underlyi...

end-to-end training of a large vocabulary end-to-end speech recognition system

In this paper, we present an end-to-end training framework for building ...

Training Augmentation with Adversarial Examples for Robust Speech Recognition

This paper explores the use of adversarial examples in training speech r...

A Comparison of Speech Data Augmentation Methods Using S3PRL Toolkit

Data augmentations are known to improve robustness in speech-processing ...

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

The two most popular loss functions for streaming end-to-end automatic s...

Streaming end-to-end speech recognition with jointly trained neural feature enhancement

In this paper, we present a streaming end-to-end speech recognition mode...

ImportantAug: a data augmentation agent for speech

We introduce ImportantAug, a technique to augment training data for spee...