A comparison of streaming models and data augmentation methods for robust speech recognition

by   Jiyeon Kim, et al.

In this paper, we present a comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T). We explore three recently proposed data augmentation techniques, namely, multi-conditioned training using an acoustic simulator, Vocal Tract Length Perturbation (VTLP) for speaker variability, and SpecAugment. Experimental results show that unidirectional models are in general more sensitive to noisy examples in the training set. It is observed that the final performance of the model depends on the proportion of training examples processed by data augmentation techniques. MoChA models generally perform better than RNN-T models. However, we observe that training of MoChA models seems to be more sensitive to various factors such as the characteristics of training sets and the incorporation of additional augmentations techniques. On the other hand, RNN-T models perform better than MoChA models in terms of latency, inference time, and the stability of training. Additionally, RNN-T models are generally more robust against noise and reverberation. All these advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.



There are no comments yet.


page 1

page 2

page 3

page 4


Investigation of Data Augmentation Techniques for Disordered Speech Recognition

Disordered speech recognition is a highly challenging task. The underlyi...

end-to-end training of a large vocabulary end-to-end speech recognition system

In this paper, we present an end-to-end training framework for building ...

Two-Pass End-to-End Speech Recognition

The requirements for many applications of state-of-the-art speech recogn...

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition

The two most popular loss functions for streaming end-to-end automatic s...

Attention-based Transducer for Online Speech Recognition

Recent studies reveal the potential of recurrent neural network transduc...

Learning Noise-Invariant Representations for Robust Speech Recognition

Despite rapid advances in speech recognition, current models remain brit...

Improving Joint Layer RNN based Keyphrase Extraction by Using Syntactical Features

Keyphrase extraction as a task to identify important words or phrases fr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent dramatic improvement in End-to-End (E2E) Automatic Speech Recognition (ASR) systems

[kim2020review] has been achieved thanks to advances in deep neural network. These end-to-end speech recognition systems mainly consist of a single neural network component that performs all the equivalent tasks that are used to be performed by many discrete components in conventional speech recognition systems consisting of Acoustic Model (AM), Language Model (LM), the Pronunciation Model (PM), and so on. Connectionist Temporal Classification (CTC) [a_graves_icml_2006_00], Attention-based Encoder-Decoder (AED) model [44926, gowda2020utterance], and Recurrent Neural Network-Transducer (RNN-T) [graves2012sequence, li2019improving] are often used in recent speech recognition systems. These model architectures have a simpler training pipeline with better modeling capabilities compared to conventional architectures such as DNN-HMM systems.

There have been increased efforts towards ASR systems that can process streaming input in real-time, preferably on-device [he2018streaming, garg2020streaming]. The transition from the server-based ASR to the on-device ASR systems reduces the cost of maintenance for service providers and more relevant for tasks where privacy, accessibility, or lower latency are required. Furthermore, an on-device streaming model can now surpass a server-side conventional model in accuracy [sainath2020streaming, garg2020hierarchical].

Research on various data augmentation methods has also become relevant to the need for more representative datasets for training these ASR systems. [Park_2019] introduces time warping, frequency masking, and time masking for data augmentation. [C_Kim_INTERSPEECH_2017_1]

introduces room simulation with the different reverberation time, Signal-to-Noise Ratios (SNR), microphone, and sound source locations for far-field speech recognition.

[park2019specaugment] introduces SpecAugment and compares it with room simulation with different combinations. By randomly generating a warping factor, the speaker variability [Kim2019, n_jaitly_icml_workshop_2013_00] can also be improved. There has also been effort towards alleviating the impact of noise and reverberation while training an ASR model, such as [pncc_chanwoo].

Other relevant research includes how the CTC architecture has been used to predict a character for each audio frame and can be used for both the attention-based E2E model and RNN-T [graves2012sequence]. CTC can be jointly attached to an attention-based E2E model [hori2017advances] and can be attached with a prediction network on the RNN-T model. [raffel2017online] introduces hard monotonic attention and [c_chiu_iclr_2018_00] combines hard monotonic attention and soft attention. In order to make one on-device model that is robust for both near-field and far-field speech data, we explore streaming models with the data augmentation method. For a relevant research, [he2018streaming] uses TTS approach and [moritz2020streaming] selects SpecAugment as an augmentation method.

In this paper, we compare two streaming speech recognition models: RNN-T and MoChA, that are trained using the same training dataset with the same on-the-fly data augmentation methods [c_kim_asru_2019_01].

The main contribution of this paper is that by comparing two on-line streaming models, one can gain an insight into a streaming ASR model architecture suitable for both near-field and far-field scenarios.

The rest of this paper is structured as follows: Sec. 2 describes streaming model architecture and augmentation techniques that we have used for our experiments. The experimental setup, along with details about the model composition, training strategies, hyperparameter selection, and other empirical information will be discussed in Sec. 3. Sec. 4 describes the results of our experiments, specifically the comparison between two streaming models. We conclude the paper in Sec. 5.

2 Related research

In this section, we give an overview of related research including streaming end-to-end speech recognition models such as MoChA and RNN-T and various data-augmentation techniques. Fig. 1 shows the structures of the RNN-T and MoChA models that were employed in our experiments.

Figure 1: Comparison of RNN-T and MoChA-based models with different data augmentation method.

2.1 Monotonic Chunkwise Attention

Monotonic Chunkwise Attention (MoChA) [c_chiu_iclr_2018_00] is a unidirectional monotonic soft attention mechanism, derived as a hybrid of hard monotonic attention [aharoni2016sequence] with local soft attention.

Input feature at t

timestamp, becomes an input of encoder, and encoder outputs hidden vector

from the input. For the attention, MoChA combines the two commonly used attention mechanisms - a probability value, calculated as the sigmoid of a global hard monotonic energy function

, is used to define the boundaries of the significant locations in the encoder sequence. Another energy function known as chunk energy is evaluated on a constant number, chunk size , of the most recent encoder outputs. The chunk energy is then used to calculate the context vectors to be used by the decoder. With the target label , decoder combines previous output , previous context vector and outputs current decoder state, . The model outputs a probability of label using this decoder state , previous output and context vector

with a softmax layer.

As an online attention mechanism, MoChA holds distinct advantages both in terms of speed and accuracy over full attention mechanisms. Also, it has a linear time complexity at the inference time and very few context vectors are calculated - proportional to the length of the output sequence, which increases the efficiency of the online model. Nevertheless, an online attention mechanism’s performance suffers from the lack of complete information about the future states of the input sequence.

Encoder step

Output step
Figure 2: Attention heatmap derived from MoChA, chunk size 4

2.2 RNN Transducer

An RNN-T-model introduced in [graves2012sequence] has been successfully employed for on-device speech recognition applications in [he2018streaming]. As shown in Fig. 1, an RNN-T model consists of an encoder, a prediction network, and a joint network block that combines outputs from the encoder and the prediction network. The encoder consists of six layers of LSTMs. Input feature at t timestamp, becomes an input of encoder, and encoder outputs hidden vector from the input feature. The prediction network acts as a language model and feeds the previous label output to predict the next label. Joint network combines those two networks, hidden vector from prediction network and

from encoder, and outputs logits

. After passing a softmax layer, model outputs a probability of label

. Along with CTC, and different from our attention model, RNN-T has a blank label and only non-blank output,

encoder embedding output becomes an input of the prediction network. The logic of getting RNN-T logit is defined as,


where , , are joint, encoder, and prediction logic of each network.

2.3 Data Augmentation

To increase the diversity of data for training models and make a model robust, we compare streaming models with several data augmentation methods - Acoustic Simulator (AS), Vocal Tract Length Perturbation followed by the Acoustic Simulator, and SpecAugment.

2.3.1 Room acoustics simulation

To enhance the robustness of models for noisy and far-field environments, we apply on-the-fly data augmentation using an acoustic simulator in [C_Kim_INTERSPEECH_2017_1, c_kim_interspeech_2018_00, B_Li_INTERSPEECH_2017_1]. This acoustic simulator artificially adds noise and reverberation to training utterances. This module emulates an environment for far-field speech recognition where the parameters - room dimension, microphone, and sound source location, reverberation time, SNR are randomly picked from a specific range [C_Kim_INTERSPEECH_2017_1]. By doing this, we generate simulated utterances on-the-fly, thereby ensuring that the training dataset is virtually infinite in its diversity.

2.3.2 Vocal Tract Length Perturbation

Vocal Tract Length Perturbation (VTLP) [Kim2019, n_jaitly_icml_workshop_2013_00, x_cui_taslp_2015_00] is a technique for generating a random warping factor to simulate the change in the relative length of a person’s vocal tract length. This allows us to freely change voice characteristics and harmonics on the input audio. Thereby, we can bypass the restriction of having a limited number of speakers. It also helps avoid over-training biases originating from limited utterances from a type of speaker. In our experiments, we combine VTLP with AS, expecting the combined data augmentation to be effective on noise test sets.

2.3.3 SpecAugment

SpecAugment is introduced in [Park_2019] and widely used due to its simplicity and effectiveness. SpecAugment warps and masks the features in time and frequency axis in a spectrogram. Time warping shifts a spectrogram in a time axis, time masking masks the frequencies at certain time steps. Frequency masking masks the frequencies at a certain frequency range.

3 Experiments

3.1 Experimental Setup

For all the experiments, we employ the power-mel filterbank coefficients with a power coefficient of ( [Kim2019, c_kim_asru_2019_00] to extract 40 dimensions of features. We use the window size of 25 ms with an interval of 10 ms

between successive frames. As the output label, Byte Pair Encoding (BPE) is used which splits training words into 10025 units of BPE. We enable dropout rate for the encoder layer, both for MoChA and RNN-T, and 10% label smoothing is applied to the output probability for MoChA model. For better comparison, we use the same encoder structures with 1024 Long Short-Term Memory (LSTM) cells, 6 layers LSTM encoder and we use the beam size of 12 as a default of beam search based decoding during the inference. We trained for sufficient epochs, 13 to 15, where model converges well and performance doesn’t fluctuate. All the training and testing data is 16 kHz audio. For the MoChA model, we use a chunk size of 4, and the LSTM unit size of of 1000. For better convergence, we train a model with a joint CTC and Cross-Entropy (CE) loss

[suyounctc] function defined as,


where , , are total, CE, and CTC losses respectively.

For the RNN-T model, we use a layer of 1024 units of prediction layer. We use the same training methodology and parameters of the encoder for a fair comparison. The RNN-T is trained with a CTC loss in the encoder. Combined with a prediction network, the RNN-T loss is applied to the softmax output of the joint network. As an RNN-T loss, a negative natural log of output label probability is used.


We use a linear learning rate warm-up strategy while pre-training all the encoder layers for MoChA and RNN-T models. We figure out that the model convergence is unstable as the reduction factor decreases during the pretraining stage. We trained the first layer for 0.5 epochs, followed by the addition of one LSTM layer to the encoder every 0.25 epoch. The learning rate was reset and then linearly increased when a new layer is added, as per Fig 3. We continued learning rate warm-up from 1.75 to 2.5 epochs after all the LSTM layers were added.


learning rate
Figure 3: Learning rate warm-up schedule

3.2 Data Augmentation

To perform data-augmentation using an AS and VTLP, we employ an example server architecture described in [c_kim_asru_2019_01]. We simulate input audio on-the-fly with 4 CPUs operating on input for each GPU.

In our experiments using an acoustic simulator, SNR values are sampled between 0 dB and 30 dB from a distribution similar to that described in [C_Kim_INTERSPEECH_2017_1]. Similarly, reverberation time in () is sampled from 0.0 s to 1.0 s from a distribution described in [C_Kim_INTERSPEECH_2017_1]. Babble, music, and TV noises are used as noise sources. Each utterance is corrupted by one to three noise sources randomly located inside a room using an acoustic simulator. The selection probability of a noise file was dependant on the type of noise. Room dimensions, microphone, noise, sound source locations are randomly picked. To see the effect of data augmentation and model convergence, we enable an AS to a different percent of input data. Detailed explanation results will be shown in Sec. 4. During experiments, a distinct effect of Acoustic Simulation on some models is observed. For the following discussion in this paper, we define which is the percentage of utterances processed by an acoustic simulator:


These ‘clean’ audio streams still have other augmentation methods - VTLP applied to them but do not contain artificial noise or reverberation. For the VTLP configuration, we randomly choose a warping factor between 0.8 to 1.2 and oversized fft factor as 16 identical parameter setup as per [Kim2019].

For the SpecAugment experiment, We randomly mask time and frequency on the input feature. In a time axis, we randomly mask the number of time dimensions with a range from 1 to 20. In a frequency axis, we randomly mask up to two sections with maximum dimension 8 in 40 dimension features.

3.3 Dataset

3.3.1 LibriSpeech Corpus

LibriSpeech [v_panayotov_icassp_2015_00] is a large corpus of 16kHz English speech. Each of the models presented here is trained on full 960 hours of training data available in the LibriSpeech Corpus. Both test-clean and test-other are used for performance evaluation.

3.3.2 Test set - LibriSpeech clean with noise

To evaluate the performance under noisy environments, we use the noisy LibriSpeech test set and the Voices test set. We synthetically add babble, music, tv noises to the clean LibriSpeech test set through AS. The same distribution of AS parameters is used during the training.

3.3.3 Test set - VOiCES

VOiCES [DBLP:journals/corr/abs-1804-05053]

evaluation set, introduced at 2019 Interspeech Voices ASR Challenge, consists of 4600 utterances derived from the LibriSpeech dataset. This dataset targets acoustically challenging environments, such as noisy background, reverberation, and secondary speakers.

4 Experimental results

We compare MoChA and RNN-T after applying various augmentation methods - an Acoustic Simulator, VTLP followed by an Acoustic Simulator, and SpecAugment, in terms of the speech recognition accuracy, the number of parameters, the model size, the latency, and the inference time. For an extended comparison, we also compare them to Bi-directional LSTM with Full Attention (BFA) models and Uni-directional LSTM with Full Attention (UFA) models to provide a contrast with offline ASR models. In obtaining these results, a Language Model (LM) is not employed. For all the experimental results, we use the same beam size of 12.

Acoustic Simulator - Clean perf.
Acoustic Simulator - Noise perf.
SpecAugment perf.
Inference time
Parameter and Model size
Table 1: Overview of model comparison
MoChA without warm-up with warm-up test-clean 6.41 6.18 +noise 65.95 64.32 test-other 18.82 18.53 voices 75.31 73.51
Table 2: Enabling / Disabling learning rate warm-up on MoChA model, WER(%)
0% 10% 30% 50% 70% 90% 100%
MoChA-based model
test-clean 6.18 6.58 7.03 7.06 7.35 7.37 7.75
+ noise 64.32 36.82 32.10 30.07 28.06 27.45 26.46
test-other 18.53 18.30 18.43 18.43 17.81 18.41 18.07
Voices 73.51 39.51 33.59 30.67 27.99 26.52 25.20
RNN-T-based model
test-clean 7.86 7.83 8.11 8.55 9.55 10.42 11.78
+ noise 63.03 33.97 29.81 28.32 26.70 27.01 27.92
test-other 20.89 20.45 20.54 20.70 21.74 23.44 25.69
Voices 71.04 36.43 30.42 27.81 24.55 24.90 25.63
Table 3: Word Error Rates (WERs) (%) of streaming Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T) models trained with Vocal Tract Length Perturbation (VTLP) and an Acoustic Simulator (AS) with different values defined in (4).
method Test Set Bi-directional LSTM Full Attention Uni-directional LSTM Full Attention Uni-directional LSTM MoCha Uni-directional LSTM RNN-T Baseline test-clean 4.38 6.86 6.18 7.86 + noise 59.67 69.79 64.32 63.03 test-other 14.52 19.27 18.53 20.89 voices 66.82 79.43 73.51 71.04 avg. 40.52 48.73 45.17 44.91 AS test-clean 4.15 6.98 8.33 9.44 + noise 18.10 24.24 25.37 25.42 test-other 11.83 17.44 18.78 21.68 voices 17.45 24.30 25.01 23.05 avg. 13.56 19.15 20.23 20.43 VTLP+AS test-clean 4.29 7.76 7.35 9.55 + noise 17.05 26.31 28.06 26.70 test-other 11.58 18.62 17.81 21.74 voices 15.26 25.52 27.99 24.55 avg. 12.53 20.45 21.43 21.26 SpecAug. test-clean 3.62 6.26 6.93 7.26 + noise 51.98 58.22 58.43 59.06 test-other 11.20 16.41 17.47 18.68 voices 48.93 57.85 57.55 60.48 avg. 31.58 37.81 38.13 39.66
Table 4: Word Error Rates (WERs) obtained with different models and different data augmentation techniques. For in (4), we use a value of 70% in experiments with an acoustic simulator.

4.1 Effects of the learning rate warm-up

We compare performance by enabling / disabling learning rate warmup. This learning rate warmup strategy shows 0.23% to 1.8% absolute WER difference on our test set, especially the gap is larger on noisy test sets. The results are shown in Table 2.

4.2 Effects of for data-augmentation using an acoustic simulator

We define noise added LibriSpeech and Voices test set as the noisy test set. We observe a trade-off between the clean and the noisy performances with streaming models. As the percentile of AS increases, Word Error Rate (WER) on clean test sets increases while WER on noisy test sets decreases. Unlike streaming models, the BFA model trained with AS has an improvement on both the clean and noisy test set as shown in Table 4.

Table 3 shows that MoChA has strength in clean speech whereas RNN-T has strength in noise speech compared to MoChA. Due to the attention mechanism of MoChA - if the probability of monotonic attention is less than 0.5, encoder embedding is not attended and the model did not output a label on some test cases. We compare its CTC performance to check the encoder performance, and figure out MoChA performs better on a relatively clean set in the BPE label unit.

4.3 Comparison of different augmentation methods

We compare streaming MoChA and RNN-T models with different augmentation methods. The experimental results are summarized in Table 4. MoChA-based model generally shows better performance than RNN-T when combined with different augmentation methods especially for test-clean and test-other test sets. While comparing streaming models to non-streaming models, BFA and UFA models, the BFA model shows the best performance on all test sets as expected. Notably, the BFA model shows better performance when VTLP is added before the AS. We also conclude that applying AS is critical for the model performance on noisy test sets although it degrades the performance of a clean test set on streaming models. The performance of Specaugment degrades compared to the AS trained model on the noisy test set.

Model beam 1 4 8 12 MoChA latency (ms) 14.99 41.32 78.24 119.1 inference (s) 1.45 1.73 1.98 2.25 RNN-T latency (ms) 4.38 5.34 6.26 7.27 inference (s) 0.56 0.71 0.83 0.92
Table 5: Average latency and inference time of streaming models with different beam sizes. Note that the time for obtaining the encoder output is excluded.
Measurement Bi-directional LSTM Full Attention Uni-directional LSTM Full Attention Uni-directional LSTM MoCha Uni-directional LSTM RNN-T Number of Parameters (million) 187 83 85 81 Model size (MB) 717 318 326 313
Table 6: Number of parameters and the size of different model architectures

4.4 Latency and Inference Time

Latency and inference time are measured on Nvidia Tesla P100 GPU, python version 3.5.4 through python model inference code. We use 100 utterances sampled from the LibriSpeech test-clean set and calculate average latency and inference time per sentence. Sampled utterances consist of 2346 words, and 10427 characters. Due to the identical encoder architecture of both models, we compare latency and inference time excluding encoder computation time for a better comparison. As shown in Table 5, with a simple decoding process compared to the attention-based model [li2019improving], RNN-T shows less time on both latency and inference time. As beam size increases, the latency of the MoChA model increased drastically due to its attention mechanism.

4.5 Parameter & Model size

We compare a parameter and model size of different model architectures. Parameter and model size is calculated without model compression. As expected, BFA model capacity is increased due to the bidirectional encoding. We find that parameter and model size of MoChA is larger than that of UFA and RNN-T models, mainly because of its complex computation while getting an attention weight such as monotonic attention. RNN-T consists of the least parameter and model size, along with latency and inference time.

5 Conclusions

We compare two streaming model architectures: MoChA and RNN-T in terms of inference time, latency, and speech recognition accuracies when combined with different data augmentation techniques. We observe that compared to non-streaming models, streaming models are more sensitive to the proportion of noisy examples in the training set. The final performance of the model depends on the proportion of noisy examples generated by data augmentation techniques. It is observed that MoChA models perform slightly better than RNN-T models in terms of speech recognition accuracy especially for clean test sets while RNN-T models perform generally better in terms of latency, inference time, and the stability of parameter convergence during training. These advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.