Recent dramatic improvement in End-to-End (E2E) Automatic Speech Recognition (ASR) systems[kim2020review] has been achieved thanks to advances in deep neural network. These end-to-end speech recognition systems mainly consist of a single neural network component that performs all the equivalent tasks that are used to be performed by many discrete components in conventional speech recognition systems consisting of Acoustic Model (AM), Language Model (LM), the Pronunciation Model (PM), and so on. Connectionist Temporal Classification (CTC) [a_graves_icml_2006_00], Attention-based Encoder-Decoder (AED) model [44926, gowda2020utterance], and Recurrent Neural Network-Transducer (RNN-T) [graves2012sequence, li2019improving] are often used in recent speech recognition systems. These model architectures have a simpler training pipeline with better modeling capabilities compared to conventional architectures such as DNN-HMM systems.
There have been increased efforts towards ASR systems that can process streaming input in real-time, preferably on-device [he2018streaming, garg2020streaming]. The transition from the server-based ASR to the on-device ASR systems reduces the cost of maintenance for service providers and more relevant for tasks where privacy, accessibility, or lower latency are required. Furthermore, an on-device streaming model can now surpass a server-side conventional model in accuracy [sainath2020streaming, garg2020hierarchical].
Research on various data augmentation methods has also become relevant to the need for more representative datasets for training these ASR systems. [Park_2019] introduces time warping, frequency masking, and time masking for data augmentation. [C_Kim_INTERSPEECH_2017_1]
introduces room simulation with the different reverberation time, Signal-to-Noise Ratios (SNR), microphone, and sound source locations for far-field speech recognition.[park2019specaugment] introduces SpecAugment and compares it with room simulation with different combinations. By randomly generating a warping factor, the speaker variability [Kim2019, n_jaitly_icml_workshop_2013_00] can also be improved. There has also been effort towards alleviating the impact of noise and reverberation while training an ASR model, such as [pncc_chanwoo].
Other relevant research includes how the CTC architecture has been used to predict a character for each audio frame and can be used for both the attention-based E2E model and RNN-T [graves2012sequence]. CTC can be jointly attached to an attention-based E2E model [hori2017advances] and can be attached with a prediction network on the RNN-T model. [raffel2017online] introduces hard monotonic attention and [c_chiu_iclr_2018_00] combines hard monotonic attention and soft attention. In order to make one on-device model that is robust for both near-field and far-field speech data, we explore streaming models with the data augmentation method. For a relevant research, [he2018streaming] uses TTS approach and [moritz2020streaming] selects SpecAugment as an augmentation method.
In this paper, we compare two streaming speech recognition models: RNN-T and MoChA, that are trained using the same training dataset with the same on-the-fly data augmentation methods [c_kim_asru_2019_01].
The main contribution of this paper is that by comparing two on-line streaming models, one can gain an insight into a streaming ASR model architecture suitable for both near-field and far-field scenarios.
The rest of this paper is structured as follows: Sec. 2 describes streaming model architecture and augmentation techniques that we have used for our experiments. The experimental setup, along with details about the model composition, training strategies, hyperparameter selection, and other empirical information will be discussed in Sec. 3. Sec. 4 describes the results of our experiments, specifically the comparison between two streaming models. We conclude the paper in Sec. 5.
2 Related research
In this section, we give an overview of related research including streaming end-to-end speech recognition models such as MoChA and RNN-T and various data-augmentation techniques. Fig. 1 shows the structures of the RNN-T and MoChA models that were employed in our experiments.
2.1 Monotonic Chunkwise Attention
Monotonic Chunkwise Attention (MoChA) [c_chiu_iclr_2018_00] is a unidirectional monotonic soft attention mechanism, derived as a hybrid of hard monotonic attention [aharoni2016sequence] with local soft attention.
Input feature at t
timestamp, becomes an input of encoder, and encoder outputs hidden vector
from the input. For the attention, MoChA combines the two commonly used attention mechanisms - a probability value, calculated as the sigmoid of a global hard monotonic energy function, is used to define the boundaries of the significant locations in the encoder sequence. Another energy function known as chunk energy is evaluated on a constant number, chunk size , of the most recent encoder outputs. The chunk energy is then used to calculate the context vectors to be used by the decoder. With the target label , decoder combines previous output , previous context vector and outputs current decoder state, . The model outputs a probability of label using this decoder state , previous output and context vector
with a softmax layer.
As an online attention mechanism, MoChA holds distinct advantages both in terms of speed and accuracy over full attention mechanisms. Also, it has a linear time complexity at the inference time and very few context vectors are calculated - proportional to the length of the output sequence, which increases the efficiency of the online model. Nevertheless, an online attention mechanism’s performance suffers from the lack of complete information about the future states of the input sequence.
2.2 RNN Transducer
An RNN-T-model introduced in [graves2012sequence] has been successfully employed for on-device speech recognition applications in [he2018streaming]. As shown in Fig. 1, an RNN-T model consists of an encoder, a prediction network, and a joint network block that combines outputs from the encoder and the prediction network. The encoder consists of six layers of LSTMs. Input feature at t timestamp, becomes an input of encoder, and encoder outputs hidden vector from the input feature. The prediction network acts as a language model and feeds the previous label output to predict the next label. Joint network combines those two networks, hidden vector from prediction network and
from encoder, and outputs logits. After passing a softmax layer, model outputs a probability of label
. Along with CTC, and different from our attention model, RNN-T has a blank label and only non-blank output,encoder embedding output becomes an input of the prediction network. The logic of getting RNN-T logit is defined as,
where , , are joint, encoder, and prediction logic of each network.
2.3 Data Augmentation
To increase the diversity of data for training models and make a model robust, we compare streaming models with several data augmentation methods - Acoustic Simulator (AS), Vocal Tract Length Perturbation followed by the Acoustic Simulator, and SpecAugment.
2.3.1 Room acoustics simulation
To enhance the robustness of models for noisy and far-field environments, we apply on-the-fly data augmentation using an acoustic simulator in [C_Kim_INTERSPEECH_2017_1, c_kim_interspeech_2018_00, B_Li_INTERSPEECH_2017_1]. This acoustic simulator artificially adds noise and reverberation to training utterances. This module emulates an environment for far-field speech recognition where the parameters - room dimension, microphone, and sound source location, reverberation time, SNR are randomly picked from a specific range [C_Kim_INTERSPEECH_2017_1]. By doing this, we generate simulated utterances on-the-fly, thereby ensuring that the training dataset is virtually infinite in its diversity.
2.3.2 Vocal Tract Length Perturbation
Vocal Tract Length Perturbation (VTLP) [Kim2019, n_jaitly_icml_workshop_2013_00, x_cui_taslp_2015_00] is a technique for generating a random warping factor to simulate the change in the relative length of a person’s vocal tract length. This allows us to freely change voice characteristics and harmonics on the input audio. Thereby, we can bypass the restriction of having a limited number of speakers. It also helps avoid over-training biases originating from limited utterances from a type of speaker. In our experiments, we combine VTLP with AS, expecting the combined data augmentation to be effective on noise test sets.
SpecAugment is introduced in [Park_2019] and widely used due to its simplicity and effectiveness. SpecAugment warps and masks the features in time and frequency axis in a spectrogram. Time warping shifts a spectrogram in a time axis, time masking masks the frequencies at certain time steps. Frequency masking masks the frequencies at a certain frequency range.
3.1 Experimental Setup
For all the experiments, we employ the power-mel filterbank coefficients with a power coefficient of ( [Kim2019, c_kim_asru_2019_00] to extract 40 dimensions of features. We use the window size of 25 ms with an interval of 10 ms
between successive frames. As the output label, Byte Pair Encoding (BPE) is used which splits training words into 10025 units of BPE. We enable dropout rate for the encoder layer, both for MoChA and RNN-T, and 10% label smoothing is applied to the output probability for MoChA model. For better comparison, we use the same encoder structures with 1024 Long Short-Term Memory (LSTM) cells, 6 layers LSTM encoder and we use the beam size of 12 as a default of beam search based decoding during the inference. We trained for sufficient epochs, 13 to 15, where model converges well and performance doesn’t fluctuate. All the training and testing data is 16 kHz audio. For the MoChA model, we use a chunk size of 4, and the LSTM unit size of of 1000. For better convergence, we train a model with a joint CTC and Cross-Entropy (CE) loss[suyounctc] function defined as,
where , , are total, CE, and CTC losses respectively.
For the RNN-T model, we use a layer of 1024 units of prediction layer. We use the same training methodology and parameters of the encoder for a fair comparison. The RNN-T is trained with a CTC loss in the encoder. Combined with a prediction network, the RNN-T loss is applied to the softmax output of the joint network. As an RNN-T loss, a negative natural log of output label probability is used.
We use a linear learning rate warm-up strategy while pre-training all the encoder layers for MoChA and RNN-T models. We figure out that the model convergence is unstable as the reduction factor decreases during the pretraining stage. We trained the first layer for 0.5 epochs, followed by the addition of one LSTM layer to the encoder every 0.25 epoch. The learning rate was reset and then linearly increased when a new layer is added, as per Fig 3. We continued learning rate warm-up from 1.75 to 2.5 epochs after all the LSTM layers were added.
3.2 Data Augmentation
To perform data-augmentation using an AS and VTLP, we employ an example server architecture described in [c_kim_asru_2019_01]. We simulate input audio on-the-fly with 4 CPUs operating on input for each GPU.
In our experiments using an acoustic simulator, SNR values are sampled between 0 dB and 30 dB from a distribution similar to that described in [C_Kim_INTERSPEECH_2017_1]. Similarly, reverberation time in () is sampled from 0.0 s to 1.0 s from a distribution described in [C_Kim_INTERSPEECH_2017_1]. Babble, music, and TV noises are used as noise sources. Each utterance is corrupted by one to three noise sources randomly located inside a room using an acoustic simulator. The selection probability of a noise file was dependant on the type of noise. Room dimensions, microphone, noise, sound source locations are randomly picked. To see the effect of data augmentation and model convergence, we enable an AS to a different percent of input data. Detailed explanation results will be shown in Sec. 4. During experiments, a distinct effect of Acoustic Simulation on some models is observed. For the following discussion in this paper, we define which is the percentage of utterances processed by an acoustic simulator:
These ‘clean’ audio streams still have other augmentation methods - VTLP applied to them but do not contain artificial noise or reverberation. For the VTLP configuration, we randomly choose a warping factor between 0.8 to 1.2 and oversized fft factor as 16 identical parameter setup as per [Kim2019].
For the SpecAugment experiment, We randomly mask time and frequency on the input feature. In a time axis, we randomly mask the number of time dimensions with a range from 1 to 20. In a frequency axis, we randomly mask up to two sections with maximum dimension 8 in 40 dimension features.
3.3.1 LibriSpeech Corpus
LibriSpeech [v_panayotov_icassp_2015_00] is a large corpus of 16kHz English speech. Each of the models presented here is trained on full 960 hours of training data available in the LibriSpeech Corpus. Both test-clean and test-other are used for performance evaluation.
3.3.2 Test set - LibriSpeech clean with noise
To evaluate the performance under noisy environments, we use the noisy LibriSpeech test set and the Voices test set. We synthetically add babble, music, tv noises to the clean LibriSpeech test set through AS. The same distribution of AS parameters is used during the training.
3.3.3 Test set - VOiCES
evaluation set, introduced at 2019 Interspeech Voices ASR Challenge, consists of 4600 utterances derived from the LibriSpeech dataset. This dataset targets acoustically challenging environments, such as noisy background, reverberation, and secondary speakers.
4 Experimental results
We compare MoChA and RNN-T after applying various augmentation methods - an Acoustic Simulator, VTLP followed by an Acoustic Simulator, and SpecAugment, in terms of the speech recognition accuracy, the number of parameters, the model size, the latency, and the inference time. For an extended comparison, we also compare them to Bi-directional LSTM with Full Attention (BFA) models and Uni-directional LSTM with Full Attention (UFA) models to provide a contrast with offline ASR models. In obtaining these results, a Language Model (LM) is not employed. For all the experimental results, we use the same beam size of 12.
|Acoustic Simulator - Clean perf.|
|Acoustic Simulator - Noise perf.|
|Parameter and Model size|
4.1 Effects of the learning rate warm-up
We compare performance by enabling / disabling learning rate warmup. This learning rate warmup strategy shows 0.23% to 1.8% absolute WER difference on our test set, especially the gap is larger on noisy test sets. The results are shown in Table 2.
4.2 Effects of for data-augmentation using an acoustic simulator
We define noise added LibriSpeech and Voices test set as the noisy test set. We observe a trade-off between the clean and the noisy performances with streaming models. As the percentile of AS increases, Word Error Rate (WER) on clean test sets increases while WER on noisy test sets decreases. Unlike streaming models, the BFA model trained with AS has an improvement on both the clean and noisy test set as shown in Table 4.
Table 3 shows that MoChA has strength in clean speech whereas RNN-T has strength in noise speech compared to MoChA. Due to the attention mechanism of MoChA - if the probability of monotonic attention is less than 0.5, encoder embedding is not attended and the model did not output a label on some test cases. We compare its CTC performance to check the encoder performance, and figure out MoChA performs better on a relatively clean set in the BPE label unit.
4.3 Comparison of different augmentation methods
We compare streaming MoChA and RNN-T models with different augmentation methods. The experimental results are summarized in Table 4. MoChA-based model generally shows better performance than RNN-T when combined with different augmentation methods especially for test-clean and test-other test sets. While comparing streaming models to non-streaming models, BFA and UFA models, the BFA model shows the best performance on all test sets as expected. Notably, the BFA model shows better performance when VTLP is added before the AS. We also conclude that applying AS is critical for the model performance on noisy test sets although it degrades the performance of a clean test set on streaming models. The performance of Specaugment degrades compared to the AS trained model on the noisy test set.
4.4 Latency and Inference Time
Latency and inference time are measured on Nvidia Tesla P100 GPU, python version 3.5.4 through python model inference code. We use 100 utterances sampled from the LibriSpeech test-clean set and calculate average latency and inference time per sentence. Sampled utterances consist of 2346 words, and 10427 characters. Due to the identical encoder architecture of both models, we compare latency and inference time excluding encoder computation time for a better comparison. As shown in Table 5, with a simple decoding process compared to the attention-based model [li2019improving], RNN-T shows less time on both latency and inference time. As beam size increases, the latency of the MoChA model increased drastically due to its attention mechanism.
4.5 Parameter & Model size
We compare a parameter and model size of different model architectures. Parameter and model size is calculated without model compression. As expected, BFA model capacity is increased due to the bidirectional encoding. We find that parameter and model size of MoChA is larger than that of UFA and RNN-T models, mainly because of its complex computation while getting an attention weight such as monotonic attention. RNN-T consists of the least parameter and model size, along with latency and inference time.
We compare two streaming model architectures: MoChA and RNN-T in terms of inference time, latency, and speech recognition accuracies when combined with different data augmentation techniques. We observe that compared to non-streaming models, streaming models are more sensitive to the proportion of noisy examples in the training set. The final performance of the model depends on the proportion of noisy examples generated by data augmentation techniques. It is observed that MoChA models perform slightly better than RNN-T models in terms of speech recognition accuracy especially for clean test sets while RNN-T models perform generally better in terms of latency, inference time, and the stability of parameter convergence during training. These advantages make RNN-T models a better choice for streaming on-device speech recognition compared to MoChA models.