In recent years, we have witnessed significant progress in automatic speech recognition (ASR) mainly due to the use of deep learning algorithms[17, 28]. Deep model based ASR systems mainly focus on the hybrid framework and consist of many components, including acoustic model (AM), pronunciation model, language model (LM). Those components are trained separately using different objective functions, and extra expert linguistic knowledge may be needed. Recently, an emerging trend in the ASR community is to rectify this disjoint training issue by replacing hybrid systems with end-to-end (E2E) systems [26, 21, 5, 23, 7, 3, 27, 15, 25, 19]. The three major E2E approaches are built on the Connectionist Temporal Classification (CTC) [11, 12, 10], Attention-based Encoder-Decoder (AED) [8, 1, 2, 9, 5], and recurrent neural network transducer (RNN-T) [13, 24, 20]. Different from training conventional hybrid models, token alignment information between input acoustic frames and output token sequence is not required when training the E2E models.
CTC maps the input speech frames to target label sequence by marginalizing all the possible alignments. A dynamic programming based forward-backward algorithm is usually used to train the model. An advantage of the CTC approach is that it does frame-level decoding as the conventional hybrid model, and hence can be applied for online speech recognition. However, one disadvantage is the conditional independence assumption given the input acoustic frames. AED, on the other hand, does not have such an assumption, and is presumably more powerful than CTC for the speech recognition task. However, one drawback of the AED model is that the entire input sequence is required to start the decoding process due to the global attention mechanism, which makes it challenging for real-time streaming ASR, despite some recent attempts along this direction [6, 22].
RNN-T is an extension to CTC, which consists of three components: an encoder, a prediction network, and a joint network which integrates the outputs of encoder and prediction networks together to predict the target labels. RNN-T overcomes the conditional independence assumption of CTC with the prediction network; moreover, it allows streaming ASR because it still preforms frame-level monotonic decoding. Hence, there has been a significant research effort in promoting this approach in the ASR community [20, 24, 18, 4, 14], and RNN-T has recently been successfully deployed in embedding devices .
However, compared to CTC or AED, RNN-T is much more difficult to train due to the model structure, and the synchronous decoding constraint. Besides, its training is very memory demanding due to the 3-dimensional output tensor[24, 20]. In , an approach is proposed to reduce the memory cost, and it enables large mini-batch training. To tackle the training difficulty, initializing the encoder and prediction networks of an RNN-T with a CTC model and an RNNLM respectively is proven to be beneficial [24, 14]. In this paper, we explore other model initialization approaches to overcome the training difficulty of RNN-T models. Specifically, we propose to utilize external token alignment information to pre-train RNN-T. Two types of pre-training methods are investigated, which are referred to as encoder pre-training and whole-network pre-training respectively. Encoder pre-training refers to initializing the encoder in the RNN-T only, while the other components are trained from the random initialization. The whole-network pre-training, as its name suggests, pre-trains the whole network by an auxiliary objective function instead of the RNN-T loss. The proposed methods are evaluated on 3400 hours voice assistant data and 65,000 hours production data. The experimental results show that the accuracy of RNN-T model can be significantly improved with our proposed pre-training methods, with up-to 28% relative word error rate (WER) reduction.
The rest of this paper is organized as follows: Section 2 briefly introduces the basic RNN-T model, including the model training and decoding. The proposed two types of pre-training methods are described in Section 3 and Section 4, respectively. Next, Section 5 shows the experimental results and analysis. Section 6 gives the conclusions.
2 RNN Transducer Model
The RNN-T model was proposed in  as an extension to the CTC model. A typical RNN-T model has three components, as shown in the Figure 1, namely encoder, prediction network and joint network. Compared with CTC, RNN-T does not have the conditional independence assumption because of the prediction network, which emits an output tokens conditioned on the previous prediction results.
To be more precise, the encoder in an RNN-T model is an RNN that maps each acoustic frame to a high-level feature representation , where is the time index:
The prediction network, which is also based on RNNs, converts previous non-blank output token to a high-level representation , where is the label index of each output token.
Given the hidden representations of both acoustic features and labels from the encoder and prediction network, the joint network integrates the information using a feed-forward network as:
The decoding of RNN-T is operated in a frame-by-frame fashion. Starting from the first frame fed to encoder, if the current output is not , then the prediction network is updated with that output token. Otherwise, if the output is , then the encoder is updated with the next frame. The decoding terminates when the last frame of input sequence is consumed. In this way, real-time streaming is satisfied. Greedy search and beam search can be used in the decoding stage, which stores different numbers of intermediate states.
3 Encoder Pre-training
In an RNN-T model, encoder and prediction network usually have different model structures, which make it difficult to train them well at the same time. Directly training RNN-T from random initialization may get a biased model toward one of the model components, i.e., dominated by the acoustic or the language input. Most groups adopt a initialization strategy that initializes the encoder with a CTC model and the prediction network with a RNNLM [14, 24, 30]. However, the output sequence of CTC is a series of spikes, separated by the . Thus after CTC based pre-training, most encoder output leads to generate , which results in wrong inference for the RNN-T model.
In our work, we propose to utilize external alignments to pre-train the encoder with the Cross Entropy (CE) criterion. The encoder is regarded as a token classification model rather than a CTC model. As shown in the right part of Figure 2
, an RNN based token classification model is trained first with the CE loss. In this paper we use ’CE loss’ to represent the cross entropy loss function, and ’CTC loss’ to represent the CTC forward-backward algorithm based loss function, and ’RNN-T loss’ to represent the RNN-T loss function.
In our experiments, we use word piece units as target tokens , which have been explored in the context of machine translation, and successfully applied in E2E ASR [24, 10]. With word-level alignments, we can get the boundary frame index of each word. For the word divided into more than one word piece, we equally allocate the total frames inside the word boundary to its word pieces. There will be a marginal case in which a word contains more word pieces than frames, which prevents us from generating token alignments. The total ratio of this special case is less than 0.01% of all the training utterances, so we just remove those utterance in the pre-training stage. In this way, we can obtain the hard alignments of target tokens for all the frames.
Based on the encoder structure, one extra fully connected layer is added on top of the encoder, in which the output is used for token classification. The objective is
where represents a fully connected layer, is the label index and denotes the target dimension, which is also the dimension of . And is the word piece label for each input frame . After the encoder pre-training, each output , which is the high-level representation of input acoustic features, is expected to contain the information about the alignments.
4 whole-network pre-training
Among encoder pre-training methods, encoder is regarded to perform token mapping (CTC loss pre-training) or token aligning (CE loss pre-training). However, these pre-training methods only consider part of the RNN-T model. In this paper, we also explore the whole-network pre-training method with the use of external token alignments information. Different from other models, the output of the RNN-T is three-dimensional. Thus, the key challenge for the the whole-network pre-training is the label tensor design. The optimizer reduces the CE loss between the output of the model and a crafted three-dimensional label tensor .
Three different types of label tensor are explored in this study, and examples are given in Figure 3. In each label tensor, the horizontal axis represents the time dimension, and the vertical axis represents the output token dimension. We represent as an additional class, and it is shown as one-hot vector in the label tensor. At the first label tensor , we set all the output target grids of each frame in to the one-hot vector corresponding to its alignment label. The last row of the label tensor is set to all , which indicates the end of the utterance. Thus, after pre-training, encoder output is supposed to contain the alignment information. However, only considers the frame-by-frame alignment, but ignores the output token level information. If we directly perform the RNN-T decoding on , we can not obtain the correct inference sequence.
Thus, taken the decoding process into consideration, we design another label tensor . Each frame is assigned to its token alignment. Target token position is determined by its sequence order. When perform pre-training, we only compute the CE from the non-empty part of the label tensor. token is inserted under each target token to ensure the correct decoding results. If we directly perform the RNN-T decoding algorithm on the label tensor , correct results should be obtained. The decoding path is illustrated by the red arrow on the label tensor in Figure 3. Thus, by directly performing the decoding on of the given example, the inference result is ’A B s C ’. After removing tokens, the final result is ’A B s C”, which is the same with the alignment of this utterance.
However, in , almost half of the valid part is , so that tokens dominate the pre-training process. Therefore, we design the label tensor , which only keeps the non-blank part of . The label tensor only remains one grid with its corresponding alignment for each frame. In order to provide the blank information during the pre-training stage, we set short-pause (space token less than 3 frames) of each utterance to . That means some in the valid part of the label tensor will become . After the pre-training is done, we replace the CE loss with the RNN-T loss, and proceed to the standard RNN-T training.
5 Experiments and Analysis
5.1 Experimental setup
The proposed methods are evaluated on the 3400-hour Cortana voice assistant data, and 65,000-hour Microsoft production data. For the Cortana data, the training and test sets consist of approximately 3400 hours and 6 hours of English audio, respectively. The 65,000 hours production data are transcribed data from all kinds of Microsoft products. The test sets cover 13 application scenarios such as Cortana and far-field speaker, with totally 1.8 million (M) words. Training and test material is anonymized, with personally identifiable information removed. In this work, we evaluate the methods on Cortana data at first, and then evaluate the selected best method on very large scale 65,000-hour production data.
The input feature is a vector of 80-dimension log Mel filter bank for every 10 milliseconds (ms) of speech. Eight vectores are stacked together to form an input frame to the encoder, and the frame shift is 30 ms. All RNN-T models adopt the configuration recommended in [20, 16]
. All encoders (Enc.) have 6 hidden-layer LSTMs, and all prediction networks (Pred.) have 2 hidden-layer LSTMs. The joint network has two linear layers without any activation functions. Layer normalization is used in all LSTM layers, and the hidden dimension is 1280 with the projection dimension equal to 640. The output layer models 4000 word piece units together withtoken. The word piece units are generated by running byte pair encoding  on the acoustic training texts.
5.2 Evaluation on Cortana data
Experimental results of whole-network pre-training are shown in the Table 1. The RNN-T baseline is trained from random initialization. For pre-trained models, the whole network is pre-trained with CE loss, then it is trained with RNN-T loss. Using the pre-trained network as a seed model, the final word error rate (WER) can be significantly reduced. All designed label tensors can improve the RNN-T training, achieving 10% to 12% relative WER reduction.
|+ Pre-train (all align)||13.53|
|+ Pre-train (correct decoding)||13.66|
|+ Pre-train (align path - sp blank)||13.23|
In Table 2, we evaluate the encoder pre-training method on 3400 hours Cortana data. Using a pre-trained CTC model to initialize the encoder does not improve the accuracy. This is because the output of CTC is a sequence of spikes, in which there are lots of tokens without any meaning. Hence, if we use the pre-trained CTC as the seed for the encoder of RNN-T, most encoder output will generate , which does not help the RNN-T training. When we use CE loss pre-trained encoder to initialize the encoder of RNN-T, it achieves significant improvement compared with training from the random initialization. It obtains 28% relative WER reduction from the RNN-T baseline and CTC based encoder pre-training.
|Model||Enc. Pre-train||WER (%)|
Among all the encoder pre-training experiments in Table 2, prediction network and joint network are all trained from the random initialization. The only difference is the parameters seed of encoder. When comparing CTC loss and CE loss based encoder pre-training methods, there is a huge WER gap between these two approaches. Initializing the encoder as a token aligning model rather than a sequence mapping model results in the much better accuracy. This is because the RNN-T encoder performs the frame-to-token aligning, which extracts the high-level features of each input frame.
5.3 Evaluation on very large scale data
From our experiments, both the encoder pre-training and the whole-network pre-training can improve the performance of RNN-T model. In order to get more convincing results, we evaluate our proposed methods on very large scale data, where we use the 65,000 hours Microsoft production data set. The results are shown in Table 3. Due to the very large resource requirement and computation cost, we only evaluate CE-based encoder pre-training method, which obtained the best accuracy in Cortana experiments. All the results are obtained using beam search, and the beam width is 5.
Besides our proposed methods, we evaluate the widely used CTC-RNNLM pre-training strategy [24, 30, 14] as comparison. It used a well trained CTC model to initialize the encoder, and a well trained RNNLM to initialize the prediction network. This CTC+RNNLM initialization approach reduced the average WER from 12.63 to 12.29 in 13 test scenarios with 1.8 M words. In contrast, our proposed approach, which pre-trains the encoder with alignments using the CE loss, outperforms the other methods significantly, achieving a 11.41 WER on the average. Compared with training from the random initialization, our proposed method can obtain 10% relative WER reduction in the very large scale task.
|+ Pre-train (Enc. CTC, Pred. LM) ||12.29|
|+ Pre-train (Enc. CE)||11.34|
5.4 Output time delay comparison
Although RNN-T is a natural streaming model, it still has latency compared to hybrid models . With the help of alignments for model initialization, we hope to reduce the latency of RNN-T. To better understand the advantages of our proposed pre-training methods, we compare the gap between the ground truth word alignment and the word alignment generated by greedy decoding from different RNN-T models. The visualization is performed on the test set of Cortana data. As shown in the Figure 4
, the central axis represents the ground truth word alignment. The output alignment distributions are normalized to the normal distribution. The horizontal axis represents the number of frames away from the ground truth word alignment, and the vertical axis represents the ratio of words.
From the Figure 4, different RNN-T models have different time delay compared with the ground truth. That’s because the RNN-T model tends to see several future frames, which can provide more information for the token recognition. The baseline RNN-T model has around 10 frames average delay. In contrast, when performed the proposed pre-training methods, the average delay can be significantly reduced. Using CE pre-trained encoder to initialize the RNN-T model can reduce the average delay to 6 frames, and using whole-network pre-training method can reduce it to 5 frames. The reason for the time delay reduction is that pre-training provides the alignment information to the RNN-T model, which will guide the model to make decision earlier. This shows the advantage of our proposed pre-training methods in terms of time delay during the decoding stage.
In this work, we explore the training strategy of an RNN-T model and propose two pre-training approaches with the use of external alignment information. Two types of pre-training methods have been evaluated, referred to as encoder pre-training and whole-network pre-training. Encoder pre-training used CE loss to pre-train the encoder of RNN-T only. Whole-network pre-training pre-trains the whole RNN-T model with CE loss. Three kinds of designed label tensors are used for the whole-network pre-training. The proposed methods are evaluated on 3400 hours Cortana data and 65000 hours production data. When compared with training from the random initialization, the whole-network pre-training obtains a 12% relative WER reduction. And the encoder pre-training obtains a 28% and a 10% relative WER reduction on 3400 hours Cortana and 65,000 hours production data, respectively. Compared to the widely used CTC+RNNLM initialization strategy on very large scale data, encoder pre-training still outperforms it by a 8% relative WER reduction. Our proposed methods can also significantly reduce the time delay of RNN-T model.
-  (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR, pp. . Cited by: §1.
-  (2016) End-to-end attention-based large vocabulary speech recognition. In Proc. ICASSP, pp. 4945–4949. Cited by: §1.
-  (2017) Exploring Neural Transducers for End-to-End Speech Recognition. In Proc. ASRU, Cited by: §1.
-  (2017) Exploring neural transducers for end-to-end speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 206–213. Cited by: §1.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §1.
-  (2017) Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382. Cited by: §1.
-  (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, Cited by: §1.
-  (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP, pp. . Cited by: §1.
-  (2015) Attention-Based Models for Speech Recognition. In NIPS, pp. . Cited by: §1.
-  (2019) Advancing acoustic-to-word ctc model with attention and mixed-units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 1880–1892. Cited by: §1, §3.
-  (2006) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, pp. 369–376. Cited by: §1, §3.
-  (2014) Towards End-to-End Speech Recognition with Recurrent Neural Networks. In PMLR, pp. 1764–1772. Cited by: §1.
-  (2012) Sequence Transduction with Recurrent Neural Networks. CoRR abs/1211.3711. External Links: Cited by: §1, §2, §2.
-  (2013) Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pp. 6645–6649. Cited by: §1, §1, §3, §5.3.
-  (2018) Towards Discriminatively-trained HMM-based End-to-end models for Automatic Speech Recognition. In Proc. ICASSP, Cited by: §1.
-  (2019) Streaming end-to-end speech recognition for mobile devices. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6381–6385. Cited by: §1, §5.1.
-  (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29 (6), pp. 82–97. Cited by: §1.
-  (2019) Large-scale multilingual speech recognition with a streaming end-to-end model. arXiv preprint arXiv:1909.05330. Cited by: §1.
-  (2018) Advancing acoustic-to-word CTC model. In Proc. ICASSP, pp. 5794–5798. Cited by: §1.
-  (2019) Improving RNN Transducer Modeling for End-to-End Speech Recognition. In Proc. ASRU, Cited by: §1, §1, §1, §5.1, §5.4.
-  (2015) EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. In Proc. ASRU, pp. 167–174. Cited by: §1.
-  (2019) Triggered attention for end-to-end speech recognition. In Proc. ICASSP, pp. 5666–5670. Cited by: §1.
-  (2017) A Comparison of Sequence-to-Sequence Models for Speech Recognition. In Proc. Interspeech, pp. 939–943. Cited by: §1.
-  (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with rnn-transducer. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. Cited by: §1, §1, §1, §3, §3, §5.3, Table 3.
-  (2018) Improving the performance of online neural transducer models. In Proc. ICASSP, pp. 5864–5868. Cited by: §1.
-  (2015) Learning Acoustic Frame Labeling for Speech Recognition with Recurrent Neural Networks. In Proc. ICASSP, pp. 4280–4284. Cited by: §1.
-  (2017) Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proc. Interspeech, Cited by: §1.
-  (2011) Conversational speech transcription using context-dependent deep neural networks. In Proc. INTERSPEECH, Cited by: §1.
-  (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §5.1.
-  (2018) Exploring rnn-transducer for chinese speech recognition. CoRR abs/1811.05097. External Links: Cited by: §3, §5.3.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.