Improving RNN Transducer Modeling for End-to-End Speech Recognition

09/26/2019
by   Jinyu Li, et al.
0

In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8 RNN-T model. This best RNN-T model is significantly better than the device hybrid model with similar size by achieving up-to 15.0 and obtains similar WERs as the server hybrid model of 5120 Megabytes in size.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/18/2020

Attention-based Transducer for Online Speech Recognition

Recent studies reveal the potential of recurrent neural network transduc...
05/01/2020

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition

Recently, the recurrent neural network transducer (RNN-T) architecture h...
11/13/2018

Exploring RNN-Transducer for Chinese Speech Recognition

End-to-end approaches have drawn much attention recently for significant...
05/19/2020

A New Training Pipeline for an Improved Neural Transducer

The RNN transducer is a promising end-to-end model candidate. We compare...
01/08/2022

Two-Pass End-to-End ASR Model Compression

Speech recognition on smart devices is challenging owing to the small me...
11/19/2021

A comparison of streaming models and data augmentation methods for robust speech recognition

In this paper, we present a comparative study on the robustness of two d...
03/28/2020

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

Thus far, end-to-end (E2E) models have not been shown to outperform stat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in automatic speech recognition (ASR) have been mostly due to the advent of using deep learning algorithms to build hybrid ASR systems with deep acoustic models like Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). However, one problem with these hybrid systems is that there are several intermediate models (acoustic model, language model, lexicon model, decision tree, etc.) which either need expert linguistic knowledge or need to be trained separately. In the last few years, an emerging trend in ASR is the study of end-to-end (E2E) systems

[38, 28, 6, 32, 5, 39, 17, 8, 35, 27]. The E2E ASR system directly transduces an input sequence of acoustic features to an output sequence of tokens (phonemes, characters, words etc). This reconciles well with the notion that ASR is inherently a sequence-to-sequence task mapping input waveforms to output token sequences. Some widely used contemporary E2E approaches for sequence-to-sequence transduction are: (a) Connectionist Temporal Classification (CTC) [14, 15], (b) Attention Encoder-Decoder (AED) [9, 3, 4, 10, 6], and (c) RNN Transducer (RNN-T)[16]. These approaches have been successfully applied to large-scale ASR systems [38, 28, 6, 32, 5, 8, 41, 34, 18].

Among all these three approaches, AED (or LAS: Listen, Attend and Spell [6, 8]) is the most popular one. It contains three components: Encoder which is similar to a conventional acoustic model; Attender which works as an alignment model; Decoder which is analogous to a conventional language model. However, it is very challenging for AED to do online streaming, which is an important requirement for ASR services. Although there are several studies towards that direction, such as monotonic chunkwise attention [7] and triggered attention [30], it’s still a big challenge. As for CTC, it enjoys its simplicity by only using an encoder to map the speech signal to target labels. However, its frame-independence assumption is most criticized. There are several attempts to improve CTC modeling by relaxing or removing such assumption. In [12, 13]

, attention modeling was directly integrated into the CTC framework to relax the frame-independence assumption by working on model hidden layers without changing the CTC objective function and training process, hence enjoying the simplicity of CTC training.

A more elegant solution is RNN-T [16], which extends CTC modeling by incorporating acoustic model with its encoder, language model with its prediction network, and decoding process with its joint network. There is no frame-independence assumption anymore and it is very natural to do online streaming. As a result, RNN-T instead of AED was successfully deployed to Google’s device [18] with great impacts. In spite of its recent success in industry, there is less research in RNN-T when compared to the popular AED or CTC, possibly due to the complexity of RNN-T training [2]

. For example, the encoder and prediction network compose a grid of alignments, and the posteriors need to be calculated at each point in the grid to perform the forward-backward training of RNN-T. This three-dimension tensor requires much more memory than what is required in AED or CTC training. Given there are lots of network structure studies in hybrid systems (e.g.,

[37, 22, 23, 36, 19, 43, 20, 42, 33]), it is also desirable to explore advanced network structures so that we can put a RNN-T model into devices with both good accuracy and small footprint.

In this paper, we use the methods presented in the latest work [18]

as the baseline and improve the RNN-T modeling in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures than the layer-normalized long short-term memory (LSTM) used in

[18] so that we obtain RNN-T models with better accuracy but smaller footprint.

This paper is organized as follows. In Section 2, we briefly describe the basic RNN-T method. Then, in Section 3, we propose how to improve the RNN-T training by reducing the memory consumption which constrains the training minibatch size because of the large-size tensors in RNN-T. In Section 4, we propose several model structures which improve the baseline LSTM model used in [18] for better accuracy and smaller model size. We evaluate the proposed models in Section 5 by training them with 30 thousand hours anonymized and transcribed production data, and evaluating with several ASR tasks. Finally, we conclude the study in Section 6.

2 Rnn-T

Figure 1 shows the diagram of the RNN-T model, which consists of encoder, prediction, and joint networks. The encoder network is analogous to the acoustic model, which converts the acoustic feature into a high-level representation , where is time index.

(1)

The prediction network works like a RNN language model, which produces a high-level representation by conditioning on the previous non-blank target predicted by the RNN-T model, where is output label index.

(2)

The joint network is a feed-forward network that combines the encoder network output and the prediction network output as

(3)
(4)

where and are weight matrices,

is a bias vector, and

is a non-linear function, e.g. Tanh or ReLU

[31].

The

is connected to the output layer with a linear transform

(5)

The final posterior for each output token is obtained after applying softmax operation

(6)
Figure 1: Diagram of RNN-Transducer.

The loss function of RNN-T is the negative log posterior of output label sequence

given input acoustic feature ,

(7)

which is calculated based on the forward-backward algorithm described in [16]. The derivatives of loss with respect to is

(8)

where denotes the blank label.

is is the probability of outputting

during while is the probability of outputting during , assuming one sequence with acoustic feature length as T and label sequence length as U.

3 Training Improvement

We use to denote the number of all output labels (including blank), and to denote the dimension of . From Eq. (4) and (5), we can see the size of is , and the size of is . Compared with other E2E models, such as CTC or AED, RNN-T model consumes much more memory during training. This restricts the training minibatch size, especially when trained on GPUs with limited memory capacity. The training speed would be slow if small minibatch is used, therefore reducing memory cost is important for fast RNN-T training. In this paper, we use two methods to reduce memory cost for RNN-T model training.

3.1 Efficient encoder and prediction output combination

To improve the training efficiency of RNN, usually several sequences are used in parallel in one minibatch. Given sequences in one minibatch with acoustic feature lengths , the dimension of would be ( is the size of vector ) when they are paralleled in one minibatch. Because the actual element number in is

, some of the memory is wasted. Such waste is worse for RNN-T when combining encoder and prediction network output with the broadcasting method which is a popular implementation for dealing with different-size tensors in neural network training tools such as PyTorch.

Suppose the label length of N sequences are . The dimension of would then be . To combine and with the broadcasting method, the dimension of and is expanded to

(9)

and

(10)

respectively. Then, they are combined according to Eq. (4) and the output dimension becomes

(11)

Note the last dimension is instead of because of the projection matrices in Eq. (4). Hence, becomes a four-dimension tensor, which requires very large memory to accommodate.

To eliminate such memory waste, we did not use the broadcasting method to combine and . Instead, we implement the combination sequence by sequence. Hence, the size of for utterance is instead of . Then we concatenate all instead of paralleling them, which means we convert into a two-dimension tensor . In this way, the total memory cost for is . This significantly reduces the memory cost, compared to the broadcasting method. There is no recurrent operation after the combination, hence such conversion will not affect the training speed for the operations following it. Of course, we need to pass the sequence information like , to any later operations so that they could process the sequences correctly based on such information.

Another way to reduce memory waste is to sort the sequences with respect to both acoustic feature and label length, and then performing the training with sorted sequences. However, from our experience, this generates worse accuracy than presenting randomized utterances to the training. In the deep speech 2 work [1]

, SortaGrad was used to present the sorted utterances to training in the first epoch for the better initialization of CTC training. After the first epoch, the training then is reverted back to a random order of data over minibatches, which indicates the importance of data shuffle.

3.2 Function merging

Most of the neural network training tools (e.g., PyTorch) are modular based. Although this is convenient for user to try various model structure, the memory usage is not efficient. For RNN-T model, if we define softmax and loss function with two separate modules, to get the derivative of loss with respect to , i.e.,

, we may use the chain rule

(12)

to first calculate with Eq. (8) and then. However, calculating and storing is not necessary. Instead we directly derive the formulation of as

(13)

The tenor size of of one minibatch is very big, hence Eq. (13) saves large memory compared with Eq. (12). To further optimize this, we calculate softmax in-place 111the operation that directly changes the content of a given tensor without making a copy. for Eq. (6) and let take the memory storage place of in Eq. (13) . In this way, for this part we only have one very large tensor in the memory (either or ), compared to standard chain rule implementation which needs to store three large tensors ( , , and ) in the memory.

With above two methods, memory cost is reduced significantly, which helps to increase the training minibatch size. For a RNN-T model with around 4,000 output labels, the minibatch size could be increased from 2000 to 4000 when it is trained with V100 GPU which has 16 Gigabytes memory. The improvement is even larger for a RNN-T model with 36,000 output labels. The minibatch size could be increased from 500 to 2000.

4 Model Structure Exploration

In the latest successful RNN-T work [18], the layer-normalized LSTMs with projection layer were used in both encoder and prediction networks. We describe it in Section 4.1 and use it as the baseline structure in this study. From Section 4.2 to Section 4.6, we propose different structures for RNN-T. These new structures achieve more modeling power by 1) decoupling the target classification task and temporal modeling task with separate modeling units; 2) exploring future context frames to generate more informative encoder output.

4.1 Layer normalized LSTM

In [18], layer normalization and projection layer for LSTM were reported important to the success of RNN-T modeling. Following [21], we define the layer normalization function for vector given adaptive gain and bias as

(14)
(15)
(16)

where is the dimension of , and is element-wise product.

Then, we define the layer normalized LSTM function with project layer as

(17)
(18)
(19)
(20)
(21)
(22)

The vectors , , , are the activations of the input, output, forget gates, and memory cells, respectively. is the output of the LSTM. and are the weight matrices for the inputs and the recurrent inputs , respectively. are bias vectors. The functions and are the logistic sigmoid and hyperbolic tangent nonlinearity, respectively.

For the multi-layer LSTM, , then we have the hidden output of the th layer at time as

(23)

and we use the last hidden layer output and of the encoder and prediction networks as and , where and denote the number of layers in encoder and prediction networks respectively.

4.2 Layer trajectory LSTM

Recently, significant accuracy improvement was reported with a layer trajectory LSTM (ltLSTM) model [24, 25] in which depth-LSTMs are used to scan the outputs of multi-layer time-LSTMs to extract information for label classification. By decoupling the temporal modeling and senone classification tasks, ltLSTM allows time and depth LSTMs focusing on their individual tasks. The model structure is illustrated by Figure 2.

Figure 2: Diagram of layer trajectory LSTM (ltLSTM). Depth-LSTM (D-LSTM) is used to scan the outputs of time-LSTM (T-LSTM) across all layers at the current time step to get summarized layer trajectory information for senone classification. Note that There is no time recurrence in D-LSTMs, which only occurs in T-LSTMs.

Following [24], we define the layer-normalized ltLSTM formulation using

(24)
(25)

with the LSTM function defined from Eq. (17) to Eq. (23), where and are the memory cells from time-LSTM (time and layer ) and depth-LSTM (time and layer ), respectively. The time-LSTM formulation in Eq. (24) performs temporal modeling, while the depth-LSTM formulation in Eq. (25) works across layers for the target classification. is the output of the depth-LSTM at time and layer . Similarly, we can use the last hidden layer output and of the encoder and prediction networks as and for RNN-T.

4.3 Contextual layer trajectory LSTM

In [26], the contextual layer trajectory LSTM (cltLSTM) was proposed to further improve the performance of ltLSTM models by using context frames to capture future information. In cltLSTM, used in Eq. (25) is replaced by the lookahead embedding vector in order to incorporate the future context information. The embedding vector is computed from the depth-LSTM using Eq. (26)

(26)
(27)
(28)

where denotes the weight matrix applied to depth-LSTM output . Note it is only meaningful to have future context lookahead for the encoder network. Therefore, of the encoder network can be used as for RNN-T. In Eq. (26), every layer has frames lookahead. To generate , we have frames lookahead in total.

4.4 Layer normalized gated recurrent units

Similar to [18]

, our target is to deploy RNN-T for device ASR. Therefore, footprint is a major factor when developing models. As gated recurrent unit (GRU)

[11] usually has light weight compared to LSTM, we use it as the building block for RNN-T and define the layer normalized GRU function as

(29)
(30)
(31)
(32)

Again, and are the weight matrices for the inputs and the recurrent inputs , respectively. are bias vectors. For the multi-layer GRU, , then we have the hidden output of the th layer at time as

(33)

Note that all the networks in this study are uni-directional and our layer normalized GRU is different from the bi-directional GRU with batch normalization used in

[5]. Layer normalization is better than batch normalization for recurrent neural networks [21].

4.5 Layer trajectory GRU

Similar to the layer normalized ltLSTM presented in Section 4.2, we propose layer trajectory GRU (ltGRU) with layer normalization in this study as

(34)
(35)

The time-GRU formulation in Eq. (34) performs temporal modeling, while the depth-GRU formulation in Eq. (35) works across layer for target classification. is the output of the depth-GRU at time and layer . We can use the last hidden layer output and of the encoder and prediction networks as and for RNN-T.

4.6 Contextual layer trajectory GRU

To further improve the performance of ltGRU models, we extend ltGRU in Section 4.5 by utilizing the lookahead embedding vector in order to incorporate the future context information. The embedding vector is computed from the depth-GRU using Eq. (36)

(36)
(37)
(38)

Different from Eq. (26), Eq. (36) uses vector instead of matrix to integrate future context from the depth-GRU. This further reduces the model footprint. Hence we call this model as elementwise contextual layer trajectory GRU (ecltGRU).

of the encoder network can then be used as for RNN-T. In Eq. (36), every layer has frames lookahead, and we have frames lookahead in total to generate .

5 Experiments

In this section, we evaluate the effectiveness of the proposed models. In our experiments, all models are uni-directional, and were trained with 30 thousand (k) hours of anonymized and transcribed Microsoft production data, including Cortana and Conversation data, recorded in both close-talk and far-field conditions. We evaluated all models with Cortana and Conversation test sets. Both sets contain mixed close-talk and far-field recordings, with 439k and 111k words, respectively. The Cortana set has shorter utterances related to voice search and commands, while the Conversation set has longer utterances from conversations. We also evaluate the models on the third test set named as DMA with 29k words, which is not in Cortana or Conversation domain. The DMA domain was unseen during the model training, and serves to evaluate the generalization capacity of the model.

5.1 RNN-T models with greedy search

enc. pred. size Mb Cortana % Conv. % DMA %
LSTM LSTM 255 12.03 22.18 26.78
ltLSTM LSTM 422 11.11 22.41 24.70
ltLSTM ltLSTM 482 10.91 22.51 25.15
cltLSTM LSTM 469 9.92 19.86 23.03
GRU GRU 139 12.30 22.90 26.50
ltGRU GRU 216 11.60 21.74 24.99
ltGRU ltGRU 235 11.58 21.69 26.00
ecltGRU GRU 216 10.68 19.76 23.22
Table 1: WERs and sizes of all RNN-T models on Cortana, Conversation (Conv.), and DMA test sets. All test sets are mixed with close-talk and far-field recordings. All encoder (enc.) networks have 6 hidden layers, and all prediction (pred.) networks have 2 hidden layers. The decoding results are generated by greedy search. The cltLSTM and ecltGRU set , i.e., using 4 future context frames at each layer.

In Table 1, we compare different structures of RNN-T models in terms of size and word error rate (WER) using greedy search. The computational cost of RNN-T models during runtime is proportional to the model size. The feature is 80-dimension log Mel filter bank for every 10 milliseconds (ms) speech. Three of them are stacked together to form a frame of 240-dimension input acoustic feature to the encoder network. All encoder (enc.) networks have 6 hidden layers, and all prediction (pred.) networks have 2 hidden layers. The joint network always outputs a vector with dimension 640. The output layer models 4096 word piece units together with blank label. The word piece units are generated by running byte pair encoding [40] on the acoustic training texts. Similar to the latest successful RNN-T model on device [18], our first model uses LSTM for both encoder and prediction networks. The encoder network has a 6 layer layer-normalized LSTM with 1280 hidden units at each layer and the output size then is reduced to 640 using a linear projection layer. We denote this layer-normalized LSTM structure as 1280p640. The prediction network has 2 hidden layers, and each layer is with the 1280p640 LSTM structure. This model has total size of 255 megabytes (Mb). We use this RNN-T model as the baseline model.

Then, we explore ltLSTM structures described in Section 4.2 for encoder and prediction networks. All the LSTM components in ltLSTM use the 1280p640 LSTM structure. Simply using ltLSTM in the encoder significantly reduces the WERs on Cortana and DMA test sets, but increases the model size from 255 Mb to 422 Mb. Further using ltLSTM in prediction network increases the model size to 482 Mb, without benefiting the WER clearly. Hence, using ltLSTM only in the encoder is a good setup.

Using cltLSTM described in Section 4.3 in the encoder network with at each layer improves the WERs significantly, with respectively 17.5%, 10.5%, and 14.0% relative WER reduction on Cortana, Conversation, and DMA test sets from the baseline RNN-T model which uses layer-normalized LSTM in both encoder and decoder networks. This clearly shows the advantage of using future context for the encoder network. However, this model also has a very large model size as 469 Mb, which especially brings challenges to the deployment into devices.

Next, layer-normalized LSTM is replaced with layer-normalized GRU described in Section 4.4. The GRU in this study is with 800 units at each layer without any projection layer. When using layer-normalized GRU in both encoder and prediction networks, we significantly reduce the RNN-T model size to 139 Mb and obtain slightly worse WERs than the baseline RNN-T model using LSTM.

The RNN-T model using ltGRU proposed in Section 4.5 in the encoder network has the model size 216 Mb, and significantly improves the RNN-T model with GRU in the encoder network. Again, applying ltGRU to prediction network doesn’t bring any additional benefits but increases the model size.

Finally, ecltGRU which uses at each layer has almost the same size as ltGRU in the encoder network due to the use of vectors instead of matrices when incorporating future context, but significantly improves the WERs because of the future context access. The size of this model (216 Mb) is smaller than that (255 Mb) of the baseline RNN-T model which uses LSTM in the encoder and prediction networks. At the same time, it outperforms the baseline RNN-T model with 11.2%, 10.9% and 13.3% relative WER reduction on Cortana, Conversation, and DMA test sets, respectively.

Given the good performance of ecltGRU, we evaluate the impact of future frame contexts at each layer in Figure 3 by varying the value of . When , the ecltGRU model just reduces to the ltGRU model. With larger value, the WERs drop monotonically. Setting as 3 or 4 doesn’t have too much WER difference while smaller value does reduce the model latency with less future context access.

Figure 3: The WERs of the ecltGRU model with respect to future context frames at each layer.

5.2 Comparison with hybrid models

AM
/ enc.
LM
/ pred.
size
Mb
Cortana
%
Conv.
%
DMA
%
hybrid LSTM 5gram 5120 9.35 18.82 20.18
LSTM
pruned
5gram
218 10.92 20.22 23.05
RNN-T LSTM LSTM 255 9.94 19.70 23.19
ecltGRU GRU 216 9.28 18.22 20.45
Table 2: Comparison of hybrid models with RNN-T models on Cortana, Conversation (Conv.), and DMA test sets. The RNN-T decoding results are generated by beam search with beam width 10. The ecltGRU sets , i.e., using 4 future context frames at each layer.

In Table 2

, we compare RNN-T models decoded using beam search with hybrid models trained with cross entropy criterion on Cortana, Conversation, and DMA test sets. Note that all models, not matter RNN-T or hybrid, can be improved by sequence discriminative training. The acoustic model of this hybrid model is a 6 layer LSTM with 1024 hidden units at each layer and the output size then is reduced to 512 using a linear projection layer. The softmax layer has 9404-dimension output to model the senone labels. The input feature is also 80-dimension log Mel filter bank for every 10 milliseconds (ms) speech. We applied frame skipping by a factor of 2

[29] to reduce the runtime cost, which corresponds to 20ms per frame. This acoustic model (AM) has the size of 120 Mb. The language model (LM) is a 5-gram with around 100 million (M) ngrams, which is compiled into a graph of 5 gigabytes (Gb) size. We refer this configuration as the server setup. We also prune the LM for the purpose of device usage, resulting in a graph of 98 Mb size. The decoding of these two hybrid models uses our production setup, differing only the LM size. We list the WERs of this AM combined with both the large LM and pruned LM in Table 2. Clearly, the device setup with small LM gives worse WERs than the server setup using large LM.

The RNN-T decoding results are generated by beam search with beam width 10. Comparing the results in Table 2 and Table 1, we can see that beam search significantly improves the WERs from the greedy search. Using ecltGRU in the encoder network and GRU in the prediction network for RNN-T outperforms the baseline RNN-T which uses LSTM everywhere, with 6.6%, 7.5%, and 11.8% relative WER reduction on Cortana, Conversation, and DMA test sets, respectively. The size of this best RNN-T model is 216 Mb, less than the baseline RNN-T model with LSTM units. Moreover, the WERs of this best RNN-T model matches the WERs of hybrid model with server setup which has a 5 Gb LM. Comparing to the device hybrid setup which has similar size (218 Mb), this best RNN-T obtains 15.0%, 9.9%, and 11.3% relative WER reduction on Cortana, Conversation, and DMA test sets, respectively. With the high recognition accuracy and small footprint, this RNN-T model is ideal for device ASR.

Finally, we look at the gap between ground truth word alignment obtained by force alignment with a hybrid model and the word alignment generated by greedy decoding from two RNN-T models in Table 2. As shown in Figure 4, the baseline RNN-T with LSTM in the encoder network has larger delay than the ground truth alignment. The average delay is about 10 input frames. In contrast, the RNN-T model with ecltGRU has less alignment discrepancy, with average 2 input frames delay. This is because the ecltGRU encoder has total 24 frames lookahead, which provides more information to RNN-T so that it makes decision earlier than the baseline model. Because the input to RNN-T models spans 30ms by stacking 3 10ms frames together, therefore the average latency of RNN-T with ecltGRU encoder is (24+2)*30ms = 780ms while the average latency of RNN-T with LSTM encoder is 10*30ms = 300ms. Hence, partial accuracy advantage of ecltGRU encoder comes from the tradeoff of latency.

Figure 4: The gap between ground truth word alignment and the word alignment from two RNN-T models in Table 2. The ecltGRU sets , i.e., using 4 future context frames at each layer.

6 Conclusions

In this paper, we improve the RNN-T training for end-to-end ASR from two aspects. First, we reduce the memory consumption of RNN-T training with efficient combination of the encoder and prediction network outputs, and by reformulating the gradient calculation to avoid storing multiple large tensors in the memory. In this way, we significantly increase the minibatch size (otherwise constrained by the memory usage) during training. Second, we improve the baseline RNN-T model structure which uses LSTM units by proposing several new structures. All the proposed structures use the concept of layer trajectory which decouples the classification task and temporal modeling task by using depth LSTM / GRU units and time LSTM / GRU units, respectively. The best tradeoff between model size and accuracy is obtained by the RNN-T model with ecltGRU in the encoder network and GRU in the prediction network. The future context lookahead at each layer helps to build accurate models with better performance.

Trained with 30 thousand hours anonymized and transcribed Microsoft production data, this best RNN-T model achieves respectively 6.6%, 7.5%, and 11.8% relative WER reduction on Cortana, Conversation, and DMA test sets, from the baseline RNN-T model but with smaller model size (216 Megabytes), which is ideal for the deployment to devices. This best RNN-T model is significantly better than the hybrid model with similar size, by reducing 15.0%, 9.9%, and 11.3% relative WER on Cortana, Conversation, and DMA test sets, respectively. It also obtains similar WERs as the server-size hybrid model of 5120 Megabytes in size.

References

  • [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in English and Mandarin. In

    International conference on machine learning

    ,
    pp. 173–182. Cited by: §3.1.
  • [2] T. Bagby, K. Rao, and K. C. Sim (2018)

    Efficient implementation of recurrent neural network transducer in tensorflow

    .
    In Proc. SLT, pp. 506–512. Cited by: §1.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR, pp. . Cited by: §1.
  • [4] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio (2016) End-to-end attention-based large vocabulary speech recognition. In Proc. ICASSP, pp. 4945–4949. Cited by: §1.
  • [5] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, S. Satheesh, D. Seetapun, A. Sriram, et al. (2017) Exploring Neural Transducers for End-to-End Speech Recognition. In Proc. ASRU, Cited by: §1, §4.4.
  • [6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §1, §1.
  • [7] C. Chiu and C. Raffel (2017) Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382. Cited by: §1.
  • [8] C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina, et al. (2018) State-of-the-art speech recognition with sequence-to-sequence models. In Proc. ICASSP, Cited by: §1, §1.
  • [9] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In EMNLP, pp. . Cited by: §1.
  • [10] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-Based Models for Speech Recognition. In NIPS, pp. . Cited by: §1.
  • [11] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §4.4.
  • [12] A. Das, J. Li, R. Zhao, and Y. Gong (2018) Advancing connectionist temporal classification with attention modeling. In Proc. ICASSP, Cited by: §1.
  • [13] A. Das, J. Li, G. Ye, R. Zhao, and Y. Gong (2019) Advancing acoustic-to-word CTC model with attention and mixed-units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 1880–1892. Cited by: §1.
  • [14] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, pp. 369–376. Cited by: §1.
  • [15] A. Graves and N. Jaitley (2014) Towards End-to-End Speech Recognition with Recurrent Neural Networks. In PMLR, pp. 1764–1772. Cited by: §1.
  • [16] A. Graves (2012) Sequence Transduction with Recurrent Neural Networks. CoRR abs/1211.3711. External Links: Link Cited by: §1, §1, §2.
  • [17] H. Hadian, H. Sameti, D. Povey, and S. Khudanpur (2018) Towards Discriminatively-trained HMM-based End-to-end models for Automatic Speech Recognition. In Proc. ICASSP, Cited by: §1.
  • [18] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, et al. (2019) Streaming end-to-end speech recognition for mobile devices. In Proc. ICASSP, pp. 6381–6385. Cited by: §1, §1, §1, §1, §4.1, §4.4, §4, §5.1.
  • [19] W. Hsu, Y. Zhang, and J. Glass (2016) A prioritized grid long short-term memory RNN for speech recognition. In Proc. SLT, pp. 467–473. Cited by: §1.
  • [20] J. Kim, M. El-Khamy, and J. Lee (2017) Residual LSTM: design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360. Cited by: §1.
  • [21] J. Lei Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1, §4.4.
  • [22] J. Li, A. Mohamed, G. Zweig, and Y. Gong (2015) LSTM time and frequency recurrence for automatic speech recognition. In ASRU, Cited by: §1.
  • [23] J. Li, A. Mohamed, G. Zweig, and Y. Gong (2016) Exploring multidimensional LSTMs for large vocabulary ASR. In Proc. ICASSP, Cited by: §1.
  • [24] J. Li, C. Liu, and Y. Gong (2018) Layer trajectory LSTM. In Proc. Interspeech, Cited by: §4.2, §4.2.
  • [25] J. Li, L. Lu, C. Liu, and Y. Gong (2018) Exploring layer trajectory LSTM with depth processing units and attention. In Proc. IEEE SLT, Cited by: §4.2.
  • [26] J. Li, L. Lu, C. Liu, and Y. Gong (2019) Improving layer trajectory LSTM with future context frames. In Proc. ICASSP, Cited by: §4.3.
  • [27] J. Li, G. Ye, A. Das, R. Zhao, and Y. Gong (2018) Advancing acoustic-to-word CTC model. In Proc. ICASSP, pp. 5794–5798. Cited by: §1.
  • [28] Y. Miao, M. Gowayyed, and F. Metze (2015) EESEN: End-to-End Speech Recognition using Deep RNN Models and WFST-based Decoding. In Proc. ASRU, pp. 167–174. Cited by: §1.
  • [29] Y. Miao, J. Li, Y. Wang, S. Zhang, and Y. Gong (2016) SIMPLIFYING long short-term memory acoustic models for fast training and decoding. In Proc. ICASSP, Cited by: §5.2.
  • [30] N. Moritz, T. Hori, and J. Le Roux (2019) Triggered attention for end-to-end speech recognition. In Proc. ICASSP, pp. 5666–5670. Cited by: §1.
  • [31] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.
  • [32] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly (2017) A Comparison of Sequence-to-Sequence Models for Speech Recognition. In Proc. Interspeech, pp. 939–943. Cited by: §1.
  • [33] G. Pundak and T. N. Sainath (2017) Highway-LSTM and recurrent highway networks for speech recognition. In Proc. of Interspeech, Cited by: §1.
  • [34] K. Rao, H. Sak, and R. Prabhavalkar (2017) Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. In Proc. ASRU, Cited by: §1.
  • [35] T. N. Sainath, C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, and Z. Chen (2018) Improving the performance of online neural transducer models. In Proc. ICASSP, pp. 5864–5868. Cited by: §1.
  • [36] T. N. Sainath and B. Li (2016) Modeling time-frequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In Proc. Interspeech, Cited by: §1.
  • [37] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak (2015) Convolutional, long short-term memory, fully connected deep neural networks. In Proc. ICASSP, pp. 4580–4584. Cited by: §1.
  • [38] H. Sak, A. Senior, K. Rao, O. Irsoy, A. Graves, F. Beaufays, and J. Schalkwyk (2015) Learning Acoustic Frame Labeling for Speech Recognition with Recurrent Neural Networks. In Proc. ICASSP, pp. 4280–4284. Cited by: §1.
  • [39] H. Sak, M. Shannon, K. Rao, and F. Beaufays (2017) Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping. In Proc. Interspeech, Cited by: §1.
  • [40] R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §5.1.
  • [41] H. Soltau, H. Liao, and H. Sak (2016) Neural Speech Recognizer: Acoustic-to-word LSTM Model for Large Vocabulary Speech Recognition. arXiv preprint arXiv:1610.09975. Cited by: §1.
  • [42] Y. Zhang, G. Chen, D. Yu, K. Yao, S. Khudanpur, and J. Glass (2016) Highway long short-term memory rnns for distant speech recognition. ICASSP. Cited by: §1.
  • [43] Y. Zhao, S. Xu, and B. Xu (2016) Multidimensional residual learning based on recurrent neural networks for acoustic modeling. In Proc. Interspeech, pp. 3419–3423. Cited by: §1.