1 Introduction
Recent advances in automatic speech recognition (ASR) have been mostly due to the advent of using deep learning algorithms to build hybrid ASR systems with deep acoustic models like Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), and Recurrent Neural Networks (RNNs). However, one problem with these hybrid systems is that there are several intermediate models (acoustic model, language model, lexicon model, decision tree, etc.) which either need expert linguistic knowledge or need to be trained separately. In the last few years, an emerging trend in ASR is the study of endtoend (E2E) systems
[38, 28, 6, 32, 5, 39, 17, 8, 35, 27]. The E2E ASR system directly transduces an input sequence of acoustic features to an output sequence of tokens (phonemes, characters, words etc). This reconciles well with the notion that ASR is inherently a sequencetosequence task mapping input waveforms to output token sequences. Some widely used contemporary E2E approaches for sequencetosequence transduction are: (a) Connectionist Temporal Classification (CTC) [14, 15], (b) Attention EncoderDecoder (AED) [9, 3, 4, 10, 6], and (c) RNN Transducer (RNNT)[16]. These approaches have been successfully applied to largescale ASR systems [38, 28, 6, 32, 5, 8, 41, 34, 18].Among all these three approaches, AED (or LAS: Listen, Attend and Spell [6, 8]) is the most popular one. It contains three components: Encoder which is similar to a conventional acoustic model; Attender which works as an alignment model; Decoder which is analogous to a conventional language model. However, it is very challenging for AED to do online streaming, which is an important requirement for ASR services. Although there are several studies towards that direction, such as monotonic chunkwise attention [7] and triggered attention [30], it’s still a big challenge. As for CTC, it enjoys its simplicity by only using an encoder to map the speech signal to target labels. However, its frameindependence assumption is most criticized. There are several attempts to improve CTC modeling by relaxing or removing such assumption. In [12, 13]
, attention modeling was directly integrated into the CTC framework to relax the frameindependence assumption by working on model hidden layers without changing the CTC objective function and training process, hence enjoying the simplicity of CTC training.
A more elegant solution is RNNT [16], which extends CTC modeling by incorporating acoustic model with its encoder, language model with its prediction network, and decoding process with its joint network. There is no frameindependence assumption anymore and it is very natural to do online streaming. As a result, RNNT instead of AED was successfully deployed to Google’s device [18] with great impacts. In spite of its recent success in industry, there is less research in RNNT when compared to the popular AED or CTC, possibly due to the complexity of RNNT training [2]
. For example, the encoder and prediction network compose a grid of alignments, and the posteriors need to be calculated at each point in the grid to perform the forwardbackward training of RNNT. This threedimension tensor requires much more memory than what is required in AED or CTC training. Given there are lots of network structure studies in hybrid systems (e.g.,
[37, 22, 23, 36, 19, 43, 20, 42, 33]), it is also desirable to explore advanced network structures so that we can put a RNNT model into devices with both good accuracy and small footprint.In this paper, we use the methods presented in the latest work [18]
as the baseline and improve the RNNT modeling in two aspects. First, we optimize the training algorithm of RNNT to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures than the layernormalized long shortterm memory (LSTM) used in
[18] so that we obtain RNNT models with better accuracy but smaller footprint.This paper is organized as follows. In Section 2, we briefly describe the basic RNNT method. Then, in Section 3, we propose how to improve the RNNT training by reducing the memory consumption which constrains the training minibatch size because of the largesize tensors in RNNT. In Section 4, we propose several model structures which improve the baseline LSTM model used in [18] for better accuracy and smaller model size. We evaluate the proposed models in Section 5 by training them with 30 thousand hours anonymized and transcribed production data, and evaluating with several ASR tasks. Finally, we conclude the study in Section 6.
2 RnnT
Figure 1 shows the diagram of the RNNT model, which consists of encoder, prediction, and joint networks. The encoder network is analogous to the acoustic model, which converts the acoustic feature into a highlevel representation , where is time index.
(1) 
The prediction network works like a RNN language model, which produces a highlevel representation by conditioning on the previous nonblank target predicted by the RNNT model, where is output label index.
(2) 
The joint network is a feedforward network that combines the encoder network output and the prediction network output as
(3)  
(4) 
where and are weight matrices,
is a bias vector, and
is a nonlinear function, e.g. Tanh or ReLU
[31].The
is connected to the output layer with a linear transform
(5) 
The final posterior for each output token is obtained after applying softmax operation
(6) 
The loss function of RNNT is the negative log posterior of output label sequence
given input acoustic feature ,(7) 
which is calculated based on the forwardbackward algorithm described in [16]. The derivatives of loss with respect to is
(8) 
where denotes the blank label.
is is the probability of outputting
during while is the probability of outputting during , assuming one sequence with acoustic feature length as T and label sequence length as U.3 Training Improvement
We use to denote the number of all output labels (including blank), and to denote the dimension of . From Eq. (4) and (5), we can see the size of is , and the size of is . Compared with other E2E models, such as CTC or AED, RNNT model consumes much more memory during training. This restricts the training minibatch size, especially when trained on GPUs with limited memory capacity. The training speed would be slow if small minibatch is used, therefore reducing memory cost is important for fast RNNT training. In this paper, we use two methods to reduce memory cost for RNNT model training.
3.1 Efficient encoder and prediction output combination
To improve the training efficiency of RNN, usually several sequences are used in parallel in one minibatch. Given sequences in one minibatch with acoustic feature lengths , the dimension of would be ( is the size of vector ) when they are paralleled in one minibatch. Because the actual element number in is
, some of the memory is wasted. Such waste is worse for RNNT when combining encoder and prediction network output with the broadcasting method which is a popular implementation for dealing with differentsize tensors in neural network training tools such as PyTorch.
Suppose the label length of N sequences are . The dimension of would then be . To combine and with the broadcasting method, the dimension of and is expanded to
(9) 
and
(10) 
respectively. Then, they are combined according to Eq. (4) and the output dimension becomes
(11) 
Note the last dimension is instead of because of the projection matrices in Eq. (4). Hence, becomes a fourdimension tensor, which requires very large memory to accommodate.
To eliminate such memory waste, we did not use the broadcasting method to combine and . Instead, we implement the combination sequence by sequence. Hence, the size of for utterance is instead of . Then we concatenate all instead of paralleling them, which means we convert into a twodimension tensor . In this way, the total memory cost for is . This significantly reduces the memory cost, compared to the broadcasting method. There is no recurrent operation after the combination, hence such conversion will not affect the training speed for the operations following it. Of course, we need to pass the sequence information like , to any later operations so that they could process the sequences correctly based on such information.
Another way to reduce memory waste is to sort the sequences with respect to both acoustic feature and label length, and then performing the training with sorted sequences. However, from our experience, this generates worse accuracy than presenting randomized utterances to the training. In the deep speech 2 work [1]
, SortaGrad was used to present the sorted utterances to training in the first epoch for the better initialization of CTC training. After the first epoch, the training then is reverted back to a random order of data over minibatches, which indicates the importance of data shuffle.
3.2 Function merging
Most of the neural network training tools (e.g., PyTorch) are modular based. Although this is convenient for user to try various model structure, the memory usage is not efficient. For RNNT model, if we define softmax and loss function with two separate modules, to get the derivative of loss with respect to , i.e.,
, we may use the chain rule
(12) 
to first calculate with Eq. (8) and then. However, calculating and storing is not necessary. Instead we directly derive the formulation of as
(13) 
The tenor size of of one minibatch is very big, hence Eq. (13) saves large memory compared with Eq. (12). To further optimize this, we calculate softmax inplace ^{1}^{1}1the operation that directly changes the content of a given tensor without making a copy. for Eq. (6) and let take the memory storage place of in Eq. (13) . In this way, for this part we only have one very large tensor in the memory (either or ), compared to standard chain rule implementation which needs to store three large tensors ( , , and ) in the memory.
With above two methods, memory cost is reduced significantly, which helps to increase the training minibatch size. For a RNNT model with around 4,000 output labels, the minibatch size could be increased from 2000 to 4000 when it is trained with V100 GPU which has 16 Gigabytes memory. The improvement is even larger for a RNNT model with 36,000 output labels. The minibatch size could be increased from 500 to 2000.
4 Model Structure Exploration
In the latest successful RNNT work [18], the layernormalized LSTMs with projection layer were used in both encoder and prediction networks. We describe it in Section 4.1 and use it as the baseline structure in this study. From Section 4.2 to Section 4.6, we propose different structures for RNNT. These new structures achieve more modeling power by 1) decoupling the target classification task and temporal modeling task with separate modeling units; 2) exploring future context frames to generate more informative encoder output.
4.1 Layer normalized LSTM
In [18], layer normalization and projection layer for LSTM were reported important to the success of RNNT modeling. Following [21], we define the layer normalization function for vector given adaptive gain and bias as
(14)  
(15)  
(16) 
where is the dimension of , and is elementwise product.
Then, we define the layer normalized LSTM function with project layer as
(17)  
(18)  
(19)  
(20)  
(21)  
(22) 
The vectors , , , are the activations of the input, output, forget gates, and memory cells, respectively. is the output of the LSTM. and are the weight matrices for the inputs and the recurrent inputs , respectively. are bias vectors. The functions and are the logistic sigmoid and hyperbolic tangent nonlinearity, respectively.
For the multilayer LSTM, , then we have the hidden output of the th layer at time as
(23) 
and we use the last hidden layer output and of the encoder and prediction networks as and , where and denote the number of layers in encoder and prediction networks respectively.
4.2 Layer trajectory LSTM
Recently, significant accuracy improvement was reported with a layer trajectory LSTM (ltLSTM) model [24, 25] in which depthLSTMs are used to scan the outputs of multilayer timeLSTMs to extract information for label classification. By decoupling the temporal modeling and senone classification tasks, ltLSTM allows time and depth LSTMs focusing on their individual tasks. The model structure is illustrated by Figure 2.
Following [24], we define the layernormalized ltLSTM formulation using
(24)  
(25) 
with the LSTM function defined from Eq. (17) to Eq. (23), where and are the memory cells from timeLSTM (time and layer ) and depthLSTM (time and layer ), respectively. The timeLSTM formulation in Eq. (24) performs temporal modeling, while the depthLSTM formulation in Eq. (25) works across layers for the target classification. is the output of the depthLSTM at time and layer . Similarly, we can use the last hidden layer output and of the encoder and prediction networks as and for RNNT.
4.3 Contextual layer trajectory LSTM
In [26], the contextual layer trajectory LSTM (cltLSTM) was proposed to further improve the performance of ltLSTM models by using context frames to capture future information. In cltLSTM, used in Eq. (25) is replaced by the lookahead embedding vector in order to incorporate the future context information. The embedding vector is computed from the depthLSTM using Eq. (26)
(26)  
(27)  
(28) 
where denotes the weight matrix applied to depthLSTM output . Note it is only meaningful to have future context lookahead for the encoder network. Therefore, of the encoder network can be used as for RNNT. In Eq. (26), every layer has frames lookahead. To generate , we have frames lookahead in total.
4.4 Layer normalized gated recurrent units
Similar to [18]
, our target is to deploy RNNT for device ASR. Therefore, footprint is a major factor when developing models. As gated recurrent unit (GRU)
[11] usually has light weight compared to LSTM, we use it as the building block for RNNT and define the layer normalized GRU function as(29)  
(30)  
(31)  
(32) 
Again, and are the weight matrices for the inputs and the recurrent inputs , respectively. are bias vectors. For the multilayer GRU, , then we have the hidden output of the th layer at time as
(33) 
Note that all the networks in this study are unidirectional and our layer normalized GRU is different from the bidirectional GRU with batch normalization used in
[5]. Layer normalization is better than batch normalization for recurrent neural networks [21].4.5 Layer trajectory GRU
Similar to the layer normalized ltLSTM presented in Section 4.2, we propose layer trajectory GRU (ltGRU) with layer normalization in this study as
(34)  
(35) 
The timeGRU formulation in Eq. (34) performs temporal modeling, while the depthGRU formulation in Eq. (35) works across layer for target classification. is the output of the depthGRU at time and layer . We can use the last hidden layer output and of the encoder and prediction networks as and for RNNT.
4.6 Contextual layer trajectory GRU
To further improve the performance of ltGRU models, we extend ltGRU in Section 4.5 by utilizing the lookahead embedding vector in order to incorporate the future context information. The embedding vector is computed from the depthGRU using Eq. (36)
(36)  
(37)  
(38) 
Different from Eq. (26), Eq. (36) uses vector instead of matrix to integrate future context from the depthGRU. This further reduces the model footprint. Hence we call this model as elementwise contextual layer trajectory GRU (ecltGRU).
of the encoder network can then be used as for RNNT. In Eq. (36), every layer has frames lookahead, and we have frames lookahead in total to generate .
5 Experiments
In this section, we evaluate the effectiveness of the proposed models. In our experiments, all models are unidirectional, and were trained with 30 thousand (k) hours of anonymized and transcribed Microsoft production data, including Cortana and Conversation data, recorded in both closetalk and farfield conditions. We evaluated all models with Cortana and Conversation test sets. Both sets contain mixed closetalk and farfield recordings, with 439k and 111k words, respectively. The Cortana set has shorter utterances related to voice search and commands, while the Conversation set has longer utterances from conversations. We also evaluate the models on the third test set named as DMA with 29k words, which is not in Cortana or Conversation domain. The DMA domain was unseen during the model training, and serves to evaluate the generalization capacity of the model.
5.1 RNNT models with greedy search
enc.  pred.  size Mb  Cortana %  Conv. %  DMA % 

LSTM  LSTM  255  12.03  22.18  26.78 
ltLSTM  LSTM  422  11.11  22.41  24.70 
ltLSTM  ltLSTM  482  10.91  22.51  25.15 
cltLSTM  LSTM  469  9.92  19.86  23.03 
GRU  GRU  139  12.30  22.90  26.50 
ltGRU  GRU  216  11.60  21.74  24.99 
ltGRU  ltGRU  235  11.58  21.69  26.00 
ecltGRU  GRU  216  10.68  19.76  23.22 
In Table 1, we compare different structures of RNNT models in terms of size and word error rate (WER) using greedy search. The computational cost of RNNT models during runtime is proportional to the model size. The feature is 80dimension log Mel filter bank for every 10 milliseconds (ms) speech. Three of them are stacked together to form a frame of 240dimension input acoustic feature to the encoder network. All encoder (enc.) networks have 6 hidden layers, and all prediction (pred.) networks have 2 hidden layers. The joint network always outputs a vector with dimension 640. The output layer models 4096 word piece units together with blank label. The word piece units are generated by running byte pair encoding [40] on the acoustic training texts. Similar to the latest successful RNNT model on device [18], our first model uses LSTM for both encoder and prediction networks. The encoder network has a 6 layer layernormalized LSTM with 1280 hidden units at each layer and the output size then is reduced to 640 using a linear projection layer. We denote this layernormalized LSTM structure as 1280p640. The prediction network has 2 hidden layers, and each layer is with the 1280p640 LSTM structure. This model has total size of 255 megabytes (Mb). We use this RNNT model as the baseline model.
Then, we explore ltLSTM structures described in Section 4.2 for encoder and prediction networks. All the LSTM components in ltLSTM use the 1280p640 LSTM structure. Simply using ltLSTM in the encoder significantly reduces the WERs on Cortana and DMA test sets, but increases the model size from 255 Mb to 422 Mb. Further using ltLSTM in prediction network increases the model size to 482 Mb, without benefiting the WER clearly. Hence, using ltLSTM only in the encoder is a good setup.
Using cltLSTM described in Section 4.3 in the encoder network with at each layer improves the WERs significantly, with respectively 17.5%, 10.5%, and 14.0% relative WER reduction on Cortana, Conversation, and DMA test sets from the baseline RNNT model which uses layernormalized LSTM in both encoder and decoder networks. This clearly shows the advantage of using future context for the encoder network. However, this model also has a very large model size as 469 Mb, which especially brings challenges to the deployment into devices.
Next, layernormalized LSTM is replaced with layernormalized GRU described in Section 4.4. The GRU in this study is with 800 units at each layer without any projection layer. When using layernormalized GRU in both encoder and prediction networks, we significantly reduce the RNNT model size to 139 Mb and obtain slightly worse WERs than the baseline RNNT model using LSTM.
The RNNT model using ltGRU proposed in Section 4.5 in the encoder network has the model size 216 Mb, and significantly improves the RNNT model with GRU in the encoder network. Again, applying ltGRU to prediction network doesn’t bring any additional benefits but increases the model size.
Finally, ecltGRU which uses at each layer has almost the same size as ltGRU in the encoder network due to the use of vectors instead of matrices when incorporating future context, but significantly improves the WERs because of the future context access. The size of this model (216 Mb) is smaller than that (255 Mb) of the baseline RNNT model which uses LSTM in the encoder and prediction networks. At the same time, it outperforms the baseline RNNT model with 11.2%, 10.9% and 13.3% relative WER reduction on Cortana, Conversation, and DMA test sets, respectively.
Given the good performance of ecltGRU, we evaluate the impact of future frame contexts at each layer in Figure 3 by varying the value of . When , the ecltGRU model just reduces to the ltGRU model. With larger value, the WERs drop monotonically. Setting as 3 or 4 doesn’t have too much WER difference while smaller value does reduce the model latency with less future context access.
5.2 Comparison with hybrid models








hybrid  LSTM  5gram  5120  9.35  18.82  20.18  
LSTM 

218  10.92  20.22  23.05  
RNNT  LSTM  LSTM  255  9.94  19.70  23.19  
ecltGRU  GRU  216  9.28  18.22  20.45 
In Table 2
, we compare RNNT models decoded using beam search with hybrid models trained with cross entropy criterion on Cortana, Conversation, and DMA test sets. Note that all models, not matter RNNT or hybrid, can be improved by sequence discriminative training. The acoustic model of this hybrid model is a 6 layer LSTM with 1024 hidden units at each layer and the output size then is reduced to 512 using a linear projection layer. The softmax layer has 9404dimension output to model the senone labels. The input feature is also 80dimension log Mel filter bank for every 10 milliseconds (ms) speech. We applied frame skipping by a factor of 2
[29] to reduce the runtime cost, which corresponds to 20ms per frame. This acoustic model (AM) has the size of 120 Mb. The language model (LM) is a 5gram with around 100 million (M) ngrams, which is compiled into a graph of 5 gigabytes (Gb) size. We refer this configuration as the server setup. We also prune the LM for the purpose of device usage, resulting in a graph of 98 Mb size. The decoding of these two hybrid models uses our production setup, differing only the LM size. We list the WERs of this AM combined with both the large LM and pruned LM in Table 2. Clearly, the device setup with small LM gives worse WERs than the server setup using large LM.The RNNT decoding results are generated by beam search with beam width 10. Comparing the results in Table 2 and Table 1, we can see that beam search significantly improves the WERs from the greedy search. Using ecltGRU in the encoder network and GRU in the prediction network for RNNT outperforms the baseline RNNT which uses LSTM everywhere, with 6.6%, 7.5%, and 11.8% relative WER reduction on Cortana, Conversation, and DMA test sets, respectively. The size of this best RNNT model is 216 Mb, less than the baseline RNNT model with LSTM units. Moreover, the WERs of this best RNNT model matches the WERs of hybrid model with server setup which has a 5 Gb LM. Comparing to the device hybrid setup which has similar size (218 Mb), this best RNNT obtains 15.0%, 9.9%, and 11.3% relative WER reduction on Cortana, Conversation, and DMA test sets, respectively. With the high recognition accuracy and small footprint, this RNNT model is ideal for device ASR.
Finally, we look at the gap between ground truth word alignment obtained by force alignment with a hybrid model and the word alignment generated by greedy decoding from two RNNT models in Table 2. As shown in Figure 4, the baseline RNNT with LSTM in the encoder network has larger delay than the ground truth alignment. The average delay is about 10 input frames. In contrast, the RNNT model with ecltGRU has less alignment discrepancy, with average 2 input frames delay. This is because the ecltGRU encoder has total 24 frames lookahead, which provides more information to RNNT so that it makes decision earlier than the baseline model. Because the input to RNNT models spans 30ms by stacking 3 10ms frames together, therefore the average latency of RNNT with ecltGRU encoder is (24+2)*30ms = 780ms while the average latency of RNNT with LSTM encoder is 10*30ms = 300ms. Hence, partial accuracy advantage of ecltGRU encoder comes from the tradeoff of latency.
6 Conclusions
In this paper, we improve the RNNT training for endtoend ASR from two aspects. First, we reduce the memory consumption of RNNT training with efficient combination of the encoder and prediction network outputs, and by reformulating the gradient calculation to avoid storing multiple large tensors in the memory. In this way, we significantly increase the minibatch size (otherwise constrained by the memory usage) during training. Second, we improve the baseline RNNT model structure which uses LSTM units by proposing several new structures. All the proposed structures use the concept of layer trajectory which decouples the classification task and temporal modeling task by using depth LSTM / GRU units and time LSTM / GRU units, respectively. The best tradeoff between model size and accuracy is obtained by the RNNT model with ecltGRU in the encoder network and GRU in the prediction network. The future context lookahead at each layer helps to build accurate models with better performance.
Trained with 30 thousand hours anonymized and transcribed Microsoft production data, this best RNNT model achieves respectively 6.6%, 7.5%, and 11.8% relative WER reduction on Cortana, Conversation, and DMA test sets, from the baseline RNNT model but with smaller model size (216 Megabytes), which is ideal for the deployment to devices. This best RNNT model is significantly better than the hybrid model with similar size, by reducing 15.0%, 9.9%, and 11.3% relative WER on Cortana, Conversation, and DMA test sets, respectively. It also obtains similar WERs as the serversize hybrid model of 5120 Megabytes in size.
References

[1]
(2016)
Deep speech 2: endtoend speech recognition in English and Mandarin.
In
International conference on machine learning
, pp. 173–182. Cited by: §3.1. 
[2]
(2018)
Efficient implementation of recurrent neural network transducer in tensorflow
. In Proc. SLT, pp. 506–512. Cited by: §1.  [3] (2015) Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR, pp. . Cited by: §1.
 [4] (2016) Endtoend attentionbased large vocabulary speech recognition. In Proc. ICASSP, pp. 4945–4949. Cited by: §1.
 [5] (2017) Exploring Neural Transducers for EndtoEnd Speech Recognition. In Proc. ASRU, Cited by: §1, §4.4.
 [6] (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In Proc. ICASSP, pp. 4960–4964. Cited by: §1, §1.
 [7] (2017) Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382. Cited by: §1.
 [8] (2018) Stateoftheart speech recognition with sequencetosequence models. In Proc. ICASSP, Cited by: §1, §1.
 [9] (2014) Learning Phrase Representations using RNN EncoderDecoder for Statistical Machine Translation. In EMNLP, pp. . Cited by: §1.
 [10] (2015) AttentionBased Models for Speech Recognition. In NIPS, pp. . Cited by: §1.
 [11] (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §4.4.
 [12] (2018) Advancing connectionist temporal classification with attention modeling. In Proc. ICASSP, Cited by: §1.
 [13] (2019) Advancing acoustictoword CTC model with attention and mixedunits. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 1880–1892. Cited by: §1.
 [14] (2006) Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In ICML, pp. 369–376. Cited by: §1.
 [15] (2014) Towards EndtoEnd Speech Recognition with Recurrent Neural Networks. In PMLR, pp. 1764–1772. Cited by: §1.
 [16] (2012) Sequence Transduction with Recurrent Neural Networks. CoRR abs/1211.3711. External Links: Link Cited by: §1, §1, §2.
 [17] (2018) Towards Discriminativelytrained HMMbased Endtoend models for Automatic Speech Recognition. In Proc. ICASSP, Cited by: §1.
 [18] (2019) Streaming endtoend speech recognition for mobile devices. In Proc. ICASSP, pp. 6381–6385. Cited by: §1, §1, §1, §1, §4.1, §4.4, §4, §5.1.
 [19] (2016) A prioritized grid long shortterm memory RNN for speech recognition. In Proc. SLT, pp. 467–473. Cited by: §1.
 [20] (2017) Residual LSTM: design of a deep recurrent architecture for distant speech recognition. arXiv preprint arXiv:1701.03360. Cited by: §1.
 [21] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1, §4.4.
 [22] (2015) LSTM time and frequency recurrence for automatic speech recognition. In ASRU, Cited by: §1.
 [23] (2016) Exploring multidimensional LSTMs for large vocabulary ASR. In Proc. ICASSP, Cited by: §1.
 [24] (2018) Layer trajectory LSTM. In Proc. Interspeech, Cited by: §4.2, §4.2.
 [25] (2018) Exploring layer trajectory LSTM with depth processing units and attention. In Proc. IEEE SLT, Cited by: §4.2.
 [26] (2019) Improving layer trajectory LSTM with future context frames. In Proc. ICASSP, Cited by: §4.3.
 [27] (2018) Advancing acoustictoword CTC model. In Proc. ICASSP, pp. 5794–5798. Cited by: §1.
 [28] (2015) EESEN: EndtoEnd Speech Recognition using Deep RNN Models and WFSTbased Decoding. In Proc. ASRU, pp. 167–174. Cited by: §1.
 [29] (2016) SIMPLIFYING long shortterm memory acoustic models for fast training and decoding. In Proc. ICASSP, Cited by: §5.2.
 [30] (2019) Triggered attention for endtoend speech recognition. In Proc. ICASSP, pp. 5666–5670. Cited by: §1.
 [31] (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814. Cited by: §2.
 [32] (2017) A Comparison of SequencetoSequence Models for Speech Recognition. In Proc. Interspeech, pp. 939–943. Cited by: §1.
 [33] (2017) HighwayLSTM and recurrent highway networks for speech recognition. In Proc. of Interspeech, Cited by: §1.
 [34] (2017) Exploring architectures, data and units for streaming endtoend speech recognition with RNNtransducer. In Proc. ASRU, Cited by: §1.
 [35] (2018) Improving the performance of online neural transducer models. In Proc. ICASSP, pp. 5864–5868. Cited by: §1.
 [36] (2016) Modeling timefrequency patterns with LSTM vs. convolutional architectures for LVCSR tasks. In Proc. Interspeech, Cited by: §1.
 [37] (2015) Convolutional, long shortterm memory, fully connected deep neural networks. In Proc. ICASSP, pp. 4580–4584. Cited by: §1.
 [38] (2015) Learning Acoustic Frame Labeling for Speech Recognition with Recurrent Neural Networks. In Proc. ICASSP, pp. 4280–4284. Cited by: §1.
 [39] (2017) Recurrent neural aligner: An encoderdecoder neural network model for sequence to sequence mapping. In Proc. Interspeech, Cited by: §1.
 [40] (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §5.1.
 [41] (2016) Neural Speech Recognizer: Acoustictoword LSTM Model for Large Vocabulary Speech Recognition. arXiv preprint arXiv:1610.09975. Cited by: §1.
 [42] (2016) Highway long shortterm memory rnns for distant speech recognition. ICASSP. Cited by: §1.
 [43] (2016) Multidimensional residual learning based on recurrent neural networks for acoustic modeling. In Proc. Interspeech, pp. 3419–3423. Cited by: §1.
Comments
There are no comments yet.