Model Unit Exploration for Sequence-to-Sequence Speech Recognition

02/05/2019
by   Kazuki Irie, et al.
Google
RWTH Aachen University
0

We evaluate attention-based encoder-decoder models along two dimensions: choice of target unit (phoneme, grapheme, and word-piece), and the amount of available training data. We conduct experiments on the LibriSpeech 100hr, 460hr, and 960hr tasks; across all tasks, we find that grapheme or word-piece models consistently outperform phoneme-based models, even though they are evaluated without a lexicon or an external language model. On the 960hr task the word-piece model achieves a word error rate (WER) of 4.7 set and 13.4 (other) when decoded with an LSTM LM: the lowest reported numbers using sequence-to-sequence models. We also conduct a detailed analysis of the various models, and investigate their complementarity: we find that we can improve WERs by up to 9 model with either the phoneme or the grapheme model. Rescoring an N-best list generated by the phonemic system, however, provides limited improvements. Further analysis shows that the word-piece-based models produce more diverse N-best hypotheses, resulting in lower oracle WERs, than the phonemic system.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/05/2017

State-of-the-art Speech Recognition With Sequence-to-Sequence Models

Attention-based encoder-decoder architectures such as Listen, Attend, an...
12/05/2017

Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models

Sequence-to-sequence models, such as attention-based models in automatic...
07/23/2018

Acoustic-to-Word Recognition with Sequence-to-Sequence Models

Acoustic-to-Word recognition provides a straightforward solution to end-...
12/04/2019

Integrating Whole Context to Sequence-to-sequence Speech Recognition

Because an attention based sequence-to-sequence speech (Seq2Seq) recogni...
12/21/2019

Candidate Fusion: Integrating Language Modelling into a Sequence-to-Sequence Handwritten Word Recognition Architecture

Sequence-to-sequence models have recently become very popular for tackli...
04/01/2021

Do RNN States Encode Abstract Phonological Processes?

Sequence-to-sequence models have delivered impressive results in word fo...
11/20/2017

Speech recognition for medical conversations

In this paper we document our experiences with developing speech recogni...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence-to-sequence learning [1]

based on encoder-decoder attention models 

[2] has become popular for both machine translation [3] and speech recognition [4, 5, 6, 7, 8]. Such models are typically trained to output character-based units: graphemes, byte-pair encodings [9], or word-pieces [10], which allow the model to directly map the frame-level input audio features to the output word sequence, without using a hand-crafted pronunciation lexicon. Thus, when using such character-based output units, end-to-end speech recognition models [11]

jointly learn the acoustic model, pronunciation model, and language model within a single neural network. In fact, such models outperform the conventional approach 

[12] when trained on sufficiently large amounts of data [8].

While such a character-based sequence-to-sequence modeling approach certainly simplifies training and decoding, attention-based encoder-decoder models can also be trained with phoneme-based output targets. The use of phonemes as the output unit [13] enables the possibility of integrating available pronunciation lexica into the sequence-to-sequence approach. However, in [13, 14], it has been empirically shown that the grapheme based models outperform the phoneme based approach, while [13] find that use of lexica is still useful for recognizing rare words such as named entities.

We revisit the comparison between sequence-to-sequence models with various output units with two motivations. First, we investigate whether the previous result [13] which establishes the dominance of lexicon-free graphemic models over the phoneme-based models also hold on tasks with smaller amounts of training data. We carry out evaluation on the three subsets of the LibriSpeech task [15]: 100hr, 460hr, and 960hr, where we find that grapheme or word-piece models do indeed consistently outperform phoneme-based models. Second, we investigate the complementarity and the potential for combining models based on different units. In experimental evaluations, we find that simple N-best list rescoring results in large improvements in WER. Finally, we conduct a detailed analysis of the differences in the hypotheses produced by the models with various output units, in terms of quality of the top hypothesis, as well as the oracle error rate of the N-best list.

2 Sequence-to-sequence speech models

2.1 Listen, Attend, and Spell (LAS) Model

All our models are Listen, Attend, and Spell (LAS) [11] speech models. The LAS model has encoder, attention, and decoder modules as depicted in Figure 1(a). The encoder transforms the input frame-level audio feature sequence into a sequence of hidden activations. The attention module

summarizes the encoder sequence into a single vector for each prediction step, and finally, the

decoder

models the distribution of the output sequence conditioned on the history of previously predicted labels. Both the encoder and the decoder are modeled using recurrent neural networks, and thus the entire model can be jointly optimized. We refer the interested reader to 

[4, 16] for more details of attention-based models.

Figure 1: (a) LAS model (figure from [4]), (b) LAS with an auxiliary decoder: main decoder operates on graphemes and the auxiliary decoder predicts phonemes; dashed lines represent state copying for initialization at each word boundary.

2.2 LAS with an auxiliary decoder

While our work first focuses on the comparison of models with different output units, our final goal is to ideally get benefits from different model units. For model combination (e.g., rescoring), a single neural network with multiple decoders using different modeling units can be convenient. We consider a two-decoder model with decoders operating on different units. The design of the model is illustrated in Figure 1(b). The main decoder (grapheme, in the example) works exactly as in the baseline LAS model. The auxiliary decoder (phoneme, in the example) is designed such that it predicts only the next word as a sequence of the auxiliary units. We use separate parameters for the attention and initialize all recurrent states of the auxiliary component at each word boundary by those of the main decoder. The model is trained in two stages; the main decoder and the encoder are first trained, and their parameters are not modified during the training of the auxiliary components. In experiments, we use word-pieces for the main decoder, and phonemes for the auxiliary decoder.

3 Phonemic sequence-to-sequence model

We investigate phoneme-based models, as in [13], to further explore the complementarity between graphemic and phonemic models. Also, conceptually, the use of pronunciation lexicon can ease integration of completely new words or named entities111This might not be as relevant for LibriSpeech evaluation as we use the official LibriSpeech lexicon without modification.. However, by using the pronunciation lexicon to define the output units, we give up the end-to-end approach, which introduces complications for training and decoding.

For training, words with multiple pronunciation variants cause a problem, since there is no unique mapping from such a word to its corresponding phoneme sequence. While we can potentially obtain the correct pronunciation variant by generating alignments, we skip this extra effort by choosing a pronunciation simply by randomly choosing one of the pronunciations for each word to define a unique mapping. In addition, we include an unknown token UNK as a part of the phoneme vocabulary and use it to represent words which are not included in the lexicon. We use a dedicated end-of-word token EOW (as part of the phoneme inventory) to model word boundaries, as in [13], which we find improves performance.

To deal with the ambiguity of homophones during decoding, we incorporate a (word-based) n-gram language model. We use a general weighted finite-state transducer (WFST) decoder to perform a beam search. The lexicon and language model (LM) are represented as WFST

and respectively and combined by means of FST composition as the search network  [17]. The search process then explores partial path hypotheses which are constrained by the search network and scored by both the LAS model and the n-gram LM.

4 LibriSpeech Experimental Setup

4.1 Dataset

The LibriSpeech task [15] has three subsets with different amounts of transcribed training data: 100hr, 460hr, and 960hr data. A lexicon with pronunciations for 200K words is officially distributed. The development and test data are both split into clean and other subsets, each of them consisting of about 5 to 6 hours of audio. The number of unique words observed in each subset as well as the out-of-vocabulary (unseen in training data) rate is summarized in Table 1. For language modeling, extra text-only data of about 800M words is also available, along with an officially distributed 3-gram word LM; we use the unpruned 3-gram LM for decoding the phonemic LAS models. In contrast, the grapheme and word-piece models are evaluated without a lexicon or an LM (unless otherwise indicated). We train word piece models [10] of size 16K (16,384) on each training subset.

Training Vocab. dev test
data (h) Size clean other clean other
100 34 K 2.5 2.5 2.4 2.8
460 66 K 0.9 1.2 1.0 1.3
960 89 K 0.6 0.8 0.6 0.8
Lexicon 200 K 0.3 0.6 0.4 0.5
Table 1: Out-of-vocabulary (OOV) rates (%) with respect to the vocabulary (unique word list) in different data scenarios, and with respect to the pronunciation lexicon.

4.2 Models and training

We use 80-dimensional log-mel features with deltas and accelerations as the frame-level audio input features. Reducing input frame rate in the encoder is important for successfully training sequence-to-sequence speech models, especially for tasks such as LibriSpeech which feature long utterances (15s). Thus, following [18], our encoder layers include two layers of 3

3 convolution with 32 channels with a stride of 2, which results in a total time reduction factor of 4. We consider three model (

small, medium, and large) which differ in terms of the sizes of model components. On top of the convolutional layers, the encoder contains 3 (small) or 4 (medium and large) layers of bi-directional LSTMs [19], with either 256 (small), 512 (medium), or 1024 (large) LSTM [20]

cells in each layer. A projection layer and batch normalization are applied after each encoder layer 

[18]. The decoder consists of 1 (small) or 2 (medium and large) LSTM layers, and uses additive attention as described in [16].

We train all models using 16 GPUs by asynchronous stochastic gradient descent with Adam optimizer 

[21] from random initialization without any special pre-training method222We find training to be stable across repeated runs. We find that our models achieve the best WER on the dev-clean, earlier than on the dev-other set..

5 Standalone Performance Results

5.1 Baseline model performance on 960hr

The WER performance of grapheme and word-piece based models is summarized in Table 2. For both graphemes and word-pieces, we present the performance for small, medium and large model sizes (as shown by different numbers of parameters) as described in Sec 4.2. The difference of number of parameters between different units only comes from the unit-level vocabulary size. As can be seen in Table 2, models get benefits from the large number of parameters and the best WERs are obtained for the large word-piece model.

Unit Param. dev test
clean other clean other
Grapheme 7 M 7.6 20.5 7.9 21.3
35 M 5.3 15.6 5.6 15.8
130 M 5.3 15.2 5.5 15.3
Word-Piece 20 M 5.8 16.0 6.1 16.4
60 M 4.9 14.0 5.0 14.1
180 M 4.4 13.2 4.7 13.4
Table 2: WERs (%) for grapheme and word-piece models.

For phoneme based models, we first check the phoneme error rates (PER) in order to make sure that the models are reasonable and that a large number of parameters improves the performance333By increasing the model size from 7 M to 35 M, then to 130 M, we improve the PERs (%) from (3.2, 9.7, 3.2, 9.9), to (2.8, 8.9, 3.0, 9.1), then to (2.4, 7.9, 2.5, 7.7) on the dev-clean, dev-other, test-clean, and test-other sets.. The WER performance results for decoding with the lexicon and the 3-gram word LM (88M n-grams) is shown in Table 3. We observe that despite the use of an external LM which is trained on much more data than the transcription text data, the phonemic system performs worse than the graphemic best model. This is similar to what is reported in [13]444We note that we get about 2% absolute degradation in WERs with a model trained without EOW compared with the model with EOW.. It is nevertheless interesting to look into examples where the phonemic model outperforms the best word-piece model. In Table 4, we present some illustrative examples.

In Table 3, we also include the WERs from previous work on LibriSpeech 960hr. Our word-piece model performs better than the previously reported best sequence-to-sequence model in [22] while the performance is behind the conventional hybrid system with an n-gram LM [23]. It can be noted that our word-piece model simply trained using the cross-entropy criterion (without e.g. minimum word error rate training [24]) is competitive with Sabour et al.’s model trained with optimal completion distillation [25], which is reported to give 4.5% and 13.3% on test-clean and test-other sets. For further comparison, we also report the WERs of our best word-piece model combined with an LSTM LM [26] by shallow fusion [27, 28]. We obtain similar relative improvements reported in [22] and achieve WERs of 3.6% on the test-clean, and 10.3% on the test-other set, which largely reduce the performance gap from the best hybrid system reported in [23].

Unit LM dev test
clean other clean other
Phoneme 3-gram 5.6 15.8 6.2 15.8
Grapheme None 5.3 15.2 5.5 15.3
Word-Piece 16K None 4.4 13.2 4.7 13.4
Word-Piece 16K LSTM 3.3 10.3 3.6 10.3
BPE 10K (Zeyer et al. [22]) None 4.9 14.4 4.9 15.4
LSTM 3.5 11.5 3.8 12.8
Hybrid system (Han et al. [23]) N-gram 3.4 8.8 3.6 8.9
LSTM 3.1 8.3 3.5 8.6
Table 3: WERs (%) for the 960hr dataset.
Phoneme Word-Piece
when did you come bartley when did you come partly
kirkland jumped for the jetty kerklin jumped for the jetty
man’s eyes remained fixed man’s eyes were made fixed
Table 4: Examples where the phonemic system’s 1-best wins against the word-piece model’s 1-best.

5.2 Results on 100hr and 460hr tasks

We conduct the same experiments in the 100hr and 460hr conditions. For each unit, we obtain the best performance for the large models for the 460hr scenario, whereas for the 100hr case, the medium model perform the best. The results are summarized in Table 5. We find that even in the small dataset scenarios with higher OOV rates, graphemic and word-piece based models outperform the phonemic system. We also note that the performance of attention-based models dramatically degrade when the amount of training data is reduced, unlike the conventional hybrid approach [15].

Train Unit dev test
data clean other clean other
460hr Phoneme 7.6 27.3 8.5 27.8
Grapheme 6.4 23.5 6.8 24.1
Word-Piece 5.7 21.8 6.5 22.5
100hr Phoneme 13.8 38.9 14.3 40.9
Grapheme 11.6 36.1 12.0 38.0
Word-Piece 12.7 33.9 12.9 35.5
Table 5: WERs for the 460hr and 100hr scenarios.

6 Rescoring Experiments

We consider two methods for combining LAS models with different output units. The first approach is simple N-best list rescoring

. We generate a N-best list from one LAS model, convert the corresponding word sequences to the rescorer LAS model’s unit, score them, and combine the scores by log-linear interpolation to get new scores. However, rescoring is limited to the hypotheses generated by one LAS model. Therefore, we also carry out

union of N-best list with cross-rescoring: we independently generate N-best lists from two LAS models, rescore the hypotheses generated by one model using the other model and vice versa, to get the 1-best from the union of the rescored (up to) 2N hypotheses.

6.1 N-best Rescoring results

We carry out the N-best rescoring of our best word-piece based model in the 960hr scenario by a graphemic, and a phonemic model. In all following experiments, the interpolation weights are optimized to obtain the best dev-clean WER (which typically also gives the best dev-other WER). The WERs are presented in the upper part of Table 6. We obtain improvements of 9% in both cases on the test-clean set, and 8% relative with the phonemic model and 9% relative with the graphemic model on the test-other set. Thus, it can be noted that rescoring is a simple method for using a phonemic model without an additional language model. To determine if gains by the graphemic and phonemic models are additive, we combine the scores from all models, which obtains only slight improvements of up to 0.1 absolute as shown in Table 6 (+ Both). In Table 7, we again show some illustrative examples on which the addition of the phonemic model gives lower WERs than the combination of word-piece and grapheme based models only.

In the other direction, we also rescore the N-best list generated by the phonemic system by the word-piece model. The results are shown in the lower part of Table 6. We find that the improvements are limited (only up to 4% relative). In fact, the 30-best list generated by a phonemic system has much higher oracle WERs than the 8-best list of the word-piece model.

dev test
clean other clean other
Word-Piece 4.4 (2.4) 13.2 (9.2) 4.7 (2.6) 13.4 (9.1)
+ Phoneme 4.1 12.4 4.3 12.4
+ Grapheme 4.0 12.3 4.3 12.3
+ Both 3.9 12.2 4.3 12.2
Phoneme 5.6 (4.9) 15.8 (14.4) 6.2 (5.5) 15.8 (14.7)
+ Word-Piece 5.4 15.5 6.0 15.5
Table 6: WER (%) results for N-best list rescoring. Oracle WERs are shown in parentheses.
WP+G+P WP+G
oh bartley did you write to me oh bartly did you write to me
… lettuce leaf with mayonnaise … lettuce leaf with mayonna is …
the manager fell to his musings the manager felt of his musings
what a fuss is made about you what are fusses made about you
… eyes blazed with indignation … eyes blaze of indignation
Table 7: Examples where Word-Piece+Grapheme+Phoneme (WP+G+P) wins over Word-Piece+Grapheme (WP+G).

Finally, Table 8 shows improvements by rescoring with an auxiliary phoneme decoder using the two decoder-model described in Sec 2.2. We obtain improvements despite small number of additional parameters (30M) corresponding to the phonemic 2-layer LSTM decoder and the attention layer, however rescoring with an independent phoneme model gives larger improvements.

dev test Total
clean other clean other Param.
Word-Piece (WP) 4.4 13.2 4.7 13.4 180 M
WP + Auxiliary phoneme 4.3 13.0 4.6 13.1 210 M
WP + Phoneme 4.1 12.4 4.3 12.4 310 M
Table 8: WERs (%) for rescoring with an auxiliary decoder.

6.2 Union of N-best lists with cross-rescoring results

The examples in Table 4 show some complementarity between the word-piece 1-best hypothesis and the phonemic one. To evaluate the potential value of hypotheses generated by the phonemic model, we decode a N-best list from the word-piece based and phoneme based models independently, rescore the respective hypotheses (cross-rescoring), and take the 1-best from the 2N hypotheses (union). In Table 9, we observe that we only obtain marginal improvements on the test-other set, compared with rescoring the 8-best word-piece hypotheses. For a fairer comparison, we also carry out rescoring of 16-best lists generated by the word-piece model by the phonemic model. We find that such an approach is slightly better than the union. This suggests that decoding from the phonemic model has limited benefits for the LibriSpeech task.

Num hyp dev test
clean other clean other
Word-Piece 8 4.4 (2.4) 13.2 (9.2) 4.7 (2.6) 13.4 (9.1)
+ Phoneme 4.1 12.4 4.3 12.4
Union 16 4.1 12.4 4.3 12.3
Word-Piece 16 4.4 (2.0) 13.2 (8.3) 4.7 (2.2) 13.4 (8.1)
+ Phoneme 4.0 12.3 4.3 12.2
Table 9: WERs (%) results for union of N-best lists with cross-rescoring. Oracle WERs are shown in parentheses.

6.3 Why is Oracle WER So High for Phonemic System?

The oracle WERs are much worse for the phonemic system than the word-piece model (Table 6). We observe that the phonemic system consumes lots of search space to keep hypotheses with different homophones, rather than allocating various choices for difficult words. For example, on the reference utterance “bozzle had always waited upon him with a decent coat and a well brushed hat and clean shoes”, where bozzle is an OOV word, the word-piece based model fills the 8-best beam by proposing different spellings for bozzle such as {basil, bazil, basle, bosel, bosal, bosell, bossel}, which is a reasonable usage of the search space. The phoneme system, instead, only produces {bazil, basil} as a substitution for bozzle and lists homophones for shoes, {shoes, shews, shoos, shues, shooes} instead. Homophone distinction might still be inefficient for a phonemic system as the phonemic LAS model gives them all the same score, and a single parameter is used to weight the external LM for the entire search. Addressing this issue might be crucial to improve the phonemic system.

7 Conclusion

Our experiments on the LibriSpeech dataset show that word-piece and grapheme based models consistently outperform phoneme based models. We find that the word-piece based attention models can achieve a relatively low oracle WER with 8-best hypotheses and rescoring that N-best hypotheses using graphemic or phonemic models gives good improvements. Our experiments for phoneme model decoding are however limited to composition with a 3-gram word level LM. In the future work, we would like to investigate the phonemic system performance with better language models.

References

  • [1] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112. Montreal, Canada, Dec. 2014.
  • [2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,

    Neural machine translation by jointly learning to align and translate,”

    in Proc. Int. Conf. on Learning Representations (ICLR), San Diego, CA, USA, May 2015.
  • [3] Yonghui Wu, Mike Schuster, Zhifeng Chen, et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  • [4] Rohit Prabhavalkar, Kanishka Rao, Tara Sainath, Bo Li, Leif Johnson, and Navdeep Jaitly, “A comparison of sequence-to-sequence models for speech recognition,” in Proc. Interspeech, Stockholm, Sweden, Aug. 2017, pp. 939–943.
  • [5] Suyoun Kim, Takaaki Hori, and Shinji Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, Mar. 2017, pp. 4835–4839.
  • [6] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu, “Exploring neural transducers for end-to-end speech recognition,” in Proc. ASRU, Okinawa, Japan, Dec. 2017, pp. 206–213.
  • [7] Chao Weng, Jia Cui, Guangsen Wang, Jun Wang, Chengzhu Yu, Dan Su, and Dong Yu, “Improving attention based sequence-to-sequence models for end-to-end english conversational speech recognition,” in Proc. Interspeech, Hyderabad, India, Sept. 2018, pp. 761–765.
  • [8] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and Michiel Bacchiani, “State-of-the-art speech recognition with sequence-to-sequence models,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 4774–4778.
  • [9] Rico Sennrich, Barry Haddow, and Alexandra Birch, “Neural machine translation of rare words with subword units,” in ACL, Berlin, Germany, August 2016, pp. 1715–1725.
  • [10] Mike Schuster and Kaisuke Nakajima, “Japanese and korean voice search,” in Proc. ICASSP, Kyoto, Japan, Mar. 2012, pp. 5149–5152.
  • [11] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: a neural network for large vocabulary conversational speech recognition,” in ICASSP, Shanghai, China, Mar. 2016, pp. 4960–4964.
  • [12] Hervé A. Bourlard and Nelson Morgan, Connectionist Speech Recognition: A Hybrid Approach, Kluwer Academic Publishers, Norwell, MA, USA, 1993.
  • [13] Tara N Sainath, Rohit Prabhavalkar, Shankar Kumar, Seungji Lee, Anjuli Kannan, David Rybach, Vlad Schogol, Patrick Nguyen, Bo Li, and Yonghui Wu, “No need for a lexicon? evaluating the value of the pronunciation lexica in end-to-end models,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 5859–5863.
  • [14] Shiyu Zhou, Linhao Dong, Shuang Xu, and Bo Xu, “A comparison of modeling units in sequence-to-sequence speech recognition with the Transformer on Mandarin Chinese,” arXiv preprint arXiv:1805.06239, 2018.
  • [15] Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “LibriSpeech: an ASR corpus based on public domain audio books,” in ICASSP, South Brisbane, Queensland, Australia, Apr. 2015, pp. 5206–5210.
  • [16] Ron J. Weiss, Jan Chorowski, Navdeep Jaitly, Yonghui Wu, and Zhifeng Chen, “Sequence-to-sequence models can directly translate foreign speech,” in Interspeech, Stockholm, Sweden, Aug. 2017, pp. 2625–2629.
  • [17] Mehryar Mohri, Fernando Pereira, and Michael Riley, “Speech recognition with weighted finite-state transducers,” in Handbook of Speech Processing, chapter 28, pp. 559–582. Springer, 2008.
  • [18] Yu Zhang, William Chan, and Navdeep Jaitly, “Very deep convolutional networks for end-to-end speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, Mar. 2017, pp. 4845–4849.
  • [19] Mike Schuster and Kuldip K Paliwal,

    Bidirectional recurrent neural networks,”

    IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [20] Sepp Hochreiter and Jürgen Schmidhuber,

    Long short-term memory,”

    Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [21] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, San Diego, CA, USA, May 2015.
  • [22] Albert Zeyer, Kazuki Irie, Ralf Schlüter, and Hermann Ney, “Improved training of end-to-end attention models for speech recognition,” in Interspeech, Hyderabad, India, Sept. 2018, pp. 7–11.
  • [23] Kyu J Han, Akshay Chandrashekaran, Jungsuk Kim, and Ian Lane, “The CAPIO 2017 conversational speech recognition system,” arXiv preprint:1801.00059, 2018.
  • [24] Rohit Prabhavalkar, Tara N. Sainath, Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-Cheng Chiu, and Anjuli Kannan, “Minimum word error rate training for attention-based sequence-to-sequence models,” in Proc. ICASSP, Calgary, Canada, Apr. 2018, pp. 4839–4843.
  • [25] Sara Sabour, William Chan, and Mohammad Norouzi, “Optimal completion distillation for sequence learning,” arXiv preprint arXiv:1810.01398, 2018.
  • [26] Martin Sundermeyer, Ralf Schlüter, and Hermann Ney, “LSTM neural networks for language modeling.,” in Proc. Interspeech, Portland, OR, USA, Sept. 2012, pp. 194–197.
  • [27] Jan Chorowski and Navdeep Jaitly, “Towards better decoding and language model integration in sequence to sequence models,” in Internspeech, Stockholm, Sweden, Aug. 2017, pp. 523–527.
  • [28] Shubham Toshniwal, Anjuli Kannan, Chung-Cheng Chiu, Yonghui Wu, Tara N Sainath, and Karen Livescu, “A comparison of techniques for language model integration in encoder-decoder speech recognition,” in Proc. IEEE Workshop on Spoken Language Technology (SLT), Athens, Greece, Dec. 2018.