Log In Sign Up

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

by   Nick Rossenbach, et al.

Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end ASR systems without the necessity of parameter or architecture changes. We compare our method with language model integration of the same text data and with simple data augmentation methods like SpecAugment and show that performance improvements are mostly independent. We achieve improvements of up to 33 baseline with data-augmentation in a low-resource environment (LibriSpeech-100h), closing the gap to a comparable oracle experiment by more than 50%. We also show improvements of up to 5 recent ASR baseline on LibriSpeech-960h.


page 1

page 2

page 3

page 4


Text-To-Speech Data Augmentation for Low Resource Speech Recognition

Nowadays, the main problem of deep learning techniques used in the devel...

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Data augmentation is one of the most effective ways to make end-to-end a...

Synthesizing Dysarthric Speech Using Multi-talker TTS for Dysarthric Speech Recognition

Dysarthria is a motor speech disorder often characterized by reduced spe...

Audio-attention discriminative language model for ASR rescoring

End-to-end approaches for automatic speech recognition (ASR) benefit fro...

SynthASR: Unlocking Synthetic Data for Speech Recognition

End-to-end (E2E) automatic speech recognition (ASR) models have recently...

Back-Translation-Style Data Augmentation for End-to-End ASR

In this paper we propose a novel data augmentation method for attention-...

ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition

Speech recognition applications cover a range of different audio and tex...

1 Introduction & Related Work

Recently published automatic speech recognition (ASR) systems are based on deep neural network approaches, either in combination with hidden-markov-models (hybrid approach) or as a standalone end-to-end system. While hybrid deep neural network architectures provide state-of-the-art performance, recent results using end-to-end architectures show competing performance on large resource tasks

[12]. Improvements were achieved by using new data augmentation methods such as SpecAugment [15] or using advanced pre-training schemes [26]. For medium to low resource tasks, hybrid architectures are still superior to end-to-end approaches [12]. To further increase the performance of end-to-end systems in low resource conditions, untranscribed speech or text can be used as additional training data. A previously published approach is the text-to-encoder (TTE) model which can integrate additional text [6] or untranscribed speech [7] into ASR training. Another method is the joint training of ASR and text-to-speech (TTS) systems such as the Speech Chain approach [21, 22, 20] or variants of it [1]. Training TTS systems on external data to create audio features for ASR has also been investigated in [13, 11]. The usage of TTS in the context of ASR training was inspired by recent advances in end-to-end TTS systems with multispeaker capabilities such as Tacotron [23], Tacotron-2 [19] and Deep-Voice [17].

Most of the previously presented approaches require that the ASR and TTS systems share at least a common feature processing pipeline and operate on the same kind of audio features. Especially for approaches where ASR and TTS are trained jointly, this is a strict requirement. In contrast to that, our approach includes a completely separate end-to-end TTS system with a Griffin & Lim (G&L) vocoder [5] for synthetic waveform generation instead of synthetic feature generation. While related work covering independent TTS systems [13, 11] uses additional data, our TTS system is only trained on the ASR corpus itself. The synthetic data is stored as compressed audio and can be used for any kind of speech recognition system, with no relation to the TTS system. For adaptive speaker embeddings, we compare global-style-tokens (GST) [24]

and i-vector representations

[9]. To the best of our knowledge, previous work on integrating synthesized data from TTS systems to ASR did not include a comparison with other data-augmentation techniques. Thus, we compare our TTS approach to SpecAugment [15] and speed-perturbation [10]. We also include a direct comparison with language-model integration of the same text data as used to generate synthetic speech. The experiments in this work were managed with Sisyphus [16] and the ASR and TTS systems were implemented in RETURNN [3]. The RETURNN configs will be made publicly available111

2 Attention-based Speech Recognition

2.1 Model

Our baseline model for LibriSpeech-100h

consists of 6 bi-directional long-short-term-memory (LSTM) encoder layers, and a single LSTM decoder layer, following


. The encoder layers have a dimension of 1024 each. The decoder layer has a dimension of 1000. The time dimension is reduced by max-pooling with a factor of 2 between the first three layers, resulting in a total time reduction of 8. The attention mechanism is a MLP-style attention with weight feedback

[26]. The dimension of the attention-MLP combination is 1024. CTC loss [4]

is used as additional training criterion besides cross-entropy (CE), based on label prediction with a softmax layer taking the encoder states as input.

As input we use 40-dimensional MFCC features. Each feature dimension is normalized to zero mean and variance of one by estimating statistics on the training data. The training includes a pre-training scheme as in

[26]. For the text labels we use byte-pair-encoding [18] with 10k merge operations.

2.2 Data Augmentation

For each experiment we include a variant of SpecAugment [15] as data augmentation method. The spectral axes of the features are masked randomly at 1 to 4 positions spanning between 1 and 8 features. On the time axis, we mask between 1 and of the number of frames positions, for a maximum of 20 consecutive frames. We also compare SpecAugment to speed-perturbation [10], using the perturbation factors 0.9, 0.95, 1.05 and 1.1 to add 4 times the original data.

2.3 Language Model

For some experiments we use two additional language models trained on either the text data used for synthesis (LM-small) or all of the additional data for language modeling (LM-large). The language models are based on the transformer architecture, similar to work presented in [8]. LM-small is a 30 layer tranformer architecture with a feed-forward dimension of 2048 and an attention dimension of 512. For LM-large we use a 32 layer transformer architecture with a feed-forward dimension of 4096 and an attention dimension of 1024. The language model is added to the ASR system with log-linear combination.

3 Speech Synthesis System

3.1 Synthesis Network

The synthesis network is inspired by the Tacotron-2 architecture [19], but contains some modifications. We use 80 dimensional log-mel features with a preemphasis factor of , a window size of and a window shift of . The features are globally normalized to zero mean and variance of one by extracting feature statistics from the training data. We set a bound to the mel-scale at 60 Hz, removing lower frequency bands. The input symbols are lowercased characters, and we add an additional end-of-sequence token ””.

The encoder consists of 3 1-D convolutional layers with 128 filters of size 5, and a bi-directional LSTM with 128 hidden units each forming the encoder states for a character sequence . The attention mechanism is an MLP-attention with convolutional weight feedback [2] using the sum of attention weights. The decoder consists of two stacked LSTM layers, from which we only use the second layer as input to the attention. In addition to the encoder state, a 64 dimensional positional encoding is used in the attention. The attention energy is computed as:


The convolutional weight feedback

is computed with 32 1-D convolutional filters of size 31, applied on the sum of previous alignments. Instead of performing zero padding, we pad the positions before the first encoder state with ones, to indicate that positions before the start are already ”attended”.

The output is a single linear layer, transforming the decoder state and the current context vector into the shape of the output features. We stack 3 frames per single output step to reduce the sequence length and increase the attention stability. The input for the stop token is a linear transformation of the same inputs, predicting a single scalar with an applied sigmoid that indicates a finished sequence. We use L1 loss for the spectral features and binary cross-entropy loss for the stop token. The stop token target is a ”ramp” of length 5, meaning that the target values for the binary CE loss are

. To prohibit early stopping during decoding, we continue for 5 decoder steps (15 frames) after the stop token value exceeds a threshold of .

3.2 Speaker Modeling

We use two different methods to enable speaker adaptation in the TTS system. One is a GST [24] based embedding, which is an unsupervised method of adaption. The other uses i-vector representations computed as in [9].

For the GST speaker embedding we use 6 2-D convolutional layers with stride 2 to extract a short feature sequence from the target audio. A single feature vector is computed by applying a single forward LSTM on the sequence and taking the last state. This feature vector is used to select a mixture from the style token feature bank containing 100 entries of size 128 via attention. The mixture of style tokens or an i-vector representation is concatenated to the LSTM encoder states to form the speaker adapted encoder representation.

3.3 Data Preprocessing

We found that the audio data of the LibriSpeech corpus is designed in a way that is not beneficial for TTS systems. The speech utterances are not based on full sentences, meaning that there can be 2 sentences in one utterance, with style changes or unnatural pauses in between. Some utterances start or end in the middle of a sentence, leading to unnatural pronounciation at the beginning and end of utterances. These problems were also adressed in [25]. To remove unnatural pauses and long pauses in general, we apply the FFMPEG silenceremove filter222 with a threshold of -40dB.

3.4 Generation

We train a seperate network that converts the log-mel features into linear features, which are necessary for direct G&L conversion. In the Tacotron architecture [23], the network converting log-mel features to linear spectograms is part of the same training. We chose to train the two parts separately, so the linear network needs to be trained only once on the available audio data.

dev test
clean other clean other
No No 12.8 36.8 12.8 38.7
Yes 11.8 34.6 11.8 37.1
Yes No 10.5 27.7 10.8 28.8
Yes 10.6 27.2 10.9 28.8
Table 1: Results on LibriSpeech-100 with SpecAugment and Speed Perturbation.

The mel-to-linear network consists of 2 stack BLSTM layers with a residual connections. The outputs are 512-dimensional linear spectograms (the DC-part is excluded). When generating the audio data we run the feature network on the text data and apply the mel-to-linear network on the resulting features. The linear features are used as input to a G&L vocoder. We only use a single iteration for phase reconstruction as there is no need to reconstruct the phase when using MFCC features in the ASR system. Because both neural models perform regression tasks without search, and G&L synthesis with a single iteration is computationally cheap, we can generate large amounts of training data in a short time period. Generating 30.000 utterances with 50 hours of speech takes 60 minutes for the feature generation, 10 minutes for the conversion and 60 minutes for G&L synthesis and file encoding using a machine with 4 CPU cores and a single GPU.

4 Experiments

dev test
cl. oth. cl. oth.
* No No N 12.8 36.8 12.8 38.7
Y 11.4 34.5 11.4 36.4
GST N 10.2 34.9 10.6 36.9
Y 08.9 33.0 09.3 34.8
oracle 05.9 22.6 06.2 23.5
Yes No N 10.5 27.7 10.8 28.8
Y 09.9 27.0 10.3 28.1
GST N 08.2 27.4 08.7 28.4
Y 07.4 25.7 07.9 26.7
oracle 05.4 17.9 05.6 18.4
[1] No - N 21.9
x-vec 17.9
Y 17.0
+CCT 16.6
oracle N 11.8
Table 2: Results on LibriSpeech-100 comparing synthetic audio data against LM-combination based on the same additional text data (LM-small) and SpecAugment. We compare the results of our paper (*) with the results presented in [1]. The model CCT also uses untranscribed speech.

We performed our simulated low-resource experiments on LibriSpeech [14], similar to previous work (e.g. [1]). We use LibriSpeech-100h as training data for the ASR baseline and the TTS system and the transcriptions of LibriSpeech-360h

as text-only data. We trained the baseline ASR system for 80 checkpoints and 4 data epochs to reach an initially converged state. This reduces the variance in the resulting performance, as all comparable training runs start from the same checkpoint. We then reset the learning rate and continue the training for an additional 170 checkpoints including data-augmentation methods and/or synthetic data. We set the epoch partitioning factor for each checkpoint in a way that for each experiment roughly the same amount of audio data was seen during training, independent from the amount of synthetic data generated. The parameter for LM score combination is optimized on dev-clean for test-clean, and on dev-other for test-other.

4.1 Results for LibriSpeech-100

Before adding synthetic data, we comapred the effects of speed-perturbation and SpecAugment. The results can be seen in Table 1. In our setting, SpecAugment and speed-perturbation both show improvements over the baseline, but the improvement of speed-perturbation vanishes when being combined with SpecAugment. For the following experiments we only use SpecAugment as data-augmentation method. In a next step, we used the text of the LibriSpeech-360h corpus to generate synthetic audio with the GST-TTS model (Table 2). To be able to compare the effect of additional audio to the effect of additional text, we included the LM-small language model. In direct comparsion, adding synthetic data showed a better performance on test-clean, while adding an LM showed better performance on test-other. By combining both, we achieved a relative improvement of 27% over the baseline on test-clean. The same relative improvement was observed over a stronger baseline including SpecAugment. The orcale experiment includes the text and audio of LibriSpeech-360h, but uses the same initial checkpoint as all other experiments (only LibriSpeech-100h for the first 4 epochs) to be comparable. While on test-clean we can reduce the gap to the oracle performance by more than 50%, we only see a small improvement on test-other. We compare our results with the x-vector (x-vec) TTS system and the cycle-consitent (joint) training (CCT) of ASR and TTS models presented in [1]. With their joint model, they achieved a relative improvement of 21% on test-clean over a weaker baseline. Scores on test-other were not reported.

dev test
clean other clean other
No - 8.5 30.7 8.8 32.5
GST 6.1 29.7 6.5 30.8
i-vector 7.2 30.8 7.4 32.7
oracle 3.9 18.7 4.2 19.2
Yes - 7.3 23.3 8.1 24.5
GST 5.0 21.7 5.4 22.2
i-vector 5.6 23.5 6.0 24.5
oracle 3.7 15.4 4.2 15.7
Table 3: Results on LibriSpeech-100 comparing synthetic data generated with a TTS system using GST or i-vector embeddings against the oracle data. All results are with an additional LM (LM-large).

In Table 3 we compare the use of synthetic data generated with two different speaker embedding methods against an oracle experiment. The GST-based system clearly outperforms the TTS system using i-vector representations. We see that the relative improvement of using synthetic data gets even larger when using LM-large and SpecAugment, up to a performance increase of 33%. The amount of presented audio during training is about 4000 hours, corresponding to 12.5 epochs original data (100h) and 8.5 epochs synthetic data (330h).

4.2 Results for LibriSpeech-960

In preliminary experiments we also tested if we can improve our currently best baseline setups for LibriSpeech-960h using the TTS models trained on LibriSpeech-100h. For the i-vector model we used representations computed on the full corpus, but the GST model is exactly the same as in the 100h case. We generated 2000 hours of additional data using the language model text-data, and trained the models for 12500 hours of data. The results for using synthetic data with TTS together with our best baseline using SpecAugment and LM-large can be found in Table 4. As the models converged slower and showed less overfitting when using synthetic data, we used an original to synthetic data ratio of 3 to 2 and extended the training time by re-training each model for another 12500 hours of training data (15 epochs on the original data and 5 epochs on the synthetic data). Unfortunately, the GST results were not finished at the time of the submission. The relative improvements are not exceeding 5% WER, and we assume further investigation is needed to balance the regularization effects of Dropout, SpecAugment and Synthetic Data. We compare our results to the improvement achieved by training a separate TTS on the 3 Speaker M-AILABS corpus as performed in [11].

Paper Retrain
dev test
cl. oth. cl. oth.
our work No - 2.61 7.36 2.77 7.88
GST 2.57 7.43 2.72 7.82
i-vector 2.63 7.63 2.79 7.89
Yes - 2.43 7.03 2.66 7.37
GST 2.35 7.05 2.50 7.29
i-vector 2.32 6.72 2.53 7.19
[11] - - 5.10 16.21
GST 4.66 15.47
Table 4: Results on LibriSpeech-960 comparing synthetic data generated with a TTS system using GST or i-vector embeddings against the oracle data. In contrast to [11], our results include SpecAugment and an LM (LM-large) in decoding.

5 Conclusion

We presented a straight-forward approach to generate and add synthetic audio data to state-of-the-art end-to-end ASR systems. We showed that we can improve a strong low-resource baseline system that already uses data augmentation and an additional language model by up to 33% in relative WER on LibriSpeech test-clean, and by 9% on test-other. The improvements by using synthetic data were larger when used together with SpecAugment and LM combination. Our TTS system uses global-style-tokens for unsupervised speaker embeddings, thus removing the need for speaker labeled training data. By using Griffin & Lim synthesis as vocoder approach, synthesizing data is computationally inexpensive compared to the ASR training itself. Although we observed large improvements when using TTS data, manual evaluation revealed that the TTS outputs are still poor in stability and speaker adaptation capabilities. Preliminary results on the full LibriSpeech-960h corpus show only minor improvements. In future work, we will try to build stronger and more stable TTS systems that include all of the LibriSpeech data as well as investigating possible underfitting that occurs in a large resource environment.

6 Acknowledgements

This work has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 694537, project ”SEQCLAS”) and from a Google Focused Award. The work reflects only the authors’ views and none of the funding parties is responsible for any use that may be made of the information it contains. Simulations were partially performed with computing resources granted by RWTH Aachen University under project nova0003.


  • [1] M. K. Baskar, S. Watanabe, R. F. Astudillo, T. Hori, L. Burget, and J. Cernocký (2019) Self-supervised sequence-to-sequence ASR using unpaired speech and text. CoRR abs/1905.01152. External Links: Link, 1905.01152 Cited by: §1, §4.1, Table 2, §4.
  • [2] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio (2015) Attention-based models for speech recognition. CoRR abs/1506.07503. External Links: Link, 1506.07503 Cited by: §3.1.
  • [3] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlüter, and H. Ney

    Returnn: the RWTH extensible training framework for universal recurrent neural networks

    In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, pp. 5345–5349. External Links: Link, Document Cited by: §1.
  • [4] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    Proceedings of the 23rd International Conference on Machine Learning

    ICML ’06, New York, NY, USA, pp. 369–376. External Links: ISBN 1-59593-383-2, Link, Document Cited by: §2.1.
  • [5] D. W. Griffin, D. S. Deadrick, and J. S. Lim

    Speech synthesis from short-time fourier transform magnitude and its application to speech processing

    In IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’84, San Diego, California, USA, March 19-21, 1984, pp. 61–64. External Links: Link, Document Cited by: §1.
  • [6] T. Hayashi, S. Watanabe, Y. Zhang, T. Toda, T. Hori, R. F. Astudillo, and K. Takeda Back-translation-style data augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018, pp. 426–433. External Links: Link, Document Cited by: §1.
  • [7] T. Hori, R. F. Astudillo, T. Hayashi, Y. Zhang, S. Watanabe, and J. L. Roux (2018) Cycle-consistency training for end-to-end speech recognition. CoRR abs/1811.01690. External Links: Link, 1811.01690 Cited by: §1.
  • [8] K. Irie, A. Zeyer, R. Schlüter, and H. Ney (2019) Language modeling with deep transformers. CoRR abs/1905.04226. External Links: Link, 1905.04226 Cited by: §2.3.
  • [9] M. Kitza, P. Golik, R. Schlüter, and H. Ney (2019) Cumulative adaptation for BLSTM acoustic models. CoRR abs/1906.06207. External Links: Link, 1906.06207 Cited by: §1, §3.2.
  • [10] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur (2015) Audio augmentation for speech recognition.. In INTERSPEECH, pp. 3586–3589. External Links: Link Cited by: §1, §2.2.
  • [11] J. Li, R. Gadde, B. Ginsburg, and V. Lavrukhin (2018) Training neural speech recognition systems with synthetic speech augmentation. CoRR abs/1811.00707. External Links: Link, 1811.00707 Cited by: §1, §1, §4.2, Table 4.
  • [12] C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney (2019) RWTH ASR systems for librispeech: hybrid vs attention - w/o data augmentation. CoRR abs/1905.03072. External Links: Link, 1905.03072 Cited by: §1.
  • [13] M. Mimura, S. Ueno, H. Inaguma, S. Sakai, and T. Kawahara (2018) Leveraging sequence-to-sequence speech synthesis for enhancing acoustic-to-word speech recognition. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018, Athens, Greece, December 18-21, 2018, pp. 477–484. External Links: Link, Document Cited by: §1, §1.
  • [14] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015-04) Librispeech: an asr corpus based on public domain audio books. pp. 5206–5210. External Links: Document Cited by: §4.
  • [15] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: A simple data augmentation method for automatic speech recognition. CoRR abs/1904.08779. External Links: Link, 1904.08779 Cited by: §1, §1, §2.2.
  • [16] J. Peter, E. Beck, and H. Ney Sisyphus, a workflow manager designed for machine translation and automatic speech recognition. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018

    pp. 84–89. External Links: Link Cited by: §1.
  • [17] W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller (2017) Deep voice 3: 2000-speaker neural text-to-speech. CoRR abs/1710.07654. External Links: Link, 1710.07654 Cited by: §1.
  • [18] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, External Links: Link Cited by: §2.1.
  • [19] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu (2017) Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR abs/1712.05884. External Links: Link, 1712.05884 Cited by: §1, §3.1.
  • [20] A. Tjandra, S. Sakti, and S. Nakamura End-to-end feedback loss in speech chain framework via straight-through estimator. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019, pp. 6281–6285. External Links: Link, Document Cited by: §1.
  • [21] A. Tjandra, S. Sakti, and S. Nakamura

    Listening while speaking: speech chain by deep learning

    In 2017 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2017, Okinawa, Japan, December 16-20, 2017, pp. 301–308. External Links: Link, Document Cited by: §1.
  • [22] A. Tjandra, S. Sakti, and S. Nakamura Machine speech chain with one-shot speaker adaptation. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September, pp. 887–891. External Links: Link, Document Cited by: §1.
  • [23] Y. Wang, R.J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. A. Saurous (2017) Tacotron: towards end-to-end speech synthesis. In Proc. Interspeech 2017, pp. 4006–4010. External Links: Document, Link Cited by: §1, §3.4.
  • [24] Y. Wang, D. Stanton, Y. Zhang, R. J. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, Y. Jia, F. Ren, and R. A. Saurous Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholm, Sweden, July 10-15, pp. 5167–5176. External Links: Link Cited by: §1, §3.2.
  • [25] Cited by: §3.3.
  • [26] A. Zeyer, K. Irie, R. Schlüter, and H. Ney

    Improved training of end-to-end attention models for speech recognition

    In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., pp. 7–11. External Links: Link, Document Cited by: §1, §2.1, §2.1.