: an automatic speech recognition (ASR) system creates a transcript of the speech signal, which is then translated by a machine translation (MT) system. Recent approaches use end-to-end neural network models[3, 31], which are directly inspired by end-to-end models for ASR [8, 7]. Cascade approaches can benefit from large amounts of training data available to its components for certain language pairs. For example, an English ASR model can be trained on 960 hours of speech , and an English–French translation model can be trained on about 40 million sentence pairs .
End-to-end approaches have very limited data available [14, 22, 10]; nevertheless, they present several benefits. First, end-to-end models can enable lower inference latency since they involve only one prediction. Second, it may be easier to reduce the model size for a single integrated model. Finally, end-to-end approaches avoid compounding errors from the ASR and MT models.
End-to-end models for AST have been shown to perform better than or on par with cascade models [31, 2] when both are trained only on speech translation parallel corpora. However, when additional data are used to train its ASR and MT subsystems, the cascade outperforms the vanilla end-to-end approach [3, 27, 12]. In this work, we explore several techniques that leverage the wealth of ASR and MT data to aid end-to-end systems, by means of data augmentation.
The major contributions of this paper are:
We confirm that end-to-end models underperform cascade models by a large margin, when the components can be trained on additional ASR and MT training data while the end-to-end model is constrained to be trained on AST training data. In particular, we build a very strong cascade model that outperforms a previously reported system  by 5.5 BLEU.
We investigate the strategies that can improve end-to-end AST models. We augment the data by leveraging ASR training data with MT and MT training data with text-to-speech synthesis (TTS). We also study the effect of pretraining the ASR encoder as well as how to better utilize out-of-domain augmented data with fine-tuning. In the case of TTS-augmented data, we analyze the effect of the amount of data added, which TTS engine is used, and whether one speaker or multiple speakers are used to generate the data.
We benchmark the performance of several architectures on AST task on public available datasets. We first propose an extension to the Bérard model  that increases its capacity for training on larger data settings. We also benchmark models on the AST task that have been previously applied to the ASR task only: VGG LSTM  and VGG Transformer . To our knowledge, this is the first time the VGG Transformer architecture has been applied to the AST task. For better reproducibility, experiments are conducted on two publicly available datasets, AST Librispeech  and MuST-C . With data augmentation, pretraining, fine-tuning and careful architecture selection, we obtain competitive end-to-end models on the corresponding English–French (En–Fr) and English–Romanian (En–Ro) tasks.
For certain language pairs, cascade models can access large amounts of training data. In this section, we present our strategies to leverage this additional data for end-to-end models.
The first prong of our approach involves generating synthetic data to augment the existing AST data, and it is summarised in Figure 1. MT models can be used to generate synthetic AST training data by automatically translating the transcript portion of ASR training data. (For our languages of interest, the MT model can be trained on a large amount of data and will be able to generate high quality translations.) In addition, the MT model itself would not be directly used in training the end-to-end AST model, which will avoid reinforcing errors produced by that model.
Similarly, we can generate additional synthetic AST training data by generating speech from the source side of an MT parallel corpus. This technique is similar to backtranslation  and has been previously applied to end-to-end ASR training .
Another important aspect to consider is that the additional synthetic data we generate may not be in-domain. This can be problematic when there is a large gap between the amount of available gold data compared to the amount of synthetic data. We investigate fine-tuning techniques to address this issue.
As a second prong to the approach, we examine pretraining: we can use the large ASR corpus to pretrain the speech encoder of an AST model; this is displayed in Figure 2.
In this section, we describe the various model architectures used for ASR and AST experimentation.
3.1 Bérard Model and Extension
We use a similar architecture to  with a speech encoder consisting of two non-linear layers followed by two convolutional layers and three bidirectional LSTM layers, and a custom LSTM decoder. In addition, the input state to an LSTM layer is the state emitted at the current timestep by the layer beneath. The input state to the bottom layer is the state emitted by the top LSTM layer at the previous timestep. Preliminary experiments showed that passing the state from the previous timestep to the LSTM layer one level above at the current timestep was not as effective. Finally, we extend the architecture to an arbitrary number of decoder layers as shown in Equation 1:
where subscript indicates the timestep, superscript indicates the position in the stack of LSTMs, is the state, is the output, is the embedding function, is a target token and
is a context vector.
3.2 Vgg Lstm
We investigate the performance of ASR and AST with a model similar to the ESPnet111https://github.com/espnet/espnet implementation . The encoder is composed of two blocks of VGG layers  followed by bidirectional LSTM layers. We use a hybrid attention mechanism from  that takes into account both location and content information. The decoder is an LSTM, following . This model will subsequently be called vgglstm.
3.3 VGG Transformer
We also investigate the performance of a Transformer model. To our knowledge, this is the first time this type of model is applied to the AST task. We use a variant of the Transformer network that has been shown to perform well on the ASR task  by replacing the sinusoidal positional embedding with input representations learned by convolutions. This model will subsequently be called vggtransformer.
4 Experimental Setup
For both the En–Fr and En–Ro language pairs, we use three datasets corresponding to the AST, ASR and MT tasks. We choose some of the largest publicly available datasets for ASR and MT in order to have the ASR and MT models in an unconstrained-like setting and make the comparison between end-to-end and cascade models more realistic. Dataset statistics are summarized in Table 1.
For the En–Fr AST task, we use the publicly available augmented Librispeech corpus  (AST Librispeech), which is the second largest dataset publicly available for this task. For En–Ro AST, we use the recently released MuST-C corpus , which is the largest publicly available dataset.
For the En ASR task, we use Librispeech  (ASR Librispeech) which is in the same domain as AST Librispeech and also the largest publicly available dataset for ASR. Since the validation and test set from AST Librispeech come from the training portion of ASR Librispeech, we filter all training utterances from ASR Librispeech that contain any of the validation or test utterances from AST Librispeech with at least two words. We allow ASR Librispeech training utterances to contain validation and test utterances from AST Librispeech with only one word, otherwise we would for example remove all training utterances containing the word “no”. The ASR Librispeech corpus is used for the ASR task for both En–Fr and En–Ro experiments.
For the En–Fr MT task, we use the En–Fr parallel data available as part of the WMT14222http://www.statmt.org/wmt14/translation-task.html competition . For En–Ro, we use the En–Ro parallel data available for the WMT16333http://statmt.org/wmt16/translation-task.html competition .
|Dataset||# utterances||# hours|
4.2 Preprocessing Settings
For the En–Fr task, we follow the same preprocessing as . The English text is simply lowercased as it does not contain punctuation. The French text is punctuation-normalized, tokenized and lowercased. For the En–Ro task, the English text is tokenized, punctuation-stripped and lowercased. The Romanian text is punctuation-normalized and tokenized but the casing is preserved, following . We do not limit the number of frames in the training data except to avoid GPU out-of-memory errors.
When training on AST Librispeech only, we use a character-level decoder. Otherwise, we use a unigram model with size 10,000 using the SentencePiece implementation  as training on larger datasets with a character-level decoder would be prohibitively slow.
4.3 Model Settings
We use two sets of hyperparameters for the Bérard architecture. When training on AST Librispeech (En–Fr), we reuse the same parameters as. In all other settings—for En–Fr with additional data and En–Ro—we use 3 decoder layers based on the extended model presented in § 3 in order to give more capacity to the model.
The vgglstm encoder uses 80 log-scaled mel spectrogram features, 2 VGG blocks with 64 and 128 channels, filter size 3, pooling size 2, 2 convolutional layers and layer normalization and 5 bidirectional LSTM layers of size 1024. The decoder uses embeddings of size 1024 and 2 LSTM layers of size 1024. The attention has dimension 1024 and 10 channels with filter size 201. vgglstm uses no dropout.
The vggtransformer architecture also uses 80 features, the same VGG block configuration as vgglstm
, 14 transformer encoder layers and 4 transformer decoder layers with size 1024, 16 heads, a feed forward network of size 4096 and dropout with probability 0.15. Thevggtransformer decoder uses target embeddings of size 128 and 4 convolutional layers with 256 channels, filter size 3 and layer normalization.
Table 2 summarizes the number of parameters for the models presented in this section as well as the Transformer model used for MT.
|Bérard with 3 decoder layers||13.5M|
4.4 Training Settings
For the Bérard architecture, we use the Adam optimizer  with a learning rate of 0.001. For the smaller AST Librispeech task, we use a minibatch size of 16000 frames to help convergence. For other tasks, we use a minibatch size of 96,000 frames except for the vggtransformer architecture where we use 72,000 frames (to avoid memory issues). We also use delayed updates  in order to keep the same effective batch size and avoid GPU out-of-memory errors. All experiments are conducted on 8 GPUs. For other architectures than Bérard, we use ADADELTA  with a learning rate of 1 and we normalize the loss per utterance instead of per token. These hyperparameters were chosen based on preliminary experimentation on the ASR Librispeech task.
5.1 Cascade Baselines
The baseline approach, cascade, involves a two-step process: first, obtain a transcript from input speech with an ASR model, then translate the transcript with an MT model. Both models are trained separately on large training datasets.
The ASR models for En–Fr use the same architectures described in § 3 and are trained on the full Librispeech corpus, which is much larger than the available AST data. For the En–Ro task, ASR models are trained on the MuST-C and the Librispeech datasets. We use a Transformer  as the basic MT architecture. More precisely, for En–Fr, we first pretrain a large Transformer model (transformer big) over the entire WMT14 corpus, then fine-tune this model on the AST Librispeech data. For En–Ro experiments, we merge the MuST-C and the WMT16 corpus and train a smaller Transformer (transformer base) on the joint corpus.
5.2 Data Augmentation
MT: Producing AST data from ASR data. The MT models described in § 5.1 are used to automatically translate the English transcript of ASR Librispeech into French and Romanian. The resulting synthetic data can directly be used as additional training data for end-to-end AST models.
TTS: Producing AST data from MT data. We also explore augmenting the MT training data with TTS. This technique is similar to backtranslation  and has been previously applied to end-to-end ASR training . We use two pretrained TTS engines. The first engine, TTS1, uses the OpenSeq2Seq framework  to generate speech samples in five different voices. The TTS model is based on an extension of the Tacotron 2 model  with Global Style Tokens . The second TTS engine, TTS2, is trained on about 15 hours of single speaker data. The text data comes from several domains such as Wikipedia, news articles, parliament speech, and novels. We use TTS1 to generate speech from a random sample of WMT14 with the same size as ASR Librispeech (265,754) and TTS2 to generate speech from WMT16.
5.3 Speech Encoder Pretraining
Speech encoder pretraining is an alternative way to make use of the full ASR Librispeech dataset. We first pretrain an English ASR model on ASR Libirspeech plus the TTS1 corpus generated in § 5.2 – the parallel corpus built from the generated TTS and WMT14 English text. We then take the encoder of the ASR model to initialize the encoder of an AST model with the same architecture.
On the En–Fr task, the TTS data is generated from WMT14, which is out-of-domain with respect to Librispeech. On the En–Ro task, both the TTS and the MT data are out-of-domain. We investigate fine-tuning as a technique to mitigate the domain shift. We fine-tune by continuing training on only AST Librispeech or MuST-C, starting from the best checkpoint on development data after convergence.
6 Results and Analysis
We first investigate techniques to improve end-to-end AST on Bérard model. Results for the En–Fr and En–Ro tasks are summarized in Table 3. The ASR and MT components we trained with additional data are very strong: the cascade model outperforms the vanilla end-to-end model by 8.2 BLEU on the AST task. It is also important to note that our cascade baseline is greater than the best reported result on this task by 5.5 BLEU and even better than the previously reported oracle BLEU of 19.3 by 2 BLEU.
6.1 Effect of Data Augmentation
By augmenting ASR Librispeech with automatic translations (AST + MT), we show an improvement of 6.8 BLEU for En–Fr and 3.0 BLEU for En–Ro. Under this setting, we observe that pretraining is not beneficial for En–Fr but is for En–Ro.444Data setting experiments were conducted on the Bérard architecture. We do not expect the conclusions on data augmentation, pretraining and fine-tuning to change on the VGG architectures.
Additional TTS-augmented data (AST + TTS) initially hurts performance for both language pairs. With pretraining and fine-tuning, TTS data provides a gain of 3.3 BLEU over the vanilla end-to-end baseline. However, even with pretraining and fine-tuning, the end-to-end model with additional TTS still underperforms the model using MT-augmented data only.
We also augmented the data with MT and TTS (AST + MT + TTS) at the same time. We find that it does not provide additional gain over using MT data only. In general, MT data can efficiently help the model, while TTS data is less efficient and can be hurtful. We analyze TTS in more detail in § 6.4.
|Cascade (ASR architecture; Transformer used for MT)|
We then investigate different architectures on the higher-resource En–Fr task and summarize results in Table 4. On the En–Fr AST + MT setting, we obtain 19.87 BLEU with Bérard, 19.84 BLEU with vgglstm and 21.65 BLEU with vggtransformer, showing that these architectures are effective with additional data. which is only 0.36 BLEU behind the corresponding vgglstm cascade performance of 21.79 BLEU and equivalent (+0.1 BLEU) to the Bérard cascade. With vggtransformer, we obtain the best end-to-end AST score of 21.65 BLEU, which is equivalent to the scores of the vggtransformer (-0.01 BLEU) and vgglstm cascades (-0.1 BLEU).
6.2 Effect of Encoder Pretraining
Results are summarized in Table 5. Pretraining improves the AST + TTS system by +2.6 BLEU and +2.0 BLEU with fine-tuning, improves the MuST-C + TTS system by +2.7 BLEU and +2.2 BLEU with fine-tuning, and improves the MuST-C + MT + TTS system by +4.2 BLEU and +1.4 BLEU.
However, gains from pretraining ASR on the full Librispeech dataset do not compound with gains from MT augmentation. Pretraining does not help as much in the AST + MT + TTS setup, showing a negligble change in BLEU score. Pretraining has mixed results for the MuST-C + MT case, with -0.2 BLEU and +0.8 BLEU with pretraining.
Pretraining on in-domain ASR data is not a good substitute for MT-augmenting the ASR data. However we note that using a pretrained speech encoder will speed up convergence of the AST model. Thus, pretraining could be used in experiments with the same architecture and provides a good starting point for more rapid iteration.
|AST + TTS||()||()|
|AST + MT + TTS||()||()|
|MuST-C + TTS||()||()|
|MuST-C + MT||()||()|
|MuST-C + MT + TTS||()||()|
6.3 Effect of Fine-tuning
Table 6 summarizes the fine-tuning results. We apply fine-tuning whenever TTS-augmentation is used. Fine-tuning seems to mitigate the effect of domain shift introduced by the additional out-of-domain TTS data: in the Librispeech AST + TTS setup, fine-tuning improves by +1.7 (+1.2 BLEU on top of pretraining). For the MuST-C + MT setup, we see +2.3 BLEU (+1.4 on top of pretraining).
Fine-tuning does not improve the AST Librispeech model on top of MT-augmentation though, likely because the MT-augmented data is already in-domain. For the AST + TTS + MT setup, we see neutral results: +0.3 BLEU and no effect on top of pretraining. However, for the MuST-C + MT setup we see a gain of +0.3–0.9 BLEU because the MT-augmented data is out-of-domain for the MuST-C dataset.
|AST + TTS||()||()|
|AST + MT + TTS||()||()|
|MuST-C + TTS||()||()|
|MuST-C + MT||()||()|
|MuST-C + MT + TTS||()||()|
6.4 TTS Data: Quantity, Quality and Diversity
We now analyze the effect of augmenting the training data for AST with TTS in the En–Fr task. First, in Figure 3, we can see that adding TTS data up to 100,000 utterances improves the performance but beyond that, the performance degrades. We hypothesize that this is because the additional TTS data is out of domain. With fine-tuning, adding up to 300,000 utterances improves performance, which confirms our hypothesis, however, adding 1M utterances starts degrading performance. In the future, we will investigate how to make more effective use of larger quantities of TTS-generated data.
We also study the effect of using single-speaker or multi-speaker TTS. We use a sample of size 300k utterances from WMT14 and generate speech with the TTS1 engine using the first speaker Speaker 0, the second speaker Speaker 1, and all five speakers in a round-robin fashion. Finally, we investigate whether the quality of the TTS engine matters. For the same sample of 300k utterances, we generate speech using the TTS2 engine both from the English text and the corresponding French translations. The latter is analogous to copying the target to the source in machine translation . Results are reported in Table 7. Comparing the first two rows, we conclude that performance varies depending on the speaker. The third row shows that using multiple speakers performs on par with choosing Speaker 0 but outperforms choosing Speaker 1. We therefore recommend using multiple speakers by default. Finally, comparing rows 1 and 4, we conclude that the quality of the TTS matters marginally, with TTS2 slightly outperforming TTS1. The last rows shows that the analogue of copying target to the source in machine translation is also effective for the AST task.
|Row #||Configuration||TTS Engine||BLEU|
7 Related Work
Initial attempts at speech translation  involve incorporating lattices from ASR systems as inputs to statistical MT models [18, 6]. More recent approaches have focused on end-to-end models.  demonstrate the viability of this approach on a small synthetic corpus.  outperform a cascade model using a similar architecture to an attention-based ASR model.  and  show that multi-task learning can further improve an end-to-end model. Pretraining has also been shown to improve end-to-end models [2, 1].
 note that cascaded models are at a disadvantage when constrained to be trained on speech translation data only. They demonstrate how to leverage additional ASR and MT training data with an attention-passing mechanism.  demonstrate improvements to an end-to-end model using MT-augmented and TTS-augmented data but do so on a proprietary dataset. In contrast, our experiments are conducted on two public datasets where we obtain new state-of-the-art performance and we provide additional analyses on network architectures and recommendations on how to better leverage TTS-augmented data.
We have demonstrated that cascaded models are very competitive when they are not constrained to be trained on AST data only. We have demonstrated a number of techniques aimed at bridging the gap between end-to-end models and cascade models. With data augmentation, pretraining, fine-tuning and architecture selection, we trained end-to-end models that show competitive performance when compared to cascade approach. We also analyzed the effect of TTS data in terms of quality, quantity and the usage of single speaker vs. multiple speakers, and provide recommendations on how to leverage this type of data. In the future, we would like to investigate how to make more effective use of larger scale TTS-generated data.
-  (2018) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. arXiv preprint arXiv:1809.01431. Cited by: §7.
-  (2018) End-to-end automatic speech translation of audiobooks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6224–6228. Cited by: item 1, item 3, §1, §3.1, §4.2, §4.3, Table 3, §7.
-  (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. In NIPS Workshop on end-to-end learning for speech and audio processing, Cited by: §1, §1, §7.
-  (2014-06) Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Cited by: §1, §4.1.
-  (2016-08) Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, Berlin, Germany, pp. 131–198. External Links: Cited by: §4.1.
-  (2008) Recent efforts in spoken language translation. IEEE Signal Processing Magazine 25 (3), pp. 80–88. Cited by: §7.
-  (2016) Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960–4964. Cited by: §1.
-  (2015) Attention-based models for speech recognition. In Advances in neural information processing systems, pp. 577–585. Cited by: §1, §3.2.
Copied monolingual data improves low-resource neural machine translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, pp. 148–156. External Links: Cited by: §6.4.
-  (2019-06) MuST-C: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Minneapolis, MN, USA. Cited by: item 3, §1, §4.1, §4.2.
-  (2019-02-11) Back-translation-style data augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, 2018 IEEE Spoken Language Technology Workshop, SLT 2018 - Proceedings, pp. 426–433 (English). External Links: Cited by: §2, §5.2.
-  (2019) Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7180–7184. Cited by: §1, §7.
-  (2015) Adam: A method for stochastic optimization. External Links: Cited by: §4.4.
-  (2018) Augmenting Librispeech with French translations: a multimodal corpus for direct speech translation evaluation. In LREC (Language Resources and Evaluation Conference), Cited by: item 3, §1, §4.1.
-  (2018) Mixed-precision training for nlp and speech recognition with openseq2seq. External Links: Cited by: §5.2.
SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Cited by: §4.2.
-  (2015-09) Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1412–1421. External Links: Cited by: §3.2.
-  (2005) On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology, Cited by: §7.
-  (2019) Transformers with convolutional context for ASR. arXiv preprint arXiv:1904.11660. Cited by: item 3, §3.3.
-  (1999) Speech translation: coupling of recognition and translation. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Vol. 1, pp. 517–520. Cited by: §1, §7.
-  (2015) Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. Cited by: §1, §4.1.
-  (2013) Improved speech-to-text translation with the Fisher and Callhome Spanish–English speech translation corpus. In Proc. IWSLT, Cited by: §1, §1.
-  (2018-07) Multi-representation ensembles and delayed SGD updates improve syntax-based NMT. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 319–325. External Links: Cited by: §4.4.
-  (2015) Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. Cited by: §2, §5.2.
-  (2018) Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4779–4783. Cited by: §5.2.
-  (2015) Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §3.2.
-  (2019) Attention-passing models for robust and data-efficient end-to-end speech translation. arXiv preprint arXiv:1904.07209. Cited by: §1, §7.
-  (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.3, §5.1.
Style tokens: unsupervised style modeling, control and transfer in end-to-end speech synthesis.
Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5180–5189. External Links: Cited by: §5.2.
-  (2018) ESPnet: end-to-end speech processing toolkit. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., pp. 2207–2211. External Links: Cited by: item 3, §3.2.
-  (2017) Sequence-to-sequence models can directly translate foreign speech. In Proc. Interspeech 2017, pp. 2625–2629. External Links: Cited by: §1, §1, §7.
-  (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §4.4.