Building a successful Speech translation (ST) system from scratch is not always possible because of limitations in computation and data resources. Recent works indicate that increasing pre-trained model size still leads to performance improvements on downstream NLP tasks Sanh et al. (2020). Consequently, the size of the pre-trained model has been getting larger and larger, leading to sometimes impractical to fine-tune the pre-trained models. Collecting the end-to-end data is expensive for finding high-quality data, aligning audio, transcript, and translation, filtering wrong and poor alignment. To address the above challenges, this research focuses on improving computational efficiency and data efficiency with the usage of pre-trained models for speech translation.
Our first contribution is to compare the performances between the cascaded system and the end-to-end system, both using pre-trained models. The end-to-end system has been developed recently and proven to have comparable performance to the cascaded system Niehues et al. (2018); Ansari et al. (2020); Bentivogli et al. (2021). However, there is no claim about which approach has clear advantages on performance. This work investigates the performance comparison of two systems by directly combining the pre-trained model without architecture modification. Our result shows that the end-to-end system outperforms the cascaded system in terms of fine-tuning efficiency and accuracy. As the second contribution, we propose two fine-tuning strategies to improve computational efficiency. Rather than fine-tuning the entire ST model, we present two effective approaches. Especially, the adapter approach fine-tunes less than one-tenth parameter and achieves comparable performance to the cascaded model. The third contribution is that we present a novel similarity loss to mitigate the data scarcity issue. Unlike the end-to-end data that are challenging to acquire, speech-to-transcript data is more accessible. We develop the similarity loss that measures the difference between latent representations for the audio and the transcript. Our result shows that involving similarity loss improves data efficiency and boosts model performance.
2 Related work
The computational limitation is one major obstacle to the end-to-end ST system. Li et al. (2021) presents that fine-tuning the layer normalization and multi-head attention parameters is effective to improve computational efficiency. Le et al. (2021) showed that fine-tuning residual adapt modules that are transplanting between the encoders and decoders is a promising approach. These researches show the possibility of fine-tuning components of pre-trained models and motivate this work to explore other efficient fine-tuning approaches.
The lack of end-to-end training data is another obstacle to the end-to-end ST system. Recent work address this obstacle by leveraging the available data resource using multi-task learning Weiss et al. (2017); Anastasopoulos and Chiang (2018); Bérard et al. (2018)et al. (2020); Liu et al. (2019) and generating synthetic data Jia et al. (2019); Pino et al. (2020); Lam et al. (2022)
techniques. This work extends this idea with a novel loss function to efficiently use the available data.
3 Speech translation using pre-trained models
In this work, we propose the cascaded and end-to-end combinations of pre-trained models to build ST systems. A first baseline approach is to combine the two models in a cascaded manner. However, the cascaded approach has several drawbacks, e.g. error propagation and computational complexity. Therefore, we also investigated the possibility of combining the two pre-trained models into one end-to-end speech translation model.
Cascaded system In the ASR stage, the module inputs acoustic data and outputs the transcript. In the following MT stage, the transcript gets first segmented into sub-words according to the vocabulary of the translation system. Then the input is fed into the MT module to generate translation.
End-to-end system Instead of generating the intermediate transcript, we combine two pre-trained models by feeding the hidden state representation generated from the ASR module to the MT module. The output of the ASR module is character-based. However, the input of the MT module is subword-based. Therefore, the lengths of speech and text sequences are very different for the same segment. The length inconsistency is hard to learn by the ST model, harming model performance. Accordingly, we insert a compression layer between two modules based on the Connectionist Temporal Classification (CTC) algorithm Graves et al. (2006); Gaido et al. (2021)
. The layer averages the adjacent speech representations aligned to the same character to compress the redundant and uninformative vectors.
To address the high demand on the memory, we proposed a two-stage training for the end-to-end system to enable training and applying on a single GPU: In the first stage, we fine-tune the pre-trained models on the individual speech recognition or text translation tasks. In this case, all parameters of the model get updated. In the second stage, we jointly train the entire model on the end-to-end task, but only train part of the parameters to improve computational efficiency. Rather than fine-tuning all parameters, we firstly propose only to fine-tune the encoder of the MT module and freeze the rest. The motivation is to solve the discrepancy between the speech representation from the ASR module and the text representation for the decoder. Second, inspired by the adapter investigated in MT Bapna et al. (2019) and multilingual ST Le et al. (2021), we propose a simple adapter of three BLSTM layers (Figure 2). The adapter gets inserted between the ASR and MT modules to keep the semantic information of these two modules integrated.
To tackle the lack of the end-to-end data, inspired by Pham et al. (2019), we propose a similarity loss function (Figure 2). The motivation is that the speech translation model should represent similar hidden state representations for aligned audio and transcript. Consequently, minimizing the similarity loss is proposed to improve speech translation performance. The last hidden states of the MT encoder from EN-audio and DE-text get averaged over time steps to produce the representing vectors. Afterwards, the Mean Squared Error gets calculated between the representing vectors as the loss. As we propose only using the speech-to-transcript training data, the model does not know the target language. Therefore, we implement the target forcing mechanism by generating a target-language-specific embedding with the MT pre-trained embedding layer and prepending the embedding to the speech representations Ha et al. (2016); Johnson et al. (2017); Gangi et al. (2019). Besides, we implement the similarity with the end-to-end system together with the adapter. We only fine-tune the adapter and freeze the rest, to avoid the parameters getting forced to zero to minimize the similarity loss to the optimum zero.
4 Experiments & Results
The proposed approaches are evaluated on the English-German speech translation of the CoVoST2 Wang et al. (2020) dataset. In pre-processing, we remove uncompleted data with no transcript or translation, and the double quotes at the beginning and the end of the transcript and translation. Besides, we build a custom vocabulary for ASR tasks which consists of all distinct characters. We use the pre-trained wav2vec2.0 Baevski et al. (2020) with the large architecture for speech recognition task and a pre-trained MBart50 Liu et al. (2020); Tang et al. (2020) for machine translation task.
Cascaded system In cascaded combination, we explore the efficiency of fine-tuning each component. Firstly we experiment on initializing with parameters of the pre-trained models to provide references. Then, we fine-tune the pre-trained wav2vec2.0 and MBart50 models with speech-to-transcript and transcript-to-translation training data, respectively. We experiment with different combinations of the pre-trained and fine-tuned parameters to explore efficiency. As Table 1 shows, fine-tuning both modules leads to the best improvements of 4.9 BLEU points compared with no fine-tuning and 3.3 BLEU points improvement on the CoVoST2 Cascaded. Besides, we find that fine-tuning the encoder of the MT model slightly improves performance. With 25% parameters, fine-tuning the encoder achieves 41% improvements of fine-tuning the entire MBart50.
End-to-end system In the end-to-end combination, we apply a two-stage training scheme (Section 3). The first stage is fine-tuning the pre-trained models, and the second stage is fine-tuning the end-to-end model with the end-to-end data. In light of building the cascaded model directly using the ASR and MT modules, we first experiment with the initialization on the end-to-end system. As Table 2 shows, the end-to-end model does not work without the end-to-end training data according to experiments 1, 2 and 3. Next, we experiment with the end-to-end training data on different fine-tuning modules in the first stage and fine-tuning the MT encoder or adapter in the second stage to address computation efficiency. As shown in experiments 4, 5 and 6, the two-stage approach that trains the ASR component independently in the first stage and the MT encoder using the end-to-end data in the second stage is promising to build end-to-end ST systems. With this configuration (E4), we can achieve a better translation quality than the cascaded system. Furthermore, the end-to-end combination achieves 4.5 BLEU points improvements compared with the CoVoST2 E2E.
A second approach is integrating an additional adapt layer (E7), where we only need to train 67M instead of 150M parameters. The performance is 2 BLEU points worse. However, the pre-trained MT model is not changed and therefore can, for example, still be used for text translation in parallel.
In a second series of experiments, we evaluated the data efficiency of the end-to-end model with respect to end-to-end training data. Therefore, we investigated the effect of the similarity loss on the best-performing system using the adapt layer (E2E 7). To use the same hyperparameters as the previous experiments in model training, we scale up the similarity loss value by 100 to make it at the same scale as the original experiments. We evaluate model performance on different portions of the training data to evaluate the data efficiency.
In a first experiment, we evaluated the model on the zero-shot condition, where no end-to-end training data was available. As Table 3 shows, this approach fails to enable speech translation. We observe that the translation is in the source language, although with the target forcing mechanism. Therefore we expect that involving a few end-to-end data would solve the issue. Continue from the model trained with the similarity loss, we experiment on training with the original loss using different amounts of data. We observe that training with 10% training data enables the model to translate into the correct target language but poorly. With 20% data, adding similarity loss improves 51% compared with that without the loss. The evaluation score reaches 17.8 BLEU points, achieving 85% performance of the best end-to-end model. Besides, compared with the learning curve of the original loss, adding the similarity loss enables the model to fulfil speech translation tasks with less training data. The advantages of adding the similarity loss demonstrate a promising approach to improving data efficiency. In addition, we observe that with all training data, training the model with the similarity loss gains 0.7 BLEU point improvement. Therefore, we conclude that involving the similarity loss increases data efficiency and benefits to improving model performance.
In this work, we proposed using pre-trained speech recognition and text translation models to build a state-of-the-art speech translation system with limited resources. While a cascaded combination directly achieves relatively good performance, we develop several techniques to enable the end-to-end system to use these models, and handle different word representations used in the pre-trained models. Secondly, we propose two training strategies that allow the training and inference on a single GPU. Finally, we present an additional training loss to reduce the need for end-to-end training data. Using all these techniques, the proposed end-to-end model can outperform the cascaded model.
- Tied multitask learning for neural speech translation. arXiv preprint arXiv:1802.06655. Cited by: §2.
- FINDINGS OF THE IWSLT 2020 EVALUATION CAMPAIGN. In Proceedings of the 17th International Conference on Spoken Language Translation, Online, pp. 1–34. External Links: Cited by: §1.
Wav2vec 2.0: a framework for self-supervised learning of speech representations. External Links: Cited by: §4.
Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478. Cited by: §3.
- Cascade versus direct speech translation: do the differences still make a difference?. External Links: Cited by: §1.
- End-to-end automatic speech translation of audiobooks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6224–6228. Cited by: §2.
- CTC-based compression for direct speech translation. External Links: Cited by: §3.
- End-to-end speech-translation with knowledge distillation: fbk@iwslt2020. External Links: Cited by: §2.
- One-to-many multilingual end-to-end speech translation. External Links: Cited by: §3.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In
Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §3.
- Toward multilingual neural machine translation with universal encoder and decoder. arXiv preprint arXiv:1611.04798. Cited by: §3.
- Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7180–7184. Cited by: §2.
- Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. Cited by: §3.
- Sample, translate, recombine: leveraging audio alignments for data augmentation in end-to-end speech translation. arXiv preprint arXiv:2203.08757. Cited by: §2.
- Lightweight adapter tuning for multilingual speech translation. External Links: Cited by: §2, §3.
- Multilingual speech translation with efficient finetuning of pretrained models. External Links: Cited by: §2.
- Multilingual denoising pre-training for neural machine translation. External Links: Cited by: §4.
- End-to-end speech translation with knowledge distillation. arXiv preprint arXiv:1904.08075. Cited by: §2.
- The iwslt 2018 evaluation campaign. Cited by: §1.
- Improving zero-shot translation with language-independent constraints. External Links: Cited by: §3.
- Self-training for end-to-end speech translation. arXiv preprint arXiv:2006.02490. Cited by: §2.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. External Links: Cited by: §1.
- Multilingual translation with extensible multilingual pretraining and finetuning. External Links: Cited by: §4.
- CoVoST 2 and massively multilingual speech-to-text translation. External Links: Cited by: Table 1, Table 2, §4.
- Sequence-to-sequence models can directly translate foreign speech. External Links: Cited by: §2.