Log In Sign Up

Simple and Effective Unsupervised Speech Translation

by   Changhan Wang, et al.

The amount of labeled data to train models for speech tasks is limited for most languages, however, the data scarcity is exacerbated for speech translation which requires labeled data covering two different languages. To address this issue, we study a simple and effective approach to build speech translation systems without labeled data by leveraging recent advances in unsupervised speech recognition, machine translation and speech synthesis, either in a pipeline approach, or to generate pseudo-labels for training end-to-end speech translation models. Furthermore, we present an unsupervised domain adaptation technique for pre-trained speech models which improves the performance of downstream unsupervised speech recognition, especially for low-resource settings. Experiments show that unsupervised speech-to-text translation outperforms the previous unsupervised state of the art by 3.2 BLEU on the Libri-Trans benchmark, on CoVoST 2, our best systems outperform the best supervised end-to-end models (without pre-training) from only two years ago by an average of 5.0 BLEU over five X-En directions. We also report competitive results on MuST-C and CVSS benchmarks.


page 1

page 2

page 3

page 4


IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task

This paper describes the submission to the IWSLT 2021 Low-Resource Speec...

Simple and Effective Unsupervised Speech Synthesis

We introduce the first unsupervised speech synthesis system based on a s...

Self-Supervised Representations Improve End-to-End Speech Translation

End-to-end speech-to-text translation can provide a simpler and smaller ...

Unsupervised Pre-training of Bidirectional Speech Encoders via Masked Reconstruction

We propose an approach for pre-training speech representations via a mas...

Unsupervised Domain Adaptation for Hate Speech Detection Using a Data Augmentation Approach

Online harassment in the form of hate speech has been on the rise in rec...

A Comparative Study on End-to-end Speech to Text Translation

Recent advances in deep learning show that end-to-end speech to text tra...

Revisiting End-to-End Speech-to-Text Translation From Scratch

End-to-end (E2E) speech-to-text translation (ST) often depends on pretra...

1 Introduction

Training supervised speech systems requires large amounts of labeled data which is often not available for all but a small fraction of the over 7,000 languages spoken around the world (Lewis et al., 2022). Despite much recent effort in creating speech translation corpora (Di Gangi et al., 2019a; Wang et al., 2021b), only a few dozen language directions are covered. The lack of labeled training data is even more acute for speech translation because it requires aligned labeled data in two languages which increases the effort to create such datasets. This poses the question of whether speech translation systems can be built using less labeled data or no labeled data at all.

Recent work on unsupervised speech recognition has achieved performance that can enable useful systems using no labeled data (Yeh et al., 2019; Liu et al., 2018; Chen et al., 2019; Baevski et al., 2021; Liu et al., 2022a), enabled in large part by the advances in self-supervised speech representation learning (Schneider et al., 2019; Baevski et al., 2020). These techniques were also used to build unsupervised text-to-speech systems (Liu et al., 2022b). Similarly, unsupervised text-to-text machine translation has shown great promise for certain language directions (Conneau et al., 2018; Lample et al., 2018; Artetxe et al., 2018).

In this paper, we study a method to build end-to-end unsupervised speech-to-text and speech-to-speech translation systems trained on synthetic training data obtained by cascading existing unsupervised techniques: we first transcribe speech utterances in the source language using unsupervised speech recognition (Baevski et al., 2021; Liu et al., 2022a), then translate the resulting transcription using unsupervised machine translation (Lample et al., 2018; Artetxe et al., 2018; Liu et al., 2020), and finally synthesize the translation into a target language speech utterance using unsupervised speech synthesis (Liu et al., 2022b). We also consider applying the pipeline directly at inference time. Our approach benefits from the use of self-supervised speech models (Baevski et al., 2020; Liu et al., 2020) and to further improve performance, we present a technique to adapt existing self-supervised models to the target domain.

Figure 1: Overview of the proposed approach to unsupervised speech-to-text translation (S2TT) and speech-to-speech translation (S2ST). We first adapt speech pre-trained model (wav2vec 2.0) for the input language and domain of interest, and then cascade unsupervised speech recognition (ASR), unsupervised text de-normalization, unsupervised machine translation (MT) and unsupervised speech synthesis (TTS) models to produce pseudo-labels for end-to-end S2TT and S2ST model training. Our models rely only on unlabeled speech data and unpaired text data without the need of any human annotation.

2 Background

Unsupervised speech recognition.

Liu et al. (2018) presents some of the earliest work on unsupervised phoneme recognition and their work applies adversarial training. Wav2vec-U (Baevski et al., 2021)

effectively applied self-supervised speech representations, introduced a new evaluation metric and compared to state-of-the-art supervised systems trained on large amounts of labeled data. Wav2vec-U 2.0 

(Liu et al., 2022a) simplifies audio-side pre-processing and improves accuracy through better architecture as well as better training objective. Lin et al. (2022) shows that out-of-domain speech pre-training or out-of-domain text data hurts the training robustness of Wav2vec-U models, especially under low-resource settings.

Unsupervised speech synthesis.

Recent work has demonstrated unsupervised speech synthesis systems to be able to achieve comparable performance to supervised systems (Liu et al., 2022b; Ni et al., 2022). The systems are trained on data resulting from labeling speech audio data with unsupervised speech recognition models and training text-to-speech models on the resulting models.

Unsupervised machine translation.

Lample et al. (2018) and Artetxe et al. (2018) built the first fully unsupervised machine translation (MT) systems by exploiting cross-lingual similarity of representations in multilingual sequence-to-sequence models, as well as back-translation for further refinements of the initial models. mBART (Liu et al., 2020) used a similar model architecture and training process to build unsupervised MT models, but it utilized a larger-scale multilingual text corpus (Conneau et al., 2020)

and an updated noising strategy for pre-training with denoising autoencoder objective.

End-to-end speech translation.

End-to-end sequence-to-sequence modeling has witnessed increased applications in speech-to-text translation (Duong et al., 2016; Bérard et al., 2016; Weiss et al., 2017; Bansal et al., 2017; Vila et al., 2018; Di Gangi et al., 2019b; Ren et al., 2020; Li et al., 2021) and speech-to-speech translation (Jia et al., 2019; Kano et al., 2021; Jia et al., 2022). Compared to cascaded systems, end-to-end speech translation models have simpler pipeline and lower inference latency. It is shown that recent end-to-end speech-to-text translation (S2TT) models perform comparably to the cascaded counterparts on the well-established MuST-C benchmark (Bentivogli et al., 2021). Given the scarcity of speech translation corpora, there are recent attempts on building end-to-end S2TT models under low-resource settings (Bansal et al., 2018, 2019; Cheng et al., 2021) or unsupervised settings (Chung et al., 2019).

3 Methods

Figure 1 provides an overview of our proposed approach to unsupervised speech-to-text translation (S2TT) and speech-to-speech translation (S2ST). We leverage a cascade of unsupervised models to produce pseudo-labels for end-to-end S2TT and S2ST model training. To mitigate language and domain mismatch in speech pre-training (wav2vec 2.0), we finetune wav2vec 2.0 models using unlabeled in-domain speech data, and then use the adapted models to build downstream speech recognition models.

3.1 Unsupervised Cascaded Pseudo-Labeling

We cascade unsupervised speech recognition (ASR), unsupervised text de-normalization (TDN) and unsupervised machine translation (MT) models to produce pseudo-labels for S2TT. For S2ST, we additionally apply unsupervised speech synthesis (TTS) models to MT model outputs to obtain synthesized target speech.

Unsupervised ASR.

We adopt wav2vec-U 2.0 (Liu et al., 2022a), which learns a mapping from self-supervised speech representations to phonemes via adversarial training and decodes phonemes into words via a weighted finite state transducer (Mohri, 1997). To improve adversarial training stability and suppress overfitting in the low-resource settings, we add Gaussian noise to the frozen input features

as well as R-Drop regularization (Wu et al., 2021)

to the logit outputs of the generator

where and are two generator instances with different dropout masks, and is the Kullback-Leibler (KL) divergence. We add weighted to the wav2vec-U 2.0 objective function, where is a hyper-parameter. After adversarial learning, we follow Baevski et al. (2021)

to perform self-training with a Hidden Markov Model (HMM), and fine-tune the adapted wav2vec 2.0 model again with the CTC objective on the HMM labels. We denote the final ASR model as “w2vu2-CTC”.

Unsupervised MT.

We adopt mBART (Liu et al., 2020), which has a Transformer architecture (Vaswani et al., 2017)

with model parameters shared across all training languages. It first obtains initial cross-lingual alignments for all languages via a denoising autoencoder objective 

(Vincent et al., 2010), and then refines the alignments for one specific language pair via bidirectional online back-translation on that pair of languages. We denote this model as “mBART-OBT”.

Unsupervised TDN.

ASR models decode normalized spoken-form texts, which have no case or punctuation (except hyphen and apostrophe). MT models, however, encode unnormalized written-form texts that have case and punctuation. This discrepancy leads to quality degradation when we cascade the two models directly for pseudo-labeling. To mitigate the mismatch, we de-normalize ASR model outputs into their unnormalized written form before feeding them into MT models. The text de-normalizer is a mBART model pre-trained with denoising autoencoder objective and fine-tuned with paired data of raw text (output) and its normalized version (input).

Unsupervised TTS.

We follow Liu et al. (2022b) to produce phoneme labels for unlabeled speech data with wav2vec-U 2.0, and then train an autoregressive Transformer TTS model (Li et al., 2019)

on the pseudo-labeled data. For wav2vec-U 2.0, we perform HMM-based self-training and fine-tune pre-trained wav2vec 2.0 model with HMM phoneme labels. To alleviate under-generation and over-generation issues in autoregressive models, we add R-Drop style consistency loss

to the objective function (weighted by a hyperparameter

) for better end-of-sentence (EOS) predictions, where and are two EOS predictions on the same input with different dropout masks.

3.2 Unsupervised Adaptation of wav2vec 2.0 Pre-trained Models

Next, we present a method to improve performance when the domain of the data used for self-supervised pre-training differs from the downstream task domain which is often the case for low-resource languages. Specifically, we adapt out-of-domain or out-of-language wav2vec 2.0 models to the domain and language of interest by fine-tuning the entire wav2vec 2.0 models on discrete labels obtained from unlabeled in-domain data using the CTC objective (Graves et al., 2006).

To obtain discrete labels, we first collect all the wav2vec 2.0 speech representations for the training data, and perform k-means clustering to identify

clusters. Then for each utterance, we label each of its speech representation frames by the corresponding cluster ids , where . Finally, we merge identical consecutive to obtain the final labels , where and .

After unsupervised fine-tuning with discrete labels, we discard the output projection layer used for the CTC objective, and use the resulting wav2vec 2.0 trunk instead of the original wav2vec 2.0 model in the downstream tasks. The adapted models are used to extract speech representations for wav2vec-U 2.0 models, as well as pre-train encoders of the CTC models in wav2vec-U self-training.

3.3 End-to-end Model Training with Pseudo-labels

After obtaining pseudo-labels from the cascade of unsupervised models, we train end-to-end S2TT and S2TT models with supervised objectives on these pseudo-labels. For end-to-end S2TT, we adopt the model architecture in Li et al. (2021), which we denote as “w2v2-mBART”. We pre-train its encoder by the unsupervised ASR model, w2vu2-CTC, and pre-train its decoder by the unsupervised MT model, mBART-OBT. For end-to-end S2ST, we adopt a variant of Translatotron 2 (Jia et al., 2022), Spec-T2, which adds an additional encoder in between Translatotron 2’s two decoders, and replace Translatotron 2’s second decoder by an autoregressive Transformer decoder (Li et al., 2019). Similar to w2v2-mBART, we pre-train Spec-T2’s first encoder and first decoder by w2vu2-CTC and mBART-OBT, respectively.

Fr-En Es-En Ru-En Et-En Lv-En Avg.
Duration (hrs) 264 113 16 3 2
Bilingual setup
Supervised learning + pre-training
  End-to-end (w2v2-mBART) 35.7 36.2 39.4 a5.7 13.5 26.1
Supervised learning
  End-to-end (S2T Transformer; Wang et al. 2020) 26.3 23.0 14.8 a0.1 a2.5 13.3
Unsupervised learning
  Cascaded (ASRTDNMT) 24.4 23.4 27.8 a8.5 a7.6 18.3
  End-to-end (w2v2-mBART) 24.2 24.0 25.6 a3.9 a2.8 16.1
Multilingual setup
Supervised learning + pre-training
  End-to-end (w2v2-mBART), 21 langs.En (Babu et al., 2021) 32.9 34.1 26.4 a3.5 a6.0 20.6
Supervised learning
  End-to-end (S2T Transformer), 21 langs.En (Wang et al., 2020) 26.9 26.3 a9.6 a0.4 a0.6 12.8
Unsupervised learning
  End-to-end (w2v2-mBART), 24.3 24.0 22.8 a3.1 a1.0 15.0
Table 1: Bilingual and multilingual X-En speech-to-text translation results: test BLEU on CoVoST 2. Et-En and Lv-En are low-resource with only 3h and 2h of training data, respectively. End-to-end modeling on these two directions suffers from overfitting.
En-Es En-Ru En-Fr
Duration (hrs) 504 489 100
Supervised learning + pre-training
  End-to-end (w2v2-mBART) 32.4 20.0 23.1
Supervised learning
  End-to-end (S2T Transformer) 27.2 15.3 11.4
Unsupervised learning
  Chung et al. (2019) N/A N/A 12.2
  Cascaded (ASRTDNMT) 22.0 10.0 15.4
  End-to-end (w2v2-mBART) 23.8 9.8 15.3
Table 2: Bilingual En-X speech-to-text translation results: test BLEU on MuST-C (En-Es and En-Ru) and Libri-Trans (En-Fr). Our best system outperforms previous state of the art (Chung et al., 2019) on Libri-Trans by 3.7 BLEU.  Wang et al. (2020). We report the + + configuration with the best result selected supervisedly out of 10 runs.

4 Experimental Setup

We evaluate our translation models on 5 directions into English (Fr-En, Es-En, Ru-En, Et-En and Lv-En) and 3 directions out of English (En-Es, En-Ru and En-Fr). The 5 non-English languages are from 4 different Indo-European language family sub-groups: Romance (Fr and Es), Slavic (Ru), Uralic (Et) and Baltic (Lv). For the X-En directions, we evaluate S2TT models on CoVoST 2 (Wang et al., 2021b) and evaluate S2ST models on CVSS-C (Jia et al., 2022), which adds synthetic target speech to CoVoST 2 with a single canonical speaker voice. For the En-X directions, we only evaluate S2TT models. We use MuST-C (Di Gangi et al., 2019a) for En-Es and En-Ru, as well as Libri-Trans (Kocabiyikoglu et al., 2018) for En-Fr. For Libri-Trans, we follow Chung et al. (2019) to combine validation set and test set for evaluation.

Speech pre-training.

We use robust wav2vec 2.0 (Hsu et al., 2021)

for English speech, which is trained on datasets from multiple domains. For non-English speech, we adapt open-source VoxPopuli

111 (Wang et al., 2021a) models by CTC fine-tuning with 1024 discrete labels (Fr, Es and Ru) or 128 discrete labels (Et and Lv). We use monolingual VoxPopuli models for Fr and Es, and multilingual models of similar languages for Ru, Et and Lv (Slavic, Uralic and Baltic languages, respectively). We extract speech representations from the 15-th layer of the original wav2vec 2.0 models for computing discrete labels.

Speech recognition.

For wav2vec-U 2.0 models, we extract speech representations from the 19-th (15-th) layer of the adapted (original) wav2vec 2.0 models. We increase the dropout on the batch normalized input features to 0.2. We set

for input Gaussian noise and for R-Drop regularization. For wav2vec-U 2.0 loss weights, we set and choose , and from 1.0 / 1.5, 1.5 / 2.5 and 0.3 / 0.5, respectively. For text data, we use open web crawled corpus, CC-100 (Conneau et al., 2020), which is created with little curation and has large language coverage. For supervised baselines, we fine-tune adapted wav2vec 2.0 models with CTC objective on labeled data, which we denote as “w2v2-CTC”.

Machine translation.

We use CC-100 (Conneau et al., 2020) to train bilingual mBART large models for each language pair. For bidirectional online back-translation, we use the same CC100 data and follow Liu et al. (2020) to apply 99% vocabulary masking for the first 500 updates. For supervised baselines, we fine-tune mBART models with labeled data, which we denote as “mBART-FT”.

Speech synthesis.

We train Transformer models (with weight ) on CVSS-C target speech from the It-En direction to avoid content overlaps with the selected 5 directions. For grapheme-to-phoneme conversion, we employ g2pE (Park, 2019) for English texts and Phonemizer (Bernard, 2015) with espeak-ng222 backend for texts in other languages. We resample audios to 22,050Hz and extract log-Mel spectrogram with FFT size 1024, window length 1024 and hop length 256.

End-to-end speech translation.

For bilingual S2TT, we pre-train its encoder/decoder with w2vu2-CTC/mBART-OBT for unsupervised models, or with w2v2-CTC/mBART-FT for supervised models that leverage pre-training. To alleviate overfitting in low-resource settings (Ru-En, Et-En and Lv-En), we duplicate training examples and equip them with 2 different pseudo-labels from mBART-OBT beam search decoding. For multilingual S2TT and S2ST, we pre-train speech encoder with XLS-R 0.3B (Babu et al., 2021), and pre-train text decoder with mBART-OBT from the En-Fr direction.

Checkpoint selection and averaging.

For unsupervised ASR, we adopt the unsupervised metric in Baevski et al. (2021) and average the best 2 checkpoints in the same run. For unsupervised MT and unsupervised TTS, we average the last 5 checkpoints. For end-to-end S2TT/S2ST, we sort checkpoints by losses on the pseudo-labeled validation set and average the best 5 checkpoints.

Automatic evaluation of speech outputs.

Following a common practice, we first transcribe English speech outputs from the TTS or S2ST model with an open-source English ASR model333
examples/wav2vec (“Wav2Vec 2.0 Large (LV-60) + Self Training”)
, and then calculate WER or BLEU on the ASR transcription for automatic evaluation scores.

Fr-En Es-En Ru-En Et-En Lv-En Avg.
Source duration (hrs) 264 113 16 3 2
Supervised learning + pre-training
  End-to-end (Spec-T2), 31.8 32.3 32.9 a5.2 a7.5 21.9
Supervised learning
  End-to-end (Spec-T2), 27.4 27.7 25.4 a4.1 a2.5 17.4
Unsupervised learning
  Cascaded (ASRTDNMTTTS), bilingual 21.6 21.2 25.3 a7.2 a7.7 16.6
  End-to-end (Spec-T2), 21.2 20.1 19.9 a3.2 a2.8 13.4
Table 3: Multilingual X-En speech-to-speech translation results: test BLEU on CVSS-C. Our multilingual model is trained on a subset of 5 directions out of the 21 available directions. Appendix A.1 presents a comparison of our supervised model to Jia et al. (2022) in the 21-direction setting, which performs roughly similarly.
wav2vec 2.0 Domain Hours Multi- Seen Fine- Fr Es Ru Et Lv
features lingual lang. tuning 264h 113h 16h 3h 2h
VoxPopuli out 21K- none 26.7 21.4
(Wang et al., 2021a) 89K unsup. 21.4 18.3 25.6 22.4 27.8
XLS-R in+out 436K none 26.1 21.9 32.8
(Babu et al., 2021) unsup. 23.4 19.0 28.3 26.4
Robust wav2vec 2.0 out 63K none 29.3
(Hsu et al., 2021) unsup. 31.5 22.7 35.2 35.1
Table 4: Different wav2vec 2.0 features for non-English unsupervised ASR (wav2vec-U 2.0) training: validation PER on CoVoST 2 with Viterbi decoding. All models use the wav2vec 2.0 large configuration. We unsupervisedly finetune wav2vec 2.0 models to the language and domain of interest. “”: Monolingual models for Fr and Es; multilingual models of similar languages for Ru, Et and Lv (trained on the Slavic, Uralic and Baltic languages in VoxPopuli, respectively).

5 Results

5.1 X-En Speech-to-Text Translation

For X-En S2TT, we consider models trained for a single language direction (bilingual) and models covering multiple directions (multilingual). Results are reported on five translation directions into English of the CoVoST 2 benchmark and we focus on end-to-end systems but we also consider a cascade of unsupervised models. Supervised models are purely trained on labeled data without pre-training, while as supervised models with pre-training use wav2vec and mBART models, unsupervised models also use pre-trained models but no labeled data.

Table 1 shows that unsupervised end-to-end models outperform the supervised baselines by 5.0 BLEU on average over the five translation directions of the bilingual setup. The supervised models represent the best supervised end-to-end models from two years ago. These improvements are due to advances in unsupervised modeling as well as self-supervised pre-training. The supervised models with pre-training perform generally far above the unsupervised models and shows that there is potential to improve unsupervised speech translation in the future.

The cascaded unsupervised setup performs better than the end-to-end approach for directions with little synthetic training data such as Ru-En, Et-En and Lv-En. This is because end-to-end models are trained on datasets comprising as little as two hours of synthetic speech translation data on which they overfit. Cascaded unsupervised models do not suffer under this issue because they exploit more text for unsupervised machine translation (Table 7).

Supervised learning with pre-training for the bilingual setup performs better than the multilingual setup because only a single translation direction needs to be modeled and because the mBART model was pre-trained on 50 languages while as only a single language is being used in the X-En setup.

5.2 En-X Speech-to-Text Translation

For bilingual En-X S2TT, we compare our unsupervised models to the previous state of the art (Chung et al., 2019) on Libri-Trans (En-Fr) and we also evaluate them on the MuST-C benchmark for En-Es and En-Ru directions. Table 2 shows the test BLEU of our models and the baselines on both benchmarks. On Libri-Trans, our best system outperforms the previous state of the art, an alignment-based cascaded system, by 3.2 BLEU (Chung et al., 2019). On MuST-C, our models also achieve competitive results in this high-resource setting of around 500 hours of training data, with 3.4 BLEU and 5.5 BLEU behind the supervised baselines on En-Es and En-Ru, respectively.

5.3 X-En Speech-to-Speech Translation

To train a multilingual X-En speech-to-speech translation model, we combine pseudo-labeled bilingual data for multiple translation directions and use the Spec-T2 architecture, a variant of Translatotron 2. We build supervised Spec-T2 baselines with and without pre-training and evaluate on the CVSS-C benchmark. Table 3 shows that the best unsupervised system is on average only 0.8 BLEU below the supervised baseline. We believe that the unsupervised approach is less effective for speech-to-speech translation compared to speech-to-translation because of the increased error accumulation in the synthetic data creation process due to the addition of the unsupervised speech synthesis component to which we input unsupervised translation output which in turn is based on unsupervised speech recognition transcriptions. Similarly to speech-to-text translation, the cascaded unsupervised model performs better than the end to end approach and this is most prominent for low-resource directions.

Fr Es Ru Et Lv En Avg.
Duration (hrs) 264 113 16 3 2 504
Supervised learning + pre-training
  w2v2-CTC 15.7 a7.0 a7.1 11.1 a5.9 a6.3 a8.9
Supervised learning
  Transformer 18.3 16.0 31.4 65.7 51.8 12.1 32.6
Unsupervised learning
  w2vu2-CTC 23.2 10.3 15.7 17.6 14.8 12.7 15.7
Table 5: Speech recognition results: test WER on CoVoST 2 and MuST-C (En-Es). Semi-supervised and unsupervised models are decoded with 4-gram language model.  Wang et al. (2020).
CVSS Libri-Trans MuST-C
JS Divergence 0.207 0.376 0.369
Supervised learning
  Transformer 12.8 15.0 16.8
Unsupervised learning
  Transformer 15.2 17.1 20.1
Table 6: Speech synthesis results: validation WER for re-synthesis on CVSS-C, Libri-Trans and MuST-C. To quantify training-inference time domain similarity, we follow Lin et al. (2022) to compute Jensen–Shannon divergence (“JSD”) on 4-gram phoneme distributions. Low JSD suggests high similarity.
Fr-En Es-En Ru-En Et-En Lv-En En-Es En-Ru En-Fr Avg.
2.1B En text, non-En text 428M 379M 849M 46M 68M 379M 849M 428M
Bitext 207K 79K 12K 1.8K 2.3K 259K 259K 47K
Supervised learning + pre-training
  mBART-FT 46.7 46.0 48.4 23.3 29.6 38.7 23.1 21.5 34.6
Supervised learning
  Transformer 37.9 36.3 19.8 0.3 0.2 33.8 15.8 17.9 20.3
Unsupervised learning
  mBART-OBT 40.1 43.8 48.6 19.0 25.0 38.5 22.2 22.1 32.4
Table 7: Machine translation results: test BLEU on CoVoST 2 (X-En), MuST-C (En-Es and En-Ru) and Libri-Trans (En-Fr). We finetune mBART model with bitext data for supervised learning and with unpaired pre-training data for unsupervised learning.  Wang et al. (2020).

5.4 Speech Pre-training

We evaluate the effectiveness of the unsupervised adaptation technique of wav2vec 2.0 models (§3.1

) on the five non-English languages, which have less training data than English. We train wav2vec-U 2.0 models on CoVoST 2 with features extracted from three different wav2vec 2.0 models and their adapted versions: 1) Out-of-domain models, “VoxPopuli” 

(Wang et al., 2021a), that are trained with data in the same language (for Fr and Es) or similar languages (for Ru, Et and Lv) from the same language family subgroup; 2) a massively multilingual model for 128 languages, “XLS-R” (Babu et al., 2021), whose training data contains CoVoST 2; 3) a multi-domain English model, “robust wav2vec 2.0” (Hsu et al., 2021), where the target languages are unseen. We report validation PER on Viterbi predictions in Table 4. Speech pre-training on mismatched domains or languages (“VoxPopuli” and “robust wav2vec 2.0”) leads to training convergence failure on three low-resource languages (Ru, Et and Lv). The two languages with the least amount of data, Et and Lv, even fail with in-domain multilingual pre-training. Unsupervised adaptation significantly improves training convergence and model performance for all the 3 scenarios of speech pre-training. In an example worst case scenario, Et-En wav2vec-U 2.0 model is successfully trained with only 3 hours of Et speech data and features from an adapted out-of-language out-of-domain wav2vec 2.0 model (“robust wav2vec 2.0”).

Fr-En Es-En Ru-En Et-En Lv-En En-Es En-Ru En-Fr Avg.
BLEU on raw text
  ASRTDNMT 24.4 23.4 27.8 a8.5 a7.6 22.0 10.0 15.4 17.4
   Remove TDN 17.2 18.3 20.7 a5.7 a7.8 17.2 a8.9 10.4 13.3
BLEU on normalized text (case and punctuation removed)
  ASRTDNMT 25.0 23.9 28.7 a7.9 a9.5 23.7 9.4 15.5 18.0
   Remove TDN 23.1 24.1 26.9 a7.2 a9.4 23.1 9.4 15.1 17.3
Table 8: Effectiveness of text de-normalization in the unsupervised pipeline evaluated in terms of speech-to-text translation on CoVoST 2 (X-En), MuST-C (En-Es and En-Ru) and Libri-Trans (En-Fr). We report test BLEU on either raw text or normalized text. TDN not only recovers case and punctuation, but also leads to better translation of content.

5.5 Speech Recognition

Next, we evaluate the performance of unsupervised speech recognition in our setting. We decode our pre-trained supervised baselines (“w2v2-CTC”) and unsupervised models (“w2vu2-CTC”) with 4-gram language model. They are compared with previous un-pre-trained supervised baselines (Wang et al., 2020) on CoVoST 2 and MuST-C (for En), whose results (test WER) can be found in Table 5. We see that our unsupervised end-to-end models outperform un-pre-trained supervised baselines on all six languages with an average 16.9 WER reduction over the supervised one. Unsupervised ASR works best for languages with little labeled data due to the use of pre-trained features and advances in unsupervised algorithms.

5.6 Speech Synthesis

In our unsupervised setting, the target speech data does not share the same domain as the source one. This realistic setting leads to training-inference time domain mismatch on TTS models. We evaluate the effects of this mismatch by a re-synthesis task on 3 different datasets: CVSS-C (from It-En), Libri-Trans and MuST-C. We synthesize speech using validation texts and report WER on the ASR transcription of the synthesized speech. To quantize domain similarity, we follow Lin et al. (2022) to compute Jensen–Shannon divergence (“JSD”) on 4-gram phoneme distributions, where low JSD suggests high similarity. Table 6 shows the results. We see that both supervised and unsupervised models have higher WER on less similar domains (Libri-Trans and MuST-C).

5.7 Machine Translation

We evaluate our unsupervised models (“mBART-OBT”) on the CoVoST 2, MuST-C and Libri-Trans benchmarks with test BLEU. For comparison, we also build supervised Transformer baselines (“Transformer”) and supervised mBART baselines (“mBART-FT”). Results are shown in Table 7. We observe that our unsupervised models outperform supervised baselines by 12.1 BLEU on average over the eight considered translation directions. They are behind supervised baselines by only 2.2 BLEU on average. In contrast to supervised baselines that leverage in-domain paired data, the unsupervised models use unpaired CC100 data which is web data.

5.8 Text De-normalization

We verify the effectiveness of text de-normalization (TDN) by ablating it in the unsupervised cascaded pipeline. In Table 8, we show test BLEU calculated on either raw text () or normalized text () for the ablation. We see that TDN improves greatly by 4.1 on average over all the directions. From the improvements on , we conclude that TDN not only recovers case and punctuation, but also improves translation of the content.

6 Conclusion

In this paper, we present a simple and effective approach to unsupervised speech-to-text translation (S2TT) and speech-to-speech translation (S2ST). Our S2TT systems outperform the previous state of the art on Libri-Trans by 3.2 BLEU as well as the best supervised end-to-end models (without pre-training) on CoVoST 2 from only two years ago by an average of 5.0 BLEU over five translation directions into English. Our S2TT and S2ST systems also perform competitively on the MuST-C and CVSS-C benchmarks.


We thank Alexei Baevski, Andy Chung, Alexis Conneau, Hongyu Gong, Jiatao Gu and Sravya Popuri for helpful discussions.


  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2018)

    Unsupervised neural machine translation

    In International Conference on Learning Representations, Cited by: §1, §1, §2.
  • A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y. Saraf, J. Pino, et al. (2021) XLS-r: self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296. Cited by: Table 1, §4, Table 4, §5.4.
  • A. Baevski, W. Hsu, A. Conneau, and M. Auli (2021) Unsupervised speech recognition. Advances in Neural Information Processing Systems 34, pp. 27826–27839. Cited by: §1, §1, §2, §3.1, §4.
  • A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proc. of NeurIPS, Cited by: §1, §1.
  • S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater (2018) Low-resource speech-to-text translation. Proc. Interspeech 2018, pp. 1298–1302. Cited by: §2.
  • S. Bansal, H. Kamper, K. Livescu, A. Lopez, and S. Goldwater (2019) Pre-training on high-resource speech recognition improves low-resource speech-to-text translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 58–68. Cited by: §2.
  • S. Bansal, H. Kamper, A. Lopez, and S. Goldwater (2017) Towards speech-to-text translation without speech recognition. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 474–479. Cited by: §2.
  • L. Bentivogli, M. Cettolo, M. Gaido, A. Karakanta, A. Martinelli, M. Negri, and M. Turchi (2021) Cascade versus direct speech translation: do the differences still make a difference?. In

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    pp. 2873–2887. Cited by: §2.
  • A. Bérard, O. Pietquin, L. Besacier, and C. Servan (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. In NIPS Workshop on end-to-end learning for speech and audio processing, Cited by: §2.
  • M. Bernard (2015) Phonemizer. GitHub. Note: Cited by: §4.
  • K. Chen, C. Tsai, D. Liu, H. Lee, and L. Lee (2019)

    Completely unsupervised speech recognition by a generative adversarial network harmonized with iteratively refined hidden markov models

    In Proc. of Interspeech, Cited by: §1.
  • Y. Cheng, H. Lee, and H. Wang (2021) AlloST: Low-Resource Speech Translation Without Source Transcription. In Proc. Interspeech 2021, pp. 2252–2256. External Links: Document Cited by: §2.
  • Y. Chung, W. Weng, S. Tong, and J. Glass (2019) Towards unsupervised speech-to-text translation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7170–7174. Cited by: §2, Table 2, §4, §5.2.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2020) Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8440–8451. Cited by: §2, §4, §4.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word translation without parallel data. Proc. of ICLR. Cited by: §1.
  • M. A. Di Gangi, R. Cattoni, L. Bentivogli, M. Negri, and M. Turchi (2019a) Must-c: a multilingual speech translation corpus. In 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2012–2017. Cited by: §1, §4.
  • M. A. Di Gangi, M. Negri, and M. Turchi (2019b) Adapting transformer to end-to-end spoken language translation. In INTERSPEECH 2019, pp. 1133–1137. Cited by: §2.
  • L. Duong, A. Anastasopoulos, D. Chiang, S. Bird, and T. Cohn (2016)

    An attentional model for speech translation without transcription

    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 949–959. Cited by: §2.
  • A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks


    Proceedings of the 23rd international conference on Machine learning

    pp. 369–376. Cited by: §3.2.
  • W. Hsu, A. Sriram, A. Baevski, T. Likhomanenko, Q. Xu, V. Pratap, J. Kahn, A. Lee, R. Collobert, G. Synnaeve, and M. Auli (2021) Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. In Proc. Interspeech 2021, pp. 721–725. Cited by: §4, Table 4, §5.4.
  • Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz (2022) Translatotron 2: high-quality direct speech-to-speech translation with voice preservation. In Proceedings of the 39th International Conference on Machine Learning, K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato (Eds.), Proceedings of Machine Learning Research, Vol. 162, pp. 10120–10134. Cited by: §2, §3.3.
  • Y. Jia, M. T. Ramanovich, Q. Wang, and H. Zen (2022) CVSS corpus and massively multilingual speech-to-speech translation. arXiv preprint arXiv:2201.03713. Cited by: §A.1, Table 9, Table 3, §4.
  • Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu (2019) Direct speech-to-speech translation with a sequence-to-sequence model. In INTERSPEECH, Cited by: §2.
  • T. Kano, S. Sakti, and S. Nakamura (2021) Transformer-based direct speech-to-speech translation with transcoder. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 958–965. Cited by: §2.
  • A. C. Kocabiyikoglu, L. Besacier, and O. Kraif (2018) Augmenting librispeech with french translations: a multimodal corpus for direct speech translation evaluation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §4.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations, Cited by: §1, §1, §2.
  • M. P. Lewis, G. F. Simon, and C. D. Fennig (2022) Ethnologue: languages of the world, 25th edition. SIL International. Note: Online version: Cited by: §1.
  • N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu (2019)

    Neural speech synthesis with transformer network


    Proceedings of the AAAI Conference on Artificial Intelligence

    33 (01), pp. 6706–6713.
    Cited by: §3.1, §3.3.
  • X. Li, C. Wang, Y. Tang, C. Tran, Y. Tang, J. Pino, A. Baevski, A. Conneau, and M. Auli (2021) Multilingual speech translation from efficient finetuning of pretrained models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 827–838. Cited by: §2, §3.3.
  • G. Lin, C. Hsu, D. Liu, H. Lee, and Y. Tsao (2022) Analyzing the robustness of unsupervised speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8202–8206. Cited by: §2, §5.6, Table 6.
  • A. H. Liu, W. Hsu, M. Auli, and A. Baevski (2022a) Towards end-to-end unsupervised speech recognition. arXiv preprint arXiv:2204.02492. Cited by: §1, §1, §2, §3.1.
  • A. H. Liu, C. J. Lai, W. Hsu, M. Auli, A. Baevskiv, and J. Glass (2022b) Simple and effective unsupervised speech synthesis. arXiv preprint arXiv:2204.02524. Cited by: §1, §1, §2, §3.1.
  • D. Liu, K. Chen, H. Lee, and L. Lee (2018) Completely unsupervised phoneme recognition by adversarially learning mapping relationships from audio embeddings. Proc. of Interspeech. Cited by: §1, §2.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §1, §2, §3.1, §4.
  • M. Mohri (1997) Finite-state transducers in language and speech processing. Computational linguistics 23 (2), pp. 269–311. Cited by: §3.1.
  • J. Ni, L. Wang, H. Gao, K. Qian, Y. Zhang, S. Chang, and M. Hasegawa-Johnson (2022)

    Unsupervised text-to-speech synthesis by unsupervised automatic speech recognition

    arXiv preprint arXiv:2203.15796. Cited by: §2.
  • J. Park (2019) G2pE. GitHub. Note: Cited by: §4.
  • Y. Ren, J. Liu, X. Tan, C. Zhang, T. Qin, Z. Zhao, and T. Liu (2020) SimulSpeech: end-to-end simultaneous speech to text translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3787–3796. Cited by: §2.
  • S. Schneider, A. Baevski, R. Collobert, and M. Auli (2019) Wav2vec: unsupervised pre-training for speech recognition. In Proc. of Interspeech, Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §3.1.
  • L. C. Vila, C. Escolano, J. A. Fonollosa, and M. R. Costa-Jussa (2018) End-to-end speech translation with the transformer.. In IberSPEECH, pp. 60–63. Cited by: §2.
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P. Manzagol, and L. Bottou (2010) Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion.. Journal of machine learning research 11 (12). Cited by: §3.1.
  • C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux (2021a)

    VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 993–1003. Cited by: §4, Table 4, §5.4.
  • C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino (2020) Fairseq S2T: fast speech-to-text modeling with fairseq. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: System Demonstrations, Suzhou, China, pp. 33–39. Cited by: Table 1, Table 2, §5.5, Table 5, Table 7.
  • C. Wang, A. Wu, J. Gu, and J. Pino (2021b) CoVoST 2 and massively multilingual speech translation.. In Interspeech, pp. 2247–2251. Cited by: §1, §4.
  • R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen (2017) Sequence-to-sequence models can directly translate foreign speech. Proc. Interspeech 2017, pp. 2625–2629. Cited by: §2.
  • L. Wu, J. Li, Y. Wang, Q. Meng, T. Qin, W. Chen, M. Zhang, T. Liu, et al. (2021)

    R-drop: regularized dropout for neural networks

    Advances in Neural Information Processing Systems 34, pp. 10890–10905. Cited by: §3.1.
  • C. Yeh, J. Chen, C. Yu, and D. Yu (2019) Unsupervised speech recognition via segmental empirical output distribution matching. In Proc. of ICLR, Cited by: §1.

Appendix A Appendix

a.1 Comparison of our CVSS-C supervised baseline to previous work

X-En direction Fr Es Ru Et Lv Avg.
Evaluated by a proprietary ASR
  Jia et al. (2022) 32.4 33.4 23.2 a3.2 a2.8 19.0
Evaluated by an open-source ASR
  Ours 33.8 34.6 29.4 a3.1 a3.2 20.8
Table 9: Multilingual supervised baselines on CVSS-C for translating 21 languages into English. We report test BLEU on ASR transcription of the translated speech.

For evaluation of CVSS-C models, we use an open-source English ASR model444
examples/wav2vec (“Wav2Vec 2.0 Large (LV-60) + Self Training”)
to transcribe translated speech for BLEU calculation. The previous work (Jia et al., 2022), however, used transcription from a proprietary ASR model which we do not have access to. As a result, BLEU numbers reported for our model and the previous work are not directly comparable, but the small difference suggests that the two models perform roughly similarly.

a.2 Data Overview for Supervised Learning and Unsupervised Learning

Fr-En Es-En Ru-En Et-En Lv-En
Supervised learning
Src. paired speech 264 113 16 3 2
Src. paired text 207K 79K 12K 1.8K 2.3K
Tgt. paired speech 174 70 13 3 1
Tgt. paired text 207K 79K 12K 1.8K 2.3K
Unsupervised learning
Src. speech 23K 21K 89K 43K 28K
Src. text 428M 379M 849M 46M 68M
Tgt. speech 29 29 29 29 29
Tgt. text 2.1B 2.1B 2.1B 2.1B 2.1B
En-Es En-Ru En-Fr
Supervised learning
Src. paired speech 504 489 100
Src. paired text 259K 259K 47K
Tgt. paired text 259K 259K 47K
Unsupervised learning
Src. speech 63K 63K 63K
Src. text 2.1B 2.1B 2.1B
Tgt. text 379M 849M 428M
Table 10: Overview of the speech data (hours) and text data (sentences) used in supervised learning and unsupervised learning.

Table 10 provides an overview for the speech and text data used in supervised learning and unsupervised learning.