Log In Sign Up

Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

by   Kun Wei, et al.

Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST. However, direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare. To address this issue, we propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks. By effectively leveraging the paired text data, Speech2S is capable of modeling the cross-lingual speech conversion from source to target language. We verify the performance of the proposed Speech2S on Europarl-ST and VoxPopuli datasets. Experimental results demonstrate that Speech2S gets an improvement of about 5 BLEU scores compared to encoder-only pre-training models, and achieves a competitive or even better performance than existing state-of-the-art models1.


page 1

page 2

page 3

page 4


mSLAM: Massively multilingual joint pre-training for speech and text

We present mSLAM, a multilingual Speech and LAnguage Model that learns c...

Textless Speech-to-Speech Translation on Real Data

We present a textless speech-to-speech translation (S2ST) system that ca...

Does Joint Training Really Help Cascaded Speech Translation?

Currently, in speech translation, the straightforward approach - cascadi...

Direct Speech Translation for Automatic Subtitling

Automatic subtitling is the task of automatically translating the speech...

UWSpeech: Speech to Speech Translation for Unwritten Languages

Existing speech to speech translation systems heavily rely on the text o...

Towards Unsupervised Speech-to-Text Translation

We present a framework for building speech-to-text translation (ST) syst...

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model

Data-driven speech processing models usually perform well with a large a...

1 Introduction

Direct speech to speech translation (S2ST) has gained more and more attention from research and industry communities in recent years [11, 15, 21]

. Traditionally, cascaded speech to speech translation consists of automatic speech recognition (ASR), machine translation (MT), and text to speech synthesis (TTS) tasks. Direct S2ST aims at integrating the above three tasks into an end-to-end model, which translates the speech of one language to the speech of another language directly. Compared to cascaded S2ST, direct S2ST has the following advantages: (1) it is able to alleviate the error propagation problem of pipeline systems; (2) it can retain the emotion, pitch, and prosody information of the speaker to the greatest extent; (3) it has faster reasoning speed and takes up fewer storage resources.

However, data scarcity is the biggest problem of direct speech to speech translation tasks [25]. At present, there is very little parallel S2ST data though lots of efforts [26, 10, 5]. To alleviate this problem, a line of work tries to leverage pseudo data to improve direct S2ST [4, 21]. They usually convert the ASR data into speech to text translation data using an MT system, and then generate the target audio from the target text with a TTS system. Unfortunately, these methods do not guarantee the accuracy of the generated pseudo S2ST data. Another line of work aims at boosting the performance of direct S2ST through pre-training methods [21, 8]. For example, the paper in [8] explores pre-training the encoder with mSLAM objective [2], and pre-training the decoder of Translatoron 2 [9] with MT task to generate phonemes. The authors in [21] propose to combine wav2vec 2.0 [1] encoder and mBART [17] decoder to a speech-to-unit translation (S2UT) model, which also can be further boosted by data augmentation techniques.

Although the self-supervised pre-training method in [21] can initialize the direct S2ST model with the pre-trained wav2vec 2.0 encoder and mBART decoder, which are trained with discrete units extracted with HuBERT [6] model from unlabeled speech data, it still lacks effective connection between encoder and decoder, and ignores the cross-lingual modeling capacity in pre-training. In the real world, speech data, ASR data, and MT data are relative much more than direct speech to speech corpora, and MT data can be utilized to learn the transformation ability from source text to target text. How to build the cross-lingual bridge between speech encoder and unit decoder of direct S2ST with bilingual text in the pre-training stage is not well explored.

In this paper, we propose a Speech2S model, which aims at modeling cross-lingual information and alleviating data scarcity problems by jointly pre-training with unpaired speech and bilingual MT text for the direct speech to speech translation task. More specially, Speech2S consists of a speech encoder, unit encoder, and unit decoder. We propose two pre-training tasks to pre-train the three modules with unit encoder as the bridge between source speech and target units. Like HuBERT [6], the first pre-training objective is to predict the clustered units based on the output of both speech encoder and unit encoder, with unlabeled speech data. To take advantage of bilingual machine translation corpus, we first leverage two text-to-unit models to convert source/target text into source/target units, with which, the cross-lingual unit encoder and decoder can be well pre-trained through cross-entropy loss.

We evaluate the proposed model on Europarl-ST [7] and VoxPopuli [26] S2ST datasets. Our contributions can be summarized as follows. (1) We propose a joint pre-trained Speech2S model, which can take advantage of bilingual text data to boost bilingual speech conversion. (2) The proposed model achieves a significant improvement of about 5 BLEU scores compared to the pre-trained model without MT data. (3) Furthermore, we conduct a detailed analysis about the effect of parallel data size, data augmentation of different domains, and subjective evaluation.

2 Related Work

Conventional speech to speech translation is usually composed of cascaded ASR, MT and TTS modules [20, 14]. On this basis, to avoid error transmission caused by cascade models, researchers explore the combination of ASR and MT modules [19, 3], as well as TTS modules [11, 23], namely direct S2ST. This paper focuses on exploring direct S2ST with improved pre-training methods.

2.1 Direct Speech to Speech Translation

S2ST, which directly translates the source speech to the target speech, has attracted a lot of attention recently [11, 23, 15, 12, 28]. Translatotron [11] is the first work to achieve direct speech-to-speech translation by using a sequence-to-sequence model. This system uses an encoder to model the log-mel spectrogram and predict the target spectrogram by the decoder, combined with the speaker information. Then, a vocoder is used to convert spectrogram into waveform. This work in [9] improves Translatotron system by utilzing a duration-based spectrogram synthesizer enhanced with target phoneme from decoder. Unlike Translatotron, the authors in [15] propose a novel direct speech to speech translation system, which employs discrete hidden units instead of spectrogram as model target before vocoder. They also expand it without using any text data on real-world S2ST tasks [16]. However, real speech to speech translation data is very limited due to the high cost of obtaining such data [26, 10]. Our work is to leverage a pre-training approach to alleviate data dependence on direct S2ST dataset.

2.2 Pre-Training for Direct S2ST

Recent years have witnessed a great progress on pre-training techniques for direct S2ST tasks [21, 8]. The work in [8] employs speech-text joint model from mSLAM as the encoder, to generate phoneme sequence with MT task and generate spectrogram with S2ST task. The most related work to our paper is  [21], which enhances the speech-to-unit translation (S2UT) model by a wav2vec 2.0 [1] encoder and a decoder from pre-trained unit mBART [18]. In this S2UT model, wav2vec 2.0 is pre-trained on unlabeled audio data, and mBART leverages reduced discrete units tokenized from unlabeled audio data to train a denoised encoder-decoder model, and finally uses the mBART decoder to initialize the S2UT decoder. However, the simple combination of wav2vec 2.0 encoder and mBART decoder lacks cross-language modeling capabilities, which is particularly important for translation tasks. Motivated by this, we propose to bridge the language gap by utilizing machine translation corpus to improve model pre-training for direct speech to speech translation.

3 The Proposed Method

Our goal is to leverage paired machine translation corpora to bridge the semantic gap between source speech and target speech. In this section, we will first introduce the model architecture of Speech2S, and the details of the model pre-training and fine-tuning methods.

3.1 Structure of Speech2S

As shown in Figure 1, Speech2S consists of a speech encoder , a unit encoder and a unit decoder

. Speech encoder and unit encoder employ standard Transformer network

[24] with the same Transformer layers, except that a 5-layer CNN network in speech encoder is used to pre-process the original audio signal. Unit decoder is a multi-layer Transformer decoder layer which is composed of a multi-head self-attention mechanism, cross-attention mechanism, and a FFN network.

Formally, we denote unpaired speech as , and denote bilingual text as . After applying the speech and text discretization modules (as introduced in Section 3.2.1), we obtain the speech units from and bilingual units from . Briefly speaking, is used to encode the source audio sequence

into a sequence of vector representation

. Following the mixing mechanism proposed in [29], we also adapt it to improve alignment learning by randomly replacing part of with the corresponding unit embedding. can transform speech representation into final hidden states , or transform source unit sequence into unit hidden states . Besides, reads the encoder representations and generates a target unit sequence .

Figure 1: The overall framework of the proposed Speech2S.

3.2 Model Pre-Training

Before pre-training, we first use two discretization modules to tokenize speech and text into shared discrete tokens. Then the model can be optimized by two pre-training objectives, including speech to units task using speech encoder and unit encoder, and source units to target units task using unit encoder and unit decoder.

3.2.1 Speech/text discretization

We use HuBERT k-means cluster as the speech discretization module, which is learned from the HuBERT iter-1 hidden states, and can tokenize unlabeled speech into discrete hidden units. To tokenize text into the same space like speech, we introduce two text-to-unit models like

[29], which are trained by using two small ASR corpus with paired speech and transcription. More specifically, we first use speech discretization to convert paired speech into hidden units, and obtain the text, unit data by combining it with paired text. Then we utilize a sequence-to-sequence model to achieve the text-to-unit models trained on the paired text and unit data. Once obtaining the discrete models, we can tokenize unlabeled speech into hidden units , and tokenize bilingual text into bilingual units , respectively, all of which can be used to optimize the model in pre-training stage.

3.2.2 Pre-training objects

When the input audio is fed into the speech encoder , it is partially masked and encoded into middle hidden states , namely (S), which also be sent to unit encoder to get final hidden states from . Based on and , the speech pre-training object can be designed on the masked positions as,


where is the target hidden units, and the is parameterized as the same way with HuBERT [6].

Unit encoder also takes as input in pre-training stage, and use to output the encoded unit hidden state . The unit decoder will generate a series of hidden states = according to the encoder representation of source units. The objective function of unit pre-training is formalized as,


where , denotes , and

is a softmax layer. Finally, we pre-train Speech2S under multi-task learning framework with

= + .

3.3 Speech2S Fine-Tuning

In the fine-tuning stage, we can fine-tune Speech2S with speech encoder, unit encoder, and unit decoder to a direct speech-to-speech translation model. Leveraging the cross-entropy loss, we simply employ direct S2ST corpus as the fine-tuning dataset to optimize the model, where the target speech needs to convert into target units using speech discretization module. Finally, we utilize a unit-based HiFi-GAN [16] to generate the target waveform from target units.

4 Experiments

4.1 Datasets

We conduct our experiments on two directions of the same language pair: Spanish-English (es-en) and English-Spanish (en-es). For pre-training, we use VoxPopuli dataset, a large-scale multilingual corpus providing 100K hours of unlabelled speech data in 23 languages, as speech pre-training data. The ASR subset of Voxpopuli (VoxPopuli-ASR) in each language is used to train the textual discretization module, namely sequence-to-sequence based text-to-unit model. We use machine translation data between English and Spanish from Europarl v10 [13] as the bilingual text data to generate paired text units for textual unit pre-training. Meanwhile, the speech-to-speech paired data VoxPopuli-S2S is used for our S2ST fine-tuning stage. We use the dev set split from VoxPopuli and the dev/test set of Europarl-ST dataset to verify the effect of speech to speech translation models. In order to avoid duplication with the corpus of the test set, we deleted the data of 2012 and earlier in the VoxPopuli training set. To avoid errors caused by audio itself, all audio is unified to the 16 kHz ogg format. In addition, we use the training sets text of CoVoST-2 and Europarl-ST datasets for additional analysis experiments on data augmentation for different domains. Data details are shown in Table 1.

4.2 Implementation Details

Discretization We use released k-means cluster model222
from multilingual HuBERT (mHuBERT), which trained with VoxPopuli 100k subset [16], to extract units from speech data. For text discretization, we first extract the units of Voxpopuli-ASR speech using mHuBERT cluster and normalize the units using the same 1h English or Spanish speech normalizer as  [16]. Then we train the text-to-unit discretization model using the normalized units and transcripts of the corresponding speech of the units. The text-to-unit model has 6 Transformer layers as encoder and 6 layers for decoder, each has 512 nodes with 4 attention heads. Pairs of translation text in Europarl v10 are pre-extracted offline using this discretization model and the extracted units are applied in pre-training stage.

Pre-training Our Speech2S is composed of a 6-layer Transformer speech encoder, a 6-layer Transformer unit encoder a 6-layer Transformer decoder and an output FFN layer of 1024 units. Each Transformer layer has 768 nodes with 4 attention heads and relative positional attention bias [27]. We pre-train with the same 400k training steps for all models.

Fin-tuning The fine-tuning model structure is basically the same as the pre-training model structure. The normalized units of target language used in fine-tuning stage are extracted using the same extractor as the text-to-unit model. After generating units, we use unit based HiFi-GAN [16] to generate target speech. English and Spanish use recognition models wav2vec333 and microsoft speech-to-text tookit444 to transcribe into text, respectively. The SacreBLEU toolkit [22] is used to calculate the final BLEU score.

Baselines For comparison, we design two strong baselines for the experiment. The first one employs HuBERT encoder to initialize the encoder of speech-to-unit translation model, and the other is existing S2UT model [21], which is initialized with HuBERT encoder plus 6-layer unit level mBART decoder. The two models use the same speech data as our model for pre-training and fine-tuning. The parameters of the S2UT base model and our Speech2S model are almost the same.

data samples source(hrs) target(hrs)
pre-train, en-es
VoxPopuli 1.8M 14k -
Europarl v10 1.9M - -
pre-train, es-en
VoxPopuli 2.0M 16k -
Europarl v10 1.9M - -
fine-tune, en-es
VoxPopuli-S2S 120k/6k/- 394/20/- 403/21/-
fine-tune, es-en
VoxPopuli-S2S 153k/6k/- 513/19/- 495/18/-
Europarl-ST 31.6k/1.3k/1.3k 75.6/3.0/2.9 76.5/3.0/-
CoVoST-2 78.9k/13.3k/13.2k 112.0/22.0/22.7 81.0/14.4/-
tokenize, en
VoxPopuli-ASR - 1.3k -
tokenize, es
VoxPopuli-ASR - 261 -
Table 1: Statistics of datasets (train/dev/test splits), including pre-training, fine-tuning, and tokenizing datasets.
# System Pre-trained Model Parameters
VoxPopuli  Europarl-ST
VoxPopuli Europarl-ST
1 S2UT [21] w/o pre-training Large (827M) - -/21.8 - -/18.8
2 wav2vec 2.0+mBART     24.3 25.7/26.0     21.4 25.7/23.8
3 Ours HuBERT Base (157M) 20.5 20.2/19.1 18.7 21.1/19.2
4 HuBERT+mBART [21] 22.5 21.8/20.9 20.1 23.2/21.1
5 Speech2S 24.6 25.3/25.6 23.3 26.8/24.4
Table 2: Speech to speech translation performance (BLEU) on VoxPopuli dev set and Europarl-ST dev/test sets. For the S2UT systems, the results on VoxPopuli are reproduced by ourselves, and the results of Europarl-ST are reported in the paper.

4.3 Experimental Results

Table 2 shows the BLEU scores of S2UT systems [21] and our Speech2S systems. By comparing the model fine-tuned from HuBERT and our proposed model, results show that our model achieves more than 4 BLEU value gains on the S2ST tasks in both directions (#5 vs. #3). Compared to S2UT base model fine-tuned from HuBERT encoder and mBART decoder, the proposed Speech2S model still has an improvement of more than 3 BLEU scores (#5 vs. #4). This result proves that our model can better incorporate text information into the language model through pre-training, and learn the corresponding relationship between source language speech and target language units through shared unit encoder. Furthermore, we compare our model with S2UT Large model from their paper (#5 vs. #2), our method achieves almost the same results as S2UT Large on the English-Spanish task with a smaller number of parameters, while on the Spanish-English test set, it achieves results that exceed those of the larger model, which also verifies the above conclusion.

4.4 Analysis

4.4.1 Effect of Parallel Data Size

An interesting question is how well does the model perform if we only have very little fine-tuning data. Here, we verify the effect of varying parallel data size for Speech2S and baselines. We evaluate the proposed Speech2S and baseline from HuBERT on 10 hour, 50 hour and 100 hour supervised data sets respectively. These training data are randomly sampled from all data of VoxPopuli-S2S.

Pre-trained Model hours
dev  test
dev  test
Speech2S (Ours)
Speech2S (Ours)
Speech2S (Ours)
Table 3: BLEU scores for Speech2S and baseline trained with 15-hr, 50-hr, and 100-hr subsets.

From Table 3, we can find that even if there is only 10 hours of supervised data, through our joint pre-training with speech and bilingual text, the BLEU can reach more than 10. On the 100 hour supervised data set, the fine-tuning results are close to those of hundreds of hours of supervised data fine-tuning. From the results of weak supervision, we can draw a conclusion that the Speech2S model can learn the unified mapping of speech and unit well through pre-training, thus reducing the dependence on supervised S2ST data.

4.4.2 Effect of Data Augmentation

In this section, we explore the effect of data augmentation for different domain datasets. As shown in Table 4, we first evaluate the performance on CoVoST-2 dev/test sets using the model trained with VoxPoluli train set. In terms of absolute performance, the BLEU scores of CoVoST-2 underperform significantly that of Europarl-ST (#3 vs. #1). A potential reason is that the pre-training and fine-tuning data domains are consistent for Europarl-ST test set, but it has a domain mismatch problem between VoxPopuli and CoVoST-2.

# Fine-tuning Data Evaluation Data   dev   test
Table 4: BLEU scores with data augmentation for different domain datasets. vp_train means the VoxPopuli training set, Eur_train means the Europarl-ST training set, and Cov_train means the CoVoST-2 training set.

We conduct data augmentation experiments by adding the paired source speech and target unit data from Europral-ST and CoVoST-2 speech-to-text translation dataset. Based on the training data, which consists of source speech and target text, we use the text-to-unit model trained on VoxPopuli-ASR data to convert the text of the target language into units, and then enlarge the training set with the speech and generated target units, as shown in the line 2 and 4 of Table 4. With data augmentation, the Speech2S can achieve bigger improvements on CoVoST-2 than Europarl-ST, which confirms our suspicions. Experimental results also demonstrate that this data augmentation method is very effective for domain adaption.

4.4.3 Subjective Evaluation

To further compare the speech quality generated by different models, we select 50 samples from the Europarl-ST dev set and test the naturalness score of these samples. Table 5 lists the naturalness score of different models, including S2UT model and our Speech2S models without and with data augmentation. The results show that our proposed Speech2S achieves the naturalness score of 4.1, outperforming S2UT model fine-tuned from HuBERT and mBART. With data augmentation, the Speech2S model obtains the best naturalness score of 4.3. Experiments demonstrate that our proposed method not only significantly improves the translation quality of S2ST tasks, but also enhances the naturalness of generated speech. In addition, we can find from this experiment that more accurate units will also help to improve the quality of the final synthesized speech.

Model S2UT Speech2S Speech2S+DAT
naturalness score 4.00.1
Table 5: The naturalness score for different models. DAT means data augmentation method.

5 Conclusion

This paper proposes a novel pre-training method with unlabeled speech and paired text data for direct speech to speech translation. The core of the proposed Speech2S is to enhance the cross-lingual speech conversion capability by modeling the transformation from source units to target units, which are extracted from bilingual text data using a discrete tokenizer. Experimental results and analyses on common VoxPopuli and Europarl-ST speech-to-speech translation tasks demonstrate the effectiveness and superiority of the proposed Speech2S model.


  • [1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020)

    Wav2vec 2.0: a framework for self-supervised learning of speech representations

    Advances in Neural Information Processing Systems 33, pp. 12449–12460. Cited by: §1, §2.2.
  • [2] A. Bapna, C. Cherry, Y. Zhang, Y. Jia, M. Johnson, Y. Cheng, S. Khanuja, J. Riesa, and A. Conneau (2022) MSLAM: massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374. Cited by: §1.
  • [3] A. Bérard, O. Pietquin, C. Servan, and L. Besacier (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744. Cited by: §2.
  • [4] Q. Dong, F. Yue, T. Ko, M. Wang, Q. Bai, and Y. Zhang (2022) Leveraging pseudo-labeled data to improve direct speech-to-speech translation. arXiv preprint arXiv:2205.08993. Cited by: §1.
  • [5] P. Duquenne, H. Gong, N. Dong, J. Du, A. Lee, V. Goswani, et al. SpeechMatrix: a large-scale mined corpus of multilingual speech-to-speech translations. Cited by: §1.
  • [6] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, pp. 3451–3460. Cited by: §1, §1, §3.2.2.
  • [7] J. Iranzo-Sánchez, J. A. Silvestre-Cerda, J. Jorge, N. Roselló, A. Giménez, A. Sanchis, J. Civera, and A. Juan (2020) Europarl-st: a multilingual corpus for speech translation of parliamentary debates. In ICASSP, pp. 8229–8233. Cited by: §1.
  • [8] Y. Jia, Y. Ding, A. Bapna, C. Cherry, Y. Zhang, A. Conneau, and N. Morioka (2022) Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. arXiv preprint arXiv:2203.13339. Cited by: §1, §2.2.
  • [9] Y. Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz (2021) Translatotron 2: robust direct speech-to-speech translation. arXiv preprint arXiv:2107.08661. Cited by: §1, §2.1.
  • [10] Y. Jia, M. T. Ramanovich, Q. Wang, and H. Zen (2022) CVSS corpus and massively multilingual speech-to-speech translation. arXiv preprint arXiv:2201.03713. Cited by: §1, §2.1.
  • [11] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu (2019) Direct speech-to-speech translation with a sequence-to-sequence model. arXiv preprint arXiv:1904.06037. Cited by: §1, §2.1, §2.
  • [12] T. Kano, S. Sakti, and S. Nakamura (2021) Transformer-based direct speech-to-speech translation with transcoder. In SLT, pp. 958–965. Cited by: §2.1.
  • [13] P. Koehn (2005) Europarl: a parallel corpus for statistical machine translation. In Proceedings of machine translation summit x: papers, pp. 79–86. Cited by: §4.1.
  • [14] A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld, and P. Zhan (1997) JANUS-iii: speech-to-speech translation in multiple languages. In ICASSP, Vol. 1, pp. 99–102. Cited by: §2.
  • [15] A. Lee, P. Chen, C. Wang, J. Gu, X. Ma, A. Polyak, Y. Adi, Q. He, Y. Tang, J. Pino, et al. (2021) Direct speech-to-speech translation with discrete units. arXiv preprint arXiv:2107.05604. Cited by: §1, §2.1.
  • [16] A. Lee, H. Gong, P. Duquenne, H. Schwenk, P. Chen, C. Wang, S. Popuri, J. Pino, J. Gu, and W. Hsu (2021) Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352. Cited by: §2.1, §3.3, §4.2, §4.2.
  • [17] X. Li, C. Wang, Y. Tang, C. Tran, Y. Tang, J. Pino, A. Baevski, A. Conneau, and M. Auli (2020) Multilingual speech translation with efficient finetuning of pretrained models. arXiv preprint arXiv:2010.12829. Cited by: §1.
  • [18] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020)

    Multilingual denoising pre-training for neural machine translation

    Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §2.2.
  • [19] E. Matusov, S. Kanthak, and H. Ney (2005) On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology, Cited by: §2.
  • [20] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jitsuhiro, J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto (2006) The atr multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing 14 (2), pp. 365–376. Cited by: §2.
  • [21] S. Popuri, P. Chen, C. Wang, J. Pino, Y. Adi, J. Gu, W. Hsu, and A. Lee (2022) Enhanced direct speech-to-speech translation using self-supervised pre-training and data augmentation. arXiv preprint arXiv:2204.02967. Cited by: §1, §1, §1, §2.2, §4.2, §4.3, Table 2.
  • [22] M. Post (2018) A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771. Cited by: §4.2.
  • [23] A. Tjandra, S. Sakti, and S. Nakamura (2019) Speech-to-speech translation between untranscribed unknown languages. In ASRU, pp. 593–600. Cited by: §2.1, §2.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §3.1.
  • [25] C. Wang, H. Inaguma, P. Chen, I. Kulikov, Y. Tang, W. Hsu, M. Auli, and J. Pino (2022) Simple and effective unsupervised speech translation. arXiv preprint arXiv:2210.10191. Cited by: §1.
  • [26] C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haziza, et al. (2021-08)

    VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    In ACL, pp. 993–1003. External Links: Link Cited by: §1, §1, §2.1.
  • [27] X. Wang, Z. Tu, L. Wang, and S. Shi (2019) Self-attention with structural position representations. arXiv preprint arXiv:1909.00383. Cited by: §4.2.
  • [28] C. Zhang, X. Tan, Y. Ren, T. Qin, K. Zhang, and T. Liu (2021) Uwspeech: speech to speech translation for unwritten languages. In AAAI, Vol. 35, pp. 14319–14327. Cited by: §2.1.
  • [29] Z. Zhang, L. Zhou, J. Ao, S. Liu, L. Dai, J. Li, and F. Wei (2022) SpeechUT: bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. arXiv preprint arXiv:2210.03730. Cited by: §3.1, §3.2.1.