Speech-to-speech translation (S2ST) helps oral communication between people speaking different languages and aims to break such communication barriers. Conventionally, S2ST systems are built with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech synthesis (TTS) sub-systems, all of which rely on intermediate text representations. However, the vast majority of the approximately 7,000 languages in the world do not have speech recognition systems or even acknowledged written forms[schultz2006multilingual, nettle2000vanishing, li22aa_interspeech]. For several widely spoken languages, there are also many regional dialects used for everyday oral communication that differ significantly from formal or standard written forms, such as colloquial Arabic and regional Chinese. S2ST systems relying on intermediate text representations cannot well support such languages. Additionally, conventional S2ST systems often rely on phoneme representation in TTS and/or ASR. However, for many low resource languages with written forms, there lacks accurate grapheme-to-phoneme conversion tools for building such systems [li2022zero].
Recently, there has been progress towards developing S2ST systems without relying on intermediate text representations. Such approaches can be put into two categories: 1) End-to-end direct S2ST models, which use a single model to directly translate speech from one language to another [jia2019direct, kano2021transformer, jia2021translatotron, jia2022cvss, jia2022leveraging, dong2022leveraging, shankarappa2022faster]; 2) Cascaded S2ST based on discrete speech representation instead of text [tjandra2019speech, zhang2020uwspeech, lee2021direct, ma2021direct, lee2021textless, huang2022transpeech, popuri2022enhanced]. Although these approaches do not rely on textual representation at inference time, many of them still need textual supervision at training time for obtaining the best performance. A few works have demonstrated the feasibility of training S2ST system without textual supervision, they significantly underperform similar approaches when textual supervision is used [tjandra2019speech, zhang2020uwspeech, lee2021direct].
In this paper, we propose a novel approach for training end-to-end direct S2ST model without textual supervision. The proposed model, Textless Translatotron, is based on Translatotron 2 [jia2021translatotron], which is an end-to-end direct S2ST model composed of a speech encoder, a linguistic decoder, and an acoustic synthesizer. Instead of predicting the target phonemes as an auxiliary task from the linguistic decoder in Translatotron 2, the linguistic decoder in the proposed model predicts discrete representations of the target speech, which are obtained from a speech quantizer based on VQ-VAE [oord2017neural]. Such discrete speech representations are expected to capture phoneme-like information but without explicitly depending on it. Unlike previous works using discrete representations [lee2021textless, zhang2020uwspeech], which require multiple separately trained models (e.g: translation model and vocoder) to be cascaded, our proposed model can be trained end-to-end.
Experiments on two datasets, including a bilingual dataset and a multilingual dataset, show that our proposed model obtained a very close translation quality compared with the original Translatotron 2, despite of not using textual supervision. Such results significantly outperform the prior state-of-the-art textless model [lee2021textless] on the Fisher Spanish-English dataset by 18.5 BLEU (or, 58% relatively).
2 Related works
Conventional cascade S2ST systems relying on intermediate text representation are unable to support languages without written forms, or when textual labels are missing from datasets for written languages. The recently emerging research on S2ST without going through intermediate text representation started to explore such scenarios.
The first proposed direct S2ST model, Translatotron [jia2019direct], used a sequence-to-sequence model with attention to directly translate speech spectrogram in one language to speech spectrogram in a different language. Although it did not rely on textual intermediate representation at inference time, it required auxiliary objectives based on text at training time to obtain reasonable quality.
Tjandra et al. [tjandra2019speech] first demonstrated non-trivial results on training S2ST models without any textual supervision at training time, by using a learned discrete speech representation instead of text in a cascade system. It first trained a speech quantizer based on VQ-VAE with a speech spectrogram reconstruction task in a self-supervised manner. The trained VQ-VAE encoder was used for converting the S2ST target speech into a discrete representation. It then trained a second model for translating the S2ST source speech into the discrete representation corresponding to the target speech. The VQ-VAE decoder was used for converting the predicted discrete speech representation into speech spectrograms. The resulting system showed reasonable translation quality, but significantly underperformed baseline systems using text as the intermediate representation.
Improvements on top of [tjandra2019speech] have been primarily focused on learning better discrete speech representations, such as utilizing more training data including labeled data [zhang2020uwspeech], different learning objective [lee2021direct]
, and adopting data augmentation for removing non-linguistic variance such as speaker identity from the learned representation[lee2021textless, huang2022transpeech].
Besides discrete speech representation, continuous speech representations learned on unsupervised data [oord2018representation, baevski2020wav2vec, hsu2021hubert, chung2021w2v, chiu2022self] have also been shown effective in S2ST [jia2022leveraging, popuri2022enhanced]. Such approaches can be naturally adopted for textless S2ST as well.
Our work combines end-to-end direct S2ST with discrete speech representation to benefit from both approaches, and utilizes self-supervised learned continuous speech representation for obtaining best performance.
3 Textless Translatotron
The proposed model, Textless Translatotron, follows the architecture of Translatotron 2 [jia2021translatotron], which is an end-to-end direct S2ST model. The main components of Translatotron 2 are a speech encoder, a linguistic decoder and an acoustic synthesizer. In addition to them, we introduce a speech quantizer based on VQ-VAE, to extract discrete representation from the target speech, which is used for guiding the training of the linguistic decoder.
We improve the Translatotron 2 model by addressing two data scarcity issues: 1) Both the encoder and the decoder were learned from scratch, which is ineffective when training data is scarce. 2) The written form of the target language might be unavailable, therefore supervision using text information becomes impossible. To alleviate those scarcity issues, we improve each of the three components and further add a discrete speech quantizer.
3.1 Speech encoder
Instead of training a speech encoder from scratch, we initialize our encoder from a pre-trained multilingual w2v-BERT model [chung2021w2v] as described in [bapna2022mslam]. w2v-BERT is a self-supervised model which combines both contrastive learning and masked language model (MLM) [chung2021w2v]. The w2v-BERT model is pre-trained on a large collection of multilingual speech datasets. It is used to initialize our encoder to extract continuous speech representations and gets fine-tuned during training process.
3.2 Speech quantizer
To explore discrete speech representations, we consider using VQ-VAE-based speech quantizers as shown on the right-side in Figure 1. The motivation of choosing VQ-VAE as the quantizer is that discrete representation learned from VQ-VAE are directly optimized for speech spectrogram reconstruction, which matches how such discrete presentation is used for generating translation speech in S2ST models. In contrast, discrete representation obtained from other models (e.g: HuBERT [hsu2021hubert]) may not be optimized for spectrogram reconstruction. Additionally, because of such matching, the decoder of the VQ-VAE quantizer can be directly used as the synthesizer in the S2ST models.
The speech quantizers are pre-trained only using the S2ST speech data of the target language (e.g: English only). Let denote the speech input, the encoder projects it into a latent space
. Each encoder has a stride hyperparameter which controls the number of frames encoded into each latent vector. We vary the stride from 2 to 16 in our experiment. The model then maps the latent vector to discrete ids through finding a nearest vector in a codebookwhere is the codebook size:
Both the codebook and the projected vector are normalized. The decoder takes the discretized representations and attempts to reconstruct the speech inputs . The reconstruction loss is the absolute difference between the speech input and reconstructed input. Combined with the quantization loss, the total training objective is defined as follows:
where denotes the stop-gradient operator, and are hyperparameters which we fix to 1.0 and 0.25, respectively. In our experiments, we adopt a stack of non-causal transformer layers as the decoder following [dosovitskiy2020image]. However, we consider two different groups of quantizers as described in the following subsections.
3.2.1 Random quantizer
The random quantization is inspired by the BEST-RQ work [chiu2022self]. Suppose denotes the speech input frames where represents the input feature of frame . A stacking process is first applied by combining frames together without overlapping where is the stride hyperparameter. The speech input becomes where . The stacked spectrogram is then mapped into a latent space with a projection matrix , i.e.
. Both the projection matrix and the codebook are randomly initialized and fixed during training, only the decoder is optimized. The speech spectrogram is channel-normalized into Gaussian distribution based on global statistics. The projection matrixis Xavier [glorot2010understanding] initialized, and the codebook
is Gaussian initialized. Such initialization ensures uniform distribution of the projected code IDs.
3.2.2 Learned quantizer
To compare the random projection encoder with learned encoders, we explore two more quantizers: a linear quantizer and a Transformer quantizer. The linear quantizer shares the same encoder architecture with the random quantizer, except that the projection matrix and the codebook are learned. The transformer encoder is similar to the transformer decoder, which has a stack of non-causal transformer layers. It also stacks frames before the transformer layers. These learned quantizers are pre-trained on the target speech in the S2ST datasets, and frozen during the training of the translation models.
3.3 Linguistic decoder
Instead of using textual supervision, we use the discrete speech representations from the speech quantizer to guide the training of the linguistic decoder. The linguistic decoder autoregressively predicts the discrete code IDs generated from the speech quantizer.
3.4 Acoustic synthesizer
The synthesizer of Textless Translatotron also gets simplified compared to Translatotron 2. In Translatotron 2, a duration predictor and a Gaussian-weighted upsampler are used to augment the time rate of linguistic representation from the linguistic decoder to match the same of the target speech spectrogram. In Textless Translatotron, because the discrete speech representations have a fixed time rate, a duration predictor is no longer needed, and a simple transposed convolution is used to match the length of the linguistic representation sequence and the spectrogram frame sequence. Additionally, the autoregressive LSTM stack in the synthesizer of Translatotron 2 is replaced by a non-autoregressive Transformer stack, optionally initialized from the decoder of the learned VQ-VAE quantizers (Sec. 3.2).
To evaluate the effectiveness of the proposed model and the variations described in Sec. 3, we conducted comparative experiments on the multilingual CVSS-C corpus [jia2022cvss] and the bilingual Fisher Spanish-English corpus [post2013improved]. The CVSS-C corpus contains sentence-level paired S2ST data in 21 XEnglish language pairs. The source speech in the corpus is 1,153 hours of human read speech collected via crowdsourcing; the target speech in the corpus is 719 hours of high-quality TTS synthetic speech in a single speaker’s voice, with speech naturalness on-par with human recordings. The target speech is shorter than the source speech because of better fluency in the TTS synthetic speech. The Fisher Spanish-English corpus contains 127 hours of Spanish telephone conversations and 96 hours of synthetic English translation speech in a single speaker’s voice.
All the models are implemented in the Lingvo framework [shen2019lingvo]. Unless specified otherwise, all the S2ST models followed the hyper-parameters from [jia2022leveraging]. The speech quantizer used a 64 dimensional latent space with a codebook size of 512. A 25612 non-causal Transformer stack is used as the VQ-VAE decoder and the S2ST acoustic synthesizer. The linear-based speech quantizer used a single-layer linear projection, and the Transformer-based speech quantizer used the same 25612 Transformer stack analogous to its decoder. The pre-trained speech encoder is the same 0.6B-parameter w2v-BERT model from [jia2022leveraging, bapna2022mslam], which was pre-trained on 492k hours of unlabeled speech in 51 languages.
Following [jia2019direct, jia2022cvss], the translation quality of S2ST is evaluated by BLEU on ASR transcription from the translation speech (in lowercase, excluding punctuation marks). We used an ASR model from [park2020improved] for evaluation, which is the same as used in [jia2022cvss, jia2022leveraging, jia2021translatotron], therefore the results are comparable to these works. The results on CVSS-C are grouped into high/mid/low-resource language pair groups based on the amount of data available in the CVSS-C corpus, following [babu2021xlsr, bapna2022mslam].
Two groups of baseline models are used for comparison: text-supervised models and textless models. For the text-supervised models, we refer to the Translatotron 2 models described in [jia2021translatotron, jia2022cvss] and its improved version with the pre-trained encoder [jia2022leveraging], which are the state-of-the-art models. For the textless models, we compare our results with UWSpeech [zhang2020uwspeech] and the prior state-of-the-art in [lee2021direct].
4.1 Fisher Spanish-English
|UWSpeech VQ-VAE [zhang2020uwspeech]||3.4|
|UWSpeech XL-VAE [zhang2020uwspeech]||9.4|
|S2U + U2S [lee2021direct]||31.8|
|Textless Translatotron (this work)||50.3|
|S2U + U2S [lee2021direct]||39.9|
|Translatotron 2 [jia2021translatotron]||42.4|
|Translatotron 2 w/ pre-trained encoder||52.2|
The experimental results on the Fisher Spanish-English corpus is shown in Table 1, compared to multiple baseline models. Textless Translatotron obtained translation quality approaching the state-of-the-art text-supervised S2ST model Translatotron 2, with a difference of merely 1.9 BLEU. It outperformed the prior state-of-the-art textless S2ST model [lee2021direct] by 18.5 BLEU (or 58% relatively).
|Textless Translatotron (this work)||17.7||33.5||22.8||10.2|
|Translatotron 2 [jia2022leveraging]||10.1||26.9||14.2||2.8|
|w/ pre-trained encoder [jia2022leveraging]||17.9||32.5||22.9||10.9|
The Fisher Spanish-English corpus contains translation between two close languages and is unable to assess more complicate translation scenarios such as involving heavy re-orderings between languages. To further evaluate the performance of the proposed model, we conducted experiments on the multilingual CVSS-C corpus, as shown in Table 2. Textless Translatotron obtained translation quality nearly on-par with the Translatotron 2 model with a pre-trained encoder, with merely 0.2 BLEU difference.
4.3 Ablation studies
4.3.1 Linguistic training targets
Table 3 shows the impact of the different training target choices for the linguistic decoder. When the Textless Translatotron is trained without using a pre-trained speech encoder, it underperformed Translatotron 2 signicantly, which used phoneme-based textual supervision. There was no significant performance difference among the quantizer choices, including random quantizer and learned quantizer. However, when a powerful large pre-trained speech encoder was used, using a learned quantizer, especially one with a larger capacity, showed significantly advantages over a random quantizer or a tiny learned quantizer. With a relatively small Transformer quantizer (25612), the performance of Textless Translatotron is nearly on-par with Translatotron 2. It is important to note that no extra data other than CVSS-C was used for training the speech quantizer of Textless Translatotron. These results suggest that one major difficulty on training end-to-end direct S2ST models lies in speech understanding, which can be overcome by either introducing extra explicit supervision as in Translatotron 1 & 2, or by leveraging self-supervised speech representation learning, as in Textless Translatotron.
|Encoder||Textless Translatotron||Translatotron 2|
En language pairs in CVSS-C. (Random: random speech quantizer; Linear/Transformer: learned linear or Transformer speech quantizer.)
4.3.2 Speech quantizer stride and codebook size
The stride and the codebook size are two critical hyperparameters of the speech quantizer. Using a smaller codebook brings stronger supervision to the linguistic decoder, but suffers from larger information loss. Similarly, using a larger stride makes the discrete representation shorter and the training and inference faster, but also suffers from larger information loss. These two hyperparameters need to be choosed carefully in balancing quality, convergency, and efficiency.
4.4 Sample analysis
|REF||everyone knows mount fuji.|
|HYP||the fuji is long to mina all words.|
|REF||a man and a white dog are looking at a postcard exhibit|
|HYP||a man in a white dog is looking at a postcards exhibit|
|REF||after a year spent in the kibbutz his family arrived in paris|
|HYP||after a year in the cabot his family arrived in paris|
To understand the failure patterns, we manually analyzed samples of failure cases in the BLEU evaluation. Table 6 cherrypicks a few examples that were considered as failures in such evaluation. One common pattern is that the model does not translate part of the source speech but copies the pronunciation into the prediction without translation. Such direct copying can be desired for words that do not need to be translated, such as names and proper nouns (e.g. “kibbutz” in the fren example; transcribing to “cabot” is an ASR error in the evaluation), as also pointed out in [jia2019direct, jia2022cvss]. However, on low-resource languages, such copying are often real failure cases (e.g. in the jaen example, “mina” means “everyone” in Japanese). Such failure cases can likely be improved by having more training data.
We proposed Textless Translatotron, a novel end-to-end S2ST model that can be trained without any textual labels, therefore supports languages without written forms. The proposed model is based on Translatotron 2, but uses discrete speech representation obtained from a VQ-VAE quantizer instead of phonemes to guide the training of the linguistic decoder. When a large pre-trained speech encoder is used in both the proposed model and the baselines, Textless Translatotron demonstraded performance nearly on-par with the state-of-the-art direct S2ST model with textual supervision on the bilingual Fisher Spanish-English corpus and the multilingual CVSS-C corpus, and outperformed the prior state-of-the-art textless S2ST model on Fisher Spanish-English by 18.5 BLEU (or 58% relatively).
The authors thank Ankur Bapna, James Qin and Yonghui Wu for helpful discussion and feedback.