Direct speech-to-speech translation with discrete units

07/12/2021 ∙ by Ann Lee, et al. ∙ 0

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. Previous work addresses the problem by training an attention-based sequence-to-sequence model that maps source speech spectrograms into target spectrograms. To tackle the challenge of modeling continuous spectrogram features of the target speech, we propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that predicting discrete units and joint speech and text training improve model performance by 11 BLEU compared with a baseline that predicts spectrograms and bridges 83 without any text transcripts, our model achieves similar performance as a baseline that predicts spectrograms and is trained with text data.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech translation aims at converting speech input from one language into speech or text in another language. The technology helps bridge the communication barriers between people speaking different languages and can provide access to multimedia content in different languages. Conventional speech-to-text translation (S2T) systems take a cascaded approach by concatenating automatic speech recognition (ASR) and machine translation (MT). In recent years, end-to-end S2T 

[4] is proposed to alleviate the error propagation issue between ASR and MT. These S2T models can be further combined with text-to-speech (TTS) synthesis to provide both speech and text translation, which allows the technology to be adopted in a wider range of applications.

More recently, researchers have started exploring building direct speech-to-speech translation (S2ST) models without relying on text generation as an intermediate step [9]. Direct S2ST has the benefits of lower computational costs and inference latency as fewer decoding steps are needed compared to cascaded systems. In addition, direct S2ST is a natural approach for supporting translation for languages that do not have a writing system [29, 37]. However, training an S2ST system that directly maps input spectrogram features from one language to output spectrogram features of another language is challenging as it requires the model to jointly learn not only the alignment between two languages (as in MT) but also the acoustic and linguistic characteristics of both languages (as in ASR and TTS).

The recent success in self-supervised learning for speech has demonstrated that speech representations learned from a large unlabelled speech corpus can lead to impressive performance on a variety of downstream tasks 

[36] including ASR [3, 8], speaker and language identification [6], emotion recognition [23], etc. Moreover, discretized speech units obtained from the clustering of self-supervised speech representations are shown to be effective in conditional and unconditional spoken generative language modeling tasks [14]

and speech synthesis without mel-spectrogram estimation 


In this work, we tackle the challenge of modeling target speech in direct S2ST by predicting self-supervised discrete representations of the target speech instead of mel-spectrogram features. We investigate speech translation with discrete units in the scenarios where the source and target transcripts may or may not be available, the latter case being representative of unwritten languages. For the written languages, we present a framework that jointly generates speech and text output by combining S2ST and S2T tasks through a shared encoder and a partially shared decoder. We resolve the length mismatch issue between the speech and text output during decoding with connectionist temporal classification (CTC) [7] and show that joint training allows our proposed framework to achieve performance close to a cascaded S2T+TTS system. For the unwritten target languages, we first extend the use of discrete units to text-to-speech translation (T2ST) [37] when there are source text transcripts available. Then we show that with multitask learning using both discrete representations for the source and the target speech, it is possible to train a direct S2ST system without the use of any text transcripts.

The rest of this paper is organized as follows. After introducing background and related work in the next section, we describe our system in detail in Sec. 3. Following this, we present experimental results in Sec. 4, and Sec. 5 concludes with a discussion of potential future work.

2 Related work

Conventional S2ST systems are built by combining either cascaded or end-to-end S2T models with TTS [15, 19]. The majority of the speech translation research has focused on the S2T setup. Studies on ASR+MT systems explore better ways to integrate ASR output lattice to MT models [18] in order to alleviate the error propagation issue between the two. End-to-end S2T [4] has the potential to resolve the issue, as long as it is properly trained with multitask learning [34] or model pre-training [17] to overcome the data scarcity problem. Studies on TTS for S2ST focus more on synthesizing the para-linguistic information transferred from the source speech, such as prosody [1, 2] and word-level emphasis [5].

On the other hand, Translatotron [9] is an attention-based sequence-to-sequence framework that directly translates mel-spectrogram of the source speech into spectrogram features of the target speech. Multitask learning has been shown to be essential in facilitating the model to converge, though there is still a performance gap towards a S2T+TTS cascaded system. The authors in [11]

propose to build a single deep-learning framework step-by-step by pre-training ASR, MT and TTS models separately and connecting them with Transcoder layers. However, the inference process requires the ASR and MT decoders to complete decoding a full sequence, and thus it loses the latency advantage of a direct S2ST system.

[29, 37] both investigate direct S2ST models under the unwritten language setup by transforming the target speech into discrete representations through Variational Auto-Encoder (VAE), training a sequence-to-sequence model for translation into target discrete units, and an inverter for converting the units to speech.

In this work, we propose to train a transformer-based speech-to-discrete unit model for direct S2ST based on the idea of Translatotron [9]. We design a new text decoding task conditioned on the intermediate representation of the decoder in addition to the auxiliary tasks proposed in [9]. We choose to use self-supervised representations from HuBERT [8] to generate the target discrete units for our task, since [36, 14, 24] have shown its superior performance across ASR, spoken language modeling and speech synthesis, compared to other unsupervised representations, including VAE-based representations used in [29, 37]. Overall, there exists little work on direct S2ST due to the lack of parallel S2ST training data. While [9] performs one set of experiments on in-house S2ST data, [9, 29, 37, 11] all take advantage of TTS services to produce synthetic target speech for model training. We follow the same approach and perform our studies with single-speaker synthetic target speech.

3 Model

Our proposed system (Fig. 1) is a transformer-based sequence-to-sequence model with a speech encoder and a discrete unit decoder and incorporates auxiliary tasks similar to [9] during training to facilitate model learning (shown in dashed lines). For written target languages, we further apply target text CTC decoding conditioned on the intermediate representations from the discrete unit decoder for joint speech and text training and generation. Finally, a vocoder is separately trained to convert discrete units into waveform.

Figure 1: An illustration of the direct S2ST model with discrete units. The model can be decomposed into: (1) a transformer-based speech-to-unit model with a speech encoder and a discrete unit decoder, (2) auxiliary tasks conditioned on the speech encoder, (3) text CTC decoder conditioned on the discrete unit decoder, and (4) a vocoder that transforms discrete units into waveform.

3.1 Speech-to-unit (S2U) model

A HuBERT model trained on an unlabelled speech corpus of the target language can encode the target speech into continuous representations at every 20-ms frame. A k-means algorithm is applied on the learned representations of the unlabelled speech to generate

cluster centroids [14, 24], which are used to encode target utterances into sequences of cluster indices at every 20-ms. In the end, a target utterance is represented as , where is the number of frames.

We build the S2U model by adapting from the transformer model for MT [30]

. A stack of 1D-convolutional layers, each with stride 2 and followed by a gated linear unit activation function, is prepended to the transformer layers in the encoder for downsampling the speech input 

[28]. As the target sequence is discrete, we train the S2U model with cross-entropy loss with label smoothing. We explore two strategies for predicting the discrete unit sequence. In the first strategy (Fig. 2(a), dubbed as “stacked”), we apply the concept of reduction factor, , from TTS [33] and generate a vector at every decode decoding step for predicting discrete units. In the second strategy (Fig. 2(b), dubbed as “reduced”), we collapse a consecutive sequence of the same units into one single unit, resulting a sequence of unique discrete units. Both strategies help speed up training and inference time, and experimental results show that the model performance also improves.

(a) stacked (b) reduced
Figure 2: An illustration of the two designs for unit sequence generation from the decoder. In the stacked design ((a)), each decoding step predicts units by producing a vector for softmax computations. In the reduced design ((b)), the target unit sequence is reduced to a sequence of unique units with consecutive duplicating units removed for training.

3.2 Multitask learning

We follow the design in [9]

to incorporate auxiliary tasks with additional attention and decoder modules conditioned on the intermediate layers of the encoder. The target output of the auxiliary tasks can be either phonemes, characters, subword units or any discrete representations of the source or target utterances. These auxiliary tasks are only used during training and not in inference.

For written target languages, we add target text CTC decoding conditioned on an intermediate layer from the discrete unit decoder for the model to generate dual mode output. The use of CTC can mitigate the length mismatch between the speech and text output. However, since it only allows monotonic alignment, we rely on the transformer layers that the CTC decoder conditioned on to take care of the reordering from source to target. During training, we do teacher-forcing with the ground truth target discrete unit sequence and compute CTC loss using the teacher-forced intermediate representations from the decoder. During inference, we perform discrete unit decoding and CTC decoding for text at each decode step simultaneously.

3.3 Vocoder

We adopt the modified version of the HiFi-GAN neural vocoder [12] proposed in [24] for unit-to-waveform conversion. For the stacked discrete unit output, we train the vocoder with only discrete unit sequence and without extra pitch information as the input. For the reduced discrete unit output, we enhance the vocoder with a lightweight duration prediction module from Fastspeech 2 [27]

, which consists of two 1D-convolutional layers, each with ReLU activation and followed by layer normalization and dropout, and a linear layer. We train the enhanced vocoder by minimizing the mean square error (MSE) between the module prediction and the ground truth duration of each unit segment in logarithmic domain, together with the generator-discriminator loss from HiFi-GAN.

4 Experiments

4.1 Data

We perform our experiments using the Fisher Spanish-English speech translation corpus [25] as in [9, 37]. The dataset consists of 139k sentences from telephone conversations in Spanish, the corresponding Spanish text transcriptions and their English text translation. As in previous studies on direct S2ST [9, 37], we use a high-quality in-house TTS engine to prepare synthetic target speech with a single female voice as the training targets. We perform all experiments, including the baselines, with the synthetic target speech and do not rely on the TTS engine for other uses. We apply the ASR model described in Sec. 4.3 on the synthetic speech and filter out samples with word error rate (WER) greater than 80. Table 1 lists the statistics of the resulting training set, as well as the two development sets and the test set.

train dev dev2 test
# samples 126k 4k 4k 3.6k
source duration (hrs) 162.5 4.6 4.7 4.5
target duration (hrs) 139.3 4.0 3.8 3.9
Table 1: Statistics of the Fisher Spanish-English dataset [25] after pre-processing

4.2 Model setup

We use the pre-trained HuBERT model111 trained on Librispeech [21] for two iterations and follow [8, 14] to perform k-means with on representations from the sixth layer of the model for extracting discrete units for all target English speech. We compute 80-dimensional mel-filterbank features at every 10-ms for the source speech as input to the speech encoder and apply SpecAugment [22] with the LibriSpeech basic policy. The downsampling stack in the speech encoder contains two 1D-convolutional layers with kernel size 5 and 1024 channels, resulting in a downsampling factor of 4 on the input speech. The encoder contains 12 transformer layers with embedding size 256, feed-forward network (FFN) embedding size 2048 and 4 attention heads. The decoder consists of 6 transformer layers with the same embedding size and FFN embedding size as the encoder and 8 attention heads.

We explore four targets for the auxiliary tasks: source phonemes (sp), target phonemes (tp), source characters (sc) and target characters (tc). For sp or sc, we append a multihead attention module with 4 heads and a decoder with 2 transformer layers and the same embedding size as the discrete unit decoder to the sixth layer of the encoder based on preliminary experimentation. For tp or tc, we attach the attention and the decoder to the eighth layer of the encoder. Each auxiliary loss has a constant weight of 8.0 during training. For written target languages, we condition the CTC decoding on the third layer of the discrete unit decoder. The target text for CTC is encoded as 1k unigram subword units [13] to guarantee that the text sequence length is shorter than the length of the stacked or reduced discrete unit sequence. The weight on the CTC loss is set to 1.6 during training. We train the models for 400k steps using Adam with , learning rate , and apply an inverse square root learning rate decay schedule with 10k warmup steps. All hyper-parameters, such as dropout, are tuned on the development set. All models are implemented using fairseq S2T222 [20, 32].

dev dev2 test
BLEU speech text speech text speech text
Translatotron [9] 24.8 - 26.5 - 25.6 -
+ pre-trained encoder [9] 30.1 - 31.5 - 31.1 -
Cascaded (S2T+TTS) 38.1 40.8 39.6 42.1 39.5 41.5
Ground truth + TTS 87.9 100.0 88.9 100.0 89.6 100.0
Transformer-based Translatotron (, w/ sp, tp) 25.0 - 26.3 - 26.2 -
Transformer-based Translatotron (, w/ sc, tc) 32.9 - 34.1 - 33.2 -
S2U, no reduction (, w/ sc, tc) 32.8 - 34.2 - 34.1 -
S2U stacked (, w/ sc, tc) 34.0 - 34.5 - 34.4 -
S2U stacked + CTC (, w/ sc, tc) 34.4 36.4 36.4 37.9 34.4 35.8
S2U reduced + CTC (w/ sc, tc) 36.6 39.5 38.3 40.8 37.2 39.4
Table 2: BLEU scores evaluated with respect to four references from the Fisher Spanish-English dataset. For systems generating dual mode output (cascaded and S2U + CTC), we evaluate both the text output directly from the system and the ASR decoded text from the speech output. We only evaluate the latter for systems generating speech-only output.

4.3 Baselines and Evaluation

We build an S2T+TTS cascaded baseline by exploring various S2T and TTS architectures and find the best combination with an LSTM-based S2T model [34] and a transformer-based TTS model [16]. For the S2T model, we use 8 bidirectional LSTM layers for the encoder and 4 LSTM layers for the decoder, and embedding and hidden state sizes are all 256. The model is trained on Fisher Spanish speech and English text data, represented in characters, without pre-training or multitask learning. The TTS model has 6 transformer layers, 4 attention heads, embedding size 512 and FFN embedding size 2048 for both the encoder and the decoder. We use 32-dimensional layer for the decoder prenet. The model is trained on the English text and the synthetic target speech with a reduction factor of 2 on the output feature frames. The vocoder is a HiFi-GAN model [12] fine-tuned on the mel-spectrogram features from teacher-forcing.

In addition, we implement a transformer-based Translatotron [9] that predicts mel-spectrogram features of the target speech. We apply transformer instead of the LSTM architecture originally proposed in [9] in order to speed up model training. The model consists of the same speech encoder design as in the S2U model, the same transformer-based speech decoder design as in the TTS model for the cascaded baseline, and a fine-tuned HiFi-GAN vocoder [12]. We use the same auxiliary task setup described in the previous section but with a constant weight of 0.1 on each auxiliary loss. We apply a reduction factor of 5 on the output feature frames and tune the hyper-parameters on the development sets. Preliminary studies show no performance degradation for the transformer-based Translatotron compared with our implementation of the LSTM version of the model.

We adopt an open-sourced English ASR model

333 built with the combination of wav2vec 2.0 pre-training and self-training  [35] for evaluating the speech output. The model, which is pre-trained on Libri-Light [10] and fine-tuned on full Librispeech [21]

, achieves WER of 1.5% and 3.1% on the Librispeech test-clean and test-other sets, respectively. As the ASR decoded text is in lowercase and without punctuation except apostrophes, we normalize both the reference text and the text output from the S2ST model before computing BLEU using

SacreBLEU [26].

4.4 Results

We explore model training under both written and unwritten language scenarios. For the former, we take advantage of text transcriptions of source and target speech during S2U model training. For the latter, we focus on the cases where the source is in either a written or unwritten language, while the target language is without a writing system. Thus, the translation system can only be trained on speech targets.

Source & Target Written   Table 2 summarizes the experimental results under the written language setup. We include the results from [9]

as a reference. However, as different ASR models are used for evaluation, we should not directly compare the BLEU scores with our experiments. On the other hand, we use an open-sourced ASR model for evaluation, so our results should be comparable with all future research in the field that follows the same evaluation protocol. We also list the BLEU scores from applying the TTS model trained on synthetic target speech on the ground truth English text. The gap between the BLEU scores evaluated on the speech and text output indicates the effect of the ASR errors on the evaluation metric.

First, we explore using different targets for the auxiliary tasks with transformer-based Translatotron [9] and see that using characters as targets for the auxiliary tasks (sc, tc) gives 7 BLEU gain compared to phonemes (sp, tp), indicating that the choice of auxiliary task target labels has a large impact on the model performance. In all following experiments, we use characters as the targets for the auxiliary tasks.

Second, we compare the proposed S2U model against transformer-based Translatotron. We start with the stacked strategy as both models are required to generate the same number of frames. We can see that “S2U stacked” outperforms the transformer-based Translatotron with by 0.4-1.2 BLEU, indicating that discrete units are easier to model than continuous-valued mel-spectrogram features. In addition, we experiment with S2U training using the full discrete unit sequence () and see that a larger reduction factor can not only speed up the training and inference processes but also lead to higher performance (0.3-1.2 BLEU differences).

Third, we incorporate target text CTC decoding to the S2U model and evaluate both speech and text output. Joint training with discrete unit loss and text CTC loss brings a gain of 1.9 BLEU on dev2 for “S2U stacked”. Moreover, we see that the reduced strategy is more effective than stacked, with the former bringing 1.9-2.8 BLEU improvement on speech output and 2.6-3.1 BLEU gain on text output.

Finally, the best setup we find, “S2U reduced” with joint speech and text training and auxiliary tasks, has bridged 83% of the gap between transformer-based Translatotron and the S2T+TTS cascaded baseline. However, compared with the cascaded system, the proposed framework has the advantage of being able to generate consistent speech and text output in one inference pass. Table 3 shows examples of the ASR decoded text on the speech output and the text output from CTC decoding. We also examine the output from the tc auxiliary task, which can serve as another way to generate translated text. By using ASR decoded text as reference, we see a character error rate (CER) of 6.1 for the CTC decoded text and 31.6 for the tc decoded text on the dev set, indicating that the former is much more aligned with the generated audio. As shown in Table 3, the CTC decoded results with high CER are due to a combination of ASR errors and misspelling from CTC.

human i’m twenty six years living here
ASR i’m twenty six years living here
CTC i’m twentysix years living here
tc i’ve been * * living here
ref i’ve been living here twenty six years
human what was i going to say eh you were born in puerto rico
ASR what was i going to say h you were born in porto rico
CTC what was i going to say eh you were born in puerto rico
tc what was i was * * * * * born in puerto rico
ref i was going to say were you sir born in puerto rico
human what country would you like to travel if you could
ASR what country would you like to travel if you could
CTC what country which you like to travel if you could
tc what country what country do you like to travel if you could you could like to
ref what country would you like to travel if you could
human i mean that sometimes people who their mind don’t get distort
ASR i mean that sometimes people who their mind don get distort
CTC i mean that’ sometimes people who their’re mind ont get distor
tc i mean that there are people like the mind they get distored
ref sometimes there is people with their minds distorted
Table 3: Examples of output from our best model under the written language setup, “S2U reduced + CTC (w/ sc, tc)“. We compare text from (1) human: human transcription of the generated audio, (2) ASR: ASR decoded text on the generated audio, (3) CTC: the model’s text output from CTC decoding, (4) tc: output from the model’s auxiliary task trained with target characters as targets, and (5) ref: ground truth reference translation. The differences with respect to human are highlighted in bold for the text from ASR, CTC and tc, and * denotes word deletion.
BLEU dev dev2 test
source written, target unwritten
Translatotron, w/ sp [9] 7.4 8.0 7.2
ASR+T2S () 25.3 25.5 25.9
ASR+T2U 39.9 40.6 41.0
S2U reduced (w/ sc) 32.8 34.0 33.8
source & target unwritten
Translatotron, no auxiliary tasks [9] 0.4 0.6 0.6
UWSpeech [37] - - 9.4
S2U reduced, no auxiliary tasks 6.6 7.0 6.7
S2U reduced (w/ su) 26.2 27.4 27.1
Table 4: BLEU scores evaluated on the ASR decoded text of the speech output with respect to four references from the Fisher Spanish-English dataset. All models are trained without using any target text transcripts.

Source Written, Target Unwritten   We explore the unwritten target language setup by starting from the scenario where the source speech has a text writing system. Table 4 summarizes the results. Note that results from [9, 37] are listed as references, but the BLEU scores are not directly comparable due to the difference in ASR models used during evaluation.

We first build cascaded systems by combining ASR and T2ST. We train the transformer-based ASR system with the default hyper-parameters and s2t_transformer_s architecture in fairseq S2T [32] on Fisher Spanish speech and Spanish transcriptions. The T2ST system can be built by either training a Translatotron-style model with a text encoder and speech decoder that predicts spectrograms (T2S), or a text-to-unit (T2U) model that predicts discrete units as in S2U models. We train the T2ST systems on the Spanish transcription and the corresponding synthetic English speech without any auxiliary task. For T2S, we use the same setup for the transformer TTS model. For T2U, we treat it as a standard MT problem using a base transformer to translate source characters to target discrete units. We see that T2U model outperforms T2S by 15.1 BLEU, which is another evidence showing that discrete units are easier to model as translation targets than continuous spectrogram features.

We focus on “S2U reduced” based on the findings from the written language setup for direct S2ST. We find that training an S2U model with sc auxiliary task can already achieve 91% of the performance from a system trained with both source and target text (Table 2). This is in contrary to the findings in [9] where training Translatotron with only source transcripts attains 28% of the performance of a system trained with both source and target text.

Source & Target Unwritten   Next, we extend our experiments to a fully unwritten language setup by training models without using any text transcripts (Table 4). [9] has pointed out that the sequence-to-sequence model has difficulty in learning to attend to the input speech when trained without auxiliary tasks, resulting in close-to-zero BLEU scores. [37] addresses the challenge by training with discrete unit targets and shows potential, while it uses labelled speech from languages other than the source or the target to guide the VAE learning for the discrete units.

When “S2U reduced” is trained without auxiliary tasks, the performance greatly deteriorates. We notice that the model can still generate meaningful text. However, the generated speech does not reflect the content in the source speech, and the 6.7 BLEU score is mostly contributed by the function words. This shows that the discrete unit decoder can learn a language model over the unit sequence, while the challenge in model training is in the attention on the encoder output.

To facilitate the S2U model training, we apply the HuBERT model pre-trained on English to extract discrete representations for the source Spanish speech, and the source units (su) are used as the target for an auxiliary task. The resulting S2U model achieves a 0.9-1.2 BLEU gain compared with transformer-based Translatotron trained with phoneme targets (Table 2). This shows that source units are effective in guiding the model to properly learn the attention, and the discrete representations learned in a self-supervised manner can capture basic pronunciations that are transferable across languages.

5 Conclusion

We investigate training direct S2ST models with the use of self-supervised discrete representations as targets. We examine model training under both the written and unwritten language scenarios. For the former, we propose a framework that performs close to the cascaded baseline, yet it can generate translation in both speech and text within one inference pass. We demonstrate the possibility of translating between two unwritten languages by taking advantage of discrete representations of both the source and the target speech for model training.

Our studies are performed under a controlled setup where the target is clean and synthetic speech with only one female speaker, which is a common setup that previous work in the field used. With the recent release of large-scale S2S dataset [31], we plan to investigate the proposed framework with real data in the future. Another important aspect in generating speech output is the voice and prosody. In our work, we focus on content translation and leave the para-linguistic aspect of speech translation to future work.

We use an open-sourced ASR model for evaluation, so the results will be comparable with all future research in the field. We will also release the code for reproducing the experiments.

6 Acknowledgement

We would like to thank Jade Copet, Emmanuel Dupoux, Evgeny Kharitonov, Kushal Lakhotia, Abdelrahman Mohamed, Tu Anh Nguyen and Morgane Rivière for helpful discussions on discrete representations, and Sravya Popuri for technical discussions.


  • [1] P. Aguero, J. Adell, and A. Bonafonte (2006) Prosody generation for speech-to-speech translation. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1, pp. I–I. Cited by: §2.
  • [2] G. K. Anumanchipalli, L. C. Oliveira, and A. W. Black (2012) Intent transfer in speech-to-speech machine translation. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 153–158. Cited by: §2.
  • [3] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §1.
  • [4] A. Bérard, O. Pietquin, C. Servan, and L. Besacier (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744. Cited by: §1, §2.
  • [5] Q. T. Do, S. Sakti, and S. Nakamura (2017) Toward expressive speech translation: a unified sequence-to-sequence LSTMs approach for translating words and emphasis.. In INTERSPEECH, pp. 2640–2644. Cited by: §2.
  • [6] Z. Fan, M. Li, S. Zhou, and B. Xu (2020) Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185. Cited by: §1.
  • [7] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks


    Proceedings of the 23rd international conference on Machine learning

    pp. 369–376. Cited by: §1.
  • [8] W. Hsu, B. Bolte, Y. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447. Cited by: §1, §2, §4.2.
  • [9] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu (2019) Direct speech-to-speech translation with a sequence-to-sequence model. Proc. Interspeech 2019, pp. 1123–1127. Cited by: Direct speech-to-speech translation with discrete units, §1, §2, §2, §3.2, §3, §4.1, §4.3, §4.4, §4.4, §4.4, §4.4, §4.4, Table 2, Table 4.
  • [10] J. Kahn, M. Rivière, W. Zheng, E. Kharitonov, Q. Xu, P. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen, et al. (2020) Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7669–7673. Cited by: §4.3.
  • [11] T. Kano, S. Sakti, and S. Nakamura (2021) Transformer-based direct speech-to-speech translation with transcoder. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 958–965. Cited by: §2, §2.
  • [12] J. Kong, J. Kim, and J. Bae (2020)

    HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis

    Advances in Neural Information Processing Systems 33. Cited by: §3.3, §4.3, §4.3.
  • [13] T. Kudo (2018)

    Subword regularization: improving neural network translation models with multiple subword candidates

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75. Cited by: §4.2.
  • [14] K. Lakhotia, E. Kharitonov, W. Hsu, Y. Adi, A. Polyak, B. Bolte, T. Nguyen, J. Copet, A. Baevski, A. Mohamed, et al. (2021) Generative spoken language modeling from raw audio. arXiv preprint arXiv:2102.01192. Cited by: §1, §2, §3.1, §4.2.
  • [15] A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld, and P. Zhan (1997) JANUS-III: speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 99–102. Cited by: §2.
  • [16] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu (2019)

    Neural speech synthesis with transformer network


    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 6706–6713. Cited by: §4.3.
  • [17] X. Li, C. Wang, Y. Tang, C. Tran, Y. Tang, J. Pino, A. Baevski, A. Conneau, and M. Auli (2020) Multilingual speech translation with efficient finetuning of pretrained models. arXiv e-prints, pp. arXiv–2010. Cited by: §2.
  • [18] E. Matusov, S. Kanthak, and H. Ney (2005) On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology, Cited by: §2.
  • [19] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jitsuhiro, J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto (2006) The ATR multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing 14 (2), pp. 365–376. Cited by: §2.
  • [20] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.2.
  • [21] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur (2015) Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. Cited by: §4.2, §4.3.
  • [22] D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pp. 2613–2617. Cited by: §4.2.
  • [23] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. Proc. Interspeech 2019, pp. 161–165. Cited by: §1.
  • [24] A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W. Hsu, A. Mohamed, and E. Dupoux (2021) Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355. Cited by: §1, §2, §3.1, §3.3.
  • [25] M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-Burch, and S. Khudanpur (2014) Fisher and CALLHOME Spanish–English speech translation. Note: LDC2014T23. Web Download. Philadelphia: Linguistic Data Consortium Cited by: §4.1, Table 1.
  • [26] M. Post (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Cited by: §4.3.
  • [27] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558. Cited by: §3.3.
  • [28] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave, V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert (2019)

    End-to-end ASR: from supervised to semi-supervised learning with modern architectures

    arXiv preprint arXiv:1911.08460. Cited by: §3.1.
  • [29] A. Tjandra, S. Sakti, and S. Nakamura (2019) Speech-to-speech translation between untranscribed unknown languages. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 593–600. Cited by: §1, §2, §2.
  • [30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.
  • [31] C. Wang, M. Rivière, A. Lee, A. Wu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and E. Dupoux (2021) VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390. Cited by: §5.
  • [32] C. Wang, Y. Tang, X. Ma, A. Wu, D. Okhonko, and J. Pino (2020) Fairseq s2t: fast speech-to-text modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations, Cited by: §4.2, §4.4.
  • [33] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, et al. (2017) Tacotron: towards end-to-end speech synthesis. Proc. Interspeech 2017, pp. 4006–4010. Cited by: §3.1.
  • [34] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen (2017) Sequence-to-sequence models can directly translate foreign speech. Proc. Interspeech 2017, pp. 2625–2629. Cited by: §2, §4.3.
  • [35] Q. Xu, A. Baevski, T. Likhomanenko, P. Tomasello, A. Conneau, R. Collobert, G. Synnaeve, and M. Auli (2021) Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3030–3034. Cited by: §4.3.
  • [36] S. Yang, P. Chi, Y. Chuang, C. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G. Lin, et al. (2021) SUPERB: speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051. Cited by: §1, §2.
  • [37] C. Zhang, X. Tan, Y. Ren, T. Qin, K. Zhang, and T. Liu (2020) UWSpeech: speech to speech translation for unwritten languages. arXiv preprint arXiv:2006.07926. Cited by: §1, §1, §2, §2, §4.1, §4.4, §4.4, Table 4.