Speech translation aims at converting speech input from one language into speech or text in another language. The technology helps bridge the communication barriers between people speaking different languages and can provide access to multimedia content in different languages. Conventional speech-to-text translation (S2T) systems take a cascaded approach by concatenating automatic speech recognition (ASR) and machine translation (MT). In recent years, end-to-end S2T is proposed to alleviate the error propagation issue between ASR and MT. These S2T models can be further combined with text-to-speech (TTS) synthesis to provide both speech and text translation, which allows the technology to be adopted in a wider range of applications.
More recently, researchers have started exploring building direct speech-to-speech translation (S2ST) models without relying on text generation as an intermediate step . Direct S2ST has the benefits of lower computational costs and inference latency as fewer decoding steps are needed compared to cascaded systems. In addition, direct S2ST is a natural approach for supporting translation for languages that do not have a writing system [29, 37]. However, training an S2ST system that directly maps input spectrogram features from one language to output spectrogram features of another language is challenging as it requires the model to jointly learn not only the alignment between two languages (as in MT) but also the acoustic and linguistic characteristics of both languages (as in ASR and TTS).
The recent success in self-supervised learning for speech has demonstrated that speech representations learned from a large unlabelled speech corpus can lead to impressive performance on a variety of downstream tasks including ASR [3, 8], speaker and language identification , emotion recognition , etc. Moreover, discretized speech units obtained from the clustering of self-supervised speech representations are shown to be effective in conditional and unconditional spoken generative language modeling tasks 
and speech synthesis without mel-spectrogram estimation.
In this work, we tackle the challenge of modeling target speech in direct S2ST by predicting self-supervised discrete representations of the target speech instead of mel-spectrogram features. We investigate speech translation with discrete units in the scenarios where the source and target transcripts may or may not be available, the latter case being representative of unwritten languages. For the written languages, we present a framework that jointly generates speech and text output by combining S2ST and S2T tasks through a shared encoder and a partially shared decoder. We resolve the length mismatch issue between the speech and text output during decoding with connectionist temporal classification (CTC)  and show that joint training allows our proposed framework to achieve performance close to a cascaded S2T+TTS system. For the unwritten target languages, we first extend the use of discrete units to text-to-speech translation (T2ST)  when there are source text transcripts available. Then we show that with multitask learning using both discrete representations for the source and the target speech, it is possible to train a direct S2ST system without the use of any text transcripts.
2 Related work
Conventional S2ST systems are built by combining either cascaded or end-to-end S2T models with TTS [15, 19]. The majority of the speech translation research has focused on the S2T setup. Studies on ASR+MT systems explore better ways to integrate ASR output lattice to MT models  in order to alleviate the error propagation issue between the two. End-to-end S2T  has the potential to resolve the issue, as long as it is properly trained with multitask learning  or model pre-training  to overcome the data scarcity problem. Studies on TTS for S2ST focus more on synthesizing the para-linguistic information transferred from the source speech, such as prosody [1, 2] and word-level emphasis .
On the other hand, Translatotron  is an attention-based sequence-to-sequence framework that directly translates mel-spectrogram of the source speech into spectrogram features of the target speech. Multitask learning has been shown to be essential in facilitating the model to converge, though there is still a performance gap towards a S2T+TTS cascaded system. The authors in 
propose to build a single deep-learning framework step-by-step by pre-training ASR, MT and TTS models separately and connecting them with Transcoder layers. However, the inference process requires the ASR and MT decoders to complete decoding a full sequence, and thus it loses the latency advantage of a direct S2ST system.[29, 37] both investigate direct S2ST models under the unwritten language setup by transforming the target speech into discrete representations through Variational Auto-Encoder (VAE), training a sequence-to-sequence model for translation into target discrete units, and an inverter for converting the units to speech.
In this work, we propose to train a transformer-based speech-to-discrete unit model for direct S2ST based on the idea of Translatotron . We design a new text decoding task conditioned on the intermediate representation of the decoder in addition to the auxiliary tasks proposed in . We choose to use self-supervised representations from HuBERT  to generate the target discrete units for our task, since [36, 14, 24] have shown its superior performance across ASR, spoken language modeling and speech synthesis, compared to other unsupervised representations, including VAE-based representations used in [29, 37]. Overall, there exists little work on direct S2ST due to the lack of parallel S2ST training data. While  performs one set of experiments on in-house S2ST data, [9, 29, 37, 11] all take advantage of TTS services to produce synthetic target speech for model training. We follow the same approach and perform our studies with single-speaker synthetic target speech.
Our proposed system (Fig. 1) is a transformer-based sequence-to-sequence model with a speech encoder and a discrete unit decoder and incorporates auxiliary tasks similar to  during training to facilitate model learning (shown in dashed lines). For written target languages, we further apply target text CTC decoding conditioned on the intermediate representations from the discrete unit decoder for joint speech and text training and generation. Finally, a vocoder is separately trained to convert discrete units into waveform.
3.1 Speech-to-unit (S2U) model
A HuBERT model trained on an unlabelled speech corpus of the target language can encode the target speech into continuous representations at every 20-ms frame. A k-means algorithm is applied on the learned representations of the unlabelled speech to generatecluster centroids [14, 24], which are used to encode target utterances into sequences of cluster indices at every 20-ms. In the end, a target utterance is represented as , where is the number of frames.
We build the S2U model by adapting from the transformer model for MT 
. A stack of 1D-convolutional layers, each with stride 2 and followed by a gated linear unit activation function, is prepended to the transformer layers in the encoder for downsampling the speech input. As the target sequence is discrete, we train the S2U model with cross-entropy loss with label smoothing. We explore two strategies for predicting the discrete unit sequence. In the first strategy (Fig. 2(a), dubbed as “stacked”), we apply the concept of reduction factor, , from TTS  and generate a vector at every decode decoding step for predicting discrete units. In the second strategy (Fig. 2(b), dubbed as “reduced”), we collapse a consecutive sequence of the same units into one single unit, resulting a sequence of unique discrete units. Both strategies help speed up training and inference time, and experimental results show that the model performance also improves.
3.2 Multitask learning
We follow the design in 
to incorporate auxiliary tasks with additional attention and decoder modules conditioned on the intermediate layers of the encoder. The target output of the auxiliary tasks can be either phonemes, characters, subword units or any discrete representations of the source or target utterances. These auxiliary tasks are only used during training and not in inference.
For written target languages, we add target text CTC decoding conditioned on an intermediate layer from the discrete unit decoder for the model to generate dual mode output. The use of CTC can mitigate the length mismatch between the speech and text output. However, since it only allows monotonic alignment, we rely on the transformer layers that the CTC decoder conditioned on to take care of the reordering from source to target. During training, we do teacher-forcing with the ground truth target discrete unit sequence and compute CTC loss using the teacher-forced intermediate representations from the decoder. During inference, we perform discrete unit decoding and CTC decoding for text at each decode step simultaneously.
We adopt the modified version of the HiFi-GAN neural vocoder  proposed in  for unit-to-waveform conversion. For the stacked discrete unit output, we train the vocoder with only discrete unit sequence and without extra pitch information as the input. For the reduced discrete unit output, we enhance the vocoder with a lightweight duration prediction module from Fastspeech 2 
, which consists of two 1D-convolutional layers, each with ReLU activation and followed by layer normalization and dropout, and a linear layer. We train the enhanced vocoder by minimizing the mean square error (MSE) between the module prediction and the ground truth duration of each unit segment in logarithmic domain, together with the generator-discriminator loss from HiFi-GAN.
We perform our experiments using the Fisher Spanish-English speech translation corpus  as in [9, 37]. The dataset consists of 139k sentences from telephone conversations in Spanish, the corresponding Spanish text transcriptions and their English text translation. As in previous studies on direct S2ST [9, 37], we use a high-quality in-house TTS engine to prepare synthetic target speech with a single female voice as the training targets. We perform all experiments, including the baselines, with the synthetic target speech and do not rely on the TTS engine for other uses. We apply the ASR model described in Sec. 4.3 on the synthetic speech and filter out samples with word error rate (WER) greater than 80. Table 1 lists the statistics of the resulting training set, as well as the two development sets and the test set.
|source duration (hrs)||162.5||4.6||4.7||4.5|
|target duration (hrs)||139.3||4.0||3.8||3.9|
4.2 Model setup
We use the pre-trained HuBERT model111https://github.com/pytorch/fairseq/tree/master/examples/hubert trained on Librispeech  for two iterations and follow [8, 14] to perform k-means with on representations from the sixth layer of the model for extracting discrete units for all target English speech. We compute 80-dimensional mel-filterbank features at every 10-ms for the source speech as input to the speech encoder and apply SpecAugment  with the LibriSpeech basic policy. The downsampling stack in the speech encoder contains two 1D-convolutional layers with kernel size 5 and 1024 channels, resulting in a downsampling factor of 4 on the input speech. The encoder contains 12 transformer layers with embedding size 256, feed-forward network (FFN) embedding size 2048 and 4 attention heads. The decoder consists of 6 transformer layers with the same embedding size and FFN embedding size as the encoder and 8 attention heads.
We explore four targets for the auxiliary tasks: source phonemes (sp), target phonemes (tp), source characters (sc) and target characters (tc). For sp or sc, we append a multihead attention module with 4 heads and a decoder with 2 transformer layers and the same embedding size as the discrete unit decoder to the sixth layer of the encoder based on preliminary experimentation. For tp or tc, we attach the attention and the decoder to the eighth layer of the encoder. Each auxiliary loss has a constant weight of 8.0 during training. For written target languages, we condition the CTC decoding on the third layer of the discrete unit decoder. The target text for CTC is encoded as 1k unigram subword units  to guarantee that the text sequence length is shorter than the length of the stacked or reduced discrete unit sequence. The weight on the CTC loss is set to 1.6 during training. We train the models for 400k steps using Adam with , learning rate , and apply an inverse square root learning rate decay schedule with 10k warmup steps. All hyper-parameters, such as dropout, are tuned on the development set. All models are implemented using fairseq S2T222https://github.com/pytorch/fairseq/tree/master/examples/speech_to_text [20, 32].
|+ pre-trained encoder ||30.1||-||31.5||-||31.1||-|
|Ground truth + TTS||87.9||100.0||88.9||100.0||89.6||100.0|
|Transformer-based Translatotron (, w/ sp, tp)||25.0||-||26.3||-||26.2||-|
|Transformer-based Translatotron (, w/ sc, tc)||32.9||-||34.1||-||33.2||-|
|S2U, no reduction (, w/ sc, tc)||32.8||-||34.2||-||34.1||-|
|S2U stacked (, w/ sc, tc)||34.0||-||34.5||-||34.4||-|
|S2U stacked + CTC (, w/ sc, tc)||34.4||36.4||36.4||37.9||34.4||35.8|
|S2U reduced + CTC (w/ sc, tc)||36.6||39.5||38.3||40.8||37.2||39.4|
4.3 Baselines and Evaluation
We build an S2T+TTS cascaded baseline by exploring various S2T and TTS architectures and find the best combination with an LSTM-based S2T model  and a transformer-based TTS model . For the S2T model, we use 8 bidirectional LSTM layers for the encoder and 4 LSTM layers for the decoder, and embedding and hidden state sizes are all 256. The model is trained on Fisher Spanish speech and English text data, represented in characters, without pre-training or multitask learning. The TTS model has 6 transformer layers, 4 attention heads, embedding size 512 and FFN embedding size 2048 for both the encoder and the decoder. We use 32-dimensional layer for the decoder prenet. The model is trained on the English text and the synthetic target speech with a reduction factor of 2 on the output feature frames. The vocoder is a HiFi-GAN model  fine-tuned on the mel-spectrogram features from teacher-forcing.
In addition, we implement a transformer-based Translatotron  that predicts mel-spectrogram features of the target speech. We apply transformer instead of the LSTM architecture originally proposed in  in order to speed up model training. The model consists of the same speech encoder design as in the S2U model, the same transformer-based speech decoder design as in the TTS model for the cascaded baseline, and a fine-tuned HiFi-GAN vocoder . We use the same auxiliary task setup described in the previous section but with a constant weight of 0.1 on each auxiliary loss. We apply a reduction factor of 5 on the output feature frames and tune the hyper-parameters on the development sets. Preliminary studies show no performance degradation for the transformer-based Translatotron compared with our implementation of the LSTM version of the model.
We adopt an open-sourced English ASR model333https://github.com/pytorch/fairseq/tree/master/examples/wav2vec built with the combination of wav2vec 2.0 pre-training and self-training  for evaluating the speech output. The model, which is pre-trained on Libri-Light  and fine-tuned on full Librispeech 
, achieves WER of 1.5% and 3.1% on the Librispeech test-clean and test-other sets, respectively. As the ASR decoded text is in lowercase and without punctuation except apostrophes, we normalize both the reference text and the text output from the S2ST model before computing BLEU usingSacreBLEU .
We explore model training under both written and unwritten language scenarios. For the former, we take advantage of text transcriptions of source and target speech during S2U model training. For the latter, we focus on the cases where the source is in either a written or unwritten language, while the target language is without a writing system. Thus, the translation system can only be trained on speech targets.
as a reference. However, as different ASR models are used for evaluation, we should not directly compare the BLEU scores with our experiments. On the other hand, we use an open-sourced ASR model for evaluation, so our results should be comparable with all future research in the field that follows the same evaluation protocol. We also list the BLEU scores from applying the TTS model trained on synthetic target speech on the ground truth English text. The gap between the BLEU scores evaluated on the speech and text output indicates the effect of the ASR errors on the evaluation metric.
First, we explore using different targets for the auxiliary tasks with transformer-based Translatotron  and see that using characters as targets for the auxiliary tasks (sc, tc) gives 7 BLEU gain compared to phonemes (sp, tp), indicating that the choice of auxiliary task target labels has a large impact on the model performance. In all following experiments, we use characters as the targets for the auxiliary tasks.
Second, we compare the proposed S2U model against transformer-based Translatotron. We start with the stacked strategy as both models are required to generate the same number of frames. We can see that “S2U stacked” outperforms the transformer-based Translatotron with by 0.4-1.2 BLEU, indicating that discrete units are easier to model than continuous-valued mel-spectrogram features. In addition, we experiment with S2U training using the full discrete unit sequence () and see that a larger reduction factor can not only speed up the training and inference processes but also lead to higher performance (0.3-1.2 BLEU differences).
Third, we incorporate target text CTC decoding to the S2U model and evaluate both speech and text output. Joint training with discrete unit loss and text CTC loss brings a gain of 1.9 BLEU on dev2 for “S2U stacked”. Moreover, we see that the reduced strategy is more effective than stacked, with the former bringing 1.9-2.8 BLEU improvement on speech output and 2.6-3.1 BLEU gain on text output.
Finally, the best setup we find, “S2U reduced” with joint speech and text training and auxiliary tasks, has bridged 83% of the gap between transformer-based Translatotron and the S2T+TTS cascaded baseline. However, compared with the cascaded system, the proposed framework has the advantage of being able to generate consistent speech and text output in one inference pass. Table 3 shows examples of the ASR decoded text on the speech output and the text output from CTC decoding. We also examine the output from the tc auxiliary task, which can serve as another way to generate translated text. By using ASR decoded text as reference, we see a character error rate (CER) of 6.1 for the CTC decoded text and 31.6 for the tc decoded text on the dev set, indicating that the former is much more aligned with the generated audio. As shown in Table 3, the CTC decoded results with high CER are due to a combination of ASR errors and misspelling from CTC.
|human||i’m twenty six years living here|
|ASR||i’m twenty six years living here|
|CTC||i’m twentysix years living here|
|tc||i’ve been * * living here|
|ref||i’ve been living here twenty six years|
|human||what was i going to say eh you were born in puerto rico|
|ASR||what was i going to say h you were born in porto rico|
|CTC||what was i going to say eh you were born in puerto rico|
|tc||what was i was * * * * * born in puerto rico|
|ref||i was going to say were you sir born in puerto rico|
|human||what country would you like to travel if you could|
|ASR||what country would you like to travel if you could|
|CTC||what country which you like to travel if you could|
|tc||what country what country do you like to travel if you could you could like to|
|ref||what country would you like to travel if you could|
|human||i mean that sometimes people who their mind don’t get distort|
|ASR||i mean that sometimes people who their mind don get distort|
|CTC||i mean that’ sometimes people who their’re mind ont get distor|
|tc||i mean that there are people like the mind they get distored|
|ref||sometimes there is people with their minds distorted|
|source written, target unwritten|
|Translatotron, w/ sp ||7.4||8.0||7.2|
|S2U reduced (w/ sc)||32.8||34.0||33.8|
|source & target unwritten|
|Translatotron, no auxiliary tasks ||0.4||0.6||0.6|
|S2U reduced, no auxiliary tasks||6.6||7.0||6.7|
|S2U reduced (w/ su)||26.2||27.4||27.1|
Source Written, Target Unwritten We explore the unwritten target language setup by starting from the scenario where the source speech has a text writing system. Table 4 summarizes the results. Note that results from [9, 37] are listed as references, but the BLEU scores are not directly comparable due to the difference in ASR models used during evaluation.
We first build cascaded systems by combining ASR and T2ST. We train the transformer-based ASR system with the default hyper-parameters and s2t_transformer_s architecture in fairseq S2T  on Fisher Spanish speech and Spanish transcriptions. The T2ST system can be built by either training a Translatotron-style model with a text encoder and speech decoder that predicts spectrograms (T2S), or a text-to-unit (T2U) model that predicts discrete units as in S2U models. We train the T2ST systems on the Spanish transcription and the corresponding synthetic English speech without any auxiliary task. For T2S, we use the same setup for the transformer TTS model. For T2U, we treat it as a standard MT problem using a base transformer to translate source characters to target discrete units. We see that T2U model outperforms T2S by 15.1 BLEU, which is another evidence showing that discrete units are easier to model as translation targets than continuous spectrogram features.
We focus on “S2U reduced” based on the findings from the written language setup for direct S2ST. We find that training an S2U model with sc auxiliary task can already achieve 91% of the performance from a system trained with both source and target text (Table 2). This is in contrary to the findings in  where training Translatotron with only source transcripts attains 28% of the performance of a system trained with both source and target text.
Source & Target Unwritten Next, we extend our experiments to a fully unwritten language setup by training models without using any text transcripts (Table 4).  has pointed out that the sequence-to-sequence model has difficulty in learning to attend to the input speech when trained without auxiliary tasks, resulting in close-to-zero BLEU scores.  addresses the challenge by training with discrete unit targets and shows potential, while it uses labelled speech from languages other than the source or the target to guide the VAE learning for the discrete units.
When “S2U reduced” is trained without auxiliary tasks, the performance greatly deteriorates. We notice that the model can still generate meaningful text. However, the generated speech does not reflect the content in the source speech, and the 6.7 BLEU score is mostly contributed by the function words. This shows that the discrete unit decoder can learn a language model over the unit sequence, while the challenge in model training is in the attention on the encoder output.
To facilitate the S2U model training, we apply the HuBERT model pre-trained on English to extract discrete representations for the source Spanish speech, and the source units (su) are used as the target for an auxiliary task. The resulting S2U model achieves a 0.9-1.2 BLEU gain compared with transformer-based Translatotron trained with phoneme targets (Table 2). This shows that source units are effective in guiding the model to properly learn the attention, and the discrete representations learned in a self-supervised manner can capture basic pronunciations that are transferable across languages.
We investigate training direct S2ST models with the use of self-supervised discrete representations as targets. We examine model training under both the written and unwritten language scenarios. For the former, we propose a framework that performs close to the cascaded baseline, yet it can generate translation in both speech and text within one inference pass. We demonstrate the possibility of translating between two unwritten languages by taking advantage of discrete representations of both the source and the target speech for model training.
Our studies are performed under a controlled setup where the target is clean and synthetic speech with only one female speaker, which is a common setup that previous work in the field used. With the recent release of large-scale S2S dataset , we plan to investigate the proposed framework with real data in the future. Another important aspect in generating speech output is the voice and prosody. In our work, we focus on content translation and leave the para-linguistic aspect of speech translation to future work.
We use an open-sourced ASR model for evaluation, so the results will be comparable with all future research in the field. We will also release the code for reproducing the experiments.
We would like to thank Jade Copet, Emmanuel Dupoux, Evgeny Kharitonov, Kushal Lakhotia, Abdelrahman Mohamed, Tu Anh Nguyen and Morgane Rivière for helpful discussions on discrete representations, and Sravya Popuri for technical discussions.
-  (2006) Prosody generation for speech-to-speech translation. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 1, pp. I–I. Cited by: §2.
-  (2012) Intent transfer in speech-to-speech machine translation. In 2012 IEEE Spoken Language Technology Workshop (SLT), pp. 153–158. Cited by: §2.
-  (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33. Cited by: §1.
-  (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744. Cited by: §1, §2.
-  (2017) Toward expressive speech translation: a unified sequence-to-sequence LSTMs approach for translating words and emphasis.. In INTERSPEECH, pp. 2640–2644. Cited by: §2.
-  (2020) Exploring wav2vec 2.0 on speaker verification and language identification. arXiv preprint arXiv:2012.06185. Cited by: §1.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In
Proceedings of the 23rd international conference on Machine learning, pp. 369–376. Cited by: §1.
-  (2021) HuBERT: self-supervised speech representation learning by masked prediction of hidden units. arXiv preprint arXiv:2106.07447. Cited by: §1, §2, §4.2.
-  (2019) Direct speech-to-speech translation with a sequence-to-sequence model. Proc. Interspeech 2019, pp. 1123–1127. Cited by: Direct speech-to-speech translation with discrete units, §1, §2, §2, §3.2, §3, §4.1, §4.3, §4.4, §4.4, §4.4, §4.4, §4.4, Table 2, Table 4.
-  (2020) Libri-light: a benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7669–7673. Cited by: §4.3.
-  (2021) Transformer-based direct speech-to-speech translation with transcoder. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 958–965. Cited by: §2, §2.
HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems 33. Cited by: §3.3, §4.3, §4.3.
Subword regularization: improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 66–75. Cited by: §4.2.
-  (2021) Generative spoken language modeling from raw audio. arXiv preprint arXiv:2102.01192. Cited by: §1, §2, §3.1, §4.2.
-  (1997) JANUS-III: speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 99–102. Cited by: §2.
Neural speech synthesis with transformer network. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6706–6713. Cited by: §4.3.
-  (2020) Multilingual speech translation with efficient finetuning of pretrained models. arXiv e-prints, pp. arXiv–2010. Cited by: §2.
-  (2005) On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology, Cited by: §2.
-  (2006) The ATR multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing 14 (2), pp. 365–376. Cited by: §2.
-  (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.2.
-  (2015) Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 5206–5210. Cited by: §4.2, §4.3.
-  (2019) SpecAugment: a simple data augmentation method for automatic speech recognition. Proc. Interspeech 2019, pp. 2613–2617. Cited by: §4.2.
-  (2019) Learning problem-agnostic speech representations from multiple self-supervised tasks. Proc. Interspeech 2019, pp. 161–165. Cited by: §1.
-  (2021) Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355. Cited by: §1, §2, §3.1, §3.3.
-  (2014) Fisher and CALLHOME Spanish–English speech translation. Note: LDC2014T23. Web Download. Philadelphia: Linguistic Data Consortium Cited by: §4.1, Table 1.
-  (2018) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Cited by: §4.3.
-  (2020) Fastspeech 2: fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558. Cited by: §3.3.
End-to-end ASR: from supervised to semi-supervised learning with modern architectures. arXiv preprint arXiv:1911.08460. Cited by: §3.1.
-  (2019) Speech-to-speech translation between untranscribed unknown languages. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 593–600. Cited by: §1, §2, §2.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.
-  (2021) VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390. Cited by: §5.
-  (2020) Fairseq s2t: fast speech-to-text modeling with fairseq. In Proceedings of the 2020 Conference of the Asian Chapter of the Association for Computational Linguistics (AACL): System Demonstrations, Cited by: §4.2, §4.4.
-  (2017) Tacotron: towards end-to-end speech synthesis. Proc. Interspeech 2017, pp. 4006–4010. Cited by: §3.1.
-  (2017) Sequence-to-sequence models can directly translate foreign speech. Proc. Interspeech 2017, pp. 2625–2629. Cited by: §2, §4.3.
-  (2021) Self-training and pre-training are complementary for speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3030–3034. Cited by: §4.3.
-  (2021) SUPERB: speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051. Cited by: §1, §2.
-  (2020) UWSpeech: speech to speech translation for unwritten languages. arXiv preprint arXiv:2006.07926. Cited by: §1, §1, §2, §2, §4.1, §4.4, §4.4, Table 4.