UWSpeech: Speech to Speech Translation for Unwritten Languages

06/14/2020 ∙ by Chen Zhang, et al. ∙ Zhejiang University Microsoft 0

Existing speech to speech translation systems heavily rely on the text of target language: they usually translate source language either to target text and then synthesize target speech from text, or directly to target speech with target text for auxiliary training. However, those methods cannot be applied to unwritten target languages, which have no written text or phoneme available. In this paper, we develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter, and then translates source-language speech into target discrete tokens with a translator, and finally synthesizes target speech from target discrete tokens with an inverter. We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition, to train the converter and inverter of UWSpeech jointly. Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively, which demonstrate the advantages and potentials of UWSpeech.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Speech to speech translation [19, 26, 42, 15] is important to help the understanding of cross-lingual spoken conversations and lectures, and has been used in scenarios such as international travel or conference. Existing speech to speech translation systems either rely on target text as a pivot (they first translate source speech into target text and then synthesize target speech given the translated text [19, 26, 42]), or directly translate source speech into target speech [15]. In these translation systems, the text corresponding to the target speech is leveraged as either pivots or auxiliary training data; otherwise, the translation would not be possible or the translation accuracy would drop dramatically [15].

However, there are thousands of unwritten languages in the world [20], which are purely spoken and have no written text. It is challenging to build speech translation systems for these unwritten languages without text as pivots or auxiliary training data: continuous speech (which usually contains content, context, speaking style, etc.) is much more flexible to represent semantic meanings than discrete symbols (text) [36, 40], which makes the translation into speech harder than translation into text. Therefore, the key to ease the speech translation for unwritten languages is to reduce the flexible continuous space of speech into a more restricted discrete space.

A variety of previous works [25, 7, 44, 1, 16, 11, 12, 35] have investigated the conversion between speech and their corresponding phonetic categories (discrete tokens) in an unsupervised manner, which mimics the way that human infants learn acoustic models in their mother tongue during their early years of life [39]. Among these works, vector quantized variational autoencoder (VQ-VAE) [36, 10, 35, 8, 22, 34, 3] has been widely adopted and shown advantages over other methods. However, VQ-VAE is still purely unsupervised and cannot ensure the quality of the learned discrete representations. Therefore, although VQ-VAE performs very well on relatively easier tasks like speech synthesis [10], it cannot achieve good accuracy on more complicated speech to speech translation where semantic representations of speech are important and more accurate phonetic representations are required. Few works tackle on speech to speech translation for unwritten languages [34] since it is extremely challenging.

In this paper, we develop UWSpeech (UW is short for UnWritten), a translation system for unwritten languages with three key components: 1) a converter that transforms unwritten target speech into discrete tokens, 2) a translator that translates source-language speech into target-language discrete tokens, and 3) an inverter that converts the translated discrete tokens back to unwritten target speech. As can be seen, the discretization (transform speech into discrete tokens using converter) and reconstruction (synthesize speech from discrete tokens using inverter) steps in UWSpeech is important to ensure the translation accuracy.

To this end, we propose XL-VAE, which improves the discretization and reconstruction capability based on VQ-VAE. Different from VQ-VAE that purely relies on unsupervised methods for discrete representation learning, XL-VAE leverages written languages with phonetic labels to improve the vector quantization (discrete representations learning) of unwritten languages through cross-lingual (XL) transfer. As human beings share similar vocal organs and pronunciations [45], no matter which spoken languages they use, the phonetic representations learned in one language can more or less (depending on the language similarity) help the learning of phonetic representations in another language [46, 17]. Therefore, XL-VAE can benefit from other written languages and outperform purely unsupervised VQ-VAE on discretizing speech into discrete tokens and synthesizing speech from discrete tokens, and thus enable UWSpeech achieves better translation accuracy.

Our contributions can be summarized as follows:

  • We develop UWSpeech, a speech to speech translation system for unwritten languages, and design a novel XL-VAE to train the converter and inverter in UWSpeech for discrete speech representations.

  • We conduct experiments on Fisher Spanish-English speech conversation dataset, assuming the target language is unwritten. Experiment results show that UWSpeech equipped with XL-VAE achieves 16 and 10 BLEU points improvements over direct translation and VQ-VAE baseline respectively, which demonstrate the advantages and potentials of UWSpeech on speech to speech translation for unwritten target languages.

  • We further apply UWSpeech to text to speech translation and speech to text translation for unwritten languages. The improvements over direct translation and VQ-VAE baseline demonstrate the general applicability of UWSpeech beyond speech to speech translation.

2 Background

A Taxonomy of Speech Translation and Our Focused Setting

Based on the successes of text to text translation [4, 23, 38], speech translation [5, 43, 15] has been developed to handle speech as translation input and/or output. Previous works on speech translations has evolved from cascaded models [27, 24, 19, 26, 42] to end-to-end models [5, 43, 41, 33, 15], where the text corresponding to speech are leveraged as auxiliary training for better accuracy. Depending on the speech is in the source or/and target side, speech translation can be divided into three categories: speech to text translation, text to speech translation and speech to speech translation. In this paper, we focus on the most difficult setting: speech to speech translation for unwritten target languages. Furthermore, we also extend UWSpeech for text to speech translation with unwritten target languages and speech to text translation with unwritten source languages to demonstrate the generalization ability of our method. Besides, our method can also be applied to the written target languages whose text or phonetic transcripts are not available in the training data.

Discrete Speech Representations

Learning discrete representations of speech has long been studied for better speech understanding and modeling. Previous works on discrete speech representations include k-means clustering 

[16, 11]

, Gaussian mixture model clustering 

[7], tree-based clustering [25]

, binarization with straight-through estimation 

[12], categorical VAE [12] and the more advanced vector quantized VAE (VQ-VAE) [36, 10, 35, 8, 22, 34, 3]. VQ-VAE has been widely used to cluster/quantize the representations of speech and discretize into codebook sequence, and has achieved good results on some task such as subword units discovery from speech or text to speech synthesis [10]. However, VQ-VAE is a purely unsupervised clustering method for discrete speech representations, which limits its effectiveness on harder tasks like speech translation. In this paper, we improve VQ-VAE with cross-lingual (XL) speech recognition and propose XL-VAE to achieve better discrete speech representations.

3 UWSpeech

In this section, we introduce the design of our proposed UWSpeech: a speech to speech translation system for unwritten tget languages with the help of cross-lingual vector quantized variational autoencoder (XL-VAE). We first describe the overall pipeline of UWSpeech, and then introduce the detailed design of XL-VAE.

Figure 1: The training and inference pipeline of UWSpeech.

3.1 Pipeline Overview

For speech to speech translation where the target language is unwritten, UWSpeech consists of three components as shown in Figure 1: 1) a converter to transform the target-language speech into discrete tokens; 2) a translator to translate the source speech into target discrete tokens; 3) an inverter to convert the target discrete tokens back to target speech. We introduce each component in the following subsections.

Translator

Denote the training corpus as , where and are the source and target speech sequence. According to the pipeline of UWSpeech, we convert the target unwritten speech sequence into discrete token sequence to form a triple corpus . We train a machine translator by minimizing the negative log-likelihood loss

(1)

where can be implemented as a standard encoder-attention-decoder [38] based model with several convolution layers in the encoder to handle speech input, and will be described in the experiment setting.

Converter and Inverter

The converter and inverter transform the speech sequence into discrete token sequence and transform back to speech sequence respectively, and follow the form of autoencoder where the converter acts like the encoder and the inverter acts like the decoder. Inspired by VQ-VAE, we propose a novel XL-VAE to better train the converter and inverter for speech translation.

3.2 Xl-Vae

XL-VAE first encodes the speech sequence into hidden representations to extract discrete tokens with a converter, and reconstructs the original speech sequence given the representations of discrete tokens with an inverter. Different from VQ-VAE 

[36], XL-VAE extracts discrete representations not by unsupervised vector clustering, but by speech/phoneme recognition, where the recognition capability is transferred from other popular written languages. We train the phoneme recognition on written languages with speech and phoneme pairs based on the converter. We illustrate XL-VAE in Figure 2 and formulate each module in XL-VAE as follows.

Converter

The converter of XL-VAE takes speech sequence as input and generate continuous hidden representations :

(2)

is further converted into discrete latent variables through nearest neighbour search based on dot-product111We use dot-product here instead of Euclidean distance in VQ-VAE, in order to be consistent with the speech recognition where the hidden representations are multiplied with the matrix

and then transformed through a softmax function to get the probability of each phoneme category (which is described in the later part of this subsection).

:

(3)

where denotes the categorical distribution of the discrete variable . denotes the embedding space of the discrete tokens, denotes the number of discrete tokens and D denotes the size of each embedding vector for .

As shown in Figure 2

, the converter takes speech (mel-spectrogram) sequence as input and uses several convolution layers with strides to reduce the length of speech sequence by

. It then stacks Transformer blocks [38]

, where each block contains a self-attention layer and a feed-forward layer with a layer-normalization and a residual connection on top of each layer. For a speech sequence with length of

, the generated discrete tokens has length of .

Figure 2: The model structure of XL-VAE.

Inverter

The inverter of XL-VAE takes discrete tokens as input and convert into with discrete token look-up table (the same as used in the converter). Then is used to reconstruct the original speech sequence :

(4)

As shown in Figure 2, the inverter leverages several transposed convolution layers [9] to increase the length of by (opposed to the in the converter), to match the length of the original mel-spectrogram sequence. It then stacks Transformer blocks [38] as used in the converter. The inverter reconstructs the speech sequence in parallel. Therefore, different from the conventional self-attention in Transformer decoder which cannot see the information in the future positions, the self-attention in the inverter can see the information in all positions, just like the converter. A vocoder [14, 28] is leveraged to further convert the mel-spectrogram into audio waveform.

Cross-Lingual (XL) Speech Recognition

Instead of unsupervised quantization in VQ-VAE, XL-VAE introduces speech recognition in other written languages to help learn the discrete representations, as shown in Figure 2. Given the speech and phoneme sequence pairs of written languages, we use the converter to transform speech into , and then multiply with the discrete token embedding matrix ( is denoted in Equation 3

) and get the probability distribution

over phoneme categories with a softmax operation, where is size of phoneme vocabulary in the written languages, and also the number of discrete tokens in , which is similar with  [21]. We train the phoneme recognition with connectionist temporal classification (CTC) loss [13]. The formulation of the cross-lingual speech recognition is as follows:

(5)

where denotes the set of valid CTC paths for phoneme sequence , denotes the probability of the CTC path , denotes the probability of observing label under the softmax function and denotes the length of sequence

. The loss function

aims to minimize the negative log-likelihood of all the valid CTC paths in the training set. For more details of CTC, you can refer to [13], which is not the focus of this work.

Discrete Representation

We choose international phonetic alphabet (IPA) [2] as the phoneme set of the written languages. In this way, the discrete token embeddings are exactly the embeddings of IPA where is the size of IPA set and is the dimension of the embedding vector. The unwritten speech is converted into discrete tokens which fall into the IPA set of written languages. The discrete tokens as well as the corresponding embedding vectors in are taken as the discrete representations of speech .

Loss Function of XL-VAE

Putting Equation 234 and 5 together, we have the loss function of XL-VAE:

(6)

where

is a hyperparameter to trade-off the two loss terms.

3.3 Training and Inference

Finally, we describe the training and inference procedure of UWSpeech according to the formulations in the previous two subsections. The detailed procedure is shown in Algorithm 1.

  Training:
  Input: Speech to speech translation corpus where represents target unwritten speech. Paired speech and phoneme corpus in written languages where uses IPA as the phoneme set.
  Step 1: Train the XL-VAE model with corpus and using loss in Equation 6 to obtain the converter , inverter and discrete token look-up table .
  Step 2: Convert the unwritten speech corpus into discrete sequence corpus following Equation 2 and 3. Train the machine translator with corpus using loss in Equation 1.
  Inference:
  Input: Source speech corpus , translator , discrete token look-up table and inverter .
  Step 1: For each speech sequence , generate target discrete tokens: .
  Step 2: Convert into through discrete token look-up table , and synthesize target speech: .
Algorithm 1 UWSpeech Training and Inference

4 Experiments and Results

In this section, we first introduce the experimental setup and then report the results of UWSpeech for speech to speech translation. We further conduct some analyses of UWSpeech. Finally, we also apply UWSpeech to text to speech translation and speech to text translation settings.

4.1 Experimental Setup

Following the common practice in low-resource and unsupervised speech and translation works [18, 32, 31], we conduct experiments on popular written languages but remove the text of target speech to simulate unwritten languages. We choose Fisher Spanish-English dataset [30] for translation. Considering 1) translation to unwritten languages is difficult and 2) the most useful translation scenarios for unwritten languages are daily communication, travel translation, etc., where high-frequency and simple words/sentences are usually used, we choose some common sentences from the original full test set to form our test set. The experiment results on the original full test set are listed in the Appendix A. For the written languages used in XL-VAE, we choose French, German and Chinese with speech data and corresponding phoneme sequence. Both the German and French datasets are from Common Voice222https://voice.mozilla.org/, for Chinese dataset, we use AIShell [6]. More details about the datasets are shown in the Appendix A.

We choose Transformer [38] as the basic model structure for the converter, inverter and translator, since it achieves good results on machine translation, speech recognition and speech synthesis tasks. The detailed model configurations and hyperparameters are explained in the Appendix B.

To evaluate the accuracy of the speech translation, we pre-train an automatic speech recognition model (which can achieve 85.62 BLEU points on our test set and is comparable with 

[15]) to generate the corresponding text of the translated speech, and then calculate the BLEU score [29]

between the generated text and the reference text. More running details are shown in the Appendix 

C.

4.2 Results

In this subsection, we report the experiment results of UWSpeech. We compare UWSpeech mainly with two baselines: 1) Direct Translation, which directly translates the source speech into target speech in an encoder-attention-decoder model without any text as auxiliary training data or pivots. 2) Discretization with VQ-VAE (denoted as VQ-VAE), which follows the translation pipeline in UWSpeech but replaces XL-VAE with original VQ-VAE for speech discretization.

Method Direct Translation VQ-VAE UWSpeech
BLEU 1.45 7.17 17.33
Table 1: The BLEU scores of Spanish to English speech to speech translation, where English is taken as the unwritten language.

The speech to speech translation results on Spanish to English are shown in Table 1. As can be seen, Direct Translation achieves very low BLEU score, which is consistent with the findings in [15] and demonstrates the difficulty of direct speech to speech translation. VQ-VAE achieves slightly better BLEU score than Direct Translation, but still with poor accuracy, which demonstrates the limitations of the purely unsupervised method for speech discretization when handling speech translation. UWSpeech achieves 17.33 BLEU points, about 10 points higher than VQ-VAE and 16 points higher than Direct Translation. We also find that the inverter in XL-VAE can get a lower reconstruction loss than VQ-VAE on the validation set, demonstrating that the discrete tokens extracted by XL-VAE can not only help the discrete token translation in translator but can also benefit the speech reconstruction in inverter, which together contribute to the better accuracy in speech translation. The above results demonstrate the advantages of XL-VAE in leveraging cross-lingual speech recognition for speech discretization and the effectiveness of UWSpeech for unwritten speech translation.

We further show the experiment results on English to Spanish translation in Table 2. Similar to the results on Spanish to English translation, Direct Translation achieves very low BLEU score and UWSpeech achieves about 8 points higher than VQ-VAE, demonstrating the effectiveness of UWSpeech.

Method Direct Translation VQ-VAE UWSpeech
BLEU 0.80 3.12 11.13
Table 2: The BLEU scores of English to Spanish speech to speech translation, where Spanish is taken as the unwritten language.

4.3 Method Analyses

We conduct some experimental analyses on the proposed UWSpeech, some are listed below, and others are shown in the Appendix D, such as the combination with multi-task training.

Analyses of Written Languages in XL-VAE

We study the influences of written languages in XL-VAE on the translation accuracy, mainly from two perspectives: 1) the data amount of the written languages, and 2) the similarity between the written and unwritten languages. To this end, we design several different experimental settings for this study, as shown in Table 3333Someone may wonder the acoustic conditions of the speech in different written languages may influence the comparison. We listened and compared the acoustic conditions in their speech data and found small difference. Therefore, we can focus more on the data amount and language similarity instead of acoustic conditions considering the good robustness of ASR model..

Setting Configuration BLEU
#1 De (80h) 10.58
#2 De (160h) 12.12
#3 De (320h) 15.20
#4 De (320h) + Fr (160h) + Zh (160h) 17.33
#5 Fr (160h) 11.79
#6 Zh (160h) 9.38
Table 3: The BLEU scores of Spanish to English speech to speech translation with different written languages as well as different data amounts for XL-VAE. We denote German as De, French as Fr and Chinese as Zh.

From setting #1, #2 and #3, it can be seen that increasing the data amount of written language (German) can improve the speech translation accuracy. Comparing setting #4 with #3, we can find that further adding other languages (French and Chinese) to increase the total data amount can also improve the translation accuracy. Comparing setting #2, #5 and #6, we can find that German helps more on the discretization of English than French, and both German and French help more than Chinese, which is consistent with the language similarity. According to the language family [20], German and English belong to the same Germanic branch in Indo-European family, while French and English belong the same Indo-European family although not in the same branch. Chinese and English belong to different families and are far apart from each other. Even using the distant Chinese as written language, our method still achieves higher accuracy than VQ-VAE (9.38 vs 7.17).

Varying Embedding Size and Down-Sampling Ratio in XL-VAE

We further evaluate how the discrete token embedding size and the speech down-sampling ratio in XL-VAE influence the translation accuracy. We set when varying and set when varying according to preliminary experiments. As shown in Table 4, discrete token embedding size performs better and down-sampling ratio performs better.

Embedding Size 64 128 256 512
BLEU 13.85 15.20 17.33 17.13
Down-Sampling Ratio 1 2 4 8
BLEU 10.05 13.27 17.33 16.85
Table 4: The BLEU scores of Spanish to English translation with different discrete token embedding sizes and down-sampling ratios.

The Advantage of Training Converter and Inventer Jointly

To study the benefits of joint training the converter and inverter in XL-VAE, we separately train the converter by speech recognition on written languages and the inverter by reconstructing speech from discrete tokens. Separate training achieves 13.51 BLEU points on Spanish to English translation, which is much lower than joint training in XL-VAE (17.33), demonstrating the effectiveness of joint training in XL-VAE.

Discretization of Source Speech

In previous experiments, we only discretize the target speech for speech to speech translation. Now we study the translation accuracy if we also discretize the source speech into discrete token at the same time. We conduct experiments on Spanish to English translation direction and achieve 17.45 BLEU points, which is just slightly better than only discretizing target speech (17.33 as shown in Table 1). The results demonstrate that direct translation from source speech is not as difficult as direct translation into target speech.

Case Analyses

We further analyze some translation cases by our UWSpeech system and the baseline methods on Spanish to English translation. As shown in Table 5, we list the source (Spanish) and target (English) reference text corresponding to the speech, and convert the translated English speech into text with the pre-trained automatic speech recognition model as used in evaluation. For the first case, both Direct Translation and VQ-VAE miss the meaning of “what she said” while UWSpeech can translate the meaning. For the second case, only UWSpeech can translate the meaning of “How’s it going, where are you from?” correctly. We also show the translated discrete token sequence (IPA) by the translator (denoted as IPA (UWSpeech)) as well as the discrete token sequence extracted from the target speech (denoted as IPA (Target)) in Table 5. It can be seen that the IPA translated by UWSpeech is close to the target IPA, and both are close to the pronunciation of English speech, which demonstrate the good accuracy of the IPA extracted by XL-VAE and translated by the translator. We attach the corresponding speech and more cases in the Appendix E.

Spanish (Source) Yo no entendí lo que ella dijo. Qué tal, ¿de dónde eres?
English (Target) I didn’t understand what she said. How’s it going, where are you from?
Direct Translation I don’t know. Had a price.
VQ-VAE I didn’t understand. Like are you there are you from?
UWSpeech I didn’t understand what she say. How are you, where are you from?
IPA (Target) ai ai n | d I g n n z E | 5 n Y s t t @ l 5 n n h h a: s | b I t t | Oy n n | O | K | j | v a: K m
t t | v O t | t i: s | E E: n
IPA (UWSpeech) ai ai | d e n n n | a n n v y: s s t e n n n t h h au | 5 | j ø: | 5 | j e | v a: K m
| v O 5 t | t d i: | E E l
Table 5: Some translation cases in Spanish to English speech to speech translation.

4.4 Extension of UWSpeech

Although UWSpeech is designed for speech to speech translation, it can also be applied to other two speech translation settings for unwritten languages: text to speech translation and speech to text translation. We conduct experiments on these two settings on Spanish to English translation to verify the broad applicability of UWSpeech for unwritten speech translation, and show the results in Table 6.

In the text to speech setting, Direct Translation still achieves very poor translation accuracy and UWSpeech achieves about 14 BLEU points improvements over VQ-VAE baseline, demonstrating the effectiveness of UWSpeech on text to speech translation for unwritten languages.

In the speech to text setting, UWSpeech achieves much higher accuracy than VQ-VAE and slightly better accuracy than Direct Translation. While verifying the effectiveness of our UWSpeech, these results also demonstrate that it is not that necessary to discretize the source speech in speech translation, which is consistent with our findings in Section 4.3, and is also consistent with the results in [43] where even leveraging the ground-truth text corresponding the source speech can only achieve a BLEU gain less than 2 points.

Method Direct Translation VQ-VAE UWSpeech
Text to Speech 5.47 8.02 22.03
Speech to Text 33.87 29.98 34.05
Table 6: The BLEU scores of the text to speech and speech to text setting on Spanish to English translation, where English is taken as the unwritten target language in the text to speech setting, and Spanish is taken as the unwritten source language in the speech to text setting.

5 Conclusion

In this paper, we developed UWSpeech, a speech to speech translation system for unwritten target languages, and designed XL-VAE, an enhanced version of VQ-VAE based on cross-lingual speech recognition, to jointly train the converter and inverter to discretize and reconstruct the unwritten speech in UWSpeech. Experiments on Fisher Spanish-English dataset demonstrate that UWSpeech equipped with XL-VAE achieves significant improvements in translation accuracy over direct translation and VQ-VAE baseline.

In the future, we will enhance XL-VAE with domain adversarial training to better transfer the speech recognition ability from written languages to unwritten languages. We will test UWSpeech on more complicated sentences and language pairs. Furthermore, going beyond the proof-of-concept experiments in this work (we assumed English or Spanish is unwritten), we will apply UWSpeech on truly unwritten languages for speech to speech translation.

Appendix A Datasets Details

We choose Fisher Spanish-English dataset [30] for experiments, which contains telephone conversations of speech and text in Spanish and the corresponding text translations in English, with K parallel training samples in total. Following [15], we synthesize the English speech according to the text using a commercialized text to speech system with a female speaker. We use the original training and development sets in the dataset. We only consider the most useful translation scenarios for unwritten languages (e.g. daily communication, travel translation, etc.) in which high-freq and simple words/sentences are usually used, so we obtain some common sentences from the full test set to form our test set by filtering the sentence with a threshold of word frequency. We set the threshold to K for English and K for Spanish and finally get about K out of 3641 sentence pairs as the test set. We also conduct experiments on English to Spanish translation where the target Spanish speech is also synthesized using the commercialized text to speech system with another female speaker.

We also provide the BLEU scores of our experiments in original Fisher test set in Table 7.

Method Direct Translation VQ-VAE UWSpeech
Speech to Speech 0.80 3.42 9.35
Table 7: The BLEU scores of Spanish to English speech to speech translation, where English is taken as the unwritten language.

For the written languages used in XL-VAE, we choose French, German and Chinese with speech and corresponding phoneme sequence. Both the German and French datasets are from Common Voice444https://voice.mozilla.org/, where the German corpus contains about 280K training examples (325 hours) with 5007 different speakers and the French corpus contains 150K training examples (173 hours) with 3005 different speakers. For Chinese dataset, we use AIShell [6] which contains about 140K training examples (178 hours) with 400 different speakers. We choose 320 hours, 160 hours and 160 hours from German, French and Chinese corpus respectively for training, and the others for development. We first convert the text in German, French and Chinese corpus into phoneme with our internal grapheme-to-phoneme conversion tool, and then map the phoneme to IPA [2] according to our internal phoneme-to-IPA mapping table. We convert all the speech waveform in our experiments into mel-spectrogram following [31] with frame size of 50 ms and hop size of 12.5ms.

Appendix B Model Configuration Details

For the convolution and transposed convolution layers in the converter and inverter, the kernel size, stride size and filter size are set to , and respectively. We stack convolution layers to set the down/up-sampling ratio to 4 according to the validation performance. We stack both layers of Transformer blocks in the converter and inverter, with hidden size of both the self-attention and feed-forward layers as , and filter size of the feed-forward layer as . The size of the IPA dictionary is set as and the dimension of discrete token embeddings is set as . We simply choose Griffin-Lim algorithm [14] as the vocoder to synthesize the final waveform of the speech.

The translator performs speech to discrete tokens (IPA) translation, which follows the basic encoder-attention-decoder model structure in Transformer [38]. The encoder has several additional convolution layers to transform the speech input, which follows the same configuration of the convolution layers in the converter (with a down-sampling ratio). The discrete token embedding size, hidden size, filter size, number of encoder and decoder layers of the translator are set to , , , , respectively.

Appendix C Pipeline Details

Training Details

We first train the converter, inverter and discrete token embeddings in XL-VAE. We up-sample the speech data of each written language (German, French, Chinese) to the same amount, and then up-sample the speech data of unwritten language (English or Spanish) to match to the total amount of written languages. We ensure there are an equal amount of data in written and unwritten languages in each mini-batch. We choose the in Equation 6 according to the validation performance and set to . The batch size is set to K frames for each GPU and the XL-VAE training takes K steps on Tesla V100 GPUs.

After the training of XL-VAE, the phoneme error rates (PER) of three written languages (German, French and Chinese) on the development set are , and respectively. We convert the target unwritten speech into discrete token sequence and keep the output discrete token sequence as it is, without removing any special or repeated tokens. We use the discrete token sequence generated by XL-VAE to train translator, with batch size of K frames on each GPU and K training steps on Tesla V100 GPUs.

Our code is implemented based on tensor2tensor library [37]555https://github.com/tensorflow/tensor2tensor.

Inference and Evaluation

During inference, we use the translator to generate discrete token sequence from source speech with beam search. We set beam size to and length penalty to . We then directly use the inverter to transform the discrete token sequence back to target speech.

To evaluate the accuracy of the speech translation, following the practice in [15], we pre-train an automatic speech recognition model (which can achieve 85.62 BLEU points on our test set and is comparable with [15]) to generate the corresponding text of the translated speech, and then calculate the BLEU score [29] between the generated text and the reference text. We report BLEU score using case insensitive BLEU with moses tokenizer666https://github.com/moses-smt/mosesdecoder/blob/master/ scripts/tokenizer/tokenizer.perl and multi-bleu.perl777https://github.com/moses-smt/mosesdecoder/blob/master/ scripts/generic/multi-bleu.perl. Due to the Fisher corpus has English references in the test set, we report 4-reference BLEU score for Spanish to English setting, and still report single-reference BLEU score for English to Spanish setting.

Appendix D More Method Analyses

UWSpeech with Multi-task Training

[15] proposes a direct speech to speech translation model, which improves translation accuracy through multi-task training (source speech to source text (automatic speech recognition), and source speech to target text (speech to text translation)). Originally, due to lack of text in both source and target languages, speech to speech translation for unwritten languages could not take advantage of the multi-task training mechanism. However, our proposed XL-VAE can discretize the speech into discrete tokens, which can be regarded as text for multi-task training Therefore, we study how UWSpeech preforms when combining with multi-task training.

We combine UWSpeech with multi-task training in two ways:

  • SL ASR (Source Language ASR): Training a model that has a shared speech encoder and two decoders: one is for speech recognition on source unwritten languages (source speech to the corresponding discrete tokens ), and the other is for speech translation on source unwritten languages (source speech to the discrete tokens in the target language. Both of the discrete tokens corresponding to the source and target unwritten languages are generated by XL-VAE. In this way, we leverage automatic speech recognition of source unwritten language (discrete token sequences as target) as auxiliary loss in our Translator.

  • WL ASR (Written Languages ASR): Training a model that has a shared speech encoder and two decoders: one is for phone-level automatic speech recognition on auxiliary written languages (e.g., German, French, and Chinese in this paper), and the other is for speech to speech translation on unwritten languages (e.g., translate Spanish speech to English speech directly) at the same time, hoping that ASR can help the speech encoder training better.

As we can see in Table 8, SL ASR setting can only improve slightly from 17.33 to 17.41, which also demonstrates the discretization of source speech is not so necessary. The BLEU score of WL ASR setting is very low (2.36), which indicates that the Direct Translation model cannot make full use of the written languages, while XL-VAE can do this well.

Method UWSpeech SL ASR WL ASR
BLEU 17.33 17.41 2.36
Table 8: The BLEU scores of Spanish to English speech to speech translation, combines with multi-task training in different ways.

Appendix E Case Analyses and Demo Audios

Spanish (Source) Yo no entendí lo que ella dijo.
English (Target) I didn’t understand what she said.
Direct Translation I don’t know.
VQ-VAE I didn’t understand.
UWSpeech I didn’t understand what she say.
IPA (Target) ai ai n | d I g n n z E | 5 n Y s t t @ l 5 n n
t t | v O t | t i: s | E E: n
IPA (UWSpeech) ai ai | d e n n n | a n n v y: s s t e n n n t
| v O 5 t | t d i: | E E l
Table 9: Case 1
Spanish (Source) Qué tal, ¿de dónde eres?
English (Target) How’s it going, where are you from?
Direct Translation Had a price.
VQ-VAE Like are you there are you from?
UWSpeech How are you, where are you from?
IPA (Target) h h a: s | b I t t | Oy n n | O | K | j | v a: K m
IPA (UWSpeech) h h au | 5 | j ø: | 5 | j e | v a: K m
Table 10: Case 2
Spanish (Source) Yo soy puertoriqueña.
English (Target) I am Puerto Rican.
Direct Translation Ah.
VQ-VAE I’m from.
UWSpeech I’m from Puerto Rico.
IPA (Target) ai n m | p o d d @ | v i: i: g e E n |
IPA (UWSpeech) ai ai n n | f a ŋ | p o 5 d @ @ | v v e k k 5 |
Table 11: Case 3
Spanish (Source) Halo, buenas noches.
English (Target) Hello good evening.
Direct Translation And a.
VQ-VAE Hello video.
UWSpeech Hello good evening.
IPA (Target) h a n l O | g l t t t | g I b n I I ŋ |
IPA (UWSpeech) h a n l O | g l l d | i: i: v n ŋ ŋ
Table 12: Case 4

All the corresponding audios are available at https://speechresearch.github.io/uwspeech/.

References

  • [1] O. Adams (2017) Automatic understanding of unwritten languages. Ph.D. Thesis. Cited by: §1.
  • [2] I. P. Association, I. P. A. Staff, et al. (1999) Handbook of the international phonetic association: a guide to the use of the international phonetic alphabet. Cambridge University Press. Cited by: Appendix A, §3.2.
  • [3] A. Baevski, S. Schneider, and M. Auli (2019)

    Vq-wav2vec: self-supervised learning of discrete speech representations

    .
    arXiv preprint arXiv:1910.05453. Cited by: §1, §2.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.
  • [5] A. Bérard, O. Pietquin, C. Servan, and L. Besacier (2016) Listen and translate: a proof of concept for end-to-end speech-to-text translation. arXiv preprint arXiv:1612.01744. Cited by: §2.
  • [6] H. Bu (2017)

    AIShell-1: an open-source mandarin speech corpus and a speech recognition baseline

    .
    In Oriental COCOSDA 2017, pp. Submitted. Cited by: Appendix A, §4.1.
  • [7] H. Chen, C. Leung, L. Xie, B. Ma, and H. Li (2015) Parallel inference of dirichlet process gaussian mixture models for unsupervised acoustic modeling: a feasibility study. In Sixteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §2.
  • [8] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord (2019) Unsupervised speech representation learning using wavenet autoencoders. IEEE/ACM transactions on audio, speech, and language processing 27 (12), pp. 2041–2053. Cited by: §1, §2.
  • [9] V. Dumoulin and F. Visin (2016)

    A guide to convolution arithmetic for deep learning

    .
    arXiv preprint arXiv:1603.07285. Cited by: §3.2.
  • [10] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, and et al. (2019-09) The zero resource speech challenge 2019: tts without t. Interspeech 2019. External Links: Link, Document Cited by: §1, §2.
  • [11] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, X. Anguera, and E. Dupoux (2017) The zero resource speech challenge 2017. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 323–330. Cited by: §1, §2.
  • [12] R. Eloff, A. Nortje, B. van Niekerk, A. Govender, L. Nortje, A. Pretorius, E. van Biljon, E. van der Westhuizen, L. van Staden, and H. Kamper (2019)

    Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

    .
    Cited by: §1, §2.
  • [13] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber (2006)

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

    .
    In

    Proceedings of the 23rd international conference on Machine learning

    ,
    pp. 369–376. Cited by: §3.2.
  • [14] D. Griffin and J. Lim (1984)

    Signal estimation from modified short-time fourier transform

    .
    IEEE Transactions on Acoustics, Speech, and Signal Processing 32 (2), pp. 236–243. Cited by: Appendix B, §3.2.
  • [15] Y. Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y. Wu (2019) Direct speech-to-speech translation with a sequence-to-sequence model. Cited by: Appendix A, Appendix C, Appendix D, §1, §2, §4.1, §4.2.
  • [16] H. Kamper, K. Livescu, and S. Goldwater (2017) An embedded segmental k-means model for unsupervised segmentation and clustering of speech. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 719–726. Cited by: §1, §2.
  • [17] P. K. Kuhl, B. T. Conboy, S. Coffey-Corina, D. Padden, M. Rivera-Gaxiola, and T. Nelson (2008) Phonetic learning as a pathway to language: new data and native language magnet theory expanded (nlm-e). Philosophical Transactions of the Royal Society B: Biological Sciences 363 (1493), pp. 979–1000. Cited by: §1.
  • [18] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised machine translation using monolingual corpora only. Cited by: §4.1.
  • [19] A. Lavie, A. Waibel, L. Levin, M. Finke, D. Gates, M. Gavalda, T. Zeppenfeld, and P. Zhan (1997) JANUS-iii: speech-to-speech translation in multiple languages. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 99–102. Cited by: §1, §2.
  • [20] M. P. Lewis and F. Gary (2013) Simons, and charles d. fennig (eds.).(2015). ethnologue: languages of the world, dallas, texas: sil international. Online version: http://www. ethnologue. com. Cited by: §1, §4.3.
  • [21] X. Li, S. Dalmia, D. R. Mortensen, J. Li, A. W. Black, and F. Metze (2020) Towards zero-shot learning for automatic phonemic transcription. In

    Thirty-Fourth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §3.2.
  • [22] A. H. Liu, T. Tu, H. Lee, and L. Lee (2019) Towards unsupervised speech recognition and synthesis with quantized speech representation learning. arXiv preprint arXiv:1910.12729. Cited by: §1, §2.
  • [23] M. Luong, H. Pham, and C. D. Manning (2015) Effective approaches to attention-based neural machine translation. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    ,
    pp. 1412–1421. Cited by: §2.
  • [24] E. Matusov, S. Kanthak, and H. Ney (2005) On the integration of speech recognition and statistical machine translation. In Ninth European Conference on Speech Communication and Technology, Cited by: §2.
  • [25] P. K. Muthukumar and A. W. Black (2014) Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2594–2598. Cited by: §1, §2.
  • [26] S. Nakamura, K. Markov, H. Nakaiwa, G. Kikui, H. Kawai, T. Jitsuhiro, J. Zhang, H. Yamamoto, E. Sumita, and S. Yamamoto (2006) The atr multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing 14 (2), pp. 365–376. Cited by: §1, §2.
  • [27] H. Ney (1999) Speech translation: coupling of recognition and translation. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Vol. 1, pp. 517–520. Cited by: §2.
  • [28] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §3.2.
  • [29] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: Appendix C, §4.1.
  • [30] M. Post, G. Kumar, A. Lopez, D. Karakos, C. Callison-Burch, and S. Khudanpur (2013) Improved speech-to-text translation with the fisher and callhome spanish–english speech translation corpus. In Proc. IWSLT, Cited by: Appendix A, §4.1.
  • [31] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. Liu (2019) Almost unsupervised text to speech and automatic speech recognition. In International Conference on Machine Learning, pp. 5410–5419. Cited by: Appendix A, §4.1.
  • [32] K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450. Cited by: §4.1.
  • [33] M. Sperber, G. Neubig, J. Niehues, and A. Waibel (2019) Attention-passing models for robust and data-efficient end-to-end speech translation. TACL 7, pp. 313–325. Cited by: §2.
  • [34] A. Tjandra, S. Sakti, and S. Nakamura (2019) Speech-to-speech translation between untranscribed unknown languages. arXiv preprint arXiv:1910.00795. Cited by: §1, §2.
  • [35] A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura (2019) VQVAE unsupervised unit discovery and multi-scale code2spec inverter for zerospeech challenge 2019. Cited by: §1, §2.
  • [36] A. van den Oord, O. Vinyals, et al. (2017) Neural discrete representation learning. In Advances in Neural Information Processing Systems, pp. 6306–6315. Cited by: §1, §1, §2, §3.2.
  • [37] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalchbrenner, N. Parmar, et al. (2018) Tensor2Tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Papers), pp. 193–199. Cited by: Appendix C.
  • [38] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. 5998–6008. Cited by: Appendix B, §2, §3.1, §3.2, §3.2, §4.1.
  • [39] M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux (2016) The zero resource speech challenge 2015: proposed approaches and results. Procedia Computer Science 81, pp. 67–72. Cited by: §1.
  • [40] G. Vigliocco, D. P. Vinson, W. Lewis, and M. F. Garrett (2004) Representing the meanings of object and action words: the featural and unitary semantic space hypothesis. Cognitive psychology 48 (4), pp. 422–488. Cited by: §1.
  • [41] L. C. Vila, C. Escolano, J. A. Fonollosa, and M. R. Costa-jussà (2018) End-to-end speech translation with the transformer.. In IberSPEECH, pp. 60–63. Cited by: §2.
  • [42] W. Wahlster (2013) Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media. Cited by: §1, §2.
  • [43] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen (2017) Sequence-to-sequence models can directly translate foreign speech. Cited by: §2, §4.4.
  • [44] A. Wilkinson, T. Zhao, and A. W. Black (2016) Deriving phonetic transcriptions and discovering word segmentations for speech-to-speech translation in low-resource settings.. In INTERSPEECH, pp. 3086–3090. Cited by: §1.
  • [45] J. Wind (1989) The evolutionary history of the human speech organs. Studies in language origins 1, pp. 173–197. Cited by: §1.
  • [46] C. Yallop and J. Fletcher (2007) An introduction to phonetics and phonology. Cited by: §1.