Despite that human speech inherently carries linguistic features that represent textual information, modern text-to-speech training pipelines still require parallel speech and text transcription pairs 
. Parallel speech and transcripts may not always be available, and is costly to acquire, however human speech alone can be easily gathered. In this work, we focus on utilizing the advantage of unlabeled speech to discover discrete linguistic units, where machine learns to uncover the linguistic features hidden in human utterance without any supervision. We use these discovered linguistic units for voice conversion (VC) and achieved outstanding results.
Embedding audio signals into latent representations has been a well studied practice . Former studies have also attempted to encode speech content into various representations for voice conversion, including the use of disentangled autoencoders , VAEs  or GANs . continuous vectors are the most common approach, however when it comes to encoding speech content, they may not be the best choice. As they are unlike the discrete phonemes that we often used to represent human language. Previous works  also attempt to learn discrete representations from audio, however they did not apply the learned units for VC.
In VC, the state-of-the-art approach  requires additional loss in the framework of GAN to guarantee the disentanglement of learned encodings, whereas the proposed approach naturally possesses the direct ability to separate speaker style from speech content. In , VC is achieved through multiple losses together with a K-way quantized embedding space and autoregressive WaveNet, on the other hand, the proposed encoding space does not require additional training constraints.
In this work, we use an ASR-TTS autoencoder to discovers discrete linguistic units from speech, without any alignment, text label, or parallel data provided. The ASR-Encoder learns to encode speech from different speakers to a common set of small linguistic symbols. These finite set of linguistic symbols are represented by Multilabel-Binary Vectors (MBV), vectors consist of arbitrary number of zeros and ones. The proposed MBV method is differentiable, hence allowing backpropagation of gradients and end-to-end training of an autoencoder reconstruction setting. While the ASR-Encoder learns a many-to-one mapping from speech to discrete subword units, a TTS-Decoder learns a one-to-many mapping from discrete subword units back to speech. The discrete nature of MBV allows it to innately separate linguistic content from speech, removing speaker characteristics. Given an utterance of a source speaker, we were able to encode its speech content using the ASR-Encoder, and perform voice conversion to generate speech with the same linguistic content but style of a target speaker using the TTS-Decoder.
In training, the TTS-Decoder is always given an encoding from an arbitrary speaker and is trained to decode it back to the original speech of that particular speaker. During inference time, the TTS-Decoder has to take encodings of a source speaker and decode to a target speaker, an encoding-speaker pair that it never observed during training. Although speech conversion is already feasible, we propose to use additional adversarial training to compensate the training-testing inconsistency. Under the WGAN  setting, we train a TTS-Patcher in place of a generator. The TTS-Patcher learns to generate a mask that augments the output of TTS-Decoder. Furthermore, we use a target driven reconstruction loss to guide the generator’s update. As a result, the quality of voice conversion performance is improved. In the ZeroSpeech 2019 Challenge , we achieve place in terms of low bitrate under a strong dimension constraint on the Surprise Language  leaderboard. We further show that the proposed method is capable of generating high quality intelligible speech when the dimension constraint is removed.
2 Proposed method
2.1 Discrete linguistic units discovery
We present an unsupervised end-to-end learning framework for distinct linguistic units discovery. In this stage we learn a common set of discrete encodings for all the speakers. Let be an acoustic feature sequence where is a set of all such sequences from all source and target speakers, and is a fixed length- segment randomly sampled from . Let be a speaker where is the group of all source and target speakers who produce the sequence collection . In a training pair , the sequence is produced by speaker .
2.1.1 Learning Multilabel-Binary Vectors
In Figure 1, an ASR-TTS autoencoder framework is used to learn discrete linguistic units. The ASR-Encoder is trained to map input acoustic feature sequence to a latent discrete encoding representation:
where we propose to use Multilabel-Binary Vectors (MBV) to represent generated in (1), the discrete encoding is designed to represents the linguistic content of input speech . We define MBV as:
where is a
dimension binary vector consists of arbitrary number of zeros and ones. The proposed MBV encoding method is differentiable, and allows end-to-end training with direct gradient backpropagation in an autoencoder framework. To obtain the binarized differentiable vector, we linearly project a continuous output vector into a space, where is the dimension of the wanted MBV. The dual-channel projection allows each dimension in a MBV to symbolize an arbitrary attribute’s presence. We then perform categorical reparameterization trick with Gumbel-Softmax  on the -dimension dual-channel, which is equivalent to asking the model to predict whether an attribute is observed in a given input. Since the two channels are linked together by Gumbel-Softmax, simply picking one of them is sufficient, hence we select the first channel as our MBV encoding . We illustrate the process of MBV encoding in Figure 1 (b). Former approaches also use Gumbel-Softmax on the output layer to obtain one-hot vectors 
. However, the resulting one-hot vectors is insufficient to represent content in speech, as the one-hot encoding space is too sparse for machines to learn linguistic meanings, which we verified in our experiments.
2.1.2 Voice reconstruction and conversion
Given ASR-Encoder’s output, the TTS-Decoder is trained to generate an output acoustic feature sequence defined as:
where is a reconstruction of from given the speaker identity . Mean Absolute Error (MAE) is employed to evaluate the ASR-TTS autoencoder reconstruction loss, as MAE is reported to be able to generate sharper outputs than Mean Square Error . The reconstruction loss is given by:
where and are the parameters of the ASR-Encoder and TTS-Decoder, respectively. We uniformly sample for training in (4). Because the speaker identity is provided to the TTS-Decoder, the proposed MBV encodings is able to learn an abstract space that is invariant to speaker identity and only encodes the content of speech, without using any form of linguistic supervision.
At inference time, given a source speech , and target speaker , the TTS-Decoder can generate the voice of the target speaker using the linguistic content from :
is the output of TTS-Decoder, which has the linguistic content of but style of speaker .
2.2 Target guided adversarial learning
With the learned speaker invariant ASR-Encoder mapping in (1), we successfully represent speech content with discrete binary vectors MBVs. In this section, we describe how adversarial training is used to boost VC quality based on the discrete linguistic units learned in Section 2.1. We train a TTS-Patcher in an unsupervised manner, in which the TTS-Patcher generates a spectrum mask that residually augments the output of equation (5), resulting in a more precise voice conversion result.
We define two sets of speakers, source speaker set and target speaker set, where we aim to convert the speech of a source speaker into a target speaker’s style while preserving its content. Let and be sequences where and are the set of sequences from source speakers and target speakers, respectively. Let be a speaker from the set of all source speakers who produce , and let be a speaker from the set of all target speakers who produce .
2.2.1 Adversarial learning step
In Figure 2, a TTS-Patcher is trained as a generator under an adversarial learning setting. The TTS-Patcher takes and as input, and generates a spectrum mask that ranges from zero to one, and modifies through a residual augmentation:
The and symbols indicate element-wise addition and multiplication, respectively. In (6), is a randomly sampled input speech segment from source speaker , where is a randomly sampled target speaker. is the voice conversion utterance obtained in (5), is the output of TTS-Patcher, and finally the augmented spectrum.
A discriminator is trained to distinguish whether an input acoustic feature sequence is real or reconstructed by machine. Since naive GAN  is notoriously hard to train, we minimize the Wasserstein-1 distance between real and fake distributions instead, as proposed in the WGAN  formulation:
The discriminator computes the Wasserstein-1 distance of two distributions: real data sampled from the target speaker set , and augmented voice conversion outputs from (6). We use the alternative WGAN-GP  to enforce the 1-Lipschitz constraint required by
, where weight clipping is replaced with gradient penalty. On the last layer of the discriminator, we stretch an additional layer that learns a classifier to predict speaker from a given speech. This allows the discriminator to consider input spectrum’s fidelity and speaker identity at the same time.
2.2.2 Target guided training step
The decoupled learning  of ASR-TTS autoencoder and TTS-Patcher stabilizes the GAN  training process. However we found that under the adversarial learning scheme, the TTS-Patcher can easily learn to deceive the discriminator by over-adding style, this greatly compromises the original speech content. This is caused by the discriminator’s inability to discriminate utterances with incorrect or ambiguous content, the discriminator only learns to focus on speaker style. As a solution, we propose to add an additional target guided training step, we apply additional reconstruction loss after every adversarial step, as shown in Figure 2. Instead of converting , the ASR-Encoder now takes a segment of target speech as input, equation (6) then becomes:
and we minimize MAE between and :
where is the parameter of the TTS-Patcher. This loss effectively guides the TTS-Patcher’s update under adversarial settings, as the added style is regularized to preserve intelligibility.
The ASR-Encoder is inspired by the CBHG module , where the linear output of ASR-Encoder is fed to the MBV encoding module. We add noise in training by adding dropout layers in the ASR-Encoder as suggested in . The TTS-Decoder and TTS-Patcher have identical model architectures, where we use pixel shuffle layers to generate high resolution spectrum . We add speaker embedding on the feature map of all layers, where a distinct embedding is learned for all different layers as different information may be needed for each layer. The discriminator is consist of 2D-convolution blocks for temporal texture capturing, and convolutions projection layers followed by fully-connected output layers. We trained the network using Adam  optimizer and a batch size of 16. In the discrete linguistic units discovery stage, we train the ASR-TTS autoencoder for 200k mini-batches. In the target guided adversarial learning stage we train the model for 50k mini-batches, in one batch we train a step of adversarial learning including 5 iterations of discriminator update and 1 iteration of generator update, followed by a target guided reconstruction step.
We train and evaluate our model using the ZeroSpeech 2019 English dataset . In particular, we use the “Train Unit Dataset” as our source speaker set , the “Train Voice Dataset” as our target speaker set , and we evaluate models with the “Test Dataset”. We used log-magnitude spectrograms as acoustic features, the detailed settings are in Table 1. During training, our model is trained to process 128 consecutive overlapping frames of spectrogram, where we uniformly sample from the dataset. At inference time, for a given input with more than 128 frames, the model process them as segments and concatenate the outputs on the time-axis. Source code are publicly available111https://github.com/andi611/ZeroSpeech-TTS-without-T.
|Types of encodings||Dim||Acc|
|continuous (with add’l loss)||1024||78%|
|continuous (with add’l loss)||128||81.3%|
We compare the proposed MBV encodings with one-hot encodings, continuous encodings, and continuous encodings with additional loss , all of which under the same autoencoder training setting as described in Section 3.
4.1 Degree of disentanglement
To evaluate the degree of disentanglement for the proposed MBV with respect to speaker characteristics, we trained a speaker verification classifier that takes spectrum as input and predicts speaker identity. We encode speech from source speaker and convert it to a target speaker. With the pre-trained classifier we measure target speaker classification accuracy on the converted results. A disentangled representation should produce voice similar to the target speakers and leads to higher classification accuracy. The results are shown in Table 2, where the column indicates the dimension of encoding vectors . One-hot encodings are insufficient to encode speech, resulting in poor conversion results. continuous encodings are incapable of disentangling content from style, resulting in lower classification accuracy. Where as the proposed MBV encodings has the ability to preserve speech content while removing speaker style. Although both one-hot vectors and MBV are discrete, each dimension of one-hot vector corresponds to a linguistic unit, while each dimension of MBV may corresponds to a pronunciation attribute. This makes MBV more data efficient than one-hot vectors.
4.2 Subjective and objective evaluation
We perform subjective human evaluation on the converted voices. We use 20 subjects to grade each method on a 1 to 5 scale under two measures: the naturalness of speech and the similarity in speaker characteristics to the target speaker. In Table 3 we show the result of our evaluation, the proposed method results in significant increase of target similarity with a slight degrade of naturalness. We easily achieve comparable speech intelligibility as ordinary continuous methods, while achieving better voice conversion quality with more disentanglement (Table 2). Subjective and objective evaluations suggest that the proposed MBV method eliminates speaker identity while reserving content within speech, and is suitable for voice conversion.
|Types of encodings||naturalness||similarity|
|continuous (with add’l loss)||3.21||2.58|
|Ours (MBV with dim 6)||1.61||1.51|
|Ours (with adv. training)||2.57||3.15|
4.3 Encoding dimension analysis
We use several objective measures to determine the quality of an encoding, these measurements are shown in Table 4. The column is the output Character Error Rate (CER) from a pre-trained ASR, where we use the ASR results of real input voice as ground truth, which measures intelligibility of the converted speech. The column measures the bitrate (amount of information) that encodings carry in average with respect to the testing set, as suggested in . The column measures the machine ABX score, which indicates the goodness of encoding quality . The column indicates the number of unique symbols used to encode speech in the test set. Lower values suggest a better performance for all the measures described above. In Table 4, we compare the proposed method with other approaches along side with the baseline model  demonstrated in . When compared to other approaches, the proposed method achieves lower and values with comparable scores. Due to the discrete and differentiable nature of MBV, the proposed method can be used in other unsupervised end-to-end clustering or classification tasks, where other approaches may fail to generalize.
4.4 The zero resource speech challenge competition
We compete with other teams in the ZeroSpeech 2019 Challenge  at a global scale. We use the proposed method with a dimension of to achieve extremely low bitrate, and were able to encode a whole language with less than 64 distinct units, human evaluation also suggests that the produced speech are still acceptable (Table 3). On the Surprise dataset  leaderboard, the proposed method is place in terms of low bitrate, while achieving higher Mean Opinion Score (MOS) and lower CER than the place team, as shown in Figure 3. Although the proposed approach does not achieve high MOS because it is an inevitable trade-off with extremely low bitrate, in Table 3 we have shown that with a larger encoding dimension we were able to generate outstanding voice converted speech.
We proposed to use multilabel-binary vectors to represent the content of human speech, as its discrete nature offers a strong extraction of speaker-independent representation. We show that these discrete units naturally possess the ability of disentangling speech content and style, which makes them extremely suitable for voice conversion tasks. Also, we show that these discrete units indeed produce better style disentanglement than ordinary settings, and finally we were able to improve voice conversion results through the addition of residual augmented signals.
-  Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
-  J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.
-  W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep voice 3: Scaling text-to-speech with convolutional sequence learning,” arXiv preprint arXiv:1710.07654, 2017.
-  S.-F. Huang, Y.-C. Chen, H. yi Lee, and L.-S. Lee, “Improved audio embeddings by adjacency-based clustering with applications in spoken term detection,” CoRR, vol. abs/1811.02775, 2018.
-  W. He, W. Wang, and K. Livescu, “Multi-view recurrent neural acoustic word embeddings,” arXiv preprint arXiv:1611.04496, 2016.
S. Settle and K. Livescu, “Discriminative acoustic word embeddings: Tecurrent neural network-based approaches,”2016 IEEE Spoken Language Technology Workshop (SLT), pp. 503–510, 2016.
Y.-A. Chung, C.-C. Wu, C.-H. Shen, H. yi Lee, and L.-S. Lee, “Audio word2vec: Unsupervised learning of audio segment representations using sequence-to-sequence autoencoder,” inINTERSPEECH, 2016.
-  K. Levin, K. Henry, A. Jansen, and K. Livescu, “Fixed-dimensional acoustic embeddings of variable-length segments in low-resource settings,” in 2013 IEEE Workshop on Automatic Speech Recognition and Understanding. IEEE, 2013, pp. 410–415.
-  W.-N. Hsu, S. Zhang, and J. R. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in NIPS, 2017.
-  A. Jansen, M. Plakal, R. Pandya, D. P. W. Ellis, S. Hershey, J. Liu, R. C. Moore, and R. A. Saurous, “Unsupervised learning of semantic audio representations,” 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 126–130, 2018.
-  J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” arXiv preprint arXiv:1804.02812, 2018.
-  C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA). IEEE, 2016, pp. 1–6.
-  Y. Gao, R. Singh, and B. Raj, “Voice impersonation using generative adversarial networks,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 2506–2510.
-  L. Badino, C. Canevari, L. Fadiga, and G. Metta, “An auto-encoder based approach to unsupervised learning of subword units,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014, pp. 7634–7638.
C.-T. Chung, C.-Y. Tsai, H.-H. Lu, C.-H. Liu, H. yi Lee, and L.-S. Lee, “An iterative deep learning framework for unsupervised discovery of speech features and linguistic units with applications on spoken term detection,”2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 245–251, 2015.
-  J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Unsupervised speech representation learning using wavenet autoencoders,” arXiv preprint arXiv:1901.08810, 2019.
-  A. van den Oord, O. Vinyals et al., “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, 2017, pp. 6306–6315.
-  M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint arXiv:1701.07875, 2017.
-  E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, L. Besacier, S. Sakti, and E. Dupoux, “The Zero Resource Speech Challenge 2019: TTS without T,” in 20th Annual Conference of the International Speech Communication Association (INTERSPEECH 2019): Crossroads of Speech and Language, 2019, submitted. [Online]. Available: https://zerospeech.com/2019/
-  S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, “Development of hmm-based indonesian speech synthesis,” 2008 Proc. Oriental COCOSDA, pp. 215–220, November 2008.
-  S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov, and S. Nakamura, “Development of indonesian large vocabulary continuous speech recognition system within a-star project,” 2008 Proc. Technologies and Corpora for Asia-Pacific Speech Translation (TCAST), pp. 19–24, January 2008.
-  E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville, “Improved training of wasserstein gans,” in Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
-  A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier gans,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2642–2651.
-  Z. Zhang, Y. Song, and H. Qi, “Decoupled learning for conditional adversarial networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 700–708.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1874–1883.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the Minimal-Pair ABX task: Analysis of the classical MFC/PLP pipeline,” in INTERSPEECH 2013 : 14th Annual Conference of the International Speech Communication Association, Lyon, France, Aug. 2013, pp. 1–5.
-  L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit discovery,” Procedia Computer Science, vol. 81, pp. 80–86, 2016.
-  Z. Wu, O. Watts, and S. King, “Merlin: An open source neural network speech synthesis system,” in 9th ISCA Speech Synthesis Workshop (2016), Sep. 2016, pp. 218–223.