. Gaussian Mixture Models (GMM) are among the first successful voice conversion techniques[1, 3]. These models are trained on parallel datasets, which contain the same utterances (e.g., words or sentences) from source and target speakers. Parameter adaptation techniques can extend parallel-data trained GMMs to speakers with non-parallel data 
. A Long Short-Term Memory (LSTM) based many-to-one voice conversion method trained on non-parallel data has been proposed, but this method relies on an automatic speech recognition (ASR) system that is trained on labeled data.
More recently, deep-learning based methods such as Variational Auto-Encoders (VAE) and Cycle-Consistent Generative Adversarial Networks (Cycle-GAN) have been used to perform voice conversion using solely non-parallel data[6, 7, 8, 9, 10, 11]. Typically, source speaker features are extracted using a vocoder such as WORLD  or STRAIGHT , selected features are converted to the target speaker’s features, and finally the target voice is built in audio domain using the converted features. Alternatively, linear spectrograms can be used as features for voice conversion , since the converted spectrogram can be mapped back to the waveform domain using the Griffin-Lim algorithm .
In this paper, we propose a many-to-many voice conversion framework that uses speaker embeddings and a Cycle-GAN. Many-to-many conversion is defined as converting the voice of any desired speaker in a source set of speakers to the style of a target speaker (Fig. 1
). A convolutional neural network (CNN) basedgenerator is used for the conversion (Sec. 2.1). A CNN-based feature extractor, jointly trained with the Cycle-GAN, learns speaker-specific embeddings (features). Embeddings are subsequently used to condition the output of the generator to the style of the desired speaker.
One unique attribute that sets the proposed methodology apart from prior work is the ability to perform conversion between an out-of-dataset speaker and any in-dataset speaker in either direction without re-training (Fig. 1). This property is enabled by the learned speaker embeddings. Our results demonstrate that feature extractors trained on a diverse set of source speakers can generalize well and generate embeddings for speakers that were not in the training set, enabling voice conversion for out-of-dataset speakers (Sec4.2). In subjective tests on humans, in-dataset style conversion quality for the proposed model is comparably rated to the state-of-the-art baseline model(Sec. 4.1).
2.1 Proposed Methodology
Cycle-GANs were originally proposed as a domain conversion technique in images where collecting parallel data between the desired domains is either expensive or impossible, such as converting from photographs to paintings . As an appealing parallel-data-free method, it was adapted for one-to-one and many-to-many voice conversion where collecting time-aligned parallel data between speakers is costly [6, 7]. The proposed methodology in this paper builds on a prior Cycle-GAN-based method in  by adding a feature extractor block to enable out-of-dataset speaker conversion and combining the two separate discriminators into a single block.
Fig. 2 shows the basic components of the proposed many-to-many voice converter. is the mel-spectrogram of a genuine utterance from source speaker , and is the mel-spectrogram of a genuine utterance from the target speaker . and are unaligned and do not contain the same sentence, and potentially do not even share a single word. Feature extractor produces speaker embedding , latent representation of the style of speaker . Generator converts to using , trying to mimic the style of the target speaker while keeping the content in . Discriminator takes in an utterance and classifies it into one of outputs, where is the number of speakers in the training set. is converted to the audio domain using Griffin-Lim algorithm .
is trained to classify genuine utterance as . In other words, output is maximized, and all other outputs of are minimized (Fig. 2). Similarly, when presented with the generated (fake) utterance , output is maximized. This is achieved by adjusting the parameters of to minimize the loss :
where . is a binary indicator variable and is set to one only when the input of is a real utterance from speaker . is an indicator variable that denotes the input is a generated utterance in the style of speaker . During training, is presented with real and fake utterances for all speakers in the training set.
There is a single and a single ; together they perform all conversions. They are trained to generate such that will be fooled to label the generated utterance as , i.e. output is maximized. In practice, this is achieved by tuning the parameters of and to minimize :
and are trained for all in-dataset speakers jointly and are shared for all speakers. Initially, fake utterances generated by with the help of are low quality and can easily discriminate the fakes from genuine utterances. As training progresses, learns to mimic the target speaker better, and learns to find better ways to distinguish fakes from real utterances.
Although and might eventually learn to generate utterances that sound very much like the target speaker, and do not guarantee that the content in and will match. For example, might generate an utterance that says “horse” and sounds very much like the target speaker, but keeps generating “horse” even if contains “cat” or “chicken”.
In Cycle-GAN based methods, content preservation is enforced by cycle consistency loss. The input utterance is first converted to , an utterance in the style of . This utterance is passed through the generator a second time to generate utterance , which has the style of . The output of the cycle, , should ideally be the same as . This is enforced through the L1-loss :
and are trained to minimize , with source and target speakers randomly chosen during training.
2.2 Comparison to Prior Work
The ability to convert between a source speaker and an out-of-dataset speaker without re-training is a key difference between the proposed methodology and all prior work in this domain [6, 7, 8, 9, 10, 11]. Converters described in [6, 9, 11] can only perform one-to-one conversion; and  only performs domain conversion between female and male speakers, rather than conversion between specific speakers.
A Cycle-GAN based voice conversion method presented in  is the most closely related prior work to our model. The proposed feature extractor block used to generate speaker embeddings, as opposed to the training attribute
one-hot vector in, is the key differentiator that enables out-of-dataset speaker support. Additionally, we demonstrate voice conversion between more than 290 speakers (Sec. 4.2), a significantly more complex task than the four speakers presented in . Our model has a single discriminator as opposed to two separate discriminators, enabling a slightly simpler architecture. We use mel-spectrograms for conversion instead of features generated by a separate vocoder (Sec. 3.1).
3 Implementation Details
3.1 Data Preprocessing
Our model uses mel-spectrograms computed with Librosa and parameters =512, =32, =128, =40 and =7900 . Per-frequency scaling is performed for each speaker by: Take a random subset of the audio files for the speaker and clip silence from selected files, Compute the log-magnitude mel-spectrograms, Compute the histogram for each frequency bin, Take percentile value for each frequency bin as the maximum allowed value () for that bin, and choose () as the minimum (), Clip all values in the spectrogram to and scale to . During training, pre-computed and values for each speaker are used to scale the log-magnitude mel-spectrograms.
In Sec. 4.1, we use four speakers from the Voice Conversion Challenge 2018 dataset to compare to the baseline for in-dataset speakers . In Sec. 4.2, training is performed on Librispeech train-clean-100 dataset for 251 speakers, approximately 25 minutes of total utterances per speaker . Out-of-dataset speakers are chosen among 40 speakers in the dev-clean dataset. Transcriptions are not used in either of the two cases.
3.3 Network Architecture
Architecture of the many-to-many conversion path is given in Fig. 3. Both and are CNNs with 2D convolutions. Each unit (e.g., L1) consists of two convolutional layers, with desired downsampling performed at the second layer of the unit. converts the mel-spectrogram of utterance from source speaker to the speaker-specific embedding . An input receptive field of shape (channels, frequency bins, time steps) is mapped to an embedding of shape
(CHW). Downsampling is performed by strided convolutions, and instance normalization layers are used to aid training. Layers F1, F2, L1 and L2 use gated convolutional units .
Downsampling path of shares the same CNN architecture as . combines the mel-spectrogram of utterance from speaker with embedding to generate the mel-spectrogram of , an utterance with the style of speaker and the content of . Embedding is concatenated to the latent representation of in the bottleneck layer. F3 layer output from is mean pooled along time axis, repeated to match the dimension of the first upsampling layer (L5) output, and then concatenated to the output of L5, resembling a U-Net architecture . Similarly, F4 output is mean pooled and repeated to match the output of L4. Upsampling is performed by transposed and sub-pixel convolutions .
Discriminator is implemented as a collection of three CNNs. All three look at 128 frequency bins of the input spectrogram with progressively wider time patches (32, 64 and 128 timesteps). Each has a linear layer at the end, followed by a softmax activation with outputs as described in Sec. 2.1. Since silence does not contain speaker-identifying information, patches with signal power below a threshold are not sent to the discriminators.
4.1 Comparison to Baseline
In-dataset conversion quality of the proposed methodology is compared to the most closely related work in , state-of-the-art non-parallel converter based on Cycle-GANs. We were unable to locate the source code of this work and open-source versions from third parties [23, 24] did not match the quality of the 12 conversion samples given by the original authors . Published samples are from four unique (source, target) speaker pairs, with three audio clips from each pair. For a fair comparison, we use these published samples to evaluate the naturalness and style conversion quality of our method using subjective tests on human listeners. Four chosen pairs are (SM1SF1), (SM2SM1), (SF1SF2), and (SF2SM2) from .
4.1.1 Mean Opinion Score
We evaluate the naturalness of the converted samples with the Mean Opinion Score (MOS) test. In each test, listeners are given a single audio file from one of three categories: A ground truth audio file from one of four speakers in the dataset (SF1, SF2, SM1, SM2) An audio conversion output by the baseline An audio conversion output by the proposed model. Listeners are asked “How natural is the speech in this audio clip?” and given five choices in a Likert scale: unnatural, somewhat unnatural, indifferent, somewhat natural, natural. Choices are mapped to integers from 1 (unnatural) to 5 (natural). The test is repeated by using audio files from all four speakers and conversions. Fig. 4 shows the comparison of MOS ratings from the baseline and proposed conversions to no conversion (ground truth audio files). Listeners rated the naturalness of the ground truth audio files almost at “natural” level. Both conversion models are rated substantially below perfect conversion: Proposed model is rated slightly below ”somewhat unnatural”, whereas the baseline is slightly above. We surmise that the lower MOS scores of the proposed model is due to the artefacts introduced by Griffin-Lim algorithm when mel-spectrograms are rebuilt in raw audio domain. Baseline model uses a vocoder for audio reconstruction, which might be introducing fewer artefacts.
4.1.2 Style Conversion Quality
We evaluate the style conversion quality by comparing perceived similarity of conversion outputs to ground truth audio files from the intended target. Listeners are given two utterances: A ground truth utterance from one of the four speakers in the dataset (e.g., SF1) Either another ground truth sample from this speaker (SF1), or an utterance converted to the style of this speaker by the baseline (output of SM1SF1 conversion), or the same conversion performed by the proposed model. Listeners are asked “How likely is that these two utterances are from the same speaker?” and given five choices in a Likert scale: unlikely, somewhat unlikely, neutral, somewhat likely, likely. Choices are mapped to integers from 1 (unlikely) to 5 (likely). The test is repeated with all four speaker pairs and from both conversion models. Comparing the perceived similarity of conversions to ground truth targets in a Likert scale gives more insight into the conversion models’ capability when both models perform similarly good or bad in the conversion. Fig. 5 shows style conversion comparison results. The proposed model performs similar to the baseline in style conversions. The biggest differences between the baseline and the proposed model are in cases where even perfect conversion () is rated relatively lower (). This might indicate inherent style variability for and in the dataset, resulting in lower scores for style conversion.
4.2 Out-of-Dataset Conversion Quality
Results of Sec. 4.1 demonstrate that the proposed model performs reasonably well in both naturalness and style conversion for in-dataset speakers. To evaluate out-of-dataset conversion quality, proposed model is trained on 251 speakers in train-clean-100 (Sec. 3.2). Speakers in dev-clean (40 in total) are not used for training and are set aside to evaluate the out-of-dataset conversion capability of the trained model. We are unable to compare out-of-dataset conversions to the baseline since the baseline cannot perform such conversions.
Subjective evaluation of conversions between 291 speakers presents a scalability problem. One could pick a few utterances from a source speaker and convert them to an in-dataset and an out-of-dataset target speaker; but subjective test results on such a small set would be highly skewed based on the choice of source and target speakers. For a more robust evaluation, one could pick a set of source speakers, with several utterances from each source, and convert to a set of both in and out-of-dataset targets. This can easily lead to thousands of conversions, each evaluated by several human listeners. Since no prior baseline can perform out-of-dataset conversions, the test would need to be repeated with ground truth utterances for comparison. Time and monetary costs for such testing are prohibitive.
Given these challenges, we opted to use a more scalable method to compare the out-of-dataset target conversion quality to in-dataset target conversions. We trained an i-Vector based speaker identification (SID) model as described in [26, 27]. MFCC parameters are set as follows: =16KHz, =25ms, =10ms, =40, =7800, =20, =60, =7200. SID model is trained on all 291 speakers from train-clean-100 and dev-clean datasets (Sec. 3.2), with utterances split into training and evaluation sets. SID model reaches 89.5% top-1 accuracy for utterances in the evaluation set (second to last row in Table 1). For reference, last row in Table 1 shows the accuracy if the SID model performed random guesses. Conversion model has no knowledge of the SID model and is not specifically trained to perform adversarial attacks on it.
We randomly picked 8 source speakers from train-clean-100 (in-dataset) and 8 from dev-clean (out-of-dataset). 10 utterances per source speaker are converted to the style of 32 target speakers: 16 in-dataset and 16 out-of-dataset (a total of 5120 conversions). Target speakers are split equally between female and male.
Table 1 reports the average top-K accuracy for a converted audio clip to be classified as uttered by the conversion target. Conversion output is rebuilt in raw audio domain using Griffin-Lim , presented to the SID model, and prediction scores for 291 speakers are sorted. Top-K accuracy measures if the conversion target’s score is among the highest ranked K scores.
First row in Table 1 shows average classification accuracy of conversions for 16 in-dataset target speakers. When presented to the SID system, 9.7% of the converted utterances are predicted to come from the intended target (among 291 potential target speakers). Average accuracy for out-of-dataset targets is lower at 4.8%, but significantly higher than random chance (0.3%); demonstrating that can generate reasonable embeddings (style) for new speakers.
Table 2 reports more granular results based on in-dataset/out-of-dataset status of speakers, as well as target speakers’ gender. Out-of-dataset source speakers have only slightly lower accuracy. This is likely because downsampling path of learns to discard the style of the source speaker while keeping the content; hence out-of-dataset source speakers’ styles do not significantly impact the conversion.
|SID Eval. Set||89.5||95.3||96.5||98.0||98.9|
|In Dataset||Female, In||10.0||20.6||28.1||40.8||57.5|
|Out of Dataset||Female, In||7.2||17.2||25.2||36.6||52.2|
In this paper, we describe a non-parallel, almost unsupervised, Cycle-GAN based voice conversion method that can perform conversions between speakers that the model was never trained on. This is enabled by a unique feature extractor block that produces speaker embeddings for new speakers. Subjective tests show that style conversion quality is comparable to the state of the art, which can only perform in-dataset conversions. Out-of-dataset conversion quality of the proposed model is compared to the in-dataset quality using a quantitative method based on an independently trained speaker identification model. Future work includes improving the naturalness of the converted speech by employing a vocoder, and improving out-of-dataset conversion quality by training on a larger set of speakers to enhance the generalization capability of the feature extractor.
-  Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech and Audio Processing, vol. 6, no. 2, pp. 131–142, 1998.
-  Q. Jin, A. R. Toth, T. Schultz, and A. W. Black, “Voice convergin: Speaker de-identification by voice transformation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan. IEEE, 2009, pp. 3909–3912.
T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Trans. Audio, Speech & Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
-  A. Mouchtaris, J. V. der Spiegel, and P. Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” IEEE Trans. Audio, Speech & Language Processing, vol. 14, no. 3, pp. 952–963, 2006.
-  L. Sun, K. Li, H. Wang, S. Kang, and H. M. Meng, “Phonetic posteriorgrams for many-to-one voice conversion without parallel data training,” in IEEE International Conference on Multimedia and Expo, ICME 2016, Seattle, WA, USA, July 11-15, 2016. IEEE Computer Society, 2016, pp. 1–6.
-  T. Kaneko and H. Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” 2017.
-  H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Stargan-vc: Non-parallel many-to-many voice conversion with star generative adversarial networks,” 2018.
-  E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “A multi-discriminator cyclegan for unsupervised non-parallel speech domain adaptation,” in Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018., B. Yegnanarayana, Ed. ISCA, 2018, pp. 3758–3762.
-  F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. IEEE, 2018, pp. 5279–5283.
H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “Acvae-vc: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder,” 2018.
-  C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, F. Lacerda, Ed. ISCA, 2017, pp. 3364–3368.
-  M. Morise, F. Yokomori, and K. Ozawa, “World: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. E99.D, no. 7, pp. 1877–1884, 2016.
-  H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds,” Speech Commun., vol. 27, no. 3-4, pp. 187–207, Apr. 1999.
D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time fourier transform,” inIEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’83, Boston, Massachusetts, USA, April 14-16, 1983. IEEE, 1983, pp. 804–807.
J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in
IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017. IEEE Computer Society, 2017, pp. 2242–2251.
-  “Librosa: Python library for audio and music analysis (version 0.5.1),” https://github.com/librosa/librosa.
-  J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling. ”the voice conversion challenge 2018: database and results”. [Online]. Available: https://doi.org/10.7488/ds/2337
-  V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 2015, pp. 5206–5210.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” 2016.
Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated
convolutional networks,” in
Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, 2017, pp. 933–941.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015 - 18th International Conference Munich, Germany, October 5 - 9, 2015, Proceedings, Part III, ser. Lecture Notes in Computer Science, N. Navab, J. Hornegger, W. M. W. III, and A. F. Frangi, Eds., vol. 9351. Springer, 2015, pp. 234–241.
-  A. Aitken, C. Ledig, L. Theis, J. Caballero, Z. Wang, and W. Shi, “Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize,” 2017.
-  S. Liu, “Stargan voice conversion.” GitHub. [Online]. Available: https://github.com/liusongxiang/StarGAN-Voice-Conversion
H. Sen, “Pytorch stargan vc.” GitHub. [Online]. Available:https://github.com/hujinsen/pytorch-StarGAN-VC
-  H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo. ”stargan-vc: Non-parallel many-to-many voice conversion with star generative adversarial networks”. [Online]. Available: http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/stargan-vc/
-  J. Meyer, “Josh’s speaker id challenge.” GitHub. [Online]. Available: http://jrmeyer.github.io/asr/2017/09/29/challenge.html
-  D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recognition toolkit,” in IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. IEEE Signal Processing Society, Dec. 2011, iEEE Catalog No.: CFP11SRW-USB.