Voice conversion (VC) is a technique to modify non/para-linguistic information of speech while preserving linguistic information. This technique can be applied to various tasks such as speaker-identity modification for text-to-speech (TTS) systems , speaking assistance [2, 3], speech enhancement [4, 5, 6], and pronunciation conversion .
Voice conversion can be formulated as a regression problem of estimating a mapping function from source to target speech. One successful approach involves statistical methods using a Gaussian mixture model (GMM)[8, 9, 10]
. Neural network (NN)-based methods, such as a restricted Boltzmann machine (RBM)[11, 12], feed forward NN [13, 14], recurrent NN (RNN) [15, 16], and convolutional NN (CNN) , and exemplar-based methods, such as non-negative matrix factorization (NMF) [17, 18], have also recently been proposed.
Many VC methods including those mentioned above typically use temporally aligned parallel data of source and target speech as training data. If perfectly aligned parallel data are available, obtaining the mapping function becomes relatively simple; however, collecting such data can be a painstaking process in real application scenarios. Even though we could collect such data, we need to perform automatic time alignment, which may occasionally fail. This can be problematic since misalignment involved in parallel data can cause speech-quality degradation; thus, careful pre-screening and manual correction may be required .
These facts motivated us to consider a VC problem that is free from parallel data. In this paper, we propose a parallel-data-free VC method, which is particularly noteworthy in that it (1) does not require any extra data, such as transcripts and reference speech, and extra modules, such as an automatic speech-recognition (ASR) module, (2) is not prone to over-smoothing, which is known to be one of the main factors leading to speech-quality degradation, and (3) captures a spectrotemporal structure without any alignment procedure.
To satisfy these requirements, our method, called CycleGAN-VC, uses a cycle-consistent adversarial network (CycleGAN)  (i.e., DiscoGAN  or DualGAN ) with gated CNNs  and an identity-mapping loss 
. The CycleGAN was originally proposed for unpaired image-to-image translation. With this model, forward and inverse mappings are simultaneously learned using an adversarial loss and cycle-consistency loss . This makes it possible to find an optimal pseudo pair from unpaired data. Furthermore, the adversarial loss does not require explicit density estimation and results in reducing the over-smoothing effect [7, 27, 28, 29]. To use a CycleGAN for parallel-data-free VC, we configure a network using gated CNNs and train it with an identity-mapping loss. This allows the mapping function to capture sequential and hierarchical structures while preserving linguistic information.
We evaluated our method on a parallel-data-free VC task using the Voice Conversion Challenge 2016 (VCC 2016) dataset . An objective evaluation showed that the converted feature sequence was reasonably good in terms of global variance (GV)  and modulation spectra (MS) . A subjective evaluation showed that the speech quality was comparable to that obtained with a GMM-based method  trained using parallel and twice the amount of data. This is noteworthy since our method had a disadvantage in the training condition.
2 Related work
Recently, several approaches for parallel-data-free VC have been proposed. One approach involves using an ASR module to find a pair of corresponding frames [32, 33]. This may work well if ASR performs robustly and accurately enough, but it requires a large amount of transcripts to train the ASR module. It would also be inherently difficult to capture nonverbal information. This may become a limitation to be applied in general situations. Other approaches involve methods using an adaptation technique [34, 35] or incorporating a pre-constructed speaker space [36, 37]. These methods do not require parallel data between source and target speakers but require parallel data among reference speakers. A few attempts [38, 39, 40, 41] have recently been made to develop methods that are completely free from parallel data and extra modules. With these methods, it is assumed that source and target speech lie in the same low-dimensional embeddings. This would not only limit applications but also cause difficulty in modeling complex structures, e.g., detailed spectrotemporal structure. In contrast, we learn a mapping function directly without embedding. We expect that this would make it possible to apply our method to various applications where complex structure modeling needs to be considered.
3 Parallel-data-free VC using CycleGAN
Our goal is to learn a mapping from source to target without relying on parallel data. We solve this problem based on a CycleGAN . In this subsection, we briefly review the concept of CycleGAN and in the next subsection, we explain our proposed method for parallel-data-free VC.
Adversarial loss: An adversarial loss measures how distinguishable converted data are from target data . Hence, the closer the distribution of converted data becomes to that of target data , the smaller this loss becomes. This objective is written as
The generator attempts to generate data indistinguishable from target data by the discriminator by minimizing this loss, whereas attempts not to be deceived by by maximizing this loss.
Cycle-consistency loss: Optimizing only the adversarial loss would not necessarily guarantee that the contextual information of and will be consistent. This is because the adversarial loss only tells us whether follows the target-data distribution and does not help preserve the contextual information of . The idea of CycleGAN  is to introduce two additional terms. One is an adversarial loss for an inverse mapping and the other is a cycle-consistency loss, given as
These additional terms encourage and to find pairs with the same contextual information.
The full objective is written with trade-off parameter :
3.2 CycleGAN for parallel-data-free VC: CycleGAN-VC
Gated CNN: One of the characteristics of speech is that it has sequential and hierarchical structures, e.g., voiced/unvoiced segments and phonemes/morphemes. An effective way to represent such structures would be to use an RNN, but it is computationally demanding due to the difficulty of parallel implementations. Instead, we configure a CycleGAN using gated CNNs , which not only allows parallelization over sequential data but also achieves state-of-the-art in language modeling  and speech modeling 
. In a gated CNN, gated linear units (GLUs) are used as an activation function. A GLU is a data-driven activation function, and the-th layer output is calculated using the -th layer output and model parameters , , , and ,
where is the element-wise product and
is the sigmoid function. This gated mechanism allows the information to be selectively propagated depending on the previous layer states.
Identity-mapping loss: A cycle-consistency loss provides constraints on a structure; however, it would not suffice to guarantee that the mappings always preserve linguistic information. To encourage linguistic-information preservation without relying on extra modules, we incorporate an identity-mapping loss ,
which encourages the generator to find the mapping that preserves composition between the input and output. In practice, weighted loss with trade-off parameter is added to Eq. 3.1. Note that the original study on CycleGANs  showed the effectiveness of this loss for color preservation.
4.1 Experimental conditions
We conducted experiments to evaluate our method on a parallel-data-free VC task. We used the VCC 2016 dataset , which was recorded by professional US English speakers, including five females and five males. Following a previous study , we used a subset of speakers for evaluation. A pair of female (SF1) and male (SM1) speakers were selected as sources and another pair (TF2 and TM3) were selected as targets. The audio files for each speaker were manually segmented into 216 short parallel sentences (about 13 minutes). Among them, 162 and 54 sentences were provided as training and evaluation sets, respectively. To evaluate our method under a parallel-data-free condition, we divided the training set into two subsets without overlap. For the first half, 81 sentences were used for the source and the other 81 sentences were used for the target. The speech data were downsampled to 16 kHz, and 24 Mel-cepstral coefficients (MCEPs), logarithmic fundamental frequency (), and aperiodicities (APs) were then extracted every 5 ms using the WORLD analysis system . Among these features, we learned a mapping in the MCEP domain using our method. The was converted using logarithm Gaussian normalized transformation . Aperiodicities were directly used without modification because a previous study  showed that converting APs does not significantly affect speech quality.
Implementation details: We designed a network based on the recent success in image modeling [20, 46, 47] and speech modeling [7, 27]. The network architecture is illustrated in Fig. 2. We designed the generator using a one-dimensional (1D) CNN  to capture the relationship among the overall features while preserving the temporal structure. Inspired by a previous study 
for neural style transfer and super-resolution, we used the network that included downsampling, residual, and upsampling layers, as well as incorporating instance normalization . We used pixel shuffler for upsampling, which is effective for high-resolution image generation . We designed the discriminator using a 2D CNN  to focus on a 2D spectral texture .
|CycleGAN-VC w/ GLU||1.98||2.69||1.93||2.14|
|CycleGAN-VC w/o GLU||3.34||2.99||3.17||2.94|
|GMM-VC w/ GV||7.59||9.41||8.69||9.67|
|GMM-VC w/o GV||13.56||14.90||14.17||14.53|
Training details: As a pre-process, we normalized the source and target MCEPs per dimension. To stabilize training, we used a least squares GAN , which replaces the negative log likelihood objective in by a least squares loss. We set . We used only for the first iterations with to guide the learning process. To increase the randomness of each batch, we did not use a sequence directly and cropped a fixed-length segment (128 frames) randomly from a randomly selected audio file. We trained the network using the Adam optimizer  with a batch size of . We set the initial learning rates to for the generator and for the discriminator. We kept the same learning rate for the first iterations and linearly decay over the next iterations. We set the momentum term to .
4.2 Objective evaluation
In these experiments, we focused on the conversion of MCEPs; therefore, we evaluated the quality of converted MCEPs. We compared our method (CycleGAN-VC) with a GMM-based method (GMM-VC)  because it is still comparable to a DNN-based method in a relatively small dataset 
. Since this method requires parallel data, all the training data (162 sentences) for both source and target were used. This means that the amount of training data was twice as ours. As an ablation study, we examined our method without any GLUs. Instead of GLUs, we used typical GAN activation functions, i.e., rectified linear units (ReLUs) for the generator and leaky ReLUs [54, 55] for the discriminator. In the pre-experiment, we also examined our method without an identity-mapping loss. This revealed that the lack of this loss tends to cause significant degradation, e.g., collapse of the linguistic structure; thus, we did not examine this further.
indicate the limitation of this measure: it tends to prefer over-smoothing because it internally assumes Gaussian distribution. Therefore, as alternatives, we used two structural indicators highly correlated with subjective evaluation: GV and MS . We show the comparison of GV in Fig. 3. We list the comparison of root mean squared error (RMSE) between target and converted logarithmic MS in Table 1. We also show the comparison of MS per modulation frequency in Fig. 4. These results indicate that the MCEP sequences obtained with our method (CycleGAN-VC w/ GLU
) are closest to the target in terms of GV and MS. We expect this is because (1) the adversarial loss does not require explicit density estimation; thus, avoids over-smoothing, and (2) the GLU is a data-driven activation function; therefore, it can represent sequential and hierarchical structures better than the ReLU and leaky ReLU. We show sample MCEP trajectories in Fig.5. The trajectories of CycleGAN-VC w/ GLU have a similar global structure to those of GMM-VC w/ GV while preserving similar complexity to the source.
4.3 Subjective evaluation
We conducted listening tests to evaluate the performance of converted
speech222We provide the converted speech samples at
cyclegan-vc. By referring to the VCC 2016 , we evaluated the naturalness and speaker similarity of the converted samples. We compared our method with the baseline of the VCC 2016333We used data at http://dx.doi.org/10.7488/ds/1575, which is a GMM-based method  using parallel and twice the amount of data. To measure naturalness, we conducted a mean opinion score (MOS) test. As a reference, we used original and synthesized-and-analyzed (upper bound of our method) speeches of target speakers. Twenty sentences longer than 2 s and shorter than 5 s were randomly selected from the evaluation sets. To measure speaker similarity, we used the same/different paradigm . Ten sample pairs were randomly selected from the evaluation sets. There were nine participants who were well-educated English speakers. By referring to the study by , we evaluated on two subsets: intra-gender VC (SF1–TF2) and inter-gender VC (SF1–TM3).
We show the MOS for naturalness in Fig. 6. The results indicate that the proposed method significantly outperformed the baseline. We show the similarity to a source speaker and to a target speaker in Fig. 7. The results indicate that our method was slightly inferior to the baseline in SF1–TM3 VC but superior in SF1–TF2 VC. Overall, our method is comparable to the baseline. This is noteworthy since our method is trained under disadvantageous conditions with half the amount of and non-parallel data.
5 Discussion and conclusions
We proposed a parallel-data-free VC method called CycleGAN-VC, which uses a CycleGAN with gated CNNs and an identity-mapping loss. This method can learn a sequence-based mapping function without any extra data, modules, and time alignment procedure. An objective evaluation showed that the MCEP sequences obtained with our method are close to the target in terms of GV and MS. A subjective evaluation showed that the quality of converted speech was comparable to that obtained with the GMM-based method under advantageous conditions with parallel and twice the amount of data. However, there is still a margin between original and converted speeches. To fill the margin, we plan to apply our method to other features, such as STFT spectrograms , and other speech-synthesis frameworks, such as vocoder-free VC . Furthermore, our proposed method is a general framework, and possible future work includes applying the method to other VC applications [1, 2, 3, 4, 5, 6, 7].
Acknowledgements: This work was supported by JSPS KAKENHI Grant Number 17H01763.
-  A. Kain and M. W. Macon, “Spectral voice conversion for text-to-speech synthesis,” in Proc. ICASSP, 1998, pp. 285–288.
-  A. B. Kain, J.-P. Hosom, X. Niu, J. P. van Santen, M. Fried-Oken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007.
-  K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech,” Speech Commun., vol. 54, no. 1, pp. 134–146, 2012.
-  Z. Inanoglu and S. Young, “Data-driven emotion conversion in spoken English,” Speech Commun., vol. 51, no. 3, pp. 268–283, 2009.
-  O. Türk and M. Schröder, “Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 965–973, 2010.
-  T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice conversion techniques for body-conducted unvoiced speech enhancement,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 20, no. 9, pp. 2505–2517, 2012.
-  T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,” in Proc. INTERSPEECH, 2017, pp. 1283–1287.
-  Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech and Audio Process., vol. 6, no. 2, pp. 131–142, 1998.
-  T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222–2235, 2007.
-  E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice conversion using partial least squares regression,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 912–921, 2010.
-  L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 12, pp. 1859–1872, 2014.
-  T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion based on speaker-dependent restricted Boltzmann machines,” IEICE Trans. Inf. Syst., vol. 97, no. 6, pp. 1403–1410, 2014.
-  S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 954–964, 2010.
-  S. H. Mohammadi and A. Kain, “Voice conversion using deep neural networks with speaker-independent pre-training,” in Proc. SLT, 2014, pp. 19–23.
-  T. Nakashika, T. Takiguchi, and Y. Ariki, “High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion,” in Proc. INTERSPEECH, 2014, pp. 2278–2282.
-  L. Sun, S. Kang, K. Li, and H. Meng, in Proc. ICASSP, 2015, pp. 4869–4873.
-  R. Takashima, T. Takiguchi, and Y. Ariki, “Exampler-based voice conversion using sparse representation in noisy environments,” IEICE Trans. Inf. Syst., vol. E96-A, no. 10, pp. 1946–1953, 2013.
-  Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplar-based sparse representation with residual compensation for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10, pp. 1506–1521, 2014.
-  E. Helander, J. Schwarz, J. Nurminen, H. Silen, and M. Gabbouj, “On the impact of alignment on voice conversion performance,” in Proc. INTERSPEECH, 2008, pp. 1453–1456.
-  J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223–2232.
-  T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proc. ICML, 2017, pp. 1857–1865.
-  Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. ICCV, 2017, pp. 2849–2857.
-  Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
-  Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image generation,” in ICLR, 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Proc. NPIS, 2014, pp. 2672–2680.
-  T. Zhou, P. Krahenbuhl, M. Aubry, Q. Huang, and A. A. Efros, “Learning dense correspondence via 3D-guided cycle consistency,” in Proc. CVPR, 2016, pp. 117–126.
-  T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino, “Generative adversarial network-based postfilter for statistical parametric speech synthesis,” in Proc. ICASSP, 2017, pp. 4910–4914.
-  T. Kaneko, S. Takaki, H. Kameoka, and J. Yamagishi, “Generative adversarial network-based postfilter for STFT spectrograms,” in Proc. INTERSPEECH, 2017, pp. 3389–3393.
-  Y. Saito, S. Takamichi, and H. Saruwatari, “Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis,” in Proc. ICASSP, 2017, pp. 4900–4904.
-  T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The Voice Conversion Challenge 2016,” in Proc. INTERSPEECH, 2016, pp. 1632–1636.
-  S. Takamichi, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “A postfilter to modify the modulation spectrum in HMM-based speech synthesis,” in Proc. ICASSP, 2014, pp. 290–294.
-  M. Zhang, J. Tao, J. Tian, and X. Wang, “Text-independent voice conversion based on state mapped codebook,” in Proc. ICASSP, 2008, pp. 4605–4608.
-  F.-L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sentences,” in Proc. INTERSPEECH, 2016, pp. 287–291.
-  A. Mouchtaris, J. Van der Spiegel, and P. Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 14, no. 3, pp. 952–963, 2006.
-  C.-H. Lee and C.-H. Wu, “MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training,” in Proc. ICSLP, 2006, pp. 2254–2257.
-  T. Toda, Y. Ohtani, and K. Shikano, “Eigenvoice conversion based on Gaussian mixture model,” in Proc. INTERSPEECH, 2006, pp. 2446–2449.
D. Saito, K. Yamamoto, N. Minematsu, and K. Hirose,
“One-to-many voice conversion based on tensor representation of speaker space,”in Proc. INTERSPEECH, 2011.
-  T. Nakashika, T. Takiguchi, and Y. Minami, “Non-parallel training in voice conversion using an adaptive restricted Boltzmann machine,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 11, pp. 2032–2045, 2016.
-  C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from non-parallel corpora using variational auto-encoder,” in Proc. APSIPA, 2016, pp. 1–6.
T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi,
“Non-parallel voice conversion using i-vector PLDA: Towards unifying speaker verification and transformation,”in Proc. ICASSP, 2017, pp. 5535–5539.
C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang,
“Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,”in Proc. INTERSPEECH, 2017, pp. 3364–3368.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. CVPR, 2015, pp. 3431–3440.
-  M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. 99, no. 7, pp. 1877–1884, 2016.
-  K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for Mandarin,” in Proc. FSKD, 2007, pp. 410–414.
-  Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Maximum likelihood voice conversion based on GMM with STRAIGHT mixed excitation,” in Proc. INTERSPEECH, 2006, pp. 2266–2269.
-  W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proc. CVPR, 2016, pp. 1874–1883.
-  J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. ECCV, 2016, pp. 694–711.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
-  D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” in arXiv preprint arXiv:1607.08022. 2016.
-  X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in arXiv preprint arXiv:1611.04076. 2016.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
-  M. Wester, Z. Wu, and J. Yamagishi, “Multidimensional scaling of systems in the Voice Conversion Challenge 2016,” in Proc. SSW, 2016.
-  V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proc. ICML, 2010, pp. 807–814.
-  A. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. ICML Workshop, 2013.
-  B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” in Proc. ICML Workshop, 2015.
-  M. Wester, Z. Wu, and J. Yamagishi, “Analysis of the Voice Conversion Challenge 2016 evaluation results,” in Proc. INTERSPEECH, 2016, pp. 1637–1641.
-  K. Kobayashi, T. Toda, and S. Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential,” in Proc. SLT, 2016, pp. 693–700.