1 Introduction
Voice conversion (VC) is a technique for converting para/nonlinguistic information contained in a given utterance without changing the linguistic information. This technique can be applied to various tasks such as speakeridentity modification for texttospeech (TTS) systems [1], speaking assistance [2, 3], speech enhancement [4, 5, 6], and pronunciation conversion [7].
One widely studied VC framework involves Gaussian mixture model (GMM)based approaches
[8, 9, 10]. Recently, neural network (NN)based frameworks based on restricted Boltzmann machines
[11, 12], feedforward deep NNs [13, 14], recurrent NNs [15, 16], variational autoencoders (VAEs) [17, 18, 19] and generative adversarial nets (GANs) [7], and an exemplarbased framework based on nonnegative matrix factorization (NMF) [20, 21] have also attracted particular attention. While many VC methods including those mentioned above require accurately aligned parallel data of source and target speech, in general scenarios, collecting parallel utterances can be a costly and timeconsuming process. Even if we were able to collect parallel utterances, we typically need to perform automatic time alignment procedures, which becomes relatively difficult when there is a large acoustic gap between the source and target speech. Since many frameworks are weak with respect to the misalignment found with parallel data, careful prescreening and manual correction is often required to make these frameworks work reliably. To sidestep these issues, this paper aims to develop a nonparallel VC method that requires no parallel utterances, transcriptions, or time alignment procedures.The quality and conversion effect obtained with nonparallel methods are generally poorer than with methods using parallel data since there is a disadvantage related to the training condition. Thus, it would be challenging to achieve as high a quality and conversion effect with nonparallel methods as with parallel methods. Several nonparallel methods have already been proposed [18, 19, 22, 23]. For example, a method using automatic speech recognition (ASR) was proposed in [22]
where the idea is to convert input speech under a restriction, namely that the posterior state probability of the acoustic model of an ASR system is preserved. Since the performance of this method depends heavily on the quality of the acoustic model of ASR, it can fail to work if ASR does not function reliably. A method using ivectors
[24], which is known to be a powerful feature for speaker verification, was proposed in [23] where the idea is to shift the acoustic features of input speech towards target speech in the ivector space so that the converted speech is likely to be recognized as the target speaker by a speaker recognizer. While this method is also free of parallel data, one limitation is that it is applicable only to speaker identity conversion tasks.Recently, a framework based on conditional variational autoencoders (CVAEs) [25, 26] was proposed in [18, 27]. As the name implies, VAEs are a probabilistic counterpart of autoencoders (AEs), consisting of encoder and decoder networks. Conditional VAEs (CVAEs) [26] are an extended version of VAEs with the only difference being that the encoder and decoder networks take an attribute class label as an additional input. By using acoustic features associated with attribute labels as the training examples, the networks learn how to convert an attribute of source speech to a target attribute according to the attribute label fed into the decoder. While this VAEbased VC approach is notable in that it is completely free of parallel data and works even with unaligned corpora, there are three major drawbacks. Firstly, the devised networks are designed to produce acoustic features framebyframe, which makes it difficult to learn time dependencies in the acoustic feature sequences of source and target speech. Secondly, one wellknown problem as regards VAEs is that outputs from the decoder tend to be oversmoothed. This can be problematic for VC applications since it usually results in poor quality buzzysounding speech. One natural way of alleviating the oversmoothing effect in VAEs would be to use the VAEGAN framework [28]. A nonparallel VC method based on this framework has already been proposed in [19]. With this method, an adversarial loss derived using a GAN discriminator is incorporated into the training loss to make the decoder outputs of a CVAE indistinguishable from real speech features. While this method is able to produce more realisticsounding speech than the regular VAEbased method [18], as will be shown in Section 4, the audio quality and conversion effect are still limited. Thirdly, in the regular CVAEs, the encoder and decoder are free to ignore the additional input by finding networks that can reconstruct any data without using . In such a situation, the attribute class label will have little effect on controlling the voice characteristics of the input speech.
To overcome these drawbacks and limitations, in this paper we describe three modifications to the conventional VAEbased approach. First, we adopt fully convolutional architectures to design the encoder and decoder networks so that the networks can learn conversion rules that capture short and longterm dependencies in the acoustic feature sequences of source and target speech. Secondly, we propose simply transplanting the spectral details of input speech into its converted version at test time to avoid producing buzzysounding speech. We will show in Section 4 that this simple method works considerably better than the VAEGAN framework [19] in terms of audio quality. Thirdly, we propose using an informationtheoretic regularization for the model training to ensure that the attribute class information will not be lost in the conversion process. This can be done by introducing an auxiliary classifier whose role is to predict to which attribute class an input acoustic feature sequence belongs and by training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier. We call the present VAE variant an auxiliary classifier VAE (or ACVAE).
2 VAE voice conversion
2.1 Variational Autoencoder (VAE)
VAEs [25, 26] are stochastic neural network models consisting of encoder and decoder networks. The encoder network generates a set of parameters for the conditional distribution of a latent space variable given input data , whereas the decoder network generates a set of parameters for the conditional distribution of the data given the latent space variable . Given a training dataset , VAEs learn the parameters of the entire network so that the encoder distribution becomes consistent with the posterior . By using Jensen’s inequality, the log marginal distribution of data can be lowerbounded by
(1)  
where the difference between the left and righthand sides of this inequality is equal to the KullbackLeibler divergence
, which is minimized when(2) 
This means we can make and consistent by maximizing the lower bound of (1). One typical way of modeling , and
is to assume Gaussian distributions
(3)  
(4)  
(5) 
where and are the outputs of an encoder network with parameter , and and are the outputs of a decoder network with parameter . The first term of the lower bound can be interpreted as an autoencoder reconstruction error. By using a reparameterization with , sampling from can be replaced by sampling from the distribution, which is independent of . This allows us to compute the gradient of the lower bound with respect to by using a Monte Carlo approximation of the expectation . The second term is given as the negative KL divergence between and
. This term can be interpreted as a regularization term that forces each element of the encoder output to be uncorrelated and normally distributed.
Conditional VAEs (CVAEs) [26] are an extended version of VAEs with the only difference being that the encoder and decoder networks can take an auxiliary variable as an additional input. With CVAEs, (3) and (4) are replaced with
(6)  
(7) 
and the variational lower bound to be maximized becomes
(8) 
where denotes the sample mean over the training examples .
2.2 Nonparallel voice conversion using CVAE
By letting and be an acoustic feature vector and an attribute class label, a nonparallel VC problem can be formulated using the CVAE [18, 19]. Given a training set of acoustic features with attribute class labels , the encoder learns to map an input acoustic feature and an attribute class label to a latent space variable (expected to represent phonetic information) and then the decoder reconstructs an acoustic feature conditioned on the encoded latent space variable and the attribute class label . At test time, we can generate a converted feature by feeding an acoustic feature of the input speech into the encoder and a target attribute class label into the decoder.
3 Proposed method
3.1 Fully Convolutional VAE
While the model in [18, 19] is designed to convert acoustic features framebyframe and fails to learn conversion rules that reflect timedependencies in acoustic feature sequences, we propose extending it to a sequential version to overcome this limitation. Namely, we devise a CVAE that takes an acoustic feature sequence instead of a singleframe acoustic feature as an input and outputs an acoustic feature sequence of the same length. Hence, in the following we assume that is an acoustic feature sequence of length . While RNNbased architectures are a natural choice for modeling time series data, we use fully convolutional networks to design and , as detailed in 3.4.
3.2 Auxiliary Classifier VAE
We hereafter assume that a class label comprises one or more categories, each consisting of multiple classes. We thus represent as a concatenation of onehot vectors, each of which is filled with 1 at the index of a class in a certain category and with 0 everywhere else. For example, if we consider speaker identities as the only class category, will be represented as a single onehot vector, where each element is associated with a different speaker.
The regular CVAEs impose no restrictions on the manner in which the encoder and decoder may use the attribute class label . Hence, the encoder and decoder are free to ignore by finding distributions satisfying and . This can occur for instance when the encoder and decoder have sufficient capacity to reconstruct any data without using . In such a situation, will have little effect on controlling the voice characteristics of input speech. To avoid such situations, we introduce an informationtheoretic regularization [29] to assist the decoder output to be correlated as far as possible with .
The mutual information for and conditioned on can be written as
(9) 
where represents the entropy of , which can be considered a constant term. In practice, is hard to optimize directly since it requires access to the posterior . Fortunately, we can obtain a lower bound of the first term of by introducing an auxiliary distribution
(10) 
This technique of lower bounding mutual information is known as variational information maximization [30]. The last line of (10) follows from the lemma presented in [29]. The equality holds in (10) when . Hence, maximizing the lower bound (10) with respect to corresponds to approximating by as well as approximating by this lower bound. We can therefore indirectly increase by increasing the lower bound with respect to and . One way to do this involves expressing using an NN and training it along with and . Hereafter, we use to denote the auxiliary classifier NN with parameter . As detailed in 3.4, we also design the auxiliary classifier using a fully convolutional network, which takes an acoustic feature sequence as the input and generates a sequence of class probabilities. The regularization term that we would like to maximize with respect to , and becomes
(11)  
where denotes the sample mean over the training examples . Fortunately, we can use the same reparameterization trick as in 2.1 to compute the gradients of with respect to , and . Since we can also use the training examples to train the auxiliary classifier , we include the crossentropy
(12) 
in our training criterion. The entire training criterion is thus given by
(13) 
where and are regularization parameters, which weigh the importances of the regularization terms relative to the VAE training criterion .
While the idea of using the auxiliary classifier for GANbased image synthesis [31, 32] and voice conversion [33] has already been proposed, to the best of our knowledge, it has yet to be proposed for use with the VAE framework. We call the present VAE variant an auxiliary classifier VAE (or ACVAE).
3.3 Conversion Process
Although it would be interesting to develop an endtoend model by directly using a timedomain signal or a magnitude spectrogram as , in this paper we use a sequence of melcepstral coefficients [34] computed from a spectral envelope sequence obtained using WORLD [35].
After training and , we can convert with
(14) 
where and denote the source and target attribute class labels, respectively. A naïve way of obtaining a timedomain signal is to simply use to reconstruct a signal with a vocoder. However, the converted feature sequence obtained with this procedure tended to be oversmoothed as with other conventional VC methods, resulting in buzzysounding synthetic speech. This was also the case with the reconstructed feature sequence
(15) 
This oversmoothing effect was caused by the Gaussian assumptions on the encoder and decoder distributions: Under the Gaussian assumptions, the encoder and decoder networks learn to fit the decoder outputs to the inputs in an expectation sense. Instead of directly using to reconstruct a signal, a reasonable way of avoiding this oversmoothing effect is to transplant the spectral details of the input speech into its converted version. By using and , we can obtain a sequence of spectral gain functions by dividing by where denotes a transformation from an acoustic feature sequence to a spectral envelope sequence. Once we obtain the spectral gain functions, we can reconstruct a timedomain signal by multiplying the spectral envelope of the input speech by the spectral gain function framebyframe and resynthesizing the signal using a WORLD vocoder. Alternatively, we can adopt the vocoderfree direct waveform modification method [36], which consists of transforming the spectral gain functions into timedomain impulse responses and convolving the input signal with the obtained filters.
3.4 Network Architectures
Encoder/Decoder: We use 2D CNNs to design the encoder and the decoder networks and the auxiliary classifier network by treating as an image of size with channel. Specifically, we use a gated CNN [37]
, which was originally introduced to model word sequences for language modeling and was shown to outperform long shortterm memory (LSTM) language models trained in a similar setting. We previously employed gated CNN architectures for voice conversion
[7, 38, 33] and monaural audio source separation [39], and their effectiveness has already been confirmed. In the encoder, the output of the th hidden layer, , is described as a linear projection modulated by an output gate(16)  
(17) 
where , , and are the encoder network parameters , and
denotes the elementwise sigmoid function. Similar to LSTMs, the output gate multiplies each element of
and control what information should be propagated through the hierarchy of layers. This gating mechanism is called a gated linear unit (GLU). Here, means the concatenation of and along the channel dimension, and is a 3D array consisting of a by tiling of copies of in the time dimensions. The input into the 1st layer of the encoder is . The outputs of the final layer are given as regular linear projections(18)  
(19) 
The decoder network is constructed as described below:
where , , and are the decoder network parameters . See Section 4 for more details. It should be noted that since the entire architecture is fully convolutional with no fullyconnected layers, it can take an entire sequence with an arbitrary length as an input and generate an acoustic feature sequence of the same length.
Auxiliary Classifier: We also design an auxiliary classifier using a gated CNN, which takes an acoustic feature sequence
and produces a sequence of class probability distributions that shows how likely each segment of
is to belong to attribute . The output of the th layer of the classifier is given as(20) 
where , , and are the auxiliary classifier network parameters . The final output is given by the product of all the elements of . See Section 4 for more details.
4 Experiments
To confirm the performance of our proposed method, we conducted subjective evaluation experiments involving a nonparallel manytomany speaker identity conversion task. We used the Voice Conversion Challenge (VCC) 2018 dataset [40], which consists of recordings of six female and six male US English speakers. We used a subset of speakers for training and evaluation. Specifically, we selected two female speakers, ‘VCC2SF1’ and ‘VCC2SF2’, and two male speakers, ‘VCC2SM1’ and ‘VCC2SM2’. Thus, is represented as a fourdimensional onehot vector and in total there were twelve different combinations of source and target speakers. The audio files for each speaker were manually segmented into 116 short sentences (each about 7 minutes long) where 81 and 35 sentences (each, respectively, about 5 and 2 minutes long) were provided as training and evaluation sets, respectively. All the speech signals were sampled at 22050 Hz. For each utterance, a spectral envelope, a logarithmic fundamental frequency (log ), and aperiodicities (APs) were extracted every 5 ms using the WORLD analyzer [35]. 36 melcepstral coefficients (MCCs) were then extracted from each spectral envelope. The contours were converted using the logarithm Gaussian normalized transformation described in [41]. The aperiodicities were used directly without modification. The network configuration is shown in detail in Fig. 2. The signals of the converted speech were obtained using the method described in 3.3.
We chose the VAEGANbased approach [19]
for comparison with our experiments. Although we would have liked to replicate the implementation of this method exactly, we made our own design choices because certain details of the network configuration and hyperparameters were missing. We conducted an AB test to compare the sound quality of the converted speech samples and an ABX test to compare the similarity to the target speaker of the converted speech samples, where “A” and “B” were converted speech samples obtained with the proposed and baseline methods and “X” was a real speech sample obtained from a target speaker. With these listening tests, “A” and “B” were presented in random orders to eliminate bias in the order of stimuli. Eight listeners participated in our listening tests. For the AB test of sound quality, each listener was presented {“A”,“B”}
20 utterances, and for the ABX test of speaker similarity, each listener was presented {“A”,“B”,“X”} 24 utterances. Each listener was then asked to select “A”, “B” or “fair” for each utterance. The results are shown in Fig. 3. As the results reveal, the proposed method significantly outperformed the baseline method in terms of both sound quality and speaker similarity. Audio samples are provided at http://www.kecl.ntt.co.jp/people/kameoka.hirokazu/Demos/ acvaevc/.5 Conclusions
This paper proposed a nonparallel manytomany VC method using a VAE variant called an auxiliary classifier VAE (ACVAE). The proposed method has three key features. First, we adopted fully convolutional architectures to construct the encoder and decoder networks so that the networks could learn conversion rules that capture time dependencies in the acoustic feature sequences of source and target speech. Second, we proposed using an informationtheoretic regularization for the model training to ensure that the information in the latent attribute label would not be lost in the generation process. With regular CVAEs, the encoder and decoder are free to ignore the attribute class label input. This can be problematic since in such a situation, the attribute class label input will have little effect on controlling the voice characteristics of the input speech. To avoid such situations, we proposed introducing an auxiliary classifier and training the encoder and decoder so that the attribute classes of the decoder outputs are correctly predicted by the classifier. Third, to avoid producing buzzysounding speech at test time, we proposed simply transplanting the spectral details of the input speech into its converted version. Subjective evaluation experiments on a nonparallel manytomany speaker identity conversion task revealed that the proposed method obtained higher sound quality and speaker similarity than the VAEGANbased method.
References
 [1] A. Kain and M. W. Macon, “Spectral voice conversion for texttospeech synthesis,” in Proc. ICASSP, 1998, pp. 285–288.
 [2] A. B. Kain, J.P. Hosom, X. Niu, J. P. van Santen, M. FriedOken, and J. Staehely, “Improving the intelligibility of dysarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007.
 [3] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speakingaid systems using GMMbased voice conversion for electrolaryngeal speech,” Speech Commun., vol. 54, no. 1, pp. 134–146, 2012.
 [4] Z. Inanoglu and S. Young, “Datadriven emotion conversion in spoken English,” Speech Commun., vol. 51, no. 3, pp. 268–283, 2009.
 [5] O. Türk and M. Schröder, “Evaluation of expressive speech synthesis with voice conversion and copy resynthesis techniques,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 965–973, 2010.
 [6] T. Toda, M. Nakagiri, and K. Shikano, “Statistical voice conversion techniques for bodyconducted unvoiced speech enhancement,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 20, no. 9, pp. 2505–2517, 2012.
 [7] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequencetosequence voice conversion with similarity metric learned using generative adversarial networks,” in Proc. Interspeech, 2017, pp. 1283–1287.
 [8] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech and Audio Process., vol. 6, no. 2, pp. 131–142, 1998.

[9]
T. Toda, A. W. Black, and K. Tokuda,
“Voice conversion based on maximumlikelihood estimation of spectral parameter trajectory,”
IEEE/ACM Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222–2235, 2007.  [10] E. Helander, T. Virtanen, J. Nurminen, and M. Gabbouj, “Voice conversion using partial least squares regression,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 912–921, 2010.
 [11] L.H. Chen, Z.H. Ling, L.J. Liu, and L.R. Dai, “Voice conversion using deep neural networks with layerwise generative training,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 12, pp. 1859–1872, 2014.
 [12] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion based on speakerdependent restricted Boltzmann machines,” IEICE Trans. Inf. Syst., vol. 97, no. 6, pp. 1403–1410, 2014.
 [13] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 954–964, 2010.
 [14] S. H. Mohammadi and A. Kain, “Voice conversion using deep neural networks with speakerindependent pretraining,” in Proc. SLT, 2014, pp. 19–23.
 [15] T. Nakashika, T. Takiguchi, and Y. Ariki, “Highorder sequence modeling using speakerdependent recurrent temporal restricted boltzmann machines for voice conversion,” in Proc. Interspeech, 2014, pp. 2278–2282.

[16]
L. Sun, S. Kang, K. Li, and H. Meng,
“Voice conversion using deep bidirectional long shortterm memory based recurrent neural networks,”
in Proc. ICASSP, 2015, pp. 4869–4873.  [17] M. Blaauw and J. Bonada, “Modeling and transforming speech using variational autoencoders,” in Proc. Interspeech, 2016, pp. 1770–1774.
 [18] C.C. Hsu, H.T. Hwang, Y.C. Wu, Y. Tsao, and H.M. Wang, “Voice conversion from nonparallel corpora using variational autoencoder,” in Proc. APSIPA, 2016, pp. 1–6.
 [19] C.C. Hsu, H.T. Hwang, Y.C. Wu, Y. Tsao, and H.M. Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. Interspeech, 2017, pp. 3364–3368.
 [20] R. Takashima, T. Takiguchi, and Y. Ariki, “Examplerbased voice conversion using sparse representation in noisy environments,” IEICE Trans. Inf. Syst., vol. E96A, no. 10, pp. 1946–1953, 2013.
 [21] Z. Wu, T. Virtanen, E. S. Chng, and H. Li, “Exemplarbased sparse representation with residual compensation for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10, pp. 1506–1521, 2014.
 [22] F.L. Xie, F. K. Soong, and H. Li, “A KL divergence and DNNbased approach to voice conversion without parallel training sentences,” in Proc. Interspeech, 2016, pp. 287–291.
 [23] T. Kinnunen, L. Juvela, P. Alku, and J. Yamagishi, “Nonparallel voice conversion using ivector PLDA: Towards unifying speaker verification and transformation,” in Proc. ICASSP, 2017, pp. 5535–5539.
 [24] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Frontend factor analysis for speaker verification,” IEEE. Trans. Audio Speech Lang. Process., vol. 19, no. 4, pp. 788–798, 2011.
 [25] D. P. Kingma and M. Welling, “Autoencoding variational Bayes,” in Proc. ICLR, 2014.

[26]
D. P. Kingma and D. J. Rezendey, S. Mohamedy, and M. Welling,
“Semisupervised learning with deep generative models,”
in Adv. Neural Information Processing Systems (NIPS), 2014, pp. 3581–3589.  [27] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi, “Nonparallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and dvectors,” in Proc. ICASSP, 2018, pp. 5274–5278.
 [28] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, “Autoencoding beyond pixels using a learned similarity metric,” arXiv:1512.09300 [cs.LG], Dec. 2015.
 [29] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel, “InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets,” in Proc. NIPS, 2016.
 [30] D. Barber and F. V. Agakov, “The IM algorithm: A variational approach to information maximization,” in Proc. NIPS, 2003.
 [31] A. Odena, C. Olah, and J. Shlens, “Conditional image synthesis with auxiliary classifier GANs,” in Proc. ICML, 2017, vol. PMLR 70, pp. 2642–2651.
 [32] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, and J. Choo, “StarGAN: Unified generative adversarial networks for multidomain imagetoimage translation,” arXiv:1711.09020 [cs.CV], Nov. 2017.
 [33] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, “StarGANVC: Nonparallel manytomany voice conversion with star generative adversarial networks,” arXiv:1806.02169 [cs.SD], June 2018.
 [34] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adaptive algorithm for melcepstral analysis of speech,” in Proc. ICASSP, 1992, pp. 137–140.
 [35] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: a vocoderbased highquality speech synthesis system for realtime applications,” IEICE trans. Inf. Syst., vol. E99D, no. 7, pp. 1877–1884, 2016.
 [36] K. Kobayashi, T. Toda, and S. Nakamura, “ transformation techniques for statistical voice conversion with direct waveform modification with spectral differential,” in Proc. SLT, 2016, pp. 693–700.
 [37] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
 [38] T. Kaneko and H. Kameoka, “Paralleldatafree voice conversion using cycleconsistent adversarial networks,” arXiv:1711.11293 [stat.ML], Nov. 2017.
 [39] L. Li and H. Kameoka, “Deep clustering with gated convolutional networks,” in Proc. ICASSP, 2018, pp. 16–20.
 [40] J. LorenzoTrueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” arXiv:1804.04262 [eess.AS], Apr. 2018.
 [41] K. Liu, J. Zhang, and Y. Yan, “High quality voice conversion through phonemebased linear mapping functions with STRAIGHT for mandarin,” in Proc. FSKD, 2007, pp. 410–414.