Voice conversion (VC) is a technique for transforming the non/para-linguistic information of given speech while preserving the linguistic information. VC has great potential for application to various tasks, such as speaking aids [1, 2] and the conversion of style [3, 4] and pronunciation .
One successful approach to VC involves statistical methods based on Gaussian mixture model (GMM)[6, 7, 8]9, 10], feed forward NNs (FNNs) [11, 12, 13], recurrent NNs (RNNs) [14, 15], convolutional NNs (CNNs) , attention networks [16, 17], and generative adversarial networks (GANs) , and exemplar-based methods using non-negative matrix factorization (NMF) [18, 19].
Many VC methods (including the above-mentioned) are categorized as parallel VC, which relies on the availability of parallel utterance pairs of the source and target speakers. However, collecting such data is often laborious or impractical. Even if obtaining such data is feasible, many VC methods require a time alignment procedure as a pre-process, which may occasionally fail and requires careful pre-screening or manual correction. To overcome these restrictions, this paper focuses on non-parallel VC, which does not rely on parallel utterances, transcriptions, or time alignment procedures.
In general, non-parallel VC is quite challenging and is inferior to parallel VC in terms of quality due to the disadvantages of the training conditions. To alleviate these severe conditions, several studies have incorporated an extra module (e.g., an automatic speech recognition (ASR) module [20, 21]) or extra data (e.g., parallel utterance pairs among reference speakers [22, 23, 24, 25]). Although these additional modules or data are helpful for training, preparing them imposes other costs and thus limits application. To avoid such additional costs, recent studies have examined the use of probabilistic NNs (e.g., an RBN 
and variational autoencoders (VAEs)[27, 28]), which embed the acoustic features into common low-dimensional space with the supervision of speaker identification. It is noteworthy that they are free from extra data, modules, and time alignment procedures. However, one limitation is that they need to approximate data distribution explicitly (e.g., Gaussian is typically used), which tends to cause over-smoothing through statistical averaging.
To overcome these limitations, recent studies [27, 29, 30] have incorporated GANs , which can learn a generative distribution close to the target without explicit approximation, thus avoiding the over-smoothing caused by statistical averaging. Among these, in contrast to some of the frame-by-frame methods [27, 30], which have difficulty in learning time dependencies, CycleGAN-VC  (published in ) makes it possible to learn a sequence-based mapping function by using CycleGAN [33, 34, 35] with a gated CNN  and identity-mapping loss . This allows sequential and hierarchical structures to be captured while preserving linguistic information. With this improvement, CycleGAN-VC performed comparably to a parallel VC method .
However, even using CycleGAN-VC, there is still a challenging gap to bridge between the real target and converted speech. To reduce this gap, we propose CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN). We analyzed the effect of each technique on the Spoke (i.e., non-parallel VC) task of the Voice Conversion Challenge 2018 (VCC 2018) . An objective evaluation showed that the proposed techniques help bring the converted acoustic feature sequence closer to the target in terms of global and local structures, which we assess by using Mel-cepstral distortion and modulation spectra distance, respectively. A subjective evaluation showed that CycleGAN-VC2 outperforms CycleGAN-VC in terms of naturalness and similarity for every speaker pair, including intra-gender and inter-gender pairs.
In Section 2 of this paper, we review the conventional CycleGAN-VC. In Section 3, we describe CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques. In Section 4, we report the experimental results. We conclude in Section 5 with a brief summary and mention future work.
2 Conventional CycleGAN-VC
2.1 Objective: One-Step Adversarial Loss
Let and be acoustic feature sequences belonging to source and target , respectively, where is the feature dimension and and are the sequence lengths. The goal of CycleGAN-VC is to learn mapping , which converts into , without relying on parallel data. Inspired by CycleGAN 31] and cycle-consistency loss . Additionally, to encourage the preservation of linguistic information, CycleGAN-VC also uses an identity-mapping loss .
Adversarial loss: To make a converted feature indistinguishable from a target , an adversarial loss is used:
where discriminator attempts to find the best decision boundary between real and converted features by maximizing this loss, and attempts to generate a feature that can deceive by minimizing this loss.
Cycle-consistency loss: The adversarial loss only restricts to follow the target distribution and does not guarantee the linguistic consistency between input and output features. To further regularize the mapping, a cycle-consistency loss is used:
where forward-inverse and inverse-forward mappings are simultaneously learned to stabilize training. This loss encourages and to find an optimal pseudo pair of through circular conversion, as shown in Fig. 1(a).
Identity-mapping loss: To further encourage the input preservation, an identity-mapping loss is used:
Full objective: The full objective is written as
where and are trade-off parameters. In this formulation, an adversarial loss is used once for each cycle, as shown in Fig. 1(a). Hence, we call it a one-step adversarial loss.
2.2 Generator: 1D CNN
CycleGAN-VC uses a one-dimensional (1D) CNN  for the generator to capture the overall relationship along with the feature direction while preserving the temporal structure. This can be viewed as the direct temporal extension of a frame-by-frame model that captures such features’ relationship only per frame. To capture the wide-range temporal structure efficiently while preserving the input structure, the generator is composed of downsampling, residual , and upsampling layers, as shown in Fig. 2(a). The other notable point is that CycleGAN-VC uses a gated CNN  to capture the sequential and hierarchical structures of acoustic features.
2.3 Discriminator: FullGAN
CycleGAN-VC uses a 2D CNN  for the discriminator to focus on a 2D structure (i.e., 2D spectral texture ). More precisely, as shown in Fig. 3(a), a fully connected layer is used at the last layer to determine the realness considering the input’s overall structure. Such a model is called FullGAN.
3.1 Improved Objective: Two-Step Adversarial Losses
One well-known problem for statistical models is the over-smoothing caused by statistical averaging. The adversarial loss used in Eq. 2.1 helps to alleviate this degradation, but the cycle-consistency loss formulated as L1 still causes over-smoothing. To mitigate this negative effect, we introduce an additional discriminator and impose an adversarial loss on the circularly converted feature, as
Similarly, we introduce and impose an adversarial loss for the inverse-forward mapping. We add these two adversarial losses to Eq. 2.1. In this improved objective, we use adversarial losses twice for each cycle, as shown in Fig. 1(b). Hence, we call them two-step adversarial losses.
3.2 Improved Generator: 2-1-2D CNN
In a VC framework [5, 29] (including CycleGAN-VC), a 1D CNN (Fig. 2(a)) is commonly used as a generator, whereas in a postfilter framework [41, 42], a 2D CNN (Fig. 2(b)) is more preferred. These choices are related to the pros and cons of each network. A 1D CNN is more feasible for capturing dynamical change, as it can capture the overall relationship along with the feature dimension. In contrast, a 2D CNN is better suited for converting features while preserving the original structures, as it restricts the converted region to local. Even using a 1D CNN, residual blocks  can mitigate the loss of the original structure, but we find that downsampling and upsampling (which are necessary for effectively capturing the wide-range structures) become a severe cause of this degradation. To alleviate it, we have developed a network architecture called a 2-1-2D CNN, shown in Fig. 2(c). In this network, 2D convolution is used for downsampling and upsampling, and 1D convolution is used for the main conversion process (i.e., residual blocks). To adjust the channel dimension, we apply convolution before or after reshaping the feature map.
3.3 Improved Discriminator: PatchGAN
In previous GAN-based speech processing models [41, 42, 5, 29], FullGAN (Fig. 3(a)) has been extensively used. However, recent studies in computer vision [43, 44] indicate that the wide-range receptive fields of the discriminator require more parameters, which causes difficulty in training. Inspired by this, we replace FullGAN with PatchGAN [45, 43, 44] (Fig. 3(b)), which uses convolution at the last layer and determines the realness on the basis of the patch. We experimentally examine its effect for non-parallel VC in Section 4.2.
4.1 Experimental Conditions
Dataset: We evaluated our method on the Spoke (i.e., non-parallel VC) task of the VCC 2018 , which includes recordings of professional US English speakers. We selected a subset of speakers so as to cover all inter-gender and intra-gender conversions: VCC2SF3 (SF), VCC2SM3 (SM), VCC2TF1 (TF), and VCC2TM1 (TM), where S, T, F, and M indicate source, target, female, and male, respectively. In the following, we use the abbreviations in the parenthesis (e.g., SF). Combinations of 2 sources (SF or SM) 2 targets (TF or TM) were used for evaluation. Each speaker has sets of 81 (about 5 minutes; relatively few for VC) and 35 sentences for training and evaluation, respectively. In the Spoke task, the source and target speakers have a different set of sentences (no overlap) so as to evaluate in a non-parallel setting. The recordings were downsampled to 22.05 kHz for this challenge. We extracted 34 Mel-cepstral coefficients (MCEPs), logarithmic fundamental frequency (), and aperiodicities (APs) every 5 ms by using the WORLD analyzer .
Conversion process: The proposed method was used to convert MCEPs ( dimensions including th coefficient).222For reference, the converted speech samples, in which the proposed method was used to convert all acoustic features (namely, MCEPs, band APs, continuous , and voice/unvoice indicator), are provided at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/cyclegan-vc2/index.html. Even in this challenging setting, CycleGAN-VC2 works reasonably well. The objective of these experiments was to analyze the quality of the converted MCEPs; therefore, for the other parts, we used typical methods similar to the baseline of the VCC 2018 . Specifically, in inter-gender conversion, a vocoder-based VC method was used. was converted by using logarithm Gaussian normalized transformation , APs were directly used without modification, and the WORLD vocoder  was used to synthesize speech. In intra-gender conversion, we used a vocoder-free VC method . More precisely, we calculated differential MCEPs by taking the difference between the source and converted MCEPs. For a similar reason, we did not use any postfilter [41, 42, 49] or powerful vocoder such as the WaveNet vocoder [50, 51]. Incorporating them is one possible direction of future work.
Training details: The implementation was almost the same as that of CycleGAN-VC except that the improved techniques were incorporated. The details of the network architectures are given in Fig. 4
. For a pre-process, we normalized the source and target MCEPs to zero-mean and unit-variance by using the statistics of the training sets. To stabilize training, we used a least squares GAN (LSGAN). To increase the randomness of training data, we randomly cropped a segment (128 frames) from a randomly selected sentence instead of using an overall sentence directly. We used the Adam optimizer  with a batch size of 1. We trained the networks for iterations with learning rates of 0.0002 for the generator and 0.0001 for the discriminator and with momentum term of 0.5. We set and . We used only for the first iterations to guide the learning direction. Note that we did not use any extra data, modules, or time alignment procedures for training.
4.2 Objective Evaluation
|7||Frame-based CycleGAN ||8.85.07||7.27.11||8.86.27||8.51.36|
|7||Frame-based CycleGAN ||3.78.26||2.77.10||3.32.06||3.61.15|
As discussed in previous studies [7, 41], it is fairly complex to design a single metric that can assess the quality of converted MCEPs comprehensively. Alternatively, we used two metrics to assess the local and global structures. To measure global structural differences, we used the Mel-cepstral distortion (MCD), which measures the distance between the target and converted MCEP sequences. To measure the local structural differences, we used the modulation spectra distance (MSD), which is defined as the root mean square error between the target and converted logarithmic modulation spectra of MCEPs averaged over all MCEP dimensions and modulation frequencies. For both metrics, smaller values indicate that target and converted MCEPs are more similar.
, respectively. To eliminate the effect of initialization, we report the average and standard deviation scores over three random initializations. To analyze the effect of each technique, we conducted ablation studies onCycleGAN-VC2 (no. 5 is the full model). We also compared CycleGAN-VC2 with two state-of-the-art methods: CycleGAN-VC  and frame-based CycleGAN  (our reimplementation; we additionally used for stabilizing training). The comparison of one-step and two-step adversarial losses (nos. 1, 5) indicates that this technique is particularly effective for improving MSD. The comparisons of generator (nos. 2, 3, 5) and discriminator (nos. 4, 5) network architectures indicate that they contribute to improving both MCD and MSD. Finally, the comparison to the baselines (nos. 5, 6, 7) verifies that by incorporating the three proposed techniques, we achieve state-of-the-art performance in terms of MCD and MSD for every speaker pair.
4.3 Subjective Evaluation
We conducted listening tests to evaluate the quality of converted speech. CycleGAN-VC  was used as the baseline. To measure naturalness, we conducted a mean opinion score (MOS) test (5: excellent to 1: bad), in which we included the target speech as a reference (MOS for TF and TM are 4.8). Ten sentences were randomly selected from the evaluation sets. To measure speaker similarity, we conducted an XAB test, where “A” and “B” were speech converted by the baseline and proposed methods, and “X” was target speech. We selected ten sentence pairs randomly from the evaluation sets and presented all pairs in both orders (AB and BA) to eliminate bias in the order of stimuli. For each sentence pair, the listeners were asked to select their preferred one (“A” or “B”) or to opt for “Fair.” Ten listeners participated in these listening tests. Figs. 5 and 6 show the MOS for naturalness and the preference scores for speaker similarity, respectively. These results confirm that CycleGAN-VC2 outperforms CycleGAN-VC in terms of both naturalness and similarity for every speaker pair. Particularly, CycleGAN-VC is difficult to apply to a vocoder-free VC framework  (used in SF-TF and SM-TM), as this framework is sensitive to conversion error due to the usage of differential MCEPs. However, the MOS indicates that CycleGAN-VC2 works relatively well in such a difficult setting.
To advance the research on non-parallel VC, we have proposed CycleGAN-VC2, which is an improved version of CycleGAN-VC incorporating three new techniques: an improved objective (two-step adversarial losses), improved generator (2-1-2D CNN), and improved discriminator (PatchGAN). The experimental results demonstrate that CycleGAN-VC2 outperforms CycleGAN-VC in both objective and subjective measures for every speaker pair. Our proposed techniques are not limited to one-to-one VC, and adapting them to other settings (e.g., multi-domain VC ) and other applications [1, 2, 4, 3, 5] remains an interesting future direction.
Acknowledgements: This work was supported by JSPS KAKENHI 17H01763.
-  Alexander B Kain, John-Paul Hosom, Xiaochuan Niu, Jan P. H. van Santen, Melanie Fried-Oken, and Janice Staehely, “Improving the intelligibility of dysarthric speech,” Speech Commun., vol. 49, no. 9, pp. 743–759, 2007.
-  Keigo Nakamura, Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano, “Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech,” Speech Commun., vol. 54, no. 1, pp. 134–146, 2012.
-  Zeynep Inanoglu and Steve Young, “Data-driven emotion conversion in spoken English,” Speech Commun., vol. 51, no. 3, pp. 268–283, 2009.
-  Tomoki Toda, Mikihiro Nakagiri, and Kiyohiro Shikano, “Statistical voice conversion techniques for body-conducted unvoiced speech enhancement,” IEEE Trans. Audio Speech Lang. Process., vol. 20, no. 9, pp. 2505–2517, 2012.
-  Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,” in Proc. Interspeech, 2017, pp. 1283–1287.
-  Yannis Stylianou, Olivier Cappé, and Eric Moulines, “Continuous probabilistic transform for voice conversion,” IEEE Trans. Speech and Audio Process., vol. 6, no. 2, pp. 131–142, 1998.
Tomoki Toda, Alan W Black, and Keiichi Tokuda,
“Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222–2235, 2007.
-  Elina Helander, Tuomas Virtanen, Jani Nurminen, and Moncef Gabbouj, “Voice conversion using partial least squares regression,” IEEE Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 912–921, 2010.
-  Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai, “Voice conversion using deep neural networks with layer-wise generative training,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 12, pp. 1859–1872, 2014.
-  Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “Voice conversion based on speaker-dependent restricted Boltzmann machines,” IEICE Trans. Inf. Syst., vol. 97, no. 6, pp. 1403–1410, 2014.
-  Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE Trans. Audio Speech Lang. Process., vol. 18, no. 5, pp. 954–964, 2010.
-  Seyed Hamidreza Mohammadi and Alexander Kain, “Voice conversion using deep neural networks with speaker-independent pre-training,” in Proc. SLT, 2014, pp. 19–23.
-  Keisuke Oyamada, Hirokazu Kameoka, Takuhiro Kaneko, Hiroyasu Ando, Kaoru Hiramatsu, and Kunio Kashino, “Non-native speech conversion with consistency-aware recursive network and generative adversarial network,” in Proc. APSIPA ASC, 2017, pp. 182–188.
-  Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki, “High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion,” in Proc. Interspeech, 2014, pp. 2278–2282.
-  Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng, in Proc. ICASSP, 2015, pp. 4869–4873.
-  Kou Tanaka, Hirokazu Kameoka, Takuhiro Kaneko, and Nobukatsu Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” in Proc. ICASSP, 2019.
-  Hirokazu Kameoka, Kou Tanaka, Takuhiro Kaneko, and Nobukatsu Hojo, “Convs2s-vc: Fully convolutional sequence-to-sequence voice conversion,” in arXiv preprint arXiv:1811.01609. Nov. 2018.
-  Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki, “Exampler-based voice conversion using sparse representation in noisy environments,” IEICE Trans. Inf. Syst., vol. E96-A, no. 10, pp. 1946–1953, 2013.
-  Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li, “Exemplar-based sparse representation with residual compensation for voice conversion,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 22, no. 10, pp. 1506–1521, 2014.
-  Feng-Long Xie, Frank K Soong, and Haifeng Li, “A KL divergence and DNN-based approach to voice conversion without parallel training sentences,” in Proc. Interspeech, 2016, pp. 287–291.
Yuki Saito, Yusuke Ijima, Kyosuke Nishida, and Shinnosuke Takamichi,
“Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors,”in Proc. ICASSP, 2018, pp. 5274–5278.
-  Athanasios Mouchtaris, Jan Van der Spiegel, and Paul Mueller, “Nonparallel training for voice conversion based on a parameter adaptation approach,” IEEE Trans. Audio Speech Lang. Process., vol. 14, no. 3, pp. 952–963, 2006.
-  Chung-Han Lee and Chung-Hsien Wu, “MAP-based adaptation for speech conversion using adaptation data selection and non-parallel training,” in Proc. ICSLP, 2006, pp. 2254–2257.
-  Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano, “Eigenvoice conversion based on Gaussian mixture model,” in Proc. Interspeech, 2006, pp. 2446–2449.
Daisuke Saito, Keisuke Yamamoto, Nobuaki Minematsu, and Keikichi Hirose,
“One-to-many voice conversion based on tensor representation of speaker space,”in Proc. Interspeech, 2011, pp. 653–656.
-  Toru Nakashika, Tetsuya Takiguchi, and Yasuhiro Minami, “Non-parallel training in voice conversion using an adaptive restricted Boltzmann machine,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 11, pp. 2032–2045, 2016.
-  Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang, “Voice conversion from unaligned corpora using variational autoencoding Wasserstein generative adversarial networks,” in Proc. Interspeech, 2017, pp. 3364–3368.
-  Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, “ACVAE-VC: Non-parallel many-to-many voice conversion with auxiliary classifier variational autoencoder,” in arXiv preprint arXiv:1808.05092. Aug. 2018.
-  Takuhiro Kaneko and Hirokazu Kameoka, “Parallel-data-free voice conversion using cycle-consistent adversarial networks,” in arXiv preprint arXiv:1711.11293. Nov. 2017.
-  Fuming Fang, Junichi Yamagishi, Isao Echizen, and Jaime Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” in Proc. ICASSP, 2018, pp. 5279–5283.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial nets,” in Proc. NPIS, 2014, pp. 2672–2680.
-  Takuhiro Kaneko and Hirokazu Kameoka, “CycleGAN-VC: Non-parallel voice conversion using cycle-consistent adversarial networks,” in Proc. EUSIPCO, 2018, pp. 2114–2118.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. ICCV, 2017, pp. 2223–2232.
-  Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong, “DualGAN: Unsupervised dual learning for image-to-image translation,” in Proc. ICCV, 2017, pp. 2849–2857.
-  Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim, “Learning to discover cross-domain relations with generative adversarial networks,” in Proc. ICML, 2017, pp. 1857–1865.
-  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier, “Language modeling with gated convolutional networks,” in Proc. ICML, 2017, pp. 933–941.
-  Yaniv Taigman, Adam Polyak, and Lior Wolf, “Unsupervised cross-domain image generation,” in Proc. ICLR, 2017.
-  Jaime Lorenzo-Trueba, Junichi Yamagishi, Tomoki Toda, Daisuke Saito, Fernando Villavicencio, Tomi Kinnunen, and Zhenhua Ling, “The voice conversion challenge 2018: Promoting development of parallel and nonparallel methods,” in Proc. Speaker Odyssey, 2018, pp. 195–202.
-  Tinghui Zhou, Philipp Krähenbühl, Mathieu Aubry, Qixing Huang, and Alexei A Efros, “Learning dense correspondence via 3D-guided cycle consistency,” in Proc. CVPR, 2016, pp. 117–126.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proc. CVPR, 2016, pp. 770–778.
-  Takuhiro Kaneko, Hirokazu Kameoka, Nobukatsu Hojo, Yusuke Ijima, Kaoru Hiramatsu, and Kunio Kashino, “Generative adversarial network-based postfilter for statistical parametric speech synthesis,” in Proc. ICASSP, 2017, pp. 4910–4914.
-  Takuhiro Kaneko, Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi, “Generative adversarial network-based postfilter for STFT spectrograms,” in Proc. Interspeech, 2017, pp. 3389–3393.
Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros,
“Image-to-image translation with conditional adversarial networks,”in Proc. CVPR, 2017, pp. 5967–5976.
-  Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang, in Proc. CVPR, 2016, pp. 1874–1883.
-  Chuan Li and Michael Wand, “Precomputed real-time texture synthesis with Markovian generative adversarial networks,” in Proc. ECCV, 2016, pp. 702–716.
-  Masanori Morise, Fumiya Yokomori, and Kenji Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. Inf. Syst., vol. 99, no. 7, pp. 1877–1884, 2016.
-  Kun Liu, Jianping Zhang, and Yonghong Yan, “High quality voice conversion through phoneme-based linear mapping functions with STRAIGHT for Mandarin,” in Proc. FSKD, 2007, pp. 410–414.
-  Kazuhiro Kobayashi, Tomoki Toda, and Satoshi Nakamura, “F0 transformation techniques for statistical voice conversion with direct waveform modification with spectral differential,” in Proc. SLT, 2016, pp. 693–700.
-  Kou Tanaka, Takuhiro Kaneko, Nobukatsu Hojo, and Hirokazu Kameoka, “Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks,” in Proc. SLT, 2018, pp. 632–639.
-  Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu, “WaveNet: A generative model for raw audio,” in arXiv preprint arXiv:1609.03499. Sep. 2016.
-  Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi, Kazuya Takeda, and Tomoki Toda, “Speaker-dependent WaveNet vocoder,” in Proc. Interspeech, 2017, pp. 1118–1122.
-  Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley, “Least squares generative adversarial networks,” in Proc. ICCV, 2017, pp. 2794–2802.
-  Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
-  Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” in arXiv preprint arXiv:1607.08022. July 2016.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. CVPR, 2015, pp. 3431–3440.
-  Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo, “StarGAN-VC: Non-parallel many-to-many voice conversion using star generative adversarial networks,” in Proc. SLT, 2018, pp. 266–273.