A pytorch implementation of StarGAN-VC2
Non-parallel multi-domain voice conversion (VC) is a technique for learning mappings among multiple domains without relying on parallel data. This is important but challenging owing to the requirement of learning multiple mappings and the non-availability of explicit supervision. Recently, StarGAN-VC has garnered attention owing to its ability to solve this problem only using a single generator. However, there is still a gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for achieving non-parallel multi-domain VC in a single model, and propose an improved variant called StarGAN-VC2. Particularly, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss that allows all source domain data to be convertible to the target domain data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of the acoustic feature in a domain-specific manner. We evaluated our methods on non-parallel multi-speaker VC. An objective evaluation demonstrates that our proposed methods improve speech quality in terms of both global and local structure measures. Furthermore, a subjective evaluation shows that StarGAN-VC2 outperforms StarGAN-VC in terms of naturalness and speaker similarity. The converted speech samples are provided at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/stargan-vc2/index.html.READ FULL TEXT VIEW PDF
Non-parallel voice conversion (VC) is a technique for learning the mappi...
This paper proposes a method that allows for non-parallel many-to-many v...
Non-parallel voice conversion (VC) is a technique for training voice
Non-parallel voice conversion (VC) is a technique for learning mappings
This paper proposes a voice conversion (VC) method based on a
We have previously proposed a method that allows for non-parallel voice
We propose a parallel-data-free voice conversion (VC) method that can le...
A pytorch implementation of StarGAN-VC2
A Pytorch implementation of StarGAN-VC2
A PyTorch implementation of StarGAN-VC2.
Voice conversion (VC) is a technique for converting the non/para-linguistic information between source and target speech while preserving the linguistic information. VC has been studied intensively owing to its high potential for various applications, such as speaking aids [1, 2] and style [3, 4] and pronunciation  conversion.
One well-established approach to VC involves statistical methods based on Gaussian mixture models (GMMs)[6, 7, 8]9, 10], feed forward NNs (FNNs) [11, 12, 13], recurrent NNs (RNNs) [14, 15], convolutional NNs (CNNs) , attention networks [16, 17], and generative adversarial networks (GANs) ), and exemplar-based methods using non-negative matrix factorization (NMF) [18, 19].
Many VC methods (including the above-mentioned) are categorized as parallel VC, which learns a mapping using the training data of parallel utterance pairs. However, obtaining such data is often time-consuming or impractical. Moreover, even if such data are obtained, most VC methods rely on a time alignment procedure, which occasionally fails and requires other painstaking processes, i.e., careful pre-screening or manual correction.
As a solution, non-parallel VC has begun to be studied. Non-parallel VC, which is comparable to parallel VC, is generally quite challenging to achieve owing to its disadvantageous training conditions. To mitigate this difficulty, several studies have used additional data (e.g., parallel utterance pairs among reference speakers [20, 21, 22, 23]) or extra modules (e.g., automatic speech recognition (ASR) modules [24, 25]). These additional data and extra modules are useful for simplifying training but require other costs for preparation. To avoid such additional costs, recent studies have introduced probabilistic deep generative models, such as an RBM 
, variational autoencoders (VAEs)[27, 28]), and GANs [27, 29]. Among them, CycleGAN-VC  (published  and further improved ) shows promising results by configuring CycleGAN [32, 33, 34] with a gated CNN  and identity-mapping loss . This makes it possible to learn a sequence-based mapping function without relying on parallel data. With this improvement, CycleGAN-VC performs comparably to parallel VC .
Along with non-parallel VC, another practically important issue is non-parallel multi-domain VC, i.e., learning mappings among multiple domains (e.g., multiple speakers) without relying on parallel data. This problem is challenging in terms of scalability because typical VC methods (including CycleGAN-VC) are designed to learn a one-to-one mapping; therefore, they require the learning of multiple generators to achieve multi-domain VC. For this problem, StarGAN-VC  provides a promising solution by extending CycleGAN-VC to a conditional setting and incorporating domain codes. Through this extension, StarGAN-VC makes it possible to achieve non-parallel multi-domain VC by only using a single generator while maintaining the advantage of CycleGAN-VC. The subjective evaluation  demonstrates that StarGAN-VC outperforms another state-of-the-art method, i.e., VAE/GAN-VC .
However, even using StarGAN-VC, there is still an insurmountable gap between real and converted speech. To bridge this gap, we rethink conditional methods of StarGAN-VC, which are key components for solving non-parallel multi-domain VC using a single generator, and propose an improved variant called StarGAN-VC2. In particular, we rethink conditional methods in two aspects: training objectives and network architectures. For the former, we propose a source-and-target conditional adversarial loss, which encourages all source domain data to be converted into the target data. For the latter, we introduce a modulation-based conditional method that can transform the modulation of acoustic features in a domain-dependent manner. We examined the performance of the proposed methods on the multi-speaker VC task using the Voice Conversion Challenge 2018 (VCC 2018) dataset . An objective evaluation demonstrates that the proposed methods effectively bring the converted acoustic feature sequence close to the target one in terms of both global and local structure measures. A subjective evaluation shows that StarGAN-VC2 outperforms StarGAN-VC in terms of both naturalness and speaker similarity.
The aim of StarGAN-VC is to obtain a single generator that learns mappings among multiple domains (e.g., multiple speakers). To achieve this, StarGAN-VC extends CycleGAN-VC to a conditional setting with a domain code (e.g., a speaker identifier). More precisely, StarGAN-VC learns a generator that converts an input acoustic feature into an output feature conditioned on the target domain code , i.e., . Here, let be an acoustic feature sequence where is the feature dimension and is the sequence length, and let be the corresponding domain code where is the number of domains. Inspired by StarGAN 40], classification loss , and cycle-consistency loss . Additionally, inspired by CycleGAN-VC , StarGAN-VC also uses an identity-mapping loss  to preserve linguistic composition.
Adversarial loss: The adversarial loss is used to render the converted feature indistinguishable from the real target feature:
where is a target conditional discriminator . By maximizing this loss, attempts to learn the best decision boundary between the converted and real acoustic features conditioned on the target domain codes ( and ). In contrast, attempts to render the converted feature indistinguishable from real acoustic features conditioned on by minimizing this loss.
Classification loss: The aim of StarGAN-VC is to synthesize the acoustic feature that belongs to the target domain. To achieve this, the classification loss is used. First, the classifier is trained for real acoustic features:
where attempts to classify a real acoustic feature to the corresponding domain by minimizing this loss. Subsequently, is optimized for :
where attempts to generate an acoustic feature that is classified to the target domain by minimizing this loss.
Cycle-consistency loss: Although the adversarial loss and classification loss encourage a converted acoustic feature to become realistic and classifiable, respectively, they do not guarantee that the converted feature will preserve the input composition. To mitigate this problem, the cycle-consistency loss is used:
This cyclic constraint encourages to find out an optimal source and target pair that does not compromise the composition.
Identity-mapping loss: To impose a further constraint on the input preservation, the identity-mapping loss is used:
Full objective: The full objective is written as
where , , and are optimized by minimizing , , and , respectively.
Regarding the network architectures, this study focuses on the conditional method in the generator. Hence, here we review the StarGAN-VC generator network architecture. As shown in Figure 2
(a), StarGAN-VC incorporates conditional information into the generator in a channel-wise manner, i.e., first creates the one-hot vector indicating the domain code, subsequently expands the one-hot vector to the feature map size, and finally concatenates it to the feature map. Concatenated features are convoluted together and propagated to the next layer.
We first rethink a conditional method in training objectives. As described in Section 2.1, StarGAN-VC uses two conditional methods to make the converted feature belonging to the target domain: the classification loss (Equations 2 and 3) and the target conditional adversarial loss (Equation 2.1). We illustrate their training strategies in Figure 1(a) and (b), respectively.
In the classification loss (Figure 1(a)), via Equation 2, the decision boundary (black line) is learned among real-data domains (e.g., between “Real ” and “Real ” in Figure 1(a)). For this decision boundary, attempts to generate easily “classifiable” data via Equation 3. This means that prefers to generate data that are far from the decision boundary even when the real data exist around the decision boundary. As discussed elsewhere [41, 44, 45], this prevents from covering the whole real data distribution. In VC, this may result in a partial conversion.
Meanwhile, the target conditional adversarial loss (Figure 1(b)) encourages the generated data close to the real data conditioned on the target domain code. As discussed in the previous study , this objective prevents from leaning towards generating only classifiable data. However, a possible difficulty is that this loss needs to simultaneously handle diverse data, including hard negative samples (e.g., conversion between the same speaker in Figure 1(b)) and easy negative samples (e.g, conversion between completely different speakers in Figure 1(b)). This unfair condition makes it difficult to bring all the converted data close to real target data.
To solve this problem, we develop a source-and-target conditional adversarial loss defined as
where is randomly sampled independently of real data. Differently from Equation 2.1, both and are conditioned on the source code in addition to the target code . We call such and a source-and-target conditional generator and discriminator, respectively. As shown in Figure 1(c), by using both source and target domain codes as conditional information, this loss encourages all the converted data to be close to real data in both source-wise and target-wise manners. This resolves the unfair training condition in the target conditional adversarial loss (Figure 1(b)) and allows all the source domain data to be converted into the target domain data.
One possible disadvantage of the source-and-target conditional generator is that this requires the availability of the source code in inference, which is not required in the conventional StarGAN-VC. However, note that speaker recognition has been actively studied (e.g., ), and this problem can be alleviated by using it as a pre-process.
As indicated by previous studies on VC postfilters (e.g., global variance and modulation spectrum  postfilters), accurate modulation translation is important to achieve high-quality VC. Particularly, to achieve multi-domain VC only using a single generator, a framework must be incorporated that can conduct diverse domain-specific modulations effectively. For this challenge, a channel-wise conditional method (Figure 2(a)) is not effective because the concatenated conditional information can be additively used in a convolutional procedure but cannot be multicatively used to modulate features. To alleviate this problem, we introduce a modulation-based conditional method, which can directly modulate features in a domain-dependent manner. In particular, we introduce a conditional instance normalization (CIN) , which was originally proposed in computer vision for style transfer. As shown in Figure 2(b), given the feature , CIN conducts the following procedure:
are the average and standard deviation ofthat are calculated over for each instance. and are domain-specific scale and bias parameters that allow the modulation to be transformed in a domain-specific manner. These parameters are learnable and optimized through training.
In the above, we explain the case when the generator is conditioned on the target domain code . When using a source-and-target conditional generator introduced in Equation 3.1, we replace and with and , respectively, which are selected depending on both the source and target .
Dataset: We evaluated our method on the multi-speaker VC task using VCC 2018 , which contains recordings of professional US English speakers. Following the StarGAN-VC study , we selected a subset of speakers as covering all inter- and intra-gender conversions: VCC2SF1, VCC2SF2, VCC2SM1, and VCC2SM2, where F and M indicate female and male speakers, respectively. Thus, the number of domains is set to 4. Our goal is to learn different source-and-target mappings in a single model. Each speaker has sets of 81 and 35 sentences for training and evaluation, respectively. The recordings were downsampled to 22.05 kHz for this challenge. We extracted 34 Mel-cepstral coefficients (MCEPs), logarithmic fundamental frequency (), and aperiodicities (APs) every 5 ms by using the WORLD analyzer .
Conversion process: In these experiments, we focused on analyzing the performance in MCEP conversion. Hence, we applied the proposed method only to MCEP conversion,222For reference, the converted speech samples, in which the proposed method was applied to convert all acoustic features (namely, MCEPs, band APs, continuous , and voice/unvoice indicator), are provided at http://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/stargan-vc2/index.html. Even in this challenging setting, StarGAN-VC2 works reasonably well. and for the other parts, we used typical methods, i.e., converted using logarithm Gaussian normalized transformation , directly used APs, and synthesized speech using the WORLD vocoder . To examine the pure effect of the proposed methods, we did not use any postfilter [54, 55, 56] or powerful vocoder such as the WaveNet vocoder [57, 58]. Incorporating them remains possible future work.
Implementation details: We designed the network architectures on the basis of CycleGAN-VC2 , i.e., we used a 2-1-2D CNN in and a 2D CNN in . We formulate using the projection discriminator . In the pre-experiment, we found that skip connections in residual blocks  result in partial conversion. Thus, we removed them in . The details of the network architectures are given in Figure 3. For a GAN objective, we used a least squares GAN . We conducted speaker-wise normalization for a pre-process. We trained the networks using the Adam optimizer  with a batch size of 8, in which we used a randomly cropped segment (128 frames) as one instance. The number of iterations was set to , learning rates for and were set to 0.0002 and 0.0001, respectively, and the momentum term was set to 0.5. We set , , and . We used only for the first iterations to stabilize the training at the beginning.
|Objective||MCD [dB]||MSD [dB]|
|7.73 .07||1.96 .03|
|7.21 .16||2.87 .51|
|(StarGAN-VC)||7.11 .10||2.41 .13|
|(StarGAN-VC2)||6.90 .07||1.89 .03|
|network||MCD [dB]||MSD [dB]|
|Channel-wise (StarGAN-VC)||6.90 .08||2.55 .20|
|Modulation-based (StarGAN-VC2)||6.90 .07||1.89 .03|
We conducted an objective evaluation to validate the advantages of the proposed conditional methods over other conditional methods. The same as the previous study 
, we used two evaluation metrics for comprehensive analysis: the Mel-cepstral distortion (MCD), which measures the global structural differences by calculating the distance between the target and converted MCEPs, and the modulation spectra distance (MSD), which measures the local structural differences by computing the distance between the target and converted logarithmic modulation spectra of MCEPs. For both metrics, a smaller value indicates that the target and converted features are more similar.
We conducted comparative studies in two aspects: training objectives and network architectures, which are listed in Tables 1 and 2, respectively. We have calculated the scores averaged over three models trained with different initializations to eliminate the effect of initialization. In Table 1, the proposed source-and-target conditional loss outperforms the other losses in terms of both the MCD and MSD. This indicates that the proposed loss is useful for improving the feature quality in terms of both the global and local structure measures. In Table 2, the proposed modulation-based conditional method outperforms the conventional channel-wise conditional method in terms of the MSD. This indicates that the proposed architecture is particularly useful for improving the local structure. Through these experiments, we empirically confirm that the proposed conditional methods in objectives and networks effectively bring the converted acoustic feature sequence close to the target one.
We conducted listening tests to analyze the performance compared with StarGAN-VC , which is a state-of-the-art multi-domain non-parallel VC. To measure naturalness, we conducted a mean opinion score (MOS) test (5: excellent to 1: bad), in which we included the analysis-synthesized speech (which is the upper limit of the converted speech) as a reference (MOS: 4.2). For each model, we generated 36 sentences ( source-target combinations 3 sentences). We conducted an XAB test to measure speaker similarity. Here, “X” was target speech and “A” and “B” were speech converted by StarGAN-VC and StarGAN-VC2, respectively. For each model, we generated 24 sentences ( source-target combinations 2 sentences). To eliminate bias in the order of stimuli, we presented all pairs in both orders (AB and BA). For each sentence pair, the listeners were asked to select their preferred one (“A” or “B”) or “Fair.” 12 well-educated English speakers participated in the tests.
Figures 4 and 5 show the MOS for naturalness and the preference scores for speaker similarity, respectively. We summarized the results on the basis of three categories: all conversion, inter-gender conversion, and intra-gender conversion. These results empirically demonstrate that StarGAN-VC2 outperforms StarGAN-VC in terms of both naturalness and speaker similarity for every category.
To advance the research on multi-domain non-parallel VC, we have rethought conditional methods in StarGAN-VC in two aspects: training objectives and network architectures. We developed a source-and-target conditional adversarial loss for the former and a modulation-based conditional method for the latter and have proposed StarGAN-VC2 incorporating them. The empirical studies on non-parallel multi-speaker VC demonstrate that StarGAN-VC2 outperforms StarGAN-VC in both objective and subjective measures. StarGAN-VC2 is a general model for multi-domain VC and is not limited to multi-speaker VC. Adapting it to other tasks (e.g., multi-emotion VC and multi-pronunciation VC) remains a promising future direction.
Acknowledgements: This work was supported by JSPS KAKENHI 17H01763.
T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Trans. Audio Speech Lang. Process., vol. 15, no. 8, pp. 2222–2235, 2007.
D. Saito, K. Yamamoto, N. Minematsu, and K. Hirose, “One-to-many voice conversion based on tensor representation of speaker space,” inProc. Interspeech, 2011, pp. 653–656.
W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” inProc. CVPR, 2016, pp. 1874–1883.