Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

by   Wen-Chin Huang, et al.

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods.


Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

An effective approach to non-parallel voice conversion (VC) is to utiliz...

Refined WaveNet Vocoder for Variational Autoencoder Based Voice Conversion

This paper presents a refinement framework of WaveNet vocoders for varia...

InfoGAN-CR: Disentangling Generative Adversarial Networks with Contrastive Regularizers

Training disentangled representations with generative adversarial networ...

The NeteaseGames System for Voice Conversion Challenge 2020 with Vector-quantization Variational Autoencoder and WaveNet

This paper presents the description of our submitted system for Voice Co...

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

In this paper, we present a novel technique for a non-parallel voice con...

Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion

Generative Adversarial Networks (GANs) are machine learning networks bas...

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

We describe our submitted system for the ZeroSpeech Challenge 2019. The ...

I Introduction

Voice conversion (VC) aims to convert the speech from a source to that of a target without changing the linguistic content [47]. Speaker voice conversion [8] is a typical type of VC and refers to the process of converting speech from a source speaker to a target speaker. In addition, a wide variety of applications could be solved by applying VC, such as accent conversion [3], personalized speech synthesis [31, 41], and speaking-aid device support [49, 72, 69]. Since the spectral property plays an important role in characterizing speaker individuality, spectral conversion has been intensively studied in VC. In this work, we focus on spectral mapping in speaker voice conversion.

Numerous VC approaches have been proposed. The Gaussian mixture model (GMM)-based method

[60, 73]

has been a popular statistical approach that estimates the joint density of the source-target feature vectors, which requires a training procedure and has a well-known disadvantage that the converted outputs generally suffer from an over-smoothing issue. Frequency warping methods, such as vocal tract length normalization

[65], weighted frequency warping [16] and dynamic frequency warping [19], are able to keep spectral details while providing inferior speaker identity conversion quality to that of statistical approaches. Exemplar-based methods [67, 75, 74, 58, 70]

require much less training data and are capable of modeling the high-dimensional spectra. In recent years, deep neural networks (DNNs) have established supremacy in a wide range of research fields, including VC

[50, 13, 6, 61]. DNNs have been utilized for not only spectral mapping but also neural vocoding [52, 68, 22]. It has been shown that employing neural vocoders as the waveform generation module can greatly improve the performance of VC systems [39, 57, 42, 71, 29, 59]. It has also been shown that VC systems, whether implemented in high-dimensional or low-dimensional features, benefit from spectral detail compensation [75, 70, 53].

Nonetheless, most of the approaches described above rely on the availability of parallel training data, which is often not accessible in real world scenarios. Thus, the development of non-parallel VC methods has been gaining attention [44]. One approach is to construct a pseudo parallel dataset from a non-parallel corpus [15]

. Another family of approaches utilizes a pre-trained automatic speech recognition model to compute the phonetic posteriorgram (PPG) as the speaker-independent linguistic feature, followed by a PPG-to-acoustic mapping to generate converted features

[62, 55]

. A recently popular approach is to use DNNs to model the probability distribution of the target features; state-of-the-art models such as variational autoencoders (VAEs)

[38] and generative adversarial networks (GANs) [20] have been successfully applied to non-parallel VC [23, 55, 32, 24, 36, 34, 33, 11, 45].

In this work, we focus on VAE-based VC (VAE-VC) [23]. Specifically, the spectral conversion function is composed of an encoder-decoder pair. The encoder encodes the input spectral feature into a latent code; the decoder mixes the latent code and a specified target speaker code to generate the converted feature. The encoder-decoder network and the speaker codes are trained by back-propagation of the reconstruction error, along with a Kullback-Leibler (KL)-divergence loss that regularizes the distribution of the latent code.

Fig. 1: Illustration of how entangled latent representation affects the conversion performance in a general VAE-VC framework. The residual source speaker information in the latent code will be mixed with the given target speaker code, resulting in a mixed speaker identity in the converted feature. Thus, the performance might be harmed.
Fig. 2: Illustration of the conversion phase of the VAE-VC [23] framework. Following traditional VC systems, a vocoder first parameterizes the waveform into acoustic features, which are then converted in different streams, and finally the converted features are used to synthesize the converted waveform by a vocoder.

The degree of disentanglement of the latent representation is crucial to the success of many speech processing frameworks [25, 27, 26, 12, 9], including VAE-VC. Since we focus on the task of speaker voice conversion, the degree of disentanglement is defined as the amount of (source) speaker information residing in the latent code, i.e., the independence of the latent code and the speaker code [1]. An illustration is given in Figure 1. If the latent code is entangled by multiple components (e.g., in the VC task, the source speaker information remains in the latent code), during conversion, the decoder will draw the speaker information from both the given target speaker code and the residual source speaker information in the latent code, which harms the conversion performance. From the success of VAE-VC, we can infer that, at least to some extent, the decoder is trained to use more information in the given speaker code, rather than the speaker characteristics remained in the latent code, otherwise conversion made by changing the speaker code will not work. Although the success may be a natural result of model optimization, we doubt whether the performance is robust enough. For instance, in [54], it was demonstrated that the performance of autoencoder-based VC models was sensitive to the latent space dimension. This raises the need to design better schemes for making the latent code more independent of the speaker.

In our prior work [28], we proposed a cross-domain VAE-based VC framework (referred to as CDVAE-VC in the following discussion). The motivations of CDVAE-VC are: (1) although the effectiveness of VAE-VC using vocoder spectra (e.g., the STRAIGHT spectra, SPs [37]) has been confirmed, the use of other types of spectral features, such as mel-cepstral coefficients (MCCs) [17] that are related to human perception and have been widely used in VC, have not been properly investigated; (2) since modeling the low- and high-dimensional features alone has their respective shortcomings, based on multi-target/task learning [76, 5], it is believed that a model capable of simultaneously modeling two types of spectral features can yield better performance even if they are from the same feature domain. To this end, CDVAE-VC [28] extended the VAE-VC framework to jointly consider two kinds of spectral features, namely SPs and MCCs. By introducing two additional cross-domain reconstruction losses and a latent similarity constraint into the training objective, the latent representations encoded from the input SPs and MCCs are biased to each other and capable of self- or cross-reconstructing the input features. We speculated that the success of CDVAE-VC came from the fact that a more disentangled latent representation was learned. Furthermore, we observed a positive correlation between the conversion performance and the extent to which the latent code was disentangled.

In this work, we extend the CDVAE-VC framework by incorporating the concept of adversarial training to improve the degree of disentanglement as well as the conversion performance. First, we directly combine CDVAE-VC with GANs. GANs have shown the ability to enhance the output of the decoder in encoder-decoder network based VC frameworks [11]. Therefore, it is expected that such a combination can improve the quality of converted speech. Second, inspired from the idea of domain adversarial training (DAT) [18], we add a speaker classification training objective to the latent variables, in order to explicitly project away speaker-related information. A similar idea has been applied to several speech processing tasks, such as speech recognition [46, 64, 63], speech enhancement [18], VC [11, 51] and singing VC [48]. Here, we utilize DAT by considering cross-domain features to further facilitate a more disentangled latent representation.

Designing a clear evaluation metric for degree of disentanglement has long been an open problem in the field of machine learning. In image modeling, visual inspection has been a standard and intuitive approach

[7, 14]. However, the visual inspection is not perfectly feasible for speech processing tasks since it is hard to quantify the difference in voices as a specific latent variable changes. In previous works [11, 10, 54], a classifier-based metric has been proposed. Since the metric is also based on a trained classifier, it has limitations in comparing the disentanglement between different latent codes obtained by different models due to different training conditions and dynamics. Following [30]

, we utilize the parallel data that exist in most benchmark VC datasets and derive a novel metric for measuring disentanglement. The key assumption is that an ideal encoder should encode a pair of parallel sentences uttered by two different speakers to similar latent codes. We measure the cosine similarity between such latent codes to evaluate how well the encoder disentangles the latent codes.

The remainder of this paper is organized as follows. In Section II, we first review the VAE-VC and its extended version, CDVAE-VC. Section III introduces how to combine GANs with CDVAE-VC. Then, we describe how to add an adversarial speaker classifier objective to the latent code in Section IV. In Section V, we first examine our proposed mechanisms one by one, using conventional objective and subjective evaluation metrics adopted in VC. Disentanglement measurements of our proposed methods and how they are related to the VC performance are presented afterwards. Finally, we conclude the paper with discussions in Section VI.

Ii Background

In conventional VC frameworks, the acoustic features of the source speaker are converted to those of the target speaker in different feature streams. Many researches focus on the conversion of spectral features [73] and thus formulate VC as follows. Given source speaker’s spectral frames , the goal is to find a conversion function such that


Note that the second subindices in both sides of the equation are both , which means that the converted spectral feature sequence has the same length with that of the source. In the rest of the article, we drop the frame or the speaker indices for simplicity.

In the following subsections, we describe two VAE based VC frameworks. Throughout the paper, we use “bar” to indicate the reconstructed features, and “hat” to indicate the converted features.

Ii-a Vae-Vc

Figure 2 depicts the conversion process of a typical VAE-VC system [23]. The core of VAE-VC is an encoder-decoder network. During training, given an observed (source or target) spectral frame , a speaker-independent encoder with parameter set encodes into a latent code: . The speaker code of the input frame is then concatenated with the latent code, and passed to a conditional decoder with parameter set to reconstruct the input. This reconstruction process can be expressed as:


The model parameters can be obtained by maximizing the variational lower bound:


where is the approximate posterior, is the data likelihood, and is the prior distribution of the latent space. is simply a reconstruction term as in any vanilla autoencoder, whereas regularizes the encoder to align the approximate posterior with the prior distribution.

In the conversion phase, one could use (2) to formulate the conversion function :


where is the target speaker code.

The VAE framework makes several assumptions. First,

is assumed to follow a normal distribution whose covariance is an identity matrix. Second,

is set to be a standard normal distribution. Third, the expectation over

is approximated by sampling via a linear-transformation based re-parameterization trick

[38]. With these simplifications, we can avoid intractability and optimize the autoencoder parameter sets and the speaker codes via back-propagation.

Fig. 3: Illustration of the training phase of the CDVAE-VC [28] framework. In this framework, each feature has its own set of encoder and decoder. During training, by minimizing the loss derived from the within- and cross-domain reconstruction paths, the latent codes and learn to reconstruct not only corresponding input features but also the cross-domain features.

Ii-B Cdvae-Vc

In [28], we proposed the CDVAE-VC framework to utilize spectral features of different properties extracted from the same observed speech frame. As depicted in Figure 3, the CDVAE framework is formed by a collection of encoder-decoder pairs, one for each kind of spectral feature. Considering the SPs and MCCs as two kinds of spectral features (denoted as and ), the following losses are defined:


where and are the encoder and decoder for SPs, and and are the encoder and decoder for MCCs; and , respectively, denote the generated SPs and MCCs from the within-domain reconstruction paths; and , respectively, denote the generated SPs and MCCs from the cross-domain reconstruction paths. Note that calculates the reconstruction loss between the first argument and the corresponding input feature.

In short, we introduce two extra reconstruction streams. By minimizing the cross-domain reconstruction loss, we enforce to contain enough information to reconstruct , and vice versa. As a result, the behavior of the encoders for both feature domains are constrained to be the same, i.e., they are expected to extract similar latent information from different types of input spectral features. To explicitly reinforce this constraint, a latent similarity L1 loss defined as


can be included in the final objective expressed as:


The model parameters can be learned by maximizing (14). In the conversion phase, there are four conversion paths (i.e., two within-domain and two cross-domain paths). As reported in [28], the CDVAE MCC-MCC path gave the best performance in terms of subjective evaluation, which matched the assumption that MCCs are more related to human perception.

Fig. 4: Illustration of the training procedure of our proposed CDVAE-CLS-GAN model. Phase 1: A CDVAE is trained. Phase 2: The latent codes are used to train the CLS. Phase 3-A and 3-B: The encoders, decoders and the CLS, discriminators are trained in an alternating order.

Iii Incorporating CDVAE-VC with GANs

Minimizing the reconstruction loss in VAE-VC and CDVAE-VC tends to result in blurry spectra, similar to the over-smoothing effects in other VC frameworks. It is expected that introducing a GAN objective [20] can guide the output spectra to be more realistic. In this section, we present the main concepts and system architectures of the combination of GANs and the VAE-VC and CDVAE-VC frameworks.

Iii-a The GAN objective in the general VAE-VC

We follow [40] and incorporate a GAN objective into the decoder in the original VAE-VC. Assume that the real data distribution of any spectral frame admits density , and the autoencoding process defined in (2) induces a conditional probability . From the data distribution prospective of view, the goal is to enhance the decoder network in (2) such that best approximates the real data distribution :


A typical GAN [20] realizes the above-mentioned probability approximation by introducing a discriminator with parameter set that judges whether an input follows a true and natural probability distribution or an artificial one. Together with a generator that tries to produce realistic output features, these two components play a min-max game and seek an equilibrium with the Jensen-Shannon divergence as the objective, which is defined as follows:


To facilitate stable training, in this work we adopt a Wasserstein GAN (WGAN) [2, 21]. In the WGAN, the following Wasserstein distance is derived:


where the supremum is over all 1-Lipschitz functions . Based on the above distance, the following WGAN loss can be defined:


where is now a 1-Lipschitz discriminator. Finally, we can combine the objectives of VAE and WGAN by assigning the decoder of VAE as the generator of WGAN. As a result, combining the WGAN loss (18) and the VAE loss (3) results in a VAEGAN objective:



is the weight of the WGAN loss. This objective is shared across the encoder, decoder, and discriminator. As in standard GAN training, the discriminator is first updated by maximizing this objective, and the encoder and decoder are updated by minimizing the objective. Therefore, the components are optimized in an alternating order. GANs produce more realistic (in our case, sharper) outputs because they optimize a loss function between two distributions in a more direct fashion.

The VAW-GAN-VC method in [24] has a similar motivation to better model spectral features to improve feature generation. However, there is a fundamental difference between the training procedures of VAW-GAN-VC and the training procedures here. In VAW-GAN-VC, the objective of WGANs is to minimize the Wasserstein distance of the two distributions of the converted features and the real target features. Although this is a strong objective, it also brings some limitations. The original VAE-VC and CDVAE-VC consider only auto-encoding in the training phase, and perform conversion by changing the speaker code in the conversion phase. In other words, multiple conversion pairs are integrated into one model, sometimes referred to as “multi-target” training in VC. VAW-GAN-VC, in contrast, needs to consider not only auto-encoding but also conversion in the training phase, since the discriminator needs to discriminate the real target features and the converted features in order to align the distribution of the latter to that of the former. As a result, VAW-GAN-VC is trained to convert from one source to one target, which limits the flexibility of the model. In this work, we intend to maintain the multi-target flexibility in CDVAE-VC and thus design the WGAN objective to match the distributions of the real features and the reconstructed features. Considering this fundamental difference and to avoid confusion, we focus on multi-target VC and do not take VAW-GAN-VC into discussion and comparison in this paper.


Now we can combine the GAN objective with CDVAE-VC, which we will refer to as CDVAE-GAN, where the derivation of the objective is as simple as replacing the VAE loss in (19) with the CDVAE objective defined in (14). However, in practice, combining CDVAE-VC with GANs is not as trivial as replacing the encoder and decoder in VAE-GAN with CDVAE. For each kind of feature, a separate discriminator should be trained, i.e., and should be considered. It seems natural to train two discriminators jointly with the whole network. However, as mentioned above, the MCC-MCC path in CDVAE-VC performs best in four paths in the conversion phase. Introducing a discriminator for SPs might not necessarily benefit the quality of the output MCCs. To determine the best architecture, we examine the effect of three settings, including combining CDVAE with only , only , and both and . Detailed experimental results will be shown in Sections V-C and V-D.

Iv Adversarial speaker classifier (CLS)

As discussed above, the viability of the family of VAE-VC frameworks relies on the decomposition of input, which is assumed to be composed of phonetic representation and speaker information. Ideally, the latent code extracted using the encoder should contain solely phonetic information and free from any speaker information. However, this decomposition is not explicitly guaranteed. To this end, we investigate the effect of an adversarial speaker classifier to explicitly force the latent code to be speaker independent.

Iv-a The classifier loss

An adversarial speaker classifier with parameter set tries to classify which speaker the latent code comes from. We will refer to this classifier as CLS. Specifically, given a latent code

, the CLS predicts a posterior probability

, which is the probability that is extracted from an input frame produced by speaker . Therefore, we can define the CLS loss as the negative cross-entropy between the predicted posterior and the one-hot ground truth vector:



We now augment the CDVAE-GAN framework with the adversarial speaker classifier, which we will refer to as CDVAE-CLS-GAN. Adding the CLS loss (20) to the CDVAE-GAN loss, we obtain the final objective:


where is the weight of the classifier loss. This objective is shared, again, across the encoder, decoder, discriminator, and classifier.

The training process is divided into three phases, as depicted in Figure 4. Phase one involves the training of the VAE. In phase two, to pre-train the classifier, we first use the trained VAE obtained in phase one to extract latent codes from the same training set. The classifier is then trained with these latent codes to minimize (20). In the third phase, we train the whole network using an alternating update schedule, similar to the one described in Section III-A. Specifically, the encoder and the decoder are first frozen and the discriminator and classifier are trained to maximize and minimize defined in (18) and (20), respectively, and thus they can discriminate self-reconstructed features and classify latent codes correctly. Then, we freeze these modules and train the encoder and decoder to not only minimize in (14), but also optimize and so that they can fool the frozen components.

The described training scheme also plays a min-max game between {encoders, decoders} and {discriminator, classifier}. An ideally trained model should contain encoders that learns to project away as much speaker information as possible and decoders that can generate realistic and natural output spectra given an inferred latent code with a specific speaker code. Algorithm 1 summarizes the training procedure of CDVAE-CLS-GAN.

function autoencode()
      sample using
// Phase 1: train the VAE
while not converged do
      mini-batch of samples from the training set
// Phase 2: train the CLS
while not converged do
      mini-batch of samples from the training set
// Phase 3: train the whole network
while not converged do
      mini-batch of samples from the training set
     // Update the discriminator and classifier
     while not converged do
     // Update the encoder and generator
     while not converged do
Algorithm 1 Training procedure of CDVAE-CLS-GAN
ConvLReLU Conv-3x1-n, LN, LReLU
ConvGLU (Conv-3x1-n, LN, sigmoid) (Conv-3x1-n, LN, tanh)
ConvLReLU 5 (n=1024, 512, 256, 64, 32),
FC-16 (), FC-16 ()
ConvLReLU 5 (n=512, 256, 128, 64, 32),
FC-16 (), FC-16 ()
(Concat with y, ConvGLU) 4 (n=128, 256, 512, 1024),
Concat with y, Conv-3x1-513
(Concat with y, ConvGLU) 4 (n=64, 128, 256, 512),
Concat with y, Conv-3x1-513
ConvLReLU 5 (n=1024, 512, 256, 64, 32), FC-1
ConvLReLU 5 (n=512, 256, 128, 64, 32), FC-1
(Concat with y, ConvGLU) 4 (n=128, 256, 512, 1024),
Concat with y, Conv-3x1-513
TABLE I: Model architectures. Conv-hw-n indicates a convolutional layer with kernel size h

w and n output channels. LReLU indicates the leaky ReLU activation function. FC indicates fully-connected linear layer . LN indicates the layer normalization layer.

V Experimental evaluations

V-a Experimental settings

We conducted all experiments on the Voice Conversion Challenge (VCC) 2018 dataset, which contained recordings of 12 professional US English speakers with a sampling rate of 22050 Hz. The training and testing sets, respectively, consisted of 81 utterances and 35 utterances per speaker. We further divided the training utterances into 70/11 training/validation sets. The WORLD vocoder was used to extract acoustic features, including 513-dimensional SPs, 513-dimensional aperiodicity signals (APs), and fundamental frequency (). 35-dimensional MCCs were then extracted from the SPs, which were then normalized to unit-sum, and the normalizing factor was used as the energy of SPs. The 0-th coefficient of MCCs was taken out as the energy of MCCs. We further applied Min-Max normalization to SPs and MCCs. In the conversion phase, the converted SPs in VAE systems and the converted MCCs in CDVAE systems (excluding CDVAE-GAN with ) were obtained. The energy and AP were kept unmodified, and

was converted using a linear mean-variance transformation in the log-


The detailed network architectures are shown in Table I. We adopted the fully convolutional network (FCN) [43] based CDVAE-VC as our baseline system [30], which consumes continuous spectral frames extracted from the whole utterance and outputs a sequence of converted frames of the same length. This model has been confirmed to outperform the frame-wise CDVAE-VC counterpart. We also adopted a gradient penalty regularization [21] in the WGAN objective to stabilize the training. Layer normalization [4], the gated linear units activation function, and skip connections were also used to more effectively propagate the conditional information.

Following [30], the latent space and speaker representation were set to 16-dimensional. We used a mini-batch of 16 and the Adam optimizer with a fixed learning rate of 0.0001. The hyper-parameters and were set to be 50 and 1000, respectively, according to a held-out validation set. For CDVAE-GAN, we first pre-trained the CDVAE for 100000 steps. Then, we adversarially trained the discriminator(s) with the whole network for 10000 steps. We followed a common WGAN training scheme [2, 21] such that the discriminator(s) were updated for 5 iterations followed by 1 iteration of encoder and decoder update. For CDVAE-CLS-GAN, after training the CDVAE for 100000 steps, we pre-trained the classifier with the latent code extracted from the encoders for 30000 steps. Then, we trained the whole network for 10000 steps. After experimenting with different training schemes, here we updated the discriminator and the classifier for 1 iteration followed by 5 iterations of encoder and decoder update.

The following models are compared in order to examine the effectiveness of our proposed methods.

  • VAE: The FCN version of the VAE-VC model introduced in [23]. This model is only used to evaluate the impact of cross domain features on the degree of disentanglement.

  • CDVAE: The FCN model in [30], which is the baseline model in our experiments.

  • CDVAE-GANSP: The CDVAE with .

  • CDVAE-GANMCC: The CDVAE with .

  • CDVAE-GANBOTH: The CDVAE with and .

  • CDVAE-CLS: The CDVAE with CLS.

  • CDVAE-CLS-GANSP: The CDVAE with and CLS.


For simpilcity, in the rest of the paper, we use brackets to surround the type of feature used during conversion, and that path will be used in CDVAE-based methods. For instance, CDVAE-GANMCC [MCC] uses the MCC and the MCC-MCC path. In addition, if MCC is used in CDVAE and CDVAE-CLS, we additionally compare systems incorporating the global variance (GV) post-filter [56] to enhance the output, as in the original CDVAE [28].

V-B Evaluation methodology

V-B1 Objective evaluation metrics

  • Mel-Cepstrum distortion (MCD): MCD measures the spectral distortion in the MCC domain, and is a commonly adopted objective metric in the field of VC. It is calculated as:


    where is the dimension of the MCCs and and represent the -th dimensional coefficient of the converted MCCs and the target MCCs, respectively. In practice, MCD is calculate in a utterance-wise manner. A dynamic time warping (DTW) based alignment is performed to find the corresponding frame pairs between the non-silent converted and target MCC sequences beforehand.

  • Global variance (GV): GV serves as a metric for the over-smoothness of the output features. GV is usually calculated dimension-wise over all non-silent frames in the evaluation set. The -dimensional GV value is calculated as follows:


    where is the mean of all converted -th dimensional MCC coefficients.

  • Modulation Spectrum (MS): MS [66] is defined as the log-scaled power spectrum of a given feature sequence. The temporal fluctuation of the sequence is first decomposed into individual modulation frequency components, and their power values are represented as the MS. In this work we measure the MS of MCCs. Different from previous works that measured the MS of specific dimension of the MCC sequence, here we report the average of all dimensions. We also measure a MS distortion (MSD), where the MSD for the -dimension is calculated by:


V-B2 Subjective evaluation methods

We recruited 14 participants for the following two subjective evaluations.111A demo web page with samples used for subjective evaluation is available at https://unilight.github.io/CDVAE-GAN-CLS-Demo/.

  • The mean opinion score (MOS) test on naturalness: Subjects were asked to evaluate the naturalness of the converted and natural speech samples on a scale from 1 (completely unnatural) to 5 (completely natural).

  • The VCC [44] style test on similarity: This paradigm was adopted by the VCC organizing committee. Listeners were given a pair of speech utterances consisting of a natural speech sample from a target speaker and a converted speech sample. Then, they were asked to determine whether the pair of utterances can be produced by the same speaker, with a 4-level confidence of their decision, i.e., sure or not sure.

Model F-F M-M M-F F-M Avg.
CDVAE [MCC] 6.56 5.76 6.96 6.27 6.39
CDVAE-GANSP [SP] 7.09 6.38 7.38 6.93 6.94
CDVAE-GANMCC [MCC] 7.52 6.65 7.87 7.27 7.33
CDVAE-GANBOTH [MCC] 7.44 6.85 7.90 7.30 7.37
CDVAE-CLS [MCC] 6.65 6.29 7.02 6.49 6.61
CDVAE-CLS-GANSP [SP] 7.23 6.91 7.64 7.08 7.21
CDVAE-CLS-GANMCC [MCC] 7.71 6.62 7.76 7.05 7.29
CDVAE-CLS-GANBOTH [MCC] 7.57 7.13 8.12 7.30 7.53
TABLE II: Mean Mel-cepstral distortions [dB] of all non-silent frames in the evaluation set for the compared models.
Fig. 5: Global variance curves of all non-silent frames averaged over all conversion pairs for the compared models.
Fig. 6: Average modulation spectrum curves over all dimensions of all non-silent frames over all conversion pairs for the compared models.

V-C Applying GANs to different features

We first compare CDVAE-GANSP, CDVAE-GANMCC, CDVAE-GANBOTH and CDVAE-CLS-GANSP, CDVAE-CLS-GANMCC, CDVAE-CLS-GANBOTH, respectively. As in Table II, CDVAE-GANBOTH and CDVAE-CLS-GANBOTH gave the highest MCD, while in Figures 4(b), 4(c), 5(b) and 5(c), we can see that in terms of GV and MS, CDVAE-GANMCC and CDVAE-CLS-GANMCC yielded curves closer to the target curves, where the curves of the other models deviated more from the target curves. Meanwhile, consistent with a common observation in the VC literature that MCD, which measures the sample mean, often yields opposite results to GV and MS, both presenting the sample variance [73, 35]. This result suggests that modeling both feature domains simultaneously does not always yield better results. As for perceptual performance, our internal listening tests revealed that CDVAE-GANMCC gave the best results among the three models. Note that although CDVAE-GANSP and CDVAE-CLS-GANSP gave the lowest MCD compared with the other two models, they do not necessarily outperform their MCC counterparts in listening tests. We speculate that fitting the SP domain tends to give more over-smoothed output features, resulting in low MCDs but not beneficial for improving perceptual performance. The result is reasonable since the MCC-MCC path is used when performing conversion.

V-D Effectiveness of GANs

Next, we examine the effectiveness of combining GANs with CDVAE and CDVAE-CLS. Based on the discussion in the previous subsection, we focus on CDVAE-GANMCC and CDVAE-CLS-GANMCC here. As in Figures 4(a), 5(a) and 7, we can see that CDVAE-GANMCC and CDVAE-CLS-GANMCC fit the GV and MS statistics to the target much better than CDVAE and CDVAE-CLS, respectively. Also, models with GAN yield very small MSD comparing to the rest of the models. This confirms that involving a GAN objective in training indeed improves the modeling of the statistics, especially the variance of real speech data. Table III and Figure 8 show the subjective evaluation results. The -test showed that CDVAE-GANMCC significantly outperformed CDVAE with a -value of . Meanwhile, CDVAE-CLS-GANMCC significantly outperformed CDVAE-CLS with a -value of . These results confirm the effectiveness of GAN. On the other hand, CDVAE-GANMCC performed comparably with CDVAE with GV post-processing (-value = ), while CDVAE-CLS-GANMCC performed comparably with CDVAE-CLS with GV post-processing (-value = ). These results are consistent with our findings in the objective evaluations, suggesting that GANs enhance the variance of output features, thus have the potential to replace the GV post-filtering process commonly involved in traditional MCC-based VC systems [73]. This is advantageous since the model can then be freed from the post-filtering process in the online conversion phase, which may benefit real-time applications.

Model F-F M-M M-F F-M Avg.
CDVAE [MCC] 2.50 0.27 2.42 0.21 2.31 0.25 2.28 0.31 2.40 0.28
CDVAE-CLS [MCC] 2.72 0.37 2.61 0.48 2.44 0.30 2.17 0.29 2.55 0.39
CDVAE w/ GV [MCC] 3.36 0.44 3.36 0.26 2.94 0.30 2.89 0.50 3.12 0.30
CDVAE-CLS w/ GV [MCC] 3.50 0.53 3.53 0.49 3.06 0.31 3.17 0.38 3.30 0.38
CDVAE-GANMCC [MCC] 3.19 0.40 3.06 0.38 2.61 0.36 2.58 0.27 2.95 0.30
CDVAE-CLS-GANMCC [MCC] 2.94 0.37 3.58 0.48 3.06 0.26 3.11 0.57 3.15 0.36
Target - - - - 4.75 0.25

Mean opinion scores on naturalness for the compared models and the natural target voice with 95% confidence intervals.

Fig. 7: Modulation spectrum distortion curves of all non-silent frames over all conversion pairs for the compared models.

V-E Effectiveness of CLS

Next, we evaluate the effectiveness of the adversarial speaker classifier. Looking at the CDVAE, CDVAE-GAN models and their counterparts with CLS, a trend of increase in MCD values can be observed in Table II. On the other hand, Figures 4(a), 5(a) and 7 show that applying CLS to CDVAE and CDVAE-CLS-GANMCC yields similar GV values, but with MS values closer to those of the target, as well as a smaller MSD. These results imply that CLS can improve objective statistics.

Table III and Figure 8 show the subjective evaluation results. The effectiveness of CLS can be confirmed by the following observations: The speech naturalness was improved in all conversion pairs, by adding CLS to CDVAE, CDVAE w/ GV, and CDVAE-GANMCC. This is consistent with our aforementioned findings from the objective evaluations. Furthermore, the conversion similarity is greatly improved when incorporating CLS in CDVAE and CDVAE w/ GV, and is slightly improved when added to CDVAE-GANMCC. This confirms our initial motivation of CLS, which is to increase speaker similarity by eliminating source speaker identity in the latent code.

Fig. 8: Similarity results over all speaker pairs for the compared models.

V-F Disentanglement Measure

In this section, we investigate the degree of disentanglement of the VC models involved in this study. We use a novel metric that was recently proposed in [30] as the disentanglement measurement, termed DEM. The main design concept of DEM is that a pair of sentences of the same content uttered by the source and target speakers should have similar latent codes since the phonetic contents are the same. Therefore, we can use the cosine similarity to measure the distance of the latent codes obtained from the paired utterances. Specifically, the procedure to calculate DEM is as follows:

  1. extracting the latent codes of a pair of parallel utterances spoken by the source and target speakers;

  2. aligning the frame sequences of the pair of utterances using DTW;

  3. calculating the frame-wise cosine similarity, and then taking the average of the entire sequence.

As with other popular evaluation metrics, e.g., MCD and MSD, computing DEM requires parallel data. Since parallel data are usually available in standardized VC datasets, DEM is a simple but effective measure of the degree of disentanglement of the latent codes.

Model F-F M-M M-F F-M Avg.
VAE [SP] .568 .633 .534 .552 .571
CDVAE [SP] .597 .658 .557 .577 .597
CDVAE-GANSP [SP] .605 .677 .565 .594 .610
CDVAE-CLS [SP] .573 .582 .508 .535 .550
CDVAE-CLS-GANSP [SP] .629 .638 .573 .602 .610
CDVAE [MCC] .530 .588 .476 .502 .524
CDVAE-GANMCC [MCC] .559 .609 .502 .534 .551
CDVAE-CLS [MCC] .575 .581 .507 .533 .549
CDVAE-CLS-GANMCC [MCC] .583 .621 .561 .563 .584
TABLE IV: The results of DEM: the cosine similarity of the latent codes extracted from non-silent frames of parallel utterances of source-target pairs.

Table IV shows the evaluation results of DEM. First, we observe that CDVAE [SP] yields higher DEM scores than VAE [SP]. This confirms that introducing cross domain features indeed increases the degree of disentanglement. Next, comparing the corresponding methods in the upper and lower half of the table, which used SP and MCC as input features respectively, the DEM scores of the upper is consistently higher than those of the latter. This result is somehow reasonable because here SPs (513-dimensional) are of higher dimensions than MCCs (35-dimensional) and carry much detailed information. As a result, in terms of cosine similarity measure, higher DEM could be observed in the upper half methods than the lower half.

One interesting finding here is that when corporating GANs in CDVAE and CDVAE-CLS models, the DEM scores are consistently and significantly improved. This result indicates that during training of CDVAE-GANMCC, although not in our original expectaions, the discriminator not only benefits the decoders, but also indirectly guides the latent codes to be better disengagled.

As for CLS, we first observe that including CLS in CDVAE improves the DEM score when using MCC yet degrades when using SP. Although this somewhat makes the effectiveness of CLS inconvincing, we note that CDVAE-CLS [SP] and CDVAE-CLS [MCC] have nearly identical DEM scores. This intersesting finding shows that the CLS forces the encoders to encode different features into similar contents. On the other hand, including CLS in CDVAE-GAN models boosts the DEM scores of cross gender pairs, which confirms that CLS can help the encoders eliminate speaker independent information, such as gender.

Finally, we compare the results of similarity tests of CDVAE [MCC], CDVAE-GANMCC, and CDVAE-CLS-GANMCC in Figure 8 and the DEM results in Table IV. CDVAE-CLS-GANMCC achieves the highest similarity scores in Figure 8 and gives the highest DEM scores in Table IV. The result verifies the positive correlation between the conversion performance and the degree of disentanglement of the latent codes.

Vi Conclusions

In this paper, we have extended the cross-domain VAE based VC framework by integrating GANs and CLS into the training phase. The GAN objective was used to better approximate the distribution of real speech signals. The CLS, on the other hand, was applied to the latent code as an explicit constraint to eliminate speaker-dependent factors. Objective and subjective evaluations confirmed the effectiveness of the GAN and CLS objectives. We have also investigated the correlation between the degree of disentanglement and the conversion performance. A novel evaluation metric, DEM, that measures the degree of disentanglement in VC was derived. Experimental results confirmed a positive correlation between the degree of disentanglement and the conversion performance.

In the future, we will exploit more acoustic features in the CDVAE system, including rawer features, such as the magnitude spectrum, and hand-crafted features, such as line-spectral pairs. An effective algorithm that can optimally determine the latent space dimension is also worthy of study. Finally, it is worthwhile to generalize this disentanglement framework to extract speaker-invariant latent representation from unknown source speakers in order to achieve many-to-one VC.

We have made the source code publicly accessible so that readers can reproduce our results.222https://github.com/unilight/cdvae-vc


  • [1] A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §I.
  • [2] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §III-A, §V-A.
  • [3] S. Aryal, D. Felps, and R. Gutierrez-Osuna (2013) Foreign accent conversion through voice morphing.. In Proc. Interspeech, pp. 3077–3081. Cited by: §I.
  • [4] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §V-A.
  • [5] R. Caruana (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. Cited by: §I.
  • [6] L. H. Chen, Z. H. Ling, L. J. Liu, and L. R. Dai (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (12), pp. 1859–1872. Cited by: §I.
  • [7] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud (2018) Isolating sources of disentanglement in variational autoencoders. pp. 2610–2620. Cited by: §I.
  • [8] D. Childers, B. Yegnanarayana, and K. Wu (1985) Voice conversion: factors responsible for quality. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 10, pp. 748–751. Cited by: §I.
  • [9] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord (2019) Unsupervised speech representation learning using wavenet autoencoders. arXiv preprint arXiv:1901.08810. Cited by: §I.
  • [10] J. Chou and H. Lee (2019) One-Shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization. pp. 664–668. Cited by: §I.
  • [11] J. Chou, C. Yeh, H. Lee, and L. Lee (2018) Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations. In Proc. Interspeech, pp. 501–505. Cited by: §I, §I, §I.
  • [12] Y. Chung and J. Glass (2018) Speech2Vec: a sequence-to-sequence framework for learning word embeddings from speech. In Proc. Interspeech, pp. 811–815. Cited by: §I.
  • [13] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad (2010) Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing 18 (5), pp. 954–964. Cited by: §I.
  • [14] C. Eastwood and C. K. I. Williams (2018) A framework for the quantitative evaluation of disentangled representations. In ICLR, Cited by: §I.
  • [15] D. Erro, A. Moreno, and A. Bonafonte (2009) INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 18 (5), pp. 944–953. Cited by: §I.
  • [16] D. Erro, A. Moreno, and A. Bonafonte (2009) Voice conversion based on weighted frequency warping. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 18 (5), pp. 922–931. Cited by: §I.
  • [17] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai (1992) An adaptive algorithm for mel-cepstral analysis of speech. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 137–140. Cited by: §I.
  • [18] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §I.
  • [19] E. Godoy, O. Rosec, and T. Chonavel (2011) Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 20 (4), pp. 1313–1323. Cited by: §I.
  • [20] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pp. 2672–2680. Cited by: §I, §III-A, §III.
  • [21] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in neural information processing systems, pp. 5767–5777. Cited by: §III-A, §V-A, §V-A.
  • [22] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda (2017) An investigation of multi-speaker training for wavenet vocoder. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 712–718. Cited by: §I.
  • [23] C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang (2016) Voice conversion from non-parallel corpora using variational auto-encoder. In Proc. APISPA ASC, pp. 1–6. Cited by: Fig. 2, §I, §I, §II-A, 1st item.
  • [24] C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. In Proc. Interspeech, pp. 3364–3368. Cited by: §I, §III-A.
  • [25] W. Hsu, Y. Zhang, and J. Glass (2017) Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pp. 1878–1889. Cited by: §I.
  • [26] W. Hsu, Y. Zhang, R. J. Weiss, Y. Chung, Y. Wang, Y. Wu, and J. Glass (2019) Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5901–5905. Cited by: §I.
  • [27] W. Hsu, Y. Zhang, R. Weiss, H. Zen, Y. Wu, Y. Cao, and Y. Wang (2019) Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, Cited by: §I.
  • [28] W. Huang, H. Hwang, Y. Peng, Y. Tsao, and H. Wang (2018) Voice conversion based on cross-domain features using variational auto encoders. In 11th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 51–55. Cited by: §I, Fig. 3, §II-B, §II-B, §V-A.
  • [29] W. Huang, Y. Wu, H. Hwang, P. L. Tobing, T. Hayashi, K. Kobayashi, T. Toda, Y. Tsao, and H. Wang (2019) Refined wavenet vocoder for variational autoencoder based voice conversion. In 27th European Signal Processing Conference (EUSIPCO), Cited by: §I.
  • [30] W. Huang, Y. Wu, C. Lo, P. L. Tobing, T. Hayashi, K. Kobayashi, T. Toda, Y. Tsao, and H. Wang (2019) Investigation of f0 conditioning and fully convolutional networks in variational autoencoder based voice conversion. pp. 709–713. Cited by: §I, 2nd item, §V-A, §V-A, §V-F.
  • [31] A. Kain and M. W. Macon (1998) Spectral voice conversion for text-to-speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Vol. 1, pp. 285–288. Cited by: §I.
  • [32] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo (2019) ACVAE-vc: non-parallel voice conversion with auxiliary classifier variational autoencoder. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 27 (9), pp. 1432–1443. Cited by: §I.
  • [33] H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo (2018) StarGAN-vc: non-parallel many-to-many voice conversion using star generative adversarial networks. In IEEE Spoken Language Technology Workshop (SLT), pp. 266–273. Cited by: §I.
  • [34] T. Kaneko, H. Kameoka, K. Tanaka, and N. Hojo (2019) Cyclegan-vc2: improved cyclegan-based non-parallel voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6820–6824. Cited by: §I.
  • [35] T. Kaneko, H. Kameoka, N. Hojo, Y. Ijima, K. Hiramatsu, and K. Kashino (2017) Generative adversarial network-based postfilter for statistical parametric speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4910–4914. Cited by: §V-C.
  • [36] T. Kaneko and H. Kameoka (2017) Parallel-data-free voice conversion using cycle-consistent adversarial networks. arXiv preprint arXiv:1711.11293. Cited by: §I.
  • [37] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne (1999) Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: possible role of a repetitive structure in sounds. Speech communication 27 (3-4), pp. 187–207. Cited by: §I.
  • [38] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §I, §II-A.
  • [39] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda (2017) Statistical voice conversion with wavenet-based waveform generation. In Proc. Interspeech, pp. 1138–1142. Cited by: §I.
  • [40] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In Proceedings of The 33rd International Conference on Machine Learning, Vol. 48, pp. 1558–1566. Cited by: §III-A.
  • [41] J. Latorre, V. Wan, and K. Yanagisawa (2014) Voice expression conversion with factorised hmm-tts models. In Proc. Interspeech, pp. 1514–1518. Cited by: §I.
  • [42] L.-J. Liu, Z.-H. Ling, Y. Jiang, M. Zhou, and L.-R. Dai (2018) WaveNet vocoder with limited training data for voice conversion. In Proc. Interspeech, pp. 1983–1987. Cited by: §I.
  • [43] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 3431–3440. Cited by: §V-A.
  • [44] J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavicencio, T. Kinnunen, and Z. Ling (2018) The voice conversion challenge 2018: promoting development of parallel and nonparallel methods. In Proc. Odyssey The Speaker and Language Recognition Workshop, pp. 195–202. Cited by: §I, 2nd item.
  • [45] J. Lorenzo-Trueba, F. Fang, X. Wang, I. Echizen, J. Yamagishi, and T. Kinnunen (2018) Can we steal your vocal identity from the internet?: initial investigation of cloning obama’s voice using gan, wavenet and low-quality found data. In Proc. Odyssey The Speaker and Language Recognition Workshop, pp. 240–247. Cited by: §I.
  • [46] Z. Meng, J. Li, Z. Chen, Y. Zhao, V. Mazalov, Y. Gang, and B. Juang (2018) Speaker-invariant training via adversarial learning. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5969–5973. Cited by: §I.
  • [47] S. H. Mohammadi and A. Kain (2017) An overview of voice conversion systems. Speech Communication 88, pp. 65–82. Cited by: §I.
  • [48] E. Nachmani and L. Wolf (2019) Unsupervised Singing Voice Conversion. pp. 2583–2587. Cited by: §I.
  • [49] K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano (2012) Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech. Speech Communication 54 (1), pp. 134–146. Cited by: §I.
  • [50] M. Narendranath, H. A. Murthy, S. Rajendran, and B. Yegnanarayana (1995) Transformation of formants for voice conversion using artificial neural networks. Speech communication 16 (2), pp. 207–216. Cited by: §I.
  • [51] O. Ocal, O. H. Elibol, G. Keskin, C. Stephenson, A. Thomas, and K. Ramchandran (2019) Adversarially trained autoencoders for parallel-data-free voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2777–2781. Cited by: §I.
  • [52] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: §I.
  • [53] Y. Peng, H. Hwang, Y. Wu, Y. Tsao, and H. Wang (2018) Exemplar-based spectral detail compensation for voice conversion. In Proc. Interspeech, pp. 486–490. Cited by: §I.
  • [54] K. Qian, Y. Zhang, S. Chang, X. Yang, and M. Hasegawa-Johnson (2019) AutoVC: zero-shot voice style transfer with only autoencoder loss. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 5210–5219. Cited by: §I, §I.
  • [55] Y. Saito, Y. Ijima, K. Nishida, and S. Takamichi (2018) Non-parallel voice conversion using variational autoencoders conditioned by phonetic posteriorgrams and d-vectors. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5274–5278. Cited by: §I.
  • [56] H. Silén, E. Hel, J. Nurminen, and M. Gabbouj (2012) Ways to implement global variance in statistical speech synthesis. In Proc. Interspeech, pp. 1436–1439. Cited by: §V-A.
  • [57] B. Sisman, M. Zhang, and H. Li (2018) A voice conversion framework with tandem feature sparse representation and speaker-adapted ”wavenet” vocoder. In Proc. Interspeech, pp. 1978–1982. Cited by: §I.
  • [58] B. Sisman, H. Li, and K. C. Tan (2017) Sparse representation of phonetic features for voice conversion with and without parallel data. In IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 677–684. Cited by: §I.
  • [59] B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura (2018) Adaptive wavenet vocoder for residual compensation in gan-based voice conversion. In IEEE Spoken Language Technology Workshop (SLT), pp. 282–289. Cited by: §I.
  • [60] Y. Stylianou, O. Cappe, and E. Moulines (1998) Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6 (2), pp. 131–142. Cited by: §I.
  • [61] L. Sun, S. Kang, K. Li, and H. Meng (2015)

    Voice conversion using deep bidirectional long short-term memory based recurrent neural networks

    In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4869–4873. Cited by: §I.
  • [62] L. Sun, K. Li, H. Wang, S. Kang, and H. Meng (2016) Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In IEEE International Conference on Multimedia and Expo (ICME), pp. 1–6. Cited by: §I.
  • [63] S. Sun, C. Yeh, M. Hwang, M. Ostendorf, and L. Xie (2018) Domain adversarial training for accented speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4854–4858. Cited by: §I.
  • [64] S. Sun, B. Zhang, L. Xie, and Y. Zhang (2017) An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257, pp. 79–87. Cited by: §I.
  • [65] D. Sundermann and H. Ney (2003) VTLN-based voice conversion. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 556–559. Cited by: §I.
  • [66] S. Takamichi, T. Toda, A. W. Black, G. Neubig, S. Sakti, and S. Nakamura (2016) Postfilters to modify the modulation spectrum for statistical parametric speech synthesis. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 24, pp. 755–767. Cited by: 3rd item.
  • [67] R. Takashima, T. Takiguchi, and Y. Ariki (2012) Exemplar-based voice conversion in noisy environment. In IEEE Spoken Language Technology Workshop (SLT), pp. 313–317. Cited by: §I.
  • [68] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda (2017) Speaker-dependent wavenet vocoder. In Proc. Interspeech, pp. 1118–1122. Cited by: §I.
  • [69] K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura (2013) A hybrid approach to electrolaryngeal speech enhancement based on spectral subtraction and statistical voice conversion.. In Proc. Interspeech, pp. 3067–3071. Cited by: §I.
  • [70] X. Tian, S. W. Lee, Z. Wu, E. S. Chng, and H. Li (2017) An exemplar-based approach to frequency warping for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP) 25 (10), pp. 1863–1876. Cited by: §I.
  • [71] P. L. Tobing, T. Hayashi, Y. Wu, K. Kobayashi, and T. Toda (2018) An evaluation of deep spectral mappings and wavenet vocoder for voice conversion. In IEEE Spoken Language Technology Workshop (SLT), pp. 297–303. Cited by: §I.
  • [72] T. Toda, K. Nakamura, H. Saruwatari, K. Shikano, et al. (2014) Alaryngeal speech enhancement based on one-to-many eigenvoice conversion. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 22 (1), pp. 172–183. Cited by: §I.
  • [73] T. Toda, A. W. Black, and K. Tokuda (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15 (8), pp. 2222–2235. Cited by: §I, §II, §V-C, §V-D.
  • [74] Y.-C. Wu, H.-T. Hwang, C.-C. Hsu, Y. Tsao, and H.-M. Wang (2016) Locally linear embedding for exemplar-based spectral conversion. In Proc. Interspeech, pp. 1652–1656. Cited by: §I.
  • [75] Z. Wu, T. Virtanen, E. S. Chng, and H. Li (2014) Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 (10), pp. 1506–1521. Cited by: §I.
  • [76] Y. Xu, J. Du, Z. Huang, L. Dai, and C. Lee (2015) Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement. In Proc. Interspeech, pp. 1508–1512. Cited by: §I.