Emotional voice conversion (EVC) is a type of voice conversion (VC) that converts the emotional state of speech from one to another while preserving the linguistic content and speaker identity. It has various applications in expressive speech synthesis [liu2019teacher]
, and intelligent dialogue systems, such as voice assistants and conversational agents.
In general, voice conversion refers to the conversion of speaker identity while preserving the linguistic information. The earlier VC studies include Gaussian mixture model (GMM)[toda2007voice], partial least square regression (PLSR) [helander2010voice], NMF-based exemplar-based sparse representation [aihara2014exemplar, sisman2018voice] and group sparse representation [ccicsman2017sparse]hinton2006fast]
, and variational autoencoder (VAE)[huang2019investigation, qian2019zero, Qian_2020] have greatly improved the voice conversion quality.
Spectral mapping has been the main focus of conventional voice conversion; however, prosody mapping has not been given the same level of attention. We note that emotion is inherently supra-segmental and hierarchical in nature, that is manifested both in the spectrum and prosody [xu2011speech, latorre2008multilevel]. Therefore, it is insufficient for emotional voice conversion to just convert the spectral features frame-by-frame.
Statistical modelling for prosody conversion represents one of the successful attempts. In [tao2006prosody], the pitch contour was decomposed into a hierarchical structure with a classification-regression tree, then converted by GMM and regression-based clustering methods. A GMM-based model [aihara2012gmm]
was proposed to handle both spectrum and prosody conversion. Another strategy is to create a source and target dictionary and estimate a sparse mapping using exemplar-based techniques with NMF[aihara2014exemplar]
. Moreover, an emotional voice conversion model combining hidden Markov model (HMM), GMM, and F0 (fundamental frequency) segment selection method was proposed in[inanoglu2009data], which can convert pitch, duration and spectrum. The prior studies serve as a source of inspiration for this work.
There have been studies on deep learning approaches for emotional voice conversion with parallel training data, such as deep neural network [lorenzo2018investigating]luo2016emotional]
and deep bi-directional long-short-term memory[ming2016deep]. More recently, other methods, such as sequence-to-sequence model [robinson2019sequence] and rule-based model [xue2018voice], were also proven to be effective. To eliminate the need for parallel training data, autoencoders [gao2018nonparallel] and cycle-consistent generative adversarial networks (CycleGAN) [zhou2020transforming] based emotional voice conversion frameworks were proposed and shown remarkable performance. We note that these frameworks are designed for a specific speaker; therefore, they are called speaker-dependent frameworks.
It is believed that emotional expression and perception present individual variations influenced by personalities, languages and cultures [kotti2012speaker, fersini2009audio, arnold1960emotion, dai2009comparing]. Simultaneously, they also share some common cues across different individuals regardless of their identities and backgrounds [arnold1960emotion, dai2009comparing, schuller2005speaker]. In the field of emotion recognition, speaker-independent emotion recognition demonstrates a more robust, stable and a better generalization ability than the speaker-dependent ones [kotti2012speaker]. However, so far, few researchers have explored the speaker-independent emotional voice conversion. Most related studies, such as [shankar2019multi], have only dealt with a multi-speaker model at most.
CycleGAN is an effective solution for voice conversion without parallel training data; however, it is more suitable for pair-wise conversion. An encoder-decoder structure, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN) [hsu2017voice] would be more suitable to learn the emotion-independent representations. The main contributions of this paper include: 1) we study emotion through speaker-independent perspective for voice conversion; 2) we propose a VAW-GAN architecture and its training framework that does not require parallel training data; and 3) we study prosody modelling, and propose F0 conditioning for emotion-independent encoder training.
The paper is organized as follows: In Section 2, we motivate the perspective of speaker-independent emotion. In Section 3, we introduce the proposed emotional voice conversion framework. In Section 4, we report the experiments. Section 5 concludes the study.
2 Speaker-Independent Perspective on Emotion
It is well known that speech is more than just words, and it carries the emotions of the speaker. Emotion reflects the intent, mood and temperament of the speaker, and plays an important role in decision making and opinion expression [arnold1960emotion]. Emotion is highly complex with multiple signal attributes regarding the spectrum and prosody, which makes it difficult to disentangle and synthesize [xu2011speech].
Previous studies have revealed that basic emotions can be expressed and recognized through universal principles which are innately shared across human culture [ekman1992argument]. In general, it is also commonly believed that the emotional state in a speech has an impact on the speech production mechanism across the glottal source and vocal tract of the individuals [kane2014phonetic]. The studies prompt us to investigate speech emotions from a speaker-independent [schuller2005speaker] perspective. Studies have also shown possible ways of speaker-independent emotion representation for both seen and unseen speakers over a large multi-speakers emotional corpus [kotti2012speaker]akccay2020speech].
To validate the idea of speaker-independent emotion elements across speakers [schuller2005speaker, kotti2012speaker], we conduct a preliminary study using CycleGAN-based emotional voice conversion framework [zhou2020transforming], which is designed for speaker-dependent EVC. In this study, we train a network with two conversion pipelines for the mapping of spectrum and prosody (CWT-based F0 features) respectively. We train the network on one specific speaker and test it for both the intended (seen) and unseen speaker. In Table 1, we report the performance of spectrum conversion in terms of Mel-cepstral distortion (MCD) and log-spectral distortion (LSD) [huang2019investigation, qian2019zero, Qian_2020, sisman2019group]; and that of prosody conversion in terms of Pearson correlation coefficient (PCC) and root mean square error (RMSE) of F0 contours [sisman2019group]. Zero effort represents the cases where we directly compare the speech of source and target emotions without any conversion.
We observe that the speaker-dependent CycleGAN system performs for the unseen speaker pretty well, which is encouraging. As shown in Table 1, the results for the unseen speaker are clearly better than those for Zero Effort in terms of MCD and LSD for spectrum, and PCC and RMSE for prosody, despite the fact that it [zhou2020transforming] does not have any information about the unseen speaker in advance. Encouraged by this observation, we propose a speaker-independent emotional voice conversion framework that converts anyone’s emotion.
|Seen (Zero effort)||5.210||7.383||0.571||62.242|
|Unseen (Zero effort)||5.296||7.400||0.440||66.646|
3 Speaker-Independent Emotional Voice Conversion
An encoder-decoder structure, such as VAW-GAN [hsu2017voice], allows us to learn the emotion-independent representations using the encoder. Instead of CycleGAN, we propose to take advantage of the encoder-decoder structure of VAW-GAN to formulate a speaker-independent emotional voice conversion framework.
We first extract spectral (SP) and F0 features using WORLD vocoder. It is believed that F0 is hierarchical in nature. Thus it is insufficient to use a Logarithm Gaussian (LG)-based linear transformation for F0 to describe the prosody[sisman2019group, csicsman2017transformation, luo2016emotional]. We perform CWT decomposition of F0 to describe the prosody from the micro-prosody level to the whole utterance level, in a similar way reported in [suni2013wavelets]. We believe that F0 contains speaker-dependent and independent components [sisman2018wavelet], the CWT decomposition of F0 allows the encoder to learn the speaker-independent emotion pattern across different speakers[csicsman2017transformation]
. As CWT is sensitive to the discontinuities in F0, the following preprocessing steps are needed: 1) linear interpolation over unvoiced regions, 2) transformation of F0 from linear to a logarithmic scale, and 3) normalization of the resulting F0 to zero mean and unit variance[zhou2020transforming, sisman2019group, csicsman2017transformation].
The proposed VAW-GAN and its training procedure are shown in Figure 1. It was found that separate training of spectrum and prosody achieves better performance than joint training for emotion conversion [zhou2020transforming]. Following this idea, we propose to train two networks separately: 1) a VAW-GAN model conditioned on F0 for spectrum conversion, denoted as VAW-GAN for Spectrum, and 2) a VAW-GAN model for prosody conversion denoted as VAW-GAN for Prosody.
Both networks consist of three components, that are 1) encoder, 2) generator/decoder, and 3) discriminator. During the training of VAW-GAN for Spectrum and VAW-GAN for Prosody, the encoder is exposed to input frames from multiple speakers with different emotions. The encoder learns the emotion-independent patterns among multiple speakers and transforms the input features into a latent code . In this case, we assume that the latent code
is emotion-independent and only contains the information of speaker identity and phonetic content. We use a one-hot vector as emotion ID to represent different emotions and provide the emotion information to the generator/decoder. Since the spectral features inVAW-GAN for Spectrum training highly depend on F0 and contain prosodic information, we propose to use F0 as an additional condition to decoder, to disentangle the spectral features from the prosody, as shown in Figure 1.
We then train the generative model based on adversarial learning for both spectrum and prosody, to find an optimal solution through a min-max game: the discriminator is used to tries to maximize the loss between the real and reconstructed features, while the generator tries to minimize it [ak2019attribute, ak2020semantically]. It allows us to achieve high-quality converted speech with target emotion that is defined by the emotion ID.
3.2 Run-time Conversion
The run-time conversion phase is given in Figure 2. Similar to that of the training phase, for prosody conversion, we perform CWT on F0 to decompose prosody into different time scales, and then CWT-based F0 features are taken by the trained VAW-GAN for Prosody to generate the converted F0 with the designated emotion ID. As for spectrum conversion, we propose to condition the generator/decoder on the converted CWT-based F0 features together with emotion ID. Then, the converted spectral features are obtained through the trained VAW-GAN for Spectrum. Finally, we use WORLD vocoder to synthesize the converted emotional speech. We note that aperiodicity (AP) is directly copied from the source speech.
3.3 Effect of F0 Conditioning
We note that, for both spectrum and prosody conversion, the encoder is trained with input features from multiple speakers with different emotions to generate the emotion-independent latent code . The decoder is conditioned on the target emotion ID to generate the speech. Since the spectral features are highly dependent on F0 features and also carry prosodic information, it is insufficient to train an emotion-independent encoder only using the one-hot emotion ID. Thus, we propose to add F0 as an additional input to the generator (decoder), that aims to force the encoder to learn only an emotion-independent representation. F0 conditioning provides remarkable improvement over the baseline in both objective and subjective evaluation, as will be reported in Section 4.
3.4 Comparison with Related Work
The proposed EVC framework is unique in the sense that it eliminates the need for: 1) parallel training data, 2) any alignment technique, 3) any speaker embedding, such as i-vector, and 4) external modules such as speech recognizer. It shares a similar motivation with other VC frameworks based on conditional VAE [huang2019investigation, Qian_2020] regarding F0 conditioning, but differs in many ways. For example, 1) We study emotion conversion, while [huang2019investigation, Qian_2020] focus on speaker identity conversion; 2) Through F0 conditioning mechanism, we eliminate the residual prosodic information in the latent code to make it emotion-independent, while [huang2019investigation, Qian_2020] focus on the generation of speaker-independent latent code, and do not study the emotion perspective; 3) We propose to condition the generator/decoder on the converted CWT-based F0 features at run-time which has not been studied before.
We note that a large emotional multi-speaker voice conversion dataset is not publicly available. Therefore, we combine three different emotional speech corpora to conduct experiments, that are: 1) an English emotional speech corpus [liu2014emotional], 2) EmoV-DB [adigwe2018emotional], and 3) JL-Corpus [james2018open]. We train the networks using the speech data of three female speakers from the first two datasets and conduct emotion conversion on these three speakers and another two female speakers randomly chosen from JL-Corpus for evaluation. We call these two speakers from JL-Corpus as unseen speakers, since the framework has no prior information of these speakers during training. Those involved in both training and conversion phase are denoted as seen speakers.
We choose the two common emotions of these three datasets, that are 1) neutral and 2) angry. In all experiments, we conduct emotion conversion from neutral to angry. We conduct both objective and subjective experiments with 2 minutes of evaluation data to assess the system performance in a comparative study.
4.1 Experimental Setup
As illustrated in Figure 1, we train two similar VAW-GAN pipelines for both spectrum and prosody conversion. The encoder for both frameworks is a 5-layer 1D convolutional neural network (CNN) with a kernel size of 7 and a stride of 3 followed by a fully connected layer. Its output channel is
. The latent code is 64-dimensional and assumed to have a standard normal distribution.
In prosody conversion pipeline, the emotion ID is a 10-dimensional one-hot vector, that is concatenated with the latent code to generate a 74-dimensional vector and then merged by a fully connected layer. In spectrum conversion pipeline, a one-dimensional F0 is concatenated together with the latent code and emotion ID into a 75-dimensional vector. For GAN, the generator is a 4-layer 1D CNN with kernel sizes of and strides of , and the output channel is . The discriminator is a 3-layer 1D CNN with kernel sizes of and a stride of followed by a fully connected layer. Its output channels are
4.2 Objective Evaluation
We perform objective evaluation to assess the performance of both spectrum and prosody conversion. We use MCD and LSD for spectrum conversion evaluation, while PCC is used for prosody conversion. In this section, the proposed VAW-GAN-based EVC framework given in Figure 1 is denoted as CWT-C-VAWGAN. The baseline framework, denoted as C-VAWGAN, converts spectrum with VAW-GAN conditioned on LG-based F0 without CWT decomposition, where F0 is converted in a traditional manner with LG-based linear transformation [sisman2019group]. In Table 2, we report comprehensive experimental results for both seen and unseen speakers.
|Framework||MCD [dB]||LSD [dB]||PCC|
Firstly, we observe that the proposed CWT-C-VAWGAN framework outperforms the baseline in terms of spectrum conversion by achieving consistently lower LSD and MCD values for both seen and unseen speakers. This shows that, in terms of F0 conditioning, CWT-based converted F0 features are more effective than the LG-based F0 features. We also note that CWT-C-VAWGAN achieves comparable results between seen and unseen speakers, which we believe is remarkable. The results validate the idea of speaker-independent EVC in spectrum conversion.
Secondly, we compare the CWT-C-VAWGAN framework with the baseline in terms of prosody conversion. We note that CWT-C-VAWGAN consistently outperforms the baseline by achieving higher PCC for both seen and unseen speakers. Moreover, CWT-C-VAWGAN reports a closer PCC between seen and unseen speaker (0.776 vs 0.691) than C-VAWGAN (0.750 vs 0.630). These results validate the idea of speaker-independent EVC in prosody conversion.
4.3 Subjective Evaluation
We further conduct four listening experiments to assess the proposed CWT-C-VAWGAN in terms of speech quality, emotion similarity and speaker similarity. 15 subjects participated in all the experiments, each listening to 110 converted utterances. As a reference baseline, CWT-VAWGAN denotes the VAW-GAN system that converts spectrum and CWT-based F0 without conditioning the generator on F0.
We first report the mean opinion score (MOS) of the proposed CWT-C-VAWGAN and baseline CWT-VAWGAN for seen speakers. We note that both frameworks are based on VAW-GAN for spectrum and prosody conversion, but CWT-C-VAWGAN conditions the generator on additional CWT-based F0 features. As shown in Table 3, CWT-C-VAWGAN outperforms CWT-VAWGAN with a higher MOS score of . The results confirm the effectiveness of the proposed CWT-based F0 conditioning.
MOS results with 95% confidence interval of the baseline (CWT-VAWGAN) and proposed framework (CWT-C-VAWGAN).
We further conduct XAB emotion similarity test to assess the emotion conversion performance, where the subjects are asked to choose the speech sample which sounds closer to the reference in terms of emotional expression. As shown in Fig. 3 (a), we observe that the proposed CWT-C-VAWGAN clearly outperforms the baseline CWT-VAWGAN in terms of emotion similarity for seen speakers. Once again, we confirm that conditioning on the converted CWT-F0 features further improves the emotional expression.
We also conduct XAB emotion similarity test to assess the performance of proposed framework between speaker-independent CWT-C-VAWGAN and speaker-dependent training SD-CWT-C-VAWGAN, as reported in Fig. 3 (b) for seen speakers, and in Fig. 3 (c) for unseen speakers. SD-CWT-C-VAWGAN is trained with data only from one specific speaker. We train the baseline separately for each of the three specific speakers and perform speaker-dependent tests. We are glad to see that speaker-independent training outperforms speaker-dependent training for both seen and unseen speakers, that is very encouraging. We believe that speaker-independent training benefits from a multi-speaker database, and learns the speaker-independent emotion mapping effectively. We also observe that the listeners strongly favor speaker-independent training over speaker-dependent training for unseen speakers (Preference Score: 68.7% vs 31.3%).
Lastly, we conduct XAB speaker similarity test to compare the performance of the proposed speaker-independent training CWT-C-VAWGAN and speaker-dependent training SD-CWT-C-VAWGAN. We note that SD-CWT-C-VAWGAN is trained only with the specific speaker, hence expected to have a better performance in speaker similarity. As reported in Fig.4, we observe that the proposed CWT-C-VAWGAN achieves comparable results with SD-CWT-C-VAWGAN that we believe is an encouraging outcome. The results indicate that the proposed CWT-C-VAWGAN framework does not convert the emotion at the expense of speaker similarity, and shows remarkable performance of preserving the speaker identity while performing speaker-independent emotion conversion.
The experiments suggest that: (1) The proposed framework learns the speaker-independent emotional expression pattern for both spectrum and prosody across speakers; (2) In listening experiments, the speaker-independent training outperforms speaker-dependent training, while successfully preserving speaker identity of the source speaker; (3) The proposed CWT-based F0 conditioning improves spectrum conversion; and (4) The proposed framework is capable of converting anyone’s emotion, as reported in both objective and subjective evaluation. To our best knowledge, this paper is the first to provide a speaker-independent perspective to emotion conversion.
In this paper, we propose a speaker-independent emotional voice conversion framework that converts anyone’s emotion without the need for parallel training data. We perform both spectrum and prosody conversion based on VAW-GAN. We provide CWT modelling of F0 to describe the prosody in different time resolutions. Moreover, we study the use of CWT-based F0 as an additional input to the decoder to improve the spectrum conversion performance. Experimental results validate the idea of speaker-independent emotion conversion by showing remarkable performance for both seen and unseen speakers.
This study highlights the need for a large emotional voice conversion corpus that will be our future focus.