Emotion, as an essential component of human communication, can be conveyed by various prosodic features, such as pitch, intensity, and speaking rate [scherer1991vocal]
. It plays an important role as a manifestation at semantic and pragmatic level of spoken languages. An adequate rendering of emotion in speech is critically important in expressive text-to-speech, personalized speech synthesis, and intelligent dialogue systems, such as social robots and conversational agents.
Emotional voice conversion is a voice conversion (VC) technique to convert the emotion from the source utterance to the target utterance, while preserving the linguistic information and the speaker identity, as illustrated in Figure 1. It shares many similarities with conventional voice conversion. Both of them aim to convert non-linguistic information through mapping features from source to target. They are also different because conventional voice conversion techniques consider prosody-related features as speaker-independent. As speaker identity is thought to be characterized by the physical attributes of the speaker, which are strongly affected by the spectrum and determined by the voice quality of the individual [ramakrishnan2012speech], conventional VC studies mainly focus on spectrum conversion. On the other hand, emotion is inherently supra-segmental and hierarchical in nature [xu2011speech, latorre2008multilevel], that is manifested both in spectrum and prosody. Therefore, emotion cannot be handled simply at frame level, as it is insufficient to just convert the spectral features frame-by-frame.
Early studies of VC marked a success by training the spectral mapping on parallel speech data between source and target speaker [abe1990voice, shikano1991speaker]
. Many statistical approaches have been proposed in the past decades, such as Gaussian Mixture Model (GMM)[toda2007voice], and Partial Least Square Regression (PLSR) [helander2010voice]. Other VC methods, like Non-negative Matrix Factorization (NMF) [lee2001algorithms] and exemplar-based sparse representation schemes [wu2014exemplar, ccicsman2017sparse, sisman2018voice] were designed to address the over-smoothing problem in VC.
, and Recurrent Neural Network (RNN)[nakashika2014high]
have helped VC systems to achieve a higher level in terms of modeling the relationship between source and target features. More recently, some approaches have been proposed to eliminate the need of parallel data for VC such as Deep Bidirectional Long-Short-Term Memory (DBLSTM) with i-vector[wu2016use], variational auto-encoder [hsu2016voice], DBLSTM with model using Phonetic Posteriorgrams (PPGs) [sun2016personalized, sun2016phonetic], and GANs [hsu2017voice, kaneko2017parallel, kaneko2018cyclegan, sismanstudy]. The successful practice of these deep learning methods became the source of inspiration for this study.
The early studies on emotional VC [tao2006prosody, wu2009hierarchical] only focused on prosody conversion by using a classification and regression tree to decompose the pitch contour of the source speech into a hierarchical structure, then followed by GMM and regression-based clustering methods. One attempt to handle both spectrum and prosody conversion [schroder2001emotional, iida2003corpus, an2017emotional] was the GMM-based technique [aihara2012gmm]. Another approach is a combination of HMM, GMM and F0 segment selection method for transforming F0, duration and spectrum, which was proposed in [inanoglu2009data]. More recently, exemplar-based emotional VC approach based on NMF [aihara2014exemplar] and other NN-based models such as DNN [lorenzo2018investigating]
, Deep Belief Network (DBN)[luo2016emotional] and DBLSTM [ming2016deep] were also proposed to perform spectrum and prosody mapping. Inspired by the success of sequence-to-sequence models in text-to-speech synthesis, a sequence-to-sequence encoder-decoder based model [robinson2019sequence] was also investigated to transform the intonation of a human voice, and can convert the emotion of neutral utterances effectively. Rule-based emotional VC approaches such as [xue2018voice] are capable of controlling the degree of emotion using dimensional space.
We note that the training of most of the emotional VC systems relies on parallel training data, which is not practical in real life applications. Motivated by that, more recently, a style transfer auto-encoders [gao2018nonparallel] was proposed, that can learn from non-parallel training data. A source-target pair in non-parallel dataset represents source and target emotions, but unlike those in parallel dataset, they can carry different linguistic content, that may make data collection much easier.
Prosody conveys linguistic, para-linguistic and various types of non-linguistic information, such as speaker identity, emotion, intention, attitude and mood. It is observed that prosody is influenced by short-term as well as long-term dependencies [sanchez2014hierarchical, sisman2019group]. We note that F0 is an essential prosodic factor with respect to the intonation in speech, describing the variation of the vocal pitch over different time domains, from the syllables to the entire utterance. Therefore, it should be represented with hierarchical modeling [ming2015fundamental, ming2016exemplar, csicsman2017transformation], for example, in multiple time scales. The early studies on emotional voice conversion use a linear transformation method [tao2006prosody, wu2009hierarchical, aihara2012gmm, aihara2014exemplar, gao2018nonparallel] to convert F0. Such single pitch value of F0 representation does not characterize speech prosody well [xu2011speech, latorre2008multilevel, sisman2019group]. Continuous Wavelet Transform (CWT) decomposes a signal into frequency components and represent it with different temporal scales, that becomes an excellent instrument. CWT has already been applied for speech voice conversion frameworks such as DKPLS [sanchez2014hierarchical] and exemplar-based conversion [sisman2018wavelet, csicsman2017transformation]. It has been also shown to be effective for emotional voice conversion such as NMF-based approach [ming2016exemplar, ming2015fundamental] and DBLSTM-based approach [ming2016deep]. Other adaptations of CWT for emotional speech synthesis have been investigated in [luo2017emotional, luo2017emotion, luo2019emotional].
In this paper, we propose an emotional VC framework with CycleGAN that is trained on non-parallel data to map a speaker’s speech from one emotion to another. We use mel-spectrum to represent the acoustic features and CWT coefficients for prosody features. Our framework does not rely on either parallel training data or any other extra modules such as speech recognition or time alignment procedures.
The main contributions of this paper include: 1) we propose a parallel-data-free emotional voice conversion framework; 2) we show the effect of prosody for emotional voice conversion; 3) we effectively convert spectral and prosody features with CycleGAN; 4) we investigate different training strategies for spectrum and prosody conversion such as separate training and joint training; and 5) we outperform the baseline approaches, and achieve quality converted voice.
This paper is organized as follows: In Section 2, we describe the details of CycleGAN and CWT decomposition of F0. In Section 3, we explain our proposed spectrum and prosody conversion for emotional voice conversion framework. Section 4 reports the experimental results. Conclusion is given in Section 5.
2 Related Work
A CycleGAN is incorporated with three losses: adversarial loss, cycle-consistency loss, and identity-mapping loss, learning forward and inverse mapping between source and target. Adversarial loss measures how distinguishable between the data distribution of converted data and source or target data. For the forward mapping, it is defined as:
The closer the distribution of converted data with that of target data, the smaller the loss, or Eq. (1), becomes. The adversarial loss only tells us whether follows the distribution of target data and does not help preserve the contextual information. In order to guarantee that the contextual information of and will be consistent, the cycle-consistency loss is given as:
This loss encourages and to find an optimal pseudo pair of through circular conversion. To preserve the linguistic information without any external processes, an identity mapping loss is introduced as below:
We note that CycleGAN is well-known for achieving remarkable results without parallel training data in many fields from computer vision to speech information processing. In this paper, we propose to use CycleGAN for spectrum and prosody conversion for emotional voice conversion with non-parallel training data.
2.2 Continuous Wavelet Transform (CWT)
It is well-known that emotion can be conveyed by various prosodic features, such as pitch, intensity and speaking rate. F0 is an essential part with respect to the intonation. We note that the modeling of F0 is a challenging task as F0 is discontinuous due to the unvoiced parts, and hierarchical in nature. As a multi-scale modeling method, CWT makes it possible to decompose F0 to different variations over multiple time scales.
Wavelet transform provides an easily interpretable visual representation of signals. Using CWT, a signal can be decomposed into different temporal scales. We note that CWT has been successfully used in speech synthesis [kruschke2003estimation, mishra2006decomposition] and voice conversion [sisman2019group, sisman2018wavelet].
Given a bounded, continuous signal , its CWT representation can be written as:
where is the Mexican hat mother wavelet. The original signal can be recovered from the wavelet representation by inverse transform, given as:
However, if all information on is not available, the reconstruction is incomplete. In this study, we fix the analysis at ten discrete scales, one octave apart. The decomposition is given as:
where and . These timing scales were originally proposed in [suni2013wavelets] and in prosody model [vainio2013continuous, sisman2018phonetically]. We believe that the prosody of emotion is expressed differently at different time scales. With the multi-scale representations, lower scales capture the short-term variations and higher scales capture the long-term variations. In this way, we are able to model and transfer the F0 variants from the micro-prosody level to the whole utterance level for emotion pairs. In Figure 2, we use an example to compare two utterances with the same content but different emotion across time scales.
The reconstructed is approximated as:
3 Spectrum and Prosody Conversion for Emotional Voice Conversion
In this section, we propose an emotional VC framework that performs both spectrum and prosody conversion using Cycle-Consistent Adversarial Networks. As an essential component of prosody, we propose to use CWT to decompose one-dimensional F0 into 10 time scales. The proposed framework is trained on non-parallel speech data, eliminating the need of parallel training data and effectively converts the emotion of source speaker from one state to another.
The training phase of the proposed framework is given in Figure 3. We first extract spectral and F0 features from both source and target utterances using WORLD vocoder [morise2016world]
. It is noted that F0 features extracted from WORLD vocoder are discontinuous, due to the voiced/unvoiced parts within an utterance. Since CWT is sensitive to the discontinuities in F0, we perform the following pre-processing steps for F0: 1) linear interpolation over unvoiced regions, 2) transformation of F0 from linear to logarithmic scale, and 3) normalizing the resulting F0 to zero mean and unit variance. We then perform the CWT decomposition of F0 as given in Eq. (6) and Algorithm 1.
We train CycleGAN for spectrum conversion with 24-dimensional Mel-cepstral coefficients (MCEPs), and for prosody conversion with 10-dimensional F0 features for each speech frame. We note that the source and target training data are from the same speaker, but consist of different linguistic content and different emotions. By learning forward and inverse mappings simultaneously using adversarial and cycle-consistency losses, we encourage CycleGAN to find an optimal mapping between source and target spectrum and prosody features.
The run-time conversion phase is shown in Figure 4. We first use the WORLD vocoder to extract spectral features, F0, and APs from a given source utterance. Similar to that of training phase, we encode spectral features as 24-dimensional MCEPs, and obtain 10-scale F0 features through CWT decomposition of F0, that is also reported in Algorithm 1. 24-dimensional MCEPs and 10-scale F0 are fed into the corresponding trained CycleGAN models to perform spectrum and prosody conversion separately. We reconstruct the converted F0 with CWT synthesis approximation method, that is given in Eq. (7) and Algorithm 2. Finally, we use WORLD vocoder to synthesize the converted emotional speech.
We conduct both objective and subjective experiments to assess the performance of our proposed parallel-data-free emotional VC framework. In this paper, we use the emotional speech corpus [liu2014emotional], which is recorded by a professional American actress, speaking English utterances with the same content in seven different emotions. We randomly choose four emotions, that are 1) neutral, 2) angry, 3) sad, and 4) surprise.
We perform CWT to decompose F0 into 10 different scales and train CycleGAN using non-parallel training data to learn the relationships of spectral and prosody features between different emotions of the same speaker. CycleGAN-based spectrum conversion framework, denoted as baseline, is used as the reference framework. In this framework, F0 is transformed through LG-based linear transformation method.
We are also interested in the effect of joint and separate training for spectrum and prosody features. In joint training, we concatenate 24 MCEPs and 10 CWT coefficients to form a vector for each frame to train the joint spectrum-prosody CycleGAN. In separate training, we train a spectrum CycleGAN with the MCEP features, and a prosody CycleGAN with the CWT coefficients separately. Hereafter, we denote the separate training as CycleGAN-Separate, and the joint training as CycleGAN-Joint. The comparison of the frameworks can be also seen in Table 1.
|Framework||Spectrum Conversion||Prosody Conversion (F0)|
|Baseline||Spectrum CycleGAN||LG-based F0 linear transformation|
|CycleGAN-Joint||Joint Spectrum-Prosody CycleGAN|
|CycleGAN-Separate (proposed)||Spectrum CycleGAN||Prosody CycleGAN|
4.1 Experimental Setup
The speech data in [liu2014emotional] is sampled at 16kHz with 16-bit per sample. The audio files for each emotion are manually segmented into 100 short parallel sentences (approximately 3 minutes). Among them, 90 and 10 sentences are provided as training and evaluation sets, respectively. In order to make sure that our proposed model is trained under non-parallel condition, the first 45 utterances are used for the source and the other 45 sentences are used for the target. 24 Mel-cepstral coefficients (MCEPs), fundamental frequency (F0), and aperiodicities (APs) are then extracted every 5 ms using WORLD vocoder [morise2016world]. As a pre-processing step, we normalize the source and target MCEPs per dimension.
We report the performance of three frameworks that use CycleGAN, namely 1) baseline 2) CycleGAN-Joint, and 3) CycleGAN-Separate. For the baseline, we extract 24-dimensional MCEPs and one-dimensional F0 features for each frame. For both CycleGAN-Separate and CycleGAN-Joint, each speech frame is represented with 24-dimensional MCEPs and 10-dimensional F0 features. We adopt the same network structure for all frameworks. We design the generators using a one-dimensional (1D) CNN to capture the relationship among the overall features while preserving the temporal structure. The 1D CNN is incorporated with down-sampling, residual, and up-sampling layers. As for the discriminator, a 2D CNN is employed. For all frameworks, we set . is only used for the first iterations with to guide the learning process.
We train the networks using the Adam optimizer [kingma2014adam] with a batch size of 1. We set the initial learning rates to 0.0002 for the generators and 0.0001 for the discriminators. We keep the learning rate the same for the first iterations, which then linearly decays over the next iterations. The momentum term is set to be 0.5. As CycleGAN does not require source-target pair to be the same length, time alignment is not necessary.
4.2 Objective Evaluation
We perform objective evaluation to assess the performance of both spectrum and prosody conversion. In all experiments, we use 45-45 non-parallel utterances during training.
4.2.1 Spectrum Conversion
We employ Mel-cepstral distortion (MCD) [mcd1] between the converted and target Mel-cepstra to measure the spectrum conversion, that is given as follows:
where and represent the converted and target MCEPs sequences, respectively. A lower MCD indicates better performance.
Table 2 reports the MCD values for a number of settings in a comparative study. The MCD values are calculated for both joint and separate training of spectrum and prosody features. We conducted the experiments for three emotion combinations: 1) neutral-to-angry, 2) neutral-to-sad, and 3) neutral-to-surprise. We observed that all separate training settings consistently outperform those of other joint training settings by achieving lower MCD values. For example, the overall MCD of separate training is 8.71, while it is 10.23 for joint training.
We note that the baseline trains CycleGAN only with spectral features. Therefore, its spectral distortion is supposed to be the same with that of CycleGAN-Separate. That is the reason why MCD results of the baseline do not need to report in this case.
4.2.2 Prosody Conversion
We use Pearson Correlation Coefficient (PCC) and Root Mean Squared Error (RMSE) to report the performance of prosody conversion. The RMSE between the converted F0 and the corresponding target F0 is defined as:
where and denote the converted and target interpolated F0 features, respectively. is the length of sequence. We note that a lower RMSE value represents better F0 conversion performance.
The PCC between the converted and target F0 sequences is given as:
are the standard deviations of the converted F0 sequences () and the target F0 sequences (), respectively. We note that a higher PCC value represents better F0 conversion performance.
Table 3 reports the RMSE and PCC values of F0 conversion for a number of settings in a comparative study. In this experiment, we conducted 3 emotional conversion settings, that are: 1) neutral-to-angry, 2) neutral-to-sad, 3) neutral-to-surprise. We also report the overall performance. As for RMSE results, first of all, we observe that the proposed prosody conversion, based on CycleGAN with CWT-based F0 decomposition outperforms the traditional baseline (denoted as baseline) where F0 is converted with LG-based linear transformation method. Secondly, the proposed separate training with CycleGAN for spectral and CWT-based prosody conversion overall achieves better result (RMSE: 63.03) than separate training (RMSE: 65.05), which is also consistent with the objective evaluation. PCC results suggest that both joint and separate training of CWT-based F0 features achieve similar results.
We would like to highlight that the proposed CWT-based modeling for F0 always outperforms the baseline framework that uses LG-based linear transformation method.
4.3 Subjective Evaluation
We further conduct two listening experiments to assess the proposed frameworks in terms of emotion similarity. We perform XAB test to assess the emotion similarity by asking listeners to choose the one which sounds more similar to the original target between A and B in terms of emotional expression. XAB test has been widely used in speech synthesis such as voice conversion [sisman2019group], singing voice conversion [singan-2019] and emotional voice conversion [luo2019emotional]. In both experiments, 45-45 non-parallel utterances are used during training. We selected two emotion combinations for the listening experiments, that are 1) neutral-to-angry (N2A), and 2) neutral-to-surprise (N2S). 13 subjects participated in all the listening tests, each of them listens to 80 converted utterances in total.
We first conduct XAB test between the baseline and our proposed method to show the effect of our proposed framework that performs separate training of CycleGAN-based conversion for spectrum and CWT-based F0 modeling. Consistent with the previous experiments, our proposed framework is again denoted as CycleGAN-Separate. Listeners are asked to listen to the source utterances, the baseline, our proposed method and the reference utterances respectively. Then, they are asked to choose the one which sounds more similar to the reference in terms of emotional expression. We note that both frameworks perform spectral conversion in the same way, while our proposed framework performs a more sophisticated F0 conversion, that is to modeling with CWT, and then converting with CycleGAN. The results are reported in Figure 5 for 2 different emotional conversion scenarios that are N2A and N2S. We observe that the proposed CycleGAN-Separate outperforms the baseline framework in both experiments, which shows the effectiveness of prosody modeling and conversion, for emotional voice conversion.
We then conduct XAB test between joint and separate training to assess different training strategies for spectrum and prosody conversion. The results are reported in Figure 6 for two different emotional conversion scenarios N2A and N2S. We observed that the performance of separate training (denoted as CycleGAN-Separate) is much better than the joint training (denoted as CycleGAN-Joint). Our proposed method achieves 93.6 on N2A and 96.5 on N2S, which we believe are remarkable.
4.4 Joint vs. Separate Training of Spectrum and Prosody
We observe that the listeners prefer the separate training much better than the joint training. We consider that prosody is manifested at different time scales, which also consists of content-dependent, and content-independent elements.
The joint training ties the CWT coefficients of F0 with the spectral features at the frame level, that assumes that prosody is content-dependent. With the limited number of training samples (45 pairs and around 3 minutes of speech), the CycleGAN model resulting from the joint training doesn’t generalize well the emotional mapping for unseen content at run-time inference. With the separate training, the CycleGAN model is trained for spectrum and prosody separately. In this way, the prosody CycleGAN learns sufficiently well from the limited number of training samples between the emotion pairs in a content-independent manner. Therefore, separate training outperforms joint training in terms of emotion similarity.
In this paper, we propose a high-quality parallel-data-free emotional voice conversion framework. We perform both spectrum and prosody conversion, that is based on CycleGAN. We provide a non-linear method which uses CWT to decompose F0 into different timing-scales. Moreover, we also study the joint and separate training of CycleGAN for spectrum and prosody conversion. We observe that separate training of spectrum and prosody can achieve better performance than joint training, in terms of emotion similarity. Experimental results show that our proposed emotional voice conversion framework can achieve better performance than the baseline without the need of parallel training data.