Voice transformation (VT) is a technique to modify some properties of human speech while preserving its linguistic information. VT can be applied to change the speaker identity, i.e., voice conversion (VC) , or to transform the speaking style of a speaker, such as emotion and accent conversion . In this work, we will focus on emotion voice transformation. The goal is to change emotion-related characteristics of a speech signal while preserving its linguistic content and speaker identity. Emotion conversion techniques can be applied to various tasks, such as hiding negative emotions for customer service agents, helping film dubbing, and creating more expressive voice messages on social media.
Traditional VC approaches cannot be applied directly because they change speaker identity by assuming pronunciation and intonation to be a part of the speaker-independent information. Since the speaker’s emotion is mainly conveyed by prosodic aspects, some studies have focused on modelling prosodic features such as pitch, tempo, and volume [3, 4]. In , a rule-based emotional voice conversion system was proposed. It modifies prosody-related acoustic features of neutral speech to generate different types of emotions. A speech analysis-synthesis tool STRAIGHT  was used to extract fundamental frequency () and power envelope from raw audio. These features were parameterized and modified based on Fujisaki model  and target prediction model . The converted features were then fed back into STRAIGHT to re-synthesize speech waveforms with desired emotions. However, this method requires temporal aligned parallel data that is difficult to obtain in real applications; and the accurate time alignment needs manual segmentation of the speech signal at phoneme level, which is very time consuming.
To address these issues, we propose a nonparallel training method. Instead of learning one-to-one mapping between paired emotional utterances , we switch to training a conversion model between two emotional domains .
Inspired by disentangled representation learning in image style transfer[9, 10], we assume that each speech signal can be decomposed into a content code that represents emotion-invariant information and a style code that represents emotion-dependent information. is shared across domains and contains the information we want to preserve. is domain-specific and contains the information we want to change. In conversion stage, we extract content code of the source speech and recombine it with style code of the target emotion. A generative adversarial network (GAN)  is added to improve the quality of converted speech. Our approach is nonparallel, text-independent, and does not rely on any manual operation.
We evaluated our approach on IEMOCAP  for four emotions: angry, happy, neutral, sad; which are widely studied in emotional speech analysis literature. An objective evaluation showed that our model can modify the speech to significantly increase the percentage of desired emotions. A subjective evaluation on Amazon MTurk showed that the converted speech had good quality and preserved the speaker identity.
2 Related Work
2.1 Emotion-related features
Previous emotion conversion methods directly modify parameterized prosody-related features that convey emotions. 
first proposed to use Gaussian mixture models (GMM) for spectrum transformation. A recent work explored four types of acoustic features: contour, spectral sequence, duration and power envelope, and investigated their impact on emotional speech synthesis The authors found that and spectral sequence are the dominant factors in emotion conversion, while power envelope and duration alone has little influence. They further claimed that all emotions can be synthesized by modifying the spectral sequence, but did not provide a method to do it. In this paper, we focus on learning the conversion models for and spectral sequence.
2.2 Nonparallel training approaches
Parallel data means utterances with the same linguistic content but varying in aspects to be studied. Since parallel data is hard to collect, nonparallel approaches have been developed. Some borrow ideas from image-to-image translation and create GAN models  suitable for speech, such as VC-VAW-GAN , SVC-GAN  and VC-CycleGAN . Another trend is based on WaveNet [18, 19]
. Although it can train directly on raw audio without feature extraction, the huge amount of computation resources and training data required is not affordable for most users.
2.3 Disentangled representation learning
Our work draws inspiration from recent studies in image style transfer. A basic idea is to find disentangled representations that can independently model image content and style. It is claimed in 
that a Convolutional Neural Network (CNN) is an ideal representation to factorize semantic content and artistic style. They introduced a method to separate and recombine content and style of natural images by matching feature correlations in different convolutional layers. For us, the task is to find disentangled representations for speech signal that can split emotion from speaker identity and linguistic content.
The research on human emotion expression and perception has two major conclusions. First, human emotion perception is a multi-layered process. 
figured out that humans do not perceive emotion directly from acoustic features, but through an intermediate layer of semantic primitives. They introduced a three-layered model and learnt the connections by a fuzzy inference system. Some researchers found that adding middle layers can improve emotion recognition accuracy. Based on this finding, we suggest the use of multilayer perceptrons (MLP) to extract emotion-related information in speech signals.
Second, the emotion generation process of human speech follows the opposite direction of emotion perception. This means the encoding process of the speaker is the inverse operation of the decoding process of the listener. We assume that emotional speech generation and perception share the same representation methodology. This means the encoder and decoder are inverse operations with mirror structures.
Let and be utterances drawn from two different emotional categories. Our goal is to learn a mapping between two distributions and
. Since the joint distributionis unknown for nonparallel data, the conversion models and
cannot be directly estimated. To solve this problem, we make two assumptions:
(i). The speech signal can be decomposed into an emotion-invariant content code and an emotion-dependent style code;
(ii). The encoder and decoder are inverse functions.
Fig. 1 shows the generative model of speech with a partially shared latent space. A pair of corresponding speech is assumed to have a shared latent code and emotion-related style codes . For any emotional speech , we have a deterministic decoder and its inverse encoders , . To convert emotion, we just extract and recombine the content code of the source speech with the style code of the target emotion.
It should be noted that the style code is not inferred from one utterance, but learnt from the entire emotion domain. This is because the emotion style from a single utterance is ambiguous and may not capture the general characteristics of the target emotion. It makes our assumption slightly different from the cycle consistent constraint , which assumes that an example converted to another domain and converted back should remain the same as the original, i.e., . Instead, we apply a semi-cycle consistency in the latent space by assuming that and .
Traditional emotional speech analysis mainly focuses on four types of acoustic features: fundamental frequency (), spectral sequence, time duration and energy envelope. It was found in  that only and spectral sequence have significant influence, while the other two require manual segmentation and have little impact on changing emotions. Therefore we focus on learning the conversion model for and spectral sequence. Fig. 2 shows an overview of our nonparallel emotional speech conversion system. The features are extracted and recombined by WORLD  and converted separately. We modify
by linear transform to match statistics of the fundamental frequencies in the target emotion domain. The conversion is performed by log Gaussian normalization
are the mean and variance obtained from the source and target emotion set. Aperiodicity (AP) is mapped directly since it does not contain emotion-related information.
For spectral sequence, we use low-dimensional representation in mel-cepstrum domain to reduce complexity.  shows that 50 MCEP coefficients are enough to synthesize full-band speech without quality degeneration. Spectra conversion is learnt by the autoencoder model in Fig. 1. The encoders and decoders are implemented with gated CNN . In addition, a GAN module is added to produce realistic spectral frames. Our model has 4 subnetworks , in which is the discriminator in GAN to distinguish between real samples and machine-generated samples.
3.3 Loss functions
We jointly train the encoders, decoders and GAN’s discriminators with multiple losses displayed in Fig. 3. To keep encoder and decoder as inverse operations, we apply reconstruction loss in the direction . The spectral sequence should not change after encoding and decoding.
In our model, the latent space is partially shared. Thus the cycle consistency constraint  is not preserved, i.e., . We apply a semi-cycle loss in the coding direction and .
Moreover, we add a GAN module to improve the speech quality. The converted samples should be indistinguishable from the real samples in the target emotion domain. GAN loss is computed between and , .
The full loss is the weighted sum of , , .
where control the weights of the components.
We test the proposed method on IEMOCAP , which is a widely used corpus for emotion recognition. To our knowledge, this is the first work to use it for emotion conversion. IEMOCAP contains scripted and improvised dialogs in five sessions; each has labeled emotional sentences pronounced by two English speakers. The emotions in scripted dialogs have strong correlation with the lingual content. Since our task is to change emotion but keep the speaker identity and linguistic content, we only use the improvised dialogs of the same speaker. We train the conversion model on four emotions: angry, happy, neutral, sad. The acoustic features , spectral sequence and AP are extracted by WORLD  every ms, then encoded to
-dimension mel-cepstral vectors of temporal sizeas the autoencoder’s input.
4.2 Network Structure
Our network structure is illustrated in Fig. 4. The encoders and decoders are implemented with 1-dimensional CNNs to capture the temporal dependencies; the GAN discriminators are implemented with 2-dimensional CNNs to capture the spectra-temporal patterns. All networks are equipped with gated linear units (GLU) 
as activation functions. The emotion style is learnt by a-layer MLP that outputs channel-wise mean and variance . Then they are fed into the decoder by adding an adaptive instance normalization (AdaIN)  layer before activation. This mechanism is similar to the conversion model of in eq. (2).
We use Adam optimizer and set . The learning rate is initialized as and linearly decayed to from the -th iteration. We set and . For training, we randomly sample fixed length frames (128) from the input audio with KHz frequency. Conversion was conducted on speech sequences with arbitrary length.
The results were evaluated on three metrics: emotion correctness, voice quality and the ability to retain speaker identity.
Subjective evaluation We conducted listening tests on Amazon MTurk to evaluate the converted speech 111We provide some converted samples at https://www.jian-gao.org/emovc. Each example was listened to by
random evaluators. They were asked to manually classify the emotion, and give-to- opinion scores on voice quality and the similarity with the original speaker. The mean opinion score (MOS) of the latter two metrics are listed in Fig. 5. For subjective emotion classification, we found results consistent with the objective evaluations, therefore omitted for space constraints.
Objective evaluation We applied a state-of-the-art speech emotion classifier  for objective evaluation. The results in Table 1 show that our model can effectively increase the proportion of desired emotions and reduce the original. Note that neutral and sad speech often get mixed up even by humans.
|percentage % original (converted)|
5 Conclusion and Future work
We presented a nonparallel emotional speech conversion system. Objective and subjective evaluations showed that our model can successfully manipulate emotions to fool the emotion classifier as well as human listeners. As our approach does not require any paired data, transcripts or time alignment, it is easy to be applied in real-world situations. To our knowledge, this is the first work for nonparallel emotion conversion using style transfer. Future work is to develop a multi-domain emotion conversion model for unseen speakers.
Acknowledgements: This research was supported by Signify Research and U.S. Air Force under grant FA9550-17-1-0259.
-  S.H. Mohammadi and A. Kain, “An overview of voice conversion systems,” Speech Communication, vol. 88, pp. 65–82, 2017.
-  G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, “Accent conversion using phonetic posteriorgrams,” in ICASSP. IEEE, 2018, pp. 5314–5318.
-  M. Wang, M. Wen, K. Hirose, and N. Minematsu, “Emotional voice conversion for mandarin using tone nucleus model–small corpus and high efficiency,” in Speech Prosody 2012, 2012.
-  Z. Wang and Y. Yu, “Multi-level prosody and spectrum conversion for emotional speech synthesis,” in Signal Processing (ICSP). IEEE, 2014, pp. 588–593.
-  Y. Xue, Y. Hamada, and M. Akagi, “Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space,” Speech Communication, vol. 102, pp. 54–67, 2018.
-  H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne, “Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds1,” Speech communication, vol. 27, no. 3-4, pp. 187–207, 1999.
-  H. Fujisaki and K. Hirose, “Analysis of voice fundamental frequency contours for declarative sentences of japanese,” ASJs Japan (E), vol. 5, no. 4, pp. 233–242, 1984.
-  Y. Xue and M. Akagi, “A study on applying target prediction model to parameterize power envelope of emotional speech,” in RISP workshop NCSP’16. 信号処理学会, 2016.
-  L.A. Gatys, A.S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in CVPR. IEEE, 2016, pp. 2414–2423.
X. Huang, MY. Liu, S. Belongie, and J. Kautz,
“Multimodal unsupervised image-to-image translation,”
The European Conference on Computer Vision (ECCV), September 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NIPS, 2014, pp. 2672–2680.
-  C. Busso, M. Bulut, CC Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, and S.S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335, 2008.
-  H. Kawanami, Y. Iwami, T. Toda, H. Saruwatari, and K. Shikano, “Gmm-based voice conversion applied to emotional speech synthesis,” in Eurospeech, 2003.
-  M.Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image translation networks,” in Advances in Neural Information Processing Systems (NIPS), pp. 700–708. Curran Associates, Inc., 2017.
-  C.-C. Hsu, H.-T. Hwang, Y.-C. Wu, Y. Tsao, and H.-M. Wang, “Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks,” in Proc. Interspeech 2017, 2017, pp. 3364–3368.
-  Takuhiro Kaneko, Hirokazu Kameoka, Kaoru Hiramatsu, and Kunio Kashino, “Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks,” in Proc. Interspeech 2017, 2017, pp. 1283–1287.
-  F. Fang, J. Yamagishi, I. Echizen, and J. Lorenzo-Trueba, “High-quality nonparallel voice conversion based on cycle-consistent adversarial network,” arXiv preprint arXiv:1804.00425, 2018.
-  A. Van Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A.W. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio.,” in SSW, 2016, p. 125.
-  A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, “Speaker-dependent wavenet vocoder,” in Proc. Interspeech, 2017, vol. 2017, pp. 1118–1122.
-  CF Huang and M. Akagi, “A three-layered model for expressive speech perception,” Speech Communication, vol. 50, no. 10, pp. 810–828, 2008.
-  JY Zhu, T. Park, P. Isola, and A.A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in IEEE ICCV, Oct 2017.
-  M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Trans. on Information and Systems, vol. 99, no. 7, pp. 1877–1884, 2016.
-  Y.N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in ICML, 2017, pp. 933–941.
S. Mirsamadi, E. Barsoum, and C. Zhang,
“Automatic speech emotion recognition using recurrent neural networks with local attention,”in ICASSP. IEEE, 2017, pp. 2227–2231.