Voice can be regarded as the composition of what to deliver — linguistic contents — and how to deliver — style. Voice conversion (VC) is a task to change speech styles while keeping the linguistic contents . It is a challenging task since linguistic contents might be lost or style information is not changed during conversion.
Previously, VC had been conducted using frame-based approach. Given the source and target speeches, alignment between two speeches is obtained, then acoustic features of source speech are converted to target speech. Various methods were applied to model acoustic features such as Gaussian mixture model (GMM)[1, 2]
, deep neural network (DNN)
, recurrent neural network (RNN). Recently, successes of sequence-to-sequence (seq2seq) model with attention  is also applied to VC . Problems such as mispronunciation, training instability have been observed while training seq2seq VC model . In , textual supervision is added to each time step of decoder output to improve the quality of VC. However, this approach has limitations since it requires explicit alignment by human or dynamic time wrapping (DTW).
To overcome the drawbacks of previous works, we employ an approach to use text-to-speech (TTS). TTS is a task to transform textual or phonetic information to speech waveform, and abundant seq2seq-based studies have actively conducted [8, 9, 10]. TTS is a highly related task to VC. Only the input domains of VC and TTS are different, the role of the decoder is very same to convert phonetic information to acoustic features. The embedding space of TTS is highly correlated to phonetic information, and VC is expected to learn embedding space close to that of TTS by using multitask learning. In this paper, TTS is utilized to give phonetic information to VC to improve the performance.
Furthermore, we extend this work to emotional voice conversion. Given a style reference speech, style encoder extracts emotion information only and removes linguistic contents. Since style encoder is designed to extract emotion regardless of linguistic contents, it can handle multiple input style domain. Also, extracted emotion is injected to the decoder, then it can generate various emotions. Thus the proposed model can handle many-to-many emotional voice conversion.
Contributions of our proposed model are as follows:
Multitask learning with TTS could improve the performance of VC.
Many-to-many emotional voice conversion was firstly conducted by a seq2seq model.
A style reference speech could determine target domain of voice conversion.
2 Related Work
Traditionally, frame-based models have been used to solve VC. Given source and target speeches, their alignment is found by DTW. Then conversion between acoustic features of aligned frames are modeled. Recently, seq2seq models have been proposed in VC [6, 7]. In these models, the model jointly learns alignment and frame conversion by attention mechanism without explicit temporal alignment. However, the performance of VC is not sufficient since converted voice might lose linguistic information. For example, the same words repeatedly generated, some words dropped, or wrongly pronounced. To prevent these phonomena, textual supervision is added to VC, but it requires explicit alignment . Unlike previous work, we used TTS for guiding linguistic information to VC. This approach does not need any explicit alignment.
Meanwhile, emotional voice conversion mainly has done with frame-based conversion [11, 12] or rule-based approach . These have limitations since DTW does not ensure the exact alignment and rule-based approach has a limitation to model voice conversion. It could be improved by using model which does not rely on explicit alignment.
Many-to-many VC refers as to the number of source domain and target domain is multiple. In 
, Cycle-GAN converted an out-of-dataset speaker to a target speaker, and vice versa. The i-vector-based VC system is proposed to generate the linguistic features of speakers that are not in the training set. Compared with other many-to-many VC methods [14, 15], our proposed model can transfer the emotional knowledge of speaker and enables conversion among different emotions.
TTS is one of the most active research area in speech domain. Various kinds of Seq2seq models have been proposed [8, 9, 10], and expressive synthesis is also studied [16, 17]. Among them, the most relevant work is . Style vector extracted from style encoder which takes a one-hot vector of emotion as an input is injected to the network to generate emotional speech. In our work, emotion labels are not utilized during training, and any emotion labels are not explicitly taken as our network’s inputs.
3 Proposed Model
The proposed model can perform VC and TTS in a single model. The network plays as VC when its input is , or TTS when the input pair is where , and are log Mel spectrogram of speech carrying linguistic contents, one-hot represented text, and log Mel spectrogram of style reference speech, respectively. Both and are mapped to the same space , then it is decoded into Mel spectrogram m. For each decoding step, style vector extracted from is concatenated to attention RNN and decoder RNN. Linear spectrogram l is obtained by the post processor. Detailed network architecture is depicted in Fig. 1 and below equations.
where , and are the embedding of text encoder, contents encoder, and style encoder, respectively. and
are hidden representation of attention RNN and decoder RNN at time step. and are log Mel spectrogram, log linear spectrogram of target, context vector achieved by attention mechanism, and output of attention RNN at time step . XOR denotes exclusive OR operator.
For the TTS part including text encoder, decoder, attention and post processor, overall architecture is based on Tacotron , and some modifications that context vector
is utilized for every iteration in attention RNN, and residual connection is added to Convolution Bank + Highway + bi-GRU (CBHG) connection as described in.
Text encoder is composed of character embedding layer, prenet, and CBHG where prenet is composed of two FC-ReLU-Dropout layers. A stack of LSTM is used for contents encoder, and there was no reduction of temporal resolution since temporal reduction may lose local temporal information. It means that the length ofis the same as that of . For style encoder, is mapped to which has fixed dimension by taking the embedding of the last step of LSTM followed by a fully connected layer to reduce dimension. The attention module is composed of attention RNN followed by attention mechanism. Attention RNN takes input of and . Then its output and is used for attention mechanism to generate context vector .
Training loss is where and are ground truth log Mel spectrogram and log linear spectrogram.
4 Experimental Results
In this section, description on our dataset and implementation details will be delivered. Experiments are designed to verify enhanced speech quality by multitask learning. Also, the analysis on style encoder and the performance of many-to-many emotional VC will be displayed.
We have constructed a male Korean Emotional Text-to-speech (mKETTS) dataset. One 30-year-old male pronounced 3,000 sentences in seven different emotions (neutral, happiness, sadness, anger, fear, surprise, and disgust), so the whole number of utterances is 21,000. The text of all sentences is same across the emotions. All recordings have conducted in a silent studio without noise, and recorded in 44.1 kHz sampling rate. The whole duration after trimming silence is 29.2 hours.
4.2 Implementation details
For preprocessing, the silence of the first and end of each waveform is trimmed using voice activity detection (VAD) algorithm111https://github.com/wiseman/py-webrtcvad and resampled to 16 kHz. Then log Mel spectrogram is extracted with window size 50ms, shift 12.5 ms, nfft 2048, 80 Mel bins and Hanning window. The magnitude was normalized to [0, 1]. Since text encoder uses character-based representation, Korean character decomposed into onset, nucleus, and coda, then converted into one-hot representation.
For detailed parameter settings are followed. We used 256 character embedding, 32 dimension for
. The initial learning rate was 1e-3 for Adam optimizer. Gradient clipping was used with 1, andm was generated with reduction factor 5. The batch size was 32, and Bahdanau attention 
was used. For contents encoder, we used two layers of bidirectional LSTM and for style encoder, and contents encoder is composed of two layers of unidirectional LSTM. The output of the last time step is only used for contents encoder. The parameters of contents encoder and style encoder is not shared. In the training stage, teacher forcing is used to prevent accumulating loss of predicted output. When creating mini-batches, each sample is zero-padded to the longest length of the samples. Then losses on the zero-padded regions are also backpropagated for inference stability. For every iteration, the task of the network is randomly decided to either VC or TTS. For each sample, the source and target emotion are differently selected.
|MOS||4.34 0.27||4.61 0.28|
|ABX||0.25 0.14||0.59 0.11|
4.3 Linguistic consistency
For verifying linguistic consistency of the proposed model, three different models were trained and evaluated. VCTTS is the model that combines TTS and VC together as explained in Section 3. VC refers to the voice converter that does not have any TTS path. TTS refers to the TTS model that does not contain VC paths. VCTTS-V is the model for inference of voice conversion using VCTTS, and VCTTS-T is the model when TTS path of VCTTS is activated.
Word error rate (WER) was computed to measure how our proposed model improves the linguistic consistency of the converted speech. Practically, morphemes were used instead of words since morphemes are considered as recognition units of Korean speech [18, 19, 20]. Google Cloud Speech-to-Text API transcribed the converted speech, and transcripts were divided into a sequence of morphemes by the Komoran morphological analyzer in KoNLPy . Average WER between two sequences of morphemes from true transcripts and automatically recognized transcripts was then calculated and shown in Table 1. The result shows that VCTTS-V outperforms VC, and WER of VCTTS-T is worse than TTS.
After training, eight native Korean speakers participated in subjective evaluation. 20 sentences were generated by VC and VCTTS-V models. Emotion of was set to neutral and target emotion was set to happiness while sentence of is fixed. It is blindly tested that subjects never knew which model generated which speeches. Subjects were asked to rate its clarity from 1 to 5. Also, preference ABX test between two models were conducted. Given two speeches without information that which model generated which samples, subjects were asked to choose clearer speech. Subjects could choose nothing if two samples are perceived similar. The results are shown in Table 2. It shows that VCTTV-V has higher MOS and ABX preference score, which means multitask learning of VC and TTS is helpful to keep linguistic information.
4.4 Emotional voice conversion
To investigate emotional voice conversion, VCTTS model mentioned above is used for inference. After training, we randomly chose 20 samples per each emotion, and those samples are fed to the model. Then
can be obtained per sample, and cosine similarity between each sample is measured and illustrated in Fig.2. The mean value of cosine similarity between all emotion pairs are also shown.
In the figure, it is shown that samples in the same emotion have very high cosine similarity while off-diagonal emotion pairs have low similarity. It implies that the style encoder is able to extract emotion style regardless of linguistic contents. Except for diagonal emotion pairs, emotion pair (Disgust-Anger) also shows relatively high similarity, it means that the embedding of these two emotions is closer than the other emotions. Same phonomenon is observed in (Sadness-Fear) pair.
On the other hand, emotional voice conversion should reflect emotion while keeping linguistic contents. In Fig. 3, voice conversion examples are displayed. Given a neutral speech, it is transformed into six different emotions with given . Contents of is fixed for this experiment. The top row in the Fig. 3 is log Mel spectrogram of input speech and from the second row to the seventh row are log Mel spectrogram of the converted speech. It can be found that the overall shape of the spectrogram is similar to that of input, while some changes such as temporal shift, frequency shift, duration of pause were made. Within a single model, it can generate multiple domains of speech.
In this paper, we presented emotional VC using multitask learning with TTS. Although there have been abundant researches on VC, the performance of VC lacks in terms of preserving linguistic information, emotional, and many-to-many VC. Unlike previous methods, the linguistic contents of VC are preserved by multitask learning with TTS. A single model is trained to optimize on both VC and TTS, the embedding space is trained to capture linguistic information by TTS. The result shows that using multitask learning much reduces WER, and subjective evaluation also supports this. For emotional VC, we collected a Korean parallel database for seven distinct emotions, and the model is trained to generate speech depends on the style reference input. Also, style encoder is devised to extract style information while removing linguistic information. Without explicit input of emotion label, style encoder successfully disentangles emotions.
This research can be extended to many other directions. First, we only show the possibility of helping TTS to VC, but TTS can be also improved by VC since some language has highly nonlinear relationship between text and its pronunciation. Second, the contents encoder is trained to extract only linguistic information, it could be extended to improve speech recognition. Third, more explicit loss such as domain adversarial loss can be added to minimize the difference between embedding of VC and TTS in linguistic space.
-  Masanobu Abe, Satoshi Nakamura, Kiyohiro Shikano, and Hisao Kuwabara, “Voice conversion through vector quantization,” Journal of the Acoustical Society of Japan (E), vol. 11, no. 2, pp. 71–76, 1990.
Tomoki Toda, Alan W Black, and Keiichi Tokuda,
“Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007.
-  Srinivas Desai, Alan W Black, B Yegnanarayana, and Kishore Prahallad, “Spectral mapping using artificial neural networks for voice conversion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 954–964, 2010.
Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng,
“Voice conversion using deep bidirectional long short-term memory based recurrent neural networks,”in ICASSP 2015. IEEE, 2015, pp. 4869–4873.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  Jing-Xuan Zhang, Zhen-Hua Ling, Li-Juan Liu, Yuan Jiang, and Li-Rong Dai, “Sequence-to-sequence acoustic modeling for voice conversion,” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 27, no. 3, pp. 631–644, 2019.
-  Jing-Xuan Zhang, Zhen-Hua Ling, Yuan Jiang, Li-Juan Liu, Chen Liang, and Li-Rong Dai, “Improving sequence-to-sequence voice conversion by adding text-supervision,” in ICASSP 2019. IEEE, 2019, pp. 6785–6789.
-  Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron: Towards end-to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.
-  Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions,” in ICASSP 2018. IEEE, 2018, pp. 4779–4783.
-  Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, and Ming Zhou, “Close to human quality tts with transformer,” arXiv preprint arXiv:1809.08895, 2018.
-  Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, and Yasuo Ariki, “Emotional voice conversion using neural networks with arbitrary scales F0 based on wavelet transform,” EURASIP J. Audio, Speech and Music Processing, vol. 2017, pp. 18, 2017.
-  Zhaojie Luo, Jinhui Chen, Tetsuya Takiguchi, and Yasuo Ariki, “Emotional voice conversion using dual supervised adversarial networks with continuous wavelet transform F0 features,” IEEE/ACM Trans. Audio, Speech & Language Processing, vol. 27, no. 10, pp. 1535–1548, 2019.
-  Yawen Xue, Yasuhiro Hamada, and Masato Akagi, “Voice conversion for emotional speech: Rule-based synthesis with degree of emotion controllable in dimensional space,” Speech Communication, vol. 102, pp. 54–67, 2018.
-  Gokce Keskin, Tyler Lee, Cory Stephenson, and Oguz H Elibol, “Many-to-many voice conversion with out-of-dataset speaker support,” arXiv preprint arXiv:1905.02525, 2019.
-  Songxiang Liu, Jinghua Zhong, Lifa Sun, Xixin Wu, Xunying Liu, and Helen Meng, “Voice conversion across arbitrary speakers based on a single target-speaker utterance.,” in Interspeech, 2018, pp. 496–500.
-  RJ Skerry-Ryan, Eric Battenberg, Ying Xiao, Yuxuan Wang, Daisy Stanton, Joel Shor, Ron J Weiss, Rob Clark, and Rif A Saurous, “Towards end-to-end prosody transfer for expressive speech synthesis with tacotron,” arXiv preprint arXiv:1803.09047, 2018.
-  Younggun Lee, Azam Rabiee, and Soo-Young Lee, “Emotional end-to-end neural speech synthesizer,” arXiv preprint arXiv:1711.05447, 2017.
-  Oh-Wook Kwon and Jun Park, “Korean large vocabulary continuous speech recognition with morpheme-based recognition units,” Speech Communication, vol. 39, no. 3-4, pp. 287–300, 2003.
Kyong-Nim Lee and Minhwa Chung,
“Pronunciation lexicon modeling and design for korean large vocabulary continuous speech recognition,”in INTERSPEECH 2004, 2004.
-  Jeong-Uk Bang, Sang-Hun Kim, and Oh-Wook Kwon, “Performance of speech recognition unit considering morphological pronunciation variation,” Phonetics and Speech Sciences, vol. 10, no. 4, pp. 111–119, 2018.
Eunjeong L. Park and Sungzoon Cho,
“Konlpy: Korean natural language processing in python,”in Proceedings of the 26th Annual Conference on Human & Cognitive Language Technology, 2014.