Log In Sign Up

EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance

by   Yiwei Guo, et al.
Shanghai Jiao Tong University

Although current neural text-to-speech (TTS) models are able to generate high-quality speech, intensity controllable emotional TTS is still a challenging task. Most existing methods need external optimizations for intensity calculation, leading to suboptimal results or degraded quality. In this paper, we propose EmoDiff, a diffusion-based TTS model where emotion intensity can be manipulated by a proposed soft-label guidance technique derived from classifier guidance. Specifically, instead of being guided with a one-hot vector for the specified emotion, EmoDiff is guided with a soft label where the value of the specified emotion and Neutral is set to α and 1-α respectively. The α here represents the emotion intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, diverse speech with specified emotion intensity can be generated by sampling in the reverse denoising process.


Expressive Speech-driven Facial Animation with controllable emotions

It is in high demand to generate facial animation with high realism, but...

Direct Classification of Emotional Intensity

In this paper, we present a model that can directly predict emotion inte...

Emotion Intensity and its Control for Emotional Voice Conversion

Emotional voice conversion (EVC) seeks to convert the emotional state of...

Controllable Accented Text-to-Speech Synthesis

Accented text-to-speech (TTS) synthesis seeks to generate speech with an...

Explicit Intensity Control for Accented Text-to-speech

Accented text-to-speech (TTS) synthesis seeks to generate speech with an...

How the emotion's type and intensity affect rumor spreading

The implication and contagion effect of emotion cannot be ignored in rum...

U-Singer: Multi-Singer Singing Voice Synthesizer that Controls Emotional Intensity

We propose U-Singer, the first multi-singer emotional singing voice synt...

1 Introduction

Although current neural text-to-speech (TTS) models are able to generate high-quality speech, such as Grad-TTS [gradtts], VITS [vits] and VQTTS [VQTTS], intensity controllable emotional TTS is still a challenging task. Unlike prosody modelling in recent literatures [phone_level_3dim, du2021phone, guo2022unsupervised] that no specific label is provided in advance, emotional TTS typically utilizes dataset with categorical emotion labels. Mainstream emotional TTS models [tag1, tag2] can only synthesize emotional speech given the emotion label without intensity controllability.

In intensity controllable TTS models, efforts have been made to properly define and calculate emotion intensity values for training. The most preferred method to define and obtain emotion intensity is the relative attributes rank (RAR)[relative-attributes], which is used in [zhu2019controlling, lei2021fine, schnell2021improving, lei2022msemotts, MixedEmotion]

. RAR seeks a ranking matrix by a max-margin optimization problem, which is solved by support vector machines. The solution is then fed to the model for training. As this is a manually constructed and separated stage, it might result in suboptimal results that bring bias into training. In addition to RAR, the operation on emotion embedding space is also explored.


designs an algorithm to maximize distance between emotion embeddings, and interpolates the embedding space to control emotion intensity.

[im2022emoq] quantizes the distance of emotion embeddings to obtain emotion intensities. However, the structure of the embedding space also greatly influences the performance of these models, resulting in the need for careful extra constraints. Intensity control for emotion conversion is investigated in [choi2021sequence, Emovox], with similar methods. Some of the mentioned works also have degraded speech quality. As an example, [MixedEmotion]

(which we refer to as “MixedEmotion” later) is an autoregressive model with intensity values from RAR to weight the emotion embeddings. It adopts pretraining to improve synthetic quality, but still with obvious quality degradation.

To overcome these issues, we need a conditional sampling method that can directly control emotions weighted with intensity. In this work, we propose a soft-label guidance technique, based on the classifier guidance technique [beat-gan, liu2019more] in denoising diffusion models [DDPM, song-sde]. Classifier guidance is an efficient sampling technique that uses the gradient of a classifier to guide the sampling trajectory given a one-hot class label.

In this paper, based on the extended soft-label guidance, we propose EmoDiff which is an emotional TTS model with sufficient intensity controllability. Specifically, we first train an emotion-unconditional acoustic model. Then an emotion classifier is trained on any on the diffusion process trajectory where is the diffusion timestamp. In inference, we guide the reverse denoising process with the classifier and a soft emotion label where the value of the specified emotion and Neutral is set to and respectively, instead of a one-hot distribution where only the specified emotion is 1 while all others are 0. here represents the emotion intensity. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, it also generates diverse speech samples even with the same emotion as a strength of diffusion models [beat-gan].

In short words, the main advantages of EmoDiff are:

  1. We define the emotion intensity as the weight for classifier guidance when using soft-labels. This achieves precise intensity control in terms of classifier probability, needless for extra optimizations. Thus it enables us to generate speech with arbitrary specified emotion intensity effectively.

  2. It poses no harm to the synthesized speech. The generated samples have good quality and naturalness.

  3. It also generates diverse samples even in the same emotion.

2 diffusion models with classifier guidance

2.1 Denoising Diffusion Models and TTS Applications

Denoising diffusion probabilistic models [DDPM, song-sde] have proven successful in many generative tasks. In the score-based interpretation [song-score-matching, song-sde], diffusion models construct a forward stochastic differential equation (SDE) to transform the data distribution into a known distribution

, and use a corresponding reverse-time SDE to generate realistic samples starting from noises. Thus, the reverse process is also called “denoising” process. Neural networks are then to estimate the score function

for any on the SDE trajectory, with score-matching objectives [song-score-matching, song-sde]. In applications, diffusion models bypass the training instability and mode collapse problem in GANs, and outperform previous methods on sample quality and diversity [beat-gan].

Denoising diffusion models have also been used in TTS [difftts, gradtts, diffsinger, fastdiff, lam2022bddm] and vocoding [wavegrad, diffwave] tasks, with remarkable results. In this paper, we build EmoDiff on the design of GradTTS [gradtts]. Denote a frame of mel-spectrogram, it constructs a forward SDE:


where is a standard Brownian motion and is the SDE time index. is referred to as noise schedule such that is increasing and . Then we have . This SDE also indicates the conditional distribution , where both has closed forms. Thus we can directly sample from . In practice, we set

to identity matrix and

therefore becomes where is a scalar with known closed form. Meanwhile, we condition the terminal distribution on text, i.e. let , where is the aligned phoneme representation of that frame.

The SDE of Eq.(1) has a reverse-time counterpart:


where is the score function that is to be estimated, and is a reverse-time Brownian motion. It shares the trajectory of distribution with forward SDE in Eq.(1). So, solving it from , we can end up with a realistic sample . A neural network is trained to estimate the score function, in the following score-matching [song-score-matching] objective:


2.2 Conditional Sampling Based on Classifier Guidance

Figure 1: Training and sampling diagrams of EmoDiff. In training, is directly sampled from known distribution . When sampling with a certain emotion intensity, the score function is estimated by score estimator. “SG” means stop gradient operation.

Denoising diffusion models provide a new way of modeling conditional probabilities where is a class label. Suppose we now have an unconditional generative model , and a classifier . By Bayes formula, we have


In the diffusion framework, to sample from conditional distribution , we need to estimate score function . By Eq.(4), we only need to add the gradient from a classifier to the unconditional model. This conditional sampling method is named classifier guidance [beat-gan, liu2019more], and is also used in unsupervised TTS [guidedtts].

In practice, classifier gradients are often scaled [beat-gan, guidedtts] to control the strength of guidance. Instead of original in Eq.(4), we now use , where is called guidance level. Larger will result in highly class-correlated samples while smaller one will encourage sample variability [beat-gan].

Different from ordinary classifiers, the input to the classifier used here is all the along the trajectory of SDE in Eq.(1), instead of clean only. The time index can be anything in . Thus, the classifier can also be denoted as .

While Eq.(3.2) can effectively control sampling on class label , it cannot be directly applied to soft-labels, i.e. labels weighted with intensity, as the guidance is not well-defined now. Therefore, we extend this technique for emotion intensity control in Section 3.2.

3 EmoDiff

3.1 Unconditional Acoustic Model and Classifier Training

The training of EmoDiff mainly includes the training of the unconditional acoustic model and emotion classifier. We first train a diffusion-based acoustic model on emotional data, but don’t provide it with emotion conditions. This is referred to as “unconditional acoustic model training” as in Figure 1(a). This model is based on GradTTS [gradtts], except that we provide explicit duration sequence by forced aligners to ease duration modeling. In this stage, the training objective is , where is the loss of logarithmic duration, and is the diffusion loss as Eq.(3). In practice, following GradTTS, we also adopt prior loss to encourage converging. For notation simplicity, we use to denote diffusion and prior loss together in Figure 1(a).

After training, the acoustic model can estimate score function of noisy mel-spectrogram given input phoneme sequence , i.e. , which is unconditonal of emotion labels. Following Section 2.2, we then need an emotion classifier to distinguish emotion categories from noisy mel-spectrograms . Meanwhile, as we always have a text condition , the classifier is formulated as . As is shown in Figure 1(b), the input to the classifier consists of three components: SDE timestamp , noisy mel-spectrogram and phoneme-dependent Gaussian mean . This classifier is trained with the standard cross-entropy loss . Note that we freeze the acoustic model parameters in this stage, and only update the weights in emotion classifier.

As we always need text as condition along through the paper, we omit it and denote this classifier as in later sections to simplify the notation, if no ambiguity is caused.

3.2 Intensity Controllable Sampling with Soft-Label Guidance

In this section, we extend the classifier guidance to soft-label guidance which can control emotion weighted with intensity. Suppose the number of basic emotions is , and every basic emotion has a one-hot vector form . For each , only the -th dimension is 1. We specially use to denote Neutral. For an emotion weighted with intensity on , we define it to be . Then the gradient of log-probability of clasifier w.r.t can be defined as


The intuition of this definition is that, intensity stands for the contribution of emotion on the sampling trajectory of . Larger means we sample along a trajectory with large “force” towards emotion , otherwise . Thus we can extend Eq.(4) to


When the intensity is (100% emotion ) or (100% Neutral), the above operation reduces to the standard classifier guidance form Eq.(4). Hence we can use the soft-label guidance Eq.(5) in the sampling process, and generate a realistic sample with specified emotion with intensity .

Figure 1(c) illustrates the intensity controllable sampling process. After feeding the acoustic model and obtaining phoneme-dependent sequence, we sample and simulate reverse-time SDE from to through a numerical simulator. In each simulator update, we feed the classifier with current and get the output probabilities . Eq.(3.2) is then used to calculate the guidance term. Similar as Section 2.2, we also scale the guidance term with guidance level . At the end, we obtain which is not only intelligible with input text, but also corresponding to the target emotion with intensity . This lead to precise intensity that correlates well to classifier probability.

Generally, in addition to intensity control, our soft-label guidance is capable for more complicated control on mixed emotions [MixedEmotion]. Denote a combination of all emotions where , Eq.(5) can be generalized to


Then Eq.(3.2) can also be expressed in such generalized form. This extension can also be interpreted from the probabilistic view. As the combination weights can be viewed as a categorical distribution over basic emotions , Eq.(7) is equivalent to


where is the cross-entropy function. Eq.(9) implies the fact that we are actually decreasing the cross-entropy of target emotion distribution and classifier output , when sampling along the gradient . The gradient of cross-entropy w.r.t can guide the sampling process. Hence, this soft-label guidance technique can generally be used to control any arbitrary complex emotion as a weighted combination of several basic emotions.

In Figure 1(c), we use cross-entropy as a concise notation for soft-label guidance term. In our intensity control scenario, it reduces to Eq.(5) mentioned before.

4 Experiments and Results

Figure 2: Classification probabilities when controlling on intensity

. Errorbars represent standard deviation.

4.1 Experimental Setup

We used the English part of the Emotional Speech Dataset (ESD) [zhoukun-ESD] to perform all the experiments. It has 10 speakers, each with 4 emotional categories Angry, Happy, Sad, Surprise together with a Neutral category. There are 350 parallel utterances per speaker and emotion category, amounting to about 1.2 hours each speaker. Mel-spectrogram and forced alignments were extracted by Kaldi [povey2011kaldi] in 12.5ms frame shift and 50ms frame length, followed by cepstral normalization. Audio samples in these experiments are available 111

In this paper, we only consider single-speaker emotional TTS problem. Throughout the following sections, we trained an unconditional GradTTS acoustic model on all 10 English speakers for a reasonable data coverage, and a classifier on a female speaker (ID:0015) only. The unconditional GradTTS model was trained with Adam optimizer at learning rate for 11M steps. We used exponential moving average on model weights as it is reported to improve diffusion model’s performance [song-sde]. The structure of the classifier is a 4-layer 1D CNN, with BatchNorm and Dropout in each block. In the inference stage, guidance level was fixed to 100.

We chose HifiGAN [hifigan] trained on all the English speakers here as a vocoder for all the following experiments.

4.2 Emotional TTS Quality

GT 4.730.09 -
GT (voc.) 4.690.10 2.96
MixedEmotion [MixedEmotion] 3.430.12 6.62
GradTTS w/ emo label 4.160.10 5.75
EmoDiff (ours) 4.130.10 5.98
Table 1:

MOS and MCD comparisons. MOS is presented with 95% confidence interval. Note that “GradTTS w/ emo label” cannot control emotion intensity.

We first measure the speech quality, which contains audio quality and speech naturalness. We did comparisons of the proposed EmoDiff with the following systems:

  1. GT and GT (voc.): ground truth recording and analysis synthesis result (vocoded with GT mel-spectrogram).

  2. MixedEmotion222We used the official implementation proposed in [MixedEmotion]. It is an autoregressive model based on relative attributes rank to pre-calculate intensity values for training. It much resembles Emovox [Emovox] for intensity controllable emotion conversion.

  3. GradTTS w/ emo label: a conditional GradTTS model with hard emotion labels as input. It therefore does not have intensity controllability, but should have good sample quality, as a certified acoustic model.

Note that in this experiment, samples from EmoDiff and MixedEmotion were controlled with intensity weight, so that they are directly comparable with others.

Table 1 presents the mean opinion score (MOS) and mel cepstral distortion (MCD) evaluations. It is shown that the vocoder causes little deterioration on sample quality, and our EmoDiff outperforms MixedEmotion baseline with a large margin. Meanwhile, EmoDiff and the hard-conditioned GradTTS both have decent and very close MOS results. The MCD results of them only have a small difference. This means EmoDiff does not harm sample quality for intensity controllability, unlike MixedEmotion.

4.3 Controllability of Emotion Intensity

To evaluate the controllability of emotion intensity, we used our trained classifier to classify the synthesized samples under a certain intensity that was being controlled. The input to the classifier was now set to

. The average classification probability on the target emotion class was used as the evaluation metric. Larger values indicate large discriminative confidence. For both EmoDiff and MixedEmotion on each emotion, we varied the intensity from

to . When intensity is , it equivalents to synthesize 100% Neutral samples. Larger intensity should result in larger probability.

Figure 2 presents the results. To demonstrate the capability of this classifier, we plotted the classification probability on ground truth data. To show the performance of hard-conditioned GradTTS model, we also plotted the probability on its synthesized samples. As it doesn’t have intensity controllability, we only plotted the values when intensity was . Standard deviations are presented as an errobar here as well for each experiment.

It can be found from the figure that the trained classifier has a reasonable performance on ground truth data at first. As a remark, the classification accuracy on validation set is 93.1%. Samples from GradTTS w/ emo label have some lower classification probabilities. Most importantly, the proposed EmoDiff always covers a larger range from intensity to than the baseline. The error range of EmoDiff is also always lower than the baseline, meaning that our control is more stable. This proves the effectiveness of our proposed soft-label guidance technique. We also notice that sometimes EmoDiff reaches higher classification probability than hard-conditioned GradTTS at intensity . This is also reasonable, as conditioning on emotion labels when training is not guaranteed to achieve better class-correlation than classifier guidance, with a strong classifier and sufficient guidance level.

4.4 Diversity of Emotional Samples

Figure 3: Diversity preference test of each emotion.

Despite genearating high-quality and intensity controllable emotional samples, EmoDiff also has good sample diversity even in the same emotion, benefiting from the powerful generative ability of diffusion models. To evaluate the diversity of emotional samples, we conducted a subjective preference test for each emotion between our EmoDiff and MixedEmotion. Listeners were asked to choose the more diverse one, or “Cannot Decide”. Note that the test was done for each emotion in weight.

Figure 3 shows the preference result. It is clear that for each of the three emotion categories Angry, Happy and Surprise, EmoDiff owns a large advantage of being preferred in diversity. Only for Sad, EmoDiff outperforms the baseline with a little margin. This is mainly because MixedEmotion is autoregressive, and we found its variation on duration accounts much especially for Sad samples.

5 Conclusion

In this paper, we investigated the intensity control problem in emotional TTS systems. We defined emotion with intensity to be the weighted sum of a specific emotion and Neutral, with the weight being the intensity values. Under this modeling, we extended classifier guidance technique to soft-label guidance, which enables us to directly control any arbitrary emotion with intensity instead of a one-hot class label. By this technique, the proposed EmoDiff can achieve simple but effective control on emotion intensity, with an unconditional acoustic model and emotion classifier. Subjective and objective evaluations demonstrated that EmoDiff outperforms baseline in terms of TTS quality, intensity controllability and sample diversity. Also, the proposed soft-label guidance can generally be applied to control more complicated natural emotions, which we leave as a future work.

6 Acknowledgements

This study is supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Jiangsu Technology Project (No.BE2022059-2).