1 Introduction
Although current neural texttospeech (TTS) models are able to generate highquality speech, such as GradTTS [gradtts], VITS [vits] and VQTTS [VQTTS], intensity controllable emotional TTS is still a challenging task. Unlike prosody modelling in recent literatures [phone_level_3dim, du2021phone, guo2022unsupervised] that no specific label is provided in advance, emotional TTS typically utilizes dataset with categorical emotion labels. Mainstream emotional TTS models [tag1, tag2] can only synthesize emotional speech given the emotion label without intensity controllability.
In intensity controllable TTS models, efforts have been made to properly define and calculate emotion intensity values for training. The most preferred method to define and obtain emotion intensity is the relative attributes rank (RAR)[relativeattributes], which is used in [zhu2019controlling, lei2021fine, schnell2021improving, lei2022msemotts, MixedEmotion]
. RAR seeks a ranking matrix by a maxmargin optimization problem, which is solved by support vector machines. The solution is then fed to the model for training. As this is a manually constructed and separated stage, it might result in suboptimal results that bring bias into training. In addition to RAR, the operation on emotion embedding space is also explored.
[um2020emotional]designs an algorithm to maximize distance between emotion embeddings, and interpolates the embedding space to control emotion intensity.
[im2022emoq] quantizes the distance of emotion embeddings to obtain emotion intensities. However, the structure of the embedding space also greatly influences the performance of these models, resulting in the need for careful extra constraints. Intensity control for emotion conversion is investigated in [choi2021sequence, Emovox], with similar methods. Some of the mentioned works also have degraded speech quality. As an example, [MixedEmotion](which we refer to as “MixedEmotion” later) is an autoregressive model with intensity values from RAR to weight the emotion embeddings. It adopts pretraining to improve synthetic quality, but still with obvious quality degradation.
To overcome these issues, we need a conditional sampling method that can directly control emotions weighted with intensity. In this work, we propose a softlabel guidance technique, based on the classifier guidance technique [beatgan, liu2019more] in denoising diffusion models [DDPM, songsde]. Classifier guidance is an efficient sampling technique that uses the gradient of a classifier to guide the sampling trajectory given a onehot class label.
In this paper, based on the extended softlabel guidance, we propose EmoDiff which is an emotional TTS model with sufficient intensity controllability. Specifically, we first train an emotionunconditional acoustic model. Then an emotion classifier is trained on any on the diffusion process trajectory where is the diffusion timestamp. In inference, we guide the reverse denoising process with the classifier and a soft emotion label where the value of the specified emotion and Neutral is set to and respectively, instead of a onehot distribution where only the specified emotion is 1 while all others are 0. here represents the emotion intensity. Our experiments show that EmoDiff can precisely control the emotion intensity while maintaining high voice quality. Moreover, it also generates diverse speech samples even with the same emotion as a strength of diffusion models [beatgan].
In short words, the main advantages of EmoDiff are:

We define the emotion intensity as the weight for classifier guidance when using softlabels. This achieves precise intensity control in terms of classifier probability, needless for extra optimizations. Thus it enables us to generate speech with arbitrary specified emotion intensity effectively.

It poses no harm to the synthesized speech. The generated samples have good quality and naturalness.

It also generates diverse samples even in the same emotion.
2 diffusion models with classifier guidance
2.1 Denoising Diffusion Models and TTS Applications
Denoising diffusion probabilistic models [DDPM, songsde] have proven successful in many generative tasks. In the scorebased interpretation [songscorematching, songsde], diffusion models construct a forward stochastic differential equation (SDE) to transform the data distribution into a known distribution
, and use a corresponding reversetime SDE to generate realistic samples starting from noises. Thus, the reverse process is also called “denoising” process. Neural networks are then to estimate the score function
for any on the SDE trajectory, with scorematching objectives [songscorematching, songsde]. In applications, diffusion models bypass the training instability and mode collapse problem in GANs, and outperform previous methods on sample quality and diversity [beatgan].Denoising diffusion models have also been used in TTS [difftts, gradtts, diffsinger, fastdiff, lam2022bddm] and vocoding [wavegrad, diffwave] tasks, with remarkable results. In this paper, we build EmoDiff on the design of GradTTS [gradtts]. Denote a frame of melspectrogram, it constructs a forward SDE:
(1) 
where is a standard Brownian motion and is the SDE time index. is referred to as noise schedule such that is increasing and . Then we have . This SDE also indicates the conditional distribution , where both has closed forms. Thus we can directly sample from . In practice, we set
to identity matrix and
therefore becomes where is a scalar with known closed form. Meanwhile, we condition the terminal distribution on text, i.e. let , where is the aligned phoneme representation of that frame.The SDE of Eq.(1) has a reversetime counterpart:
(2) 
where is the score function that is to be estimated, and is a reversetime Brownian motion. It shares the trajectory of distribution with forward SDE in Eq.(1). So, solving it from , we can end up with a realistic sample . A neural network is trained to estimate the score function, in the following scorematching [songscorematching] objective:
(3) 
2.2 Conditional Sampling Based on Classifier Guidance
Denoising diffusion models provide a new way of modeling conditional probabilities where is a class label. Suppose we now have an unconditional generative model , and a classifier . By Bayes formula, we have
(4) 
In the diffusion framework, to sample from conditional distribution , we need to estimate score function . By Eq.(4), we only need to add the gradient from a classifier to the unconditional model. This conditional sampling method is named classifier guidance [beatgan, liu2019more], and is also used in unsupervised TTS [guidedtts].
In practice, classifier gradients are often scaled [beatgan, guidedtts] to control the strength of guidance. Instead of original in Eq.(4), we now use , where is called guidance level. Larger will result in highly classcorrelated samples while smaller one will encourage sample variability [beatgan].
Different from ordinary classifiers, the input to the classifier used here is all the along the trajectory of SDE in Eq.(1), instead of clean only. The time index can be anything in . Thus, the classifier can also be denoted as .
3 EmoDiff
3.1 Unconditional Acoustic Model and Classifier Training
The training of EmoDiff mainly includes the training of the unconditional acoustic model and emotion classifier. We first train a diffusionbased acoustic model on emotional data, but don’t provide it with emotion conditions. This is referred to as “unconditional acoustic model training” as in Figure 1(a). This model is based on GradTTS [gradtts], except that we provide explicit duration sequence by forced aligners to ease duration modeling. In this stage, the training objective is , where is the loss of logarithmic duration, and is the diffusion loss as Eq.(3). In practice, following GradTTS, we also adopt prior loss to encourage converging. For notation simplicity, we use to denote diffusion and prior loss together in Figure 1(a).
After training, the acoustic model can estimate score function of noisy melspectrogram given input phoneme sequence , i.e. , which is unconditonal of emotion labels. Following Section 2.2, we then need an emotion classifier to distinguish emotion categories from noisy melspectrograms . Meanwhile, as we always have a text condition , the classifier is formulated as . As is shown in Figure 1(b), the input to the classifier consists of three components: SDE timestamp , noisy melspectrogram and phonemedependent Gaussian mean . This classifier is trained with the standard crossentropy loss . Note that we freeze the acoustic model parameters in this stage, and only update the weights in emotion classifier.
As we always need text as condition along through the paper, we omit it and denote this classifier as in later sections to simplify the notation, if no ambiguity is caused.
3.2 Intensity Controllable Sampling with SoftLabel Guidance
In this section, we extend the classifier guidance to softlabel guidance which can control emotion weighted with intensity. Suppose the number of basic emotions is , and every basic emotion has a onehot vector form . For each , only the th dimension is 1. We specially use to denote Neutral. For an emotion weighted with intensity on , we define it to be . Then the gradient of logprobability of clasifier w.r.t can be defined as
(5) 
The intuition of this definition is that, intensity stands for the contribution of emotion on the sampling trajectory of . Larger means we sample along a trajectory with large “force” towards emotion , otherwise . Thus we can extend Eq.(4) to
(6) 
When the intensity is (100% emotion ) or (100% Neutral), the above operation reduces to the standard classifier guidance form Eq.(4). Hence we can use the softlabel guidance Eq.(5) in the sampling process, and generate a realistic sample with specified emotion with intensity .
Figure 1(c) illustrates the intensity controllable sampling process. After feeding the acoustic model and obtaining phonemedependent sequence, we sample and simulate reversetime SDE from to through a numerical simulator. In each simulator update, we feed the classifier with current and get the output probabilities . Eq.(3.2) is then used to calculate the guidance term. Similar as Section 2.2, we also scale the guidance term with guidance level . At the end, we obtain which is not only intelligible with input text, but also corresponding to the target emotion with intensity . This lead to precise intensity that correlates well to classifier probability.
Generally, in addition to intensity control, our softlabel guidance is capable for more complicated control on mixed emotions [MixedEmotion]. Denote a combination of all emotions where , Eq.(5) can be generalized to
(7) 
Then Eq.(3.2) can also be expressed in such generalized form. This extension can also be interpreted from the probabilistic view. As the combination weights can be viewed as a categorical distribution over basic emotions , Eq.(7) is equivalent to
(8)  
(9) 
where is the crossentropy function. Eq.(9) implies the fact that we are actually decreasing the crossentropy of target emotion distribution and classifier output , when sampling along the gradient . The gradient of crossentropy w.r.t can guide the sampling process. Hence, this softlabel guidance technique can generally be used to control any arbitrary complex emotion as a weighted combination of several basic emotions.
4 Experiments and Results
4.1 Experimental Setup
We used the English part of the Emotional Speech Dataset (ESD) [zhoukunESD] to perform all the experiments. It has 10 speakers, each with 4 emotional categories Angry, Happy, Sad, Surprise together with a Neutral category. There are 350 parallel utterances per speaker and emotion category, amounting to about 1.2 hours each speaker. Melspectrogram and forced alignments were extracted by Kaldi [povey2011kaldi] in 12.5ms frame shift and 50ms frame length, followed by cepstral normalization. Audio samples in these experiments are available ^{1}^{1}1https://cantabilekwok.github.io/EmoDiffintensityctrl/.
In this paper, we only consider singlespeaker emotional TTS problem. Throughout the following sections, we trained an unconditional GradTTS acoustic model on all 10 English speakers for a reasonable data coverage, and a classifier on a female speaker (ID:0015) only. The unconditional GradTTS model was trained with Adam optimizer at learning rate for 11M steps. We used exponential moving average on model weights as it is reported to improve diffusion model’s performance [songsde]. The structure of the classifier is a 4layer 1D CNN, with BatchNorm and Dropout in each block. In the inference stage, guidance level was fixed to 100.
We chose HifiGAN [hifigan] trained on all the English speakers here as a vocoder for all the following experiments.
4.2 Emotional TTS Quality
MOS  MCD  
GT  4.730.09   
GT (voc.)  4.690.10  2.96 
MixedEmotion [MixedEmotion]  3.430.12  6.62 
GradTTS w/ emo label  4.160.10  5.75 
EmoDiff (ours)  4.130.10  5.98 
MOS and MCD comparisons. MOS is presented with 95% confidence interval. Note that “GradTTS w/ emo label” cannot control emotion intensity.
We first measure the speech quality, which contains audio quality and speech naturalness. We did comparisons of the proposed EmoDiff with the following systems:

GT and GT (voc.): ground truth recording and analysis synthesis result (vocoded with GT melspectrogram).

MixedEmotion^{2}^{2}2We used the official implementation https://github.com/KunZhou9646/Mixed_Emotions: proposed in [MixedEmotion]. It is an autoregressive model based on relative attributes rank to precalculate intensity values for training. It much resembles Emovox [Emovox] for intensity controllable emotion conversion.

GradTTS w/ emo label: a conditional GradTTS model with hard emotion labels as input. It therefore does not have intensity controllability, but should have good sample quality, as a certified acoustic model.
Note that in this experiment, samples from EmoDiff and MixedEmotion were controlled with intensity weight, so that they are directly comparable with others.
Table 1 presents the mean opinion score (MOS) and mel cepstral distortion (MCD) evaluations. It is shown that the vocoder causes little deterioration on sample quality, and our EmoDiff outperforms MixedEmotion baseline with a large margin. Meanwhile, EmoDiff and the hardconditioned GradTTS both have decent and very close MOS results. The MCD results of them only have a small difference. This means EmoDiff does not harm sample quality for intensity controllability, unlike MixedEmotion.
4.3 Controllability of Emotion Intensity
To evaluate the controllability of emotion intensity, we used our trained classifier to classify the synthesized samples under a certain intensity that was being controlled. The input to the classifier was now set to
. The average classification probability on the target emotion class was used as the evaluation metric. Larger values indicate large discriminative confidence. For both EmoDiff and MixedEmotion on each emotion, we varied the intensity from
to . When intensity is , it equivalents to synthesize 100% Neutral samples. Larger intensity should result in larger probability.Figure 2 presents the results. To demonstrate the capability of this classifier, we plotted the classification probability on ground truth data. To show the performance of hardconditioned GradTTS model, we also plotted the probability on its synthesized samples. As it doesn’t have intensity controllability, we only plotted the values when intensity was . Standard deviations are presented as an errobar here as well for each experiment.
It can be found from the figure that the trained classifier has a reasonable performance on ground truth data at first. As a remark, the classification accuracy on validation set is 93.1%. Samples from GradTTS w/ emo label have some lower classification probabilities. Most importantly, the proposed EmoDiff always covers a larger range from intensity to than the baseline. The error range of EmoDiff is also always lower than the baseline, meaning that our control is more stable. This proves the effectiveness of our proposed softlabel guidance technique. We also notice that sometimes EmoDiff reaches higher classification probability than hardconditioned GradTTS at intensity . This is also reasonable, as conditioning on emotion labels when training is not guaranteed to achieve better classcorrelation than classifier guidance, with a strong classifier and sufficient guidance level.
4.4 Diversity of Emotional Samples
Despite genearating highquality and intensity controllable emotional samples, EmoDiff also has good sample diversity even in the same emotion, benefiting from the powerful generative ability of diffusion models. To evaluate the diversity of emotional samples, we conducted a subjective preference test for each emotion between our EmoDiff and MixedEmotion. Listeners were asked to choose the more diverse one, or “Cannot Decide”. Note that the test was done for each emotion in weight.
Figure 3 shows the preference result. It is clear that for each of the three emotion categories Angry, Happy and Surprise, EmoDiff owns a large advantage of being preferred in diversity. Only for Sad, EmoDiff outperforms the baseline with a little margin. This is mainly because MixedEmotion is autoregressive, and we found its variation on duration accounts much especially for Sad samples.
5 Conclusion
In this paper, we investigated the intensity control problem in emotional TTS systems. We defined emotion with intensity to be the weighted sum of a specific emotion and Neutral, with the weight being the intensity values. Under this modeling, we extended classifier guidance technique to softlabel guidance, which enables us to directly control any arbitrary emotion with intensity instead of a onehot class label. By this technique, the proposed EmoDiff can achieve simple but effective control on emotion intensity, with an unconditional acoustic model and emotion classifier. Subjective and objective evaluations demonstrated that EmoDiff outperforms baseline in terms of TTS quality, intensity controllability and sample diversity. Also, the proposed softlabel guidance can generally be applied to control more complicated natural emotions, which we leave as a future work.
6 Acknowledgements
This study is supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102), Jiangsu Technology Project (No.BE20220592).