Toward Dimensional Emotion Detection from Categorical Emotion Annotations

11/06/2019 ∙ by Sungjoon Park, et al. ∙ 0

We propose a framework which makes a model predict fine-grained dimensional emotions (valence-arousal-dominance, VAD) trained on corpus annotated with coarse-grained categorical emotions. We train a model by minimizing EMD distances between predicted VAD score distribution and sorted categorical emotion distributions in terms of VAD, as a proxy of target VAD score distributions. With our model, we can simultaneously classify a given sentence to categorical emotions as well as predict VAD scores. We use pre-trained BERT-Large and fine-tune on SemEval dataset (11 categorical emotions) and evaluate on EmoBank (VAD dimensional emotions), in order to show our approach reaches comparable performance to that of the state-of-the-art classifiers in categorical emotion classification task and significant positive correlations with ground truth VAD scores. Also, if one continues training our model with supervision of VAD labels, it outperforms state-of-the-art VAD regression models. We further present examples showing our model can annotate emotional words suitable for a given text even those words are not seen as categorical labels during training.



There are no comments yet.


page 1

page 3

page 4

page 5

page 6

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1:

Overview of our approach. Our model is able to predict VAD distributions conditioned on an input sentence through supervised training with categorical emotion annotations. (sub-fig. a) Specifically, one-hot categorical labels are sorted in terms of V, A, D scores, respectively, to be served as (sparse) label VAD distributions during training. (sub-fig. b) For inference, categorical emotion class can be predicted by picking one having maximum probability of the product of the distributions (sub-fig. c), and continuous VAD score predictions can be made by computing expectation of each distributions. (sub-fig. d)

Humans can feel and express complex emotions beyond the basic emotions Ekman (1992); Plutchik (2001) in daily basis. To represent these various emotions systematically, a dimensional emotion model like the Valence-Arousal-Dominance (VAD) model is commonly used. Russell and Mehrabian (1977)

This model maps emotional states to orthogonal dimensional VAD space, showing various emotions can be projected into the space with measurable distances from one another. Since dimensional models pose an emotion as real-valued vector in the space, it is likely to account for subtle emotional expressions compared to categorical models which employ a finite number basic emotions. With dimensional VAD models, capturing fine-grained emotions could benefit clinical natural language processing (NLP) researches

Desmet and Hoste (2013); Sahana and Girish (2015), emotion regulation as a psychotherapy research Torre and Lieberman (2018) and other works in computational social science fields dealing with subtle emotion recognition. Buechel and Hahn (2016)

Therefore, building an dimensional emotion detection model from annotated corpus will be highly useful. However, such annotated resources are surprisingly scarce. There are few corpus having full VAD annotations Buechel and Hahn (2017), or only having that of VA. Preoţiuc-Pietro et al. (2016); Yu et al. (2016) One could build such resource through a corpus labeling by using best-worst scaling Kiritchenko and Mohammad (2017). Instead, we examine a novel way to predict dimensional emotion (VAD) scores from relatively common resources which are corpus annotated with coarse-grained basic categorical emotions. Scherer and Wallbott (1994); Alm et al. (2005); Aman and Szpakowicz (2007); Mohammad (2012); Sintsovaa and Musata (2013); Li et al. (2017); Schuff et al. (2017); Shahraki and Zaiane (2017); Mohammad et al. (2018)

In this paper, we propose a framework to learn dimensional VAD scores from corpus with categorical emotion labels. We demonstrate our idea by using pre-trained language model BERT Devlin et al. (2018) and fine-tune it through our approach. In detail, our model learns conditional VAD distributions through supervision of categorical emotion labels, in order to use them to compute VAD scores as well as categorical emotion labels for a given sentence.

In summary, our contributions are as follows:

  • We propose a framework which enables learning to predict VAD scores from a corpus with categorical emotions annotations.

  • Our model trained only with categorical emotion labels can predict VAD scores which shows significant positive correlations to corresponding ground truth VAD scores.

  • Our model can be fine-tuned once again with supervision of VAD scores to outperform state-ot-the-art dimensional emotion detection models.

2 Approach

Here we describe how we predict VAD scores for a given text from a model trained on a dataset with categorical emotion annotations.

Overview. The key idea is to train an emotion detection model to predict each of the VAD distributions conditioned on a given text, rather than directly predict categorical emotion labels as like conventional emotion classifiers. We show that it is possible even if we only have categorical emotion labels because those categorical emotion labels can also have VAD scores. Thus one can sort the labels by each VAD dimensions to obtain (sparse) ground truth conditional VAD distributions for a given text. (Fig. 1a, 1b) Then a model can be trained to predict VAD distributions by minimizing the distance between predicted and ground truth distributions, allowing the model to predict not only VAD scores for regression (expectations of predicted distributions, Fig. 1d) but also pick a emotion label within a given set of categorical labels for classification. (argmax of emotion labels, Fig. 1c)

Model Architecture. (Fig 1a) Formally, an emotion detection model is where is an emotion drawn from a set of pre-defined categorical emotions and is a sequence of symbols representing an input text. Usually, is represented as an one-hot vector in emotion classification task.

Unlike classification models directly training , we aim to learn each distribution of V, A, D from a pair of input text and categorical labels. To this end, we map categorical emotion labels to three-dimensional VAD space,

, using NRC-VAD Lexicon

Mohammad (2018). For example, an emotion label ”joy” is mapped to (0.980, 0.824, 0.794) and ”sad” (0.225, 0.333, 0.149) in the VAD space. By using this coordinates, now our model tries to predict the following distribution:


Furthermore, since each dimensions in VAD space are nearly independent, Russell and Mehrabian (1977)

, we assume that the dimensions are mutually independent. So the joint distribution could be decomposed into product of three conditional distributions:


For each decomposed conditional distributions, we would use any type of trainable function with sufficient complexity to capture linguistic patterns from given input. As a demonstration, we use pre-trained bidirectional language model BERT Devlin et al. (2018), which shows state-of-the-art performances in natural language understanding tasks if fine-tuned over task-specific datasets. We stack a softmax or sigmoid activation layer over hidden state corresponding to [CLS] token in BERT for each conditional distributions.

Model Training. (Fig 1b) To train our model, we should obtain target conditionals for each from categorical emotion labels. So we simply sort categorical emotions in by V, A, D scores respectively, based on the mapped VAD coordinates. For example, if we have four emotions in the categorical labels and they have corresponding valence score (0.980, 0.225, 1,000, 0.167) in NRC-VAD Mohammad (2018), then we could sort label orders to (anger, sad, joy, happy) and corresponding one-hot labels to obtain the target conditional . In other words, by rearranging label positions ascending order of valence scores, sorted one-hot labels can be treated as a proxy of target conditionals. We sort labels in terms of A, D to obtain the other conditionals as well. Note that these conditionals will be sparse because we only have points for each VAD dimensions.

Next, we minimize the distances between the true and predicted s. Since we sorted the labels, there are orders between classes. These orders should be taken into account during optimization, thus we minimize the squared Earth Mover’s Distance (EMD) loss Hou et al. (2017) between the true and predicted s to consider the order between labels. EMD loss is as follows:


where is a true conditional and

is a predicted conditional. This loss is designed to consider the distance between classes in an ordered classification problem, giving more penalties if a model chooses a class far from the correct class using a distance measure. It computes the squared difference between the cumulative distribution function of

and corresponding .

Note that Eq. 3 has an assumption that the probability mass of and should be the same. In single label case, i.e., if the annotated categorical emotion label can appear only once for each text, it is satisfied since and

is output of a softmax layer, which is having the sum always summed up to one. However, in multi-label case, this assumption is violated because generally sigmoid activation layer is used to represent positive probabilities for each class independently. Thus we slightly change the Eq.

3 to satisfy the assumption, defining interclass EMD loss as follows:


where and are normalized and which divided to its corresponding sum of probabilities. We also introduce intraclass EMD loss:


where is true and is predicted for class . Finally we use EMD loss for multi-labeled case as follows:


Next, we minimize the sum of three squared EMD losses between target and predicted distributions for each of VAD dimensions:


where , , denote target and , , predicted conditional distributions.

Predicting categorical Emotion Labels. (Fig. 1c) Based on model’s predicted VAD distributions, we can pick one emotion label from a given set as like conventional emotion classifiers. By computing the product of predicted , , , we obtain predicted , assuming conditional independence. Then we can pick a emotion label as follows:


Since we only have given emotion labels, we compare the joint probabilities of and pick one emotion label having the maximum probability among labels (single-label case, Eq. 8

), or multiple labels with probability over a certain threshold (multi-label case). The threshold is a hyperparameter of the model, set to 0.125 (=


Predicting Continuous VAD Scores. (Fig. 1d) We can further compute the expectations of predicted conditionals; , , to predict the continuous VAD scores.


Once again, we use the VAD scores in Mohammad (2018) for each dimension when computing the expectations. This allows us to predict continuous VAD scores from the model which is trained over categorical emotion annotations.

3 Experiments

In this section, we show our experimental setups. Throughout these experiments, we mainly focus on demonstrating our approach can effectively predict continuous emotional dimensions (VAD scores) only with categorical emotion labels.

3.1 Dataset

We use three datasets consist of text and corresponding emotion annotations. Two of them have categorical emotion labels, and the other is VAD-annotated corpus.

SemEval 2018 E-c (SemEval). A multi-labeled categorical emotion annotated corpus which contains 10,983 tweets and corresponding labels for presence-absence of 11 emotions. Mohammad et al. (2018) We abbreviate this dataset hereafter SemEval.

ISEAR. A single-labled categorical emotion annoated corpus contains 7,666 sentences. A label can have only one emotion among 7 categorical emotions. Scherer and Wallbott (1994)

EmoBank. Sentences paired with continuous VAD scores as labels. This corpus contains 10,062 sentences collected across 6 domains 2 perspectives. Each sentence has three scores representing VAD in range of 1 to 5. Unless otherwise noted, we use weighted average of VAD scores as ground truth scores, which is recommended by EmoBank authors. Buechel and Hahn (2017)

3.2 Predicting Categorical Emotion Labels.

We examine classification performances of our approach and compare them to state-of-the-art emotion classification models. We use accuracy, macro F1 score, and micro F1 score for evaluation metrics.


A convolutional neural network for text classification trained by multi-task learning.

Zhang et al. (2018) The model jointly learns classification labels and emotional distributions of a given text. The emotion distribution represents multiple emotions in a given sentence, which is normalized affective term counts extracted by emotion lexicons. The model reaches state-of-the-art classification accuracy and F1 score on the ISEAR.

NTUA-SLP. A classification model using deep self-attention layers over Bi-LSTM hidden states. The models is pre-trained on general tweets and ‘SemEval 2017 task 4A’, then fine-tuned over all ‘SemEval 2018 subtasks’, in order to transfer knowledge learnt to each subtasks. Baziotis et al. (2018) The model took the first place in multi-labeled emotion classification task on SemEval dataset.

BERT-Large (Classification). A pre-trained bidrectional language model based on stacked multiple Transformers Vaswani et al. (2017). The model shows state-of-the-art performance in various natural language understanding tasks after fine-tuned over task-specific datasets. Devlin et al. (2018)

. We add a linear transformation layer with sigmoid activation on BERT for training on a multi-labeled dataset (SemEval) or softmax activation for single-labeled dataset (ISEAR). Like conventional text classifiers, these are optimized by minimizing cross-entropy loss between predicted distributions and one-hot labels.

BERT-Large (Ours, SemEval). We use BERT again and fine-tune the model with our objective functions. For a multi-labeled dataset (SemEval), we minimize Eq. 7 with Eq. 6 for each VAD dimensions. This model can choose an emotion label in by Eq. 8.

BERT-Large (Ours, ISEAR). We fine-tune another BERT with our approach on ISEAR. This model is optimized by minimizing Eq. 7 with Eq. 3 for each VAD dimensions. Like the model above, this model can predict an emotion label by Eq. 8 as well.

3.3 Predicting Continuous VAD scores.

Next, we investigate VAD score prediction performance of our approach and compare them to state-of-the-art VAD regression models. Since training objectives of models vary, we prefer Pearson’s correlation coefficient between model’s VAD predictions and ground truth scores for an evaluation metric.

3.3.1 Zero-shot Predictions

We refer following two performances as zero-shot prediction performances because these models are not trained over EmoBank, which means the model is trained without supervision of any VAD score labels. These models use entire EmoBank as an evaluation set. We focus on these results since we aim to predict VAD scores from the model trained over corpus annotated with categorical emotion labels.

BERT-Large (Ours, SemEval). We compute VAD score predictions by using Eq. 9 from our model trained on SemEval, which is the same model used in predicting categorical emotion labels.

BERT-Large (Ours, ISEAR). Like the model above, we also compute VAD scores from our model trained on ISEAR.

3.3.2 Predictions after Supervised Learning

Unlike previous models, followings are trained by supervised learning on the VAD score labels in EmoBank. These results allow us to evaluate the extent of zero-shot prediction performances, and further we can see how much the zero-shot prediction model could be improved if VAD annotations are available.

AAN. Adversarial Attention Network for dimensional emotion regression which learns to discriminate VAD dimension scores. Zhu et al. (2019) Pearson correlations of predicted and ground truth of VAD scores in EmoBank are reported. Note that the scores are reported by 2 perspectives and 6 domains respectively, thus we use the highest VAD correlations among perspective and domains for comparison.

Ensemble. Multi-task ensemble neural networks which learns to predict VAD scores, sentiment, and their intensity simultaneously. Akhtar et al. (2019) The model is recently shown to be effective on the VAD regression.


Predicting VAD scores through variational autoencoders trained by semi-supervised learning, which shows state-of-the-art performance on the VAD score prediction task.

Wu et al. (2019) The model shows highest performance when using 40% of labeled Emobank data, so we compare our model’s performances to that scores.

BERT-Large (Ours, EBSemEval).

We fine-tune once again our BERT-Large (SemEval) on Emobank dataset. We split Emobank to train, valid, test set with the ratio of 6:2:2, then train the model and report the correlation between predicted and ground truth VAD scores on the test set. Specifically, we remove the final linear layer with softmax or sigmoid activations used for training with categorical labels, and we add a new linear layer with relu activations for VAD score predictions. Then all parameters were fine-tuned once again by minimizing mean squared error loss (MSE) between predicted VAD scores and corresponding VAD scores. Through this model, we investigate the effectiveness of our approach as an parameter initialization strategy of the model for VAD regression where the VAD annotations are available.

3.4 Experimental Details.

In all experiment, we specifically use BERT-Large uncased model.111

We set the learning rate to 2e-5 with 3 epoch of warm-up period. The batch size is to 64, then we stop fine-tuning all of the layers when the validation loss is minimized. We use single TPU for optimization, and all of the fine-tuning steps were converged within 10 epochs taking an hour.

4 Results

Dataset EmoBank SemEval 2018 E-c ISEAR
Task Regression
Model Scheme V (r) A (r) D (r)
MT-CNN Zhang et al. (2018) - - - - - - - - 0.668
NTUA-SLP Baziotis et al. (2018) - - - - 0.528 0.701 0.588 - -
BERT-Large (Classification, ep3) - - - - 0.534 0.697 0.572 0.704 0.700
BERT-Large (Ours, SemEval) Zero-shot 0.659 0.327 0.287 0.500 0.695 0.572 - -
BERT-Large (Ours, ISEAR) Zero-shot 0.502 0.069 0.236 - - - 0.695 0.688
AAN Zhu et al. (2019) Supervised 0.424 0.352 0.265 - - - - -
Ensemble Akhtar et al. (2019) Supervised 0.635 0.375 0.277 - - - - -
SRV-SLSTM Wu et al. (2019) Semi-supervised 0.620 0.508 0.333 - - - - -
BERT-Large (Ours, EBSemEval) Supervised 0.765 0.583 0.416 - - - - -
Table 1: Performance of VAD score prediction and categorical emotion class prediction. With fine-tuning pre-trained BERT-Large, we show comparable performance to state-of-the-art models in classification and significant positive correlations with VAD scores using only the categorical emotion annotations. If our model trained on SemEval is fine-tuned on EmoBank, it outperforms all the state-of-the-art VAD regression models.

We present our experimental results. First, we elaborate the zero-shot VAD score prediction results of our models, and then we compare these results to that of supervise models. We also show classification performances of our model and comparison models.

Zero-Shot VAD score Predictions. The results are shown in Table 1. When our model is trained on SemEval and tested on Emobank, the predicted VAD scores show significant positive Pearson’s correlation coefficients with target VAD scores in EmoBank. The correlation in valence (V) show highest score among the dimensions (r=.659, p.001), followed by arousal (A) (r=.327, p.001), and dominance (D) (r=.287, p.001). For our model trained on ISEAR dataset, the scores also show significant positive Pearson’s . The correlation in V dimension (r=.502, p.001), followed by D (r=.236, p.001), and A (r=.069, p.001).

The correlations of SemEval for all dimension are higher than the score of ISEAR. This is because emotion labels in SemEval have more information than that of ISEAR. First, SemEval has 11 categorical emotion annotations whereas ISEAR has 7 labels. More number of labels leads to less sparse VAD target distributions, thus our model can distinguish the extent of VAD more easily where the more number of labels exists. Second, SemEval can have multiple emotion labels for every sentences, however ISEAR has only one label. Apparently, these multiple emotion labels makes the possible range of the expected VAD scores much wider than that of single emotion labels. If a sentence always should have a single label, then the predicted VAD distribution must be summed up to one. Otherwise, multiple labels enables the distributions to have much larger value of the sum, which leads to wider range of the expected values that help the model distinguish the degree of VAD dimensions for a given sentence.

Note that we observe the correlation in A dimension of ISEAR is low. We see that the standard deviation of arousal scores of ISEAR labels ‘anger’, ‘disgust’, ‘fear’, ‘sadness’, ‘shame’, ‘joy’, ’guilt’ is lower (.191) than other dimensions, (V: .328, D: .237) and actually it becomes much lower when only one label ’sadness’, is removed, dropping to (.105). This makes model difficult to differentiate labels in terms of the degree of arousal, leading to lower correlation with target scores for the A dimension.

Comparison to VAD predictions of Supervised Models. Three comparison models (AAN, Ensemble, SRV-SLTSTM) in Table 1 are trained by supervision of VAD scores. Since our model trained on SemEval shows better performance than ISEAR, hereafter we compare the scores from SemEval to that of comparison models.

Among those models, Ensemble shows the highest correlation on V dimension (.635), SRV-SLSTM reaches to the highest correlation on A (.375) and D (.333) dimensions. We highlight our model trained on SemEval shows even better correlation in V dimension (.659) without any supervision of VAD score labels. The correlation of A (.327) is followed which is slightly lower than that of state-of-the-art models, and D (.287) is comparable to that of the Ensemble. Overall, we see that zero-shot prediction performance are fairly comparable with those of state-of-the-art models.

Furthermore, we present the result from our another model, which is trained on SemEval and then fine-tuned on training set of EmoBank corpus and VAD score labels. We could see that if we continue training our model with supervision of VAD labels, our model outperforms all of the state-of-the-art models with a large margin. The VAD fine-tuned model shows the significant correlation in all V (r=.765, p.001), A (r=.583, p.001) and D (r=.416, p.001) dimensions. These are (+.130, +.075, +.083) improvement of the correlation from the state-of-the-arts for VAD dimensions, respectively.

Categorical Label Classification. Next, classification performances our model and that of comparison models are reported. In case of SemEval, fine-tuning BERT as like a conventional classifier (BERT-Large, classification) shows higher macro F1 score (.534) than NTUA-SLT and comparable micro F1 score (.697) and multi-label accuracy (.572). Fine-tuning BERT on ISEAR shows similar results. The BERT classifier outperforms MT-CNN with higher micro f1 score. (.700)

Also, our model also shows comparable classification performance with comparison models. Our model shows higher macro f1 score (.688) on ISEAR, which is higher than that of MT-CNN, In case of SemEval, however, our model shows slightly lower performance to that of NTUA-SLP.

5 Ablation Study

Model V (r) A (r) D (r)
  1. BERT (Ours, SemEval) 0.659 0.327 0.287
  2. BERT (Random Init., EB) 0.600 0.536 0.344
  3. BERT (Ours, EBSemEval) 0.765 0.583 0.416
  4. BERT (Regression, EB) 0.787 0.632 0.498
Table 2: Ablation Study results of our models. Given that the model architecture is the same (BERT-Large), the architecture is effective for the VAD regression task, and initialization with our model trained on categorical emotion annotation helps to improve the performance as well. Using pre-trained BERT-Large shows slightly better results.

We further conduct ablation study to investigate our model’s VAD prediction performances. Since we use pre-trained BERT and fine-tune them with different datasets, the effect of pre-training and fine-tuning should be decomposed to understand the source of improvements.

In Table. 2, we present four models for ablation study which all having the same neural network architecture (BERT-Large) to control the size and structure of the model. Model 1 is our model trained on SemEval, and Model 3 is fine-tuned on EmoBank with initialization of trained weights of Model 1. This is equivalent to training Model 1 continuously using supervision of EmoBank labels. Model 2 use BERT but all the weights are randomly initialized, which means it does not use pre-trained language model weights, then the model is trained on EmoBank. Lastly, Model 4 is directly fine-tuning the BERT with EmoBank VAD labels, starting from pre-trained language model weights.

As shown in Table. 2, we point out Model 2 is already comparable to state-of-the-art VAD prediction models in Table. 1. Specifically, Model 2 outperforms SRV-SLSTM in A and D dimensions. For V dimension, Model 2 underperforms Model 1 and SRV-SLSTM. Overall, this indicates that multi-layer Transformers architecture is effective for VAD score regression even without any pre-trained knowledge. Also, we see further improvement on Model 3, which means initializing the model with our approach is better than just using random weights to start training.

Note that we observe that Model 4 shows better performance in all V (r=.787, p.001), A (r=.632, p.001) and D (r=.498, p.001) dimensions. It indicates that using pre-training bidirectional language model weights is better initialization strategy rather than using our model. This is because Model 1 is fine-tuned once to predict VAD distributions based on categorical emotion labels which resulting in forgetting the general linguistic representation of a given text from pre-trained BERT. So it seems starting to training from general representation of text allows to predict VAD scores better, rather than the representations trained from categorical emotion labels. It might be partially due to the suboptimal fine-tuning strategy for a fine-tuned model. However, it is beyond the scope of this work, so we plan to investigate how to fine-tune a fine-tuned model effectively in future work.

6 Qualitative Examples

Tweet categorical Label Nearest Neighbors from VAD scores
Gooood morning it is such a #blessing to see another day
all that Read this I hope have a great morning
joy, optimism
reaffirm, shimmer,
brighten, affections, mythological
Happy Winning Wednesday!!
Each day is a day of new possibilities.
Keep pushing and keep your head up.
#live #love #laugh #reachforthestars
joy, love,
incentive, alive, reborn,
radiance, lavish
Not only was and responsible for the
unnecessary outrage of this movie,
but made the director look bad
anger, disgust
refusal, liar, falsified,
disrespect, unsavory
you begin to irritate me, primitive anger, disgust
negativity, abandon, dontlikeyou,
depression, morgue
Mentally suffered #iwanttodie #worthless
#lifewithoutcolor #pain #suicidal
disgust, pessimism,
orphaned, wasting, decomposed,
hopelessness, dead
Table 3: Qualitative examples of predictions from our model trained on SemEval. Examples Tweets are from test set of SemEval. We present predicted categorical emotion labels, and corresponding top 5 nearest neighbor words in NRC-VAD-Lexicons with respect to the model predictions of VAD scores.

In Table 3, we show examples predicted from an our model trained on SemEval. The table prsents annotated tweets from SemEval test set and corresponding predicted categorical labels, and top 5 nearest neighbor emotional words with respect to predicted VAD scores. For these 5 tweets, our model correctly predicted categorical emotion labels. We elaborate how we find the nearest neighbor words from the VAD scores.

Given that our model predicted VAD scores, we find nearest neighbor words for that scores by using NRC-VAD-Lexicons. Mohammad (2018) We first rescale our model’s predicted VAD scores from 0 to 1 for each VAD dimensions since the lexicons have values from 0 to 1. To do this, we first predict VAD scores for every sentences in SemEval test set and then we rescale the scores by following: (, which makes all dimensions to have scores from 0 to 1.

Next, we find nearest neighbor words by using the rescaled VAD values. Euclidean distances between the values and all words in NRC-VAD-Lexicons are computed, and we pick top 5 nearest words among them which have smallest distances. We present the words in the right column of Table 3. These words help us to understand VAD scores more intuitively, and further they could be regarded as automatically generated emotional annotations for a given sentence. In other words, our model can predict categorical emotion labels which is not seen in training time by finding nearest neighbor words in VAD space.

Five examples in Table 3 shows our model can predict categorical emotion labels and further finds suitable emotional words for a given sentence. Especially, for the fifth tweet, our model annotated depressive words (hopelessness, dead) to the given sentence, so it might be extended to detect risky signs of people in needs from social media.

7 Related Work

VAD Dimensions of Emotions. Research of emotion representation model has gone through the history of psychology domain. Categorical model of emotion assumes that categorical categories represented by emotion words compose the building blocks of human emotion. Supporting evidence includes six basic emotions Ekman (1992), and findings of universally adaptive emotions Plutchik (1980). Alternatively, to understand how people conceptualize emotional feelings beholds the dimensional model of emotion. Osgood et al. (1957) suggested initial ideas of emotion coordinates. Russell and Mehrabian (1977)

further constructed Pleasure or Valence-Arousal-Dominance (PAD, VAD) model, a semantic scale model to rate emotional state, representing an emotional state as a pair of orthogonal coordinates on V-A-D dimensions. Absolute values of the intercorrelations among the three scales show considerable independence among the scales

Russell and Mehrabian (1977). Categorical emotion states can be represented in three-dimensional (VAD) emotion space. Based on emotional dimensions, word-level VAD annotation of English words has been created. Bradley and Lang (1999); Warriner et al. (2013) Recently, large-scale annotation of VAD score annotation to English words is developed Mohammad (2018), so we leverage this annotation scores for predicting sentence-level VAD scores during training from categorical emotion annotation datasets.

Emotional Distribution Learning. Instead of predicting multiple emotion labels from text, learning emotion distribution itself from text has been proposed Deyu et al. (2016). This approach maps text to emotion distribution and respective intensities incorporating Plutchik’s wheel of emotions. Furthermore, distribution learning can be extended to issues of emotion ranking. Zhou et al. (2018) Unlike previous approach, our model learns decomposed emotional distributions, which is valence, arousal, dominance distribution of emotions.

8 Discussion and Conclusions

We propose learning to predict VAD scores from the text with categorical emotion annotations. Our framework predicts VAD score distributions for a given text rather can classification probabilities for each class, by minimizing the EMD distances between predicts VAD distributions and sorted label distributions as a proxy of target VAD distributions.

Learning conditional VAD distributions enables predicting categorical emotion classes and continuous VAD scores simultaneously. With fine-tuning pre-trained BERT-Large on SemEval, our approach shows comparable performance in categorical emotion classification task and significant positive correlations with target VAD scores even without supervision of VAD scores. If our model continues supervised training on the VAD labels, our model outperforms state-of-the-art VAD regression models. Ablation study shows this is because superiority of the multi-layer Transformer architecture as well as effective initialization strategy of fine-tuning the model starting from our model for VAD score prediction. We further find nearest neighbor words from the predicted VAD scores of our model, which could be regarded as our model can automatically generate categorical emotion labels which are not be seen in training time to a corresponding input sentence.

We hope our framework would help researchers to build a human-annotated sentence-level VAD emotion dataset by providing machine-annotated VAD scores as a start, or use it just as VAD score prediction model. Most of the languages except English would not have such corpus with VAD annotations, so our model will be helpful to build a multilingual resource using multilingual corpora with categorical emotion labels. Öhman et al. (2018) Also, further work will focus on developing a model giving more sensible VAD scores without VAD annotations.


  • S. Akhtar, D. Ghosal, A. Ekbal, P. Bhattacharyya, and S. Kurohashi (2019) All-in-one: emotion, sentiment and intensity prediction using a multi-task ensemble framework. IEEE Transactions on Affective Computing. Cited by: §3.3.2, Table 1.
  • C. O. Alm, D. Roth, and R. Sproat (2005)

    Emotions from text: machine learning for text-based emotion prediction

    In EMNLP, Cited by: §1.
  • S. Aman and S. Szpakowicz (2007) Identifying expressions of emotion in text. In TSD, Cited by: §1.
  • C. Baziotis, N. Athanasiou, A. Chronopoulou, A. Kolovou, G. Paraskevopoulos, N. Ellinas, S. Narayanan, and A. Potamianos (2018)

    NTUA-slp at semeval-2018 task 1: predicting affective content in tweets with deep attentive rnns and transfer learning

    In SemEval, Cited by: §3.2, Table 1.
  • M. M. Bradley and P. J. Lang (1999) Affective norms for english words (anew): instruction manual and affective ratings. Technical report Citeseer. Cited by: §7.
  • S. Buechel and U. Hahn (2016) Emotion analysis as a regression problem—dimensional models and their implications on emotion representation and metrical evaluation. In

    Proceedings of the Twenty-second European Conference on Artificial Intelligence

    pp. 1114–1122. Cited by: §1.
  • S. Buechel and U. Hahn (2017) EMOBANK: studying the impact of annotation perspective and representation format on dimensional emotion analysis. In EACL, Cited by: §1, §3.1.
  • B. Desmet and V. Hoste (2013) Emotion detection in suicide notes. Expert Syst. Appl. 40 (16), pp. 6351–6358. External Links: Document Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. In WMT, Cited by: §1, §2, §3.2.
  • Z. Deyu, X. Zhang, Y. Zhou, Q. Zhao, and X. Geng (2016) Emotion distribution learning from texts. In EMNLP, Cited by: §7.
  • P. Ekman (1992) An argument for basic emotions. Cognition & emotion 6 (3-4), pp. 169–200. Cited by: §1, §7.
  • L. Hou, C. Yu, and D. Samaras (2017) Squared earth movers distance loss for training deep neural networks on ordered-classes. In NIPS, Cited by: §2.
  • S. Kiritchenko and S. Mohammad (2017) Best-worst scaling more reliable than rating scales: a case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the ACL, External Links: Document Cited by: §1.
  • Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu (2017) Dailydialog: a manually labelled multi-turn dialogue dataset. arXiv preprint arXiv:1710.03957. Cited by: §1.
  • S. M. Mohammad, F. Bravo-Marquez, M. Salameh, and S. Kiritchenko (2018) SemEval-2018 Task 1: Affect in tweets. In SemEval, Cited by: §1, §3.1.
  • S. M. Mohammad (2012) # emotional tweets. In SemEval, Cited by: §1.
  • S. Mohammad (2018) Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In Proceedings of the 56th Annual Meeting of the ACL, Cited by: §2, §2, §2, §6, §7.
  • E. Öhman, K. Kajava, J. Tiedemann, and T. Honkela (2018) Creating a dataset for multilingual fine-grained emotion-detection using gamification-based annotation. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Cited by: §8.
  • C. E. Osgood, G. J. Suci, and P. H. Tannenbaum (1957) The measurement of meaning. University of Illinois press. Cited by: §7.
  • R. Plutchik (1980) A general psychoevolutionary theory of emotion. In Theories of emotion, pp. 3–33. Cited by: §7.
  • R. Plutchik (2001) The nature of emotions: human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist 89 (4), pp. 344–350. Cited by: §1.
  • D. Preoţiuc-Pietro, H. A. Schwartz, G. Park, J. Eichstaedt, M. Kern, L. Ungar, and E. Shulman (2016) Modelling valence and arousal in Facebook posts. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 9–15. Cited by: §1.
  • J. A. Russell and A. Mehrabian (1977) Evidence for a three-factor theory of emotions. Journal of research in Personality 11 (3), pp. 273–294. Cited by: §1, §2, §7.
  • D. Sahana and L. Girish (2015) Automatic drug reaction detection using sentimental analysis. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 4 (5). Cited by: §1.
  • K. R. Scherer and H. G. Wallbott (1994) Evidence for universality and cultural variation of differential emotion response patterning.. Journal of personality and social psychology 66 (2), pp. 310. Cited by: §1, §3.1.
  • H. Schuff, J. Barnes, J. Mohme, S. Padó, and R. Klinger (2017) Annotation, modelling and analysis of fine-grained emotions on a stance and sentiment detection corpus. In WASSA, pp. 13–23. Cited by: §1.
  • A. G. Shahraki and O. R. Zaiane (2017) Lexical and learning-based emotion mining from text. In CICLing, Cited by: §1.
  • V. Sintsovaa and C. Musata (2013) Fine-grained emotion recognition in olympic tweets based on human computation. In WASSA, Cited by: §1.
  • J. B. Torre and M. D. Lieberman (2018) Putting feelings into words: affect labeling as implicit emotion regulation. Emotion Review 10 (2), pp. 116–124. External Links: Document Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
  • A. B. Warriner, V. Kuperman, and M. Brysbaert (2013) Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior Research Methods 45 (4), pp. 1191–1207. Cited by: §7.
  • C. Wu, F. Wu, S. Wu, Z. Yuan, J. Liu, and Y. Huang (2019)

    Semi-supervised dimensional sentiment analysis with variational autoencoder

    Knowledge-Based Systems 165, pp. 30–39. Cited by: §3.3.2, Table 1.
  • L. Yu, L. Lee, S. Hao, J. Wang, Y. He, J. Hu, K. R. Lai, and X. Zhang (2016) Building Chinese affective resources in valence-arousal dimensions. In Proceedings of the 2016 Conference of the NAACL, Cited by: §1.
  • Y. Zhang, J. Fu, D. She, Y. Zhang, S. Wang, and J. Yang (2018) Text emotion distribution learning via multi-task convolutional neural network. In IJCAI, Cited by: §3.2, Table 1.
  • D. Zhou, Y. Yang, and Y. He (2018) Relevant emotion ranking from text constrained with emotion relationships. In NAACL, Cited by: §7.
  • S. Zhu, S. Li, and G. Zhou (2019)

    Adversarial attention modeling for multi-dimensional emotion regression

    In Proceedings of the 57th Annual Meeting of the ACL, Florence, Italy. Cited by: §3.3.2, Table 1.