1 Introduction
Humans can feel and express complex emotions beyond the basic emotions Ekman (1992); Plutchik (2001) in daily basis. To represent these various emotions systematically, a dimensional emotion model like the ValenceArousalDominance (VAD) model is commonly used. Russell and Mehrabian (1977)
This model maps emotional states to orthogonal dimensional VAD space, showing various emotions can be projected into the space with measurable distances from one another. Since dimensional models pose an emotion as realvalued vector in the space, it is likely to account for subtle emotional expressions compared to categorical models which employ a finite number basic emotions. With dimensional VAD models, capturing finegrained emotions could benefit clinical natural language processing (NLP) researches
Desmet and Hoste (2013); Sahana and Girish (2015), emotion regulation as a psychotherapy research Torre and Lieberman (2018) and other works in computational social science fields dealing with subtle emotion recognition. Buechel and Hahn (2016)Therefore, building an dimensional emotion detection model from annotated corpus will be highly useful. However, such annotated resources are surprisingly scarce. There are few corpus having full VAD annotations Buechel and Hahn (2017), or only having that of VA. PreoţiucPietro et al. (2016); Yu et al. (2016) One could build such resource through a corpus labeling by using bestworst scaling Kiritchenko and Mohammad (2017). Instead, we examine a novel way to predict dimensional emotion (VAD) scores from relatively common resources which are corpus annotated with coarsegrained basic categorical emotions. Scherer and Wallbott (1994); Alm et al. (2005); Aman and Szpakowicz (2007); Mohammad (2012); Sintsovaa and Musata (2013); Li et al. (2017); Schuff et al. (2017); Shahraki and Zaiane (2017); Mohammad et al. (2018)
In this paper, we propose a framework to learn dimensional VAD scores from corpus with categorical emotion labels. We demonstrate our idea by using pretrained language model BERT Devlin et al. (2018) and finetune it through our approach. In detail, our model learns conditional VAD distributions through supervision of categorical emotion labels, in order to use them to compute VAD scores as well as categorical emotion labels for a given sentence.
In summary, our contributions are as follows:

We propose a framework which enables learning to predict VAD scores from a corpus with categorical emotions annotations.

Our model trained only with categorical emotion labels can predict VAD scores which shows significant positive correlations to corresponding ground truth VAD scores.

Our model can be finetuned once again with supervision of VAD scores to outperform stateottheart dimensional emotion detection models.
2 Approach
Here we describe how we predict VAD scores for a given text from a model trained on a dataset with categorical emotion annotations.
Overview. The key idea is to train an emotion detection model to predict each of the VAD distributions conditioned on a given text, rather than directly predict categorical emotion labels as like conventional emotion classifiers. We show that it is possible even if we only have categorical emotion labels because those categorical emotion labels can also have VAD scores. Thus one can sort the labels by each VAD dimensions to obtain (sparse) ground truth conditional VAD distributions for a given text. (Fig. 1a, 1b) Then a model can be trained to predict VAD distributions by minimizing the distance between predicted and ground truth distributions, allowing the model to predict not only VAD scores for regression (expectations of predicted distributions, Fig. 1d) but also pick a emotion label within a given set of categorical labels for classification. (argmax of emotion labels, Fig. 1c)
Model Architecture. (Fig 1a) Formally, an emotion detection model is where is an emotion drawn from a set of predefined categorical emotions and is a sequence of symbols representing an input text. Usually, is represented as an onehot vector in emotion classification task.
Unlike classification models directly training , we aim to learn each distribution of V, A, D from a pair of input text and categorical labels. To this end, we map categorical emotion labels to threedimensional VAD space,
, using NRCVAD Lexicon
Mohammad (2018). For example, an emotion label ”joy” is mapped to (0.980, 0.824, 0.794) and ”sad” (0.225, 0.333, 0.149) in the VAD space. By using this coordinates, now our model tries to predict the following distribution:(1) 
Furthermore, since each dimensions in VAD space are nearly independent, Russell and Mehrabian (1977)
, we assume that the dimensions are mutually independent. So the joint distribution could be decomposed into product of three conditional distributions:
(2) 
For each decomposed conditional distributions, we would use any type of trainable function with sufficient complexity to capture linguistic patterns from given input. As a demonstration, we use pretrained bidirectional language model BERT Devlin et al. (2018), which shows stateoftheart performances in natural language understanding tasks if finetuned over taskspecific datasets. We stack a softmax or sigmoid activation layer over hidden state corresponding to [CLS] token in BERT for each conditional distributions.
Model Training. (Fig 1b) To train our model, we should obtain target conditionals for each from categorical emotion labels. So we simply sort categorical emotions in by V, A, D scores respectively, based on the mapped VAD coordinates. For example, if we have four emotions in the categorical labels and they have corresponding valence score (0.980, 0.225, 1,000, 0.167) in NRCVAD Mohammad (2018), then we could sort label orders to (anger, sad, joy, happy) and corresponding onehot labels to obtain the target conditional . In other words, by rearranging label positions ascending order of valence scores, sorted onehot labels can be treated as a proxy of target conditionals. We sort labels in terms of A, D to obtain the other conditionals as well. Note that these conditionals will be sparse because we only have points for each VAD dimensions.
Next, we minimize the distances between the true and predicted s. Since we sorted the labels, there are orders between classes. These orders should be taken into account during optimization, thus we minimize the squared Earth Mover’s Distance (EMD) loss Hou et al. (2017) between the true and predicted s to consider the order between labels. EMD loss is as follows:
(3) 
where is a true conditional and
is a predicted conditional. This loss is designed to consider the distance between classes in an ordered classification problem, giving more penalties if a model chooses a class far from the correct class using a distance measure. It computes the squared difference between the cumulative distribution function of
and corresponding .Note that Eq. 3 has an assumption that the probability mass of and should be the same. In single label case, i.e., if the annotated categorical emotion label can appear only once for each text, it is satisfied since and
is output of a softmax layer, which is having the sum always summed up to one. However, in multilabel case, this assumption is violated because generally sigmoid activation layer is used to represent positive probabilities for each class independently. Thus we slightly change the Eq.
3 to satisfy the assumption, defining interclass EMD loss as follows:(4) 
where and are normalized and which divided to its corresponding sum of probabilities. We also introduce intraclass EMD loss:
(5) 
where is true and is predicted for class . Finally we use EMD loss for multilabeled case as follows:
(6) 
Next, we minimize the sum of three squared EMD losses between target and predicted distributions for each of VAD dimensions:
(7) 
where , , denote target and , , predicted conditional distributions.
Predicting categorical Emotion Labels. (Fig. 1c) Based on model’s predicted VAD distributions, we can pick one emotion label from a given set as like conventional emotion classifiers. By computing the product of predicted , , , we obtain predicted , assuming conditional independence. Then we can pick a emotion label as follows:
(8) 
Since we only have given emotion labels, we compare the joint probabilities of and pick one emotion label having the maximum probability among labels (singlelabel case, Eq. 8
), or multiple labels with probability over a certain threshold (multilabel case). The threshold is a hyperparameter of the model, set to 0.125 (=
)Predicting Continuous VAD Scores. (Fig. 1d) We can further compute the expectations of predicted conditionals; , , to predict the continuous VAD scores.
(9) 
Once again, we use the VAD scores in Mohammad (2018) for each dimension when computing the expectations. This allows us to predict continuous VAD scores from the model which is trained over categorical emotion annotations.
3 Experiments
In this section, we show our experimental setups. Throughout these experiments, we mainly focus on demonstrating our approach can effectively predict continuous emotional dimensions (VAD scores) only with categorical emotion labels.
3.1 Dataset
We use three datasets consist of text and corresponding emotion annotations. Two of them have categorical emotion labels, and the other is VADannotated corpus.
SemEval 2018 Ec (SemEval). A multilabeled categorical emotion annotated corpus which contains 10,983 tweets and corresponding labels for presenceabsence of 11 emotions. Mohammad et al. (2018) We abbreviate this dataset hereafter SemEval.
ISEAR. A singlelabled categorical emotion annoated corpus contains 7,666 sentences. A label can have only one emotion among 7 categorical emotions. Scherer and Wallbott (1994)
EmoBank. Sentences paired with continuous VAD scores as labels. This corpus contains 10,062 sentences collected across 6 domains 2 perspectives. Each sentence has three scores representing VAD in range of 1 to 5. Unless otherwise noted, we use weighted average of VAD scores as ground truth scores, which is recommended by EmoBank authors. Buechel and Hahn (2017)
3.2 Predicting Categorical Emotion Labels.
We examine classification performances of our approach and compare them to stateoftheart emotion classification models. We use accuracy, macro F1 score, and micro F1 score for evaluation metrics.
MTCNN.
A convolutional neural network for text classification trained by multitask learning.
Zhang et al. (2018) The model jointly learns classification labels and emotional distributions of a given text. The emotion distribution represents multiple emotions in a given sentence, which is normalized affective term counts extracted by emotion lexicons. The model reaches stateoftheart classification accuracy and F1 score on the ISEAR.NTUASLP. A classification model using deep selfattention layers over BiLSTM hidden states. The models is pretrained on general tweets and ‘SemEval 2017 task 4A’, then finetuned over all ‘SemEval 2018 subtasks’, in order to transfer knowledge learnt to each subtasks. Baziotis et al. (2018) The model took the first place in multilabeled emotion classification task on SemEval dataset.
BERTLarge (Classification). A pretrained bidrectional language model based on stacked multiple Transformers Vaswani et al. (2017). The model shows stateoftheart performance in various natural language understanding tasks after finetuned over taskspecific datasets. Devlin et al. (2018)
. We add a linear transformation layer with sigmoid activation on BERT for training on a multilabeled dataset (SemEval) or softmax activation for singlelabeled dataset (ISEAR). Like conventional text classifiers, these are optimized by minimizing crossentropy loss between predicted distributions and onehot labels.
3.3 Predicting Continuous VAD scores.
Next, we investigate VAD score prediction performance of our approach and compare them to stateoftheart VAD regression models. Since training objectives of models vary, we prefer Pearson’s correlation coefficient between model’s VAD predictions and ground truth scores for an evaluation metric.
3.3.1 Zeroshot Predictions
We refer following two performances as zeroshot prediction performances because these models are not trained over EmoBank, which means the model is trained without supervision of any VAD score labels. These models use entire EmoBank as an evaluation set. We focus on these results since we aim to predict VAD scores from the model trained over corpus annotated with categorical emotion labels.
BERTLarge (Ours, SemEval). We compute VAD score predictions by using Eq. 9 from our model trained on SemEval, which is the same model used in predicting categorical emotion labels.
BERTLarge (Ours, ISEAR). Like the model above, we also compute VAD scores from our model trained on ISEAR.
3.3.2 Predictions after Supervised Learning
Unlike previous models, followings are trained by supervised learning on the VAD score labels in EmoBank. These results allow us to evaluate the extent of zeroshot prediction performances, and further we can see how much the zeroshot prediction model could be improved if VAD annotations are available.
AAN. Adversarial Attention Network for dimensional emotion regression which learns to discriminate VAD dimension scores. Zhu et al. (2019) Pearson correlations of predicted and ground truth of VAD scores in EmoBank are reported. Note that the scores are reported by 2 perspectives and 6 domains respectively, thus we use the highest VAD correlations among perspective and domains for comparison.
Ensemble. Multitask ensemble neural networks which learns to predict VAD scores, sentiment, and their intensity simultaneously. Akhtar et al. (2019) The model is recently shown to be effective on the VAD regression.
SRVSLSTM.
Predicting VAD scores through variational autoencoders trained by semisupervised learning, which shows stateoftheart performance on the VAD score prediction task.
Wu et al. (2019) The model shows highest performance when using 40% of labeled Emobank data, so we compare our model’s performances to that scores.BERTLarge (Ours, EBSemEval).
We finetune once again our BERTLarge (SemEval) on Emobank dataset. We split Emobank to train, valid, test set with the ratio of 6:2:2, then train the model and report the correlation between predicted and ground truth VAD scores on the test set. Specifically, we remove the final linear layer with softmax or sigmoid activations used for training with categorical labels, and we add a new linear layer with relu activations for VAD score predictions. Then all parameters were finetuned once again by minimizing mean squared error loss (MSE) between predicted VAD scores and corresponding VAD scores. Through this model, we investigate the effectiveness of our approach as an parameter initialization strategy of the model for VAD regression where the VAD annotations are available.
3.4 Experimental Details.
In all experiment, we specifically use BERTLarge uncased model.^{1}^{1}1https://tfhub.dev/google/bert_uncased_L24_H1024_A16/1
We set the learning rate to 2e5 with 3 epoch of warmup period. The batch size is to 64, then we stop finetuning all of the layers when the validation loss is minimized. We use single TPU for optimization, and all of the finetuning steps were converged within 10 epochs taking an hour.
4 Results
Dataset  EmoBank  SemEval 2018 Ec  ISEAR  
Task  Regression 



Model  Scheme  V (r)  A (r)  D (r) 


Acc. 



MTCNN Zhang et al. (2018)                  0.668  
NTUASLP Baziotis et al. (2018)          0.528  0.701  0.588      
BERTLarge (Classification, ep3)          0.534  0.697  0.572  0.704  0.700  
BERTLarge (Ours, SemEval)  Zeroshot  0.659  0.327  0.287  0.500  0.695  0.572      
BERTLarge (Ours, ISEAR)  Zeroshot  0.502  0.069  0.236        0.695  0.688  
AAN Zhu et al. (2019)  Supervised  0.424  0.352  0.265            
Ensemble Akhtar et al. (2019)  Supervised  0.635  0.375  0.277            
SRVSLSTM Wu et al. (2019)  Semisupervised  0.620  0.508  0.333            
BERTLarge (Ours, EBSemEval)  Supervised  0.765  0.583  0.416           
We present our experimental results. First, we elaborate the zeroshot VAD score prediction results of our models, and then we compare these results to that of supervise models. We also show classification performances of our model and comparison models.
ZeroShot VAD score Predictions. The results are shown in Table 1. When our model is trained on SemEval and tested on Emobank, the predicted VAD scores show significant positive Pearson’s correlation coefficients with target VAD scores in EmoBank. The correlation in valence (V) show highest score among the dimensions (r=.659, p.001), followed by arousal (A) (r=.327, p.001), and dominance (D) (r=.287, p.001). For our model trained on ISEAR dataset, the scores also show significant positive Pearson’s . The correlation in V dimension (r=.502, p.001), followed by D (r=.236, p.001), and A (r=.069, p.001).
The correlations of SemEval for all dimension are higher than the score of ISEAR. This is because emotion labels in SemEval have more information than that of ISEAR. First, SemEval has 11 categorical emotion annotations whereas ISEAR has 7 labels. More number of labels leads to less sparse VAD target distributions, thus our model can distinguish the extent of VAD more easily where the more number of labels exists. Second, SemEval can have multiple emotion labels for every sentences, however ISEAR has only one label. Apparently, these multiple emotion labels makes the possible range of the expected VAD scores much wider than that of single emotion labels. If a sentence always should have a single label, then the predicted VAD distribution must be summed up to one. Otherwise, multiple labels enables the distributions to have much larger value of the sum, which leads to wider range of the expected values that help the model distinguish the degree of VAD dimensions for a given sentence.
Note that we observe the correlation in A dimension of ISEAR is low. We see that the standard deviation of arousal scores of ISEAR labels ‘anger’, ‘disgust’, ‘fear’, ‘sadness’, ‘shame’, ‘joy’, ’guilt’ is lower (.191) than other dimensions, (V: .328, D: .237) and actually it becomes much lower when only one label ’sadness’, is removed, dropping to (.105). This makes model difficult to differentiate labels in terms of the degree of arousal, leading to lower correlation with target scores for the A dimension.
Comparison to VAD predictions of Supervised Models. Three comparison models (AAN, Ensemble, SRVSLTSTM) in Table 1 are trained by supervision of VAD scores. Since our model trained on SemEval shows better performance than ISEAR, hereafter we compare the scores from SemEval to that of comparison models.
Among those models, Ensemble shows the highest correlation on V dimension (.635), SRVSLSTM reaches to the highest correlation on A (.375) and D (.333) dimensions. We highlight our model trained on SemEval shows even better correlation in V dimension (.659) without any supervision of VAD score labels. The correlation of A (.327) is followed which is slightly lower than that of stateoftheart models, and D (.287) is comparable to that of the Ensemble. Overall, we see that zeroshot prediction performance are fairly comparable with those of stateoftheart models.
Furthermore, we present the result from our another model, which is trained on SemEval and then finetuned on training set of EmoBank corpus and VAD score labels. We could see that if we continue training our model with supervision of VAD labels, our model outperforms all of the stateoftheart models with a large margin. The VAD finetuned model shows the significant correlation in all V (r=.765, p.001), A (r=.583, p.001) and D (r=.416, p.001) dimensions. These are (+.130, +.075, +.083) improvement of the correlation from the stateofthearts for VAD dimensions, respectively.
Categorical Label Classification. Next, classification performances our model and that of comparison models are reported. In case of SemEval, finetuning BERT as like a conventional classifier (BERTLarge, classification) shows higher macro F1 score (.534) than NTUASLT and comparable micro F1 score (.697) and multilabel accuracy (.572). Finetuning BERT on ISEAR shows similar results. The BERT classifier outperforms MTCNN with higher micro f1 score. (.700)
Also, our model also shows comparable classification performance with comparison models. Our model shows higher macro f1 score (.688) on ISEAR, which is higher than that of MTCNN, In case of SemEval, however, our model shows slightly lower performance to that of NTUASLP.
5 Ablation Study
Model  V (r)  A (r)  D (r) 
ZeroShot  
1. BERT (Ours, SemEval)  0.659  0.327  0.287 
Supervised  
2. BERT (Random Init., EB)  0.600  0.536  0.344 
3. BERT (Ours, EBSemEval)  0.765  0.583  0.416 
4. BERT (Regression, EB)  0.787  0.632  0.498 
We further conduct ablation study to investigate our model’s VAD prediction performances. Since we use pretrained BERT and finetune them with different datasets, the effect of pretraining and finetuning should be decomposed to understand the source of improvements.
In Table. 2, we present four models for ablation study which all having the same neural network architecture (BERTLarge) to control the size and structure of the model. Model 1 is our model trained on SemEval, and Model 3 is finetuned on EmoBank with initialization of trained weights of Model 1. This is equivalent to training Model 1 continuously using supervision of EmoBank labels. Model 2 use BERT but all the weights are randomly initialized, which means it does not use pretrained language model weights, then the model is trained on EmoBank. Lastly, Model 4 is directly finetuning the BERT with EmoBank VAD labels, starting from pretrained language model weights.
As shown in Table. 2, we point out Model 2 is already comparable to stateoftheart VAD prediction models in Table. 1. Specifically, Model 2 outperforms SRVSLSTM in A and D dimensions. For V dimension, Model 2 underperforms Model 1 and SRVSLSTM. Overall, this indicates that multilayer Transformers architecture is effective for VAD score regression even without any pretrained knowledge. Also, we see further improvement on Model 3, which means initializing the model with our approach is better than just using random weights to start training.
Note that we observe that Model 4 shows better performance in all V (r=.787, p.001), A (r=.632, p.001) and D (r=.498, p.001) dimensions. It indicates that using pretraining bidirectional language model weights is better initialization strategy rather than using our model. This is because Model 1 is finetuned once to predict VAD distributions based on categorical emotion labels which resulting in forgetting the general linguistic representation of a given text from pretrained BERT. So it seems starting to training from general representation of text allows to predict VAD scores better, rather than the representations trained from categorical emotion labels. It might be partially due to the suboptimal finetuning strategy for a finetuned model. However, it is beyond the scope of this work, so we plan to investigate how to finetune a finetuned model effectively in future work.
6 Qualitative Examples
Tweet  categorical Label  Nearest Neighbors from VAD scores  


joy, optimism 







anger, disgust 


you begin to irritate me, primitive  anger, disgust 





In Table 3, we show examples predicted from an our model trained on SemEval. The table prsents annotated tweets from SemEval test set and corresponding predicted categorical labels, and top 5 nearest neighbor emotional words with respect to predicted VAD scores. For these 5 tweets, our model correctly predicted categorical emotion labels. We elaborate how we find the nearest neighbor words from the VAD scores.
Given that our model predicted VAD scores, we find nearest neighbor words for that scores by using NRCVADLexicons. Mohammad (2018) We first rescale our model’s predicted VAD scores from 0 to 1 for each VAD dimensions since the lexicons have values from 0 to 1. To do this, we first predict VAD scores for every sentences in SemEval test set and then we rescale the scores by following: (, which makes all dimensions to have scores from 0 to 1.
Next, we find nearest neighbor words by using the rescaled VAD values. Euclidean distances between the values and all words in NRCVADLexicons are computed, and we pick top 5 nearest words among them which have smallest distances. We present the words in the right column of Table 3. These words help us to understand VAD scores more intuitively, and further they could be regarded as automatically generated emotional annotations for a given sentence. In other words, our model can predict categorical emotion labels which is not seen in training time by finding nearest neighbor words in VAD space.
Five examples in Table 3 shows our model can predict categorical emotion labels and further finds suitable emotional words for a given sentence. Especially, for the fifth tweet, our model annotated depressive words (hopelessness, dead) to the given sentence, so it might be extended to detect risky signs of people in needs from social media.
7 Related Work
VAD Dimensions of Emotions. Research of emotion representation model has gone through the history of psychology domain. Categorical model of emotion assumes that categorical categories represented by emotion words compose the building blocks of human emotion. Supporting evidence includes six basic emotions Ekman (1992), and findings of universally adaptive emotions Plutchik (1980). Alternatively, to understand how people conceptualize emotional feelings beholds the dimensional model of emotion. Osgood et al. (1957) suggested initial ideas of emotion coordinates. Russell and Mehrabian (1977)
further constructed Pleasure or ValenceArousalDominance (PAD, VAD) model, a semantic scale model to rate emotional state, representing an emotional state as a pair of orthogonal coordinates on VAD dimensions. Absolute values of the intercorrelations among the three scales show considerable independence among the scales
Russell and Mehrabian (1977). Categorical emotion states can be represented in threedimensional (VAD) emotion space. Based on emotional dimensions, wordlevel VAD annotation of English words has been created. Bradley and Lang (1999); Warriner et al. (2013) Recently, largescale annotation of VAD score annotation to English words is developed Mohammad (2018), so we leverage this annotation scores for predicting sentencelevel VAD scores during training from categorical emotion annotation datasets.Emotional Distribution Learning. Instead of predicting multiple emotion labels from text, learning emotion distribution itself from text has been proposed Deyu et al. (2016). This approach maps text to emotion distribution and respective intensities incorporating Plutchik’s wheel of emotions. Furthermore, distribution learning can be extended to issues of emotion ranking. Zhou et al. (2018) Unlike previous approach, our model learns decomposed emotional distributions, which is valence, arousal, dominance distribution of emotions.
8 Discussion and Conclusions
We propose learning to predict VAD scores from the text with categorical emotion annotations. Our framework predicts VAD score distributions for a given text rather can classification probabilities for each class, by minimizing the EMD distances between predicts VAD distributions and sorted label distributions as a proxy of target VAD distributions.
Learning conditional VAD distributions enables predicting categorical emotion classes and continuous VAD scores simultaneously. With finetuning pretrained BERTLarge on SemEval, our approach shows comparable performance in categorical emotion classification task and significant positive correlations with target VAD scores even without supervision of VAD scores. If our model continues supervised training on the VAD labels, our model outperforms stateoftheart VAD regression models. Ablation study shows this is because superiority of the multilayer Transformer architecture as well as effective initialization strategy of finetuning the model starting from our model for VAD score prediction. We further find nearest neighbor words from the predicted VAD scores of our model, which could be regarded as our model can automatically generate categorical emotion labels which are not be seen in training time to a corresponding input sentence.
We hope our framework would help researchers to build a humanannotated sentencelevel VAD emotion dataset by providing machineannotated VAD scores as a start, or use it just as VAD score prediction model. Most of the languages except English would not have such corpus with VAD annotations, so our model will be helpful to build a multilingual resource using multilingual corpora with categorical emotion labels. Öhman et al. (2018) Also, further work will focus on developing a model giving more sensible VAD scores without VAD annotations.
References
 Allinone: emotion, sentiment and intensity prediction using a multitask ensemble framework. IEEE Transactions on Affective Computing. Cited by: §3.3.2, Table 1.

Emotions from text: machine learning for textbased emotion prediction
. In EMNLP, Cited by: §1.  Identifying expressions of emotion in text. In TSD, Cited by: §1.

NTUAslp at semeval2018 task 1: predicting affective content in tweets with deep attentive rnns and transfer learning
. In SemEval, Cited by: §3.2, Table 1.  Affective norms for english words (anew): instruction manual and affective ratings. Technical report Citeseer. Cited by: §7.

Emotion analysis as a regression problem—dimensional models and their implications on emotion representation and metrical evaluation.
In
Proceedings of the Twentysecond European Conference on Artificial Intelligence
, pp. 1114–1122. Cited by: §1.  EMOBANK: studying the impact of annotation perspective and representation format on dimensional emotion analysis. In EACL, Cited by: §1, §3.1.
 Emotion detection in suicide notes. Expert Syst. Appl. 40 (16), pp. 6351–6358. External Links: Document Cited by: §1.
 BERT: pretraining of deep bidirectional transformers for language understanding. In WMT, Cited by: §1, §2, §3.2.
 Emotion distribution learning from texts. In EMNLP, Cited by: §7.
 An argument for basic emotions. Cognition & emotion 6 (34), pp. 169–200. Cited by: §1, §7.
 Squared earth movers distance loss for training deep neural networks on orderedclasses. In NIPS, Cited by: §2.
 Bestworst scaling more reliable than rating scales: a case study on sentiment intensity annotation. In Proceedings of the 55th Annual Meeting of the ACL, External Links: Document Cited by: §1.
 Dailydialog: a manually labelled multiturn dialogue dataset. arXiv preprint arXiv:1710.03957. Cited by: §1.
 SemEval2018 Task 1: Affect in tweets. In SemEval, Cited by: §1, §3.1.
 # emotional tweets. In SemEval, Cited by: §1.
 Obtaining reliable human ratings of valence, arousal, and dominance for 20,000 English words. In Proceedings of the 56th Annual Meeting of the ACL, Cited by: §2, §2, §2, §6, §7.
 Creating a dataset for multilingual finegrained emotiondetection using gamificationbased annotation. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Cited by: §8.
 The measurement of meaning. University of Illinois press. Cited by: §7.
 A general psychoevolutionary theory of emotion. In Theories of emotion, pp. 3–33. Cited by: §7.
 The nature of emotions: human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. American scientist 89 (4), pp. 344–350. Cited by: §1.
 Modelling valence and arousal in Facebook posts. In Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 9–15. Cited by: §1.
 Evidence for a threefactor theory of emotions. Journal of research in Personality 11 (3), pp. 273–294. Cited by: §1, §2, §7.
 Automatic drug reaction detection using sentimental analysis. International Journal of Advanced Research in Computer Engineering & Technology (IJARCET) 4 (5). Cited by: §1.
 Evidence for universality and cultural variation of differential emotion response patterning.. Journal of personality and social psychology 66 (2), pp. 310. Cited by: §1, §3.1.
 Annotation, modelling and analysis of finegrained emotions on a stance and sentiment detection corpus. In WASSA, pp. 13–23. Cited by: §1.
 Lexical and learningbased emotion mining from text. In CICLing, Cited by: §1.
 Finegrained emotion recognition in olympic tweets based on human computation. In WASSA, Cited by: §1.
 Putting feelings into words: affect labeling as implicit emotion regulation. Emotion Review 10 (2), pp. 116–124. External Links: Document Cited by: §1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
 Norms of valence, arousal, and dominance for 13,915 english lemmas. Behavior Research Methods 45 (4), pp. 1191–1207. Cited by: §7.

Semisupervised dimensional sentiment analysis with variational autoencoder
. KnowledgeBased Systems 165, pp. 30–39. Cited by: §3.3.2, Table 1.  Building Chinese affective resources in valencearousal dimensions. In Proceedings of the 2016 Conference of the NAACL, Cited by: §1.
 Text emotion distribution learning via multitask convolutional neural network. In IJCAI, Cited by: §3.2, Table 1.
 Relevant emotion ranking from text constrained with emotion relationships. In NAACL, Cited by: §7.

Adversarial attention modeling for multidimensional emotion regression
. In Proceedings of the 57th Annual Meeting of the ACL, Florence, Italy. Cited by: §3.3.2, Table 1.