Social media have strongly changed the way we interact with each other and how we share contents. Many people exploit social networks to publish details of their daily lives, their opinions and their thoughts. These data represent a precious source of information for building large datasets of annotated multimedia contents, or for mining users’ behaviours and other user-related information.
A valuable feature for every modern system that interacts with humans is understanding the emotional state of users. Conversational systems can adapt their language in function of the perceived user emotions, digital marketing platforms can customize recommendations, social media marketing strategies can be changed in function of the estimated emotions triggered when posting contents. If we restrict our attention to the case of text, emotion detection is a widely studied and still challenging task[19, 10, 1, 12, 8]
. In the case of categorical emotion detection, sentences are usually classified into the six universal emotions defined by Ekman, namely anger, disgust, fear, happiness, sadness, and surprise.
This paper is rooted on the connections between the task of emotion detection and social media data. There is an intrinsic link between certain categories of tags attached to user posts and the emotional state of those users that participate in the tagging process. We focus on the case of Facebook, where users can express their feeling on a post through the so called “reactions”, that are LOVE, HAHA, WOW, SAD, ANGRY, together with the widely known LIKE. While LIKE represents a universal and generic expression of a positive feedback, the other reactions are more fine-grained, and somewhat related to the aforementioned categories of emotions. However, this relationship is weak and distant, since some reactions can be loosely associated to emotional categories, sometimes with large ambiguity. For example, WOW expresses “surprise” but it can be also used to describe contents where the astonishment is accompanied by “fear”. Moreover, Facebook reactions are the outcome of a tagging process where users might follow superficial and strongly subjective criteria to react.
Recently, Facebook reactions have been studied in the context of emotion detection. Some authors trained emotion classification models using Facebook reactions , , while others tried to learn to predict Facebook reactions in a given domain, bootstrapping the system with the outcome of emotion mining 
. Reactions are usually manually mapped to (a subset of) the aforementioned universal emotions, providing a form of distant supervision. Differently, the task of emotion detection from text has been the subject of a large number of studies, mostly distinguished into lexicon-based and machine learning-based approaches (or hybrid solutions). Lexicon-based approaches employ linguistic models or prior knowledge for the classification task, and they essentially give a score to a sentence using a predefined sentiment lexicon, without using labeled data, . In  the authors propose an unsupervised context-based emotion detection method that does not rely on any affect dictionaries or annotated training data. A constraint optimization framework based on lexicon is presented in 
. Machine learning-based methods usually exploit supervised learning algorithms trained on annotated corpora. The approach of focusses on Twitter data, while  uses a heterogeneous emotion-annotated dataset to recognize the six basic emotions. Finally,  focusses on an ensemble model, strongly exploiting pre-trained, dense word-embedding representations.
In this paper we propose a neural network-based model to jointly learn the task of emotion detection and the task of predicting Facebook reactions. Our model consists of a bidirectional Long Short-Term Memory (LSTM) recurrent neural network[17, 9] to encode the input sentence, and two predictors associated with the considered tasks. Predictors are not independent, but are linked by prior knowledge on the relationships between the tasks. Such knowledge is represented by First-Order Logic (FOL) formulas, which allow us to naturally express how reactions are connected to emotion classes and vice-versa. Following the framework of Learning from Constraints , FOL formulas are converted into polynomial constraints and softly enforced into the learning problem, thus tolerating some violations. The system automatically learns “how” to fulfil the FOL formulas in function of the way the data are distributed. Our model is trained using a heterogeneous dataset composed of data labeled with emotion classes, Facebook posts that include user reactions, and a large collection of unsupervised posts. We do not use any external lexical resources, and an extended experimental analysis shows that the tasks of emotion classification and reaction prediction can both benefit from their interaction. The resulting emotion detector is competitive with some models that exploit lexical resources or ad-hoc features, and we also investigate the role of pre-trained word embeddings.
2 Model and Data Organization
We consider a multi-task setting where two predictors and operate on the same data , that is a short input text. Such predictors are associated to the task of reaction classification () and emotion classification (), respectively. In the context of this paper, both the tasks consist in predicting the most dominant reaction/emotion when processing a text .111Some approaches consider these tasks as multi-label prediction problems [10, 19], while other authors focus on the most dominant response , as we do in this paper. What we propose can be adapted to the case of multi-label prediction. In detail,
outputs a probability distribution overreactions, and, analogously, outputs a probability distribution over classes of emotions. We select the emotion-reaction pair associated to the largest probabilities.
Following the classical pipeline of several machine learning-based approaches in Natural Language Processing, the input text is tokenized into words belonging to a fixed-size vocabulary. Each word is embedded into a learnable latent dense representation, also known as “word embedding”, and a Long Short-Term Memory (LSTM) recurrent neural network 
processes the sequence of word embeddings in both directions (Bidirectional Recurrent Neural Network (BRNN)). The forward and backward states are then concatenated, producing an embedded latent representation of and , respectively. The choice of sharing the same latent representation of with both predictors is due to the fact that the two prediction tasks are certainly correlated. Finally, during the training stage, the MLPs are connected by constraints that are devised from FOL rules, and that will be described in Section 3. The whole architecture is reported in Fig. 1.
Our model is trained using a heterogeneous collection of text of partially labeled and unlabeled data, composed by the union of three disjoint sets, , that, in turn, consist of pairs , where is either a reaction label, an emotion label, or a dummy placeholder (i.e., unlabeled data), respectively. The set is a collection of Facebook posts, each of them labeled with one out of reaction classes (listed in Section 1
), encoded with a one-hot vectorof size . We did not consider the class LIKE, since it is too generic, and we selected the most frequent reaction class in each post. Moreover, is composed only by those posts with at least reaction hits in total ( in our experience), and where the most frequent reaction has a number of hits that is greater than the number of hits of all the other reactions scaled by a factor (we set ). The set is a collection of sentences, each of them labeled with one of the universal emotions (see Section 1), encoded with a one-hot vector of size .222In our experience we did not consider the neutral class, that, however, could be easily introduced in the proposed model. We exploited existing databases to build (see Section 4), keeping only the most dominant emotion in the case of multi-labeled data. Finally, the set is a collection of unlabeled text, that in our experience, consists of a large collection of Facebook posts without reactions. Each sample is paired with a dummy label vector . This set is exploited to enforce the logic constraints (Section 3) in space regions that are not covered by the labeled portion(s) of the training set. This allows the model to learn predictors that better generalize the information associated to the logic formulas. A sketch that summarizes the types of training data used in this paper is reported in Fig. 2.
3 Multi-Task Learning with Constraints
Before introducing the approach that we propose with this paper, we mention that the simplest way to bridge the tasks of emotion and reaction classification is to generate artificial labels, i.e., to define a fixed mapping between emotions and reactions and augment the training data with these new labels (see, for example , Table 1). Considering the emotion/reaction classes of Section 2, a reasonable mapping from reactions to emotions, represented with the notation “ground truth” “new label”, is the following one: LOVE happiness, WOW surprise, HAHA happiness, SAD sadness, ANGRY anger. Similarly, we can map emotions to reactions: anger ANGRY, disgust ANGRY, fear WOW, happiness HAHA, sadness SAD, surprise WOW. However, this manual conversion is rigid and sometimes ambiguous. For example, no reactions are converted into labels of classes fear and disgust, and no emotions are mapped into the reaction LOVE.
We propose to describe the mappings between emotion and reaction classes using FOL formulas and to develop a multi-task system that learns from them, following the framework of Learning from Constraints [5, 6, 7]. Each class is associated to a predicate, whose truth degree is computed using a function that, for simplicity, we indicate with the name of the class itself. These predicates can be seen as the components of the vectorial functions and , i.e., , and . We define the following rules,
Notice that these rules do not include negations, that is due to the probabilistic relationship (softmax) that we introduced in the output of the predictors (if a function goes toward , all the others will automatically go toward ).333We did not write the rules in a more compact form using the double implication , since we will differently weigh the impact of some of them, as it will be clear shortly.
We defined our FOL formulas after having analyzed the content of various Facebook posts and the associated reactions. Implications 3-5-9 include an ambiguous mapping, modeled with the operator (disjunction). The second predicate that we reported in each disjunction corresponds to a less trivial mapping that, at a first glance, might not always seem obvious. However, in our experience, we found these cases to be more frequent than expected. We report an example for each of them: WOW could be fear instead of surprise (Eq. 5),
Snake on a plane: Frightening moment on an Aeromexico flight when a large snake fell from overhead mid-flight. The flight made a quick landing and animal control took the stowaway into custody.
Emotion happiness could be converted into LOVE instead of HAHA (Eq. 9),
When I got a wedding ring of diamond from the boy I loved.
The reaction ANGRY could be eventually mapped into disgust (Eq. 3),
The San Antonio police chief said that former officer Matthew Luckhurst committed a vile and disgusting act that violates our guiding principles.
Our rules are converted into real-valued polynomials by means of T-Norms [5, 6], that are functions modeling the logical AND whose output is in . We used the Product T-Norm, where the logical AND is simply the product of the involved arguments. In turn, this choice transforms into the polynomial (see [5, 6] for further details). Constraining the FOL formula to hold true leads to enforcing the T-Norm-based polynomials to be , so we get equality constraints, e.g., in the previous example. We introduce these constraints into the learning problem in a soft manner using penalty functions, so that the system might decide to violate some of them for some input (in our implementation, we used the penalty ).
Formally, the multi-task function that we minimize to learn the model is
where we avoided reporting the scaling factors in front of each term of the summation, to keep the notation simpler. The function is the cross-entropy loss, while is the penalty term associated to the -th FOL formula, weighed by the scalar .444Each might only consider some of the output components of and , depending on the FOL formula that it implements. Notice that FOL formulas are constrained to hold true on all the available training data, including the large collection of unlabeled text . This allows the system to learn predictors that fulfil the FOL rules in regions of the input space that might not be covered by the labeled data, thus increasing the information transfer between the two tasks (as typically done in the framework of Learning from Constraints [5, 6]). Thanks to this formulation, we can differently weigh the impact of each constraint in function of the confidence we have on it, tuning the parameters . For example, constraints associated to Formulas 4-7-8 are weaker that the other ones, and we decided to keep their weight small.
4 Experimental Results
In order to evaluate the proposed model, we created a heterogeneous data collection that follows the organization described in Section 2. In particular, we considered a large public dataset of Facebook posts that are scraped from Facebook pages of newspapers. 555https://data.world/martinchek/2012-2016-facebook-posts Data was filtered accordingly to what we described in Section 2, ending up with posts, out of which are left unlabeled. Then, we collected the most popular datasets containing text labeled with emotions, namely AffectiveText, ISEAR, and Fairy Tales. AffectiveText (SemEval-2007 ) contains 1,250 short newspaper headlines. Sentences are labeled with the six basic emotions, and each of them is scored in a range from 0 to 100. For the purpose of this experimentation, we took the emotion with the highest score. ISEAR (International Survey on Emotion Antecedents and Reactions ) contains 7,666 sentences from questionnaires about emotional experiences covering anger, disgust, fear, joy, sadness, shame, guilt. We discarded the last two classes since they are not part of the universal emotions, and mapped “joy” to “happiness” (the class “surprise” is missing). Fairy Tales  contains sentences belonging to short stories, annotated with multiple labels. We discarded the neutral class and we kept only sentences with four identical labels (three for the class “disgust”, due to the small number of samples). In Table 1 we report the details of the data exploited in this paper.
We evenly divided our heterogeneous datasets into 3 splits, keeping the original data distribution among classes. Each split is further divided into training, validation and test sets, with special attention in preparing the test data. In particular, the test set is composed of 15% of the labeled Facebook posts, merged with one of ISEAR, Fairy Tales, Affective Text. As a matter of fact, each of such emotional datasets is small sized (considering the number of classes and the intrinsic difficulty of the learning task), and it has different properties w.r.t. the other two ones. We experienced that training and testing on subportions of the same emotional dataset leads to performances that do not reflect the concrete quality of the system when it is deployed and tested in a generic context. Differently, training and testing on different emotional datasets offers a more realistic perspective of the generalization quality of the resulting system. The training set includes 70% of the labeled Facebook posts and 80% of the two emotional datasets which are not present in the test set, plus the unlabeled Facebook posts. The validation set is composed of the remaining data, that is, 15% of labeled posts and 20% of the two emotional datasets which are not used as test set. We preprocessed all the data converting text to lowercase, removing URLs, standardizing numbers with a special token, removing brackets, separating punctuation and hashtags. Then, we created a vocabulary composed of the most frequent words and we truncated sentences longer than 30 words, to make them more easily manageable by the BRNN.
We evaluated architectures with differently sized word embeddings (from 50 to 300 units each), states of the BRNN (in the range ), hidden layers (and number of units) of the final MLPs (up to hidden layers). After a first exploratory experimentation, we focussed on models with word embeddings of size 100, BRNN with a hidden state composed of 100 units and final predictors with no hidden layers, that were providing the best results in the validation data. Then, we kept validating in more detail all the other model parameters (learning rate, the possibility of introducing drop-out right after the BRNN, weight of the logic constraints ). We considered the (macro) F1 scores on each task to evaluate the quality of our models, and we early stopped the training procedure whenever the average F1 score on the validation data was not increased after epochs (keeping the model associated to the best F1 score found so far).
We compared the following models:
The model of Fig. 1, without logic constraints (, ).
The same as Plain, but including logic constraints (, ).
The same as Plain, where the training data is augmented with artificially mapped classes as described at the beginning of Section 3.
Variant of the models above, based on pre-trained word embeddings of size (the popular Google word2vec model).666In this case, after our initial exploratory experimentation, we selected a BRNN with state size , and reaction predictor with a hidden layer of size .
We first evaluate the quality of the system in the task of reaction prediction. In Table 2, we can appreciate how introducing logic constraints constantly improves the quality of the predictor in all the reaction classes. Using artificial labels from emotional data is far from giving the same benefits of logic constraints, and we did not experience advantages in using pre-trained word embeddings, that might be due to the inherent noise in the reaction prediction task.
|Plain||0.630 (0.009)||0.354 (0.008)||0.440 (0.009)||0.532 (0.014)||0.329 (0.012)||0.457 (0.007)|
|Constr||0.639 (0.162)||0.371 (0.013)||0.443 (0.003)||0.535 (0.005)||0.347 (0.007)||0.467 (0.007)|
|Artificial||0.596 (0.051)||0.324 (0.015)||0.393 (0.028)||0.451 (0.077)||0.303 (0.030)||0.413 (0.038)|
|Plain+Emb||0.614 (0.019)||0.343 (0.014)||0.425 (0.012)||0.531 (0.007)||0.345 (0.013)||0.452 (0.006)|
|Constr+Emb||0.638 (0.007)||0.347 (0.003)||0.437 (0.005)||0.538 (0.012)||0.356 (0.009)||0.463 (0.003)|
|Artif.+Emb||0.608 (0.031)||0.323 (0.006)||0.375 (0.031)||0.446 (0.070)||0.311 (0.002)||0.412 (0.030)|
Moving to the task of emotion classification, we report the results we obtained in the previously described test sets, that correspond to three different emotional datasets. In Table 3 we focus on testing in the ISEAR data. Logical rules always allow the model to improve the macro-averaged F1 scores. We notice that the F1 score on “disgust” and “fear” classes is largely better than when not using constraints. In fact, without exploiting the logical rules of Eq. 3 and 5 there is no transfer of information from reaction data, and the supervised portion of the training set is not enough to learn good predictors. Interestingly, this consideration does not hold when using pre-trained embeddings, where the performances of the not-constrained model are already close to the constrained one. In this case, all the other classes are improved instead. Finally, artificial labels do not seem a promising solution.
The results on the Fairy Tales test data are shown in Table 4, still confirming the improvements introduced by constraints in the average case. Since “surprise” is poorly represented in the labeled portion of the training set (being it not included in ISEAR data), results in this class are pretty low. While artificial labels help in “surprise”, they sometimes lead to very bad results. This is even more evident when using pre-trained embeddings, where the system constantly overfits the training data. Notice that the F1 scores on the validation splits were very promising when using such embeddings, but, as we mentioned when describing the experimental setting, the system badly generalizes to out-of-sample data that is related-but-not-fully-coherent with the training (validation) sets.
In the case of Affect Text test data (Table 5) constraints still increase the macro F1, but not when using pre-trained embeddings. We observe a less coherent behaviour with respect to the previous test sets, and this is due to the fact that Affective Text is composed of sentences that are significantly shorter than the ones of the other datasets, and they are evocative of multiple emotions in which it is harder to distinguish the most-dominant one.
In Fig. 3
we report precision and recall (averaged on the test splits, when needed) associated to the results of Table2,3,4,5. When predicting reactions and using constraints, we observe improvements in both precision and recall in the case of 3 out of 5 classes. When predicting emotions, improvements are usually either in terms of precisions or in terms of recall (we count a similar number of cases in which precision is improved and cases in which recall is improved).
Comparing our experimental analysis with existing literature that is about emotion detection is not straightforward. Existing approaches make use of lexical resources or focus on settings that are pretty different from the one we selected (they test on splits that are taken from the same emotional dataset, thus providing better results [20, 12]). However, we found that, in some cases, our model is competitive with popular algorithms. Table 6 reports the F1 scores of existing models, emphasizing the cases in which our Constr+Emb outperforms them. In Affective Text, we compared with the WN-AFFECT system (based on WordNet Affect), and a model based on LSA to compute representations of emotion words  (even if they considered a multi-label learning problem). On the same data, as well as in ISEAR, we also considered the CNMF model from , based on non-negative matrix factorization, that was evaluated on a subset of the emotions we considered in this paper. Finally, we compared with (what we refer to as) the Wikipedia model from , that was trained on texts taken from Wikipedia and tested on the ISEAR data (and other datasets).777We did not consider Fairy Tales since existing approaches usually merge “anger” and “disgust”, and also because the sentence truncation strongly affected this dataset.
In this paper we proposed to jointly learn the tasks of emotion classification and prediction of Facebook reactions, when processing raw text. While such tasks share several analogies, mapping emotion classes to Facebook reactions (and vice-versa) can easily become ambiguous. Our system exploits First Order-Logic formulas to model the task relationships, and it learns from such formulas, also exploiting large collections of unlabeled training data. The provided experimental analysis has shown that bridging these two tasks by means of FOL-based constraints leads to improvements in the prediction quality that clearly goes beyond more naive approaches in which artificial labels are generated in the data preprocessing stage. Our future work will focus on the introduction of lexical resources in our system.
This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 825619.
-  (2012) Unsupervised emotion detection from text using semantic and syntactic relations. In Proceedings of the The International Joint Conferences on Web Intelligence and Intelligent Agent Technology, pp. 346–353. Cited by: §1, §1, Table 6, §4.
-  (2008) Affect in text and speech. Ph.D. Thesis, University of Illinois. Cited by: §4.
Using a heterogeneous dataset for emotion analysis in text.
Canadian Conference on Artificial Intelligence, pp. 62–67. Cited by: §1, footnote 1.
-  (1971) Constants across cultures in the face and emotion.. Journal of personality and social psychology 17 (2), pp. 124. Cited by: §1.
-  (2015) Foundations of support constraint machines. Neural computation 27 (2), pp. 388–480. Cited by: §1, §3, §3, §3.
-  (2013) Constraint verification with kernel machines. IEEE transactions on neural networks and learning systems 24 (5), pp. 825–831. Cited by: §3, §3, §3.
-  (2018) The role of coherence in facial expression recognition. In International Conference of the Italian Association for Artificial Intelligence, pp. 320–333. Cited by: §3.
-  (2017) Emotion detection from text via ensemble classification using word embeddings. In Proceedings of the International Conference on Theory of Information Retrieval, pp. 269–272. Cited by: §1, §1.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2.
-  (2010) Evaluation of unsupervised emotion models to textual affect recognition. In NAACL HLT Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pp. 62–70. Cited by: §1, §1, Table 6, §4, footnote 1.
-  (2018) Social emotion mining techniques for facebook posts reaction prediction. In Proceedings of the 10th International Conference on Agents and Artificial Intelligence, Cited by: §1.
-  (2012) Portable features for classifying emotional text. In Proceedings of the Conference of the NAACL HLT, pp. 587–591. Cited by: §1, §4.
-  (2016) Distant supervision for emotion detection using facebook reactions. In Proceedings of the Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pp. 30–39. Cited by: §1, §3.
-  (2014) Learning emotion indicators from tweets: hashtags, hashtag patterns, and phrases. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1203–1209. Cited by: §1.
ASEDS: towards automatic social emotion detection system using facebook reactions.
International Conference on High Performance Computing and Communications; on Smart City; on Data Science and Systems (HPCC/SmartCity/DSS), pp. 860–866. Cited by: §1.
-  (1994) Evidence for universality and cultural variation of differential emotion response patterning.. Journal of personality and social psychology 66 (2), pp. 310. Cited by: §4.
-  (1997) Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §1, §2.
-  (2007) Semeval-2007 task 14: affective text. In Proceedings of the 4th International Workshop on Semantic Evaluations, pp. 70–74. Cited by: §4.
-  (2008) Learning to identify emotions in text. In Proceedings of the ACM symposium on Applied computing, pp. 1556–1560. Cited by: §1, §1, Table 6, §4, footnote 1.
-  (2015) Detecting emotions in social media: a constrained optimization approach.. In IJCAI, pp. 996–1002. Cited by: §1, §4.