Deep learning, applied to ever larger datasets, has led to large improvements in performance in sentiment and emotion analysis. In light of this development, lexica, lists of words and associated weights for a particular affective variable, which used to be a key component for feature extraction[Mohammad17starsem], may seem obsolete. However, this is far from the truth.††* These authors contributed equally to this work.
Lexica can be used as features to improve performance for sentence-level emotion prediction even in advanced neural architectures [Mohammad17wassa, DeBruyne19arxiv]. Word ratings are also often used to refine pre-trained embedding models for specific tasks [Yu17emnlp, Khosla18coling]. But much more importantly, word ratings are relatively cheap to acquire and have been found to be robust across domains and even languages, regarding their translational equivalents [Leveau12, Warriner13]. This gives lexica a pivotal role for processing under-resourced languages. Perhaps most importantly, using lexica allows for interpretable models since the resulting document-level predictions can be easily broken down to the words within it. This gives lexica an important role for building justifiable AI and addressing related ethical challenges [clos2017towards]. Interpretability is also crucial for NLP use in other academic disciplines such as psychology, social science, discourse linguistics, and the digital humanities, where understanding the nature of “constructs” (as psychologists call them) such as emotions is far more important than making accurate predictions [schwartz2013toward, eichstaedt2015psychological, pennebaker2011, liu2016analyzing].
While lexica for many kinds of emotion already exist (see Section 2.1), there is no such resource for empathy despite its growing popularity in the NLP community [Khanpour17, Buechel18emnlp]. In fact, psychologists have tried to gather word ratings for empathy, but this task has proven difficult.
Hand-curated lexica for empathy are difficult to create in part because there is no clear set of words that can accurately distinguish empathy from self-focused distress. The gold standard for discerning these is an emotion rating scale by batson1987distress. This scale is a collection of emotion words (e.g., compassionate, tender, warm) that could serve as a rudimentary lexicon, but it contains many words that are rarely used (e.g., “perturbed”), and many words that can take on meanings that are far from empathy (e.g, “warm”, “tender”). These word-based scales have shown good reliability for self-report of these emotional states, but would make poor guides for a proper lexicon of empathy.
In this paper, we construct the first empathy lexicon. Specifically, we learn ratings for two kinds of empathy—empathic concern (feeling for someone) and personal distress (suffering with someone)—for words given existing document-level ratings from the recently published Empathic Reactions dataset [Buechel18emnlp]. We first train a model to predict document-level empathy in a regular supervised set-up and then ”invert” the resulting model to derive word ratings. We conclude with an in-depth analysis of the resulting resource.111The empathy lexicon will be made publicly available upon acceptance of this paper.
2 Related Work
2.1 Lexica for Psychological Quantities
The notion of describing (part of) a word’s meaning, such as the emotion typically associated with it, in terms of numerical ratings has a long tradition in psychology, dating back at least to osgood_measurement_1957. Today, many sets of word ratings exist, covering numerous constructs and languages, particularly relating to sentiment and emotion. Early work in NLP was mostly focused on positive-vs.-negative resources such as SentiWordNet and VADER [baccianella_sentiwordnet_2010, hutto_vader_2014]. In contrast, resources from psychologists tend to focus on valence and arousal (or other representations of affective states [Ekman92]). In particular, this includes the Affective Norms for English Words (ANEW; [Bradley99anew]) which have been adopted to many languages [Redondo07, Montefinese14], and their extension by Warriner13. Such lexica have recently became popular in NLP [Wang16acl, sedoc2017predicting, mohammad_obtaining_2018]. Lexica also exist for many other constructs, including concrete/abstractness, familiarity, imageability and humor [brysbaert_concreteness_2014, yee_valence_2017, engelthaler_humor_2017]. Yet, noticeably, an empathy lexicon is missing.
Psychologists use such lexica either for content analysis, most noticeably using the Linguistic Inquiry and Word Count (LIWC) lexica [tausczik2010psychological], or as controlled stimuli for experiments, e.g., on language processing and memory [hofmann_affective_2009, monnier_semantic_2008]. Applications of lexica in NLP has been discussed in Section 1.
Whereas most lexica are created manually, there is an extensive body of work on learning such ratings automatically (see kulkarni2019depth for a survey). Early work focused on deriving scores through linguistic patterns or statistical association with a small set of seed words [Hatzivassiloglou97, Turney03]. More recent approaches almost always rely on word embeddings [hamilton_inducing_2016, li_inferring_2017, Buechel18naacl]. This line of work is predominantly based on word-level supervision. In contrast, we learn word ratings from document-level ratings.
2.2 Empathy and Distress in Psychology
Empathy, the “reactions of one individual to the observed experiences of another” [davis1983], often in response to their need or suffering, is complex and controversial, with luminary scientists both arguing for the benefits of empathy [dewaal2009] and “against empathy” [bloom2014]. Empathy has been linked to a multitude of positive outcomes, from volunteering [batson1997], to charitable giving [pavey2012], and even longevity [poulin2018], but it can also cause the empathizing person increased stress [buffone2017] and emotional pain [chikovani2015]. In this paper, we build lexica of state (momentary) empathy and distress based on the subscales of Interpersonal Reactivity Index (IRI) questionnaire [davis1980]. The scale creator defines these as follows: Empathic Concern assesses “other-oriented” feelings of sympathy and concern for unfortunate others and Personal Distress measures “self-oriented” feelings of personal anxiety and unease in tense interpersonal settings.
2.3 Empathy and Distress in AI
Most previous work in language-centered AI for empathy has been conducted with a focus on speech and especially spoken dialogue. Conversational agents, psychological interventions, and call center applications has been addressed particularly often[Mcquiggan2007, Fung16, Perez-Rosas17, Alam17]. In contrast, studies addressing empathy in written language are surprisingly rare. Abdul2017icwsm, in contrast, focus on trait empathy, a temporally more stable personal attribute. In particular, they studied the detection of “pathogenic empathy”, marked by self-focused distress, a potentially detrimental form of empathy associated with health risks, in social media language using a wide array of features, including -grams and demographic information.
Khanpour17 present a corpus of messages from online health communities which has binary empathy annotations on the sentence-level. They report an .78 F-Score using a CNN-LSTM. The corpus, however, is not publicly available. In contrast, Buechel18emnlp recently presented the first publicly available gold standard dataset supported by proper psychological theories. The dataset consists of responses to news articles and scales for empathy and distress between 1 and 7. They collected empathy and distress ratings from the writer of an informal message using a sophisticated annotation methodology borrowed from psychology. In this contribution, we build upon their work by using their document-level ratings to predict word labels.
2.4 Lexicon Learning from Document Labels
Few studies address learning word ratings based on document-level supervision. However, those studies (described in detail below) focus on their particular application rather than addressing the underlying, abstract learning problem (formalized in Section 3). As a result, previously proposed methods have not been quantitatively compared.
In an early study, mihalcea2006corpus computed the happiness factor
of a word type as the ratio of documents labeled “happy” to all blog posts it occurs in. Labels were given by the blog users. The resulting lexicon was used to estimate user happiness over the course of an average 24-hour day as well as a seven-day week. Rill12 independently came up with a very similar approach for identifying the evaluative meaning of adjectives and adjective phrases (absolutely fantastic vs. just awful) based on a corpus of online product reviews. Since the individual reviews come with a one-to-five star rating, the evaluative meaning of an adjective or phrase was computed as the average rating of all reviews it occurs in (Mean Star Rating, see Section 3
). This approach was later adopted by Ruppenhofer14 who found that it works quite well for classifying quality and intelligence adjectives into intensity classes (excellent vs. mediocre and brilliant vs. dim, respectively). Another related approach was proposed by Mohammad12, who used hashtags in Twitter posts as distant supervision labels of emotion categories, e.g., #sadness. Word ratings were then computed based on pointwise mutual information between word types and emotion labels.
The above methods all derive word labels directly using relatively simple statistical operations. From this group, we selected the Mean Star Rating approach for experimental comparison (Section 4), as it expects numerical document labels, in line with the later employed empathy gold standard (Section 5).
Note that these contributions are distinct from pattern-based approaches, e.g., presented by Hatzivassiloglou97, who distinguish positive and negative words based on their usage pattern with particular conjunctions: “A and B” implies that A and B have the same polarity whereas “A but B” implies opposing polarity. Such approaches are not considered here because they base lexicon learning on linguistic usage patterns instead of document-level supervision and hence rely on large quantities of raw text.
In another line of work, [Sap14emnlp] address the task of modeling user age and gender in social media. They showed that by training a linear model with Bag-of-Words (BoW) unigram features, the resulting feature weights can effectively be interpreted as word-level ratings. In a later study Preotiuc16 employed the same method to create a valence and arousal lexicon based on annotated Facebook posts. This is the second baseline method we used in our evaluation; Technical details are given in Section 3 (Regression Weights).
In a recent study, Wang17 present a three step approach to infer word polarity. Based on a Twitter corpus with hashtag-derived polarity labels, they (1) apply the method of Mohammad12 to generate a first set of word labels (see above). Those ratings are used (2) to train sentiment-aware word embeddings. The embeddings are then used (3) as input to a classifier which is trained on a set of seed words to predict the final word ratings. In essence, this is a semi-supervised approach because the last step requires word-level gold data and does not address the problem at hand.
This section formalizes the learning problem we address, describes the two baseline methods we compare against and the Mixed-Level Feed Forward Network, and concludes with a brief discussion. Section 5 then describes signed spectral clustering, which we use for qualitative interpretation of our new empathy lexicon.
We address the problem of learning word ratings for an arbitrary lexical characteristic based on gold labels of the same characteristic but for a higher linguistic level (see Figure 1). For example, how can one learn word-level polarity ratings based on document-level polarity gold labels? More formally, let denote a set of words with corresponding gold labels . Let be a set of higher level linguistic units with corresponding gold labels . Those linguistic units can be anything from phrases over paragraphs to whole books, yet for conciseness we will refer to those units as documents. Our problem is to predict given , , and . To the best of our knowledge, this is the first contribution ever formulating this as an abstract learning problem (rather than looking at concrete applications in isolation) and studying it in a systematic manner—the baseline methods have so far not been compared against each other. We now proceed by introducing methods for solving this problem.
Mean Star Rating.
Following Rill12, we predict by averaging the gold labels of documents in which the word occurs. We denote the set of documents containing as . Hence this baseline method can be described as follows:
Mean Binary Rating .
As previously mentioned mihalcea2006corpus created lexica for happiness and sadness from binary labels. To apply this method to numerical document labels (as present in the empathic reactions dataset; see Section 5) we first apply a median split (documents labels below (above) the median are recorded as 0 (1)). Subsequently, we calculate Mean Binary Rating using the same equation as for Mean Star Rating (Equation 1) thus showing the resemblances between mihalcea2006corpus and Rill12.
, this baseline method learns word ratings by fitting a linear regression model with Bag-of-Words (BoW) features.
First, consider a multivariate linear regression model for the predicting document ratings . In general, such a model is given by
where denotes the intercept and and represent weight and value for feature , respectively.
Using a BoW approach, relative frequency of a word in a document is often used as features. Except for the intercepts, the linear model can conversely be interpreted as computing the weighted average of all weight terms , the relative term frequency then being the weighting factor. With this interpretation in mind, a linear BoW model aligns perfectly with a lexicon-based approach to achieve document-level prediction, with feature weights corresponding to word ratings (see Sap14emnlp for a more detailed explanation). Hence the above equation can be rewritten as
where denotes the relative frequency of word in document
. Thus, by fitting the model to predict document ratings, we learn word ratings, simultaneously. In practice, ridge regression is used for fitting the model parameters (word ratings) thus introducingpenalization to avoid overfitting.
Mixed-Level Feed Forward Network.
We learn a Feed Forward Network (FFN; illustrated in Figure 1) on the document-level using a neural BoW approach with an external, pre-trained embedding model. By training the FFN on this task, it implicitly learns to map points of the embedding space to gold labels, which we then exploit for predicting word level ratings.
In general, a Feed Forward Network consists of an input layer followed by multiple hidden layers with activation,
where , denote weights and biases of layer , respectively, and is a nonlinear function. Since we predict numerical values (document-level ratings), the activation on the output layer , where is the number of non-input layers, is given by the affine transformation
For fitting the model parameters, consider a pre-trained embedding model such that
denotes the vector representation of a word. This would be either the learned representation of or a zero vector of length if is not in the embedding model. We can now train the model to predict the document gold ratings using a gradient descent-based method. For a document , the embedding centroid of tokens present in is used as input . That is,
where is the number of tokens in . Embeddings are not updated during training.
Until this point, the described model is quite similar to deep averaging networks (DAN) proposed by Iyyer15acl in that it is a Feed Forward Network that predicts document labels from embedding centroid features. What differs is that we used the model to predict word labels, once it is fit to predict the document labels . Conceptually, by fitting the model parameters, the FFN learns to map points of the (pre-trained) embedding space to points in the label space of . But using the same embedding model, we can also represents words , the ones we want to predict labels for, within the same feature space. Moreover, note that per our problem definition, word and document labels populate the same label space. Hence, we can predict by feeding into the FFN without any further adjustments. Since the FFN can predict both word and document labels, we call this model Mixed-Level Feed Forward Network (MLFFN). For the MLFNN, SHapley Additive exPlanations (SHAP) [lundberg2017unified] is mathematically equivalent to our method of using individual words to derive ratings.
Discussion of Model Properties.
Mean Star Rating, Binary Star Rating, and Regressions Weights learn exclusively from the available document-level gold data. In contrast, one of the major advantages of the MLFFN is that it builds on pre-trained word embeddings, thus implicitly leveraging vast amounts of unlabeled text data. For our experiments we use publicly available embeddings which are trained on hundreds of billions of tokens. MLFFN is also more flexible than Regression Weights, since it can learn nonlinear dependencies between relative word frequencies of a document and its gold label.
Another major advantage of the MLFFN model relates to the set of words that gold labels can be predicted for. Whereas Mean Star Rating, Mean Binary Rating, and Regression Weights are conceptually limited to words which occur in the gold data, MLFFN can predict ratings for any word for which embeddings are known. In practice, this implies that with our approach empathy ratings for millions of word types can be induced.
Hyperparameters and Implementation.
The implementation of Mean Star Rating and Mean Binary Rating is straightforward and requires no further details. For Regression Weights we used the same setup as Sap14emnlp, as implemented in the Differential Language Analysis Toolkit (DLATK, [Schwartz17emnlp]).
For MLFFN, we follow the hyperparameter choices Buechel18emnlp used for the Empathic Reactions dataset. Thus, MLFFN has two hidden layers (256 and 128 units, respectively) with ReLU activation. The model was trained using the Adam optimizer[Kingma15] with a learning rate of
and a batch size of 32. We trained for a maximum of 200 epochs, and applied early stopping if the performance on the validation set did not improve for 20 consecutive epochs. We applied dropout with probabilities ofand on input and dense layers, respectively. Moreover regularization of
was applied to the weights of dense layers. Keras[chollet2015keras] was used for implementation.
|Mean Binary Rating||.31||.18||.11||.07|
|Mean Star Rating||.39||.22||.14||.09|
We next conduct a systematic comparison of the above approaches. The best evaluation strategy would require having both document-level and word-level ratings for empathy. One could then train the models on the former and test the performance in predicting the later, possibly using resampling to get a distribution of scores. However, this option is not available, since the difficulty of acquiring empathy word ratings is exactly the point of this paper. Furthermore, cross-validation results to predict document-level ratings from derived word-level lexica using the empathic reactions dataset lack statistical power to distinguish between methods due to an insufficient number of examples, thus we require an alternative approach.
We adopt two alternative approaches: First, in place of empathy, we rely on other affective variables, namely, valence, arousal, and dominance (VAD), for which both document and word ratings are available. The assumption here is that performance results for VAD are transferable to empathy. Second, we use the Empathic Reactions222https://github.com/wwbp/empathic_reactions dataset to create one lexicon for each method under consideration. We then use it to predict trait-level empathy ratings, thus testing the generalizability of the resulting lexica to other domains as well as from state to trait empathy (see Section 2).
4.1 Intrinsic Evaluation with Emotion Data
We use the following gold standards for evaluation: Document-level supervision is provided by EmoBank [buechel2017emobank]333https://github.com/JULIELab/EmoBank, a large-scale corpus manually annotated with emotion according to the psychological Valence-Arousal-Dominance scheme. EmoBank contains ten thousand sentences with multiple genres and has annotations from both writer and reader emotion. Word-level supervision to test against comes from the well-known affective norms (psychological valence, arousal, and dominance) dataset collected by Warriner13 containing 13,915 English word types.
We fit all four models on EmoBank and evaluate against the word ratings by Warriner13 using 10-fold cross-validation. For word embeddings we used off-the-shelf Fasttext subword embeddings [mikolov2018advances].444 https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M-subword.zip. The embeddings are trained with subword information on Common Crawl (600B tokens). Performance will be measured in terms of correlation between predicted and gold labels. As shown in Table 1, MLFFN by far outperforms Mean Binary Rating, Mean Star Rating and Regression Weights, the latter three being roughly equal. This is most probably, because the MLFFN builds on top of a pre-trained embedding model, thus leverage vast amounts of unlabeled data in addition to the document-level supervision.
4.2 Extrinsic Evaluation
To evaluate the lexica created using each of the underlying methods, we applied them to measure trait-level empathy dataset. We validate our empathy lexica by showing that they predict personal-level empathy traits on another dataset which collected trait-level empathy questionnaires and users’ Facebook posts. Abdul2017icwsm used this dataset to predict person trait-based pathogenic empathy. Here, instead we aggregate the empathy survey results of Facebook users recruited via Qualtrics. We filtered users to include those who had posted at least five times in the last 30 days and had at least 100 lifetime posts. The survey included an integrated app grabbing participants’ Facebook posts. In total there are 2,405 users with 1,835,884 Facebook posts after filtering non-English posts (see Abdul2017icwsm for further dataset details).
The lexica were employed in a very simple fashion: For each user, we computed the weighted average of the empathy scores of the the words they used across all tweets. Relative frequency was used as weighting feature. As shown in Table 1, the performance is generally much poorer, indicating the change of domain, the more difficult task of inferring trait-level empathy from state-level ratings, and the overall reduced performance of purely lexicon-based approaches. 555While 0.18 may seem low, our results are similar to those from Abdul2017icwsm who use a regression model with LDA topics trained on Facebook posts. Nevertheless, this is consistent with intrinsic results. The MLFFN widely outperforms the other approaches (having about twice as high performance figures), the latter being roughly similar in performance.
5 The Empathy Dictionary
|High Empathy||grieve, grieving, loss, prayers, grief, heartbroken, losses, deppression, condolences, widowed|
|wounds, wounded, scars, heal, blisters, trauma, wound, heals, bleeding, fasciitis|
|duckworth, salama, mansour, santiago, gilbert, fernandez, braves, vaughn, colonialism, crowe|
|minneapolis, neighborhoods, detroit, chicago, charlotte, cincinnati, brisbane, angeles, atlanta, drescher|
|Low Empathy||fool, clueless, dumbass, idiotic, lazy, stupidity, morons, idiot, idiots, dumb|
|bother, slightest, anything, else, nobody, anybody, any, nothing, anyone, himself|
|loser, bs, moron, dingus, maniac, buffoon, ffs, loon, crap, psycho|
wacky, bizarre, odd, creepy, weird, unnerving, masochistic, freaks, unusual, strange
|High Distress||homicide, killings, murdered, massacre, murdering, homicides, genocide, murder, murderers, killed|
|brutalized, assaulted, raped, bullied, tormented, harassed, detained, molested, reprimanded, beaten|
|horrific, witnessed, retched, wretched, atrocious, awful, horrid, foul, shoddy, unpleasant|
|horrifying, terrifying, harrowing, overdoses, suicides, deaths, suicide, gruesome, devastating, tragedy|
|Low Distress||dunno, guessing, guess, gues, probably, assuming, maybe, clue, bet, assume|
|wont, knowlegde, alot, doesnt, isnt, wasnt, ahve, dont, didnt, exempt|
|sort, lot, bunch, sorts, type, whatever, plenty, depending, types, range|
|intact, stays, rememeber, keeping, keeps, always, kept, vague, rember, stay|
The final empathy dictionary consists of the predictions of the MLFFN from the last experiment, which we then adjusted using a log min-max rescaling into the interval
for consistency with Buechel18emnlp. We restricted ourselves to words which appear in the Empathic Reaction dataset and did not make use of the ability of the MLFFN to predict ratings for all word of the embedding model (left for future work). This was done to ensure interpretability of word ratings relative to their usage in the corpus (achieved via clustering analysis in Section5.2).
5.1 Dataset Description
Our final lexicon consists of 9,356 word types (lower-cased, non-lemmatized, including named entities and spelling errors) each with associated empathy and distress ratings. For illustration Table 2, lists the highest and lowest ranking words for each construct (empathy and distress). An extended list is given in the appendix. High-empathy words contain many named entities that experience or cause suffering making a reader feel empathic (lukemia, lakota). This is likely because the Empathic Reactions corpus used news stories to evoke empathy in subjects who then referred to those named entities for expressing their feeling. Low-empathy words, on the other hand, are often ones used for ridiculing, hence expressing a lack of empathy (joke, wacky). High-Distress words contain predominantly adjectives, nouns, and participles which can be used to characterize abusive behaviour (inhumane, mistreating) thus causing personal distress in readers when taking the perspective of the affected entity. Interestingly, low-distress words do not seem to display any clear pattern, making us suspect that personal distress should be addressed in terms of a unipolar rather than a bipolar scale.
Uni- and bi-variate distribution of empathy and distress scores is displayed in Figure 2
. As can be seen, both sets of labels are fairly close to a normal distribution. Empathy and distress word-level ratings display only a moderate Pearson correlation ofwhich confirms that both are distinct constructs as already indicated by the qualitative analysis above. It is also highly consistent with Empathic Reactions where Buechel18emnlp found between both sets of document-level ratings.
5.2 Qualitative Clustering Analysis
To assess the face validity of the lexica, we partition the lexica into groups (clusters) of words that are semantically similar and simultaneously have similar ratings. Straightforward clustering does not take the ratings into account, and is less interpretable. We use the Signed Spectral Clustering (SSC) algorithm to cluster words that are similar semantically and in their ratings [sedoc2017predicting]. Weighted edges are added between words such that words of similar empathy have positive connections and those of differing empathy are negative (see appendix and sedoc2017semantic for precise mathematical formulation). SSC minimizes the cumulative edge weights cut within clusters versus between clusters, while simultaneously minimizing the negative edge weights within the clusters, thus pulling words of similar empathy or distress into the same clusters and pushing those that differ away. We follow the method used by sedoc2017predicting.
As seen in Table 3, the clusters of words for high and low empathy and for high distress are strikingly well illustrated. There are many clusters of topics around situations where people feel empathy. Furthermore, there are lists of different negative emotions. The lists that are all places and people names are less useful obviously for psychological analysis. However, these lists are places where bad things happen, and people to whom bad things happen, which is useful for predictive models. Usable lexica must be interpretable, SSC allows us not only to give words and ratings, but also, groups of high magnitudes. These allow domain experts then to analyze and possibly modify the lexica.
The Mixed-Level Feed-Forward Net (MLFFN) successfully learns word ratings from document-level ratings by backing out word ratings from a trained neural net, performing substantially better than methods that others have used for lexicon creation. The empathy and personal distress lexica we learn using the MLFFN look sensible; we look forward to further validating them by using them in predictive models and psychological experiments, and to exploring the extent to which using the SHAP (SHapley Additive exPlanations) [lundberg2017unified] calculations of feature importance for CNNs or RNNs improve lexicon quality over the simple neural nets which we used.