The CARE Dataset for Affective Response Detection

by   Jane A. Yu, et al.

Social media plays an increasing role in our communication with friends and family, and our consumption of information and entertainment. Hence, to design effective ranking functions for posts on social media, it would be useful to predict the affective response to a post (e.g., whether the user is likely to be humored, inspired, angered, informed). Similar to work on emotion recognition (which focuses on the affect of the publisher of the post), the traditional approach to recognizing affective response would involve an expensive investment in human annotation of training data. We introduce CARE_db, a dataset of 230k social media posts annotated according to 7 affective responses using the Common Affective Response Expression (CARE) method. The CARE method is a means of leveraging the signal that is present in comments that are posted in response to a post, providing high-precision evidence about the affective response of the readers to the post without human annotation. Unlike human annotation, the annotation process we describe here can be iterated upon to expand the coverage of the method, particularly for new affective responses. We present experiments that demonstrate that the CARE annotations compare favorably with crowd-sourced annotations. Finally, we use CARE_db to train competitive BERT-based models for predicting affective response as well as emotion detection, demonstrating the utility of the dataset for related tasks.


"Hang in There": Lexical and Visual Analysis to Identify Posts Warranting Empathetic Responses

In the past few years, social media has risen as a platform where people...

Multimodal Emotion Classification

Most NLP and Computer Vision tasks are limited to scarcity of labelled d...

Touch Your Heart: A Tone-aware Chatbot for Customer Care on Social Media

Chatbot has become an important solution to rapidly increasing customer ...

Modeling Human Annotation Errors to Design Bias-Aware Systems for Social Stream Processing

High-quality human annotations are necessary to create effective machine...

Towards Responsible Natural Language Annotation for the Varieties of Arabic

When building NLP models, there is a tendency to aim for broader coverag...

Measuring Similarity between Brands using Followers' Post in Social Media

In this paper, we propose a new measure to estimate the similarity betwe...

Dialogue Response Ranking Training with Large-Scale Human Feedback Data

Existing open-domain dialog models are generally trained to minimize the...

1 Introduction

Social media and other online media platforms have become a common means of both interacting and connecting with others as well as finding interesting, informing, and entertaining content. Users of those platforms depend on the ranking systems of the recommendation systems to show them information they will be most interested in and to safeguard them against unfavorable experiences. Towards this end, a key technical problem is to predict the affective response that a user may have when they see a post. Some affective responses can be described by emotions (e.g., angry, joyful), and others may be described more as experiences (e.g., entertained, inspired). Predicting affective response differs from emotion detection in that the latter focuses on the emotions expressed by the publisher of the post (referred to as the publisher affect in Chen et al. (2014)) and not on the viewer of the content. While the publisher’s emotion may be relevant to the affective response, it only provides a partial signal.

Current approaches to predicting affective response require obtaining training data from human annotators who try to classify content into classes of a given taxonomy. However, obtaining enough training data can be expensive, and moreover, due to the subjective nature of the problem, achieving consensus among annotators can be challenging.

This paper introduces the Common Affective Response Expression method (CARE for short), a means of obtaining labels for affective response in an unsupervised way from the comments written in response to online posts. CARE uses patterns and a keyword-affect mapping to identify expressions in comments that provide high-precision evidence about the affective response of the readers to the post. For example, the expression “What a hilarious story” may indicate that a post is humorous and “This is so cute” may indicate that a post is adorable. We seed the system with a small number of high-precision patterns and mappings. We then iteratively expand on the initial set by considering frequent patterns and keywords in unlabeled comments on posts labeled by the previous iteration.

Using CARE, we create the largest dataset to date for affective response, CARE

, which contains 230k posts annotated according to 7 affective responses. We validate the effectiveness of CARE by comparing the CARE annotations with crowd-sourced annotations. Our experiments show that there is a high degree of agreement between the annotators and the labels proposed by CARE (e.g., in 90% of the cases, at least two out of three annotators agree with all the CARE labels). Furthermore, we show that the CARE patterns/lexicon have greater accuracy than applying SOTA emotion recognition techniques to the comments. Using CARE

, we train care-bert, a BERT-based model that can predict affective response without relying on comments. care-bert provides strong baseline performance for the task of predicting affective response, on par with the SOTA models for emotion recognition. Furthermore, we show that care-bert

 can be used for transfer learning to a different emotion-recognition task, achieving similar performance to 

Demszky et al. (2020) which relied on manually-labeled training data.

2 Related work

We first situate our work with respect to previous research on related tasks.

2.1 Emotion detection in text

Approaches to emotion detection can be broadly categorized into three groups: lexicon-based, machine learning, and hybrid approaches which combine ideas from the first two. The lexicon-based approach typically leverages lexical resources such as lexicons, bags of words, or ontologies and often uses several grammatical and logical rules to guide emotion prediction

(Tao, 2004; Ma et al., 2005; Asghar et al., 2017). Though these rule-based methods are fast and interpretable, they are often not as robust and flexible because they rely on the presence of keywords existing in the lexicon (Alswaidan and Menai, 2020; Acheampong et al., 2020)

. Additionally, the scope of emotions predicted by these works is usually fairly small, ranging from two to five, and most datasets utilized are usually smaller than 10k (making it unclear if they extrapolate well). Among the ML approaches, many SOTA works employ deep learning methods such as LSTM, RNN, DNN

(Demszky et al., 2020; Felbo et al., 2017; Barbieri et al., 2018; Huang et al., 2019a; Baziotis et al., 2017; Huang et al., 2019b). However, while many of the deep learning models have shown significant improvement over prior techniques, the results are highly uninterpretable and typically require prohibitively large human-labeled datasets. In both the lexicon-based approach and the ML-approach, the classes of emotions predicted in these works are usually non-extendable or require additional labeled data.

While there are some commonalities between works in emotion detection and affective response detection, the problems are distinct enough that we cannot simply apply emotion recognition techniques to our setting. Emotion recognition focuses on the publisher affect (the affect of the person writing the text). The publisher affect may provide a signal about the affective response of the reader, but there is no simple mapping from one to the other. For example, being ‘angered’ is an affective response that does not only result from reading an angry post – it can result from a multitude of different publisher affects (e.g. excited, angry, sympathetic, embarrassed, or arrogant). For some affective responses, such as being ‘grateful’ or feeling ‘connected’ to a community, the corresponding publisher affect is highly unclear.

2.2 Affective response detection

There have been some works that address affective response in limited settings, such as understanding reader responses to online news (Katz et al., 2007; Strapparava and Mihalcea, 2007; Lin et al., 2008; Lei et al., 2014). In contrast, our goal is to address the breadth of content on social media. There are works which use Facebook reactions as a proxy for affective response, but these are constrained by the pre-defined set of reactions  (Clos et al., 2017; Raad et al., 2018; Pool and Nissim, 2016; Graziani et al., 2019; Krebs et al., 2017). The work described in Rao et al. (2014) and Bao et al. (2012) attempts to associate emotions with topics

, but a single topic can have a large variety of affective responses when seen on social media, and therefore their model does not apply to our case. Some works in the computer vision community study affective response to images 

(Chen et al., 2014; Jou et al., 2014); as they note, most of the work in the vision community also focuses on publisher affect.

2.3 Methods for unsupervised labeling

A major bottleneck in developing models for emotion and affective response detection is the need for large amounts of training data. As an alternative to manually-labeled data, many works utilize metadata such as hashtags, emoticons, and Facebook reactions as pseudo-labels (Wang et al., 2012; Suttles and Ide, 2013; Hasan et al., 2014; Mohammad and Kiritchenko, 2015). However, these can be highly noisy and limited in scope. For example, there exist only seven Facebook reactions, and they do not necessarily correspond to distinct affective responses. Additionally, for abstract concepts like emotions, hashtagged content may only capture a superficial interpretation of the concept. For example, #inspiring on Instagram will give many photos featuring selfies or obvious inspirational quotes, which do not sufficiently represent inspiration. The work we present here extracts labels from free-form text in comments rather than metadata. The work done in Sintsova and Pu (2016) is similar to our work in that it pseudo-labels tweets and extends its lexicon, but the classifier itself is a keyword, rule-based approach and is heavily reliant on the capacity of these lexicons. In contrast, our work leverages the high precision of CARE and uses these results to train a model, which is not constrained by the lexicon size in its predictions. Our method also employs bootstrapping to expand the set of patterns and lexicon, similar to Agichtein and Gravano (2000) and Jones et al. (1999) but focuses on extracting affect rather than relation tuples.

3 The CARE Method

We start with a formal description in Section 3.1 of the major components of CARE: CARE patterns, regular expressions used to extract information from the comments of a post, and the CARE lexicon, a keyword-affect dictionary used to map the comment to an affect. Section 3.2 describes how we use these components at the comment-level to predict affective responses to a social media post, and in turn, how to use the predictions to expand the CARE patterns and lexicon.

3.1 CARE patterns and the CARE lexicon

Before we proceed, we discuss two aspects of affective responses. First, there is no formal definition of what qualifies as an affective response. In practice, we use affective responses to understand the experience that the user has when seeing a piece of content, and these responses may be both emotional and cognitive. Second, the response a user may have to a particular piece of content is clearly a very personal one. Our goal here is to predict whether a piece of content is generally likely to elicit a particular affective response. In practice, if the recommendation system has models of user interests and behavior, these would need to be combined with the affect predictions.

The CARE lexicon is a dictionary which maps a word or phrase to a particular affect. The CARE lexicon contains 163 indicators for the 7 classes we consider (123 of which were identified in the expansion process described in the next section). We also considered using other lexicons (Strapparava and Valitutti, 2004; Poria et al., 2014; Staiano and Guerini, 2014; Esuli and Sebastiani, 2006; Mohammad et al., 2013), but we found that they were lacking enough application context to be useful in our setting. Table 1 shows the affects in the CARE lexicon and corresponding definitions and example comments that would fall under each affect (or class). The classes excited, angered, saddened, and scared were chosen since they are often proposed as the four basic emotions Wang et al. (2011); Jack et al. (2014); Gu et al. (2016); Zheng et al. (2016). The classes adoring, amused, and approving were established because they are particularly important in the context of social media for identifying positive content that users enjoy. Overall, a qualitative inspection indicated that these seven have minimal conceptual overlap and sufficiently broad coverage. We note, however, that one of the benefits of the method we describe is that it is relatively easy to build a model for a new class of interest compared to the process of human annotation.

AR Definition Example comment Size
Adoring Finding someone or something cute, adorable, or attractive. He is the cutest thing ever. 36
Amused Finding something funny, entertaining, or interesting. That was soooo funny. 30
Approving Expressing support, praise, admiration, or pride. This is really fantastic! 102
Excited Expressing joy, zeal, eagerness, or looking forward to something. Really looking forward to this! 41
Angered Expressing anger, revulsion, or annoyance. I’m so frustrated to see this. 26
Saddened Expressing sadness, sympathy, or disappointment. So sad from reading this. 34
Scared Expressing worry, concern, stress, anxiety, or fear. Extremely worried about finals. 2
Table 1: Definition of affective responses (AR), examples of comments which would map to each affective response, and the number of posts (in thousands) per class in CARE. The portion of each example which would match a CARE pattern in a reg-ex search is italicized.

Contrary to the CARE lexicon, CARE patterns are not class-specific and leverage common structure present in comments for affective response extraction. There is an unlimited number of possible CARE patterns, but some are much more prevalent than others. Below we describe six CARE patterns which were initially given to the system manually. In total, there are 23 patterns and the remaining 17 were discovered using the expansion method. In the same spirit as Hearst Patterns (Hearst, 1992), CARE patterns are tailored to extract specific relationships. CARE patterns rely on two sets of sub-patterns:

  • Exaggerators : words that intensify or exaggerate a statement, e.g., so, very, or really.

  • Indicators : words (up to 3) that exist in the CARE lexicon, which maps the indicator to a particular class. For example, ‘funny’ in “This is so funny” would map to amused.

We present the six CARE patterns below that were used to seed the system: (The symbol (resp. ) indicates that zero (resp. one) or more matches are required.)

  • Demonstrative Pronouns:
    Example: This is so amazing!

  • Subjective Self Pronouns:
    Example: I am really inspired by this recipe.

  • Subjective Non-self Pronouns:
    Example: They really make me mad.

  • Collective Nouns:
    {some peoplehumanssociety}{}{}
    Example: Some people are so dumb.

  • Leading Exaggerators: {}{}
    Example: So sad to see this still happens.

  • Exclamatory Interrogatives:
    {what ahow}{}{}
    Example: What a beautiful baby!

3.2 Labeling posts

The pipeline for labeling posts is shown in steps 1–3 of Figure 1 and described in detail in Algorithm LABEL:alg:expand in the Appendix. We begin with reg-ex matching of CARE patterns and individual sentences of the comments. We truncate the front half of a sentence if it contains words like ‘but’ or ‘however’ because the latter half usually indicates their predominant sentiment. We also reject indicators that contain negation words such as ‘never’, ‘not’, or ‘cannot’ (although one could theoretically map this to the opposite affective response using Plutchik’s Wheel of Emotions (Plutchik, 1980)). Note that contrary to traditional rule-based or machine-learning methods, we do not strip stop words like ‘this’ and ‘very’ because it is often crucial to the regular expression matching, and this specificity has a direct impact on the precision of the pipeline.

Figure 1: Overview of the CARE Method (pseudo-code in Appendix, Algorithm LABEL:alg:expand

). The top part of the figure shows the process of labeling a post, while the bottom features how we expand the set of patterns and lexicon. In step (1), we apply CARE patterns to the comment and extract the indicator. In step (2), we map each comment to the corresponding affective response using the CARE lexicon. Step (3) aggregates the comment-level labels for a post-level label. In step (A) we collect all comments of all posts corresponding to a particular affective response and analyze its most frequent n-grams. N-grams common to multiple classes are added to the CARE patterns (B1) while frequent n-grams specific to a class are added to the lexicon (B2). This process can then be repeated.

Given the reg-ex matches, we use the lexicon to map the indicators to the publisher affect of the comment (e.g., excited). Because the publisher affect of the comments indicate the affective response of a post, we obtain a post-level affective response label by aggregating the comment-level labels and filtering out labels that have a support smaller than . Specifically, post would be labeled with the affective response if at least of the comments in response to were labeled with . In our experiments, we used a value of , after qualitative inspection of CARE, discussed in Section 4.

Expanding CARE patterns/lexicon:

We seeded our patterns and lexicon with a small intuitive set and then expanded them by looking at common n-grams that appear across posts with the same label (steps A, B1, and B2 of Figure 1, Algorithm LABEL:alg:expand). At a high level, for a given affect , consider the set, , of all the comments on posts that were labeled , but did not match any CARE pattern. From these comments, we extract new keywords (e.g. ‘dope’ for approving as in ‘This is so dope.’) for the CARE lexicon by taking the most frequent n-grams in but infrequently in , where b includes all classes except a. On the other hand, the most common n-grams co-occuring with multiple classes were converted to regular expressions and then added as new CARE patterns (see Table 5 for a few examples). We added CARE patterns according to their frequency and stopped when we had sufficient data to train our models. After two expansion rounds, the set of patterns and indicators increased from 6 to 23 and 40 to 163, respectively. Counting the possible combinations of patterns and indicators, there are roughly 3500 distinct expressions. When considering the possible 23 CARE patterns, 163 CARE lexicon indicators, and 37 exaggerators, there are a total of 130k possible instantiations of a matching comment.

4 Evaluating CARE

In this section we apply our method to social media posts and validate these annotations using human evaluation (Section 4.1). Section 4.2 discusses class-wise error analysis, and in Section 4.3, we explore the alternative possibility of creating CARE using a SOTA publisher-affect classifier Demszky et al. (2020) to label the comments.


Our experiments use a dataset that is created from Reddit posts and comments in the database that were created between 2011 and 2019. We create our dataset, CARE, as follows. We used CARE patterns and the CARE lexicon to annotate 34 million comments from 24 million distinct posts. After filtering with a threshold of , we obtained annotations for 400k posts (the total number of posts that have at least 5 comments was 150 million). The low recall is expected given the specificity of CARE patterns/lexicon. We also filtered out posts that have less than 10 characters, resulting in a total of 230k posts in CARE. Table 1 shows the breakdown of cardinality per affective response. 195k of the posts were assigned a single label, whereas 26k (resp. 8k) were assigned two (resp. three) labels. Note that the distribution of examples per class in CARE is not reflective of the distribution in the original data, because CARE matches classes with different frequencies. The CARE dataset features the id and text of the post as well as the annotations using CARE.

4.1 Human evaluation

In our next experiment, we evaluate the labels predicted by CARE with the help of human annotators using Amazon Mechanical Turk (AMT), restricting to those who qualify as AMT Masters and having lifetime approval rating greater than 80%. The dataset for annotation was created as follows. We sub-sampled a set of 6000 posts from CARE, ensuring that we have at least 700 samples from each class and asked annotators to label the affective response of each post. Annotators were encouraged to select as many as appropriate and also permitted to choose ‘None of the above’ as shown in Figure 4. In addition to the post, we also showed annotators up to 10 sampled comments from the post in order to provide more context. Every post was shown to three of the 91 distinct annotators. For quality control, we also verified that there no individual annotator provided answers that disagreed with the CARE labels more than 50% of the time on more than 100 posts.

Table 2 shows that the rate of agreement between the annotators and the labels proposed by the CARE method is pretty high. For example, 94% of posts had at least one label proposed by CARE that was confirmed by 2 or more annotators, and 90% had all

the labels confirmed. The last column measures the agreement among annotators on labels that were not suggested by CARE, which was 53% when confirmed by 2 or more annotators. We expected this value to be reasonably large because the CARE patterns/lexicon were designed to generate a highly precise set of labels, rather than highly comprehensive ones. However, the value is still much smaller relative to the agreement rate for the CARE labels. On average, each annotation answer contained around 1.8 labels per post (with a standard deviation of 0.9). We note that ‘None of the above’ was chosen less than 0.2% of the time. Table 

6 and Figure 5 present annotator agreement statistics and label prevalence, respectively, broken down by class. Figure 6

shows the Spearman correlation between each class and a hierarchical clustering.

# Agree Any CARE All CARE Other
98 96 82
94 90 53
= 3 80 76 24
Table 2: The rate of agreement between the annotators and the labels proposed by CARE. The first column specifies the number of annotators to be used for consensus. The rest of the columns shows for all posts, the average rate of intersection of the human labels with at least one CARE label, all CARE labels, and any label that is not a CARE label.

4.2 Error Analysis

Evaluating CARE in the previous sections (Figure 6, Table 3) revealed that the accuracy of CARE varies by class and in particular, is lower for amused and excited

. To better understand if certain pattern or indicator matches are at fault here, (i.e., more erroneous than others), we investigate the precision and recall at the pattern and lexicon level.

Recall that instantiating a match for a comment involves choosing a (pattern, keyword) combination. Separating the lexicon from the patterns enables us to encode a large number of instantiated patterns parsimoniously, but some pair combinations provide a much weaker signal than others, particularly for the class amused (see Figure 9). Hence, for future iterations of CARE, we have implemented a mechanism to exclude certain pattern and keyword combinations and a means for using different thresholds for each class.

Alternatively, another mechanism for accomodating these class-wise discrepancies in performance is by tuning for each class an optimal threshold (i.e., the number of matched comments we need to see in order to reliably predict a label). Figure 2 shows how the precision and recall of each class varies according to different threshold values. To achieve precision and recall greater than 0.7, a threshold of 1 actually seems viable for most classes, while for amused and excited a threshold of at least 3 is needed. In fact, for most of the classes, using thresholds larger than 3 has negligible impact on the precision score, but does reduce the recall.

Figure 2:

Precision versus recall of each class using varying thresholds (t = 0 to 9). Ground truth labels utilized are those which have at least 2 out of 3 annotator agreement. For clarity, only odd values of t are labeled.

4.3 Can we leverage emotion classification?

Given the potential shortcomings of certain (pattern, keyword) instantiations, a natural question to ask is whether other SOTA publisher affect classification techniques could be used instead of CARE patterns and the CARE lexicon, which together label the publisher affect of comments (steps 1 and 2 of Figure 1). As we explained earlier, there is a difference between predicting publisher affect and predicting affective response. The differences vary depending on the class considered, and the topic deserves a full investigation in itself. Section C presents initial experiments that indicate that affective response and publisher affect labels intersect 44% of the time.

Here we show that using CARE patterns and the CARE lexicon performs well compared to using SOTA emotion classifiers. Let us define CARE, a modified version of CARE where steps 1 and 2 are replaced with labels using the GoEmotions classifer, a BERT-based model trained using the GoEmotions dataset Demszky et al. (2020). After applying the GoEmotions classifier to all comments of the posts with human annotations in CARE (Section 4.1), we convert the predicted GoEmotions labels using the mapping between the two taxonomies shown in Table 3. As a sanity check, we conducted an analysis applying steps 1 and 2 of CARE to the comments in the GoEmotions dataset (see Section D), which shows that the rate of agreement among the labels in the mapping is high (87.3% overall). We then aggregate and filter post labels according to a threshold .

CARE (Table 8) shows a relative decrease of 12.9% and 18.0% in the rate of annotator agreement with any and all labels, respectively, compared to that of CARE. These decreases hold even when partitioning on each individual class. The comparatively lower performance of CARE is most likely due to the low F1-scores (0.4) of the GoEmotions classifer for nearly half of the 28 classes, as reported in the original work Demszky et al. (2020, Table 4). It is important to note that in addition to demonstrating higher precision, CARE patterns and the CARE lexicon are valuable because they do not require human annotated data, unlike GoEmotions. It may, however, be useful to leverage multiple emotion detection approaches and Section E discusses a potential ensembling strategy for this.

AR GoEmotion label % agree
Amused Amusement 79.8
Approving Admiration, approval 89.3
Excited Joy 81.3
Angered Anger, Annoyance, Disgust 93.3
Saddened Disappointment, sadness 90.9
Scared Fear, nervousness 84.9
Table 3: CARE to GoEmotions mapping. The last column summarizes the rate at which a human-annotated comment in GoEmotions is also labeled the equivalent class using steps 1 and 2 of CARE, if a label is present. The average across all datapoints was 87.3%.

5 Fine-tuning with CARE

In this section we describe care-bert, a multi-label BERT-based model fine-tuned on the post-level text and annotations in CARE. Note that the model does not use the comment text, and are therefore not restricted to the simple semantics captured by CARE patterns. Such a model is important in order to make predictions early in the life of the post and in cases where the comments may not match any CARE patterns or lexicon indicators. In section 5.2, we describe how care-bert can be further fine-tuned for related tasks like emotion detection.

5.1 Creating and evaluating CARE-BERT

We train care-bert with the CARE labels in CARE, using the pre-trained model bert-base-uncased in the Huggingface library (Wolf et al., 2020)

. We use a max length of 512 and we add a dropout layer with a rate of 0.3 and a dense layer to allow for multi-label classification. We used an Adam optimizer with a learning rate of 5e-5, a batch size of 16, and 5 epochs. We used a train/validation/test split of 80/10/10%. See Section 

H for other settings we explored and Table 9 for the results of a binary model (positive vs. negative affects).

The evaluation on the test set are shown in Table 4. The classes of lowest prevalence, such as scared, had the poorest results, while the more frequent classes, such as adoring, approving, and saddened had the highest results. To put these results in further perspective, we note that they are on par with applying BERT-based models to the related task of emotion detection in Demszky et al. (2020). Specifically, using similar hyper-parameters, that work achieved a macro-averaged F1-score of 0.64 for a taxonomy of 6 labels.

AR Precision Recall F1
Adoring 0.78 0.67 0.72
Amused 0.73 0.55 0.63
Approving 0.81 0.74 0.77
Excited 0.70 0.50 0.59
Angered 0.75 0.65 0.70
Saddened 0.84 0.71 0.77
Scared 0.67 0.29 0.41
micro-avg 0.78 0.65 0.71
macro-avg 0.76 0.59 0.66
stdev 0.06 0.16 0.13
Table 4: Accuracy of care-bert using CARE.

5.2 Transfer learning to emotion detection

We now demonstrate that care-bert is also useful for pre-training of another related task in a setting with limited annotated data. We consider transfer learning to the ISEAR Dataset (Scherer and Wallbott, 1994), which is a collection of 7666 statements from a diverse set of 3000 individuals labeled according to six categories (anger, disgust, fear, guilt, joy, sadness, and shame). The labels pertain to the publisher affect and not affective response, as considered in this work. Our experiment explores transfer learning to predict the labels in the ISEAR dataset using an additional drop-out layer of 0.3 and a dense layer.

Our experiments follow closely to that of Demszky et al. (2020) and use different training set sizes (500, 1000, 2000, 4000, and 6000). To account for the potential noisy performance in low samples, we do 10 different train-test splits. We plot the average and standard deviation in the F1-scores across these 10 splits in Figure 3. We compare four different fine-tuning setups: the first two are trained using care-bert and then fine-tuned on the benchmark dataset, one with no parameter freezing (no_freeze), and one with all layers but the last two frozen (freeze). The third setup is similar to care-bert (no_freeze) but is trained on GoEmotions rather than CARE

. The last setup is the bert-base-uncased model trained only on ISEAR, where all setups use the same architecture and hyperparameters as discussed in Section 


Our values differ slightly from that cited in Demszky et al. (2020) due to the small differences in architecture and hyperparameters. However, the overall results corroborate that of Demszky et al. (2020) in that models with additional pre-training perform better than the baseline (no additional pre-training) for limited sample sizes. From Figure 3, it is apparent that care-bert and the model built from GoEmotions perform essentially on par in these transfer learning experiments in spite of the fact that care-bert does not utilize human annotations. It is also worth noting that GoEmotions and the ISEAR dataset address the same task (emotion detection) while care-bert predicts affective response. The comparable performance of care-bert with the GoEmotions models demonstrates the utility of care-bert for other tasks with limited data and the promise of CARE as a means of reliable unsupervised labeling.

Figure 3: The F1-score of each model using varying training set sizes of the ISEAR dataset. The light blue line refers to using care-bert, but with freezing all parameters except in the last layer. The dark blue refers to the same but without any freezing. Lastly, the purple line refers to the same architecture as care-bert (no freezing) but trained on GoEmotions instead of CARE, and the red line is trained only on the ISEAR dataset using a bert-base-uncased model with the same hyperparameters.

6 Conclusion

We described a method for extracting training data for models that predicts the affective response to a post on social media. CARE is an efficient, accurate, and scalable way of collecting unsupervised labels and can be extended to new classes. Using CARE, we created CARE, a large dataset which can be used for affective response detection and other related tasks, as demonstrated by the competitive performance of care-bert to similar BERT-based models in emotion detection. The CARE patterns, lexicon, CARE, and care-bert are made available on github.

There are two main cases in which CARE does not provide as much benefit: (1) when there does not exist a set of common phrases that are indicative of an affect, and (2) when an indicator maps to multiple affects. In the latter case, there is still partial information that can be gleaned from the labels. In addition to developing methods for the above cases, future work also includes incorporating emojis, negations, and punctuation, and extending to new classes. Finally, we also plan to investigate the use of CARE for predicting the affective response to images as well as multi-modal content such as memes.

7 Broader Impact

Any work that touches upon emotion recognition or recognizing affective response needs to ensure that it is sensitive to the various ways of expressing affect in different cultures and individuals. Clearly, applying the ideas described in this paper in a production setting would have to first test for cultural biases. To make “broad assumptions about emotional universalism [would be] not just unwise, but actively deleterious” to the general community (Stark and Hoey, 2021). We also note that emotion recognition methods belong to a taxonomy of conceptual models for emotion (such as that of Stark and Hoey (2021) and these “paradigms for human emotions […] should [not] be taken naively ground truth.”

Before being put in production, the method would also need to be re-evaluated when applied to a new domain to ensure reliable performance in order to prevent unintended consequences. Additionally, our work in detecting affective response is intended for understanding content, not the emotional state of individuals. This work is intended to identify or recommend content, which aligns with the user’s preferences. This work should not be used for ill-intended purposes such as purposefully recommending particular content to manipulate a user’s perception or preferences.


Appendix A Details on expanding CARE patterns

Algorithm LABEL:alg:expand presents pseudo-code for the process of labeling posts and expanding CARE patterns and the CARE lexicon. Table 5 presents example results from the expansion process.

n-gram frequency class
adorable 9000 Adoring
gorgeous 8422 Adoring
fantastic 7796 Approving
interesting 5742 Amused
sorry for your 5202 Saddened
brilliant 4205 Approving
fake 2568 Angered
sorry to hear 2323 Saddened
why i hate 1125 Angered
i feel like 293 pattern
you are a 207 pattern
this is the 173 pattern
this made me 110 pattern
he is so 102 pattern
Table 5: Examples of n-grams resulting from GetNgrams in Algorithm LABEL:alg:expand and steps B1 and B2 of Figure 1. The n-grams above the middle line are added to the lexicon under the specific class listed while the n-grams below are used for further expansion of CARE patterns after translating to reg-ex format manually.


Appendix B Annotation details

Figure 4: Interface for crowd-sourcing process using Amazon Mechanical Turk. Three distinct annotators were used to annotate each post. Annotators were told an affective response is an emotion or cognitive response to the post and the definitions and examples in Table 1 were shown to them.

Figure 4 shows the interface used for crowd-sourcing human annotations for evaluating CARE patterns. To better understand annotation results for each class, we present Table 6, which shows annotator agreement statistics broken down by class. We also computed Fleiss’ kappa for each class, where a value between 0.41-0.60 is generally considered moderate agreement and a value between 0.61-0.80 is substantial agreement. As can be seen, classes such as adoring have high average annotator support and Fleiss’ kappa while others like amused have low average annotator support and Fleiss’ kappa, an observation that aligns with the findings in Section 4.2.

AR % w/ Avg Fleiss’
support support kappa
Adoring 99.2 2.8 0.78
Amused 93.2 2.1 0.43
Approving 98.8 2.8 0.51
Excited 83.6 2.1 0.58
Angered 99.4 2.8 0.59
Saddened 99.6 2.9 0.61
Scared 98.8 2.6 0.64
Average 96.1 2.6 0.59
Table 6: The percent of CARE-labeled examples (maximum of 100) with agreement from at least one labeler by class and of those examples, the average number of annotator agreement (maximum of 3). The third column shows the Fleiss’ kappa, which was computed for class based on the presence and absence of label by each annotator for a given post. The bottom row is the average over all classes.
Figure 5: Prevalence of class labels according to annotations from AMT on which at least two annotators agree upon (blue) and according to CARE (orange). The prevalence of approving was much higher from AMT, likely due to a large perceived overlap in the definitions of approving and other classes such excited.
Figure 6: Pairwise Spearman correlation between each pair of classes, computed using the degree of annotator support for each class given a post. The dendrogram represents a hierarchical clustering of the data, correctly capturing the distinction between positive and negative classes.

Appendix C Affective response and publisher affect

The GoEmotions dataset and classifier target the publisher affect (of comments), whereas care-bert and CARE target the affective response (of posts). In an effort to study the correlation between affective response and publisher affect, we compare the following sets of labels: 1) human annotations of GoEmotion and the predicted affective responses using care-bert applied to GoEmotions and 2) CARE labels for posts in CARE and the predicted publisher affects using the GoEmotions classifier applied to CARE. Specifically, for every annotated label (i.e., not from a classifier) we count the percentage of the time where there is intersection with the set of predicted labels (i.e., from a classifier).

The results of these experiments are shown in Table 7, broken down according to the class of the annotated label. Overall, the percentage of affective response and publisher affect label agreement (44%) is moderate but seems to indicate that the affective response detection and emotion detection are not necessarily the same problem, in particular for scared and approving. The classes approving, excited, and angered

have a large variance between the two datasets, where the first (Table 

7, second column) uses comments and the second (Table 7, third column) uses posts. This could be due to the classification errors (either by GoEmotions or by care-bert) or due to the type of the text (comment or post). More research and data collection is needed to understand the relationship between affective response and publisher affect.

AR GoEmotions CARE Average
Amused 63 54 59
Approving 8 47 28
Excited 52 24 38
Angered 4 74 39
Saddened 60 62 61
Scared 44 34 39
Average 39 49 44
Table 7: Rate of intersection between affective response and publisher affect labels. The first column denotes the class. The second column denotes the percent of the time an annotated label in GoEmotions exists in the set of predicted labels by care-bert when applied to the GoEmotions dataset. The third column denotes the percent of the time an annotated label in care-bert exists in the set of predicted labels by the GoEmotions classifier when applied to CARE. The last column is the row-wise average.

Appendix D Evaluating publisher affect labels of comments

The GoEmotions dataset Demszky et al. (2020) is a collection of 58k Reddit comments labeled according to the publisher affect from a taxonomy of 28 emotions. There exists a natural mapping from 6 of our classes to those of GoEmotions (the exception being adoring) based on the definitions alone. Hence, applying CARE patterns/lexicon to the GoEmotions dataset presents another way of validating the quality of steps 1 and 2 of CARE. The number of examples in GoEmotions with labels belonging to these 6 classes was 21.0k and the number of comments that were labeled by CARE patterns/lexicon was 1259. Table 3 compares the human annotations in the GoEmotions dataset with the labels that CARE patterns/lexicon assigned to the comments and shows that they have a high degree of agreement.

While the low recall is certainly a limitation of CARE patterns and lexicon when applied to a specific small dataset, we emphasize that the primary intention of CARE patterns is to generate a labeled dataset in an unsupervised manner so one can start training classifiers for that affective response. Given the abundance of freely available unlabeled data (e.g., on Reddit, Twitter), recall is not a problem in practice. In the next section and in Section 4.3, however, we discuss how existing emotion classifiers, such as the GoEmotions classifier Demszky et al. (2020) can also be leveraged in the CARE method.

Appendix E CARE and CARE  evaluation details

Threshold Any CARE All CARE Other
95 34 25
91 61 42
87 71 51
81 73 57
73 67 62
58 56 70
47 45 76
38 37 81
30 29 84
24 23 88
max 89 89 60
CARE 93 89 54
ensemble 94 83 49
Table 8: The rate of intersection between labels agreed upon by at least two annotators and the labels proposed by CARE. The first column indicates the threshold used in CARE. Using annotations agreed upon by at least two annotators, the rest of the columns show the rate of agreement with at least one predicted label, all predicted labels, and any human-annotated label that was not predicted. The row labeled ‘max’ refers to choosing the comment-level label with the highest frequency for each post. For context, the results for CARE using are shown in the penultimate row. The last row presents results from combining the CARE pattern labels and the GoEmotion labels using .

CARE refers to the CARE method where steps 1 and 2 of Figure 1 use the GoEmotions classifier instead of CARE patterns. To evaluate how CARE and CARE compares, we use the same human-labeled dataset described in Section 4.1 and applied the GoEmotions classifier to all the comments belonging to these posts (72k comments). We then mapped the predicted GoEmotion labels to CARE pattern labels using the mapping in Table 3. GoEmotion and CARE labels not in the mapping are excluded from this analysis.

The same metrics for annotator agreement in Table 2 are shown in Table 8 for multiple thresholds and for all classes, excluding adoring. CARE labels consistently demonstrate higher agreement with human annotations than those of CARE. The last row of Table 8 shows results for an emsembling approach where steps 1 and 2 use labels from both CARE patterns in addition to the labels from the GoEmotions classifier, where the former uses and the latter uses in step 3 (optimal values for each approach, respectively). This ensembling approach does reasonably well and can be used to include classes in the GoEmotions taxonomy that do not exist in the taxonomy of Table 1. Given other emotion classifiers, one could potentially include those as well.

Appendix F Multi-dimensional scaling pairwise plots

We visualize the degree of overlap between the sentence embeddings (using Sentence-Bert (Reimers and Gurevych, 2019)) of 100 comments in CARE for each class. We then use multi-dimensional scaling or MDS Cox and Cox (2008) to map the embeddings to the same two-dimensional space using euclidean distance as the similarity metric, as shown in Figure 7 and Figure 8. Note that the MDS process does not use the class labels. As can be seen, there is substantial overlap between amused and other classes as well as between excited and approving. Given that the average number of human annotations per post was 1.8 (Section 4.1), it is likely that a portion of this overlap can attributed to the multi-label nature of the problem as well as the moderate correlation between certain classes such as excited and approving (Figure 6). See Figure 8 for plots of multi-dimensional scaling for every pair of classes, as referenced in Section 4.2.

Figure 7: The two-dimensional projection (using MDS) of sentence embeddings of comments suggests that the CARE-based predictions correspond to similarity in the embedding space. Colors correspond to the labels given by CARE labeling, which were not given to the embedding model or the MDS.
Figure 8: Subplots of plotting the multi-dimensional scaling from Figure 7 for each pairwise comparison of the 7 classes. The rows and columns follow in the order adoring, amused, approving, excited, angered, saddened, and scared. The entire grid is symmetric for ease of exploration.

Appendix G Pattern match analysis

To investigate why higher thresholds would be needed for certain classes, we analyze the CARE patterns and lexicon at the class level.

Figure 9: Scatter plot of the total frequency of a match versus its false positive rate. Ground truth labels used here are those from AMT and agreed upon by at least 2 annotators. For clarity, a match is shown only if its total count was 10 or more and if it belongs to one of the three classes (adoring, amused, and excited). Only those which contain the keywords ‘sweet’ (adoring), ‘funny’ (amused), and ‘happy’ (excited) are labeled.

Let us define a match as a tuple containing the pattern name and the word or phrase which maps the comment to an affect according to the CARE lexicon. We could also consider exaggerators in our analysis but here we assume a negligible effect on differentiating reliability. We previously assumed that each instantiated match should have the same weight of 1, but this may not be appropriate considering that some patterns or words may be more reliable.

As can be seen in Figure 9, there are some cases in which the keyword in general seems to have a high false positive rate (e.g., happy) and in other cases it appears the erroneous combination of a particular pattern and keyword can lead to high false positive rates. For example, while the match ‘(so very, funny)’ has a low false positive rate of 0.2, ‘(I, funny)’ has a much higher false positive rate of 0.57, which intuitively makes since ‘I’m funny’ does not indicate being amused. We also investigated whether individual patterns are prone to higher false positive rates, which does not seem to be the case. For future iterations of CARE, one could also use the true positive rate as the weight of a match to obtain a weighted sum when aggregating over comments to label a post.

Appendix H Modeling details

We began with the hyper-parameter settings in Demszky et al. (2020) and explored other hyper-parameter settings (batch sizes [16, 32, 64], max length [64, 256, 512], drop out rate [0.3, 0.5, 0.7], epochs [2-10]) but found minimal improvements in the F1-score, as computed by the scikit-learn package in python. Running this on two Tesla P100-SXM2-16GB GPUs took roughly 19 hours. We also experimented with higher thresholds for the parameter (see Section 3.2) but saw marginal improvements, if any.

We developed two versions of care-bert: one using the classes in Table 1, and a simpler one using only the classes positive, and negative. The first four rows in Table 1 are considered positive while the last three are negative, the results of which are featured in Table 9. Naturally, the two-class model that blurs the differences between classes with the same valence has higher results.

AR Precision Recall F1
Positive 0.95 0.95 0.94
Negative 0.77 0.77 0.78
micro-avg 0.89 0.91 0.90
macro-avg 0.86 0.86 0.86
stdev 0.10 0.13 0.11
Table 9: Accuracy of care-bert for the two-class case: positive versus negative. Note that amused, excited, adoring, and approving were mapped to positive and angered, saddened, and scared were mapped to negative.