Perceived and Intended Sarcasm Detection with Graph Attention Networks

by   Joan Plepi, et al.

Existing sarcasm detection systems focus on exploiting linguistic markers, context, or user-level priors. However, social studies suggest that the relationship between the author and the audience can be equally relevant for the sarcasm usage and interpretation. In this work, we propose a framework jointly leveraging (1) a user context from their historical tweets together with (2) the social information from a user's conversational neighborhood in an interaction graph, to contextualize the interpretation of the post. We use graph attention networks (GAT) over users and tweets in a conversation thread, combined with dense user history representations. Apart from achieving state-of-the-art results on the recently published dataset of 19k Twitter users with 30K labeled tweets, adding 10M unlabeled tweets as context, our results indicate that the model contributes to interpreting the sarcastic intentions of an author more than to predicting the sarcasm perception by others.



page 4


Improving Tweet Representations using Temporal and User Context

In this work we propose a novel representation learning model which comp...

Unified and Multilingual Author Profiling for Detecting Haters

This paper presents a unified user profiling framework to identify hate ...

Exploring Author Context for Detecting Intended vs Perceived Sarcasm

We investigate the impact of using author context on textual sarcasm det...

iSarcasm: A Dataset of Intended Sarcasm

This paper considers the distinction between intended and perceived sarc...

"Like Sheep Among Wolves": Characterizing Hateful Users on Twitter

Hateful speech in Online Social Networks (OSNs) is a key challenge for c...

Gender Prediction from Tweets: Improving Neural Representations with Hand-Crafted Features

Author profiling is the characterization of an author through some key a...

Identifying Depressive Symptoms from Tweets: Figurative Language Enabled Multitask Learning Framework

Existing studies on using social media for deriving mental health status...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sarcasm is a form of non-literal language, in which the intended meaning of the utterance differs from the literal meaning, fulfilling a social function in a discourse doi:10.1080/01638539509544922; riloff-etal-2013-sarcasm. Sarcasm detection poses a challenge for numerous NLP tasks, such as sentiment or stance prediction maynard-greenwood-2014-cares.

Early sarcasm detection systems are based on lexical and syntactic cues 10.1145/1651461.1651471; davidov-etal-2010-semi; Tsur_Davidov_Rappoport_2010; gonzalez-ibanez-etal-2011-identifying; 10.1007/s10579-012-9196-x; ghosh-etal-2015-sarcastic. However, sarcasm interpretation requires context, even for humans wallace-etal-2014-humans. More recent works hence incorporate discourse information such as contrast riloff-etal-2013-sarcasm; khattri-etal-2015-sentiment; joshi-etal-2015-harnessing; 10.1145/2684822.2685316; tay-etal-2018-reasoning, and contextualize the post by using features from user history bamman2015contextualized; amir-etal-2016-modelling; oprea-magdy-2019-exploring; hazarika-etal-2018-cascade. The relationship between an author and the audience has been given comparably less attention, despite its relevance for the sarcasm interpretation doi:10.1080/08824090109384781; doi:10.1080/10926488.2000.9678862; doi:10.1177/0261927X07309512; doi:10.1177/1461444810365313; bamman2015contextualized

. In this work, we propose a graph neural network framework jointly leveraging a user context from their historical tweets together with the social information from a user’s neighborhood modeled by heterogeneous graph structures.

The key contributions of this paper are:

(1) We present the first graph attention-based model to identify sarcasm on social media by explicitly modeling users’ social and historical context jointly, capturing complex relations between a sarcastic tweet and its conversational context.

(2) We demonstrate that exploiting these relationships increases performance in the sarcasm detection task, reaching state-of-the-art results on the recent SPIRS dataset shmueli-etal-2020-reactive, which we expand with user history. We examine the impact of different parts of the context, captured by attention weights, in modeling sarcastic utterances.

(3) We find that even with user-based models, detecting sarcastic intentions of the author is easier than identifying the sarcasm perception by others.

2 Related Work

Leveraging user history

Several previous works contextualize a sarcastic post by using features from user history - employing past tweets to identify a user’s behavioral traits (10.1145/2684822.2685316), encoding user sentiment priors over different entities (khattri-etal-2015-sentiment), or manually crafting user interaction features (bamman2015contextualized). amir-etal-2016-modelling introduce the user2vec model, applying paragraph2vec (10.5555/3044805.3045025) over user history. hazarika-etal-2018-cascade propose an alternative user embedding approach, encoding style and personality features.

Leveraging user network

An emerging line of research makes use of social interactions to encode information about the user induced by neural architecture (10.1145/2939672.2939754; Qiu2018). Network information improves performance on detecting cyberbullying (10.1145/3292522.3326034), abusive language use (qian-etal-2018-leveraging), suicide ideation (mishra-etal-2019-snap) or fake news chandra2020graph

. To the best of our knowledge, graph network based approaches have not been used in the sarcasm detection task so far.

Perceived and intended sarcasm

Perceiving sarcasm in text is not trivial even for humans, not only due to the lack of acoustic markers (BANZIGER2005252; doi:10.1080/10926488.2011.583197) but also due to the sociocultural diversity (doi:10.1080/08824090109384781; doi:10.1177/0261927X07309512) where in many cases the audience may misinterpret a sarcastic statement as sincere. This has been only recently reflected in sarcasm detection models hazarika-etal-2018-cascade; shmueli-etal-2020-reactive.

3 Proposed Approach

3.1 Tweet Embeddings

We denote the current tweet to be assessed , where is the total number of tweets. We utilize SentenceBERT embeddings reimers-gurevych-2019-sentence to encode the tweets. Formally, where , and SentenceBERT computes the mean of all tokens’ representation. We forward this representation into a linear layer to transform in dimension , .

3.2 User Embeddings (Historical Context)

Let be the author of tweet , from now on we keep only the index for brevity. Each user is associated with a set of historical tweets , where is a historic tweet posted at a time by the user . We adopt user2vec amir-etal-2016-modelling to compute the initial user representation of user based on their corresponding historical tweets

, optimizing the conditional probability of texts given the author.

Figure 1: An example of a heterogeneous user and tweet social graph extracted from one conversation.

3.3 Social Graph (Network Context)

Apart from the importance of surrounding context to understand sarcasm wallace-etal-2014-humans, certain understanding is needed between the audience and the author doi:10.1080/10926488.2000.9678862; doi:10.1177/0261927X07309512. Our goal is to model relations between users and their past tweets, interactions between users, and relations between tweets in one conversation. We model these relationships as a graph , where contains two types of nodes - Users and Tweets (Figure 1). We use three edge types , where represents the social interaction between users. This involves quotes, mentions, or replies in the user history. denotes the edges between tweets that are involved in one discussion thread, with all tweets connected with each other, and is the relation between a tweet and its author.

Representation Learning:

We use Graph Attention Networks (GATs, velickovic2018graph) to exploit the neighborhood of each node to compute the final representations.111We ran early experiments with Graph Convolutional Networks as well, obtaining inferior and less interpretable results. GAT uses a self-attention mechanism bahdanau2015neural; 10.5555/3295222.3295349 to assign an importance score to the connections that contribute more to the detection of sarcastic or non-sarcastic tweets. We initialize the user and the tweet nodes of the GAT with their corresponding embeddings and . The initial node representation of each node

is linearly transformed by a weight matrix

into a vector

. Following, the attention weights of each node are computed as:


where is a node in the neighborhood of and is the attention mechanism function which is a single-layer feedforward neural network, parameterized by a weight vector with a LeakyReLU nonlinearity.

The final node representation is computed as:


where is the number of attention heads,

is the ReLU nonlinear function,

a weight matrix and the normalized attention weights from the -th attention mechanism .

3.4 Classification model

The user and tweet representations learned by GAT layer are concatenated and forwarded through a two-layer feed-forward network parameterized by weight matrices and , where is the dimension of projected embeddings, and is equal to the number of classes. The final prediction of the model is given by:

Figure 2: The social graph is initialized with user and tweet embeddings (user2vec and sentence-BERT), and tuned by GAT to take into account relationships between them. The output representations are then fed into the classification layer.

4 Experimental Setup

4.1 Dataset

For our experiments, we use a recently published SPIRS sarcasm dataset shmueli-etal-2020-reactive. It utilizes cue tweets, conversation replies which point out the sarcastic nature of a previous post. In addition, the dataset also provides oblivious tweets, questioning the sarcastic nature of a given example, and elicit tweets, being the original start of the conversation. Non-sarcastic posts were collected randomly in equal numbers. The labeled dataset contains in total 15,000 sarcastic tweets (10,000 self-reported and 5000 perceived cues), 15,000 non-sarcastic, 10,000 oblivious and 9156 elicit tweets.

User context

We extend SPIRS with over 10 million past tweets of the authors in the dataset in order to compute the user embeddings.

Social network

Our graph consists of the three types of connections described in Sec.3.3.To avoid the bias coming from cue tweets, we exclude these from our graph. Our final social network consists of 108K nodes with 0.00002 density and 32% homophily, defined as the percentage of connections between authors of tweets with the same label.

4.2 Comparison Baselines

The baselines introduced by shmueli-etal-2020-reactive

are a Convolutional Neural Network, a Bidirectional LSTM, and a fine-tuned pre-trained BERT model. We compare our model with BERT, which performs the best of these. We add two baselines which incorporate user information. First, we extend BERT by simply concatenating the tweet embeddings with their respective user2vec author representation (‘BERT + user2vec’). As a second baseline (‘BERT + user-only GAT’), we build a social graph with only user nodes and their interactions (quotes, mentions, or replies)

as edges, and apply the GAT initialized with user2vec embeddings. The implementation of the models and the results are made publicly available, to facilitate reproducibility and reuse222

5 Results and Analysis

Our proposed GAT base model significantly outperforms all the baselines (Table 1) despite having fewer trainable parameters (500K) than the BERT model (110M). First, by simply concatenating the user2vec embeddings to BERT, we obtain 3.4% f1 score improvement on the BERT model, indicating the importance of user context in sarcasm detection. Moreover, we introduce the GAT module in the model. We first experiment with only tweet to tweet connections in the graph based on the conversations on Twitter and trained on top of the fine-tuned BERT. In this case, the GAT layer only bring 0.2% improvement due to the sparse and disconnected nature of the constructed graph. In addition, we replace user2vec with GAT embeddings tuned on user-only social graph, and we achieve 6.1% improvement on BERT and 3% over ‘BERT + user2vec’, presumably thanks to exploiting the homophily relations between users. Finally, applying GAT on the full heterogeneous user and tweet graph (as per Figure 2) provides a large performance boost thanks to incorporating the conversational thread context between tweets.

Sarcasm Detection
Model P R F1
BERT 70.1% 69.7% 69.9%
BERT + user2vec 73.6% 73.2% 73.4%
BERT + tweet-tweet GAT 70.4% 69.9% 70.1%
BERT + user-only GAT 74.2% 78.1% 76.1%
User+tweet GAT (no cues) 84.7% 83.7% 84.2%
User+tweet GAT, no elicit 83.2% 80.8% 82.0%
User+tweet GAT, no oblivious 82.4% 80.4% 81.4%
User+tweet GAT + cue tweets 94.7% 94.3% 94.5%
Table 1: Mean overall precision (P), recall (R), and F1 score (F1) of each model over 10 runs with varying seeds, detecting sarcasm on the SPIRS dataset.

User representation

We compare the initial user embeddings initialized by user2vec with the final representations computed from the GAT. The representations are projected in 2-dimensional space using T-SNE van2008visualizing. In Figure 3 and 4 we visualize the initial representations with user2vec and computed representation by GAT layer respectively. While in user2vec representations sarcastic users cannot be distinguished from non-sarcastic ones, in the GAT representations we can observe communities of users sharing the same sarcastic tendency.

Figure 3: Initial representations of users (user2vec) projected in 2D space with T-SNE. Red color denotes sarcastic users, blue non-sarcastic.
Figure 4: Learned representations by our social network module (GAT) projected in 2D space with T-SNE. Red color denotes sarcastic users, blue non-sarcastic.

Conversation context

For comparison, we construct three more social graphs where: 1) We remove the elicit tweets which triggered the sarcastic comment (GAT - elicit tweets), 2) We remove the oblivious tweets which interpreted the comment as serious (GAT - oblivious tweets), 3) We add the original cue tweets, revealing that the post was sarcastic (GAT + cue tweets). As expected, adding the cue tweets in the social graph leads to an almost perfect F1 score of 94.5%. Removing oblivious and elicit tweets causes just a small performance drop (2-3%). In the way the SPIRS dataset is annotated, an oblivious tweet typically triggers a cue tweet (“c mon, dude, it was just sarcasm”). We hypothesize that even with the cue tweets removed, the model is able to learn the predictive relation between oblivious and sarcastic tweets. This is in line with the original paper (i.e. without user context), where a 3.4% drop in prediction accuracy was observed, when the oblivious tweets were removed.

Attention weights

The attention mechanism of GAT is able to assign varied weights to different nodes in the neighborhood, dynamically encoding of the user by their homophily relations, which boosts the effect of authors in tweet representations flek-2020-returning

. We confirm this by examining users with a larger number of tweets in the dataset. When users tend to be sarcastic in most of the posts, the attention weight of their non-sarcastic tweets is smaller. In these cases, the attention weights give more importance to the surrounding user context over the conversation thread. Overall, the largest source of information for the model are the user nodes and the tweet that is being classified. We note the normalized attention weights are smaller for the oblivious and elicit tweet edges, and higher for the edges that connect tweets with their respective author. In other words, the conversational context only plays a decisive role in case of insufficient or inconsistent user-level priors.

Sarcasm Perception

Cue tweets can be either authored by the same user as the sarcastic post (intended sarcasm) or a different one (perceived sarcasm).We observe that in the sarcasm detection task, the error rate on perceived sarcasm is 20% while in the self-reported sarcasm it is only around 15%. We therefore test our model on distinguishing between perceived and self-reported sarcasm. Our GAT model brings an improvement of 2.2% over the BERT baseline, with the perceived sarcasm being harder to detect (F1 56%) than the self-reported one (F1 84%). These results are aligned with the conclusions from oprea-magdy-2019-exploring. In most cases, the perceived sarcasm is misclassified as self-reported, which is present more often (70%) in the data. Perceived sarcasm is dependent on the readers rather than the author of the tweet, therefore we hypothesize that modeling the authors’ context is less useful. It could be of benefit to model more robust recipient user profiles as well, to better predict how each individual will react.

Sarcasm Perception
Model P R F1
BERT 73.2% 68.0% 69.0%
User+tweet GAT (no cues) 75.0% 67.7% 71.2%
Table 2: Mean overall precision (P), recall (R), and F1 score (F1) over 10 runs classifying self-reported (intended) and perceived sarcasm on the SPIRS dataset.


Modeling the social networks with GAT is affected by several factors. First, the low graph density, as the original dataset wasn’t collected by following relationships between users, hence many users across different conversation threads are not related to each other. Second, the homophily degree is only 32%, users with sarcastic tendency have few connections among them.

6 Conclusions

In this work, we explore social networks of user interactions, and contextual information to interpret sarcastic intentions in social media. We propose a graph attention-based model, which combines contextual information of users, linguistic features, and social networks. The heterogeneous social network modeling dynamically exploits relationships between users and tweets in a conversation and significantly improves the state-of-the-art results.


This work has been supported by the German Federal Ministry of Education and Research (BMBF) as a part of the Junior AI Scientists program under the reference 01-S20060. We would like to thank Flora Sakketou and all the anonymous reviewers for their valuable input.

Ethical Considerations

The ability to automatically approximate personal characteristics of online users in order to improve natural language classification algorithms requires us to consider a range of ethical concerns, including: (1) privacy and user consent, (2) representativeness of the data for generalization, and (3) user vulnerability to a potential model or data misuse or misinterpretation.

Use of any user data for personalization shall be transparent, and limited to the given purpose, no individual posts shall be republished (hewson2013ethics). Researchers are advised to take account of users’ expectations (williams2017towards; shilton2016we; townsend2016social) when collecting public data such as Twitter. In this case, when we expand the original dataset with more extensive user history, we utilize publicly available Twitter data in a purely observational (DBLP:journals/corr/NorvalH17), and non-intrusive manner. All user data is kept separately on protected servers, linked to the raw text and network data only through anonymous IDs.

shah-etal-2020-predictive identify four different sources of bias in NLP models: selection bias, label bias, model overamplification, and semantic bias. While we can’t exclude any of those, the selection bias should be kept in mind in particular, when reusing the presented model, as it is unclear to which extent the augmented SPIRS dataset with user history represents a sample of the overall population on Twitter. The user selection was based solely on the available sarcasm annotations, and doesn’t include any sociodemographic information.

In addition, any user-augmented classification efforts risk invoking stereotyping and essentialism, as the algorithm may lean towards label people rather than posts (e.g. “this is a sarcastic person”). Such stereotypes can cause harm even if they are accurate on average differences (rudman2008social). These can be emphasized by the semblance of objectivity created by the use of a computer algorithm (koolen-van-cranenburgh-2017-stereotypes). It is important to be mindful of these effects when interpreting the model results in an own end-application context.



Appendix A Configurations

We perform a stratified 90/10 train-test split. We sample of the training data for validation. All splits have the same class distribution and different sets of tweet authors. We use 3 GAT layers, with number of heads . The initial dimension is and the final output dimension . To train our model we set learning rate to , and dropout 0.4 srivastava2014dropout, and use the Adam optimization algorithm kingma2015adam

for 500 training epochs with early stopping. For the GAT layers, we compute the mean of the outputs from each attention head instead of concatenation. All experiments are run in Nvidia A100 40 GB GPUs.

Appendix B User Context

To incorporate user context, we first extract all user IDs for all the tweets in the dataset. In the dataset, due to different tweet types with different users, we get in total 57K users. We fetch the tweet post timeline for each user, and we end up with a total of 104M tweets, in average 1800 posts per user. For user2vec training, we take into account only the users with a minimum of 50 posts in their timeline, and we limit the total number of posts to 1000. After filtering, the amount of tweets in the context is 10M. Every tweet is pre-processed by removing all links, user mentions are replaced with "@user", emojis and hashtags are cleared. We train user2vec amir-etal-2016-modelling for 12 epochs, with learning rate 1e-4. For those users which are filtered, or we cannot extract history, we initialize them as the mean representation of his user neighbors in the social network. We used the history tweets only for creating the user-to-user edges, and those are not present in the constructed graph, but are already encoded in the initial user representation. We experimented with various history length settings, and found almost no difference in the performance between using interactions throughout all history and interactions during the last year. Hence, we omitted older interactions to ease the computations.