Sarcasm is a form of figurative language that implies a negative sentiment while displaying a positive sentiment on the surface Joshi et al. (2017)
. Because of its conflicting nature and subtlety in language, sarcasm detection has been considered one of the most challenging tasks in natural language processing. Furthermore, when sarcasm is used in social media platforms such as Twitter or Reddit to express users’ nuanced intents, the language is often full of spelling errors, acronyms, slangs, emojis, and special characters, which adds another level of difficulty in this task.
Despite of its challenges, sarcasm detection has recently gained substantial attention because it can bring the last gist to deep contextual understanding for various applications such as author profiling, harassment detection, and irony detection Van Hee et al. (2018). Many computational approaches have been proposed to detect sarcasm in conversations Ghosh et al. (2015); Joshi et al. (2015, 2016). However, most of the previous studies use the utterances in isolation, which makes it hard even for human to detect sarcasm without the contexts. Thus, it’s essential to interpret the target utterances along with contextual information comprising textual features from the conversation thread, metadata about the conversation from external sources, or visual context Bamman and Smith (2015); Ghosh et al. (2017); Ghosh and Veale (2017); Ghosh et al. (2018).
This paper presents a transformer-based sarcasm detection model that takes both the target utterance and its context and predicts if the target utterance involves sarcasm. Our model uses a transformer encoder to coherently generate the embedding representation for the target utterance and the context by performing multi-head attentions (Section 4). This approach is evaluated on two types of datasets collected from Twitter and Reddit (Section 3), and depicts significant improvement over the baseline using only the target utterance as input (Section 5). Our error analysis illustrates that the context-aware model can catch subtle nuance that cannot be captured by the target-oriented model (Section 6).
2 Related Work
Just as most other types of figurative languages are, sarcasm is not necessarily complicated to express but requires comprehensive understanding in context as well as commonsense knowledge rather thanits literal sense Van Hee et al. (2018). Various approaches have been presented for this task.
Most earlier works had taken the target utterance without context as input. Both explicit and implicit incongruity features were explored in these works Joshi et al. (2015). To detect whether certain words in the target utterance involve sarcasm, several approaches based on distributional semantics were proposed Ghosh et al. (2015). Additionally, word embedding-based features like distance-weighted similarities were also adapted to capture the subtle forms of context incongruity Joshi et al. (2016). Nonetheless, it is difficult to detect sarcasm by considering only the target utterances in isolation.
Non-textual features such as the properties of the author, audience and environment were also taken into account Bamman and Smith (2015). Both thelinguistic and context features were used to distinguish between information-seeking and rhetorical questions in forums and tweets Oraby et al. (2017)
. Traditional machine learning methods such as Support Vector Machines were used to model sarcasm detection as a sequential classification task over the target utterance and its surrounding utterancesWang et al. (2015)
. Recently, deep learning methods using LSTM were introduced, considering the prior turnsGhosh et al. (2017) as well as the succeeding turns Ghosh et al. (2018).
3 Data Description
Given a conversation thread, either from Twitter or Reddit, a target utterance is the turn to be predicted,whether or not it involves sarcasm, and the context is an ordered list of other utterances in the thread. Table 1 shows the examples of conversation threads where the target utterances involve sarcasm.111Note that the target utterance can appear at any position of the context although its exact position is not provided in this year’s shared task data.
The Twitter data is collected by using the hashtags #sarcasm and #sarcastic. The Reddit datais a subset of the Self-Annotated Reddit Corpus that consists of 1.3 million sarcastic and non-sarcastic posts Khodak et al. (2017). Every target utterance is annotated with one of the two labels, SARCASM and NOT_SARCASM. Table 2 shows the statistics of the two datasets provided by this shared task.
Notice the huge variances in the utterance lengths for both the Twitter and the Reddit datasets. For the Reddit dataset, the average lengths of conversations as well as utterances are significantly larger in the test set than the training set that potentially makes the model development more challenging.
Two types of transformer-based sarcasm detection models are used for our experiments:
These two models are coupled with the latest transformer encoders e.g., BERT Devlin et al. (2019), RoBERTa Liu et al. (2020), and ALBERT Lan et al. (2019), and compared to evaluate how much impact the context makes to predict whether or not the target utterance involves sarcasm.
4.1 Target-oriented Model
Figure 0(a) shows the overview of the target-oriented model. Let be the input target utterance, where is the ’th token in and is the max-number of tokens in any target utterance. is first prepended by the special token representing the entire target utterance, which creates the input sequence . is then fed into the transformer encoder, which generates the sequence of embeddings , where is the embedding list for and are the embeddings of respectively. Finally, is fed into the linear decoder to generate the output vector that makes the binary decision of whether or not involves sarcasm.
4.2 Context-aware Model
Figure 0(b) shows the overview of the context-aware model. Let be the ’th utterance in the context. Then, is the concatenated list of tokens in all context utterances, where is the number of utterances in the context, is the first token in and is the last token in . The input sequence from Section 4.1 is appended by the special token representing the separator between the target utterance and the context, and also , which creates the input sequence. Then, gets fed into the transformer encoder, which generates a sequence of embeddings , where is the embedding list for , and are the embeddings of respectively. Finally, is fed into the linear decoder to generate the output vector that makes the same binary decision to detect sarcasm.
5.1 Data Split
For all our experiments, a mixture of the Twitterand the Reddit datasets is used. The Twitter training set provided by the shared task consists of 5,000 tweets, where the labels are equally balanced between SARCASM and NOT_SARCASM (Table 2). We find, however, 4.82% of them are duplicates, which are removed before data splitting. As a result, 4,759 tweets are used for our experiments. Labels in the Reddit training set are also equally balanced and no duplicate is found in this dataset.
, to compare the performance among the current state-of-the-art encoders. Every model is run three times and their average scores as well as standard deviations are reported. All models are trained on the combined Twitter + Reddit training set and evaluated on the combined development set (Table3).
5.3 Experimental Setup
After an extensive hyper-parameter search, we set the learning rate to 3e-5, the number of epochs to 30, and use different seed values, 21, 42, 63, for the three runs. Additionally, based on the statistics of each dataset, we set the maximum sequence length to 128 for the target-oriented models while it is set to 256 for the context-aware models by considering the different lengths of the input sequences required by those approaches.
The baseline scores are provided by the organizers, that are 60.0% for Reddit and 67.0% for Twitter using the single layer LSTM attention modelGhosh et al. (2018). Table 4 shows the results achieved by our target-oriented (Section 4.1) and the context-aware (Section 4.2
) models on the combined development set. The RoBERTa-Large model gives the highest F1-scores for both the target-oriented and context-aware models. The context-aware model using RoBERTa-Large show an improvement of 1.1% over its counterpart baseline so that this model is used for our final submission to the shared task. Note that it may be possible to achieve higher performance by fine-tuning hyperparameters for the Twitter and Reddit datasets separately, which we will explore in the future.
Table 5 shows the results by the RoBERTa-Large models on the test sets. The scores are retrieved by submitting the system outputs to the shared task’s CodaLab page.222https://competitions.codalab.org/competitions/22247 The context-aware models significantly outperform the target-oriented models on the test sets, showing improvements of 3.1% and 7.0% on the F1 scores for the Twitter and the Reddit datasets, respectively. The improvement on Reddit is particularly substantial due to the much greater lengths of the conversation threads and utterances in the test set compared to the ones in the training set (Table 2). As the final results, we achieve 79.0% and 75.0% for the Twitter and Reddit datasets respectively that mark the 2nd places for both datasets at the time of the submission.
For a better understanding in our final model, errors from the following three situations are analyzed (TO: target-oriented, CA: context-aware):
TwCc: TO is wrong and CA is correct.
TcCw: TO is correct and CA is wrong.
TwCw: Both TO and CA are wrong.
Table 6 shows examples for every error situation. For TwCc, TO predicts it to be NOT_SARCASM. In this example, it is difficult to tell if the target utterance involves sarcasm without having the context. For TcCw, CA predicts it to be NOT_SARCASM. It appears that the target utterance is long enough to provide enough features for TO to make the correct prediction, whereas considering the extra context may increase noise for CA to make the incorrect decision. For TwCw, both TO and CA predict it to be NOT_SARCASM. This example seems to require deeper reasoning to make the correct prediction.
This paper explores the benefit of considering relevant contexts for the task of sarcasm detection. Three types of state-of-the-art transformer encoders are adapted to establish the strong baseline for the target-oriented models, which are compared to the context-aware models that show significant improvements for both Twitter and Reddit datasets and become one of the highest performing models in this shared task.
All our resources are publicly available at Emory NLP’s open source repository:https://github.com/emorynlp/figlang-shared-task-2020
We gratefully acknowledge the support of the AWS Machine Learning Research Awards (MLRA). Any contents in this material are those of the authors and do not necessarily reflect the views of AWS.
- Contextualized Sarcasm Detection on Twitter. In International AAAI Conference on Web and Social Media, pp. 574–577. External Links: Cited by: §1, §2.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. External Links: Cited by: §4, §5.2.
- Magnets for Sarcasm: Making Sarcasm Detection Timely, Contextual and Very Personal. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 482–491. External Links: Cited by: §1.
- The Role of Conversation Context for Sarcasm Detection in Online Interactions. Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pp. 186–196. External Links: Cited by: §1, §2.
- Sarcasm Analysis using Conversation Context. Comput. Linguist. 44 (4), pp. 755–792. External Links: Cited by: §1, §2, §5.4.
- Sarcastic or Not: Word Embeddings to Predict the Literal or Sarcastic Meaning of Words. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1003–1012. External Links: Cited by: §1, §2.
- Automatic Sarcasm Detection: A Survey. ACM Computing Surveys 50 (5), pp. 1–22. External Links: Cited by: §1.
- Harnessing Context Incongruity for Sarcasm Detection. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 757–762. External Links: Cited by: §1, §2.
- Are Word Embedding-based Features Useful for Sarcasm Detection?. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1006–1011. External Links: Cited by: §1, §2.
- A Large Self-Annotated Corpus for Sarcasm. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) abs/1704.05579. External Links: Cited by: §3.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv 11942 (1909). Cited by: §4, §5.2.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. In Proceedings of the International Conference on Learning Representations, External Links: Cited by: §4, §5.2.
- Are you serious?: Rhetorical Questions and Sarcasm in Social Media Dialog. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, Saarbrücken, Germany, pp. 310–319. External Links: Cited by: §2.
- SemEval-2018 Task 3: Irony Detection in English Tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana, pp. 39–50. External Links: Cited by: §1, §2.
- Twitter Sarcasm Detection Exploiting a Context-Based Model. In WISE, Cited by: §2.