Improving Social Meaning Detection with Pragmatic Masking and Surrogate Fine-Tuning

by   Chiyu Zhang, et al.
The University of British Columbia

Masked language models (MLMs) are pretrained with a denoising objective that, while useful, is in a mismatch with the objective of downstream fine-tuning. We propose pragmatic masking and surrogate fine-tuning as two strategies that exploit social cues to drive pre-trained representations toward a broad set of concepts useful for a wide class of social meaning tasks. To test our methods, we introduce a new benchmark of 15 different Twitter datasets for social meaning detection. Our methods achieve 2.34 while outperforming other transfer learning methods such as multi-task learning and domain-specific language models pretrained on large datasets. With only 5 of training data (severely few-shot), our methods enable an impressive 68.74 average F1, and we observe promising results in a zero-shot setting involving six datasets from three different languages.



There are no comments yet.


page 1

page 2

page 3

page 4


Raise a Child in Large Language Model: Towards Effective and Generalizable Fine-tuning

Recent pretrained language models extend from millions to billions of pa...

MultiFiT: Efficient Multi-lingual Language Model Fine-tuning

Pretrained language models are promising particularly for low-resource l...

Muppet: Massive Multi-task Representations with Pre-Finetuning

We propose pre-finetuning, an additional large-scale learning stage betw...

Stepmothers are mean and academics are pretentious: What do pretrained language models learn about you?

In this paper, we investigate what types of stereotypical information ar...

FewCLUE: A Chinese Few-shot Learning Evaluation Benchmark

Pretrained Language Models (PLMs) have achieved tremendous success in na...

Prompt-Learning for Fine-Grained Entity Typing

As an effective approach to tune pre-trained language models (PLMs) for ...

Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization

The Lottery Ticket Hypothesis suggests that an over-parametrized network...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Masked language models (MLMs) such as BERT Devlin et al. (2019) have revolutionized natural language processing (NLP). These models exploit the idea of self-supervision where sequences of unlabeled text are masked and the model is tasked to reconstruct them. Knowledge acquired during this stage of denoising (called pre-training) can then be transferred to downstream tasks through a second stage (called fine-tuning). Although pre-training is general, does not require labeled data, and is task agnostic, fine-tuning is narrow, requires labeled data, and is task-specific. For a class of tasks , some of which we may not know in the present but which can become desirable in the future, it is unclear how we can bridge the learning objective mismatch between these two stages. In particular, how can we (i) make pre-training more tightly related to downstream task learning objective; and (ii) focus model pre-training representation on an all-encompassing range of concepts of general affinity to a wide host of downstream tasks?

(1) Just got chased through my house with a bowl of tuna [Disgust]
(2) I love waiting 2 hours to see 2 min. Of a loved family
members part in a dance show #sarcasm [Sarcastic]
LOST GET OVER IT  [Offensive]
Table 1: Samples from our social meaning benchmark.

We raise these questions in the context of learning a cluster of tasks to which we collectively refer as social meaning. We loosely define social meaning as meaning emerging through human interaction such as on social media. Example social meaning tasks include emotion, humor, irony, and sentiment detection. We propose two main solutions that we hypothesize can bring pre-training and fine-tuning closer in the context of learning social meaning: First, we propose a particular type of guided masking that prioritizes learning contexts of tokens crucially relevant to social meaning in interactive discourse. Since the type of “meaning in interaction” we are interested in is the domain of linguistic pragmatics Thomas (2014), we will refer to our proposed masking mechanism as pragmatic masking. We explain pragmatic masking in Section 3.

Second, we propose an additional novel stage of fine-tuning that does not depend on gold labels but instead exploits general data cues possibly relevant to all social meaning tasks. More precisely, we leverage proposition-level user assigned tags for intermediate fine-tuning of pre-trained language models. In the case of Twitter, for example, hashtags naturally assigned by users at the end of posts can carry discriminative power that is by and large relevant to a wide host of tasks. Cues such as hashtags and emojis have been previously used as pseudo or surrogate lables for particular tasks, and we adopt a similar strategy but not to learn a particular narrow task with a handful of cues. Rather, our goal is to learn extensive concepts carried by tens of thousands of key-phrase hashtags and emojis. A model endowed with such a knowledge base of social concepts can then be further fine-tuned on any narrower task in the ordinary way. We refer to this method as surrogate fine-tuning (Section 4).

In order to evaluate our methods, we present a social meaning benchmark composed of different datasets crawled from previous research sources. Since we face challenges in acquiring the different datsets, with only of the data being accessible, we paraphrase the benchmark and make it available to the community to facilitate future research. We refer to our new benchmark as the Social Meaning Paraphrase Benchmark (SMPB) (Section 2). We perform an extensive series of methodical experiments directly targeting our proposed methods. Our experiments set new SOTA in the supervised setting across our different tasks, and compare favorably to a multi-task learning setting (Section 5). Motivated by occasional data inaccessibility for social media research, we also report successful models fine-tuned exclusively on paraphrase data (Section 6). Finally, our experiments reveal a striking capacity of our models in improving downstream task performance in severely few-shot settings (i.e., as low as of gold data) and even the zero-shot setting on six different datasets from three languages (Section 8).

To summarize, we make the following contributions: (1) We propose a novel pragmatic masking strategy that makes use of social media cues akin to improving social meaning detection. (2) We introduce a new effective surrogate fine-tuning method suited to social meaning that exploits the same simple cues as our pragmatic masking strategy. (3) We introduce SMPB, a new social meaning detection benchmark comprising different datasets whose accessibility is enhanced via paraphrase technology. (4) We report new SOTA on supervised datasets and offer remarkably effective models for zero- and few-shot learning.

The rest of the paper is organized as follows: We introduce our benchmark in Section 2. Sections 3 and  4 present our pragmatic masking and surrogate fine-tuning methods, respectively. Sections 5 and 6 present our experiments with multi-task learning then paraphrase-based models. We compare our models to SOTA in Section 7 and show their effectiveness in zero- and few-shot settings in Section 8. We review related work in Section 9 and conclude in Section 10.

2 Social Meaning Benchmark

We collect datasets representing eight different social meaning tasks to evaluate our models, as follows: 111To facilitate reference, we give each dataset a name.

Crisis awareness. We use CrisisOltea  Olteanu et al. (2014), a corpus for identifying whether a tweet is related to a given disaster or not.

Emotion. We utilize EmoMoham, introduced by Mohammad et al. (2018), for emotion recognition. We use the version adapted in Barbieri et al. (2020).

Hateful and offensive language. We use HateWaseem Waseem and Hovy (2016), HateDavid Davidson et al. (2017), and OffenseZamp Zampieri et al. (2019a).

Humor. We use the humor detection datasets HumorPotash Potash et al. (2017) and HumorMeaney Meaney et al. (2021).

Irony. We utilize IronyHee-A and IronyHee-B from Van Hee et al. (2018).

Sarcasm. We use four sarcasm datasets from SarcRiloff Riloff et al. (2013), SarcPtacek Ptáček et al. (2014), SarcRajad Rajadesingan et al. (2015), and SarcBam Bamman and Smith (2015).


We employ the three-way sentiment analysis dataset from

SentiRosen Rosenthal et al. (2017).

Stance. We use StanceMoham, a stance detection dataset from Mohammad et al. (2016). The task is to identify the position of a given tweet towards a target of interest.

Data Crawling and Preparation. We use the Twitter API 222 to crawl datasets which are available only in tweet ID form. We note that we could not download all tweets since some tweets get deleted by users or become inaccessible for some other reason. Since some datasets are old (dating back to 2013), we are only able to retrieve of the tweets on average (i.e., across the different datasets). This inaccessibility of the data motivates us to paraphrase the datasets as we explain in Section 6. Before we paraphrase the data or use it in our various experiments, we normalize each tweet by replacing the user names and hyperlinks to the special tokens ‘USER’ and ‘URL’, respectively. This ensures our paraphrased dataset will never have any actual usernames or hyperlinks, thereby protecting user identity. For datasets collected based on hashtags by original authors (i.e., distant supervision), we also remove the seed hashtags from the original tweets. For datasets originally used in cross-validation, we acquire Train, Dev, and Test via random splits. For datasets that had training and test splits but not development data, we split off from training data into Dev. The data splits of each dataset are presented in Appendix Table A.1. We now introduce our pragmatic masking method.

3 Pragmatic Masking

MLMs employ random masking and so are not guided to learn any specific types of information during pre-training. Several attempts have been made to employ task-specific masking where the objective is to predict information relevant to a given downstream task. Task relevant information is usually identified based on world knowledge (e.g., a sentiment lexicon 

Gu et al. (2020); Ke et al. (2020)

, a knowledge graph 

Zhang et al. (2019), part-of-speech (POS) tags Zhou et al. (2020)) or based on some other type of modeling such as pointwise mutual information Tian et al. (2020) with supervised data. Although task-specific masking is useful, it is desirable to identify a more general masking strategy that does not depend on external information. The reason is that such information may be unavailable, or hard to acquire. For example, there are no POS taggers for some languages and so methods based on POS tags would not be applicable. Motivated by the fact that random masking is intrinsically sub-optimal Tian et al. (2020); Ke et al. (2020); Gu et al. (2020) and this particular need for a more general and dependency-free masking method, we introduce our novel pragmatic masking mechanism that is suited to a wide range of social meaning tasks.

To illustrate, consider the tweet samples in Table 1: In example (1), the emoji “” combined with the suffix “-ing” in “ing” is a clear signal indicating the disgust emotion. In example (2) the emoji “” and the hashtag “#sarcasm” communicate sarcasm. In example (3) the combination of the emojis “” and “” accompany ‘hard’ emotions characteristic of offensive language. We hypothesize that by simply masking cues such as emojis and hashtags, we can bias the model to learn about the different shades of social meaning expression. This masking method can be performed in a self-supervised fashion since hashtags and emojis can be automatically identified. We call the resulting language model pragmatically masked language model (PMLM). Specifically, when we choose tokens for masking, we prioritize hashtags and emojis. The pragmatic masking strategy follows a number of steps:

Pragmatic token selection. We randomly select up to of input sequence, giving masking priority to hashtags or emojis. The tokens are selected by whole word masking (i.e., whole hashtag or emoji). Regular token selection. If the pragmatic tokens are less than , we then randomly select regular BPE tokens to complete the percentage of masking to the .

Masking. This is the same as the RoBERTa MLM objective where we replace 80% of selected tokens with the [MASK] token, 10% with random tokens, and we keep 10% unchanged.

3.1 TweetEnglish Dataset

We extract B English tweets333We select English tweets based on the Twitter language tag. from a larger in-house dataset collected between and . We lightly normalize tweets by removing usernames and hyperlinks and add white space between emojis to help our model identify individual emojis. We keep all the tweets, retweets, and replies but remove the ‘RT USER:’ string in front of retweets. To ensure each tweet contains sufficient context for modeling, we filter out tweets shorter than English words (not counting the special tokens hashtag, emoji, USER, URL, and RT). We call this dataset TweetEng. Exploring the distribution of hashtags and emojis within TweetEng, we find that % of the tweets include at least one hashtag but no emoji, % have at least one emoji but no hashtag, and % have both at least one hashtag and at least one emoji. Investigating the hashtag and emoji location, we observe that % of the tweets use a hashtag as the last term, and that the last term of % of tweets is an emoji. We will use TweetEng as a general pool of data from which we derive for both our pragmatic masking and surrogate fine-tuning methods.

3.2 PM Datasets

For pragmatic masking, we extract five different subsets from TweetEng to explore the utility of our proposed masking method. Each of these five datasets comprises M tweets as follows:

Naive. A randomly selected tweet set. Based on the distribution of hashtags and emojis in TweetEng, each sample in Naive still has some likelihood to include one or more hashtags and/or emojis. We are thus still able to perform our PM method on Naive.

Hashtag_any. Tweets with at least one hashtag anywhere but no emojis.

Emoji_any. Tweets with at least one emoji anywhere but no hashtags.

Hashtag_end. Tweets with a hashtag as the last term but no emojis.

Emoji_end. Tweets with an emoji at the end of the tweet but no hashtags.444We perform an analysis based on two 10M random samples of tweets from Hashtag_any and Emoji_any, respectively. We find that on average there are 1.83 hashtags per tweet in Hashtag_any and 1.88 emojis per tweet in Emoji_any.

3.3 Baseline and Hyper-Parameters

For both our experiments on pragmatic masking (current section) and surrogate fine-tuning (Section 4), we use the pre-trained RoBERTaBase Liu et al. (2019) model as the initial checkpoint baseline model. We use this model, rather than a larger language model, since we run a large number of experiments and needed to be efficient with GPUs. We use the RoBERTaBase

tokenizer to process each input sequence and pad or truncate the sequence to a maximal length of

BPE tokens. We typically train with a batch size of . Further details about how we optimize our hyper-parameters for all our upcoming experiments (i.e., pragmatic masking, surrogate fine-tuning, downstream fine-tuning, and multi-task fine-tuning) are in Appendix Section B.

3.4 PM Experiments

CrisisOltea 95.95 95.78 +0.14 95.75 +0.10 95.85 +0.02 95.91 +0.07 95.95 -0.18
EmoMoham 77.99 79.43 +1.30 80.31 -1.75 79.51 +0.64 80.03 +1.06 81.28 +0.90
HateWaseem 57.34 56.75 -0.41 57.16 +0.35 56.97 +0.16 57.00 +0.01 57.08 -0.39
HateDavid 77.71 77.47 +0.81 76.87 +0.59 77.55 -0.33 78.13 +0.13 78.16 -0.23
HumorPotash 54.40 55.45 -0.19 55.32 -2.83 50.06 +4.54 57.14 -2.04 55.25 +0.32
HumorMeaney 92.37 93.24 +0.45 93.58 -0.10 92.85 +1.67 93.55 +0.95 93.19 -0.50
IronyHee-A 73.93 74.52 +0.45 74.50 +0.66 73.97 +2.27 75.34 +2.59 74.40 +1.22
IronyHee-B 52.30 52.91 +0.88 51.43 -2.14 50.41 +4.35 54.94 +1.15 54.73 -2.26
OffenseZamp 80.13 79.97 +0.27 79.74 -0.40 79.95 -1.08 80.18 +0.96 80.18 +0.47
SarcRiloff 73.85 72.02 +3.22 71.42 +3.30 74.16 +1.72 76.52 +1.41 76.30 +3.80
SarcPtacek 95.09 95.81 -0.17 95.50 +0.12 95.24 +0.57 95.81 +0.25 95.67 +0.34
SarcRajad 85.07 86.18 +0.05 85.04 +0.51 85.20 +0.73 86.14 +0.51 86.02 +0.92
SarcBam 79.08 80.03 +0.10 80.22 -0.06 79.83 +0.48 80.73 +0.39 81.13 +0.60
SentiRosen 71.08 72.03 +0.62 72.10 -0.11 71.84 -0.02 72.24 -0.26 72.27 -0.71
StanceMoham 70.41 67.14 +2.80 69.51 -1.38 69.23 +0.45 70.20 -1.58 70.04 -1.56
1-12 Average 75.78 75.92 +0.69 75.90 -0.21 75.51 +1.08 76.92 +0.38 76.78 +0.18
Table 2: Pragmatic masking results. Baseline: RoBERTaBASE without further pre-training. Light green indicates models outperforming the baseline. The best model of each task is in bold font. Masking: RM: Random masking, PM: Pragmatic masking. Datasets: N: Naive, HA: Hashtag_any, HE: Hashtag_end, EA: Emoji_any, EE: Emoji_end.

PM on Naive.   We further pre-train RoBERTa on the Naive dataset with our pragmatic masking strategy (PM) and compare to a model pre-trained on the same dataset with random masking (RM). As Table 2 shows, PM outperforms RM with an average improvement of macro F1 points across the tasks.555RM does not improve over the baseline RoBERTa (not further pre-trained) as Table 2 shows. We also observe that PM improves over RM in out of the tasks, thus reflecting the utility of our PM strategy even when working with a dataset such as Naive where it is not guaranteed (although likely) that a tweet will have hashtags and/or emojis.

PM of Hashtags.   To study the effect of PM on the controlled setting where we guarantee each sample has at least one hashtag anywhere, we further pre-train RoBERTa on the Hashtag_any dataset with PM (PM-HA in Table 2) and compare to a model further pre-trained on the same dataset with the RM (RM-HA). As Table 2 shows, PM does not improve over RM. Rather, PM results are marginally lower than those of RM.

Effect of Hashtag Location.   We also investigate whether it is more helpful to further pre-train on the tweets with one hashtag at the end. Hence, we further pre-train RoBERTa on the Hashtag_end dataset with PM and RM, respectively. As Table 2 shows, PM exploiting hashtags in the end (PM-HE) outperforms random masking (RM-HE) with an average improvement of macro across the tasks. It is noteworthy that PM-HE shows improvements over RM-HE in the majority of tasks ( tasks), and both of them outperform the baseline.

PM of Emojis.   Again, in order to study the impact of PM of emojis under a controlled condition where we guarantee each sample has at least one emoji, we further pre-train RoBERTa on the Emoji_any dataset with PM and RM, respectively. As Table 2 shows, both methods result in sizable improvements on most of tasks. PM-EA outperforms the random masking method (RM-EA) (macro = improvement) and also exceeds the baseline RoBERTa model (not further pre-trained) with average macro . PM-EA thus obtains the best overall performance (macro = ) across all settings of our pragmatic masking. PM-EA also achieves the best performance on CrisisOltea-14, two Irony detection tasks, OffenseZamp, and SarcPtacek. This indicates that emojis carry important knowledge for social meaning tasks and demonstrates the effectiveness of our PM mechanism to distill and transfer this knowledge to a wide range of downstream social tasks.

Effect of Emoji Location.   To investigate the effect of emoji location, we further pre-train RoBERTa on Emoji_end dataset with PM and RM, respectively. We refer to these two models as PM-EE and RM-EE. Both models perform better than our baseline, and PM-EE achieves the best performance on four tasks. Unlike the case of hashtags, learning is not as sensitive to location of the masked emoji.

Overall, results show the effectiveness of our proposed pragmatic masking method in improving self-supervised LMs. All models pre-trained with PM on emoji data obtain better performance than those pre-trained on hashtag data. This suggests that emoji cues are somewhat more helpful than hashtag cues for this type of guided model pre-training in the context of social meaning tasks.

4 Surrogate Fine-tuning

The current transfer learning paradigm of first pre-training then fine-tuning on particular tasks is limited by how much labeled data is available for downstream tasks. In other words, this existing set up works only given large amounts of labeled data. We propose surrogate fine-tuning where we intermediate fine-tune pre-trained LMs to predict thousands of example-level cues (i.e., hashtags occurring at the end of tweets). This method is inspired by previous work that exploited hashtags Riloff et al. (2013); Ptáček et al. (2014); Rajadesingan et al. (2015); Sintsova and Pu (2016); Abdul-Mageed and Ungar (2017); Felbo et al. (2017); Barbieri et al. (2018) or emojis Hu et al. (2017); Wood and Ruder (2016); Wiegand and Ruppenhofer (2021) as proxy for labels in a number of social meaning tasks. However, instead of identifying a specific set of hashtags or emojis for a single task and using them to collect a dataset of distant labels, we diverge from the literature in proposing to use data with any hashtag or emoji as a surrogate labeling approach suited for any (or at least most) social meaning task. As explained, we refer to our method as surrogate fine-tuning (SFT). Again, once we perform SFT, a model can be further fine-tuned on a specific downstream task to predict task labels. We now introduce our SFT datasets.

4.1 SFT Datasets

We experiment with two SFT settings, one based on hashtags (SFT-H) and another based on emojis (SFT-E). For SFT-H, we use the Hashtag_end dataset which we used for pragmatic masking (described in Section 3.2). The dataset includes M unique hashtags (all occurring at the end of tweets), but the majority of these are low frequency. We remove any hashtags occurring times, which gives us a set of hashtags in tweets. We split the tweets into Train (), Dev (), and Test (). For each sample, we use the end hashtag as the sample label.666We use the last hashtag as the label if there are more than one hashtag in the end of a tweet. We refer to this resulting dataset as Hashtag_pred.

For emoji SFT, we use the emoji_end dataset described in Section 3.2. Similar to SFT-H, we remove low-frequence emojis ( times), extract the same number of tweets as Hashtag_pred, and follow the same data splitting method. We acquire a total of unique emojis in final positions, which we assign as class labels and remove them from the original tweet body. We refer to this dataset as Emoji_pred.

4.2 SFT Experiments

Surrogate fine-tuning with Emojis. We conduct SFT using emojis. For this, we have two settings: (i) SFT-E: We fine-tune the pre-trained RoBERTa model on the Emoji_pred dataset for epochs and then further fine-tune on the downstream tasks. (ii) X1+SFT-E: We fine-tune the RoBERTa model further pre-trained with pragmatic masking of hashtags at end position (dubbed as X1) on emojis again for epochs. We refer to this last setting as X1+SFT-E in Table 3. As Table 3 shows, SFT-E outperforms the RoBERTa model baseline (BL) with improvement and X1+SFT-E outperforms both, reaching an average of vs. for BL and for SFT-E.

CrisisOltea 95.95 95.76 96.02 95.87 95.68
EmoMoham 77.99 79.69 82.04 78.69 80.50
HateWaseem 57.34 56.47 60.92 63.97 60.25
HateDavid 77.71 76.45 77.00 77.29 76.93
HumorPotash 54.40 54.75 54.93 55.51 53.83
HumorMeaney 92.37 93.82 93.68 93.74 94.49
IronyHee-A 73.93 76.63 72.73 76.22 79.89
IronyHee-B 52.30 57.59 56.11 60.14 61.67
OffenseZamp 80.13 80.18 81.34 79.82 79.50
SarcRiloff 73.85 78.34 78.74 80.50 80.49
SarcPtacek 95.09 95.88 96.16 96.01 96.24
SarcRajad 85.07 86.80 87.48 87.56 88.92
SarcBam 79.08 81.48 82.53 81.19 81.53
SentiRosen 71.08 71.27 72.07 71.83 71.08
StanceMoham 70.41 69.06 69.65 71.27 70.77
1-6 Average 75.78 76.94 77.43 77.97 78.12
Table 3: Surrogate fine-tuning (SFT). Baseline: RoBERTaBASE without further pre-training. SFT-H: SFT with hashtags. SFT-E: SFT with emojis. X1: Pragmatic masking with hashtag in end position (best hashtag PM condition). X2: Pragmatic masking with emoji anywhere (best emoji PM condition).

Surrogate fine-tuning with hashtags. We also perform SFT exploiting hashtags. For this, we again have two settings: (i) SFT-H: we use the pre-trained RoBERTa model as our initial checkpoint and surrogate fine-tune for epochs. We then continue fine-tuning this model on each of the downstream tasks. (ii) X2+SFT-H: We surrogate fine-tune the best emoji-based model (PM with emojis anywhere, dubbed X2 here) on the hashtag task (using Hashtag_pred dataset) for epochs and continue fine-tuning on the downstream tasks. This last setting is referred to as X2+SFT-H. As Table 3 shows, our proposed SFT-H method alone is also highly effective. On average, SFT-H achieves macro improvement over our baseline model which is directly fine-tuned on the downstream tasks (i.e., without SFT-H). SFT-H also yields sizeable improvements on tasks with smaller training samples, such IronyHee-B (improvement of ) and SarcRiloff (improvement of ). Our best result however is achieved with a combination of pragmatic masking with emojis and surrogate fine-tuning on hashtags (the X2+SFT-H condition). This last model achieves an average macro of and is average points higher than the baseline.

5 Multi-Task Learning

We directly compare our proposed methods to multi-task learning (MTL), a popular transfer learning method that can possibly transfer knowledge across the different tasks. This is particularly motivated by the fact that there is some affinity between the different social meaning tasks we consider. We perform MTL starting off the original RoBERTa model without any further pre-training nor fine-tuning. For this iteration, we use a batch size of for all the tasks and train with epochs without early stopping. We implement MTL using hard-sharing Caruana (1997). That is, we share all the Transformer Encoder layers across all the tasks. 777In the future, we plan to explore soft-sharing and investigate different model architectures for MTL. As Table 4 shows, multi-task RoBERTa yields improvement on average Test macro compared to single task models. Table 4 also shows that models trained with any of our methods outperform MTL.888We compare to the best setting for each of the PM and SFT methods, and their combination in Table 4. For example, our single task model combining best PM and SFT methods is better than the MTL model. One or another of our single task models is also better than the MTL model on every single task, as Table 4. This further demonstrates the utility and effectiveness of our proposed methods as compared to MTL. We now describe our models exploiting paraphrase data as a proposed alternative to gold social data that can be challenging to acquire.

Task RoBERTaBASE Our Methods
CrisisOltea 95.95 95.88 95.98 95.87 95.68
EmoMoham 77.99 75.79 81.09 78.69 80.50
HateWaseem 57.34 56.52 57.01 63.97 60.25
HateDavid 77.71 77.07 78.26 77.29 76.93
HumorPotash 54.40 53.33 55.10 55.51 53.83
HumorMeaney 92.37 91.46 94.50 93.74 94.49
IronyHee-A 73.93 77.57 77.93 76.22 79.89
IronyHee-B 52.30 54.37 56.09 60.14 61.67
OffenseZamp 80.13 79.65 81.14 79.82 79.50
SarcRiloff 73.85 77.41 77.93 80.50 80.49
SarcPtacek 95.09 95.72 96.06 96.01 96.24
SarcRajad 85.07 85.99 86.65 87.56 88.92
SarcBam 79.08 80.70 81.12 81.19 81.53
SentiRosen 71.08 70.66 71.98 71.83 71.08
StanceMoham 70.41 69.11 68.62 71.27 70.77
1-6 Average 75.78 76.08 77.30 77.97 78.12
Table 4: Multi-task learning. Bold front indicates best result for each task. RoBERTaBASE is the baseline without further pre-training. SFT: Surrogate fine-tuning, SG: Single task, MT: Multi-task. SGPM-EA: Single task PM with emoji anywhere, SGSFT-H: Single task SFT with hashtags, SGBest: Single task PM with emoji anywhere surrogate fine-tuned on hashtags (setting X2-SFT-H in Table 3, which is our best model).

6 Paraphrase-Based Models

We use an in-house model based on T5Base Raffel et al. (2020) to paraphrase our social datasets and fine-tune models exploiting exclusively the paraphrased data. We provide the full details about our procedure, data splits, and models in Appendix D.1. Our results, based on average Test macro over three runs are in Table D.3 in the Appendix. Our results show that one of our models acquires almost all results as a model trained on gold data and an ensemble of our paraphrase models is slightly better than the gold RoBERTa baseline ( vs. ). These findings demonstrate that (i) we can replace social gold data (which can become increasingly inaccessible over time) with paraphrase data (ii) without sacrificing any performance.

7 Model Comparisons

We compare our best model on each dataset to the SOTA of that particular dataset. In addition, we fine-tune BERTweet Nguyen et al. (2020), a SOTA pre-trained language model for English Twitter domain, on each task and report against our models. Finally, we compare to published results on a Twitter evaluation benchmark Barbieri et al. (2020). All our reported results are an average of three runs, and we report using the same respective metric adopted by original authors on each dataset. As Table 5 shows, our models outperform all other models on all tasks999A single exception is one of the sarcasm tasks.. On average, our models are points higher than the closest models (BERTweet Nguyen et al. (2020)). As such, we set new SOTA on all tasks, therby further demonstrating the effectiveness of our proposed methods.

Task Metric SOTA TwE BTw Best Setting
CrisisOltea M- 95.60 - 95.88 96.02 X1+SFT-E
EmoMoham M- - 78.50 80.14 82.18 PM-EE
HateWaseem W- 73.62 - 88.00 88.02 SFT-H
HateDavid W- 90.00 - 91.27 91.50 PM-N
HumorPotash M- - - 52.77 57.14 RM-EA
HumorMeaney M- - - 94.46 94.52 PM-HE
IronyHee-A 70.50 65.40 71.49 76.47 X2+SFT-H
IronyHee-B M- 50.70 - 58.67 61.67 X2+SFT-H
Offense-Zamp M- 82.90 80.50 78.49 81.34 X1+SFT-E
SarcRiloff 51.00 - 66.35 70.02 SFT-H
SarcPtacek M- 92.37 - 96.65 96.24 X2+SFT-H
SarcRajad Acc 92.94 - 95.29 95.66 X2+SFT-H
SarcBam Acc 85.10 - 82.28 82.55 X1+SFT-E
SentiRosen M-Rec 68.50 72.60 72.90 73.63 PM-N
StanceMoham Avg(a,f) 71.00 69.30 69.79 72.70 SFT-H
1-7 Average - 77.02 73.26 79.63 81.31 -
Table 5: Model comparisons. SOTA: Best performance on each respective dataset. TwE: TweetEval Barbieri et al. (2020) is a benchmark for tweet classification evaluation. BTw: BERTweet Nguyen et al. (2020), a SoTA Transformer-based pre-trained language model for English tweets. We compare using the same metrics employed on each dataset. Metrics: M-: macro , W-: weighted , : irony class, : irony class, : sarcasm class, M-Rec: macro recall, Avg(a,f): Average of the against and in-favor classes (three-way dataset).  Liu et al. (2020),  Waseem and Hovy (2016),  Davidson et al. (2017),  Van Hee et al. (2018),  Zampieri et al. (2019b),  Riloff et al. (2013),  Ptáček et al. (2014),  Rajadesingan et al. (2015),  Bamman and Smith (2015),  Rosenthal et al. (2017),  Mohammad et al. (2016).

8 Zero- and Few-Shot Learning

Since our methods exploit general cues in the data for pragmatic masking and learn a broad range of social meaning concepts, we hypothesize they should be particularly effective in few-shot learning. To test this hypothesis, we fine-tune our best models (acquired with a combination of PM and SFT) on varying percentages of the Train set of each task (i.e., , , , 101010For this analysis, we only run 1 time for each combination except the models trained on of Train set that come from Table 2.. As Figure 1 shows, our models always achieve better average macro scores than the baseline across all data size settings. In fact, this is consistently true even when our models exploit only of labeled data used by the baseline. Strikingly, our models outperforms the RoBERTa baseline with an impressive average macro when we fine-tune on only of the downstream gold data. If we use only of gold data, our models improve with over the baseline. This demonstrates that our proposed methods can most effectively alleviate the challenge of labeled data even under severe few-shot settings.

Figure 1: Few-shot learning with our best setting (X2-SFT-H: PM with emojis anywhere, followed by SFT with end-position hashtags). The y-axis indicates the average Test macro across the tasks. The x-axis indicates the percentage of Train set used to fine-tune the model.

Our proposed methods are language agnostic, and may fare well on languages other than English. Although we do not test this claim directly in this work, we do score our English-language best models on six datasets from three other languages (zero-shot setting). Namely, we test our models on data from Arabic: EmoMageed Abdul-Mageed et al. (2020), IronyGhan Ghanem et al. (2019); Italian: EmoBian Bianchi et al. (2021) and HateBosco Bosco et al. (2018); and Spanish: EmoMoham Mohammad et al. (2018) and HateBas Basile et al. (2019). For emotion, irony, and hate speech tasks, we test our best model (i.e., X2+SFT-H in Table 3) fine-tuned on the English dataset EmoMoham, IronyHee-A, and HateDavid, respectively. We compare these models against a RoBERTa baseline fine-tuned on the same English data. As Table 6 shows, our models outperform the baseline in the zero-shot setting on all but one dataset with an average improvement of . These results emphasize the effectiveness of our methods even in the zero-shot setting across different languages and tasks.

Task RB X2+SFT-H
Arabic EmoMageed 29.81 40.37
IronyGhan 31.53 44.40
1-4 Italian EmoBian 27.22 26.40
HateBosco 40.59 47.04
1-4 Spanish EmoMoham 30.58 35.09
HateBas 41.43 43.66
1-4 Averge 33.53 39.49
Table 6: Zero-shot performance. RB: RoBERTa, X2-SFT-H: Our model combining best settings of PM and SFT.

9 Related works

Masked Language models. Devlin et al. (2019) introduced BERT, a language representation model pre-trained by joint conditioning on both left and right context in all layers with the Transformer encoder Vaswani et al. (2017)

. BERT’s pre-training introduces a self-supervised learning objective, i.e., masked language modeling (MLM), to train the Transformer encoder. MLM predicts the masked tokens in the input sequences by bi-directional context. For pre-training, BERT randomly selects 15% of tokens to be replaced with [MASK]. RoBERTa 

Liu et al. (2019) optimize the BERT performance by removing the next sentence prediction task and pre-training on a larger corpus and batch size. In the last few years, several variants of LMs with different masked methods were proposed. These include XLNet Yang et al. (2019), MASS Song et al. (2019), and MARGE Lewis et al. (2020).

To incorporate more domain specific knowledge into LM, some recent works introduce knowledge-enabled masking strategies. ERINE-Baidu Sun et al. (2019) proposes to mask tokens of named entities. ERINE-Tsinghua Zhang et al. (2019) introduce a knowledge graph method to select masking words. Tian et al. (2020) and Ke et al. (2020) select the sentiment related words to mask during pre-training. Gu et al. (2020) propose a selective masking strategy to mask the more important tokens for downstream tasks. However, these masking strategies depend on external resources and/or annotations (e.g., knowledge graph, a lexicon, labeled corpora). Corazza et al. (2020) investigate the utility of hybrid emoji-based masking for enhancing abusive language detection task, which is task specific.

10 Conclusion

We proposed two novel methods for improving transfer learning with language models, pragmatic masking and surrogate fine-tuning. We demonstrated the effectiveness of these methods on a new social meaning benchmark that we also introduced (SMPB). Our models establish new SOTA on different datasets and exhibit strikingly robust performance in severely few-shot settings. Our proposed methods are language independent, and show encouraging performance when applied in zero-shot settings on data from three different languages. In future research, we plan to further test this language independence claim. Our methods promise to trigger innovation as to bringing further knowledge to pre-trained language models without use of labeled data.

Ethical consideration

SMPB is collected from publicly available sources and aims at availing resources for training NLP models without need for original user data, which could be a step forward toward protecting privacy. Following Twitter policy, all the data we used for model pre-training and fine-tuning are anonymized. We will accompany our data and model release with model cards and detailed ethical considerations and best practices.


  • M. Abdul-Mageed and L. Ungar (2017)

    EmoNet: fine-grained emotion detection with gated recurrent neural networks

    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 718–728. External Links: Link, Document Cited by: §4.
  • M. Abdul-Mageed, C. Zhang, A. Hashemi, and E. M. B. Nagoudi (2020)

    AraNet: a deep learning toolkit for Arabic social media


    Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

    Marseille, France, pp. 16–23 (English). External Links: ISBN 979-10-95546-51-1 Cited by: §8.
  • D. Bamman and N. A. Smith (2015) Contextualized sarcasm detection on twitter. In Ninth International AAAI Conference on Web and Social Media, Cited by: §2, Table 5.
  • F. Barbieri, J. Camacho-Collados, L. Espinosa Anke, and L. Neves (2020) TweetEval: unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1644–1650. External Links: Link, Document Cited by: §2, Table 5, §7.
  • F. Barbieri, J. Camacho-Collados, F. Ronzano, L. Espinosa-Anke, M. Ballesteros, V. Basile, V. Patti, and H. Saggion (2018) SemEval 2018 task 2: multilingual emoji prediction. In Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana, pp. 24–33. External Links: Link, Document Cited by: §4.
  • V. Basile, C. Bosco, E. Fersini, D. Nozza, V. Patti, F. M. Rangel Pardo, P. Rosso, and M. Sanguinetti (2019) SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 54–63. External Links: Link, Document Cited by: §8.
  • F. Bianchi, D. Nozza, and D. Hovy (2021) Feel-it: emotion and sentiment classification for the italian language. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 76–83. Cited by: §8.
  • C. Bosco, D. Felice, F. Poletto, M. Sanguinetti, and T. Maurizio (2018) Overview of the evalita 2018 hate speech detection task. In EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Vol. 2263, pp. 1–9. Cited by: §8.
  • R. Caruana (1997) Multitask learning. Machine learning 28 (1), pp. 41–75. Cited by: Appendix B, §5.
  • M. Corazza, S. Menini, E. Cabrio, S. Tonelli, and S. Villata (2020) Hybrid emoji-based masked language models for zero-shot abusive language detection. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 943–949. External Links: Link, Document Cited by: §9.
  • M. Creutz (2018) Open subtitles paraphrase corpus for six languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §D.1.
  • T. Davidson, D. Warmsley, M. Macy, and I. Weber (2017) Automated Hate Speech Detection and the Problem of Offensive Language. In Proceedings of the 11th International AAAI Conference on Web and Social Media, ICWSM ’17, pp. 512–515. Cited by: §2, Table 5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §9.
  • B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1615–1625. External Links: Link, Document Cited by: §4.
  • B. Ghanem, J. Karoui, F. Benamara, V. Moriceau, and P. Rosso (2019) IDAT@FIRE2019: Overview of the Track on Irony Detection in Arabic Tweets. . In Mehta P., Rosso P., Majumder P., Mitra M. (Eds.) Working Notes of the Forum for Information Retrieval Evaluation (FIRE 2019). CEUR Workshop Proceedings. In:, Kolkata, India, December 12-15, Cited by: §8.
  • Y. Gu, Z. Zhang, X. Wang, Z. Liu, and M. Sun (2020) Train no evil: selective masking for task-guided pre-training. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6966–6974. External Links: Link, Document Cited by: §3, §9.
  • A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi (2020) The curious case of neural text degeneration. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Link Cited by: §D.2.
  • T. Hu, H. Guo, H. Sun, T. Nguyen, and J. Luo (2017) Spice up your chat: the intentions and sentiment effects of using emojis. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11. Cited by: §4.
  • P. Ke, H. Ji, S. Liu, X. Zhu, and M. Huang (2020) SentiLARE: sentiment-aware language representation learning with linguistic knowledge. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 6975–6988. External Links: Link, Document Cited by: §3, §9.
  • W. Lan, S. Qiu, H. He, and W. Xu (2017) A continuously growing dataset of sentential paraphrases. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1224–1234. External Links: Link, Document Cited by: §D.1.
  • M. Lewis, M. Ghazvininejad, G. Ghosh, A. Aghajanyan, S. Wang, and L. Zettlemoyer (2020) Pre-training via paraphrasing. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §9.
  • J. Liu, T. S. L. Blessing, K. L. Wood, and K. H. Lim (2020) CrisisBERT: robust transformer for crisis classification and contextual crisis embedding. arXiv preprint arXiv:2005.06627. Cited by: Table 5.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §3.3, §9.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: Appendix B.
  • J.A. Meaney, S. R. Wilson, L. Chiruzzo, and W. Magdy (2021) HaHackathon: detecting and rating humor and offense. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Cited by: §2.
  • S. Mohammad, F. Bravo-Marquez, M. Salameh, and S. Kiritchenko (2018) SemEval-2018 task 1: affect in tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana, pp. 1–17. External Links: Link, Document Cited by: §2, §8.
  • S. Mohammad, S. Kiritchenko, P. Sobhani, X. Zhu, and C. Cherry (2016) SemEval-2016 task 6: detecting stance in tweets. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 31–41. External Links: Link, Document Cited by: §2, Table 5.
  • D. Q. Nguyen, T. Vu, and A. Tuan Nguyen (2020) BERTweet: a pre-trained language model for English tweets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 9–14. External Links: Link, Document Cited by: Table C.1, Table 5, §7.
  • A. Olteanu, C. Castillo, F. Diaz, and S. Vieweg (2014) Crisislex: A lexicon for collecting and filtering microblogged communications in crises. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 8. Cited by: §2.
  • P. Potash, A. Romanov, and A. Rumshisky (2017) SemEval-2017 task 6: #HashtagWars: learning a sense of humor. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 49–57. External Links: Link, Document Cited by: §2.
  • T. Ptáček, I. Habernal, and J. Hong (2014) Sarcasm detection on Czech and English Twitter. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 213–223. External Links: Link Cited by: §2, §4, Table 5.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 21, pp. 1–67. Cited by: §D.1, §6.
  • A. Rajadesingan, R. Zafarani, and H. Liu (2015) Sarcasm detection on twitter: A behavioral modeling approach. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, WSDM 2015, Shanghai, China, February 2-6, 2015, X. Cheng, H. Li, E. Gabrilovich, and J. Tang (Eds.), pp. 97–106. External Links: Link, Document Cited by: §2, §4, Table 5.
  • E. Riloff, A. Qadir, P. Surve, L. De Silva, N. Gilbert, and R. Huang (2013) Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 704–714. External Links: Link Cited by: §2, §4, Table 5.
  • S. Rosenthal, N. Farra, and P. Nakov (2017) SemEval-2017 task 4: sentiment analysis in Twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 502–518. External Links: Link, Document Cited by: §2, Table 5.
  • V. Sintsova and P. Pu (2016) Dystemo: distant supervision method for multi-category emotion recognition in tweets. ACM Trans. Intell. Syst. Technol. 8 (1). External Links: ISSN 2157-6904, Link, Document Cited by: §4.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2019) MASS: masked sequence to sequence pre-training for language generation. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 5926–5936. External Links: Link Cited by: §9.
  • Y. Sun, S. Wang, Y. Li, S. Feng, X. Chen, H. Zhang, X. Tian, D. Zhu, H. Tian, and H. Wu (2019) Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:1904.09223. Cited by: §9.
  • J. A. Thomas (2014) Meaning in interaction: an introduction to pragmatics. Routledge. Cited by: §1.
  • H. Tian, C. Gao, X. Xiao, H. Liu, B. He, H. Wu, H. Wang, and F. Wu (2020) SKEP: sentiment knowledge enhanced pre-training for sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4067–4076. External Links: Link, Document Cited by: §3, §9.
  • C. Van Hee, E. Lefever, and V. Hoste (2018) SemEval-2018 task 3: irony detection in English tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana, pp. 39–50. External Links: Link, Document Cited by: §2, Table 5.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §9.
  • Z. Waseem and D. Hovy (2016) Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In Proceedings of the NAACL Student Research Workshop, San Diego, California, pp. 88–93. External Links: Link, Document Cited by: §2, Table 5.
  • M. Wiegand and J. Ruppenhofer (2021) Exploiting emojis for abusive language detection. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 369–380. External Links: Link Cited by: §4.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link, Document Cited by: Appendix B.
  • I. Wood and S. Ruder (2016) Emoji as emotion tags for tweets. In Proceedings of the Emotion and Sentiment Analysis Workshop LREC2016, Portorož, Slovenia, pp. 76–79. Cited by: §4.
  • W. Xu, C. Callison-Burch, and B. Dolan (2015) SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 1–11. External Links: Link, Document Cited by: §D.1.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 5754–5764. External Links: Link Cited by: §9.
  • M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019a) Predicting the type and target of offensive posts in social media. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1415–1420. External Links: Link, Document Cited by: §2.
  • M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar (2019b) SemEval-2019 task 6: identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, Minnesota, USA, pp. 75–86. External Links: Link, Document Cited by: Table 5.
  • Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019) ERNIE: enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1441–1451. External Links: Link, Document Cited by: §3, §9.
  • J. Zhou, Z. Zhang, H. Zhao, and S. Zhang (2020) LIMIT-BERT : linguistics informed multi-task BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 4450–4461. External Links: Link, Document Cited by: §3.

Appendix A Dataset

Table A.1 presents the distribution of 15 social meaning datasets.

Task Classes Train Dev Test Total
CrisisOltea {on-topic, off-topic,} K K K K
EmoMoham {anger, joy, opt., sad.} K K K
HateWaseem {racism, sexism, none} K K K K
HateDavid {hate, offensive, neither} K K K K
HumorPotash {humor, not humor} K K
HumorMeaney {humor, not humor} K K K K
IronyHee-A {ironic, not ironic} K K
IronyHee-B {IC, SI, OI, NI} K K
OffenseZamp {offensive, not offensive} K K K
SarcRiloff {sarcastic, non-sarcastic} K K
SarcPtacek {sarcastic, non-sarcastic} K K K K
SarcRajad {sarcastic, non-sarcastic} K K K K
SarcBam {sarcastic, non-sarcastic} K K K K
SentiRosen {neg., neu., pos.} K K K K
StanceMoham {against, favor, none} K K K
Table A.1: Social meaning data. opt.:: Optimism, sad.: Sadness, IC: Ironic by clash, SI: Situational irony, OI: Other irony, NI: Non-ironic, neg.: Negative, Neu.: Neutral, pos.: Positive.

Appendix B Hyper-parameters and Procedure

Pragmatic masking. For pragmatic masking, we use the Adam optimizer with a weight decay of  Loshchilov and Hutter (2019) and a peak learning rate of . The number of the epochs is .

Surrogate fine-tuning. For surrogate fine-tuning, we fine-tune RoBERTa on surrogate classification tasks with the same Adam optimizer but use a peak learning rate of .

The pre-training and surrogate fine-tuning models are trained on eight Nvidia V GPUs (G each). On average the running time is hours per epoch for PMLMs, hours per epoch for SFT models. All the models are implemented by Huggingface Transformers Wolf et al. (2020).

Downstream Fine-tuning. We evaluate the further pre-trained models with pragmatic masking and surrogate fine-tuned models on the downstream tasks in Table A.1. We set maximal sequence length as for text classification tasks. For CrisisOltea and StanceMoham, we append the topic term behind the post content, separate them by [SEP] token, and set maximal sequence length to , especially. For all the tasks, we pass the hidden state of [CLS] token from the last Transformer-encoder layer through a non-linear layer to predict. Cross-Entropy calculates the training loss. We then use Adam with a weight decay of to optimize the model and fine-tune each task for epochs with early stop ( epochs). We fine-tune the peak learning rate in a set of and batch size in a set of . We find the learning rate of performs best across all the tasks. For the downstream tasks whose Train set is smaller than samples, the best mini-batch size is . The best batch size of other larger downstream tasks is .

Multi-task fine-tuning. For Multi-task fine-tuning, we use a batch size of for all the tasks and train with epochs without early stopping. We implement the multi-task learning using hard-sharing Caruana (1997) model. We send the final state of the [CLS] token to the non-linear layer of the corresponding task.

We use the same hyperparameters to run

times with random seeds for all downstream fine-tuning (unless otherwise indicated). All downstream task models are fine-tuned on Nvidia V GPUs (G each). At the end of each epoch, we evaluate the model on the Dev set and identify the model that achieved the highest performance on Dev as our best model. We then test the best model on the Test set. In order to compute the model’s overall performance across

tasks, we use same evaluation metric (i.e., macro

) for all tasks. We report the average Test macro of the best model over runs. We also average the macro scores across tasks to present the model’s overall performance.

Appendix C Summary of model performance

Table C.1 summarizes performance of our models across different settings.

CrisisOltea-14 95.95 95.88 95.78 95.92 95.75 95.85 95.85 95.87 95.91 95.98 95.95 95.77 95.76 96.02 95.87 95.68 95.88 95.70
EmoMoham-18 77.99 80.14 79.43 80.73 80.31 78.56 79.51 80.15 80.03 81.09 81.28 82.18 79.69 82.04 78.69 80.50 75.79 75.85
HateWaseem-16 57.34 57.47 56.75 56.34 57.16 57.51 56.97 57.13 57.00 57.01 57.08 56.69 56.47 60.92 63.97 60.25 56.52 55.79
HateDavid-17 77.71 77.15 77.47 78.28 76.87 77.46 77.55 77.22 78.13 78.26 78.16 77.93 76.45 77.00 77.29 76.93 77.07 77.13
HumorPotash-17 54.40 52.77 55.45 55.26 55.32 52.49 50.06 54.60 57.14 55.10 55.25 55.57 54.75 54.93 55.51 53.83 53.33 50.07
HumorMeaney-21 92.37 94.46 93.24 93.69 93.58 93.48 92.85 94.52 93.55 94.50 93.19 92.69 93.82 93.68 93.74 94.49 91.46 92.51
IronyHee-18A 73.93 77.35 74.52 74.97 74.50 75.16 73.97 76.24 75.34 77.93 74.40 75.62 76.63 72.73 76.22 79.89 77.57 78.64
IronyHee-18B 52.30 58.67 52.91 53.79 51.43 49.29 50.41 54.76 54.94 56.09 54.73 52.47 57.59 56.11 60.14 61.67 54.37 55.20
Offense-Zamp-19 80.13 78.49 79.97 80.24 79.74 79.34 79.95 78.87 80.18 81.14 80.18 80.65 80.18 81.34 79.82 79.50 79.65 79.26
SarcRiloff-13 73.85 78.81 72.02 75.24 71.42 74.72 74.16 75.88 76.52 77.93 76.30 80.10 78.34 78.74 80.50 80.49 77.41 78.09
SarcPtacek-14 95.09 96.65 95.81 95.64 95.50 95.62 95.24 95.81 95.81 96.06 95.67 96.01 95.88 96.16 96.01 96.24 95.72 96.26
SarcRajad-15 85.07 87.58 86.18 86.23 85.04 85.55 85.20 85.93 86.14 86.65 86.02 86.94 86.80 87.48 87.56 88.92 85.99 88.10
SarcBam-15 79.08 82.08 80.03 80.13 80.22 80.16 79.83 80.31 80.73 81.12 81.13 81.73 81.48 82.53 81.19 81.53 80.70 82.64
SentiRosen-17 71.08 71.83 72.03 72.65 72.10 71.99 71.84 71.82 72.24 71.98 72.27 71.56 71.27 72.07 71.83 71.08 70.66 69.48
StanceMoham-16 70.41 67.41 67.14 69.94 69.51 68.13 69.23 69.68 70.20 68.62 70.04 68.48 69.06 69.65 71.27 70.77 69.11 69.22
1-19 Average 75.78 77.12 75.92 76.60 75.90 75.69 75.51 76.59 76.92 77.30 76.78 76.96 76.94 77.43 77.97 78.12 76.08 76.26
Table C.1: Summary performance of our models across different (i) PM and SFT settings, (ii) single and multi-task settings, and (iii) in comparison with the BERTweet Nguyen et al. (2020), a SOTA Twitter pre-trained language model. The metric is macro .

Appendix D Paraphrasing Model and SMPB

d.1 Praphrase Model

In order to paraphrase our datasets, we fine-tune the T5BASE model Raffel et al. (2020) on paraphrase datasets from PIT-2015 (tweet) Xu et al. (2015), LanguageNet (tweet) Lan et al. (2017), Opusparcus  Creutz (2018)

(video subtitle), and Quora Question Pairs (Q&A website)

111111 For PIT-2015, LanguageNet and Opusparcus, we only keep sentence pairs with a semantic similarity score . We then merge all datasets and shuffle them. Next, we split the data into Train, Dev, and Test (, , and ) and fine-tune T5 on the Train split for epochs with constant learning rate of .

d.2 Smpb

We fine-tune the paraphrase model on the Train split of each of our datasets, using top- Holtzman et al. (2020) with to generate paraphrases for each gold sample. To remove any ‘paraphrases’ that are just copies of the original tweet or near-duplicates, we use a simple tri-gram similarity method. That is, we remove any sequences whose similarity with the original tweet is . We also remove paraphrases of the same tweet that are similar to one another. To ensure that paraphrases are not very different from the original tweets, we also remove any sequences whose similarity with original tweets . This process results in a paraphrasing dataset Para-Clean. We present an example paraphrase in Figure D.1. Table D.2 provides more paraphrase examples. We observe that the paraphrase model fails to generate emojis since emojis are out-of-vocabulary for the original T5 model, but the model can preserve the overall semantics of the original tweet.

Figure D.1: Paraphrase example.

We use the Para-Clean dataset we derive from our PMSB benchmark to investigate the viability of using paraphrase data instead of gold training datasets. We perform our experiments by fine-tuning our best model that combines SFT and PM (i.e., X2+SFT-H in Table 3) comparing to a model initialized with the RoBERTa baseline. We evaluate our models on the gold Dev and Test sets for the individual tasks (details about datasets are in Section 2). To explore the effect of paraphrase data size on the downstream tasks, we extract , , , and paraphrases from Para-Clean for each Train gold sample in each of our tasks. We refer to the resulting datasets as Para, Para, Para, and Para. Table D.1 shows the distribution of the resulting paraphrase datasets.

Task Para1 Para2 Para4 Para5
CrisisOltea K K K K
EmoMoham K K K K
HateWaseem K K K K
HateDavid K K K K
HumorPotash K K K K
HumorMeaney K K K K
IronyHee-A K K K K
IronyHee-B K K K K
OffenseZamp K K K K
SarcRiloff K K K K
SarcPtacek K K K K
SarcRajad K K K K
SarcBam K K K K
SentiRosen K K K K
StanceMoham K K K K
Table D.1: Distribution of SMPB Train set.
Original Tweet Paraphrase Label
You guys are horrible, avoid MMT
what I am doing is in my control, #AvoidMMT, you guys are terrifying
You guys are #terrorist. I have used everything I have to do.
USER but what I am doing is in my control,
#AvoidMMT , you guys are #terrible
You guys are awful, but I am going to stop doing it. anger
I hate when people say ’I need to talk to you or we need to talk to you’. I guess that’s the problem.
I hate when people tell me ’I need to talk to you or we need to speak’ my anxiety immediately goes up.
I hate when people say ’I need to talk to you or we need to talk.’
My anxiety immediately goes up…
Why am I afraid when people say ’I need to talk to you or we need to talk?’ anger
The 46th wedding I’ve ruined. When I hit 50 I can retire. It’s nice to see yo
Here’s the 47th wedding I’ve ruined. If I’m old enough to go on the 40s I can get married.
This is the 47th wedding I’ve ruined. When I hit 50 I can retire. After a single wedding, I drew 47 weddings, and before I hit 50 I can retire” humor
Sorry to disturb you. I have absolutely no idea what time I’ll be on cam tomorrow.
Sorry guys I have absolutely no idea
what time i’ll be on cam tomorrow but will keep you posted.
I have absolutely no idea what time I’ll be on camera tomorrow but I’ll keep you posted sadness
”I’ll buy you Dunkin’ Donuts for $5.
Who wants to go with me for my tattoo tomorrow? I’ll buy you a Dunkin’ Donuts.
Who wants to go with me to get my tattoo tomorrow?
I’ll buy you Dunkin doughnuts
Who wants to go with me to get my tattoo tomorrow? neutral
The day before class please eat beans, onions and garlic. Also see the videos
”The Day Before Class. You should make that meal, (do you think).
USER May I suggest, that you have a meal that is made with
beans, onions & garlic, the day before class.
If you can eat just the day before class, make a wonderful meal with garlic, onions and beans. joy
Table D.2: Examples of paraphrases in SMPB

d.3 Paraphrase-Based Methods

We present the average Test macro over three runs in Table D.3. As Table D.3 shows, although none of our paraphrase-based models (Para-Models) exceed the gold RoBERTa baseline, our P4 model gets very close (i.e., only less). In addition, an ensemble of our Para-Models is slightly better than the gold RoBERTa baseline ( vs. ).

Task Baseline (RoBERTaBASE) Our Best Method (X2+SFT-H)
P1 P2 P4 P5 Best Gold P1 P2 P4 P5 Best Gold
CrisisOltea 95.54 95.20 95.23 95.39 95.54 95.95 94.98 94.68 95.05 95.09 95.09 95.68
EmoMoham 77.06 77.43 77.31 76.99 77.43 77.99 77.81 78.38 77.61 77.84 78.38 80.50
HateWaseem 54.58 54.92 54.83 54.63 54.92 57.34 53.48 54.39 53.94 54.11 54.39 60.25
HateDavid 74.04 75.26 75.80 74.97 75.80 77.71 73.12 74.48 74.08 74.66 74.66 76.93
HumorPotash 49.26 53.16 53.54 53.34 53.54 54.40 49.61 54.39 52.59 51.90 54.39 53.83
HumorMeaney 91.71 91.66 92.28 91.77 92.28 92.37 94.13 94.32 93.95 94.16 94.32 94.49
IronyHee-A 71.00 71.53 71.06 71.53 71.53 73.93 74.41 75.53 75.40 75.67 75.67 79.89
IronyHee-B 43.24 45.69 48.02 48.33 48.33 52.30 48.88 51.71 50.56 49.69 51.71 61.67
OffenseZamp 79.90 79.95 80.15 79.57 80.15 80.13 75.96 78.22 78.23 78.80 78.80 79.50
SarcRiloff 71.40 72.74 72.45 72.26 72.74 73.85 76.91 77.59 79.41 79.24 79.41 80.49
SarcPtacek 90.54 92.58 93.70 93.62 93.70 95.09 91.65 93.31 94.10 93.98 94.10 96.24
SarcRajad 81.00 81.94 82.53 82.64 82.64 85.07 87.34 86.88 87.59 87.38 87.59 88.92
SarcBam 77.60 76.93 77.92 77.76 77.92 79.08 80.86 80.06 81.53 80.22 81.53 81.53
SentiRosen 70.50 70.95 71.51 71.52 71.52 71.08 66.79 67.31 68.87 68.29 68.87 71.08
StanceMoham 67.75 69.07 69.30 67.34 69.30 70.41 68.39 68.60 68.58 67.03 68.60 70.77
1-13 Average 73.01 73.93 74.38 74.11 74.49 75.78 74.29 75.32 75.43 75.20 75.83 78.12
Table D.3: Results of paraphrasing-based model. Gold denotes a model fine-tuned with downstream, original Train data. P indicates that the model is trained on Para training set. Bold denotes the best paraphrase result for each task within each of two settings (Baseline vs. X2+SFT-H). X2+SFT-H: Our model combining best settings of PM and SFT (see Table 3). Best: Performance of the best paraphrasing setting. An ensemble of our best paraphrase-based models is slightly better than the gold-initialized baseline model.