Social Bias Frames: Reasoning about Social and Power Implications of Language

by   Maarten Sap, et al.

Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but all the implied meanings that frame people's judgements about others. For example, given a seemingly innocuous statement "we shouldn't lower our standards to hire more women," most listeners will infer the implicature intended by the speaker - that "women (candidates) are less qualified." Most frame semantic formalisms, to date, do not capture such pragmatic frames in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes on others. In addition, we introduce the Social Bias Inference Corpus, to support large-scale modelling and evaluation with 100k structured annotations of social media posts, covering over 26k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (86 more detailed explanations by accurately decoding out Social Bias Frames. Our study motivates future research that combines structured pragmatic inference with commonsense reasoning on social implications.


page 1

page 4

page 5


Uncovering Implicit Gender Bias in Narratives through Commonsense Inference

Pre-trained language models learn socially harmful biases from their tra...

Correcting Sociodemographic Selection Biases for Accurate Population Prediction from Social Media

Social media is increasingly used for large-scale population predictions...

Social Chemistry 101: Learning to Reason about Social and Moral Norms

Social norms—the unspoken commonsense rules about acceptable social beha...

A Framework for the Computational Linguistic Analysis of Dehumanization

Dehumanization is a pernicious psychological process that often leads to...

Detecting and Reducing Bias in a High Stakes Domain

Gang-involved youth in cities such as Chicago sometimes post on social m...

Verb Physics: Relative Physical Knowledge of Actions and Objects

Learning commonsense knowledge from natural language text is nontrivial ...

Misinfo Belief Frames: A Case Study on Covid Climate News

Prior beliefs of readers impact the way in which they project meaning on...

1 Introduction

Figure 1: Understanding and explaining why a seemingly innocuous statement is potentially unjust requires reasoning about the conversational implicatures and commonsense implications with respect to the overall offensiveness, intent, and power differentials among different social groups. Social Bias Frames aims to represent the various pragmatic meanings related to social bias implications, by combining categorical and free-text annotations, e.g., that “women are less qualified” is implied by the phrase “lowering our standards to hire more women.”
post off? int? lewd? group? targeted group implied statement min. speaker
Kendrick just dropped the coldest video of all fucking time no no no
What a fucking stupid bitch shut the fuck up already yes yes no
need to fuck sum bad , I got some pressure built up :P no no yes
I take a bitch home give her long dick yes yes yes pers.
You annoyed me every day for months you’re a fucking moron yes yes no no
My problem with Korean artists: I dont know how to pronounce your name I can’t hanggul to save my life yes no no yes Koreans have weird names no
I’ll continue this thread till non-lifting fags stop lifting like idiots at the gym… Which is never. yes yes no yes gay men are weak and not manly no
I thought drugs were the only things black people could shoot up Boy was I wrong yes yes no yes Black folks do drugs no
kill people
commit shootings
Table 1: Examples of inference tuples in SBIC. The types of inferences captured by Social Bias Frames cover (potentially subtle) offensive implications about various demographic groups.

Language has enormous power to project social biases and reinforce stereotypes on people Fiske (1993). The way such biases are projected is rarely in what is stated explicitly, but in all the implied layers of meanings that frame and influence people’s judgements about others. For example, even with a seemingly innocuous statement that an all-Muslim movie was a “box office bomb”, most people can instantly recognize the implied demonizing stereotype that “Muslims are terrorists” (Figure 1). Understanding these biases with accurate underlying explanations is necessary for AI systems to adequately interact in the social world Pereira et al. (2016), and failure to do so can result in the deployment of harmful technologies (e.g., conversational AI systems turning sexist and racist; Vincent, 2016).

Most previous approaches to understanding the implied harm in statements have cast this task as a simple toxicity classification (e.g., Waseem and Hovy, 2016; Founta et al., 2018; Davidson et al., 2017). However, simple classifications run the risk of discriminating against minority groups, due to high variation and identity-based biases in annotations (e.g., which cause models to learn associations between dialect and toxicity; Sap et al., 2019a; Davidson et al., 2019). In addition, it is the detailed explanations that are much more informative for people to understand and reason about why a statement is potentially harmful against other people Ross et al. (2017).

Thus, we propose Social Bias Frames, a novel conceptual formalism that aims to model pragmatic frames in which people project social biases and stereotypes on others. Compared to semantic frames, the meanings projected by pragmatic frames are richer thus cannot be easily formalized using only categorical labels. Therefore, as illustrated in Figure 1, our formalism combines hierarchical categories of biased implications such as intent and offensiveness with implicatures described in free-form text such as groups referenced and implied statements. In addition, we introduce SBIC,111Social Bias Inference Corpus, available at a new corpus collected using a novel crowdsourcing framework. SBIC supports large scale learning and evaluation with over 100k structured annotations of social media posts, spanning over 26k implications about a thousand demographic groups.

We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at making high-level categorization of whether a given statement projects unwanted social bias (86% ), they are not effective at spelling out more detailed explanations by accurately decoding out Social Bias Frames. Our study motivates future research that combines structured pragmatic inference with commonsense reasoning on social implications.

Important Implications of Our Study

We recognize that studying Social Bias Frames necessarily requires us to confront online content that may be offensive or disturbing. However, deliberate avoidance does not make the problem go away. Therefore, the important premise we take in this study is that assessing social media content through the lens of Social Bias Frames is important for automatic flagging or AI-augmented writing interfaces, where potentially harmful online contents can be analyzed with detailed explanations for users to consider and verify. In addition, the collective analysis over large corpora can also be insightful for educating people to put more conscious efforts in reducing unconscious biases that they repeatedly project in their language use.

2 Social Bias Frames Definition

platform source # posts
Reddit r/darkJokes 10,096
r/meanJokes 3,497
r/offensiveJokes 356
total 13,949
Twitter Founta et al. (2018) 11,865
Davidson et al. (2017) 3,008
Waseem and Hovy (2016) 1,816
total 16,689
Table 2: Breakdown of origins of posts in SBIC.

To better enable models to account for socially biased implications of language,222In this work, we employ the U.S. socio-cultural lens when discussing bias and power dynamics among demographic groups. we design a new pragmatic formalism that distinguishes several related but distinct inferences, shown in Figure 1. Given a natural language utterance, henceforth, post

, we collect both categorical as well as free text inferences (described below), inspired by recent efforts in knowledge graph creation

(e.g., Speer and Havasi, 2012; Sap et al., 2019b). The free-text explanations are crucial to our formalism, as they can both increase trust in predictions made by the machine Kulesza et al. (2012); Bussone et al. (2015); Nguyen et al. (2018) and encourage a poster’s empathy towards targeted group, thereby combating potential biases Cohen-Almagor (2014).

Figure 2: Snippet of the annotation task used to collect SBIC. Lewdness, group implication, and speaker minority questions are ommited for brevity but shown in larger format in Figure 5 (Appendix).


denotes the overall rudeness, disrespect, or toxicity of a post. We define it formally as whether it could be considered “offensive to anyone”, as previous work has shown this to have higher recall of offensive content Sap et al. (2019a)

. This is a categorical variable with three possible answers (yes, maybe, no).

Intent to offend

captures whether the perceived motivation of the author was to offend, which is key to understanding how it is received Kasper (1990); Dynel (2015)

. This is a categorical variable with four possible answers (yes, probably, probably not, no).


or sexual references are a key subcategory of what constitutes potentially offensive material in many cultures, especially in the United States Strub (2008). This is a categorical variable with three possible answers (yes, maybe, no).

Group implications

are distinguished from individual-only attacks or insults that do not invoke power dynamics between groups (e.g., “F*ck you” vs. “F*ck you, f*ggot”). This is a categorical variable with two possible answers.

Targeted group

describes the social or demographic group that is referenced or targeted by the post. Here we collect free-text answers, but provide a seed list of demographic or social groups to encourage consistency.

Implied statement

represents the power dynamic or stereotype that is referenced in the post. We collect free-text answers in the form of simple Hearst-like patterns (e.g., “women are ADJ”, “gay men VBP”; Hearst, 1992).

Minority speaker

aims to flag posts for which the speaker may be part of the same social group referenced. This is motivated by previous work on how speaker identity influences how a statement is received Greengross and Miller (2008); Sap et al. (2019a).

3 Collecting nuanced annotations

To create SBIC, we design a crowdsourcing framework to seamlessly distill the biased implications of posts at a large scale.

3.1 Data Selection

We draw from two sources of online content, namely Reddit and Twitter, to select posts to annotate. To mitigate the challenge of scarcity of online toxicity Founta et al. (2018),333Founta et al. (2018) find that the prevalence of toxic content online is 4%. we start by annotating posts made in three intentionally offensive English subReddits (see Table 2). By nature, these are very likely to have harmful implications as they are often posted with intents to deride adversity or social inequality Bicknell (2007). Additionally, we include posts from three existing English datasets annotated for toxic or abusive language, filtering out @-replies, retweets, and links. We mainly annotate tweets released by Founta et al. (2018), who use a bootstrapping approach to sample potentially offensive tweets. We also include tweets from Waseem and Hovy (2016) and Davidson et al. (2017), who collect datasets of tweets containing racist or sexist hashtags and slurs, respectively.

total # tuples 103,173
# unique posts 30,638
groups 1,125
implications 24,004
post-group 34,204
post-group-implication 64,833
group-implication 25,866
Table 3: Statistics of the SBIC dataset

3.2 Annotation Task Design

We design a hierarchical Amazon Mechanical Turk (MTurk) framework to collect biased implications of a given post (snippet shown in Figure 2. The full task is shown in the supplementary (Figure 5).

For each post, workers indicate whether the post is offensive, whether the intent was to offend, and whether it contains lewd or sexual content. Only if annotators indicate potential offensiveness do they answer the group implication question. If the post targets or references a group or demographic, workers select or write which one(s); per selected group, they then write two to four stereotypes. Finally, workers are asked whether they think the speaker is part of one of the groups references by the post. Optionally, we ask workers for coarse-grained demographic information.444This study was approved by the University of Washington IRB.

We collect three annotations per post, and restrict our worker pool to the U.S. and Canada.

Figure 3: Breakdown of targeted group categories by domains. We show percentages within domains for the top 3 most represented categories, namely gender/sexuality (e.g., women, LGBTQ folks), race/ethnicity (e.g., Black, Latinx, and Asian folks), and culture/origin (e.g., Muslim, Jewish folks).

Annotator demographics

In our final annotations, our worker pool was relatively gender-balanced and age-balanced (55% women, 42% men, 1% non-binary; 36

10 years old), but racially skewed (82% White, 4% Asian, 4% Hispanic, 4% Black).

Figure 4: Overall architecture of our full multi-task model, which combines five classifications tasks for categorical variables (in yellow; ) with a generation task for the free-text variables (in dark blue; ).

Annotator agreement

We compute how well annotators agreed on categorical questions, showing moderate agreement on average. Workers agreed on a post being offensive at a rate of 77% (Cohen’s =0.53), its intent being to offend at 76% (=0.48), and it having group implications at 76% (=0.51). Workers marked posts as lewd with substantial agreement (94%, =0.66), but agreed less when marking the speaker a minority (94%, =0.18).555Low values are expected for highly skewed categories such as minority speaker (only 4% “yes”).

3.3 Sbic description

After data collection, SBIC contains 100k structured inference tuples, covering 25k free text group-implication pairs (see Table 3). We show example inference tuples in Table 1.

Additionally, we show a breakdown of the types of targeted groups in Figure 3. While SBIC covers a variety of types of biases, gender-based, race-based, and culture-based biases are the most represented, which parallels the types of discrimination happening in the real world RWJF (2017).

4 Social Bias Inference

Given a post, our model aims to generate the implied power dynamics in textual form, as well as classify the post’s offensiveness and other categorical variables. We show a general overview of the full model in Figure 


As input, our model takes a post , defined as a sequence of tokens delimited by a start token ([STR]) and a classifier token ([CLF]). Our encoder model then yields a contextualized representation of each token , where is the hidden size of the encoder.


For predicting the categorical variables (), our model combines five logistic classifiers that use the representation at the classifier token, , as input. The final predictions are computed through a projection and a sigmoid layer:

where and . During training, we minimize the negative log-likelihood of the predictions:

During inference, we simply predict the classes which have highest probability.


For the free-text variables, we take inspiration from recent generative commonsense modelling Bosselut et al. (2019). Specifically, we frame the inference as a conditional language modelling task, by appending the linearized targeted group () and implied statement () to the post (using the SEP delimiter token; see Figure 4). During training, we minimize the cross-entropy of the linearized triple using a language modelling objective:

During inference, we conditionally generate the group and statement conditioned on the post , using greedy (argmax) or sampling decoding.

model offensive intent lewd group min. speaker
54.8% pos. (dev) 57.5% pos (dev) 9.3% pos (dev) 68.2% pos (dev) 2.0% pos (dev)
F1 pr rec F1 pr rec F1 pr rec F1 pr rec F1 pr rec
dev GPT L1+L2 85.9 79.8 92.9 87.1 86.9 87.4 18.8 49.2 11.6 22.1 13.2 69.1 5.4 12.5 3.4
GPT L1 77.9 73.2 83.2 79.1 82.0 76.5 20.5 49.5 12.9 25.7 15.1 85.2 6.3 33.3 3.4
GPT-rnd L1+L2 75.5 77.6 73.6 72.4 89.8 60.6 12.9 55.4 7.3 15.5 9.6 40.8 0.0 0.0 0.0
tst GPT L1+L2 86.6 80.6 93.5 87.1 86.2 88.0 20.8 12.2 70.0 18.6 51.0 11.4 0.0 0.0 0.0
Table 4: Experimental results (%-ages) of various models on the classification tasks. L1 corresponds to the multitask classification model, L1+L2 the full multitask model, and *-rnd the full multitask but randomly initialized model. We bold the best results. For easier interpretation, we also report the %-age of instances in the positive class in the dev set.

4.1 Experimental setup

In this work, we build on the pretrained OpenAI-GPT model by Radford et al. (2018) as our encoder , which has yielded impressive classification and generation results Radford et al. (2018); Gabriel et al. (2019). This model is a uni-directional language model, which means encoded token representations are only conditioned on past tokens (i.e., ). OpenAI-GPT was trained on English fiction (Toronto Book Corpus; Zhu et al., 2015)

For baseline comparison, we consider a multitask classification-only model (). We also compare the full multitask model to a baseline generative inference model trained only on the language modelling loss (). Finally, we consider a model variant that uses a randomly initialized GPT model to observe the effect of pretraining.

4.2 Evaluation

We evaluate performance of our models in the following ways. For classification, we report precision, recall, and scores of the positive class.Following previous generative inference work Sap et al. (2019b), we use automated metrics to evaluate model generations. We use BLEU-2 and RougeL () scores to capture word overlap between the generated inference and the references, which captures quality of generation Galley et al. (2015); Hashimoto et al. (2019). We additionally compute word mover’s distance (WMD; Kusner et al., 2015), which uses distributed word representations to measure similarity between the generated and target text.

4.3 Training details

As each post can contain multiple annotations, we define a training instance as containing one post-group-statement triple (along with the five categorical annotations). We then split our dataset into train/dev./test (75:12.5:12.5), ensuring that no post is present in multiple splits. For evaluation (dev., test), we combine the categorical variables by averaging, and compare the generated inferences (hypotheses) to all targeted groups and implied statements (references).

All experiments are carried our using HuggingFace’s Transformers library.666

We tune hyperparameters on the dev. set, and report performance for the best performing setting (according to average F1). We train or finetune our models using a batch size of 4, a learning rate of 5e-5 (with linear warm up), and consider training for


5 Results

group targeted implied statement
dev GPT L1+L2 73.8 68.1 2.35 50.2 44.4 6.05
GPT L2 74.2 68.8 2.29 51.0 44.9 6.04
GPT-rnd L1+L2 76.5 66.8 2.15 48.2 41.2 6.12
tst GPT L1+L2 76.7 72.1 2.26 52.3 47.3 6.04
Table 5: Automatic evaluation of various models on the generation task. GPT L2 is the text-only model, GPT L1+L2 is the full multitask model. Bl: BLEU-2, Rg-L: Rouge-L, WMD: Word Mover’s Distance. Higher is better for BLEU and ROUGE scores, and lower is better for WMD.
post predicted group predicted implication reference groups reference implications
(a) All-female casts bother me but I have literally never noticed the hundreds of movies with 0 female characters I have already watched women women aren’t good actors; women can’t be in leading roles
(b) As expected, when the terrorist group Hamas won the election in Gaza it was the usual one man, one vote, one time, Islamist election. #Islam muslim folks muslims are terrorists muslim folks; arabic folks [”bad people.”, ”islam promotes controlling governments”, ”muslims are authoritarians”, ”not fair.”]
(c) ”Black guy in class: *attempts to throw a paper ball into the trash* *misses* Mrs. Mundy: ”You’re a disgrace to your race, Marcus” black folks black people are worthless black folks are good at basketball; black men are defined by athletic skill.; black men should be good at basketball.; do not take things seriously
(d) A Jew runs into a wall with a boner. What breaks first? His nose. jewish folks jewish people have big noses jewish folks jewish folks have large noses; jewish folks have the same features; jews are fun to joke about; makes fun of there looks; stereotype about there nose length; they have big noses
(e) I hate fat bitches overweight/fat folks fat people are unattractive overweight/fat folks are not pleasant people; fat folks are all the same; fat folks are less than others; not likable
Table 6: Examples of GPT L1+L2 model predictions. The model struggles to pick up on subtle biases (a), and tends to generate generic stereotypes rather than implications that are entailed by the post (b, c).

Listed in Tables 4 & 5, our modelling results indicate that making inferences about social biases in language remains challenging for models.


Most notably for classification, the multitask model outperforms other variants substantially when predicting a post’s offensiveness and intent to offend (+8% F1 on both). The classification-only model slightly outperforms the full multitask model on other categories. We hypothesize that correctly predicting those might require more lexical matching (e.g., detecting sexual words for the lewd category). In contrast, the offensiveness and intent gains from full multitasking suggest that for those more subtle semantic categories, more in-domain language model finetuning helps. Highly skewed categories pose a challenge for all models, due to the lack of positive instances. As expected, using the randomly initialized model performs significantly worse than the pretrained version.


When we evaluate on our generation tasks, we find that model performance is comparable across automatic metrics between the full multitask variant (GPT L1+L2) and the free-text only generation model (GPT L2). Surprisingly, the randomly initialized multitask variant performs better on BLEU and WMD on the group target inference, which is likely due to the small and constrained generation space (there are only 1.1k different groups in our corpus; see Table 3). When the generation space is larger (for the implied statement), pretrained variants perform better.

Error analysis

Since small differences in automated evaluation metrics for text generation sometimes only weakly correlate with human judgements

Liu et al. (2016), we manually perform an error analysis on a select set of generated dev examples from the full multitask model (Table 6). Overall, the model seems to struggle with generating textual implications that are relevant to the post, instead generating very generic stereotypes about the demographic groups (e.g., in examples b,c). The model generates the correct stereotypes when there is high lexical overlap with the post (e.g., examples d,e). This is in line with previous research showing that large language models rely on correlational patterns in data Sakaguchi et al. (2019).

6 Related Work

Bias and Toxicity Detection

Detection of hateful, abusive, or otherwise toxic language has received increased attention recently Schmidt and Wiegand (2017). Most dataset creation work has cast this detection problem as binary classification Waseem and Hovy (2016); Wulczyn et al. (2017); Davidson et al. (2017); Founta et al. (2018), Recently, Zampieri et al. (2019) collected a dataset of tweets with hierarchical categorical annotations of offensiveness and whether a group or individual is targeted. In contrast, Social Bias Frames covers both hierarchical categorical and free-text annotations.

Similar in spirit to our work, recent work has tackled more subtle bias in language, such as microaggressions Breitfeller et al. (2019) and condescension Wang and Potts (2019). These types of biases are in line with but more narrowly scoped than biases covered by Social Bias Frames.

Inference about Social Dynamics

Various work has tackled the task of making inferences about power and social dynamics. Particularly, previous work has analyzed power dynamics about specific entities, either in conversation settings (Prabhakaran et al., 2014; Danescu-Niculescu-Mizil et al., 2012) or in narrative text Sap et al. (2017); Field et al. (2019); Antoniak et al. (2019). Additionally, recent work in commonsense inference has focused on mental states of participants of a situation (e.g., Rashkin et al., 2018; Sap et al., 2019b). In contrast to reasoning about particular individuals, our work focuses on biased implications of social and demographic groups as a whole.

7 Ethical considerations

Risks in deployment

Determining offensiveness and reasoning about harmful implications of language should be done with care. When deploying such algorithms, several ethical aspects should be considered including the fairness of the model on speech by different demographic groups or in different varieties of English Mitchell et al. (2019). Additionally, practitioners should discuss potential nefarious side effects of deploying such technology, such as censorship Ullmann and Tomalin (2019) and dialect-based racial bias Sap et al. (2019a); Davidson et al. (2019). Finally, inferences about offensiveness could be paired with promotions of positive online interactions, such as emphasis of community standards Does et al. (2011) or counter-speech Chung et al. (2019); Qian et al. (2019).

Risks in annotation

Recent work has highlighted various negative side effects caused by annotating potentially abusive or harmful content (e.g., acute stress; Roberts, 2016). We mitigate these by limiting the number of posts that one worker can annotate in one day, paying workers above minimum wage ($7-$12), and providing crisis management resources to our annotators.777We direct workers to the Crisis Text Line (

8 Conclusion

To help machines reason about and account for societal biases, we introduce Social Bias Frames, a new structured commonsense formalism that distills knowledge about the biased implications of language. Our frames combine categorical knowledge about the offensiveness, intent, and targets of statements, as well as free-text inferences about which groups are targeted and biased implications or stereotypes. We collect a new dataset of 100k annotations on social media posts using a novel crowdsourcing framework. We establish baseline performance of models built on top of large pretrained language model. We show that while classifying the intent or offensiveness of statements is easier, models struggle to generate relevant inferences about social biases, especially when implications have low lexical overlap with posts. This indicates that more sophisticated models are required for Social Bias Frames inferences.

9 Acknowledgements

Authors would like to thank Hannah Rashkin and Lucy Lin for their helpful comments on the paper. This research was supported in part by NSF (IIS-1524371, IIS-1714566), DARPA under the CwC program through the ARO (W911NF-15-1-0543), DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031), and Samsung Research.


  • Antoniak et al. (2019) Maria Antoniak, David Mimno, and Karen Levy. 2019. Narrative paths and negotiation of power in birth stories. In CSCW.
  • Bicknell (2007) Jeanette Bicknell. 2007. What is offensive about offensive jokes? Philosophy Today, 51(4):458–465.
  • Bosselut et al. (2019) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. Comet: Commonsense transformers for automatic knowledge graph construction. In ACL.
  • Breitfeller et al. (2019) Luke M Breitfeller, Emily Ahn, David Jurgens, and Yulia Tsvetkov. 2019. Finding microaggressions in the wild: A case for locating elusive phenomena in social media posts. In EMNLP.
  • Bussone et al. (2015) Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The role of explanations on trust and reliance in clinical decision support systems. In 2015 International Conference on Healthcare Informatics, pages 160–169. IEEE.
  • Chung et al. (2019) Yi-Ling Chung, Elizaveta Kuzmenko, Serra Sinem Tekiroglu, and Marco Guerini. 2019. CONAN - COunter NArratives through nichesourcing: a multilingual dataset of responses to fight online hate speech. In ACL, pages 2819–2829, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Cohen-Almagor (2014) Raphael Cohen-Almagor. 2014. Countering hate on the internet. Annual review of law and ethics, 22:431–443.
  • Danescu-Niculescu-Mizil et al. (2012) Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012. Echoes of power: language effects and power differences in social interaction. In WWW, page 699, New York, New York, USA. ACM Press.
  • Davidson et al. (2019) Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. 2019. Racial bias in hate speech and abusive language detection datasets. In Abusive Language Workshop.
  • Davidson et al. (2017) Thomas Davidson, Dana Warmsley, Michael W Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In ICWSM.
  • Does et al. (2011) Serena Does, Belle Derks, and Naomi Ellemers. 2011. Thou shalt not discriminate: How emphasizing moral ideals rather than obligations increases whites’ support for social equality. J. Exp. Soc. Psychol., 47(3):562–571.
  • Dynel (2015) Marta Dynel. 2015. The landscape of impoliteness research. Journal of Politeness Research, 11(2):383.
  • Field et al. (2019) Anjalie Field, Gayatri Bhat, and Yulia Tsvetkov. 2019. Contextual affective analysis: A case study of people portrayals in online# MeToo stories. In ICWSM, volume 13, pages 158–169.
  • Fiske (1993) S T Fiske. 1993. Controlling other people. the impact of power on stereotyping. Am. Psychol., 48(6):621–628.
  • Founta et al. (2018) Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, and Nicolas Kourtellis. 2018. Large scale crowdsourcing and characterization of twitter abusive behavior. In ICWSM.
  • Gabriel et al. (2019) Saadia Gabriel, Antoine Bosselut, Ari Holtzman, Kyle Lo, Asli Çelikyilmaz, and Yejin Choi. 2019. Cooperative generator-discriminator networks for abstractive summarization with narrative flow. ArXiv, abs/1907.01272.
  • Galley et al. (2015) Michel Galley, Chris Brockett, Alessandro Sordoni, Yangfeng Ji, Michael Auli, Chris Quirk, Margaret Mitchell, Jianfeng Gao, and William B. Dolan. 2015. deltableu: A discriminative metric for generation tasks with intrinsically diverse targets. In ACL.
  • Greengross and Miller (2008) Gil Greengross and Geoffrey F Miller. 2008. Dissing oneself versus dissing rivals: Effects of status, personality, and sex on the Short-Term and Long-Term attractiveness of Self-Deprecating and Other-Deprecating humor. Evol. Psychol., 6(3):147470490800600303.
  • Hashimoto et al. (2019) Tatsunori B. Hashimoto, Hugh Zhang, and Percy Liang. 2019. Unifying human and statistical evaluation for natural language generation. In NAACL-HLT.
  • Hearst (1992) Marti A Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In ACL, COLING ’92, pages 539–545, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Kasper (1990) Gabriele Kasper. 1990. Linguistic politeness:: Current research issues. J. Pragmat., 14(2):193–218.
  • Kulesza et al. (2012) Todd Kulesza, Simone Stumpf, Margaret Burnett, and Irwin Kwan. 2012. Tell me more?: the effects of mental model soundness on personalizing an intelligent agent. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1–10. ACM.
  • Kusner et al. (2015) Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances. In ICML, pages 957–966.
  • Liu et al. (2016) Chia-Wei Liu, Ryan Lowe, Iulian V Serban, Michael Noseworthy, Laurent Charlin, and Joelle Pineau. 2016. How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In ACL.
  • Mitchell et al. (2019) Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. In FAT.
  • Nguyen et al. (2018) An T Nguyen, Aditya Kharosekar, Saumyaa Krishnan, Siddhesh Krishnan, Elizabeth Tate, Byron C Wallace, and Matthew Lease. 2018. Believe it or not: Designing a human-ai partnership for mixed-initiative fact-checking. In The 31st Annual ACM Symposium on User Interface Software and Technology, pages 189–199. ACM.
  • Pereira et al. (2016) Gonçalo Pereira, Rui Prada, and Pedro A Santos. 2016. Integrating social power into the decision-making of cognitive agents. Artif. Intell., 241:1–44.
  • Prabhakaran et al. (2014) Vinodkumar Prabhakaran, Prabhakaran Vinodkumar, and Rambow Owen. 2014. Predicting power relations between participants in written dialog from a single thread. In ACL.
  • Qian et al. (2019) Jing Qian, Anna Bethke, Yinyin Liu, Elizabeth Belding, and William Yang Wang. 2019. A benchmark dataset for learning to intervene in online hate speech. In EMNLP.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.
  • Rashkin et al. (2018) Hannah Rashkin, Maarten Sap, Emily Allaway, Noah A. Smith, and Yejin Choi. 2018. Event2mind: Commonsense inference on events, intents, and reactions. In ACL.
  • Roberts (2016) Sarah T Roberts. 2016. Commercial content moderation: Digital laborers’ dirty work. In Safiya Umoja Noble and Brendesha M Tynes, editors, The Intersectional Internet: Race, Sex, Class and Culture Online, Media Studies Publications. Peter Lang Publishing.
  • Ross et al. (2017) Björn Ross, Michael Rist, Guillermo Carbonell, Benjamin Cabrera, Nils Kurowsky, and Michael Wojatzki. 2017. Measuring the reliability of hate speech annotations: The case of the european refugee crisis. In NLP 4 CMC Workshop.
  • RWJF (2017) RWJF. 2017. Discrimination in america: Experiences and views. Accessed: 2019-11-5.
  • Sakaguchi et al. (2019) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Winogrande: An adversarial winograd schema challenge at scale. ArXiv, abs/1907.10641.
  • Sap et al. (2019a) Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019a. The risk of racial bias in hate speech detection. In ACL.
  • Sap et al. (2019b) Maarten Sap, Ronan LeBras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A Smith, and Yejin Choi. 2019b. Atomic: An atlas of machine commonsense for if-then reasoning. In AAAI.
  • Sap et al. (2017) Maarten Sap, Marcella Cindy Prasetio, Ariel Holtzman, Hannah Rashkin, and Yejin Choi. 2017. Connotation frames of power and agency in modern films. In EMNLP.
  • Schmidt and Wiegand (2017) Anna Schmidt and Michael Wiegand. 2017.

    A survey on hate speech detection using natural language processing.

    In Proceedings of the Workshop on NLP for Social Media.
  • Speer and Havasi (2012) Robyn Speer and Catherine Havasi. 2012. Representing general relational knowledge in conceptnet 5. In LREC.
  • Strub (2008) Whitney Strub. 2008. The clearly obscene and the queerly obscene: Heteronormativity and obscenity in cold war los angeles. Am. Q., 60(2):373–398.
  • Ullmann and Tomalin (2019) Stefanie Ullmann and Marcus Tomalin. 2019. Quarantining online hate speech: technical and ethical perspectives. Ethics Inf. Technol.
  • Vincent (2016) James Vincent. 2016. Twitter taught microsoft’s AI chatbot to be a racist asshole in less than a day. Accessed: 2019-10-26.
  • Wang and Potts (2019) Zijian Wang and Christopher Potts. 2019. TalkDown: A corpus for condescension detection in context. In EMNLP.
  • Waseem and Hovy (2016) Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In NAACL Student Research Workshop.
  • Wulczyn et al. (2017) Ellery Wulczyn, Nithum Thain, and Lucas Dixon. 2017. Ex machina: Personal attacks seen at scale. In WWW.
  • Zampieri et al. (2019) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Predicting the type and target of offensive posts in social media. In NAACL.
  • Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan R. Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books.

    2015 IEEE International Conference on Computer Vision (ICCV)

    , pages 19–27.