"Unsex me here": Revisiting Sexism Detection Using Psychological Scales and Adversarial Samples

04/27/2020 ∙ by Mattia Samory, et al. ∙ 0

To effectively tackle sexism online, research has focused on automated methods for detecting sexism. In this paper, we use items from psychological scales and adversarial sample generation to 1) provide a codebook for different types of sexism in theory-driven scales and in social media text; 2) test the performance of different sexism detection methods across multiple data sets; 3) provide an overview of strategies employed by humans to remove sexism through minimal changes. Results highlight that current methods seem inadequate in detecting all but the most blatant forms of sexism and do not generalize well to out-of-domain examples. By providing a scale-based codebook for sexism and insights into what makes a statement sexist, we hope to contribute to the development of better and broader models for sexism detection, including reflections on theory-driven approaches to data collection.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sexism is a complex phenomenon broadly defined as “prejudice, stereotyping, or discrimination, typically against women, on the basis of sex.”111Oxford English Dictionary Not only, but distinctively in online interactions, sexist expressions run rampant: one in ten U.S. adults reported being harassed because of their gender,222https://www.pewresearch.org/internet/2017/07/11/online-harassment-2017/ and being the target of sexism can have a measurable negative impact [Swim et al.2001]. Given the scale, reach, and influence of online platforms, detecting sexism at scale is crucial to ensure a fair and inclusive online environment.333https://www.theguardian.com/commentisfree/2015/dec/16/online-sexism-social-media-debate-abuse

Therefore, the research community has been actively developing machine learning approaches to automatically detect sexism in online interactions. Such approaches, on the one hand, provide the foundations for building automated tools to assist humans in content moderation. On the other hand, computational approaches allow at-scale understanding of the properties of sexist language.

While detecting stark examples of sexism seems relatively straightforward, operationalizing and measuring the construct of sexism in its nuances has proven difficult in the past.444Constructs are abstract “elements of information” [Groves et al.2011] that a scientist attempts to quantify through definition, translation into measurement, followed by recording signals through an instrument and finally, by analysis of the collected signals. Constructs may have multiple sub-constructs or dimensions. For example, according to the Ambivalent Sexism theory [Glick and Fiske1996], sexism can have two dimensions—benevolent and hostile sexism.

Previous work in automated sexism detection focused on specific aspects of sexism, such as hate speech towards gender identity or sex [Waseem and Hovy2016, Garibo i Orts2019], misogyny [Anzovino, Fersini, and Rosso2018], benevolent vs. hostile sexism [Jha and Mamidi2017], gender-neutral vs. -biased language [Menegatti and Rubini2017], gender-based violence [ElSherief, Belding, and Nguyen2017], directed vs. generalized and explicit vs. implicit abuse [Waseem et al.2017]. The multiple definitions of sexism applied in this related research – sometimes referring to a sub-dimension of the broader construct – make it difficult to compare the proposed methods. In particular, a lack of definitional clarity about what aspect a method aims to concretely measure with respect to theory, together with ad-hoc operationalizations, may cause severe measurement errors in the sexism detection models. With few exceptions (e.g. Jha2017 Jha2017), previous work either neglects or retrofits the link between sexism as a theoretical construct and the data used to train the machine learning models. This impedes assessing which aspects of sexism are contained in the training data. Consequently, this raises the issues of construct validity [Sen et al.2019, Salganik2017]. Finally, there have been no detailed investigations into models capturing spurious artifacts of the datasets they were trained on, instead of picking up on more essential syntactic or semantic traits of sexism. This ultimately may give rise to issues of reliability [Lazer2015, Salganik2017], particularly parallel-forms reliability, which assesses the degree to which measurements can generalize to variations [Knapp and Mueller2010].

In this work we aim to improve construct validity and reliability in sexism detection. To achieve this, we build upon the literature in social psychology to derive a comprehensive codebook for sexism (see table 2), covering a wide variety of dimensions of sexism and related constructs (see table 1). The primary purpose of this codebook is to annotate previously-used as well as novel datasets of online messages in a unified, scale-based and more nuanced fashion; and to propose our own gold-standard dataset of scale items. In social psychology, constructs are mostly measured through the individual application of meticulously constructed scales that ensure construct validity [Clark and Watson1995]. For this reason, we argue that scale items are an excellent benchmark to assess whether machine learning approaches are capturing the constructs they are intended to capture.

Moreover, we draw inspiration from recent advancements in machine learning methods, and leverage the input of a large set of crowdworkers to generate adversarial examples. Adversarial examples have been gaining traction for testing the reliability of a wide range of models, especially those considered to have a “black-box” nature [Ilyas et al.2019, Kaushik, Hovy, and Lipton2019, et al2019].555Adversarial examples can also be ‘counterfactual’, where minimal edits are done to flip the label of a datapoint.

In particular, we introduce an approach to generate hard-to-classify examples which have minimal lexical differences from sexist messages, but that are not sexist. This allows us to assess the reliability of existing models.

Our results show that state-of-the-art models fail to generalize beyond their training domain and to broader conceptualizations of sexism. Even in their training domain, the performance of the models severely drops when faced with adversarial examples. Through an analysis of the errors that the machine learning models make, and of the strategies employed by humans to generate adversarial examples, we detail insights on how to build better sexism detection models.

2 Data and Methods

In this section we present our approach to assess and improve the validity and reliability of automated methods aiming to detect sexism. We first describe how we manually collected, pruned and verified items from psychological scales (cf. table 1) designed to measure sexism and related constructs.Through a qualitative review of those scales, we developed a sexism codebook. We use the codebook, and an adversarial example generation process, to annotate and expand several existing and novel datasets. We conclude by detailing our benchmark of sexism detection methods.

2.1 Survey Scales for Measuring Sexism

First, we manually collected survey scales that were explicitly designed to measure the construct of sexism or are frequently used to measure sexism in the social psychology literature (see first part of table 1). In an additional step, we expanded our literature search to also include scales that (i) were mentioned in the original selection of sexism scales as related work or (ii) were designed to measure related constructs such as general attitudes towards men or women, egalitarianism, gender and sex role beliefs, stereotypical beliefs about men or women, attitudes towards feminism or gendered norms (cf. table 1). To be included in our list, the scales had to be publicly available and in English. In addition, items had to be formulated as a subjective, full sentence statement that participants could (dis)agree with.

As we were aiming for a scale data set that is as broad and generalizable as possible with respect to gender, cultural and religious context, we removed items that refer to specific countries (e.g., “It is easy to understand the anger of women’s groups in America”), items that were overly specific to a western cultural or religious context (e.g. “It is insulting to women to have the “obey” clause remain in the marriage service”), items that only apply to one gender or were the gender of the reader influences the perspective on the item (e.g. “ I don’t think I should work after my husband is employed full-time”) and items with an overly sexual context (e.g.,“Without an erection, a man is sexually lost”). Due to these criteria, we removed 105 out of 941 items (11%). An additional 72 items (8%) were removed because they were exact or almost exact duplicates of other scale items. In total, we were thus left with 764 items. Of those items, 536 (70%) were measuring the construct directly (i.e., a sexist statement) and 228 (30%) via a reverse-coded, non-sexist statement (cf. table 3).

We confirmed that selected items are indeed perceived as sexist by annotating each scale item as either “sexist” or “non-sexist” using five crowdworkers (cf. section 2.4 for details). Overall, the majority verdict of MTurkers corresponded with the ground truth label from the scales in 743 out of 764 cases (97.25%), showcasing that our selection indeed constitutes an accurately labelled dataset of sexist and non-sexist statements.666The 21 items (2.75%) were the majority verdict did not correspond to the ground truth labels were largely comprised of items constituting benevolent sexism against women or sexism against men. See also complementary material [removed for blind review]

We manually collected, annotated and verified items from psychological scales to (i) define a test data set that allows to assess the validity of automated methods that aim to detect sexism (cf. section 2.3) and (ii) derive a codebook for sexism that can be used to annotate e.g. social media data (cf. section 2.2). table 1 gives an overview about all scales we used.

Scale Construct
Initial Selection Attitudes towards Women Scale [Spence and Helmreich1972] Attitudes towards the role of women in society
Sex Role Scale [Rombough and Ventimiglia1981] Attitudes towards sex roles
Modern Sexism Scale [Swim et al.1995] Old fashioned and modern sexism
Neosexism Scale [Tougas et al.1995] Egalitarian values vs. negative feelings towards women
Ambivalent Sexism Inventory [Glick and Fiske1996] Hostile & benevolent sexism
Sex-Role Egalitarianism Scale [King and King1997] Egalitarian sex-role attitudes (example items)
Gender-Role Attitudes Scale [García-Cueto et al.2015] Attitudes towards gender roles
Additional Scales Belief-pattern Scale for Attitudes towards Feminism [Kirkpatrick1936] Attitudes towards feminism
Role conception Inventory [Bender Motz1952] Subjectively conceived husband and wife roles
Traditional Family Ideology Scale [Levinson and Huffman1955] Ideological orientations towards family structure
Authoritarian Attitudes Towards Women Scale [Nadler and Morrow1959] Authoritiarian Attitudes towards women
Sex-Role Ideology Questionnaire [Mason and Bumpass1975] Women’s sex-role ideology
Short Attitudes towards Feminism Scale [Smith, Ferree, and Miller1975] Acceptance or rejection of central beliefs of feminism
Sex-role orientation Scale [Brogan and Kutner1976] Normative appropriateness of gendered behavior
Sex-Role Survey [MacDonald1976] Support for equality between the sexes
The Macho Scale [Villemez and Touhey1977] Expressions of sexist and egalitarian beliefs
Sex-Role Ideology Scale [Kalin and Tilby1978] Traditional vs. feminist sex-role ideology
Sexist Attitudes towards Women Scale [Benson and Vincent1980] Seven components of sexism towards women
Index of Sex-Role Orientation Scale [Dreyer, Woods, and James1981] Women’s sex-role orientation
Traditional-Liberated Content Scale [Fiebert1983] Traditional and liberated male attitudes towards men
Beliefs about Women Scale [Belk and Snell1986] Stereotypic beliefs about women
Attitudes towards Sex-roles Scale [Larsen and Long1988] Attitudes towards egalitarian vs. traditional sex roles
Male Role Norm Inventory [Levant and Hirsch1992] Norms for the male sex-role
Attitudes towards Feminism & Women’s Movement [Fassinger1994] Affective attitudes toward the feminist movement
Male Role Attitude Scale [Pleck, Sonenstein, and Ku1994] Attitudes towards male gender roles in adolescent men
Gender Attitudes Inventory [Ashmore, Del Boca, and Bilder1995] Multiple dimensions of gender attitudes
Gender-Role Belief Scale [Kerr and Holden1996] Self-report measure of gender role ideology
Stereotypes About Male Sexuality Scale [Snell, Belk, and Hawkins1998] Stereotypes about male sexuality
Ambivalence toward Men Inventory [Glick and Fiske1999] Hostile and benevolent stereotypes towards men
Table 1: Overview of psychological scales measuring sexism and related constructs. The top seven scales represent our initial selection, that we derived our codebook from (cf. section 2.2). Most scales have multiple subscales that all contribute to operationalizing the respective construct. Items from all scales were selected based on several criteria and the selected ones tested to be perceived as sexist by crowdworkers (cf. section 2.1). More extensive information about the scales, the items, and our annotations can be found in the supplementary online material [removed for blind review].
Category Definition Scale Item Example Tweet Example
Items formulating a prescriptive set of behaviors or qualities, that women (and men) are supposed to exhibit in order to conform to traditional gender roles “A woman should be careful not to appear smarter than the man she is dating.” “Girls shouldn’t be allowed to be commentators for football games”
Stereotypes &
Items formulating a descriptive set of properties that supposedly differentiates men and women. Those supposed differences are expressed through explicit comparisons and stereotypes. “Men are better leaders than women.” “*yawn* Im sorry but women cannot drive, call me sexist or whatever but it is true.”
Endorsements of
Items acknowledging inequalities between men and women but justifying or endorsing these inequalities. “There are many jobs in which men should be given preference over women in being hired or promoted.” “I think the whole equality thing is getting out of hand. We are different, thats how were made!”

Inequality &
Rejection of
Items stating that there are no inequalities between men and women (any more) and/or that they are opposing feminism “Many women seek special favors, such as hiring policies that favor them over men, under the guise of asking for ’equality’.” “OK. Whew, that’s good. Get a real degree and forget this poison of victimhood known as feminism.”
Table 2: Semantic Sexism Codebook: we developed the following annotation schema that captures semantic dimensions of sexism by manually inspecting items from multiple sexism scales. Note that messages can also be sexist because of style rather than content, as discussed in section 2.2. All examples of tweets have been editorialized to preserve the privacy of their authors.

2.2 Sexism Codebook

To develop a finer-grained picture of what makes a statement sexist, we systematically reviewed the items of our initial selection of survey scales (see top part of table 1). We determined four major, non-overlapping and coherent content themes and defined them in a codebook (cf. table 2). Note that the determination of these categories is based on multiple scales that were designed to capture different aspects of sexism or closely related constructs at different points in time and tested on different audiences. This enables us to cover a wide range of aspects of sexism, based on established measurement tools.

While constructing the codebook, and to assess its applicability beyond scale items, we used it to annotate a random sample of tweets that previous researchers labeled as sexist[Jha and Mamidi2017, Waseem and Hovy2016]. To those tweets that we could not map to any of the codebook categories we attached binary labels: sexist or non-sexist . The coding process made salient that a message can be sexist not only because of what is said (cf. our four content categories). Often, how it is said (phrasing) can equally transport sexist intent. Only the first type of sexism is covered by scale items because they are formulated neutrally and do not contain profanity. To account for purely stylistic markers that would appear in online messages in the wild, we added the following phrasing-based categories to our codebook (be advised that these examples contain offensive language):

Uncivil and sexist:

attacks, foul language, or derogatory depictions directed towards individuals because of their sex. The messages humiliate individuals because of their sex or gender identity, for example by means of name-calling (e.g. “…bitch”), attacks (e.g. “I’m going to kick her …”), and inflammatory messages (e.g. “Burn all women!”)

Uncivil but non-sexist:

messages that are offensive but not because of the target’s sex or gender (e.g.“Are you fucking stupid?”).

Civil and sexist:

neutral phrasing that does not contain offenses or slurs: messages without clear incivility that do still convey a sexist message due to its content (e.g. “In my opinion, women are just inferior to men”)

Civil and non-sexist:

neutral phrasing that does not contain offenses or slurs and no sexist content (e.g. “I love pizza but I hate Hotdogs”).

For sexism detection, we consider a message as sexist if the content and/or the phrasing of the message is sexist.

2.3 Datasets

Next, we describe the different data sets that we use for annotation, adversarial sample generation, and ultimately sexism detection.

Sexism Scales (Abbreviation: s)

This set contains the scale items described in section 2.1.

Hostile Sexism Tweets (Abbreviation: h)

Waseem2016 Waseem2016 used various self-defined keywords to collect tweets that are potentially sexist or racist, filtering the Twitter stream for two months. The two authors labelled this data with the help of one outside annotator. In our work we use the portion of the data set labeled as sexist, and denote it hostile, since according to Jha2017 Jha2017 it contains examples of hostile sexism.

Other (non-sexist) Tweets (Abbreviation: o)

Waseem2016 Waseem2016 also annotated tweets that contained neither sexism nor racism. We denote this dataset as other.

Benevolent Sexism Tweets (Abbreviation: b)

Jha2017 Jha2017 extended Waseem2016’s dataset to include instances of benevolent sexism: subjectively positive statements that imply that women should receive special treatment and protection from men, and thus further the stereotype of women as less capable [Glick and Fiske1996]. The authors collected data using terms, phrases, and hashtags that are “generally used when exhibiting benevolent sexism.” They then asked three external annotators to crossvalidate the tweets to remove annotator bias.

Call Me Sexist Tweets (Abbreviation: c)

To add another social media dataset from a different time period using a different data collection strategy, we gathered data using the Twitter Search API using the phrase “call me sexist but.” The resulting data spans from 2008 to 2019. For annotation via crowdsourcing, we ran a pilot study and found a statistically significant priming effect of the “call me sexist but” introduction to tweets on the annotators: if interpreted as a disclaimer, annotators would assume that whatever follows is sexist by default, more so than if the phrase were not there. For this reason we removed the phrase for all annotation tasks producing the output for our eventual analysis, i.e., labelling only the remainder of each tweet (e.g. “Call me sexist, but please tell me why all women suck at driving.” to “please tell me why all women suck at driving”).

2.4 Annotation and Adversarial Examples

We rely on existing Twitter datasets that were annotated for specific aspects of sexism. To bring them all under a well-defined and consistent scheme we re-annotated them, and the two datasets that we introduce, using our sexism codebook (cf. section 2.2).

Then, we challenged crowdworkers to remove as much of the sexism as possible from the sexist examples, so as to produce adversarial examples which have minimal lexical changes, but that are non-sexist (e.g. “Women should not be allowed to drive trucks hauling a trailer” was modified into “People without proper license should not be allowed to drive trucks hauling a trailer”). We describe the two crowdsourcing tasks next.

For both tasks, we recruited annotators from Mechanical Turk (MTurkers). We only accepted annotators located in the US, with over 10,000 HITs approved and over 99% HIT Approval Rate. Workers also had to pass a qualification test that ensured that workers understood the construct of sexism as defined by our codebook. To pass the test they had to correctly annotate 4 out of 5 ground-truth sentences.777The test sentences are contained in the supplementary online material [removed for blind review]

Sexism Annotation Crowdsourcing Task

We asked MTurkers to annotate if a sentence is sexist because of its content (opinions and beliefs expressed by the speaker) or because of its phrasing (because of the speaker’s choice of words). We referred them to our codebook, together with explanations and examples for each encoding. We paid MTurkers 6 cents per annotation, resulting in a fair hourly wage [Salehi et al.2015]. We ran the task on a sample of each dataset (cf. table 3). Five raters annotated each sentence. We marked sentences that at least 3 raters find sexist, because of either content (Randolph ) or style ().

Adversarial Example Generation Crowdsourcing Task

Adversarial examples are inputs to machine learning models that are intentionally designed to cause the model to make a mistake, as they are hardest to distinguish from their counterparts. In the task, MTurkers were presented with sexist messages, and were instructed to minimally modify the tweet so that the message is (a) no longer sexist and (b) still coherent. For example, the message women are usually not as smart as men could be amended to “women are rarely not as smart as men” or “women are usually not as tall as men”. This task allows us produce adversarial pairs of texts, which are lexically similar, but with opposite labels for sexism. We paid MTurkers 20 cents per modification. We screened them for task understanding, after showing them examples and guidelines on how to produce meaningful, non-sexist modifications. We asked them not to modify words needlessly (e.g. leave extra non-sexist sentences untouched), unless crucial to make the message coherent. MTurkers produced one or more modifications for each sexist message (cf. table 3). As for the original tweets, we also annotated modifications for sexism; we discarded all those which were judged sexist.

dataset # w/annot. sexist w/mod. #mod.
benevolent 678 678 183 183 396
hostile 2866 678 278 275 567
other 7960 678 8 8 21
callme 3826 1280 773 765 1134
scales 764 764 536 72 129
Table 3: From each dataset we annotate a sub-sample with our sexism codebook (w/annot.). The sexist samples from the annotated corpora are then modified. The column w/mod. indicates for how many originally sexist samples we were able to obtain modifications, while #mod. reveals how many modifications were produced in total.

2.5 Experimental Setup for Sexism Detection

We use the annotated tweets and their adversarial counterparts for training and testing sexism detection models. We now detail the experimental setup for the task, before introducing the sexism detection models.

Experimental Setup

We focus on four dataset combinations. 1) We replicate the setup of [Jha and Mamidi2017] by combining the benevolent, hostile, and other datasets (collectively called bho).888We do not exactly replicate Jha2017’s findings since we re-annotate the data 2) We reproduce the setup using the similar, novel dataset of callme tweets (c). 3) We use the scales (s) dataset as a gold standard test for sexism. 4) To measure the effects of incorporating multiple aspects of sexism, we combine all of the previous datasets into an omnibus dataset (bhocs). We balance classes for training and testing.

To see if modifications aid or hinder sexism detection, we create two types of training data. First we consider the original Twitter datasets bho, c, and bhocs as labeled by the MTurkers. Second, we modify the datasets by injecting adversarial examples (while maintaining equal size). In particular, we sample a fraction of the sexist examples, retrieve their modified non-sexist versions, and discard a corresponding number of non-sexist examples from the original datasets. We inject adversarial examples so that they make up half of the non-sexist class: this way we ensure that models do not learn artifacts only present in the modifications (such as specific wording used by the MTurkers). We call these modified training sets bho-M, c-M, and bhocs-M respectively.

We construct the bho, c, and bhocs test sets, with the only difference that we substitute the entire non-sexist class with adversarial examples in the bho-M, c-M, and bhocs-M test sets. In addition to the tweet datasets, we also test on the scales (s) dataset. To assess the validity of the sexist detection models, we iteratively sample, shuffle, and split the datasets 5 times in 70%/30% train/test proportions.

Next, we introduce the models that we use for sexism detection. We rely on two baselines, (gender word and toxicity) and three models: logit, CNN, and BERT.

Gender-Word Baseline

We use the gender-definition word lists from [Zhao et al.2018] to create a simple baseline. The list contains words that are associated with gender by definition (like “mother” and “waitress”).

999The word lists can be found here: https://github.com/uclanlp/gn˙glove/tree/master/wordlist

Our baseline assigns a sentence or tweet to be sexist if it includes at least one gender-definition word. We opt for this method since intuitively, sexist statements have a high probability of containing gender definition-words because of their topical and stereotypical nature. However, the opposite is not true: the presence of gender-related words does not imply sexism.

Toxicity Baseline

We use Jigsaw’s Perspective API [Jigsaw and Google2017] – specifically their toxicity rating – as another baseline since it is widely used for hatespeech detection.101010http://www.nytco.com/the-times-is-partnering-with-jigsaw-to-expand-comment-capabilities/ The Perspective API uses machine learning to measure text toxicity, defined as “a rude, disrespectful, or unreasonable comment that is likely to make one leave a discussion.” We determine the optimal threshold differentiating sexist from non-sexist examples during training. We predict those test examples as sexist that score above the threshold.


Our first model is a unigram-based TF-IDF Logistic Regression model with L2 regularization (C = 1). We preprocess text following the procedure detailed in Jha2017. While this model is simple, it is interpretable and sheds light on which unigrams contribute to a sentence being classified as sexist.


Our next model consists of a word-level CNN [Kim2014]

, which is a standard baseline for text classification tasks. Each sentence is padded to the maximum sentence length. We use an embedding layer of dimension 128, followed by convolutional filters of sizes 3,4, and 5. Next, we max-pool the result of the convolutional layer into a long feature vector, add dropout regularization, and classify the result using a softmax layer. We adopt the default hyperparameters of 0.5 dropout and no L2 regularization.

BERT finetuned

BERT is a recent transformer-based pre-trained contextualized embedding model extendable to a classification model with an additional output layer [Devlin et al.2019]

. It achieves state-of-the-art performance in text classification, question answering, and language inference without substantial task-specific modifications. One of the practical advantages of BERT is that its creators make available its parameters after training on a large corpus. BERT can also be adapted for end-to-end text classification tasks through fine-tuning. We replace the output layer of a pre-trained BERT model with a new output layer to adapt it for sexism detection. For preprocessing, all sentences are converted to sequences of 128 tokens, where smaller sentences are padded with spaces. We use default hyperparameters (batch size = 32, learning rate = 2e-5, epochs = 3.0, warmup proportion = 0.1).

3 Results

First, we discuss which aspects of sexism are covered in existing and in our novel social media datasets. Second, we discuss insights into how crowdworkers generated adversarial examples for sexism. Finally, we compare various sexism detection methods on different datasets, and quantify the usefulness of adversarial examples.

Figure 1: Dataset annotation. We show how the datasets were annotated for different categories of sexist content (left plot) and phrasing (center). We then label messages as sexist if they fall under any sexist content or phrasing category (right plot). We only report messages where at least 3 out of 5 of annotators agree on the label (e.g. for the callme dataset, annotators agreed less on content labels than on phrasing labels). The benevolent and hostile are Twitter datasets that were collected and annotated as sexist in previous work [Jha and Mamidi2017, Waseem and Hovy2016]. Our re-annotation effort shows that the majority of these tweets were mislabeled and are actually non-sexist. We find that the other dataset, that previous work considered non-sexist, does indeed contain largely non-sexist tweets. We expected the majority of tweets in the callme dataset to be sexist, which proved true (~60%). Finally, 30% of the items in our scales dataset are labeled as non-sexist—in line with ground truth.

3.1 Annotating Sexism Datasets

Figure 1 shows which aspects of sexism are covered by different datasets. As described in section 2.4 we collected at least 5 annotations per sentence and marked a sentence as sexist if at least 3 raters labeled it sexist because of either its content or style . The first subplot shows the number of messages in each dataset for which our annotators were able to agree based on the content of the message if it is sexist or not; and if yes which semantic dimension of sexism is covered. With respect to the semantic aspects of sexism we observe in all three Twitter dataset (b, h and c) that the largest fraction of sexist tweets exposes sexist stereotypes, followed by sexist expectations. Endorsement of inequality and rejection of inequality are less common on Twitter.

The second subplot shows the number of messages for which annotators were able to decide just based on the style if the message is sexist or not. In this work we distinguish between general incivility (like using impolite or aggressive language) and incivility that entails sexism (like namecalling targeting individuals’ gender). We find incivility and sexism to be often confused in previous data annotation efforts. For example, focusing on uncivil tweets in the hostile dataset, we find that the number of sexist and non-sexist examples is comparable, although the entire dataset was previously labeled as sexist (e.g., “I really hope Kat gets hit by a bus, than reversed over, than driven over again, than reversed than.. #mkr” or “I’m sick of you useless ass people in my culture. stfu im sick off you useless ass people in my mentions”). Yet, in general we observe that the majority of tweets across all datasets are civil and are therefore either non-sexist or sexist because of their content.

The last subplot combines the prior two plots and shows the number of sexist and not-sexist messages in each dataset. Remember, a message is considered as sexist if it is either sexist because of its content or its phrasing. The benevolent and hostile are Twitter datasets that were collected and labeled as sexist in previous work [Jha and Mamidi2017, Waseem and Hovy2016]. Our re-annotation effort with more annotators, who were primed to understand the different aspects of sexism, shows that the majority of these tweets are non-sexist according to our codebook. Interestingly, the callme

dataset that we compiled with a very simple heuristic, contains the largest fraction of sexist messages out of all Twitter datasets according to our evaluation. The

other dataset contains tweets that Waseem2016 labeled as not-sexist. We annotated a random sample of 700 of them, and found that only 8 of them were non-sexist. Thus, we consider the entire dataset to be non-sexist.

3.2 Computational Methods for Sexism Detection

Figure 2: Classification performance of different models. The x-axis shows the test data domain and the y-axis depicts the macro-F1 score. Deep hues correspond to models trained on modified data, and respectively faint hues to models trained on original data. Colors correspond on the data domain used to train (e.g., blue corresponds to bho training sets). The rectangle to the right of each plot highlights the average performance across datasets, and compares it to two baselines: the presence of gender words (dotted line) and toxicity score (dashed line). Our results show that the fine tuned BERT model that incorporates the adversarial examples into the training process achieves on average the best performance across all datasets. The lowest performance is achieved on the scale dataset (s), where the best average F1 score was achieved by the BERT models that were trained on bhocs-M() and bhocs ().

For the sake of brevity, we call “modified” the version of the training and test sets including adversarial examples, whereas with “original” we denote their counterparts without adversarial examples. We included adversarial examples into the original dataset by randomly substituting 50% of the non-sexist tweets in the dataset. Therefore, the size of the dataset and the class balance (sexist versus non-sexist) is identical in the original data and the modified data.

Baseline performance

We compare models against two baselines: one simplistic, that considers sexist any mention of a gendered word, and one of practical interest, the toxicity score from the Persective API, which is actively used for moderating online content. Our hypotheses on their performance were subverted. Toxicity scores perform the worst across all social media datasets. By analyzing their distribution, we found that the poor performance is due to data topicality. For example, sexist tweets from the benevolent dataset show higher toxicity scores than non-sexist tweets from the same dataset, but lower than non-sexist tweets from the hostile dataset. In other words, toxicity scores may help to correctly classify sexist messages when they are phrased aggressively, but not necessarily when the sexism is expressed in a neutral or positive tone. That is, toxicity might adequately capture sexism that arises through the phrasing of the message, but not because of its content.

Conversely, the gender word baseline performed surprisingly well. This mirrors our analysis of the strategies that humans use for amending sexism from offending sentences: when gender is not mentioned, there are fewer chances that a sentence is sexist. In fact, these results help better understand the potential and limitations of state-of-the-art of sexism detection. Even using the bho data and annotations from Jha2017 (that is, disregarding our own re-annotation efforts), the gender word baseline distinguishes between sexist and non-sexist tweets with 73% macro F1. This derives from an artifact of data collection and annotation: the two classes are topically distinct and few words easily discern between them—and not because sexism detection is an easy task. Therefore, the performance of a model trained on such data may not reflect its real ability to identify sexist statements in general. This highlights the need for more rigorous model evaluations.

Overall performance patterns

Independently from the type of model employed or from the domain of origin of the training data, several patterns emerge. First, the performance of models trained on original datasets drops drastically when classifying their modified versions (cf. figure 2). For example, Logit classifiers trained on original bho data scores around 90% F1 when classifying original bho data, whereas performance drops below 70% when classifying modified data (faint blue circles, first two rows of the top left plot). This confirms the effectiveness of adversarial examples in challenging the generalizability of the models. In line with our hypotheses, incorporating even only 25% of adversarial examples into training helps the robustness of the models, which remain the same or improve when classifying modified data (deep blue circles, first two rows of the top left plot). In fact, models trained on modified data outperform models trained on original data even when classifying original data–albeit within statistical error bounds. One reason may be that adversarial examples provide classifiers with more difficult examples and therefore the information gain is higher. Two post-hoc analyses partially support this interpretation. First, by stratifying the misclassification rates, we find that models trained on modified data are more accurate not only on non-sexist adversarial examples, but also on sexist original examples (1–10% true positive rate increase depending on the dataset). Second, when classifiers are trained on modified data, the most predictive features for the non-sexist class appear qualitatively better aligned with gender-neutral language. The original data contain several artifacts that show up as top features in the model such as “Kat,” the name of a female TV host: the original data contain personal attacks to her person, which although toxic are often non-sexist. The model trained on modified data instead shows more general features as informative, like “sexist” and “girl” for positive and “people” and “kid” for negative predictors. Thus, adversarial examples boost model robustness, and partially, performance.

Best performance

We next aggregate performance based on the type of model and data domain used for training. We find that training on the modified version of the omnibus dataset yields the highest average performance across all tests. This suggests that incorporating different aspects of sexism in the training set and challenging models with difficult examples helps models generalize better. Furthermore, the family of models that performs best across tests is BERT. It is the latest and the most complex of the models in this study, therefore it is unsurprising that it achieves peak performance. However, its better average performance may stem from two factors. First, BERT produces document-level representations, thus capturing long-range dependencies in the data. For example, it can summarize a comparison between two genders, regardless of where these appear in a long sentence. Second, BERT’s internal embeddings at a word level help deal with lexical sparsity and variety. For examples, this allows associating different expressions of the same stereotype like “fragile” and “delicate”.

Performance on gold standard data

Strikingly though, all models perform poorly on the dataset of psychological scales. One potential cause is the different nature of the dataset—survey items, neutral in tone and crafted by academics, are phrased very differently from the social media parlance of the other datasets. However, we find that scales are inherently hard to classify, even when ruling out phrasing issues. The best performance when training classifiers in-domain on the scales themselves, evaluated in cross-validation, is 72% F1 (compared to over 90% for bho). On the one hand, this speaks for the complexity of the sexism detection tasks. Scale items set a hard benchmark to beat for future machine learning approaches. On the other hand, we find that our best out-of-domain model, BERT trained on the modified omnibus dataset, comes close to in-domain performance at 63% F1. Intuitively, including data from a variety of sources helps capture the multiple facets of sexism in psychological scale items.

Error Analysis

When do computational models of sexism fail? Are there expressions of sexism that are inherently more difficult than others to identify computationally? Do humans and computational models fail on the same cases? We turn to these questions next.

Several characteristics of a message may challenge sexism detection models: we include the following in our analyses. Sexist and nonsexist messages may be qualitatively different, for example discussing different topics—one of them being gender. Thus, we analyze whether models perform equally well on sexist and nonsexist messages. Not only may sexist and nonsexist messages be qualitatively different; certain types of sexism may be harder to identify. The models may distinguish blatant sexism from civil messages, but would they distinguish equally well sexist messages from messages that are “just” uncivil? On a similar vein, would models identify sexism in messages phrased positively, such as those conveying benevolent stereotypes? To test this, we include in the analysis the fine-grained labels for sexist content and phrasing obtained through our annotation effort, as well as the dataset of origin of the messages. We also analyze the role of adversarial examples. We showed that models trained on adversarial examples are more robust, but it is important that they remain accurate across the various types of messages, and that their performance gain does not stem exclusively from being able to recognize adversarial examples in the test set. Finally, expressions of sexism can be subtle, and even human annotators disagree on ambiguous examples. We check whether the degree of disagreement among human annotators matches with model performance.

To better understand how these aspects affect model performance we unpack model errors. We focus on the best performing model, BERT finetuned on the omnibus dataset. We look at both versions of the model: the one trained on the original messages (bhocs), and the one trained on the the adversarial examples (bhocs-M). Then, we assess under which conditions do the models misclassify messages. We collect the predictions made for each message in the ombinbus test set across validation folds.111111We use the test set bhocs-M because it exhibits the most message variety. Then, we compute whether the message is sexist (binary), its fine-grained annotations for content and phrasing (categorical, dummy coded with “nonsexist” as a reference class), its dataset of origin (dummy-coded categorical, “callme” as reference), whether it is an adversarial example (binary), and the number of crowdworkers agreeing on the majority label (ordinal). Via a logistic regression, we predict if the message was correctly classified depending on all of the above factors. We control for whether the prediction was made by a model trained on original or adversarial data or not via an interaction term. In compact form, the regression is as follows:


Overall, we find that model accuracy correlates with agreement between annotators: messages that are ambiguous for humans are also difficult to classify for computational models. Somewhat intuitively, sexist messages express positive sentiment seem to challenge the models. In fact the models commit more errors on messages from the benevolent sexism dataset, as well as on messages that fall under the “endorsement of inequality” class for sexist content. We also find that training on adversarial examples improves model performance beyond sheer accuracy. Models trained on adversarial data, obviously, better classify adversarial test examples than models trained only on original data. However, the former also better classify harder examples. In particular, although it is in general difficult to distinguish between sexist and non-sexist messages with uncivil phrasing, models trained on adversarial examples distinguish better between the two.


First, sexism scale items are crafted to measure constructs via designed statements. In contrast, social media content does not necessarily reflect the opinion or attitude of the author, evidenced through differences in mode (sarcasm, retweeting, quoting, joking) and context (intent, recipient, previous interactions). Removing these cues makes it difficult even for humans to decide whether content is sexist, as we have asked our raters to do. Second, treating gender-identity and biological sex as the same binary categories likely constitutes a reductionist approach to a multifaceted concept, to be addressed by further research. Third, our codebook content categories could potentially be devised in even more nuanced ways and include more dimensions of sexism. However, in their current form, they already provide us with valuable insights into the composition of training sets and “blind spots” of previously applied state-of-the-art sexism detection. In the future, we will refine this codebook further. Fourth, we use the binary classification for sexism in our models despite a codebook and annotated data that would allow us to classify it in a more fine-grained manner. While our binary approach ensures comparability to previous studies and previously labelled datasets, the alternative remains an interesting prospect for future studies.

4 Conclusions

This paper contributes to sexism detection in natural language in several ways:

First, we proposed a theory-driven data annotation approach that relies on psychological definitions of sexism to uncover the multiple dimensions of the construct. We developed a codebook for sexist expressions on social media, and used it to annotate three existing and one novel dataset/s of tweets. Further, we compiled a gold standard dataset of validated items from psychological scales of sexism. Furthermore, we introduced an approach to generate adversarial examples which have minimal lexical differences from sexist messages. We shared insights on the strategies employed by humans to make sentences not sexist. Finally, we compared the validity and reliability of state-of-the-art methods for sexism detection. We gave empirical evidence that incorporating adversarial examples as well as multiple data sources boosts model performance across dimensions of sexism. Importantly, we showed that two approaches improve model evaluation in sexism detection: 1) challenging model reliability through adversarial examples; and 2) confronting the model with the various aspects that comprise sexism as a construct to improve the validity of the model.

As a benchmark for future approaches to sexism detection, we make available to the research community our data, annotations, and adversarial examples.


  • [Anzovino, Fersini, and Rosso2018] Anzovino, M.; Fersini, E.; and Rosso, P. 2018. Automatic Identification and Classification of Misogynistic Language on Twitter. In Natural Language Processing and Information Systems, 57–64.
  • [Ashmore, Del Boca, and Bilder1995] Ashmore, R. D.; Del Boca, F. K.; and Bilder, S. M. 1995. Construction and validation of the Gender Attitude Inventory, a structured inventory to assess multiple dimensions of gender attitudes. Sex Roles 32(11-12):753–785.
  • [Belk and Snell1986] Belk, S. S., and Snell, W. E. 1986. The Beliefs about Women Scale:(BAWS); Scale Development and Validation. Personality and Social Psychology Bulletin 12(4):403–413.
  • [Bender Motz1952] Bender Motz, A. 1952. The role conception inventory: A tool for research in social psychology. American Sociological Review 17(4):465–471.
  • [Benson and Vincent1980] Benson, P. L., and Vincent, S. 1980. Development and validation of the sexist attitudes toward women scale (satws). Psychology of Women Quarterly 5(2):276–291.
  • [Brogan and Kutner1976] Brogan, D., and Kutner, N. G. 1976. Measuring Sex-Role Orientation: A Normative Approach. Journal of Marriage and the Family 38(1):31–40.
  • [Clark and Watson1995] Clark, L. A., and Watson, D. 1995. Constructing validity: Basic issues in objective scale development. Psychological assessment 7(3):309.
  • [Devlin et al.2019] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. Minneapolis, Minnesota: Association for Computational Linguistics.
  • [Dreyer, Woods, and James1981] Dreyer, N. A.; Woods, N. F.; and James, S. A. 1981. ISRO: A scale to measure sex-role orientation. Sex Roles 7(2):173–182.
  • [ElSherief, Belding, and Nguyen2017] ElSherief, M.; Belding, E.; and Nguyen, D. 2017. # notokay: Understanding gender-based violence in social media. In Eleventh International AAAI Conference on Web and Social Media.
  • [et al2019] et al, M. G. 2019. Evaluating nlp models via contrast sets. arXiv preprint arXiv:2004.02709.
  • [Fassinger1994] Fassinger, R. E. 1994. Development and testing of the attitudes toward feminism and the women’s movement (FWM) scale. Psychology of Women Quarterly 18(3):389–402.
  • [Fiebert1983] Fiebert, M. S. 1983. Measuring Traditional and Liberated Males’ Attitudes. Perceptual and Motor Skills 56(1):83–86.
  • [García-Cueto et al.2015] García-Cueto, E.; Rodríguez-Díaz, F. J.; Bringas-Molleda, C.; López-Cepero, J.; Paíno-Quesada, S.; and Rodríguez-Franco, L. 2015. Development of the gender role attitudes scale (gras) amongst young spanish people. International Journal of Clinical and Health Psychology 15(1):61–68.
  • [Garibo i Orts2019] Garibo i Orts, Ò. 2019.

    Multilingual detection of hate speech against immigrants and women in twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection.

    In Proceedings of the 13th International Workshop on Semantic Evaluation, 460–463.
  • [Glick and Fiske1996] Glick, P., and Fiske, S. T. 1996. The Ambivalent Sexism Inventory: Differentiating Hostile and Benevolent Sexism. Journal of Psrsonality and Social Psychology 70(3):491–512.
  • [Glick and Fiske1999] Glick, P., and Fiske, S. T. 1999. The ambivalence toward men inventory: Differentiating Hostile and Benevolent Beliefs about Men. Psychology of Women Quarterly 23(3):519–536.
  • [Groves et al.2011] Groves, R. M.; Fowler Jr, F. J.; Couper, M. P.; Lepkowski, J. M.; Singer, E.; and Tourangeau, R. 2011. Survey methodology, volume 561. John Wiley & Sons.
  • [Ilyas et al.2019] Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.; and Madry, A. 2019. Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175.
  • [Jha and Mamidi2017] Jha, A., and Mamidi, R. 2017. When does a compliment become sexist? Analysis and classification of ambivalent sexism using twitter data. Proceedings of the Second Workshop on NLP and Computational Social Science 7–16.
  • [Jigsaw and Google2017] Jigsaw, and Google. 2017. Perspective API. https://www.perspectiveapi.com/. Accessed: 2019-12-13.
  • [Kalin and Tilby1978] Kalin, R., and Tilby, P. J. 1978. Development and validation of a sex-role ideology scale. Psychological Reports 42(3 PT 1):731–738.
  • [Kaushik, Hovy, and Lipton2019] Kaushik, D.; Hovy, E.; and Lipton, Z. C. 2019. Learning the difference that makes a difference with counterfactually-augmented data. arXiv preprint arXiv:1909.12434.
  • [Kerr and Holden1996] Kerr, P. S., and Holden, R. R. 1996. Development of the gender role beliefs scale (GRBS). Journal of Social Behavior and Personality 11(5):3–26.
  • [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1746–1751. Doha, Qatar: Association for Computational Linguistics.
  • [King and King1997] King, L. A., and King, D. W. 1997. Sex-role egalitarian ism scale: Development, psychometric properties, and recommendations for future research. Psychology of Women Quarterly 21(1):71–87.
  • [Kirkpatrick1936] Kirkpatrick, C. 1936. The Construction of a Belief-Pattern Scale for Measuring Attitudes toward Feminism. Journal of Social Psychology 7(4):421–437.
  • [Knapp and Mueller2010] Knapp, T. R., and Mueller, R. O. 2010. Reliability and validity of instruments. The reviewer’s guide to quantitative methods in the social sciences 337–342.
  • [Larsen and Long1988] Larsen, K. S., and Long, E. 1988. Attitudes toward sex-roles: Traditional or egalitarian? Sex Roles 19(1-2):1–12.
  • [Lazer2015] Lazer, D. 2015. Issues of construct validity and reliability in massive, passive data collections. The City Papers: An Essay Collection from The Decent City Initiative.
  • [Levant and Hirsch1992] Levant, R. F., and Hirsch, L. S. 1992. The male role: An investigation of contemporary norms. Journal of Mental Health Counseling.
  • [Levinson and Huffman1955] Levinson, D. J., and Huffman, P. E. 1955. Traditional Family Ideology and Its Relation to Personality. Journal of Personality 23(3):251–273.
  • [MacDonald1976] MacDonald, A. P. 1976. Identification and Measurement of Multidimensional Attitudes Toward Equality Between the Sexes. Journal of Homosexuality 1(2):165–182.
  • [Mason and Bumpass1975] Mason, K. O., and Bumpass, L. L. 1975. U. S. Women’s Sex-Role Ideology, 1970. American Journal of Sociology 80(5):1212–1219.
  • [Menegatti and Rubini2017] Menegatti, M., and Rubini, M. 2017. Gender Bias and Sexism in Language, volume 1. Oxford University Press.
  • [Nadler and Morrow1959] Nadler, E. B., and Morrow, W. R. 1959. Authoritarian Attitudes toward Women, and their Correlates. Journal of Social Psychology 49(1):113–123.
  • [Pleck, Sonenstein, and Ku1994] Pleck, J. H.; Sonenstein, F. L.; and Ku, L. C. 1994. Attitudes toward male roles among adolescent males: A discriminant validity analysis. Sex Roles 30(7-8):481–501.
  • [Rombough and Ventimiglia1981] Rombough, S., and Ventimiglia, J. C. 1981. Sexism: A tri-dimensional phenomenon. Sex Roles 7(7):747–755.
  • [Salehi et al.2015] Salehi, N.; Irani, L. C.; Bernstein, M. S.; Alkhatib, A.; Ogbe, E.; Milland, K.; and Clickhappier. 2015. We Are Dynamo: Overcoming Stalling and Friction in Collective Action for Crowd Workers. In Proceedings of the Conference on Human Factors in Computing Systems, 1621–1630.
  • [Salganik2017] Salganik, M. J. 2017. Bit by bit: social research in the digital age. Princeton University Press.
  • [Sen et al.2019] Sen, I.; Floeck, F.; Weller, K.; Weiss, B.; and Wagner, C. 2019. A total error framework for digital traces of humans. arXiv preprint arXiv:1907.08228.
  • [Smith, Ferree, and Miller1975] Smith, E. R.; Ferree, M. M.; and Miller, F. D. 1975. A short scale of attitudes toward feminism. Representative Research in Social Psychology 6(1):51–56.
  • [Snell, Belk, and Hawkins1998] Snell, W. E.; Belk, S.; and Hawkins, R. 1998. Stereotypes about Male Sexuality Scale (SAMSS). In Fisher, T. D.; Davis, C. M.; Yarber, W. L.; and Davis, S. L., eds., Handbook of Sexuality-related Measures: A Compendium. Routledge. 463–465.
  • [Spence and Helmreich1972] Spence, J. T., and Helmreich, R. L. 1972. The Attitudes Toward Women Scale: An objective instrument to measure attitudes toward the rights and roles of women in contemporary society. American Psychological Association.
  • [Swim et al.1995] Swim, J. K.; Aikin, K. J.; Hall, W. S.; and Hunter, B. A. 1995. Sexism and racism: Old-fashioned and modern prejudices. Journal of personality and social psychology 68(2):199.
  • [Swim et al.2001] Swim, J. K.; Hyers, L. L.; Cohen, L. L.; and Ferguson, M. J. 2001. Everyday sexism: Evidence for its incidence, nature, and psychological impact from three daily diary studies. Journal of Social Issues 57(1):31–53.
  • [Tougas et al.1995] Tougas, F.; Brown, R.; Beaton, A. M.; and Joly, S. 1995. Neosexism: Plus ça change, plus c’est pareil. Personality and Social Psychology Bulletin 21(8):842–849.
  • [Villemez and Touhey1977] Villemez, W. J., and Touhey, J. C. 1977. A Measure of Individual Differences in Sex Stereotyping and Sex Discrimination: The “Macho” Scale. Psychological Reports 41(2):411–415.
  • [Waseem and Hovy2016] Waseem, Z., and Hovy, D. 2016. Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter. Proceedings of the NAACL Student Research Workshop 88–93.
  • [Waseem et al.2017] Waseem, Z.; Davidson, T.; Warmsley, D.; and Weber, I. 2017. Understanding Abuse: A Typology of Abusive Language Detection Subtasks. ArXive Preprint 78–84.
  • [Zhao et al.2018] Zhao, J.; Zhou, Y.; Li, Z.; Wang, W.; and Chang, K.-W. 2018. Learning gender-neutral word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 4847–4853.