Natural language processing (NLP) with neural networks has grown in importance over the last few years. They provide state-of-the-art models for tasks like coreference resolution, language modeling, and machine translation(Clark and Manning, 2016a, b; Lee et al., 2017; Jozefowicz et al., 2016; Johnson et al., 2017). However, since these models are trained on human language texts, a natural question is whether they exhibit bias based on gender or other characteristics, and, if so, how should this bias be mitigated. This is the question that we address in this paper.
on popular online platforms. Word embeddings, initial pre-processors in many NLP tasks, embed words of a natural language into a vector space of limited dimension to use as their semantic representation.Bolukbasi et al. (2016) and Caliskan et al. (2017) observed that popular word embeddings including word2vec (Mikolov et al., 2013) exhibit gender bias mirroring stereotypical gender associations such as the eponymous (Bolukbasi et al., 2016) "Man is to computer programmer as Woman is to homemaker".
Yet the question of how to measure bias in a general way for neural NLP tasks has not been studied. Our first contribution is a general benchmark to quantify gender bias in a variety of neural NLP tasks. Our definition of bias loosely follows the idea of causal testing: matched pairs of individuals (instances) that differ in only a targeted concept (like gender) are evaluated by a model and the difference in outcomes (or scores) is interpreted as the causal influence of the concept in the scrutinized model. The definition is parametric in the scoring function and the target concept. Natural scoring functions exist for a number of neural natural language processing tasks.
We instantiate the definition for two important tasks—coreference resolution and language modeling. Coreference resolution is the task of finding words and expressions referring to the same entity in a natural language text. The goal of language modeling is to model the distribution of word sequences. For neural coreference resolution models, we measure the gender coreference score disparity between gender-neutral words and gendered words like the disparity between “doctor” and “he” relative to “doctor” and “she” pictured as edge weights in Figure 0(a). For language models, we measure the disparities of emission log-likelihood of gender-neutral words conditioned on gendered sentence prefixes as is shown in Figure 0(b) . Our empirical evaluation with state-of-the-art neural coreference resolution and textbook RNN-based language models Lee et al. (2017); Clark and Manning (2016b); Zaremba et al. (2014) trained on benchmark datasets finds gender bias in these models 111 Note that these results have practical significance. Both coreference resolution and language modeling are core natural language processing tasks in that they form the basis of many practical systems for information extraction(Zheng et al., 2011)2013), speech recognition(Graves et al., 2013) and machine translation(Bahdanau et al., 2014). .
Next we turn our attention to mitigating the bias. Bolukbasi et al. (2016) introduced a technique for debiasing word embeddings which has been shown to mitigate unwanted associations in analogy tasks while preserving the embedding’s semantic properties. Given their widespread use, a natural question is whether this technique is sufficient to eliminate bias from downstream tasks like coreference resolution and language modeling. As our second contribution, we explore this question empirically. We find that while the technique does reduce bias, the residual bias is considerable. We further discover that debiasing models that make use of embeddings that are co-trained with their other parameters (Clark and Manning, 2016b; Zaremba et al., 2014) exhibit a significant drop in accuracy.
Our third contribution is counterfactual data augmentation (CDA): a generic methodology to mitigate bias in neural NLP tasks. For each training instance, the method adds a copy with an intervention on its targeted words, replacing each with its partner, while maintaining the same, non-intervened, ground truth. The method results in a dataset of matched pairs with ground truth independent of the target distinction (see Figure 0(a) and Figure 0(b) for examples). This encourages learning algorithms to not pick up on the distinction.
Our empirical evaluation shows that CDA effectively decreases gender bias while preserving accuracy. We also explore the space of mitigation strategies with CDA, a prior approach to word embedding debiasing (WED), and their compositions. We show that CDA outperforms WED, drastically so when word embeddings are co-trained. For pre-trained embeddings, the two methods can be effectively composed. We also find that as training proceeds on the original data set with gradient descent the gender bias grows as the loss reduces, indicating that the optimization encourages bias; CDA mitigates this behavior.
In this section we briefly summarize requisite elements of neural coreference resolution and language modeling systems: scoring layers and loss evaluation, performance measures, and the use of word embeddings and their debiasing. The tasks and models we experiment with later in this paper and their properties are summarized in Table 1.
The goal of a coreference resolution (Clark and Manning, 2016a) is to group mentions, base text elements composed of one or more consecutive words in an input instance (usually a document), according to their semantic identity. The words in the first sentence of Figure 0(a), for example, include “the doctor”and “he”. A coreference resolution system would be expected to output a grouping that places both of these mentions in the same cluster as they correspond to the same semantic identity.
|Task / Dataset||Model||Loss via||Trainable embedding||Pre-trained embedding|
|coreference resolution /
CoNLL-2012 (Pradhan et al., 2012)
|Lee et al. (2017)||coref. score||✓|
|Clark and Manning (2016b)||coref. clusters||✓||✓|
language modeling /
Wikitext-2 (Merity et al., 2016)
|Zaremba et al. (2014)||likelihood||✓|
Neural coreference resolution systems typically employ a mention-ranking model (Clark and Manning, 2016a)
in which a feed-forward neural network produces a coreference score assigning to every pair of mentions an indicator of their coreference likelihood. These scores are then processed by a subsequent stage that produces clusters.
The ground truth in a corpus is a set of mention clusters for each constituent document. Learning is done at the level of mention scores in the case of Lee et al. (2017) and at the level of clusters in the case of (Clark and Manning, 2016b) . The performance of a coreference system is evaluated in terms of the clusters it produces as compared to the ground truth clusters. As a collection of sets is a partition of the mentions in a document, partition scoring functions are employed, typically MUC, B and CEAF (Pradhan et al., 2012)
, which quantify both precision and recall. Then, standard evaluation practice is to report the average F1 score over the clustering accuracy metrics.
A language model’s task is to generalize the distribution of sentences in a given corpus. Given a sentence prefix, the model computes the likelihood for every word indicating how (un)likely it is to follow the prefix in its text distribution. This score can then be used for a variety of purposes such as auto completion. A language model is trained to minimize cross-entropy loss, which encourages the model to predict the right words in unseen text.
Word embedding is a representation learning task for finding latent features for a vocabulary based on their contexts in a training corpus. An embedding model transforms syntactic elements (words) into real vectors capturing syntactic and semantic relationships among words.
Bolukbasi et al. (2016) show that embeddings demonstrate bias. Objectionable analogies such as “man is to woman as programmer is to homemaker” indicate that word embeddings pick up on historical biases encoded in their training corpus. Their solution modifies the embedding’s parameters so that gender-neutral words no longer carry a gender component. We omit here the details of how the gender component is identified and removed. What is important, however, is that only gender-neutral words are affected by the debiasing procedure.
All of our experimental systems employ an initial embedding layer which is either initialized and fixed to some pretrained embedding, initialized then trained alongside the rest of the main NLP task, or trained without initializing. In the latter two cases, the embedding can be debiased at different stages of the training process. We investigate this choice in Section 5.
Closely Related Work
Two independent and concurrent work (Zhao et al., 2018; Rudinger et al., 2018) explore gender bias in coreference resolution systems. There are differences in our goals and methods. They focus on bias in coreference resolution systems and explore a variety of such systems, including rule-based, feature-rich, and neural systems. In contrast, we study bias in a set of neural natural language processing tasks, including but not exclusively coreference resolution. This difference in goals leads to differences in the notions of bias. We define bias in terms of internal scores common to a neural networks, while both Zhao et al. (2018) and Rudinger et al. (2018) evaluate bias using Winogram-schema style sentences specifically designed to stress test coreference resolutions. The independently discovered mitigation technique of Zhao et al. (2018) is closely related to ours. Further, we inspect the effect of debiasing different configurations of word embeddings with and without counterfactual data augmentation. We also empirically study how gender bias grows as training proceeds with gradient descent with and without the bias mitigation techniques.
3 Measuring Bias
Our definition of bias loosely follows the idea of causal testing: matched pairs of individuals (instances) that differ in only a targeted concept (like gender) are evaluated by a model and the difference in outcomes is interpreted as the causal influence of the concept in the scrutinized model.
As an example, we can choose a test corpus of simple sentences relating the word “professor” to the male pronoun “he” as in sentence of Figure 0(a) along with the matched pair that swaps in “she” in place of “he”. With each element of the matched pair, we also indicate which mentions in each sentence, or context, should attain the same score. In this case, the complete matched pair is and . We measure the difference in scores assigned to the coreference of the pronoun with the occupation across the matched pair of sentences.
We begin with the general definition and instantiate it for measuring gender bias in relation to occupations for both coreference resolution and language modeling.
Definition 1 (Score Bias).
Given a set of matched pairs (or class of sets ) and a scoring function , the bias of under the concept(s) tested by (or ), written (or ) is the expected difference in scores assigned to the matched pairs (or expected absolute bias across class members):
3.1 Occupation-Gender Bias
The principle concept we address in this paper is gender, and the biases we will focus on in the evaluation relate gender to gender-neutral occupations. To define the matched pairs to test this type of bias we employ interventions222Interventions as discussed in this work are automatic with no human involvement.: transformations of instances to their matches. Interventions are a more convenient way to reason about the concepts being tested under a set of matched pairs.
Definition 2 (Intervention Matches).
Given an instance , corpus , or class , and an intervention , the intervention matching under is the matched pair or the set of matched pairs , respectively, and is defined as follows.
The core intervention used throughout this paper is the naive intervention that swaps every gendered word in its inputs with the corresponding word of the opposite gender. The complete list of swapped words can be found in Supplemental Materials. In Section 4 we define more nuanced forms of intervention for the purpose of debiasing systems.
We construct a set of sentences based on a collection of templates. In the case of coreference resolution, each sentence, or context, includes a placeholder for an occupation word and the male gendered pronoun “he” while the mentions to score are the occupation and the pronoun. An example of such a template is the sentence “The [OCCUPATION] ran because he is late.” where the underline words indicate the mentions for scoring. The complete list can be found in the Supplemental Materials.
Definition 3 (Occupation Bias).
Given the list of templates , we construct the matched pair set for computing gender-occupation bias of score function for an occupation by instantiating all of the templates with and producing a matched pair via the naive intervention :
To measure the aggregate occupation bias over all occupations we compute bias on the class where .
The bias measures are then simply:
|Aggregate Occupation Bias (AOG)|
For language modeling the template set differs. There we assume the scoring function is the one that assigns a likelihood of a given word being the next word in some initial sentence fragment. We place the pronoun in the initial fragment thereby making sure the score is conditioned on the presence of the male or female pronoun. We are thus able to control for the frequency disparities between the pronouns in a corpus, focusing on disparities with occupations and not disparities in general occurrence. An example333As part of template occupation substitution we also adjust the article “a”. of a test template for language modeling is the fragment “He is a | [OCCUPATION]” where the pipe delineates the sentence prefix from the test word. The rest can be seen in the Supplemental Materials.
4 Counterfactual Data Augmentation (CDA)
In the previous section we have shown how to quantify gender bias in coreference resolution systems and language models using a naive intervention, or
. The disparities at the core of the bias definitions can be thought of as unwanted effects: the gender of the pronouns like he or she has influence on its coreference strength with an occupation word or the probability of emitting an occupation word though ideally it should not. Following the tradition of causal testing, we make use of matched pairs constructed via interventions to augment existing training datasets. By defining the interventions so as to express a particular concept such as gender, we produce datasets that encourage training algorithms to not capture that concept.
Definition 4 (Counterfactual Data Augmentation).
Given an intervention , the dataset of input instances can be -augmented, or , to produce the dataset .
Note that the intervention above does not affect the ground truth. This highlights the core feature of the method: an unbiased model should not distinguish between matched pairs, that is, it should produce the same outcome. The intervention is another critical feature as it needs to represent a concept crisply, that is, it needs to produce matched pairs that differ only (or close to it) in the expression of that concept. The simplest augmentation we experiment on is the naive intervention , which captures the distinction between genders on gendered words. The more nuanced intervention we discuss further in this paper relaxes this distinction in the presence of some grammatical structures.
Given the use of in the definition of bias in Section 3, it would be expected that debiasing via naive augmentation completely neutralizes gender bias. However, bias is not the only concern in a coreference resolution or language modeling systems; its performance is usually the primary goal. As we evaluate performance on the original corpora, the alterations necessarily reduce performance.
To ensure the predictive power of models trained from augmented data, the generated sentences need to remain semantically and grammatically sound. We assume that if counterfactual sentences are generated properly, the ground truth coreference clustering labels should stay the same for the coreference resolution systems. Since language modeling is an unsupervised task, we do not need to assign labels for the counterfactual sentences.
To define our gender intervention, we employ a bidirectional dictionary of gendered word pairs such as he:she, her:him/his and other definitionally gendered words such as actor:actress, queen:king. The complete list of gendered pairs can be found in the Supplemental Materials. We replace every occurrence (save for the exceptions noted below) of a gendered word in the original corpus with its dual as is the case with .
Flipping a gendered word when it refers to a proper noun such as Queen Elizabeth would result in semantically incorrect sentences. As a result, we do not flip gendered words if they are in a cluster with a proper noun. For coreference resolution, the clustering information is provided by labels in the coreference resolution dataset. Part-of-speech information, which indicates whether a word is a pronoun, is obtained through metadata within the training data.
A final caveat for generating counterfactuals is the appropriate handing of her, he and him. Both he and him would be flipped to her, while her should be flipped to him if it is an objective pronoun and to his if it is a possessive pronoun. This information is also obtained from part-of-speech tags.
The adjustments to the naive intervention for maintaining semantic or grammatical structures, produce the grammatical intervention, or .
In this section we evaluate CDA debiasing across three models from two NLP tasks in comparison/combination with the word embedding debiasing of Bolukbasi et al. (2016). For each configuration of methods we report aggregated occupation bias (marked AOB) (Definition 3) and the resulting performance measured on original test sets (without augmentation). Most of the experimentation that follow employs grammatical augmentation though we investigate the naive intervention in Section 5.2.
5.1 Neural Coreference Resolution
We use the English coreference resolution dataset from the CoNLL-2012 shared task (Pradhan et al., 2012), the benchmark dataset for the training and evaluation of coreference resolution. The training dataset contains 2408 documents with 1.3 million words. We use two state-of-art neural coreference resolution models described by Lee et al. (2017) and Clark and Manning (2016b). We report the average F1 value of standard MUC, B and CEAF metrics for the original test set.
NCR Model I
The model of Lee et al. (2017) uses pretrained word embeddings, thus all features and mention representations are learned from these pretrained embeddings. As a result we can only apply debiasing of Bolukbasi et al. (2016) to the pretrained embedding. We evaluate bias on four configurations: no debiasing, debiased embeddings (written WED), CDA only, and CDA with WED. The configurations and resulting aggregate bias measures are shown in Table 2.
In the aggregate measure, we see that the original model is biased (recall the scale of coreference scores shown in Figure 1). Further, each of the debiasing methods reduces bias to some extent, with the largest reduction when both methods are applied. Impact on performance is negligible in all cases.
Figure 2 shows the per-occupation bias in Models 1.1 and 1.2. It aligns with the historical gender stereotypes: female-dominant occupations such as nurse, therapist and flight attendant have strong negative bias while male-dominant occupations such as banker, engineer and scientist have strong positive bias. This behaviour is reduced with the application of CDA.
|Index||Debiasing Configuration||Test Acc. (F1)||Test Acc.||AOB||AOB%|
|1.1||None||67.20444Matches state-of-the-art result of Lee et al. (2017).||-||3.00||-|
|1.4||CDA () w/ WED||67.10||-0.10||0.51||-83%|
|Index||Debiasing Configuration||Test Acc. (F1)||Test Acc.||AOB||AOB%|
|2.6||CDA () w/||68.5||-0.60||0.72||0.39||-75%|
|2.7||CDA () w/||66.12||-2.98||2.03||-2.03||-31%|
|2.8||CDA () w/ ,||65.88||-3.22||2.89||-2.89||-2%|
|Index||Debiasing Configuration||Test Perp.||Test Perp.||AOB||AOB%|
NCR Model II
The model of Clark and Manning (2016b) has a trainable embedding layer, which is initialized with the word2vec embedding and updated during training. As a result, there are three ways to apply WED: we can either debias the pretrained embedding before the model is trained (written ), debias it after model training (written ), or both. We also test these configurations in conjunction with CDA. In total, we evaluate 8 configurations as in shown in Table 3.
The aggregate measurements show bias in the original model, and the general benefit of augmentation over word embedding debiasing: it has better or comparable debiasing strength while having lower impact on accuracy. In models 2.7 and 2.8, however, we see that combining methods can have detrimental effects: the aggregate occupation bias has flipped from preferring males to preferring females as seen in the column which preserves the sign of per-occupation bias in aggregation.
5.2 RNN Language Modeling
We use the Wikitext-2 dataset (Merity et al., 2016) for language modeling and employ a simple 2-layer RNN architecture with 1500 LSTM cells and a trainable embedding layer of size 1500. As a result, word embedding can only be debiased after training. The language model is evaluated using perplexity, a standard measure for averaging cross-entropy loss on unseen text. We also test the performance impact of the naive augmentation in relation to the grammatical augmentation in this task. The aggregate results for the four configurations are show in Table 4.
We see that word embedding debiasing in this model has very detrimental effect on performance. The post-embedding layers here are too well-fitted to the final configuration of the embedding layer. We also see that the naive augmentation almost completely eliminates bias and surprisingly happened to incur a lower perplexity hit. We speculate that this is a small random effect due to the relatively small dataset (36,718 sentences of which about 7579 have at least one gendered word) used for this task.
5.3 Learning Bias
The results presented so far only report on the post-training outcomes. Figure 3, on the other hand, demonstrates the evolving performance and bias during training under various configurations. In general we see that for both neural coreference resolution and language model, bias (thick lines) increases as loss (thin lines) decreases. Incorporating counterfactual data augmentation greatly bounds the growth of bias (gray lines). In the case of naive augmentation, the bias is limited to almost 0 after an initial growth stage (lightest thick line, right).
5.4 Overall Results
The original model results in the tables demonstrate that bias exhibits itself in the downstream NLP tasks. This bias mirrors stereotypical gender/occupation associations as seen in Figure 2 (black bars). Further, word debiasing alone is not sufficient for downstream tasks without undermining the predictive performance, no matter which stage of training process it is applied ( of 2.2 preserves accuracy but does little to reduce bias while of 2.3 does the opposite). Comparing 2.2 ()and 2.4 ( and ) we can conclude that bias in word embedding removed by debiasing performed prior to training is relearned by its conclusion as otherwise the post-training debias step of 2.4 would have no effect. The debiased result of configurations 1.2, 2.5 and 3.3 show that counterfactual data augmentation alone is effective in reducing bias across all tasks while preserving the predictive power.
Results combining the two methods show that CDA and pre-training word embedding debiasing provide some independent debiasing power as in 1.4 and 2.6. However, the combination of CDA and post-training debiasing has an overcorrection effect in addition to the compromise of the predictive performance as in configurations 2.7 and 2.8.
6 Future Work
We will continue exploring bias in neural natural language processing. Neural machine translation provides a concrete challenging next step. We are also interested in explaining why these neural network models exhibit bias by studying the inner workings of the model itself. Such explanations could help us encode bias constraints in the model or training data to prevent bias from being introduced in the first place.
- Clark and Manning [2016a] Kevin Clark and Christopher D Manning. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323, 2016a.
- Clark and Manning [2016b] Kevin Clark and Christopher D Manning. Deep reinforcement learning for mention-ranking coreference models. arXiv preprint arXiv:1609.08667, 2016b.
- Lee et al.  Kenton Lee, Luheng He, Mike Lewis, and Luke Zettlemoyer. End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045, 2017.
- Jozefowicz et al.  Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Johnson et al.  Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google’s multilingual neural machine translation system: Enabling zero-shot translation. TACL, 5:339–351, 2017. URL https://transacl.org/ojs/index.php/tacl/article/view/1081.
- Lapowsky  Issie Lapowsky. Google autocomplete still has a hitler problem, Feb 2018. URL https://www.wired.com/story/google-autocomplete-vile-suggestions/.
- Tatman  Rachael Tatman. Gender and dialect bias in youtube’s automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 53–59, 2017.
- Bolukbasi et al.  Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Information Processing Systems, pages 4349–4357, 2016.
- Caliskan et al.  Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
- Mikolov et al.  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Zaremba et al.  Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
- Zheng et al.  Jiaping Zheng, Wendy W Chapman, Rebecca S Crowley, and Guergana K Savova. Coreference resolution: A review of general methodologies and applications in the clinical domain. Journal of biomedical informatics, 44(6):1113–1122, 2011.
- Graves  Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
Graves et al. 
Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton.
Speech recognition with deep recurrent neural networks.In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.
- Bahdanau et al.  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Pradhan et al.  Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Joint Conference on EMNLP and CoNLL-Shared Task, pages 1–40. Association for Computational Linguistics, 2012.
- Merity et al.  Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Zhao et al.  Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
- Rudinger et al.  Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. arXiv preprint arXiv:1804.09301, 2018.
Context Template Sentences for Occupation Bias
Below is the list of the context template sentences used in our coreference resolution experiments OCCUPATION indicates the placement of one of occupation words listed below.
“The [OCCUPATION] ate because he was hungry.”
“The [OCCUPATION] ran because he was late.”
“The [OCCUPATION] drove because he was late.”
“The [OCCUPATION] drunk water because he was thirsty.”
“The [OCCUPATION] slept because he was tired.”
“The [OCCUPATION] took a nap because he was tired.”
“The [OCCUPATION] cried because he was sad.”
“The [OCCUPATION] cried because he was depressed.”
“The [OCCUPATION] laughed because he was happy.”
“The [OCCUPATION] smiled because he was happy.”
“The [OCCUPATION] went home because he was tired.”
“The [OCCUPATION] stayed up because he was busy.”
“The [OCCUPATION] was absent because he was sick.”
“The [OCCUPATION] was fired because he was lazy.”
“The [OCCUPATION] was fired because he was unprofessional.”
“The [OCCUPATION] was promoted because he was hardworking.”
“The [OCCUPATION] died because he was old.”
“The [OCCUPATION] slept in because he was fired.”
“The [OCCUPATION] quitted because he was unhappy.”
“The [OCCUPATION] yelled because he was angry.”
Similarly the context templates for language modeling are as below.
“He is a | [OCCUPATION]”
“he is a | [OCCUPATION]”
“The man is a | [OCCUPATION]”
“the man is a | [OCCUPATION]”
The list of hand-picked occupation words making up the occupation category in our experiments is as follows. For language modeling, we did not include multi-word occupations.
|accountant||air traffic controller||architect||artist||attorney|
The hand-picked gender pairs swapped by the gender intervention functions are listed below.
|gods - goddesses||manager - manageress||barons - baronesses|
|nephew - niece||prince - princess||boars - sows|
|baron - baroness||stepfathers - stepmothers||wizard - witch|
|father - mother||stepsons - stepdaughters||sons-in-law - daughters-in-law|
|dukes - duchesses||boyfriend - girlfriend||fiances - fiancees|
|dad - mom||shepherd - shepherdess||uncles - aunts|
|beau - belle||males - females||hunter - huntress|
|beaus - belles||grandfathers - grandmothers||lads - lasses|
|daddies - mummies||step-son - step-daughter||masters - mistresses|
|policeman - policewoman||nephews - nieces||brother - sister|
|grandfather - grandmother||priest - priestess||hosts - hostesses|
|landlord - landlady||husband - wife||poet - poetess|
|landlords - landladies||fathers - mothers||masseur - masseuse|
|monks - nuns||usher - usherette||hero - heroine|
|stepson - stepdaughter||postman - postwoman||god - goddess|
|milkmen - milkmaids||stags - hinds||grandpa - grandma|
|chairmen - chairwomen||husbands - wives||grandpas - grandmas|
|stewards - stewardesses||murderer - murderess||manservant - maidservant|
|men - women||host - hostess||heirs - heiresses|
|masseurs - masseuses||boy - girl||male - female|
|son-in-law - daughter-in-law||waiter - waitress||tutors - governesses|
|priests - priestesses||bachelor - spinster||millionaire - millionairess|
|steward - stewardess||businessmen - businesswomen||congressman - congresswoman|
|emperor - empress||duke - duchess||sire - dam|
|son - daughter||sirs - madams||widower - widow|
|kings - queens||papas - mamas||grandsons - granddaughters|
|proprietor - proprietress||monk - nun||headmasters - headmistresses|
|grooms - brides||heir - heiress||boys - girls|
|gentleman - lady||uncle - aunt||he - she|
|king - queen||princes - princesses||policemen - policewomen|
|governor - matron||fiance - fiancee||step-father - step-mother|
|waiters - waitresses||mr - mrs||stepfather - stepmother|
|daddy - mummy||lords - ladies||widowers - widows|
|emperors - empresses||father-in-law - mother-in-law||abbot - abbess|
|sir - madam||actor - actress||mr. - mrs.|
|wizards - witches||actors - actresses||chairman - chairwoman|
|sorcerer - sorceress||postmaster - postmistress||brothers - sisters|
|lad - lass||headmaster - headmistress||papa - mama|
|milkman - milkmaid||heroes - heroines||man - woman|
|grandson - granddaughter||groom - bride||sons - daughters|
|congressmen - congresswomen||businessman - businesswoman||boyfriends - girlfriends|
|dads - moms|