Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation

by   Emily Dinan, et al.

Models often easily learn biases present in the training data, and their predictions directly reflect this bias. We analyze the presence of gender bias in dialogue and examine the subsequent effect on generative chitchat dialogue models. Based on this analysis, we propose a combination of three techniques to mitigate bias: counterfactual data augmentation, targeted data collection, and conditional training. We focus on the multi-player text-based fantasy adventure dataset LIGHT as a testbed for our work. LIGHT contains gender imbalance between male and female characters with around 1.6 times as many male characters, likely because it is entirely collected by crowdworkers and reflects common biases that exist in fantasy or medieval settings. We show that (i) our proposed techniques mitigate gender bias by balancing the genderedness of generated dialogue utterances; and (ii) they work particularly well in combination. Further, we show through various metrics—such as quantity of gendered words, a dialogue safety classifier, and human evaluation—that our models generate less gendered, but still engaging chitchat responses.


Mitigating Gender Bias for Neural Dialogue Generation with Adversarial Learning

Dialogue systems play an increasingly important role in various aspects ...

Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models

All AI models are susceptible to learning biases in data that they are t...

Towards Understanding Gender Bias in Relation Extraction

Recent developments in Neural Relation Extraction (NRE) have made signif...

Reducing Gender Bias in Word-Level Language Models with a Gender-Equalizing Loss Function

Gender bias exists in natural language datasets which neural language mo...

Generating Clues for Gender based Occupation De-biasing in Text

Vast availability of text data has enabled widespread training and use o...

Uncovering the Source of Machine Bias

We develop a structural econometric model to capture the decision dynami...

Hollywood Identity Bias Dataset: A Context Oriented Bias Analysis of Movie Dialogues

Movies reflect society and also hold power to transform opinions. Social...

1 Introduction

Since machine learning algorithms learn to model patterns present in training datasets, what they learn is affected by data quality. Analysis has found that model predictions directly reflect the biases found in training datasets, such as image classifiers learning to associate ethnicity with specific activities

Stock and Cisse (2017)

. Recent work in natural language processing has found similar biases, such as in word embeddings

Bolukbasi et al. (2016); Brunet et al. (2018); Zhao et al. (2019), object classification Zhao et al. (2017), natural language inference He et al. (2019), and coreference resolution Zhao et al. (2018a). Less work has focused on the biases present in dialogue utterances Liu et al. (2019); Henderson et al. (2018), despite bias being clearly present in human interactions, and the rapid development of dialogue agents for real-world use-cases, such as interactive assistants. In this work we aim to address this by focusing on mitigating gender bias.

LIGHT Persona Examples
daughter: I spend most of my time doing household chores. I want to find meaning in life. I am energetic and happy.
chief wife: I am the king’s chief wife. Of all the women that he has married, or who are his concubines, I am the principal one. I represent the kingdom of my father, who is the king’s biggest ally. My sons are the ones who will most likely become the king after the death of my husband.
women: I live with my husband and 4 children in the village. I spend my days washing clothing and cleaning our home. My husband works for the royal army defending out town.
farmer Bob’s wife: I am farmer Bob’s wife. I like to take care of all our animals. I help Farmer Bob everyday on the farm.

I am a mother of eight children. I live with my family in a cottage in the countryside. I spend every day tending to the needs of all of my little ones which can be overwhelming, but I always manage to maintain a pleasing disposition and a happy smile.
wife: I am the wife of a farmer. While I may not be the most attractive woman ever, I am loyal and loving. My husband is a good man, but only seems to stay with me out of duty.
shady lady: I am a shady lady. I work in a tavern, and I am willing to trade sexual favors for money. I have to split the money with the tavernkeeper, so that he will offer me a room to work in. I am beginning to get sick from the “king’s evil”, which doctors call syphilis. My future is bleak: madness and death. But this is the only way that I can support myself, so I continue.
Table 1: Character persona examples from the LIGHT dataset. While there are relatively few examples of female-gendered personas, many of the existing ones exhibit bias. None of these personas were flagged by annotators during a review for offensive content.

We use the dialogue dataset from the LIGHT text adventure world Urbanek et al. (2019) as a testbed for our investigation into de-biasing dialogues. The dataset consists of a set of crowd-sourced locations, characters, and objects, which form the backdrop for the dialogues between characters. In the dialogue creation phase, crowdworkers are presented with personas for characters—which themselves were written by other crowdworkers—that they should enact; the dialogues the crowdworkers generate from these personas form the dialogue dataset. Dialogue datasets are susceptible to reflecting the biases of the crowdworkers as they are often collected solely via crowdsourcing. Further, the game’s medieval setting may encourage crowdworkers to generate text which accentuates the historical biases and inequalities of that time period Bowman (2010); Garcia (2017). However, despite the fact that the dialogues take place in a fantasy adventure world, LIGHT is a game and thus we are under no obligation to recreate historical biases in this environment, and can instead use creative license to shape it into a fun world with gender parity.

Dialogue Example
wife: I was married off by my family about five years ago. I spend my days cooking and cleaning so my husband will have something to eat when he returns from his work and can enjoy a clean home. I love my husband dearly because he works very hard to provide for us.
merchant: What a great day for more money.
wife: Oh my. That is some thick dust!
merchant: Indeed, it is very old.
wife: This room is going to take a while to clean. You might want to come back later.
merchant: It is fine I can set my booth up here.
wife: With all the foot traffic?
merchant: Yes it should be ok.
wife: It doesn’t appear that anyone ever comes up here!
merchant: Well they will when they know I am here.
wife: I have my doubts but I’ll just go about my cleaning.
merchant: Yea sounds like a good idea.
wife: What is that supposed to mean?
merchant: I am saying we should both do our jobs.
wife: Don’t take that tone with me!
Table 2: An example dialogue from the LIGHT dataset, with the persona for the wife character provided. Bias from the persona informs and effects the dialogue task.

We use the dialogues in LIGHT because we find that it is highly imbalanced with respect to gender: there are over 60% more male-gendered characters than female. We primarily address the discrepancy in the representation of male and female genders, although there are many characters that are gender neutral (like “trees”) or for which the gender could not be determined. We did not find any explicitly identified non-binary characters. We note that this is a bias in and of itself, and should be addressed in future work. We show that training on gender biased data leads existing generative dialogue models to amplify gender bias further. To offset this, we collect additional in-domain personas and dialogues to balance gender and increase the diversity of personas in the dataset. Next, we combine this approach with Counterfactual Data Augmentation and methods for controllable text generation to mitigate the bias in dialogue generation. Our proposed techniques create models that produce engaging responses with less gender bias.

2 Sources of Bias in Dialogue Datasets

2.1 Bias in Character Personas

Recent work in dialogue incorporates personas, or personality descriptions that ground speaker’s chat, such as I love fishing Zhang et al. (2018); Shuster et al. (2018); Mazaré et al. (2018); Olabiyi et al. (2018); Li et al. (2016). Personas have been shown to increase engagingness and improve consistency. However, they can be a starting point for bias Shankar et al. (2017); Clark et al. (2019); Henderson et al. (2018), as bias in the personas propagates to subsequent conversations.

Qualitative Examination.

Analyzing the personas in LIGHT qualitatively, we find many examples of bias. For example, the character girl contains the line I regularly clean and cook dinner. Further examples are given in Table 1.

# Characters # References
Female Male Neutral Female Male
Original Dataset 159 258 1460 439 1238
Gender Swap 336 230 694 1419 1030
New Characters 151 120 1448 357 275
Total 646 608 3602 2215 2543
Table 3: Analysis of gender in LIGHT Characters: the original dataset contains as many male-gendered characters as female-gendered characters. New characters are collected to offset this imbalance.

Quantitative Examination.

We quantitatively analyze bias by first examining whether the existing personas are offensive, and second, evaluating their gender balance. To assess the pervasiveness of unsafe content present in personas, we asked three independent annotators to examine each character’s persona for potentially offensive content. If annotators selected that the content was offensive or maybe offensive, they were asked to place it in one of four categories – racist, sexist, classist, other – and to provide a reason for their response. Just over 2% of personas were flagged by at least one annotator, and these personas are removed from the dataset.

We further examined gender bias in personas. Annotators were asked to label the gender of each character based on their persona description (choosing “neutral” if it was not explicit in the persona). This annotation is possible because some personas include lines such as I am a young woman, although the majority of personas do not mention an explicit gender. Annotators found nearly 50% more male-gendered characters than female-gendered characters (Table 3).111Note that this difference could be exacerbated by annotators assigning gender to technically ungendered personas because of their own biases.

While annotators labeled personas as explicitly male, female, or gender-neutral, gender bias may still exist in personas beyond explicit sentences such as I am a young man. For example, personas can contain gendered references such as I want to follow in my father’s footsteps rather than mother’s footsteps. These relational nouns (Barker, 1992; Williams, 2018) such as father encode a specific relationship that can be gender biased. In this example, that relationship would be between the character and a man, rather than a woman. We analyzed the frequency of references to other gendered characters in the personas by counting the appearance of gendered words using the list compiled by Zhao et al. (2018c) (for example he vs. she), and find that men are disproportionately referred to in the personas: there are nearly 3x as many mentions of men than women.

2.2 Bias in Dialogue Utterances

After analyzing the bias in LIGHT personas, we go on to analyze the bias in dialogues created from those personas and how to quantify it.

Qualitative Examination.

In our analysis, we found many examples of biased utterances in the data used to train dialogue agents. For example, the character with a queen persona utters the line I spend my days embroidery and having a talk with the ladies. Another character in a dialogue admires a sultry wench with fire in her eyes. An example of persona bias propagating to the dialogue can be found in Table 2.

Measuring Bias.

Sexism is clearly present in many datasets Henderson et al. (2018), but finding a good way to measure sexism, especially at scale, can be challenging. A simple answer would be to rely on crowdworkers operating under their own notions of “sexism” to annotate the dialogues. However, in our experience, crowdworkers hold a range of views, often different from ours, as to what counts as sexism, making mere human evaluation far from sufficient. Note that the original LIGHT personas and dialogues were generated by crowdworkers, leaving little reason to believe that crowdworkers will be proficient at spotting the sexism that they themselves embued the dataset with in the first place. Therefore, we supplement our crowdworker-collected human annotations of gender bias with additional quantitative measurements: we measure the ratio of gendered words (taken from the union of several existing gendered word lists that were each created through either automatic means, or by experts Zhao et al. (2018c, b); Hoyle et al. (2019)), and we run an existing dialogue safety classifier Dinan et al. (2019) to measure offensiveness of the dialogues.

Data Split: All
% gend. % male F1 % gend. % male F1 % gend. % male F1 % gend. % male F1 F1
Model words bias score words bias score words bias score words bias score score
Gold Lbl 0 0 - 4.11 100 - 4.03 0 - 6.67 50.71 - -
Baseline 2.37 88.39 11.24 3.66 90.26 11.77 2.44 77.99 11.54 3.05 80.05 11.43 11.42
CDA 0.88 71.03 11.63 1.38 68.57 11.7 1.2 56.18 11.43 1.17 58.01 11.12 11.62
Pos. Data 2.76 82.44 10.46 3.68 86.43 10.07 4.59 72.1 10.07 4.43 86.5 9.88 10.44
CT 0.14 68.75 10.72 5.83 98.08 13.01 4.8 2.69 10.84 4.05 45.86 11.35 11.38
ALL 0.14 64.19 11.72 6.59 97.94 12.77 5.84 7.13 11.28 8.81 50.94 12.22 11.99
Table 4: We compare the performance of various bias mitigation methods – Counterfactual Data Augmentation (CDA), Positive-Bias Data Collection (Pos. Data), Conditional Training (CT), and combining these methods (ALL) – on the LIGHT test set, splitting the test set across the four genderedness bins: . indicates there are no X-gendered words in the gold response, while, indicates that there is at least one. We measure the percent of gendered words in the generated utterances (% gend. words) and the percent of male bias (% male bias), i.e. the percent of male-gendered words among all gendered words generated. While each of these methods yield some improvement, combining all of these methods in one yields the best control over the genderedness of the utterances while still maintaining a good F1-score.
(a) F1 score
(b) Percent of gendered words
(c) Percent of male gendered words
Figure 1: Comparing the performance of the ALL  de-bias model when we fix the conditioning to a specific bin for all examples at test time. We report results for each possible conditioning bin choice. Across bins, the model maintains performance whilst radically changing the genderedness of the language generated.
Gold Labels Baseline ALL
% Offensive 13.0 14.25 10.37
Table 5: Offensive language classification of model responses on the LIGHT dialogue test set.
Figure 2: Human Evaluation of ALL model compared to baseline Transformer generative model. The control bins in ALL are set to to reduce gendered words. Evaluators find it harder to predict the speaker gender when using our proposed techniques, while model engagingness is not affected by the method.

3 Methodology: Mitigating Bias in Generative Dialogue

We explore both data augmentation and algorithmic methods to mitigate bias in generative Transformer dialogue models. We describe first our modeling setting and then the three proposed techniques for mitigating bias. Using (i) counterfactual data augmentation Maudslay et al. (2019) to swap gendered words and (ii) additional data collection with crowdworkers, we create a gender-balanced dataset. Further, (iii) we describe a controllable generation method which moderates the male and female gendered words it produces.

3.1 Models

Following (Urbanek et al., 2019)

, in all of our experiments we fine-tune a large, pre-trained Transformer encoder-decoder neural network on the dialogues in the LIGHT dataset. The model was pre-trained on Reddit conversations, using a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io. During pre-training, models were trained to generate a comment conditioned on the full thread leading up to the comment. Comments containing URLs or that were under 5 characters in length were removed from the corpus, as were all child comments, resulting in approximately

million training examples. The model is a 8 layer encoder, 8 layer decoder with 512 dimensional embeddings and 16 attention heads, and is based on the ParlAI implementation of Miller et al. (2017). For generation, we decode sequences with beam search with beam size 5.

3.2 Counterfactual Data Augmentation

One of the solutions that has been proposed for mitigating gender bias on the word embedding level is Counterfactual Data Augmentation (CDA) Maudslay et al. (2019). We apply this method by augmenting our dataset with a copy of every dialogue with gendered words swapped using the gendered word pair list provided by Zhao et al. (2018c). For example, all instances of grandmother are swapped with grandfather.

3.3 Positive-Bias Data Collection

To create a more gender-balanced dataset, we collect additional data using a Positive-Bias Data Collection (Pos. Data) strategy.

Gender-swapping Existing Personas

There are a larger number of male-gendered character personas than female-gendered character personas (see Section 2), so we balance existing personas using gender-swapping. For every gendered character in the dataset, we ask annotators to create a new character with a persona of the opposite gender that is otherwise identical except for referring nouns or pronouns. Additionally, we ask annotators to swap the gender of any characters that are referred to in the persona text for a given character.

New and Diverse characters

As discussed in Section 2, it is insufficient to simply balance references to men and women in the dataset, as there may be bias in the form of sexism. While it is challenging to detect sexism, we attempt to offset this type of bias by collecting a set of interesting and independent characters. We do this by seeding workers with examples like adventurer with the persona I am an woman passionate about exploring a world I have not yet seen. I embark on ambitious adventures. We give the additional instruction to attempt to create diverse characters. Even with this instruction, crowdworkers still created roughly 3x as many male-gendered characters as female-gendered characters. We exclude male-gendered characters created in this fashion.

In combination with the gender swapped personas above, this yields a new set of 2,676 character personas (compared to 1,877 from the original dataset), for which the number of men and women and the number of references to male or female gendered words is roughly balanced: see Table 3.

New dialogues

Finally, we collect additional dialogues with these newly created gender balanced character personas, favoring conversations that feature female gendered characters to offset the imbalance in the original data. We added further instructions for annotators to be mindful of gender bias during their conversations, and in particular to assume equality between genders – social, economic, political, or otherwise – in this fantasy setting. In total, we collect 507 new dialogues containing 6,658 new dialogue utterances in total (about 6% of the size of the full LIGHT dataset).

3.4 Conditional Training

Bias in dialogue can manifest itself in various forms, but one form is the imbalanced use of gendered words. For example, LIGHT contains far more male-gendered words than female-gendered words rather than an even split between words of both genders. To create models that can generate a gender-balanced number of gendered words, we propose Conditional Training (CT) for controlling generative model output Kikuchi et al. (2016); Fan et al. (2017); Oraby et al. (2018); See et al. (2019). Previous work proposed a mechanism to train models with specific control tokens so models learn to associate the control token with the desired text properties Fan et al. (2017), then modifying the control tokens during inference to produce the desired result.

Prior to training, each dialogue response is binned into one of four bins – – where indicates that there are zero female gendered words in the response and indicates the presence of at least one female gendered word. The gendered words are determined via an aggregation of existing lists of gendered nouns and adjectives from Zhao et al. (2018c, b); Hoyle et al. (2019). The bins are used to train a conditional model by appending a special token (indicating the bin for the target response) to the end of the input which is given to the encoder. At inference time, the bins can be manipulated to produce dialogue outputs with various quantities of gendered words.

4 Results

We train generative Transformer models using each of these methods – Counterfactual Data Augmentation that augments with swaps of gendered words (CDA, §3.2), adding new dialogues (Positive-Bias Data Collection, §3.3), and controllable generation to control the quantity of gendered words (CT, §3.4) – and finally combine all of these methods together (ALL).

Bias is Amplified in Generation

Existing Transformer generative dialogue models Serban et al. (2016); Yang et al. (2018); Urbanek et al. (2019) are trained to take as input the dialogue context and generate the next utterance. Previous work has shown that machine learning models reflect the biases present in data (Zhao et al., 2019; Brunet et al., 2018), and that these biases can be easy to learn compared to more challenging reasoning (Bolukbasi et al., 2016; Lewis and Fan, 2018). Generative models often use beam search or top-k sampling (Fan et al., 2018) to decode, and these methods are well-known to produce generic text Li et al. (2015), which makes them susceptible statistical biases present in datasets.

As shown in Table 4, we find that existing models actually amplify bias. When the trained model generates gendered words (i.e., words from our gendered word list), it generates male-gendered words the vast majority of the time – even on utterances for which it is supposed to generate only female-gendered words (i.e., the gold label only contains female-gendered words), it generates male-gendered words nearly of the time.

Additionally, following Liu et al. (2019), we run an offensive language classifier on the gold responses and the model generated utterances (Table 5) and find that the model produces more offensive utterances than exist in the dataset.222One caveat is that the classifier was trained on human generated utterances not model generated utterances, so this may require further examination.

Genderedness of Generated Text

We analyze the performance of the various techniques by dividing the test set using the four genderedness bins, , , and – and calculate the F1 word overlap with the gold response, the percentage of gendered words generated (% gend. words), and the percentage of male-gendered words generated (relative to the sum total of gendered words generated by the model). We compare to the gold labels from the test set and a baseline model that does not use any of the bias mitigation techniques. Results for all methods are displayed in Table 4.

Each of the methods we explore improve in % gendered words, % male bias, and F1 over the baseline Transformer generation model, but we find combining all methods in one – the ALL  model is the most advantageous. While ALL has more data than CDA and CT, more data alone is not enough — the Positive-Bias Data Collection model does not achieve as good results. Both the CT and ALL  models benefit from knowing the data split (, for example), and both models yield a genderedness ratio closest to ground truth.

Conditional Training Controls Gendered Words

Our proposed CT method can be used to control the use of gendered words in generated dialogues. We examine the effect of such training by generating responses on the test set by conditioning the ALL  model on a singular bin for all examples. Results are shown in Figure 1. Changing the bin radically changes the genderedness of generated text without significant changes to F1.

Examples of generated text from both the baseline and the ALL  model are shown in Table 6. The baseline model generates male-gendered words even when the gold response contains no gendered words or only female-gendered words, even generating unlikely sequences such as “my name is abigail. i am the king of this kingdom.”.

Safety of Generated Text

Using a dialogue safety classifier Dinan et al. (2019), we find that our proposed de-biased models are rated as less offensive compared to the baseline generative Transformer and the LIGHT data (see Table 5).

Human Evaluation

Finally, we use human evaluation to compare the quality of our de-biasing methods. We use the dialogue evaluation system Acute-Eval Li et al. (2019) to ask human evaluators to compare two conversations from different models and decide which model is more biased and which model is more engaging. Following Acute-Eval, we collect 100 human and model paired chats. Conversations from a human and baseline model are compared to conversations from a human and the ALL  model with all generations set to the gender-neutral control bin. Evaluators are asked which model is more engaging and for which model they find it more difficult to predict the gender of the speaker. We found that asking about difficulty of predicting a speaker’s gender was much more effective than asking evaluators to evaluate sexism or gender bias. Figure 2 shows that evaluators rate the ALL  model harder to predict the gender of (statistically significant at ) while engagingness does not change. Our proposed methods are able to mitigate gender bias without degrading dialogue quality.

5 Conclusion

We analyze gender bias in dialogue and propose a general purpose method for understanding and mitigating bias in character personas and their associated dialogues. We present techniques using data augmentation and controllable generation to reduce gender bias in neural language generation for dialogue. We use the dataset LIGHT as a testbed for this work. By integrating these methods together, our models provide control over how gendered dialogue is and decrease the offensiveness of the generated utterances. Overall, our proposed methodology reduces the effect of bias while maintaining dialogue engagingness.

Generation Examples
Context [P1:] Owl can you find out how I died?
[P2:] I can look around the forest, but I need
           more information to help. Tell me what
           you remember about your past life.
[P1:] I don’t remember anything I was hoping
           you could find out.
[P2:] Your form is very hazy. Do you remember
           if you are a man or woman?
Baseline: i am not a man. i am a man of the forest .
ALL: no, i don’t remember.
Gold: I don’t know what’s wrong with me!
Context [P1:] I do not believe my eyes, for an angel is
           upon me! Angel, please tell me your name.
[P2:] My name is Abigail!
Baseline: my name is abigail. i am the king of this kingdom.
ALL: i am the queen’s daughter !
Gold: Abigail! Such a beautiful name. To what do I owe
the pleasure of meeting you?
Table 6: Example generations from the baseline model and the proposed de-biased models. In these examples, the gold truth either contains no gendered words or only female-gendered words, but the baseline model generates male-gendered words.