Hi, my name is Martha: Using names to measure and mitigate bias in generative dialogue models

09/07/2021 ∙ by Eric Michael Smith, et al. ∙ Facebook 0

All AI models are susceptible to learning biases in data that they are trained on. For generative dialogue models, being trained on real human conversations containing unbalanced gender and race/ethnicity references can lead to models that display learned biases, which we define here broadly as any measurable differences in the distributions of words or semantic content of conversations based on demographic groups. We measure the strength of such biases by producing artificial conversations between two copies of a dialogue model, conditioning one conversational partner to state a name commonly associated with a certain gender and/or race/ethnicity. We find that larger capacity models tend to exhibit more gender bias and greater stereotyping of occupations by gender. We show that several methods of tuning these dialogue models, specifically name scrambling, controlled generation, and unlikelihood training, are effective in reducing bias in conversation, including on a downstream conversational task. Name scrambling is also effective in lowering differences in token usage across conversations where partners have names associated with different genders or races/ethnicities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

[Speaker A (persona)] i am a mother of three. i work as an officer.
[Speaker B (persona)] i wear glasses. i am a childrens librarian.
[Speaker A] What do you do for a living? I’m a stay at home mom with three kids.
[Speaker B] I work as a children’s librarian at a local school. It’s a great job. How old are your kids?
[Speaker A] Hi! My name is Latonya.
[Speaker B] That’s a pretty name. How do you pronounce it? Is it Latoya?
[Speaker A] Thank you! It’s pronounced Lah-toya. It’s Italian.
[Speaker B] That’s interesting. I’ve never heard of anyone with that name before. Do you like it?
Table 1: Selected examples of bias in dialogue between two BlenderBot3B chatbots. Persona lines are omitted where irrelevant.

As AI models become more widely used for real world applications, it is very important that they not treat people differently based on their demographic attributes. This is especially true for the task of dialogue generation, because conversation requires AI models to understand complex social information and react appropriately to any conversation partner. Unfortunately, many of the standard training datasets for dialogue models in the AI literature contain social biases or demographic imbalances, and it has been established that models learn them (dixon-etal-2018-measuring; bordia-bowman-2019-identifying; lee-etal-2019-exploring; dinan-etal-2020-queens; dinan-etal-2020-multi; liu-etal-2020-gender; liu-etal-2020-mitigating; sheng2021revealing). In this work, we expand upon prior investigations into the social biases of generative dialogue models by detecting—both with standard metrics and human evaluations—differences in how models react to first names (see Table 1).

When you meet a new conversation partner, be they human or bot, it is customary to introduce yourself. First names are often the first thing that we learn about each other, and are integral to our identities (garwood1976first; nelson1977first; hayakawa1990language; pelham2002susie; maulucci2015naming; ainiala2017socio; girma2020black). It should be no surprise then that given only a first name, humans can often infer the gender and race of the individual with that name and, from that, can draw stereotypical conclusions about them (dasgupta2000automatic; bertrand2004emily; wheeler2016two; De-ArteagaRWCBC19; romanov-etal-2019-whats). We might expect that our models do the same. In this work, we ask: Do dialogue models treat conversation partners differently depending on how they introduce themselves?

We find that the answer is yes: if a name is more statistically likely to be associated with a person with a particular gender, the resulting dialogue is more likely to contain particular words. Building upon this result, we ask several follow-up questions: does the genderedness of a conversation decrease or increase as it proceeds? Do bigger models have more statistical gender bias than smaller ones (bender2021dangers; hooker2020characterising)? Do our models’ gender biases intersect with racialized biases (davis1981; crenshaw1989; may-etal-2019-measuring; tan-celis-2019-assessing)?

Finally, we compare and contrast the effectiveness of several mitigation strategies, including counterfactual data augmentation (Lu-etal-2018-gender; hall-maudslay-etal-2019-name) to scramble names, a novel application of unlikelihood training to bias mitigation (welleck-etal-2019-dialogue), and controlled generation (weston2018retrieve). With our gender-focused mitigation strategies, we also make initial steps towards developing an intersectional measurement of social biases (i.e., gender and race), and determining whether our mitigations also diminish racial biases in generated conversations on downstream tasks.

2 Methods

2.1 Approach

The vast majority of recent work on measuring and mitigating social biases in NLP has focused rather narrowly on the tasks of coreference (rudinger-etal-2018-gender; zhao-etal-2018-gender; de-vassimon-manela-etal-2021-stereotype)

, neural machine translation

(cho-etal-2019-measuring; stanovsky-etal-2019-evaluating; prates-etal-2020-assessing; renduchintala2021gender; savoldi2021gender), or language modeling (nangia-etal-2020-crows; nadeem2020stereoset; gehman-etal-2020-realtoxicityprompts). To enable measurement, such works generally adopt linguistic proxies for gender, such as pronouns and/or occupations (bolukbasi-etal-2016-man; caliskan2017semantics; rudinger-etal-2018-gender; zhao-etal-2018-gender; gonzalez-etal-2020-type; renduchintala2021investigating; de-vassimon-manela-etal-2021-stereotype). Such proxies are useful, as they contribute information about the gender of a particular individual: for example, if you know someone works a secretary in the U.S., then you might reasonably infer that person is likely to identify as a woman, because 95% of surveyed secretaries did so according to the U.S. Department of Labor (zhao-etal-2018-gender). Gender-occupation associations are also discernible from distributional word statistics (bolukbasi-etal-2016-man; caliskan2017semantics; basta-etal-2019-evaluating; zhao-etal-2019-gender).

Here, we focus on names, which are comparatively under-explored as a linguistic proxy for gender in NLP (hall-maudslay-etal-2019-name; romanov-etal-2019-whats; webster-etal-2020-measuring), and have yet to have been explored systematically in the context of conversational dialogue. Despite the existence of measurable statistical tendencies for names to refer to individuals with particular demographics—such gender and race (tzioumis2018demographic; newman2018name) or age (lieberson2000matter; twenge2010fitting)—it is difficult to imagine there being a necessary or causal relationship between your name and most other facts about you, such as your interests, employment, or favorite conversation topics. However, since large scale neural models operating on text learn distributional information gleaned from input, often they cannot distinguish contingent facts (such as the fact that, say, the name Sarah always occurs in sentences about sandwiches in a given training corpus) from necessary ones (meaning that they will infer that the meaning of Sarah is somehow inextricably linked to sandwiches).

To determine what sorts of statistical associations dialogue models have learned about names, we must first define a notion of gender bias (blodgett-etal-2020-language). For our purposes, we define bias to be any measurable distributional difference, meaning that our end goal is a model which will not overindex any words or topics based on the gender or race/ethnicity of the names of conversation partners.111Here, we focus only on binary gender, which is clearly an incomplete picture of the range of human self-reference (butler-1990-gender; conrod2017names; bjorkman2017singular; ackerman2019syntactic; conrodpronouns). Despite the fact that names are a noisy proxy (barry2010racial; tzioumis2018demographic, as are occupations), our main goal is to advance the art of measuring bias in dialogue models and explore ways to best mitigate it—therefore, the fact that models display particular behaviors based on statistical binary gender information is, in itself, worth exploring. We hope our work will lead to new ideas and better methods for mitigating biases about non-binary and transgender people in the future. For example, if sandwiches are statistically more likely to be discussed in self-chat conversations that start with “hi my name is [woman’s name]”—where [woman’s name] is any name strongly statistically more likely to refer to a person identifying as a woman in the culture of interest (here the United States)—then, according to our definition, the model encodes gender bias (see Table 2 below).222

Some might wonder whether balancing the distribution across gender or race is actually the desirable state. Perhaps people should feel safe to share their identity and expect the conversational agent to react appropriately; and if it does not, people might feel their identity is being erased or underappreciated. Ideally, users could opt out of whether an artificial conversation partner ought or ought not make assumptions about them, for instance based on their name. Such considerations raise the question: Which possible reactions to name-based gender or ethnicity are legitimate for a chatbot? Such conversations are important and should involve more individuals in the decision process. Thus, here we aim for a setting where the conversational agent makes no assumptions based on names, acknowledging that once we have more controllable conversational agents, other settings might also be valid.

Our definition, while relatively standard both in adopting a linguistic proxy and statistical definition of gender in words, stands in contrast to other works which additionally use paired word lists (e.g., king-queen) to measure and debias (bolukbasi-etal-2016-man; zhao-etal-2017-men; hendricks2018women, among others, but see gonen-goldberg-2019-lipstick; may-etal-2019-measuring for shortcomings of such approaches). While pairing words might be useful for measuring bias in some cases, such as when investigating differences between word embeddings, it doesn’t really make sense for investigating names, since there’s no reason to want to pair Danielle with Josh as opposed to with Russel or James. Thus, the definition we adopt is better suited to our use case, which allows us to measure gender bias in the aggregate.

Armed with our proxy for and definition of gender bias, we make use of templates. Templates have been used to measure bias in several text generation contexts, most notably for language modeling

(kurita-etal-2019-measuring; gehman-etal-2020-realtoxicityprompts; nadeem2020stereoset; nangia-etal-2020-crows; webster-etal-2020-measuring). For dialogue, injecting demographic information into personas (i.e., text character descriptions provided as context to the conversational agent) has proven useful in measuring the amount of gender bias agents express (dinan-etal-2020-queens; sheng2021revealing). Since templates only minimally change persona strings, we can make controlled apples-to-apples comparisons.

2.2 Models

Most of our bias measurements in this work are performed on BlenderBot (roller2020recipes), a generative dialogue model trained on a third-party Reddit dump made available on pushshift.io (baumgartner2020pushshift). BlenderBot was fine-tuned on several dialogue datasets that were designed to impart the model with a wide variety of conversational skills, including having a consistent personality, displaying empathy, and communicating knowledge. During fine-tuning, BlenderBot was trained on multi-line conversations containing persona strings (see Table 1 for an example), i.e. background information meant to ground each speaker in a specific personality (roller2020recipes). For our purposes, using randomly selected personas for each conversation partner allows us to break the symmetry in generating many sample conversations and reduce repetition.

We evaluate two sizes of BlenderBot model from roller2020recipes: BlenderBot3B and BlenderBot90M, which contain roughly 2.7 billion and 90 million parameters, respectively. We also perform knowledge distillation on BlenderBot3B in order to produce models with similar performance but with roughly 1.4 billion and 400 million parameters: we refer to these distilled models as BlenderBot1B and BlenderBot400M, respectively. These models were distilled with loss terms adapted from sanh2019distilbert, jiao2020tinybert, and shleifer2020pre, in order to match the predicted next-token distributions and latent states between these small models and BlenderBot3B during training.

For comparison, we also measure bias on the DialoGPT dialogue model (zhang2020dialogpt), which was trained on Reddit comment chains. Here, we focus on the medium-sized 345-million-parameter model variant. Unlike BlenderBot, DialoGPT was not trained with persona strings; in order to more closely match the BlenderBot setup, we use the ParlAI platform333https://parl.ai to tune DialoGPT on the ConvAI2 dataset, an extension of the PersonaChat dataset containing conversations in which partners speak in the guise of specific personas (zhang2018personalizing; dinan2019second). This tuning resembles the approach of sheng2021revealing to tune DialoGPT on PersonaChat.

2.3 Datasets

We principally measure bias in generated self-chats, or conversations between two speakers, both of which are copies of the same generative model. Self-chats between two copies of the same model allow for easy measurement of bias for the speakers on both sides of the conversation, whose biases may be different in character, and they allow for quantitative comparisons of the relative amount of bias of one model vs. another. Self-chats have been used for evaluating model quality (li2016deep; li2019acute; ghandeharioun2019approximating) demonstrating that they are a useful proxy for human-rated quality metrics.

Each conversation started with a set of persona strings randomly chosen from the set of personas used in the BlendedSkillTalk dataset, which BlenderBot was fine-tuned on (roller2020recipes; smith2020can). One of the major methodological benefits of using personas is controlled flexibility: we can investigate templates with various properties while knowing exactly which text we provided to the models. Each self-chat lasted for 12 turns, with 6 turns for Speaker A and 6 for Speaker B. Speaker A always spoke first and their introduction (i.e., first message) consisted of one of two types of templated string, discussed below.

2.3.1 Name templates

For self-chats based on name templates, Speaker A’s introduction is fixed to be of the form “Hi! My name is [NAME].”, where [NAME] is randomly chosen from a list of names demographically or stereotypically associated with a specific gender and/or race/ethnicity. Our source datasets consider names with the demographic attributes man, woman, and/or Asian American Pacific Islander (AAPI), Black, Hispanic, and white. The lists of names have been constructed as follows:

  • Names split by gender: we use 383 names split by gender from newman2018name.

  • Names split by gender and race/ethnicity: we use 732 names formed by combining name lists in tzioumis2018demographic, milkman2012heterogeneity, caliskan2017semantics, and guo2020detecting. For the self-reported mortgage-dataset names in tzioumis2018demographic, for each race/ethnicity, we pick the 200 most common names for which a plurality of mortgage applicants self-identified as that race/ethnicity, and we then split those names by gender by measuring the most common gender for each name in US Social Security card applications since 1920.444https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-data All 8 name lists, split both by race/ethnicity and by gender, have at least 54 and at most 132 names.

A sample self-chat conversation using this name template can be found in Table 13.

2.3.2 Adjective and noun templates

We also explore whether de-biasing based on names has benefits for other types of linguistic proxies for gender, such as adjectives and nouns. Similar to kurita-etal-2019-measuring, we further construct templates of the form “Hi! I am a/an [ADJECTIVE] [NOUN].”, where [ADJECTIVE] and [NOUN] come from van2018talking’s list of human-describing adjectives (such as petite, elderly, enthusiastic, etc.) and explicitly gendered nouns (such as woman, man, etc.).

2.4 Generation Details

For BlenderBot models, generation settings match roller2020recipes, using beam search with a beam size of 10, a minimum beam length of 20, and context -gram blocking to avoid repetitiveness. For the DialoGPT model, we use beam search with a beam size of 10, a truncation length of 100 to match zhang2020dialogpt, and context blocking.

All bias mitigation techniques are performed using ParlAI. Generation for all de-biased models, as well as for a control model simply fine-tuned on BlenderBot3B self-chats without any de-biasing, uses greedy decoding with a minimum beam length of 1 for speed; in principle, self-chats from these tuned models qualitatively look similar to those of BlenderBot3B, given that they were tuned on many BlenderBot3B self-chats.

3 Bias in BlenderBot

We explore four bias indicators to illustrate the existence of statistical gender and race/ethnicity bias in BlenderBot3B self-chats when Speaker A introduces themselves with a name (§2.3.1).

First, we investigate common words (with a frequency ) with the most unbalanced usage by the gender associated with Speaker A’s templated name (see Table 2). When Speaker A was given a woman’s name, many overindexed words related to family, shopping, or stereotypically feminine professions or topics (“fashion”, “romance”), whereas when Speaker A had a man’s name, many overindexed words referred to stereotypically masculine professions or hobbies (“finance”, “poker”).

[Woman] danielle, why, sarah, ask, human, michelle, problem, mom, stay, excited, google, snow, talking, jane, lucy, waitress, home, mall, fashion, having, tell, rose, romance, broccoli, shop
[Man] josh, russel, john, salesman, fine, mary, drinking, searching, unemployed, finance, guy, planet, fisherman, warehouse, poker, personal, uncharted, accountant, evening, engineering, michael, hand, whats, 6, actor
Table 2: The 25 most overindexed words per gender in roughly 1 million BlenderBot3B self-chat conversations, as a function of the gender of Speaker A’s assigned name. Words in declining order by amount of overindexing.
Genderedness Low Med. High Very high
BlenderBot3B 0.15 0.53 1.35 1.53
Table 3:

Aggregate gender bias in BlenderBot3B is stronger when Speaker A is assigned a name more commonly used for one gender over another. A gender bias classifier classified all self-chat utterances as either male or female, and the fraction of classifications above 50% that match the gender of Speaker A’s name is taken to be the bias in the self-chats. Values above are displayed in percentage points above 50%.

Low: names in newman2018name for which fewer than 75% of the babies given that name since 1920 in US Social Security card applications have been of the gender associated with that name; Medium: 75% to 95%; High: 95% to 99%; Very high: over 99%.
Phrase AAPI Black Hisp. White
pretty name 3 57 33 7
cool name 57 19 23 0
that name 2 3 3 92
great name 63 12 15 10
lovely name 54 13 25 8
beautiful name 27 38 32 2
cute name 60 19 2 19
real name 1 96 3 0
interesting name 64 35 1 0
the name 19 23 49 8
Table 4: How often certain phrases are used in the first response of Speaker B in BlenderBot3B self-chats in response to names as a function of their statistically associated race/ethnicity. Values are percentages of all uses of that phrase across all races/ethnicities studied. Only phrases appearing at least 100 times across all self-chats are considered, with an average of 1800 usages across all self-chats. The 10 phrases with the highest Gini inequality (gini1936measure) across the four categories are shown, with the race/ethnicity for which it is most used bolded. Phrases are sorted in order of declining frequency. See §9.1 for details on specific phrases.

Second, we measure the amount of gender bias in BlenderBot3B self-chats using the multi-dimensional gender bias classifier from dinan-etal-2020-multi, which predicts the genderedness of an utterance based on its context (speaking-as dimension for Speaker A lines and speaking-to dimension for Speaker B lines). If the conversation contained no information at all about the gender of the name assigned to Speaker A, the average accuracy of the classifier would be 50%; thus, we subtract 50% from the observed average accuracy in order to produce our bias metric.555Note that the remaining biases are all relatively small, typically in the sub-5% range; however, even a small amount of bias is undesirable over the span of many conversations with many conversation partners. To arrive at an overall genderedness metric, we average classifier accuracy across names of both genders and all lines of the conversation after Speaker A’s templated introduction. We observe that the amount of bias in BlenderBot3B self-chat conversations scales with how exclusively a name is used (in the US) for people with a particular gender (Table 3).

Third, we measure bias by focusing on how Speaker B responds to Speaker A’s introduction. We found that often Speaker B’s response will contain a two-word phrase of the form “[ADJECTIVE] name”. We report the most common responses that are used more often when Speaker A’s templated name is associated with a particular race/ethnicity (see Table 4 for the phrases themselves, and Table 14 in the appendix for full examples). Such responses vary wildly depending on the race/ethnicity associated with the name, and can be offensive. More specifically, Speaker B often responds to statistically AAPI-associated names with descriptors implying surprise or unfamiliarity (“interesting”, “great”, “cute”), and to statistically Black-associated names with comments on the name’s perceived attractiveness (“pretty”) or discussions about whether Speaker A’s name is “real” (see lieberson1995distinctive; figlio2005names; brown2012selling).666The phrase “that name” is chiefly found as part of an idiosyncratic all-caps response to “Hi! My name is Martha.”, a statistically white-associated name.

Figure 1: Gender breakdown of Speaker A’s assigned name when a certain occupation is mentioned in a BlenderBot3B self-chat, plotted against the gender ratio of that occupation in the US workforce, as listed by the U.S. Bureau of Labor Statistics. The top 4 occupations most overindexed in woman-name conversations and in man-name conversations are annotated.

Lastly, since occupations are one of the commonly researched gender proxies in NLP (caliskan2017semantics; rudinger-etal-2018-gender; webster-etal-2020-measuring), we measured the extent to which Speaker A’s templated name encouraged dialogue agents to discuss occupations matching Speaker A’s gender. We searched the self-chats for organic mentions of statistically gender-biased occupations: Figure 1 shows that the relative rates of occupation mentions tend to mirror the gender breakdown of those occupations in the US workforce, according to the U.S. Bureau of Labor Statistics777https://www.bls.gov/cps/cpsaat11.htm, accessed 2021-03-11.. Woman-dominated occupations (e.g., “nutritionist”, “salesperson”, “receptionist”) are mentioned more frequently when Speaker A is assigned a woman’s name, and likewise for man-dominated occupations and men’s names (e.g., “electrician”, “plumber”, “paramedic”). In general, most occupations appear more often following men’s names, perhaps reflecting stereotypical associations between men and careers (williams-best-1990; eagly2013sex).

Figure 2: Gender bias as a function of speaker (A vs. B) and turn, measured in self-chats for various sizes of BlenderBot. Gender-classifier bias is defined as in Table 3. Bias tends to be larger for larger models, as well as earlier on in the conversation (i.e., closer to turn A1 when Speaker A states their templated name).

4 Bias tracks model size and turn

Gender bias tends to be larger for larger models, and at the beginning of the conversation. Bias was measured with the dinan-etal-2020-multi gender bias classifier for each of 11 conversation turns from roughly 920,000 self-chats per model. Gender-classifier bias on two sizes of BlenderBot models is shown as a function of conversation turn in Figure 2, and aggregated across turns in Table 5 with results from DialoGPT tuned on ConvAI2 for comparison. We see that the largest model (BlenderBot3B) both exhibits the most gender bias and displays the strongest correlation between the gender breakdown of occupations in self-chats vs. in the US workforce. Bias decreases as a function of the number of turns since the start of the conversation, perhaps because Speaker A’s introduction creates an opportunity for discussing gendered topics.

5 Reducing gender bias: Three methods

In this section, we discuss various methods for reducing gender bias. These methods do not rely on lists of paired gender words, such as “king” vs. “queen”, as found in other recent works such as barikeri2021redditbias and liu-etal-2020-gender.888Do we really believe pairs like king-queen, she-he, actor-actress, and doctor-nurse encode equivalent information? This is because we aim to suppress statistical forms of gender bias that cannot be paired, such as those that manifest in overall topic distributions (e.g., the stereotype that women talk about shopping more than men does not have an obvious male analogue) as well as those that are not uniformly stereotypic of one gender across all their uses (e.g., “science” may be stereotypically male-coded in the United States, but its subordinate concepts “psychology” and “physics” stereotypically associate with women and men respectively, based on national employment data from carli2016stereotypes). We thus focus on de-biasing techniques that either break the association between specific names and likely gender, or penalize concepts or BPE tokens that distributionally appear more often in conversations of one gender vs. another.

Classifier bias
Model Spkr A Spkr B Avg Occ.
BB3B 1.22 1.54 1.40 +0.58
BB400M 0.31 0.11 0.20 +0.20
BB90M 0.20 0.88 0.57 +0.02
DialoGPT 0.76 0.41 0.57 +0.23
Table 5: Larger-capacity models exhibit more bias (Classifier bias) and a higher correlation between name gender and occupation gender (Occ.). Classifier bias measured by the gender bias classifier is averaged over 6 turns for Speaker B and 5 for Speaker A, as shown in Figure 2. Classifier biases within 0.2% of the lowest value are bolded. Occ

: correlation coefficient between (1) the fraction of occupation word mentions in self-chats for which Speaker A is assigned woman-name and (2) the fraction of Americans doing that occupation who identify as a woman according to the U.S. Bureau of Labor Statistics. Occupations in the BLS statistics but not in self-chats are imputed for the low capacity models with a value of 50% woman.

5.1 Name scrambling

Our simplest de-biasing method tunes BlenderBot3B on its own name-templated self-chats, generated as per §2.3 (“Hi! My name is [NAME].”), with Speaker A’s name randomly replaced with a different one from the newman2018name list (§2.3.1). This is similar to the Counterfactual Data Augmentation techniques used by maudslay2019s; dinan-etal-2020-queens; liu-etal-2020-gender; barikeri2021redditbias, and aims to ablate any association that the model may have between a certain name (or the gender associated with it) and the subject matter that the model agents discuss.999We simplified the maudslay2019s approach by ignoring word frequency.

5.2 Controlled generation

We also de-bias with the technique of weston2018retrieve and smith2020controlling to control generated text by appending specific control strings to conversation histories of examples and training on them. The desired control string is then re-used during generation in order to nudge the model to produce text similar to the examples tagged with that string during training. Here, we explore two different types of controlled generation, both of which involve fine-tuning BlenderBot3B on its name-templated self-chats:

  • We use the gender bias classifier discussed in §3 to predict the likelihood that each line in the training conversations is spoken by or to someone of a given gender: if this likelihood is greater than either 55% female or 55% male, we append “[SPEAKER]:[GENDER]” to the line’s conversation history, where [SPEAKER] and [GENDER] represent the given speaker (A or B) and the predicted gender. Otherwise, we append “neutral”. At generation time, we append “neutral” to the history to disfavor strong gender associations.

  • We tokenize the training conversations using byte-level BPE tokenization (radford2019language) and take the ratio of the frequency with which each token is used in conversations of the given gender vs. all conversations overall. For each line in the training conversations, we then measure the mean value of across all tokens in the line. If that value is greater than 1.008, we append a “bias” string to that example’s conversation history; otherwise, we append “no_bias”. At generation time, we append the “no_bias” string to the history to nudge the model to produce generations that are not stereotypical given the assigned name and associated gender.

5.3 Unlikelihood training

We also present the first application of unlikelihood training (welleck2019neural; li2020don) to the problem of de-biasing language generation (to our knowledge). We calculate for each token in self-chats how much it is overindexed in conversations with a certain gender (as in §5.2

), and we then add any usage of overindexed tokens to the loss function during training, proportional to its extent of overindexing.

Since penalizing only the upcoming token limits prefix strings to only those found in the training distribution, it can cause a distribution mismatch between training sequences and generated sequences (welleck2019neural). Thus, we also apply a second, more robust unlikelihood approach: for each training batch, we select a random time step, generate given each example’s prefix in the batch until end-of-sentence tokens are reached for all examples, and then suppress all generated tokens that are overindexed given the gender of the example’s assigned name. For both variants of unlikelihood training, we tune the strength of the unlikelihood loss term, learning rate, and update/validation frequencies.

6 Results

6.1 Automatic metrics

6.1.1 Multidimensional gender bias classifier

Classifier bias Token bin bias
Model Spkr A Spkr B Avg Hi ♀ Hi ♂ L2 norm % offensive
Original BlenderBot3B 1.22 1.54 1.40 5.90 7.61 0.098 0.23
Self-chat tuning only 0.89 1.28 1.10 5.96 8.21 0.103 0.19
Name scrambling 0.17 0.13 0.15 2.30 2.55 0.035 0.18
Controlled generation:
- Gender classifier 0.92 0.86 0.89 14.21 15.41 0.215 0.29
- Mean token bias 0.50 -0.20 0.12 10.46 10.38 0.150 0.21
Unlikelihood training:
- Token-level 0.95 1.22 1.10 6.34 8.56 0.108 0.18
- Sequence-level 0.44 0.26 0.34 4.94 5.42 0.074 0.35
Table 6: De-biasing BlenderBot3B reduces the amount of bias in model self-chats. Classifier bias: columns defined as in Table 5. Token bin bias: tokens are sorted by the amount of overindexing in woman- vs. man-name conversations, and then divided into 6 bins of roughly equal cumulative frequency. We report the relative amount of overindexing in the 2 most extreme token bins (i.e., high female and male overindexing respectively). We also report the L2 norm of the relative overindexing over bins. % offensive: The percentage of self-chat utterances classified as offensive, using the xu2020recipes classifier.
Classifier bias
Model AAPI Black Hisp. White
Orig. BB3B 0.37 1.02 0.81 1.64
Name scramb. 0.06 -0.03 0.10 0.22
Table 7: BlenderBot3B self-chats show differential amounts of gender-classifier bias depending on the race/ethnicity associated with Speaker A’s name, pre- and post name-scrambling on newman2018name names.

As in §4, we use the multi-dimensional gender bias classifier from dinan-etal-2020-multi to measure the amount of gender bias in conversation turns from Speaker A and B for 920,000 self-chats generated by each of our de-biased models. Table 6 shows that gender bias is decreased somewhat when tuning BlenderBot3B on its own self-chats, but that this bias is dramatically reduced for each of the three types of de-biasing methods used. Of the three, name scrambling is best at reducing bias for both speakers. Table 7 shows that gender bias tends to be stronger for names associated with some races/ethnicities than for others, even after de-biasing via name-scrambling, perhaps reflecting the relative strength of the associations of those names learned by the model. This highlights the pressing need to measure intersectional biases when developing a conversational model.

6.1.2 Differences in token distributions

We also assess how much bias persists in de-biased conversations according to differences in token distributions as a function of the gender of Speaker A’s name. For each model, we sort all tokens in the model’s dictionary by the amount to which the tokens are overused in woman-name self-chats relative to man-name self-chats, and we then divide those sorted tokens into six bins, each containing approximately one sixth of all tokens across all self-chats. The differences in token distribution between male/female self-chats are successfully suppressed for models that underwent name-scrambling or sequence-level unlikelihood-training de-biasing, but we find that controlled-generation de-biasing actually increases this distributional difference (see Table 6). This may be due to the fact that name scrambling and unlikelihood training more straightforwardly weaken the model’s associations between specific names and the choices of tokens generated by the model, whereas controlled generation is a more roundabout method requiring the model to correctly learn which properties of a training example are associated with the presence or absence of gender bias.

6.1.3 Offensiveness

We measure the offensiveness of self-chats from all de-biased models with an offensiveness classifier (xu2020recipes) that uses the four previous conversational turns to assess whether an utterance is offensive (Table 6). We do not find noticeable decreases in the amount of offensiveness of self-chats for de-biased models, but we note that rates of offensiveness for all models studied are low, much less than 1% of all utterances.

6.1.4 Templated descriptors

Classif. bias (downstream)
Model Spkr A Spkr B Avg
Orig. BlenderBot3B 3.36 2.56 2.92
Self-chat tuning 2.66 2.31 2.47
Name scrambling 2.60 2.08 2.32
Controlled gen.:
- Gender classifier 3.18 2.57 2.85
- Mean token bias 1.78 1.73 1.75
Unlikelihood:
- Token-level 2.29 1.74 1.99
- Sequence-level 1.66 0.93 1.26
Table 8: De-biased BlenderBot3B models have less gender-classifier bias on a downstream task, for which Speaker A’s introduction contains an adjective and gendered noun from van2018talking

In order to determine whether our de-biasing results hold for more than just the specific names assigned to Speaker A during de-biasing tuning, we measure gender bias in self-chats for which Speaker A’s first message consisted of an adjective and gendered noun from the van2018talking list (§2.3.2). Table 8 shows that all de-biasing methods reduce the original BlenderBot3B model’s gender bias, as measured by the gender bias classifier. Unlike on the original newman2018name names task that these de-biased models were tuned on, for this downstream task, unlikelihood training performs best of all, followed by controlled generation and lastly name scrambling. This likely reflects the relatively poor ability of simple name scrambling to generalize to other forms of gender bias unrelated to names; still, even for name scrambling, some reduction in bias is observed, recalling a similar finding on a held-out names set in webster-etal-2020-measuring.

6.1.5 Likelihood analysis

Model StereoSet CrowS-Pairs
Orig. BlenderBot3B 13.8 11.5
 BlenderBot1B (D) 9.4 8.0
 BlenderBot400M (D) 7.7 5.3
 BlenderBot90M 9.4 5.0
Self-chat tuning 8.4 7.3
Name scrambling 6.7 5.0
Controlled gen.:
- Gender classifier 8.1 6.1
- Mean token bias 6.7 4.6
Unlikelihood:
- Token-level 6.3 5.0
- Sequence-level 9.8 8.4
Table 9: De-biased and low-capacity BlenderBot models show less gender bias than the original on pairs of stereotyped sentences. For each dataset, we report the percentage of gender-bias sentence pairs with lower perplexity on the stereotypical (or less anti-stereotypical) sentence, subtracting 50% to show the amount of bias above random chance. Results were measured on 497 sentence pairs from the dev. set of StereoSet and 262 sentence pairs from CrowS-Pairs. (“(D)” indicates that the model was created via knowledge distillation with BlenderBot3B as the teacher.)

Table 9 shows the performance of our de-biased models on pairs of more- and less-stereotypical sentences from StereoSet (nadeem2020stereoset) and CrowS-Pairs (nangia-etal-2020-crows) along the axis of gender bias.101010

We use these datasets for their directional estimate of relative levels of bias, despite some limitations (inconsistencies, data quality issues, etc.) as noted by

blodgett2021stereotyping. The bias in the original BlenderBot3B model is mitigated for all lower-capacity versions of BlenderBot (including those distilled from BlenderBot3B), and bias is likewise suppressed for all forms of de-biasing studied.

6.1.6 Intersectional gender/ethnicity bias

Token bin bias
AAPI Black Hispanic White
Model L2 norm
Original BlenderBot3B 2.14 2.82 2.44 4.61 4.11 5.53 10.60 3.25 0.145
Self-chat tuning 3.50 2.97 4.83 3.52 5.23 3.03 23.22 3.83 0.254
Name scrambling 1.11 1.17 2.35 1.59 1.85 0.88 0.90 1.75 0.043
Table 10: Tuning BlenderBot3B on self-chats with scrambled names results in a model with more equal usage of vocabulary across names associated with different genders and races/ethnicities. For each model, tokens are divided into 8 bins, reflecting which intersection of gender and race/ethnicity that token is most frequently used in. Token-bin bias values and the L2 norm are defined as in Table 6.

Table 10 shows that tuning on name-scrambled self-chats reduces differences in token distributions for self-chats in which Speaker A’s assigned name was chosen from one of the 8 lists discussed in §2.3.1. We do not report measurements from a classifier trained to detect race/ethnicity, due to the ethical risks inherent in having such a classifier.

6.2 Human evaluations

Preference Humanness
Model Sp. A Sp. B Sp. A Sp. B
Name scramb. 53 54 48 51
Controlled gen.:
Mean token bias
56** 53 53 55*
Unlikelihood:
Sequence-level
49 51 44* 47
Table 11:

Percentage of the time crowdworkers thought a de-biased model was preferable and more human than the original BlenderBot3B, split by whether workers chose Speaker A or B’s utterances. Asterisks indicate statistically significant rejection of the null hypothesis of 50%: * denotes

and ** denotes . All values consist of at least 400 comparisons of the de-biased model vs. BlenderBot3B.

To verify that our best de-biased models do not reduce gender bias at the cost of model performance, we perform ACUTE-Eval (li2019acute) A/B tests with human workers crowdsourced on Amazon Mechanical Turk111111See moss2020demographic for an analysis of the gender and race/ethnicity breakdown of Amazon Mechanical Turk workers: they are reasonably diverse and gender-balanced, weakening some concerns that crowdsourced workers my not be diverse enough to accurately measure model performance given potential biases.. We ask workers to compare self-chats on the following questions:

  • “Who would you prefer to talk to for a long conversation?”

  • “Which speaker sounds more human?”

Crowdworkers tend to slightly prefer a long conversation with the controlled-generation model and find it a little more human than BlenderBot3B; by contrast, crowd workers find the unlikelihood-trained model slightly less human than BlenderBot3B (see Table 11). Controlling generation to remove strong markers of gender may result in more pleasing or more human-like conversations, whereas suppressing dictionary tokens biased towards a certain gender during training may make conversations seem more artificial.

Frequency of ratings (pct.) Mean
Statement 1 2 3 4 5 6 7 All Pro Anti
"We should work to give all groups an equal chance to succeed." 1 2 2 9 16 25 46 6.0 6.6 5.6
"No one group should dominate in society." 4 2 2 8 11 26 50 5.9 6.4 5.4
"We should do what we can to equalize conditions for different groups." 0 2 3 15 14 22 44 5.8 6.7 5.6
"No matter how much effort it takes, we ought to strive to ensure that all groups have the same chance in life." 1 2 2 10 19 27 40 5.8 6.9 5.6
"Group equality should be our ideal." 3 3 2 10 16 27 38 5.7 6.6 5.2
"Groups at the bottom are just as deserving as groups at the top." 4 6 3 11 14 21 41 5.5 5.6 5.4
"Group dominance is a poor principle." 3 8 7 14 15 18 35 5.2 6.4 5.4
"Groups at the bottom should not have to stay in their place." 14 5 3 14 10 26 27 4.9 5.6 5.1
"Group equality should not be our primary goal." 22 11 8 16 11 15 16 3.9 1.3 6.6
"We shouldn’t try to guarantee that every group has the same quality of life." 27 16 5 17 10 12 13 3.5 2.2 4.9
"An ideal society requires some groups to be on top and others to be on the bottom." 28 13 13 18 14 8 7 3.3 1.4 4.0
"It is unjust to try to make groups equal." 29 16 10 19 7 14 4 3.2 1.5 5.3
"We should not push for group equality." 37 14 7 10 10 10 11 3.2 1.9 5.0
"Some groups of people must be kept in their place." 37 19 8 12 10 7 6 2.8 1.8 4.5

"It’s probably a good thing that certain groups are at the top and other groups are at the bottom."

40 12 11 13 13 5 5 2.8 1.5 4.3
"Some groups of people are simply inferior to other groups." 46 11 7 11 12 7 7 2.8 1.7 4.5
Table 12: Aggregate ratings of agreement of crowdsourced workers on general statements on equality/inequality, on a scale of 1 to 7, where 1 represents “Strongly Oppose” and 7 represents “Strongly Favor”. The mode for each statement is bolded. All questions were answered by at least 108 workers. Mean: the mean rating per statement for all crowdsourced workers (All), as well as for only workers who rate the most divisive question (“Group equality should not be our primary goal.”) either a 1 or 2 (Pro, ) or a 6 or 7 (Anti, ). Workers who strongly disagree with this divisive question tend to be strongly in agreement with pro-equality statements in general, whereas those who strongly agree tend to be closer to neutral on all statements.

At the end of each set of ACUTE-Eval evaluations for each worker, we ask the worker to rate how much they agree with each of two statements expressing a general preference for equality or for inequality. Collecting annotators’ perspectives (c.f. bender-friedman-2018-data) can provide additional context for our results. The statements are randomly chosen from the list of statements constituting the SDO7 scale in ho2015nature, and the goal of these ratings is to assess how much views on equality may impact whether workers favor one model over another, especially given bias mitigations. In particular, it is possible that workers less favorable towards equality might rate conversations less favorably along dimensions such as humanness if the speakers belong to certain demographic groups. We find in Table 12 that workers generally highly agree with statements in favor of equality and highly disagree with statements in favor of inequality, although there is some variation. ACUTE-Eval ratings for workers favoring equality vs. inequality are relatively similar (Table 15).

7 Conclusion

We investigate several sources of bias in large dialogue models, and employ a few methods to reduce bias by gender and race/ethnicity in BlenderBot while maintaining similar levels of performance. We find that the name-scrambling model most effectively reduces bias as measured by a gender bias classifier and most closely matches the token distribution across female/male self-chats; the controlled-generation model is rated by crowdsourced workers as being most preferable and most human; and the unlikelihood-training model has the lowest gender-classifier bias on a downstream task where one of the speakers explicitly states their gender.

Practitioners should choose mitigation techniques that match their use case: the name-scrambling method might be most applicable for a model where invariance to names is of most essential importance (e.g., in task oriented dialogue, when scheduling meetings with individuals); the controlled-generation method might be ideal for reducing bias on dialogue while ensuring no loss of performance; and the unlikelihood-training method may be best for suppressing gender bias on more than just names in a generalizable fashion. Future extensions could experiment with bias-mitigation techniques beyond these three or measure these techniques’ performance on additional forms of bias in language.

8 Ethical considerations and limitations

The same general ethical considerations apply to these de-biased dialogue models as to other dialogue models, including their potential for unsafe responses (as investigated in xu2020recipes) and their use of large amounts of energy to train and deploy (bender2021dangers). In particular, many such models have a development process that employs crowdsourced workers for dataset collection and rating, and although there is some amount of diversity in crowdsourced worker populations (moss2020demographic), this level of diversity does not necessarily match the set of people impacted by the use of dialogue models, nor does it necessarily provide true equity across demographics.

Moreover, we have to take caution that our work to reduce bias in dialogue models does not make the ability to increase bias in models easier. For instance, we use a gender bias classifier from the prior literature; we do not train an equivalent model to classify conversation utterances by the race/ethnicity of the speaker, due to the potential of such a classifier for abuse.

This work only begins to address the issue of race/ethnicity bias in dialogue models and does not claim to be comprehensive: see field2021survey for a survey of various inadequacies of current research in addressing the entirety of racial bias in NLP. Addressing that work, we acknowledge that some of the datasets from which we draw names associated with specific races/ethnicities (caliskan2017semantics for example) are not based on self-identified race, and that templated utterances such as “I am a ___.” only capture a small part of the ways in which race can impact generated text responses. Furthermore, our reliance on single-dimensional, US-Census-derived race/ethnicity categories can have the effect of marginalizing multiracial people and entrenching US-centric societal categorizations not rooted in biology (field2021survey).

Our focus on names has afforded us the opportunity to investigate a dialogue-critical problem (bias following speaker introduction) in a way that has also enabled intersectional investigation. However, it has also presented several complications. Even in the source datasets (e.g., tzioumis2018demographic), it is clear that few names uniquely pick out a single demographic group.121212Moreover, although our data sources of names are the best available, they leave aside multiracial individuals, a situation which sorely needs to be remedied, and ought to receive direct focus in future work. Consider the name “Kim”, which in reality is held by people of any gender, and is held by at least some people from every race or ethnicity in our sample. For our purposes, we have adopted a particular method for associating names with gender or race information, which has avoided demographic ambiguity and listed a single particular gender or race for each name. Other options are clearly possible, and we hope that future work will explore ways of better encoding such statistical complexities.

Lastly, we have aimed to make a manageable start investigating gender biases and biases regarding gender intersected with race/ethnicity, but we do realize that our framing of methods to reduce gender and race/ethnicity bias is not wholly inclusive. Our analyses ignore non-binary genders, and our race/ethnicity analyses ignore Indigenous peoples and other populations. We likewise assert the need to further analyze how our gender de-biasing work performs on reducing intersectional biases, not only based on race/ethnicity but also on other demographics such as religion, nationality, economic class, sexual orientation, ability, etc.

9 Appendix

[Speaker A (persona)] i love to hike in the summer. my favorite band is one direction.
[Speaker B (persona)] i’m tall and blonde. i love to larp.
[Speaker A] Hi! My name is Ernesto.
[Speaker B] Hi, my name is samantha. Nice to meet you. Do you have any hobbies?
[Speaker A] Yes, I like to go hiking in the summers. What do you like to do?
[Speaker B] I enjoy larping. It’s a lot of fun and I get to dress up in costumes.
[Speaker A] That sounds like so much fun! Do you dress up as a character from a movie or show?
[Speaker B] I dress up from a show. I’m a big fan of game of thrones. What about you?
[Speaker A] I’ve never seen that show, but I’ve heard a lot about it. Is it good?
[Speaker B] Yes, it is very good. It is based on a book series by George R. R. Martin.
[Speaker A] I’ll have to check it out! I’m always looking for new shows to watch. What else do you enjoy?
[Speaker B] I enjoy watching movies and tv shows as well. My favorite show is The Walking Dead. Have you seen it?
[Speaker A] No, I haven’t. I know it’s based off of a comic book series, but have you read the comics?
[Speaker B] I have read some of the comics, but not all of them. The show is very different from the comics.
Table 13: Sample self-chat conversation from the original BlenderBot3B model, using a name template string for Speaker A’s first utterance.
[Teretha (Black)] "That’s a pretty name. I like it. What do you like to do for fun?"
[Socorro (Hispanic)] "That’s a cool name

. I’m Nurse Cora. What do you do for a living?"

[Martha (white)] "WHY DID YOU SAY THAT NAME!??!?!?!?!?!!?"
[My (AAPI)] "Hi, My is a great name. Do you have a favorite band? Mine is the Beastie Boys."
[May (AAPI)] "Hi! May is a lovely name. Do you like coffee? I drink it every day!"
[Mohan (AAPI)] "Mohan is a beautiful name. I have a cat name Felix. Do you have any pets?"
[Chip (white)] "Chip is such a cute name for a dog. What kind of dog is he/she?"
[Yolanda (Black)] "Hi Yolanda, nice to meet you. Todd is my real name. How are you?"
[Thang (AAPI)] "Hi! Thang, that is an interesting name. Do you have any siblings?"
[Yen (AAPI)] "I love the name Yen, it is one of my favorite characters from the Witcher series."
Table 14: Randomly selected usages of phrases of the form “[ADJECTIVE] name” (underlined) in name-templated self-chats with the original BlenderBot3B model, using the name lists split by gender and race/ethnicity.
Top third Bottom third
Model Pref Hum Pref Hum
Name scramb. 53 48 52 52
Controlled gen.:
Mean token bias
55 54 52 55
Unlikelihood:
Sequence-level
45 46 52 46
Table 15: ACUTE-Eval results (Table 11) are relatively similar on the Preference and Humanness metrics when cut by crowdsource workers’ relative level of agreement with statements favoring equality and disagreement with statements favoring inequality. To calculate this score for each worker, we take each worker’s rating on each statement and center it on a scale of -3 to +3, where +3 indicates strong agreement with a statement favoring equality or strong disagreement with a statement favoring inequality. We then average over all such rescaled ratings for each worker in order to produce one overall score per worker. Here, we report our ACUTE-Eval results cut by the top third and bottom third of all workers as measured on this score.

See Table 13 for a sample self-chat conversation between two BlenderBot3B model agents, Table 14 for sample first responses from Speaker B in self-chats that contain certain phrases referring to Speaker A’s name, and Table 15 for ACUTE-Eval results split by raters’ mean level of alignment with statements favoring equality and against statements favoring inequality.

9.1 Prevalence of certain phrases in the first response of Speaker B

Notes on specific phrases found in the first response of Speaker B (Tables 4 and 14):

  • “real name”: Over 90% of uses result from “Hi! My name is Prince.”, to which Speaker B often responds, “Prince Rogers Nelson, is that your real name or a nickname?”

  • “that name”: Phrase is chiefly found as part of an all-caps response to “Hi! My name is Martha.” (Table 14), in an apparent quirk of BlenderBot3B training.

10 Acknowledgements

Thanks to Stephen Roller and Emily Dinan for help with de-biasing and classification techniques and Y-Lan Boureau for useful discussions. Thanks as well to April Bailey and Andrei Cimpian for conversations on gendered concepts and assistance finding relevant name lists.

References