On Measuring and Mitigating Biased Inferences of Word Embeddings

08/25/2019 ∙ by Sunipa Dev, et al. ∙ 0

Word embeddings carry stereotypical connotations from the text they are trained on, which can lead to invalid inferences. We use this observation to design a mechanism for measuring stereotypes using the task of natural language inference. We demonstrate a reduction in invalid inferences via bias mitigation strategies on static word embeddings (GloVe), and explore adapting them to contextual embeddings (ELMo).



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word embeddings have become the de facto feature representation across NLP (e.g., parikh2016decomposable; seo2016bidirectional)

. Their usefulness stems from their ability capture background information about words using large corpora as static vector embeddings 

(e.g., word2vec, GloVe; Mik1; Art3) or contextual encoders that produce embeddings (e.g., ELMo, BERT; Peters:2018; bert).

However, besides capturing lexical semantics, word embeddings can also encode real-world biases about gender, age, ethnicity, etc. To probe for such biases, several lines of existing work (e.g., debias; Caliskan183; ZhaoWYOC17; Bias1)

rely on measurements intrinsic to the vector representations, which despite their usefulness, have two key problems. First, there is a mismatch between what they measure (vector distances or similarities) and how embeddings are actually used (as features for deep neural networks). Second, while today’s state-of-the-art NLP systems are built using contextual word embeddings like ELMo or BERT, the tests for bias are designed for word types, and do not easily generalize to word token embeddings.

In this paper, we present a strategy for probing word embeddings for biases. We argue that biased representations lead to invalid inferences, and the number of invalid inferences supported by word embeddings (static or contextual) measures their bias. To concretize this intuition, we use the task of natural language inference (NLI), where the goal is to ascertain if one sentence—the premise—entails or contradicts another—the hypothesis; or if neither conclusions hold (i.e., they are neutral with respect to each other).

As an illustration, consider the sentences: The rude person visited the bishop. The Uzbekistani person visited the bishop. Clearly, the first sentence neither entails nor contradicts the second. Yet, the popular decomposable attention model 

parikh2016decomposable built with GloVe embeddings predicts that sentence (1) entails sentence (1

) with a high probability of

! Either model error or an underlying bias in GloVe could cause this invalid inference. To study the latter, we develop a systematic probe over millions of such sentence pairs that target specific word types like polarized adjectives (e.g., rude) and demonyms (e.g., Uzbekistani).

A second focus of this paper is bias attenuation. As a representative of several lines of work in this direction, we use the recently proposed projection method of Bias1 which simply identifies the dominant direction defining a potential bias (e.g., gender), and removes it from all embedded vectors. Specifically, we ask the related questions: Does debiasing help attenuate bias in static embeddings (GloVe) and contextual ones (ELMo)?

Our contributions.

Our primary contribution is the design of natural language inference-driven probes to measure the effect of specific biases. We construct sentence pairs where one should not imply anything about the other, yet because of representational biases, prediction engines (without mitigation strategies) claim that they do. To quantify this we use model probabilities for entailment (E), contradiction (C) or neutral association (N) for pairs of sentences. Consider, for example, The driver owns a cabinet. The man owns a cabinet. The woman owns a cabinet. The sentence (1) neither entails nor contradicts sentences (1) and (1). Yet, with sentence (1) as premise and sentence (1) as hypothesis, the decomposable attention model predicts probabilities: E: 0.497, N: 0.238, C: 0.264; the model predicts entailment. Whereas, with sentence (1) as premise and sentence (1) as hypothesis, we get probabilities E: 0.040, N: 0.306, C: 0.654; the model predicts contradiction. Each premise-hypothesis pair differ only by a gendered word. We expand on this idea, to build a suite of tasks also capturing representational bias in nationality or religion

We also define aggregate measures that quantify bias effects over a large number of predictions. We discover substantial bias in both GloVe and ELMo embeddings. In addition to the now commonly reported gender bias (e.g., debias), we also show that both embeddings encode polarized information about demonyms and religions. To our knowledge, this is the first demonstration of national or religious bias in word embeddings.

Our second contribution is to show that simple mechanisms for removing bias on static word embeddings (in particular GloVe) works. The projection approach of Bias1 has been shown effective on intrinsic measures; we show that it is also effective in the new measures based on the NLI task. We show that this is effective in reducing gender’s effects on occupations. We also show similar effects by removing subspaces associated with demonyms and religions.

Our third contribution is mainly a negative result for the same procedure on contextual embeddings, on ELMo. We show many general approaches fail to reduce bias measured by NLI. For tasks involving gender bias, we show learning and removing a gender direction on all the three layers in an ELMo embedding does not reduce bias. However, surprisingly, removing it from only the first layer (effectively a static word embedding) is effective for reducing gender bias. Yet, this approach is ineffective for religion or nationality.

On a positive note, the amount of gender bias ELMo encodes is roughly the same as GloVe after attenuating bias in both. In the case of biases associated with religions or nationality, less bias is reported using a bias-attenuated GloVe than with ELMo either before or after attempts at reducing bias. As such it seems contextual embeddings may offer better predictive performance, but also seem to encode more and harder to remove biases.

2 Measuring Bias with Inference

In this section, we will describe our construction of a bias measure using the NLI task, which has been widely studied in NLP, starting with the PASCAL RTE challenges dagan2006the-pascal; dagan2013recognizing. More recently, research in this task has been revitalized by large labeled corpora such as the Stanford NLI corpus (SNLI, snli) and its extension, MultiNLI williams2018broad.

The underlying motivation of NLI is that inferring relationships between sentences is a surrogate for the ability to reason about text. Using the intuition that systematically invalid inferences about sentence relationships can expose an underlying bias, we extend this principle to assess bias. We will describe this process using how gender biases affect inferences related to occupations. Afterwards, we will extend the approach to polarized inferences related to nationalities and religions.

2.1 Experimental Setup

Before describing our exploration of bias in representations, let us look at the word embeddings we study and the NLI systems trained over them.

We use GloVe to study static word embeddings and use ELMo for contextual ones. Our NLI models are based on the decomposable attention model parikh2016decomposable

where we replaced the projective encoder with BiLSTM. For ELMo, as is standard, we first linearly interpolate the three layers of embeddings. Our models are trained on the SNLI training set. The Supplementary lists hyper-parameters and network details.

111We will release our code with the final paper for reproduction and exploration.

2.2 Occupations and Genders

Consider the following three sentences: The accountant ate a bagel. The man ate a bagel. The woman ate a bagel. The sentence (2.2) should neither entail nor contradict the sentences (2.2) and (2.2): we do not know the gender of the accountant. That is, for this, and many other sentence pairs, the correct inference should be neutral, with prediction probabilities E: 0, N: 1, C: 0. But a gender-biased representation of the word accountant may lead to a non-neutral prediction. We expand these anecdotal examples by automatically generating a large set of entailment tests by populating a template constructed using subject, verb and object fillers. All our templates are of the form:

The subject verb a/an object.

Here, we use a set of common activities for the verb and object slots, such as ate a bagel, bought a car, etc. For the same verb and object, we construct an entailment pair using a subject fillers from sets of words. For example, to assess gender bias associated with occupations, the premise of the entailment pair would be an occupation word, while the hypothesis would be a gendered word. The Supplementary has all the word lists we use.

For any premise-hypothesis pair, the only difference between the sentences is the subject. Since we seek to construct entailment pairs with the expectation that the unbiased label is neutral, we removed all gendered words from the occupations list (e.g., nun, salesman and saleswoman). The resulting set has occupations, verbs, objects, and gendered word pairs (man-woman, guy-girl, gentleman-lady). Expanding these templates gives us entailment pairs, all of which we expect are neutral.

2.3 Measuring Bias via Invalid Inferences

Suppose we have a large collection of entailment pairs that we have constructed by populating templates as described earlier. Since each sentence pair should be inherently neutral, we can define bias as deviation from neutral. Suppose the model for the entail, neutral and contradiction labels are denoted by E: , N: , C: . We define three different measures for how far the model predictions are from neutral:

  1. Net Neutral (NN): Computes the average probability of the neutral label across all sentence pairs. That is, .

  2. Fraction Neutral (FN): Computes the fraction of sentence pairs that are labeled as neutral. That is, , where is an indicator variable.

  3. Threshold: (T:): A parameterized measure that reports the fraction of examples whose probability of neutral above : we report this for and .

In the ideal (i.e., bias-free) case, all three measures will take the value .

Embedding NN FN T: T:
GloVe 0.387 0.394 0.324 0.114
ELMo 0.417 0.391 0.303 0.063
Table 1: Gender-occupation neutrality scores, for models using GloVe and ELMo embeddings.

Table 1 shows the scores for models built with GloVe and ELMo embeddings. These numbers are roughly similar between GloVe- and ELMo-based models, and are far from the desired values of , with only the Net Neutral score for ELMo reaching above to be significantly above random. This table not only demonstrates gender bias in GloVe as expected, but also in the contextual embedding. Table 2 shows template fillers with the largest non-neutral probabilities for GloVe.

occ. verb obj. gen. ent. cont.
banker spoke to crew man
nurse can afford wagon lady
librarian spoke to consul woman 0.00
secretary budgeted for laptop gentleman
violinist budgeted for meal gentleman
mechanic can afford pig lady
Table 2: Gendered template parameters with largest entailment and contradiction values with GloVe model.

2.4 Nationality and Religion

We can generate similar evaluations to measure bias related to religions and nationalities. There do appear to be subspaces within word embeddings that capture both religions and nationalities (as discussed in §3.1), but they are not well-represented by term pairs (e.g., man and woman). Rather we use a class of words (e.g., French) to represent people from various nationalities. Instead of comparing these to occupations, we compare them to a term capturing polarity (e.g., evil, good) from a Polarity set with words, again in the Appendix.

Using the verb-object fillers as before (e.g., crashed a car), we create sentence pairs such as The evil person crashed a car. The French person crashed a car. For a demonym , a polarity term , a verb and an object , we generate a sentence pair as The person a/an . The person a/an . and use one of the embedding-based models to generate the associated label probabilities, and compute the aggregate measures as before.

Embedding NN FN T: T:
GloVe 0.713 0.760 0.776 0.654
ELMo 0.698 0.776 0.757 0.597
Table 3: Demonym-polarity neutrality scores, for models using GloVe and ELMo embeddings.

Expanding all nationality templates provides entailment pairs. Table 3 shows that with either GloVe or ELMo, the Net Neutral, Fraction Neutral, and Threshold (at 0.5 or 0.7) has scores between about and . While these scores are not , these do not numerically exhibit as much inherent bias as in the gender example; we note that these two tests are not strictly comparable, as the word sets are quite different. Moreover, there is still some apparent bias: for roughly of the sentence pairs, something other than neutral was the most likely prediction. The ones with largest non-neutral probabilities are shown in Table 4.

polar verb obj. dem. ent. cont.
unprofessional traded brownie ukrainian
great can afford wagon qatari
professional budgeted auto qatari
evil owns oven canadian
evil owns phone canadian
smart loved urchin canadian
Table 4: Nationality template parameters with largest entailment and contradition values with GloVe model.

A similar set up is considered for to measure the bias associated with Religions. We use a word list of 17 adherents to religions Adherent such as Catholic to create sentences like The Catholic person crashed a car. to be the paired hypothesis with Sentence (2.4). For each adherent , a polarity term , verb and object , we generate a sentence pair in the form of sentence (2.4) and The person a/an . We aggregated the predictions under our measures as before. Expanding all religious templates provides entailment pairs. The results for GloVe- and ELMo-based inference are shown in Table 5. We observe a similar pattern as with Nationality, with about of the sentence pairs being inferred as non-neutral; the largest non-neutral template expansions are in Table 6. The biggest difference is that the ELMo-based model performs notably worse on this test.

Embedding NN FN T: T:
GloVe 0.710 0.765 0.785 0.636
ELMo 0.635 0.651 0.700 0.524
Table 5: Religion-polarity neutrality scores, for models using GloVe and ELMo embeddings.
polar verb obj. adh. ent. cont.
dishonest sold calf satanist 0.01
dishonest swapped cap muslim 0.01
ignorant hated owner muslim 0.00
smart saved dresser sunni 0.98
humorless saved potato rastafarian 0.97
terrible saved lunch scientologist 0.97
Table 6: Religion template parameters with largest entailment and contradiction values with GloVe model.

3 Attenuating Bias in Static Embeddings

In §2, we saw that several kinds of biases exist in static embeddings (specifically GloVe). We can to some extent attenuate it. For the case of gender, this comports with the effectiveness of debiasing on previously studied intrinsic measures of bias (e.g., debias; Bias1). We focus on the simple projection operator Bias1 which simply identifies a subspace associated with a concept hypothesized to carry bias, and then removes that subspace from all word representations. Not only is this approach simple and outperforms other approaches on intrinsic measures Bias1, it also does not have the potential to leave residual information among associated words gonen2019lipstick unlike hard debiasing debias. There are also retraining-based mechanisms (e.g., gn-glove), but given that building word embeddings can be prohibitively expensive, we focus on the much simpler post-hoc modifications.

3.1 Bias Subspace

For the gender direction, we identify a bias subspace using only the embedding of the words he and she

. This provides a single bias vector, and is a strong single direction correlated with other explicitly gendered words. Its cosine similarity with the two-means vector from

Names used in Bias1 is and with Gendered word pairs from  debias is .

For nationality and religion, the associated directions are present and have similar traits to the gendered one (Table 7), but are not quite as simple to work with. For nationalities, we identify a separate set of demonyms than those used to create sentence pairs as Demonym, and use their first principal component to define a -dimensional demonym subspace. For religions, we similarly use a Adherent set, again of size , but use the first principal components to define a -dimensional religion subspace. In both cases, these were randomly divided from full sets Demonym and Adherent. Also, the cosine similarity of the top singular vector from the full sets with that derived from the training set was and for demonyms and adherents, respectively. Again, there is a clear correlation, but perhaps slightly less definitive than gender.

Embedding 2nd 3rd 4th cosine
Gendered 0.57 0.39 0.24 0.76
Demonyms 0.45 0.39 0.30 0.56
Adherents 0.71 0.59 0.4 0.72
Table 7: Fraction of the top principal value with the th principal value with the GloVe embedding for Gendered, Demonym, and Adherent datasets. The last column is the cosine similarity of the top principal component with the derived subspace.

3.2 Results of Bias Projection

By removing these derived subspaces from GloVe, we demonstrate significant decrease in bias. Let us start with gender, where we removed the he-she direction, and then recomputed the various bias scores. Table 8 shows these results, as well as the effect of projecting a random vector (averaged over 8 such vectors), along with the percent change from the original GloVe scores. We see that the scores increase between and which is quite significant compared to the effect of random vectors which range from decreasing to increasing by .

proj 0.480 0.519 0.474 0.297
% +24.7% +31.7% +41.9% +160.5%
rand 0.362 0.405 0.323 0.118
% -6.0% +2.8% -0.3% +3.5%
Table 8: Effect of attenuating gender bias using projection with he-she vector, and random vectors. Percentages compared to the results without attenuation.

For the learned demonym subspace, the effects are shown in Table 9. Again, all the neutrality measures are increased, but more mildly. The percentage increases range from to , but this is expected since the starting values were already larger, at about -neutral; they are now closer to to neutral.

proj 0.808 0.887 0.910 0.784
% +13.3% +16.7% +17.3% +19.9%
Table 9: Effect of attenuating nationality bias using projection with the Demonym-derived vector. Percentages compared to the results without attenuation.

The results after removing the learned adherent subspace, as shown in Table 10 are quite similar as with demonyms. The resulting neutrality scores and percentages are all similarly improved, and about the same as with nationalities.

proj 0.794 0.894 0.913 0.771
% +11.8% +16.8% +16.3% +21.2%
Table 10: Effect of attenuating religious bias using projection with the Adherent-derived vector. Percentages compared to the results without attenuation.

Moreover, the dev and test scores (Table 11) on the SNLI benchmark is and before, and and after the gender projection. So the scores actually improve slightly after this bias attenuation! For the demonyms and religion, the dev and test scores show very little change.

orig -gen -nat -rel
Dev 87.81 88.14 87.76 87.95
Test 86.98 87.20 86.87 87.18
Table 11: Dev and Test scores on SNLI task before on original GloVe embedding (orig) and after debiasing with respect to gender, nationality, and religion.

4 Attenuating Bias in Contextual Word Embeddings

In this section, with less success, we attempt to attenuate bias in contextual word vector embeddings (specifically ELMo).

Unlike GloVe, ELMo is not a static embedding of words, but a context-aware dynamic embedding that is computed using two layers of BiLSTMs operating over the sentence. This results in three embeddings, each -dimensional, which we call layers , and . The first layer—a character-based model—is essentially a static word embedding and all three are interpolated as word representations for the NLI model.

4.1 All Layer Projection : Gender

Our first attempt at attenuating bias is by directly replicating the projection procedure where we learn a bias subspace, and remove it from embedding. The first challenge is that each time a word appears, the context is different, and thus its embedding in each layer of ELMo is different.

However, we can embed the 1M sentences in a representative training corpus WikiSplit222https://github.com/google-research-datasets/wiki-split

, and average embeddings of word types. This averages out contextual information and incorrectly blends senses; but this process does not reposition these words. This process can be used learn a subspace, say encoding gender and is successful at this task by intrinsic measures: the second singular value of the full

Gendered set is for layer 1, for layer 2, and for layer 3, all sharp drops.

Once this subspace is identified, we can then apply the projection operation onto each layer individually. Even though the embedding is contextual, this operation makes sense since it is applied to all words; it just modifies the ELMo embedding of any word (even ones unseen before or in new context) by first applying the original ELMo mechanism, and then projecting afterwards.

However, this does not significantly change the neutrality on gender specific inference task. Compared to the original results in Table 1 the change, as shown in Table 12 is not more, and often less than, projecting along a random direction (averaged over 4 random directions). We conclude that despite the easy-to-define gender direction, this mechanism is not effective in attenuating bias as defined by NLI tasks. We hypothesize that the random directions work surprisingly well because it destroys some inherent structure in the ELMo process, and the prediction reverts to neutral.

proj 0.423 0.419 0.363 0.079
% +1.6% + 7.2% + 19.8% + 25.4%
rand 0.428 0.412 0.372 0.115
% +2.9% +5.4% +22.8% +82.5%
Table 12: Effect of attenuating gender bias using a projection operation on all layers of ELMo with learned gender direction, and with random vectors. Percentages compared to the results without attenuation.

4.2 Layer 1 Projection: Gender

Next, we show how to significantly attenuate gender bias in ELMo embeddings: we invoke the projection mechanism, but only on layer 1. The layer is a static embedding of each words – essentially a look-up table for words independent of context. Thus, as with GloVe we can find a strong subspace for gender using only the he-she vector. Table 13 shows the stability of the subspaces on the ELMo layer 1 embedding for Gendered and also Demonyms and Adherents; note this fairly closely matches the table for GloVe, with some minor trade-offs between decay and cosine values.

Embedding 2nd 3rd 4th cosine
Gendered 0.46 0.32 0.29 0.60
Demonyms 0.72 0.61 0.59 0.67
Adherents 0.63 0.61 0.58 0.41
Table 13: Fraction of the top principal value with the th principal value with the ELMo layer 1 embedding for Gendered, Demonym, and Adherent datasets. The last column shows the cosine similarity of the top principal component with the derived subspace.

Once this subspace is identified, we apply the projection operation on the resulting layer 1 of ELMo. We do this before the BiLSTMs in EMLo generates the layers 2 and 3. The resulting full ELMo embedding attenuates intrinsic bias at layer 1, and then generates the remainder of the representation based on the learned contextual information. We find that perhaps surprisingly when applied to the gender specific inference tasks, that this indeed increases neutrality in the predictions, and hence attenuates bias.

Table 14 shows that each measure of neutrality is significantly increased by this operation, whereas the projection on a random vector (averaged over 8 trials) is within change, some negative, some positive. For instance, the probability of predicting neutral is now over , an increase of , and the fraction of examples with neutral probability increased from (in Table 1) to (nearly a increase).

proj 0.488 0.502 0.479 0.364
% +17.3% +28.4% +58.1% +477.8%
rand 0.414 0.402 0.309 0.062
% -0.5% +2.8% +2.0% -2.6%
Table 14: Effect of attenuating gender bias using a projection operation on layer 1 of ELMo with he-she gender direction, and with random vectors. Percentages compared to the results without attenuation.

4.3 Layer 1 Projection: Nationality and Religion

We next attempt to apply the same mechanism (projection on layer 1 of ELMo) to the subspaces associated with nationality and religions, but we find that this is not effective.

The results of the aggregate neutrality of the nationality and religion specific inference tasks are shown in Tables 15 and 16, respectively. The neutrality actually decreases when this mechanism is used. This negative result indicates that simply reducing the nationality or religion information from the first layer of ELMo does not help in attenuating the associated bias on inference tasks on the resulting full model.

proj 0.624 0.745 0.697 0.484
% -10.7% -4.0% -7.9% -18.9%
Table 15: The effect of attenuating nationality bias using a projection operation on layer 1 of ELMo with the learned demonym direction. Percentages compared to the results without attenuation.
proj 0.551 0.572 0.590 0.391
% -13.2% -12.1% -15.7% -25.4%
Table 16: The effect of attenuating religion bias using a projection operation on layer 1 of ELMo with the learned adherents direction. Percentages compared to the results without attenuation.

We have several hypotheses of why this does not work. Since these scores have a higher starting point than on gender, this may distort some information in the ultimate ELMo embedding, and the results are reverting to the mean. Alternatively, it is possible that the layers 2 and 3 of ELMo (re-)introduce bias into the final word representations from the context, and this effect is more pronounced for nationality and religions than gender.

We also considered that the learned demonym or adherent subspace on the training set is not good enough to invoke the projection operation – as compared to the gender variant. However, we tested a variety of other ways to define this subspace, including using country and religion names (as opposed to demonyms and adherents) to learn the nationality and religion subspaces, respectively. This method is supported by the linear relationships between analogies encoded shown by static word embeddings Mik1. While in a subset of measures this did slightly better than using a separate training and test set for just the demonyms and adherents, it does not have more neutrality than the original embedding. Even training the subspace and evaluating on the full set of Demonyms and Adherents does not increase the measured aggregate neutrality scores.

5 Discussion, Related work & Next Steps

Glove vs. ELMo

While the mechanisms for attenuating bias of ELMo (measured by NLI) were not universally successful (and, in general unsuccessful), they were successful on GloVe. Moreover, the overall neutrality scores are higher on (almost) all tasks on the debiased GloVe embeddings. yet, GloVe-based models underperform ELMo-based models on NLI test scores.

Table 17 summarizes the dev and test scores for the various ELMo configurations. We see that the effect of debiasing on the original inference objective is fairly minor, and these scores remain slightly larger than the models based on GloVE, both before and after debiasing. These observations suggest that while ELMo offers better and more stable predictive accuracy, it is also harder to debias than simple static embeddings.

orig -gen(all) -gen(1) -nat(1) -rel(1)
dev 89.03 88.77 89.01 89.04
test 88.37 88.04 87.99 88.30
Table 17: Dev and Test scores on SNLI task before on ELMo embedding (orig) and after debiasing with respect to gender on all layers, and gender, nationality, and religion on layer 1.

Extending to BERT.

Our method for measuring biases is agnostic to the actual representation and is applicable to BERT bert. In principle, the projection method for debiasing contextual word vector embeddings should also apply to BERT. For any contextualized subword, the projection method invokes a linear projection. What is required is learning a subspace which captures a bias. Our initial experiments in averaging over relevant subwords revealed some success in identifying a gender subspace, but as with the ELMo all-layer debiasing, the initial evaluations of the effect on bias as measured via NLI was negative. Given the recent successes of BERT-based models in NLP, studying mitigation strategies for BERT is an important next step.

Further resolution of models and examples.

Beyond simply measuring the error in aggregate over all templates, and listing individual examples, there are various interesting intermediate resolutions of bias that can be measured. We can, for instance, restrict to all nationality templates which involve rude Polarity and iraqi Demonym, and measure their average entailment: in the GloVe model it starts as average entailment, and drops to entailment after the projection of the demonym subspace.

Sources of bias.

Our bias probes run the risk of entangling two sources of bias: from the representation, and from the data used to train the NLI task. rudinger2017social; gururangan2018annotation and references therein point out that the mechanism for gathering the SNLI data allows various stereotypes (gender, age, race, etc.) and annotation artifacts to seep into the data. What is the source of the non-neutral inferences? The observation from GloVe that the three bias measures can increase by attenuation strategies that only

transform the word embeddings indicates that any bias that may have been removed is from the word embeddings. The residual bias could still be due to word embeddings, or as the literature points out, from the SNLI data. Removing the latter is an open question; we conjecture that it may be possible to design loss functions that capture the spirit of our evaluations in order to address such bias.

Relation to error in models.

A related concern is that the examples of non-neutrality observed in our measures are simply model errors. We argue this is not so for several reasons. First, the probability of predicting neutral is below (and in the case of gendered examples, far below ) the scores on the test sets (almost ), indicating that these examples pose problems beyond the normal error. Also, through the projection of random directions in the embedding models, we are essentially measuring a type of random perturbations to the models themselves; the result of this perturbation is fairly insignificant, indicating that these effects are real.

Biases as Invalid Inferences.

We use the NLI task to measure bias in word embeddings. The definition of the NLI task lends itself naturally to identifying biases. Indeed, the ease with which we can reduce other reasoning tasks to textual entailment was a key motivation for the various PASCAL entailment challenges (dagan2006the-pascal, inter alia). In this paper, we explored three kinds of biases that have important societal impacts, but the mechanism is easily extensible to other types of biases. Recent NLP literature uses other tasks, especially coreference resolution, to quantify biases (e.g., rudinger2018gender; zhao2019gender). The interplay of biases as coreference errors and invalid inferences is an avenue for future work.

6 Conclusion

In this paper, we use the observation that biased representations lead to biased inferences to construct a systematic probe for measuring biases in word representations using the task of natural language inference. Our experiments using this probe reveal that both GloVe and ELMo embeddings encode gender, religion and nationality biases. We explore the use of a projection-based method for attenuating biases, and show that while the method works for the static GloVe embeddings, contextual embeddings encode harder-to-remove biases.

7 Experiment Setup

Our models for the NLI task is built on top of the decomposable attention network (DAN) parikh2016decomposable. We separately experiment with static word embeddings, i.e. GloVe Art3, and context-dependent embeddings, i.e. ELMo Peters:2018.

Static embeddings.

For static embeddings, we adopted the original DAN architecture but replaced the projective encoder with bidirectional LSTM cheng2016long encoder. We used the GloVe pretrained on the common crawl dataset with dimension . Across the network, the dimension of hidden layers are all set to . That is, word embeddings get downsampled to by the LSTM encoder. Models are trained on the SNLI dataset for epochs and the best performing model on the development set is preserved for evaluation.

Context-dependent embeddings.

For ELMo, we used the same architecture except replacing the static embeddings with the weighted summation of three layers of ELMo embeddings, each 1024 dimensional. At the encoder stage, ELMo embeddings are first linearly interpolated before the LSTM encoder. Then the output is concatenated with another independently interpolated version. The LSTM encoder still uses hidden size . And attention layers are lifted to dimensions due to the concatenated ELMo embeddings. For classification layers, We extend the dimension to . Models are trained on the SNLI dataset for epochs.

Debiasing & retraining.

To debias GloVe, we remove corresponding components off the static embeddings of all words, using the specified projection mechanism. The resulted embeddings are then used for (re)training. To debias ELMo, we conduct the same removal method on the input character embeddings, and then embed as usual. During retraining, the ELMo embedder (produced by the 2-layer LSTM encoder) is not fine-tuned on the SNLI training set.

On reproducibility.

Our code is deterministic and the results in the paper should be reproducible. We froze all random seeds in code except those deeply buried in learning libraries. In our preliminary experiments, we found the models are only slightly volatile against this randomness. With different random runs, the difference in testing accuracies are often in range on a 100 point scale. Thus, we believe our result is reproducible offline even though there might be subtle variation.

8 Word Lists

The word lists marked with are the ones used to populate the templates. The lists marked with are the ones used to learn a subspace in the embeddings. These two types do not intersect, and when that subscript is omitted, it implies that it is the union of the two lists – the full list. These full lists are used to assess the stability of the associated subspaces by considering the principal values.


atheist, baptist, catholic, christian, hindu, methodist, protestant, shia


adventist, anabaptist, anglican, buddhist, confucian, jain, jew, lutheran, mormon, muslim, rastafarian, satanist, scientologist, shinto, sikh, sunni, taoist


america, belarus, brazil, britain, canada, china, denmark, egypt, emirates, france, georgia, germany, greece, india, iran, iraq, ireland, italy, japan, korea, libya, morocco, netherlands, nigeria, pakistan, peru, qatar, russia, scotland, spain, switzerland, thailand, turkey, ukraine, uzbekistan, vietnam, wales, yemen, zambia


american, chinese, egyptian, french, german, korean, pakistani, spanish


belarusian, brazilian, british, canadian, danish, dutch, emirati, georgian, greek, indian, iranian, iraqi, irish, italian, japanese, libyan, moroccan, nigerian, peruvian, qatari, russian, saudi, scottish, swiss, thai, turkish, ukrainian, uzbekistani, vietnamese, welsh, yemeni, zambian


man, woman, guy, girl, gentleman, lady


man, woman, himself, herself, john, mary, father, mother, boy, girl, son, daughter, his, her, guy, gal, male, female,


accountant, actuary, administrator, advisor, aide, ambassador, architect, artist, astronaut, astronomer, athlete, attendant, attorney, author, babysitter, baker, banker, biologist, broker, builder, butcher, butler, captain, cardiologist, caregiver, carpenter, cashier, caterer, chauffeur, chef, chemist, clerk, coach, contractor, cook, cop, cryptographer, dancer, dentist, detective, dictator, director, doctor, driver, ecologist, economist, editor, educator, electrician, engineer, entrepreneur, executive, farmer, financier, firefighter, gardener, general, geneticist, geologist, golfer, governor, grocer, guard, hairdresser, housekeeper, hunter, inspector, instructor, intern, interpreter, inventor, investigator, janitor, jester, journalist, judge, laborer, landlord, lawyer, lecturer, librarian, lifeguard, linguist, lobbyist, magician, manager, manufacturer, marine, marketer, mason, mathematician, mayor, mechanic, messenger, miner, model, musician, novelist, nurse, official, operator, optician, painter, paralegal, pathologist, pediatrician, pharmacist, philosopher, photographer, physician, physicist, pianist, pilot, plumber, poet, politician, postmaster, president, principal, producer, professor, programmer, psychiatrist, psychologist, publisher, radiologist, receptionist, reporter, representative, researcher, retailer, sailor, salesperson, scholar, scientist, secretary, senator, sheriff, singer, soldier, spy, statistician, stockbroker, supervisor, surgeon, surveyor, tailor, teacher, technician, trader, translator, tutor, undertaker, valet, veterinarian, violinist, warden, warrior, watchmaker, writer, zookeeper, zoologist


apple, apron, armchair, auto, bagel, banana, bed, bench, beret, blender, blouse, bookshelf, breakfast, brownie, buffalo, burger, bus, cabinet, cake, calculator, calf, camera, cap, cape, car, cart, cat, chair, chicken, clock, coat, computer, costume, cot, couch, cow, cupboard, dinner, dog, donkey, donut, dress, dresser, duck, goat, headphones, heater, helmet, hen, horse, jacket, jeep, lamb, lamp, lantern, laptop, lunch, mango, meal, muffin, mule, oven, ox, pancake, peach, phone, pig, pizza, potato, printer, pudding, rabbit, radio, recliner, refrigerator, ring, roll, rug, salad, sandwich, shirt, shoe, sofa, soup, stapler, SUV, table, television, toaster, train, tux, TV, van, wagon, watch


awful, dishonest, dumb, evil, great, greedy, hateful, honest, humorless, ignorant, intelligent, intolerant, neat, nice, professional, rude, smart, strong, stupid, terrible, terrible, ugly, unclean, unprofessional, weak, wise


adventism, anabaptism, anglicism, atheism, baptism, buddhism, catholicism, christianity, confucianism, hinduism, islam, jainism, judaism, lutheranism, methodism, mormonism, protestantism, rastafarianism, satanism, scientology, sikhism, sunnism, taoism


ate, befriended, bought, budgeted for, called, can afford, consumed, cooked, crashed, donated, drove, finished, hated, identified, interrupted, liked, loved, met, owns, paid for, prepared, saved, sold, spoke to, swapped, traded, visited