Understanding the Origins of Bias in Word Embeddings

10/08/2018 ∙ by Marc-Etienne Brunet, et al. ∙ UNIVERSITY OF TORONTO 0

The power of machine learning systems not only promises great technical progress, but risks societal harm. As a recent example, researchers have shown that popular word embedding algorithms exhibit stereotypical biases, such as gender bias. The widespread use of these algorithms in machine learning systems, from automated translation services to curriculum vitae scanners, can amplify stereotypes in important contexts. Although methods have been developed to measure these biases and alter word embeddings to mitigate their biased representations, there is a lack of understanding in how word embedding bias depends on the training data. In this work, we develop a technique for understanding the origins of bias in word embeddings. Given a word embedding trained on a corpus, our method identifies how perturbing the corpus will affect the bias of the resulting embedding. This can be used to trace the origins of word embedding bias back to the original training documents. Using our method, one can investigate trends in the bias of the underlying corpus and identify subsets of documents whose removal would most reduce bias. We demonstrate our techniques on both a New York Times and Wikipedia corpus and find that our influence function-based approximations are extremely accurate.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Code accompanying the paper "Understanding Bias in Word Embeddings"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As machine learning algorithms play ever-increasing roles in our lives, there are ever-increasing risks for these algorithms to be systematically biased [22, 21, 11, 6, 8]. An ongoing research effort is showing that machine learning systems can not only reflect human biases in the data they learn from, but also magnify these biases when deployed in practice [19]. With algorithms aiding critical decisions ranging from medical diagnoses to hiring decisions, it is important to understand the nature and sources of these biases.

In recent work, researchers have uncovered an illuminating example of bias in machine learning systems: Popular word embedding methods such as word2vec [14] and GloVe [16] acquire stereotypical human biases from the text data they are trained on. For example, they disproportionately associate male terms with science terms, and female terms with art terms [1, 4]. Deploying these word embedding algorithms in practice, for example in automated translation systems or as hiring aids, runs the serious risk of perpetuating problematic biases in important societal contexts. Furthermore, this problem is especially pernicious because these biases can be difficult to detect. For example, word embeddings were in broad industrial use before these stereotypical biases were found.

In this work, we develop a technique for understanding the origins of bias in word embeddings. Given a bias metric and a word embedding trained on some corpus, our method identifies how perturbing the training corpus will affect the resulting bias. This naturally applies at the document level: given any document in the training corpus, our method can accurately approximate how its removal would affect the bias of the embedding. Naively, this can be done directly by removing the document and re-training an embedding on the perturbed corpus. But this approach comes at a prohibitive computational cost, limiting the number of perturbations that can be studied. Our method provides a highly efficient alternative, enabling the impact of every document in the corpus to be analyzed.

By calculating the document-level differences in bias across interesting subsets of documents, an analyst can better understand the origins of bias in word embeddings. These insights could be useful for a variety of important tasks. In some applications, finding the most influential documents on word embedding bias could surface potentially problematic portions of the training corpus. Using our technique, one could also find examples of documents that are extremely bias-correcting. Also, our method enables the study of how bias varies across various dimensions of a corpus — for example, how bias changes over time.

The main idea behind our method is to predict how perturbing the input corpus changes the bias in the resulting word embedding. To do this, we decompose the problem into two main subproblems: first, understanding how perturbing the training data changes the learned word embedding; and second, how changing the word embedding affects its bias. The latter subproblem is straightforward, and our central technical contributions solve the former. We calculate an approximation to the GloVe algorithm’s loss function, and then apply influence functions from robust statistics to this approximation. Our modification ensures that the Hessian in the influence function calculation is positive definite and block diagonal, which in turn allows us to accurately and efficiently compute how perturbations in the input corpus affect the bias metric using influence functions. We can then identify co-occurrences that most increase embedding bias, as well as approximate the effect of removing documents from (or adding documents to) the input corpus.

We demonstrate our technique through experimental results, initially on a simplified corpus of Wikipedia articles in broad use [20], and then on a New York Times corpus [18]. We use a previously proposed measure of word embedding bias as our bias metric, but our technique is generalizable to other metrics. Across a range of experiments, we find that our method’s predictions of how perturbing the input corpus will affect the bias of the embedding are extremely accurate.

2 Related work

Word embeddings are compact vector representations of words learned from a training corpus, and are actively deployed in a number of domains. They not only preserve statistical relationships present in the training data, generally placing commonly co-occurring words close to each other, but they also preserve higher-order syntactic and semantic structure, capturing relationships such as

Madrid : Spain :: Paris : France and Man : King :: Woman: Queen [15]. However, they have been shown to also preserve problematic relationships in the training data, such as Man : Computer Programmer :: Woman : Homemaker[3].

A recent line of work has begun to develop measures to document these biases as well as algorithms to correct for them. Caliskan et al. introduced a measure of bias in word embeddings and used it to show that word embeddings trained on large public corpora (e.g., Wikipedia, Google News) consistently replicate the known human biases measured by the Implicit Association Test [7, 4]. For example, female terms (e.g., “her”, “she”, “woman”) are closer to family and arts terms than they are to career and math terms, whereas the reverse is true for male terms. Bolukbasi et al. developed algorithms to de-bias word embeddings so that problematic relationships are no longer preserved, but unproblematic relationships remain[3]. We build upon this line of work by developing methodology to understand the sources of these biases in word embeddings.

Stereotypical biases have been found in other machine learning settings as well. Common training datasets for multilabel object classification and visual semantic role labeling contain gender bias and, moreover, models trained on these biased datasets exhibit greater gender bias than the training datasets [21]. Other types of bias, such as racial bias, have also been shown to exist in machine learning applications [1].

Recently, Koh and Liang proposed a methodology for using influence functions, a technique from robust statistics, to explain the predictions of a black-box model by tracing the learned state of a model back to individual training examples [12]

. Influence functions allow us to efficiently approximate the effect on model parameters of perturbing a training data point. Other efforts to increase the explainability of machine learning models have largely focused on providing visual or textual information to the user as justification for classification or reinforcement learning decisions 

[17, 9, 13].

3 Background

Here we describe the definitions and methods that are prerequisite to our work.

3.1 The GloVe word embedding algorithm

A word embedding is a set of vectors in -dimensional space where each word is represented by a vector. Two embedding algorithms, word2vec and GloVe, are in widespread use today, and are trained on the co-occurrences between pairs of words in a large corpus. We use the GloVe word embedding algorithm because its global co-occurrence matrix facilitates approximating the effect of removing documents from the corpus.

Learning a GloVe embedding from a tokenized corpus and a fixed vocabulary of size is done in two steps. First, a co-occurrence matrix is extracted from the corpus, where each entry represents a weighted count of the number of times word occurs in the context of word . Word pairs that are words apart contribute to the total co-occurrence count, up to a maximum distance defined by the context window (typically 5-15 words). Note that can far exceed 1 million words, and while , it is extremely sparse.

Gradient-based optimization is then used to learn the optimal embedding parameters , , , and which minimize the following loss:


where is the vector representation (embedding) of the th word in the vocabulary, . The embedding dimension is commonly chosen to be between 100 and 300. The set of represent the “context” word vectors. They are not our target vectors; however, when is symmetric (as is the case when the context window is symmetric) the two sets of vectors are equivalent and differ only based on their initializations. Parameters and represent the bias terms for and , respectively.

The weighting function in the above loss is used to attribute more importance to common word co-occurrences while ensuring that extremely common co-occurrences are not overly influential. It is defined as:


where and

are hyperparameters. The original authors of GloVe used

and found good performance with .

We refer to the final learned emebedding as throughout.

3.2 Influence functions

In our work, we make extensive use of influence functions. Influence functions offer a way to approximate how a model’s learned optimal parameters will change if the training data is perturbed. It was recently shown that this classic statistics technique could be applied to modern machine learning algorithms  [12]. Here we present some of the mathematical details of influence functions relevant to our work. For a more thorough treatment of the subject see  [12, 5].

Let be a convex scalar loss function for a learning task, with optimal model parameters of the form in Equation 3 below, where are the training data points and is the point-wise loss.


We would like to determine how the optimal parameters would change if we perturbed the point in the training set; i.e., . The optimal parameters under perturbation can be written as:


where we seek , noting that . Since minimizes Equation 4, we must have

for which we can compute the first order Taylor series expansion (with respect to ) around . This gives:

Noting , then dropping terms, solving for , and evaluating at we obtain:


where is the Hessian of the total loss.

Finally, in a more general case where several training points are perturbed, we have, by the linearity of the gradient operator:


where is the set of indices of perturbed points, and it is assumed .

3.3 The Word Embedding Association Test

To analyze how changes to the corpus affect bias, we need a metric for measuring the bias of a given word embedding. We use the Word Embedding Association Test (WEAT), first introduced in [4].

Using the fact that semantically similar words in an embedding are “closer” to each other than dissimilar words, with cosine similarity as a distance metric,

[4] adapted the Implicit Association Test (IAT) ([7]) to word embeddings. The theory behind the IAT is that people are faster at drawing connections between concepts that are more closely related in their minds; for example, most people are quicker to make a connection between “woman” and family-related terms and between “man” and career-related terms compared to the opposite pairing.

The WEAT adapts this principle to word embeddings. Biases such as gender bias that have been found in studies using the IAT also show up in an embedding created on a web-crawl corpus. For example, female terms (e.g., “her”, “she”, “woman”) are closer to family and arts terms than to career terms and math or science terms, and the opposite is true for male terms [4].

We use the WEAT because it is a simple and natural measure that, as explained above, has been shown to replicate the biases and associations that people exhibit.

The WEAT considers two equal-sized sets , of target words, such as math, algebra, geometry, calculus} and poetry, literature, symphony, sculpture}, and two sets , of attribute words, such as male, man, boy, brother, he} and female, woman, girl, sister, she}.

Let denote the cosine similarity of vectors and , defined as .


denote the differential association of word with the attribute sets and .

The test statistic is then defined as:

and it measures the differential association of the target word sets with the attribute sets.

The effect size that we are interested in is:

Note that because , this effect size is bound to lie in the interval .

4 Methodology

Our theoretical contributions are threefold. First, we formalize the problem of understanding bias in word embeddings and introduce the concepts of differential bias and bias gradient. Second, we show how the differential bias can be approximated in embeddings trained using the GloVe algorithm. Finally, we address how to approximate the bias gradient in GloVe.

4.1 Formalizing the Problem

Differential Bias.

Let denote any bias metric that takes as input a word embedding and outputs a scalar. The word embedding is a function of the corpus it was learned from, and so we have that . Let denote a small perturbation of the corpus resulting from the removal of some portion of its contents. Correspondingly, let be the word embedding learned from .

We can view as a collection of individual documents, and think of a perturbation of the corpus as the removal of a single document. Since word embeddings are generally trained on a large corpus that consists of a large set of individual documents (e.g., websites, newspaper articles, Wikipedia entries), we use this framing throughout our analysis. Nonetheless, we note that the unit of analysis can take an arbitrary size (e.g., paragraphs, sets of documents), provided that only a relatively small portion of the corpus is removed. Thus our methodology allows an analyst to study how bias varies across documents, groups of documents, or whichever grouping is best suited to the domain.

We define the differential bias of document to be:


which is the change in bias due to removing document from the corpus. This value decomposes the aggregate bias measured in an embedding to an incremental quantity, enabling a wide range of analyses (e.g., studying bias across document metadata categories like author, year, etc.).

Co-occurrence perturbations.

Several word embedding algorithms, including GloVe, operate on a co-occurrence matrix rather than directly on the corpus. The co-occurrence matrix is a function of the corpus , and can be viewed as being constructed additively from the co-occurrence matrices of the individual documents in the corpus, where is the co-occurrence matrix for document . In this manner, we can view as .

We then define as the co-occurrence matrix constructed from the perturbed corpus . Noting that here we are framing perturbations as occurring at a document level, the perturbed corpus is obtained by omitting document , and thus .

Bias Gradient.

When the loss function of the embedding algorithm is a differentiable function of the co-occurrence matrix, and the bias metric is also differentiable, we can do further analysis. Let , where is a variable co-occurrence matrix. Since we are interested in document removals, we require . We can determine the co-occurrences most responsible for bias by considering the gradient of the differential bias with respect to . We define the bias gradient as:


where the above expression is obtained using the chain rule.

This bias gradient has the same dimension as the co-occurrence matrix , and can be thought of as a matrix. The gradient “points” in the direction of maximal bias increase. That is, if we wished to construct a document (with a fixed word count) that most affected the bias of the overall word embedding, we would create a document with a co-occurrence matrix as closely resembling as possible, where is scalar.

While is a daunting size, if the bias metric is only affected by a small subset of the words in the vocabulary, as is the case with WEAT, it may nonetheless be feasible to compute the bias gradient.

The bias gradient can also be used to linearly approximate the differential bias of document . They are related by the following equation:


4.2 Computing the Differential Bias for GloVe

In order to compute the differential bias for a document, we must have the perturbed word embedding . The naive way to achieve this is simply to remove the document from the corpus and retrain the embedding. However, if we wish to learn the differential bias of every document in the corpus, this approach is clearly computationally infeasible.

One of our contributions is a method to overcome this barrier. Instead of computing directly, we approximate it using influence functions. Influence functions generally require the use of , as in Equation 5. In the case of GloVe this would be a matrix, which would be completely intractable to work with. We make a simplifying assumption about the behavior of the GloVe loss function around which dramatically reduces the computational requirements, but renders the computation of for every document in a corpus possible. The details follow.

Approximating leave-one-out retraining.

To approximate using influence functions, we must apply Equation 5 to the GloVe loss function from Equation 1. In doing so, we make a simplifying assumption, treating the GloVe parameters , , and as constants throughout the analysis. As a result, the parameters consist only of (i.e., , , and are excluded from ). The number of points is , and the training points are in our case , where refers to the th row of the co-occurrence matrix (not to be confused with the co-occurrence matrix of the th document, denoted as ). With these variables mapped over, the point-wise loss function for GloVe becomes:


and the total loss is then , now in the form of Equation 3.

Note that is still learned through dynamic updates of all of the parameters. It is only in deriving this influence function-based approximation for that we treat , , and as constants.

In order to use Equation 6 to approximate we need an expression for the gradient with respect to of the point-wise loss, , as well as the Hessian of the total loss, . We derive these here, starting with the gradient.

Recall that =, where = and . We observe that depends only on , , , and ; no word vector with is needed to compute the point-wise loss at . Because of this, , the gradient with respect to (a vector in ), will have only non-zero entries. These non-zero entries are the entries in , the gradient of the point-wise loss function at with respect to only word vector . Visually, this is as follows:


where the D-dimensional vector given by is:


From Equation 11, we see that the Hessian of the point-wise loss with respect to , (a -dimensional matrix), is extremely sparse, consisting of only a single block in the th diagonal block position. As a result, the Hessian of the total loss, (also a matrix), is block diagonal, with blocks of dimension . Each diagonal block is given by:


which is the Hessian with respect to only word vector of the point-wise loss at .

This block-diagonal structure allows us to solve for each independently. Moreover, will only differ from for the tiny fraction of words whose co-occurrences are affected by the removal of the selected document for the corpus perturbation. Putting all of this together, we obtain:


This formula can be used to approximate how a set of word vectors will change due to a given corpus perturbation. Notice that for all where , .

4.3 Computing the Bias Gradient for GloVe

The bias gradient presented in Equation 8 can be thought of as a matrix indicating the direction of perturbation of the corpus that will result in the maximal change in bias.

We approximate this gradient for GloVe based on Equation 14. As long as the bias metric is a function of only a small subset of the words in the vocabulary (as is the case with WEAT), our simplifying assumption (i.e., treating , , and as constant in the influence function approximation) causes the gradient to be very sparse, and thus easily computable.

Note that since the GloVe loss function is not differentiable with respect to when , we can only consider non-zero co-occurrences. However, our main motivation is to understand the sources of bias present in a word embedding by calculating how the bias would change by removing parts of the training corpus. Since we are interested in understanding perturbations stemming from document removals, we do not require the derivatives with respect to zero-valued co-occurrences (since removing a document cannot affect a zero-valued co-occurrence). Of course, nothing limits us from using the bias gradient to consider the addition of new documents that do not change the set of zero co-occurrences. We can also study the effect of adding documents that do change the set of zero co-occurrences with the methods of Section 4.2

. The mathematical details follow; some familiarity with tensor algebra is required.

Assuming the bias metric is only a function of a small subset of the words in the vocabulary, combining Equation 8 with Equation 14 gives that the bias gradient for GloVe is:


where is the set of indices of words affecting the bias metric. Recall that , where is a variable co-occurrence matrix.

We now turn our attention to the factors in the summation over in Equation 15 for the bias gradient and explain why each factor can be efficiently computed for the GloVe algorithm. Recall that the bias gradient “points” in the direction of maximal bias increase, and it can be used used to linearly approximate the differential bias. In order to use the bias gradient, however, it is essential that it can be efficiently computed.

The first factor, , is a vector in ; when the bias metric used is the WEAT, this factor is easy to compute. The second factor, is the inverse Hessian examined previously; since is small (generally in the range of -) and each entry in this matrix is easily computable, this factor can be quickly computed.

The third factor, requires the most attention. It is a tensor. However, because is only a function of (the th row of the perturbation co-occurrences), this tensor is in all but the th position along one of the axes. The “matrix” in that non-zero position can be found by computing:

evaluated at . Therefore, this last factor is also easily computed. In practice, we can compute this either directly or using automatic differentiation. Note that the partial derivatives are only defined for non-zero terms, as explained above.

We therefore see that the bias gradient can be efficiently computed. We now turn to validating our methodology.

5 Experimentation

The principal objective of our experimentation is to validate our methodology. In particular, we show that we can obtain very accurate approximations of the differential bias resulting from perturbations that exclude small sets of documents from the original training corpus. We accomplish this by first assessing the baseline WEAT bias in an unperturbed corpus, then using our methodology to predict how that bias will change after perturbing the corpus, and finally validating the prediction by comparing it with how the bias actually changes when we train new embeddings on the perturbed corpus.

We ran these experiments on two corpora and hyperparameter configurations, considering two different WEAT biases in each. Our approximations are highly correlated with our validation data, demonstrating that our technique can be used in real-world applications to understand the origins of bias in word embeddings. Our results shed light on how bias is distributed throughout the documents in the training corpora.

5.1 Experimental Setup

Here we discuss our choice of corpora, GloVe hyperparameters and bias metrics. These were used throughout our experimentation.

Choice of corpus and hyperparameters.

We use two corpora in our experiments, each with a different set of GloVe hyperparameters. Since empirical validation requires training numerous word embeddings from scratch (i.e., re-running the entire GloVe algorithm), it was helpful to tune our experimentation with a smaller and more controlled setup. This first setup consists of a corpus constructed from a Simple English Wikipedia dump (2017-11-03) [20] using 75-dimensional word vectors. These dimensions are small by the standards of a typical word embedding, but sufficient to start capturing syntactic and semantic meaning. Performance on the standard TOP-1 analogies test (shipped with the GloVe code base) was around 35%, lower than state-of-the-art performance but still clearly capturing significant meaning.

Our second setup is more representative of the academic and commercial contexts in which our technique could be applied. The corpus is constructed from 20 years of New York Times (NYT) articles [18], using 200-dimensional vectors. The TOP-1 analogy performance is approximately 54%. The details of these two configurations are tabulated in Table 1.

Wiki NYT
Min. doc. length 200 100
Max. doc. length 10,000 30,000
Num. documents 29,344 1,412,846
Num. tokens 17,033,637 975,624,317
Token min. count 15 15
Vocabulary size 44,806 213,687
Context window symmetric symmetric
Window size 8 8
0.75 0.75
100 100
Vector Dimension 75 200

Training epochs

300 150
TOP-1 Analogy % %
Table 1: Experimental Setups

Choice of experimental bias metric.

Throughout our experiments, for our bias metric we used the effect size of two different WEAT biases as presented by [4]. Recall that the WEAT uses two equal-sized sets of target words and two sets of attribute words. In the first, the target word sets are science and arts terms, while the attribute word sets are male and female terms. In the second, the target word sets are musical instruments and weapons, while the attribute word sets are pleasant and unpleasant terms.

These metrics have been shown to correlate with known human biases as measured by the Implicit Association Test [7]. A full list of the words in these sets can be found in the Appendix. They are summarized in Table 2.

Target Sets Attribute Sets
WEAT1 science arts male female
WEAT2 instruments weapons pleasant unpleasant
Table 2: WEAT Target and Attribute Sets

5.2 Experimental Methodology

To test our methodology, ideally we would simply remove a single document from a word embedding’s corpus, train a new embedding, and compare the change in bias with our differential bias approximation. However, directly validating our approximation for a single document is challenging. Two GloVe word embeddings trained on the same corpus, with identical hyperparameters, will still differ due to their random initializations and the stochastic nature of GloVe’s optimization. As a result, the WEAT bias measured in these two embeddings will vary due to this retraining noise. The effect of removing a single document, which is near zero for a typical document, is hidden in this noise. In order to obtain measurable changes, we consider removing sets of documents, which results in larger perturbations. We also predict and validate the effect of removing these sets using several embeddings, each trained with the same parameters but differing in their random seeds. Our detailed methodology is as follows.

I - Train a baseline.

We start by training 10 word embeddings using the aforementioned parameters, but using different random seeds. These embeddings create a baseline for the unperturbed bias .

Ii - Approximate the differential bias of each document.

For each WEAT test, we approximate the differential bias of every document in the corpus. We do so by first using the influence function based methods from Equation 14 to approximate the perturbed word embedding . We then use Equation 7 to calculate the resulting change in bias. We make the approximation several times, using the learned parameters , , and from different baseline embeddings in our different approximations. We then average these approximations for each document, and construct a histogram.

Iii - Construct perturbation sets.

We perturb the corpus by removing sets of documents. We construct two types of perturbation sets: targeted and random. The targeted perturbation sets are constructed from the documents whose removals were predicted to cause the greatest differential bias (in absolute value), i.e., the documents located in the tails of the histograms. For the Wiki setup we consider the 10, 30, 100, 300, and 1000 most influential documents for each bias, while for the NYT setup we consider the 100, 300, 1000, 3000, and 10,000 most influential. This results in 10 perturbations sets per corpus per bias, for a total of 40.

The random sets are, as their name suggests, drawn uniformly at random from the entire set of documents used in the training corpus. For the Wiki setup we consider 6 sets of 10, 30, 100, 300, and 1000 documents (30 total). Because training times are much longer, we limit this to 6 sets of 10,000 documents for the NYT setup. Therefore we consider a total of 36 random sets.

Iv - Approximate the differential bias of each perturbation set.

We then approximate the differential bias of each perturbation set. Note that is not linear in . Therefore determining the differential bias of a perturbation set does not amount to simply summing the differential bias of each document in the set (although in practice we find it to be close). Here we make 10 approximations, one with each of the different baseline embeddings.

V - Validate the differential bias of each perturbation set.

Finally, for each perturbation set, we remove the target documents from the corpus, and train 5 new embeddings on this perturbed corpus. We use the same hyperparameters, again varying only the random seed. We then calculate the “true” bias of the perturbed corpus, and this serves as our validation.

Vi - Compare.

We then compare the true change in bias with our predictions. The results are discussed and plotted in Section 5.3 below.

Most of the code used in the experimentation has been made available online111https://github.com/mebrunet/understanding-bias. It is written principally in Julia [2]. Figures were generated using Matplotlib [10].

5.3 Experimental Results

The baseline WEAT effect sizes are shown in Table 3

. It is worth noting that the WEAT2 (weapons vs. instruments) bias was not significant in our Wiki setup. However, our analysis does not require that the bias under consideration fall within any particular range of values. Overall, the baseline bias showed a lot of variance due only to changes in the random seed.

Wiki : 0.957, : 0.150 : 0.108, : 0.213
NYT : 1.14, : 0.124 : 1.32, : 0.056
Table 3: Baseline WEAT Effect Sizes

A histogram of the differential bias of removal for each setup and WEAT test can be seen in Figure 1. Notice the log scale on the vertical axis, and how the vast majority of documents are predicted to have a very small impact on the differential bias.

Figure 1: Histogram of the approximated differential bias of removal for every document in our Wiki setup (top) and NYT setup (bottom), considering WEAT1 (left) and WEAT2 (right), measured in percent change from the corresponding baseline means in Table 3.

We assess the accuracy of our approximations by measuring their correlation with our validation data in Figure 2. We find extremely strong correlations between the means of our approximations and the validations, tabulated in Table 4. All correlations () are above 0.985.

Wiki : 0.986 : 0.993
NYT : 0.995 : 0.997
Table 4: Correlation of Approximated and Validated Mean Biases
Figure 2:

Approximated vs. validated WEAT bias effect size due to the removal of each perturbation set in Wiki setup (top) and NYT setup (bottom), considering WEAT1 (left) and WEAT2 (right); points plot the means; error bars depict one standard deviation; dashed line shows least squares fit; the baseline means are shown with vertical dotted lines.

We further compare our approximations to our validations in Figure 3. We see that while our approximations tend to underestimate the magnitude of the change in effect size due to a perturbation, relative ranking is preserved. Because the WEAT is bound to be in , there is a “squashing” effect at the boundaries, and uncertainty decreases as the magnitude of the effect size grows. It is also worth noting that there was no apparent change in the analogy performance of the perturbed embeddings.

Figure 3: Approximated and validated differential bias of removal for every perturbation set in Wiki setup (top) and NYT setup (bottom), considering WEAT1 (left) and WEAT2 (right); the baseline means are shown with vertical dotted lines.

We ran a Welch’s T-test to compare the validation biases with the baseline biases. Of the 36 random perturbation sets only 2 differed significantly (

) from the baseline. Both of these sets were Wiki perturbations and they only showed a significant difference for WEAT 2. This is in strong contrast to the 40 targeted perturbation sets, where only 2 did not significantly differ from their respective baselines. In this case, both were from the smallest (10 document) Wiki perturbation sets.

A clear advantage of using our methodology to approximate the differential bias is computational efficiency. The speedups are between 3 and 4 orders of magnitude. Approximating the differential bias for each of the 29,344 Wikipedia documents took roughly two hours on a 16-core desktop workstation. We could retrain fewer than a dozen embeddings of the same size during that time.

With our methodology, many large-scale analyses of the origins of bias in a word embedding become computationally feasible. There are many possible applications. For example, our technique could be used to create a “bias-checker” for newspaper articles, wherein articles submitted to the bias-checker would be categorized as more or less biased than the rest of the newspaper on average, using a given bias metric. Moreover, based on the change in bias when small sets of documents are removed from the training corpus, it would be possible to identify which sections of a corpus (or authors) are more or less biased than average.

5.4 Qualitative results

We’ve demonstrated that removing articles identified by our methodology impacts the WEAT metric in a predictable, quantifiable and significant way. As previously mentioned, this metric has been shown to correlate with known human biases.

We intentionally limit our discussion of qualitative results (e.g., by not including lists of bias-affecting documents or co-occurrences) as they are inherently subjective, prone to confirmation bias, and risk perpetuating stereotypical biases. The current work is dedicated to presenting our method to predict how corpus perturbations affect the resulting embedding’s bias. Here, we give a brief comment on qualitative findings using our method, and defer a full, proper treatment to future work.

We inspected the 20 most bias-correcting and bias-aggravating documents in the New York Times corpus on the WEAT 1 bias metric (male vs. female terms and science vs. arts). Many of these documents can be readily understood to be correcting or aggravating the bias. For example, one of the bias-correcting documents is an article entitled “For Women in Astronomy, a Glass Ceiling in the Sky”; they also included interviews with female scientists. The most bias-aggravating documents mainly consisted of articles describing the work of male engineers and scientists, and also included an article referencing a female author’s essay about Judith Farr, a female literary critic and professor of English.

Many of the remaining documents could be similarly understood to be correcting or aggravating the bias simply through their impact on overall co-occurrence statistics. For example, a few of the correcting documents report on science or medicine for women (e.g., hormone treatment for menopause). However, there were some documents, which despite containing many words from the WEAT sets, were not intuitively related to the bias.

6 Conclusion

In this work, we developed and experimentally validated a technique to trace the origins of bias in word embeddings. In particular, we measured the change in bias resulting from removing small subsets of the training corpus. As doing this manually for each small subset would be computationally infeasible, we developed a methodology to approximate this change using influence functions, and applied it to the GloVe word embedding algorithm. We experimentally validated our results on Simple Wikipedia and the New York Times using the WEAT metric and considering two different biases. We perturbed the corpora and compared the influence function approximation of the change in bias to the “true” change in bias obtained by retraining the GloVe algorithm on the perturbed corpora.

Our method’s approximations accurately approximate the true changes in bias resulting from removing small groups of training documents (. We use our methodology to compute the change in bias that would result from the removal of every individual document in both the Simple Wikipedia and the New York Times corpora.

Our work represents an important contribution to the understanding of bias in machine learning algorithms. There are many directions for future work. We are currently developing a similar influence function technique for the word2vec algorithm  [14] which, unlike GloVe, is highly local and uses downsampling and negative sampling. We are also currently working on improvements to the WEAT metric to facilitate analyses based on corpus changes, such as the one proposed here. Another interesting direction is to consider bias metrics other than WEAT, such as the direct bias metric from  [3].

There is also much to be gained from applying our technique to other corpora and bias metrics. For example, one could apply our technique to measure how sections of a newspaper or online social media platform contribute to overall bias. It would also be interesting to scrape all of the comments from a news website (e.g., The New York Times) and train an embedding on a combined corpus of news articles and comments. It would then be possible to measure how removing the comments from the corpus affected the bias of the embedding; this analysis could even be done on a section-by-section or theme-by-theme basis.

Our methodology could also be applied to assess how the bias of a set of texts has evolved over time. For example, using publicly available newspaper archives or e-books with time period metadata, one could measure how the gender bias of the newspaper or of novels has evolved over time.

More broadly, our methodology of applying influence functions to trace how perturbations in training data affect changes in the bias of the output is a general idea that could be applied in many other contexts.


  • [1] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. Machine bias: There’s software used across the country to predict future criminals. and it’s biased against blacks. ProPublica, 2016.
  • [2] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah. Julia: A fresh approach to numerical computing. CoRR, abs/1411.1607, 2014.
  • [3] T. Bolukbasi, K.-W. Chang, J. Zou, V. Saligrama, and A. Kalai. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In 30th Conference on Neural Information Processing Systems (NIPS), 2016.
  • [4] A. Caliskan, J. J. Bryson, and A. Narayanan. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334):183–186, 2017.
  • [5] R. Cook and S. Weisberg. Characterizations of an empirical influence function for detecting influential cases in regression. 22:495–508, 11 1980.
  • [6] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226. ACM, 2012.
  • [7] A. G. Greenwald, D. E. McGhee, and J. L. K. Schwartz. Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74(6):1464–1480, 1998.
  • [8] M. Hardt, E. Price, N. Srebro, et al.

    Equality of opportunity in supervised learning.

    In Advances in neural information processing systems, pages 3315–3323, 2016.
  • [9] L. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell. Generating visual explanations. In

    European Conference on Computer Vision

    , pages 3–19. Springer, 2016.
  • [10] J. D. Hunter. Matplotlib: A 2d graphics environment. Computing In Science & Engineering, 9(3):90–95, 2007.
  • [11] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807, 2016.
  • [12] P. W. Koh and P. Liang. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1885–1894, 2017.
  • [13] M. Lomas, R. Chevalier, E. C. II, R. Garrett, J. Hoare, and M. Kopack. Explaining robot actions. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pages 187–188. ACM, 2012.
  • [14] T. Mikolov, K. Chen, G. Corrado, and J. Dean.

    Efficient estimation of word representations in vector space.

    In International Conference on Learning Representations, 2013.
  • [15] T. Mikolov, W. t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT 2013, 2013.
  • [16] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 01 2014.
  • [17] M. T. Ribeiro, S. Singh, and C. Guestrin.

    Why should i trust you?: Explaining the predictions of any classifier.

    In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144. ACM, 2016.
  • [18] E. Sandhaus. The new york times annotated corpus, 2008.
  • [19] L. Sweeney. Discrimination in online ad delivery. Queue, 11(3):10, 2013.
  • [20] Wikimedia. Simplewiki:database download, 2018.
  • [21] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 2979–2989, 2017.
  • [22] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. In NAACL, 2018.


S instruments bagpipe, cello, guitar, lute, trombone, banjo, clarinet, harmonica, mandolin, trumpet, bassoon, drum, harp, oboe, tuba, bell, fiddle, harpsichord, piano, viola, bongo, flute, horn, saxophone, violin
T weapons arrow, club, gun, missile, spear, axe, dagger, harpoon, pistol, sword, blade, dynamite, hatchet, rifle, tank, bomb, firearm, knife, shotgun, teargas, cannon, grenade, mace, slingshot, whip
A pleasant caress, freedom, health, love, peace, cheer, friend, heaven, loyal, pleasure, diamond, gentle, honest, lucky, rainbow, diploma, gift, honor, miracle, sunrise, family, happy, laughter, paradise, vacation
B unpleasant abuse, crash, filth, murder, sickness, accident, death, grief, poison, stink, assault, disaster, hatred, pollute, tragedy, divorce, jail, poverty, ugly, cancer, kill, rotten, vomit, agony, prison

S science science, technology, physics, chemistry, einstein, nasa, experiment, astronomy
T arts poetry, art, shakespeare, dance, literature, novel, symphony, drama
A male male, man, boy, brother, he, him, his, son
B female female, woman, girl, sister, she, her, hers, daughter