Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection

by   Shauli Ravfogel, et al.

The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we aim to remove, followed by projection of the representations on their null-space. By doing so, the classifiers become oblivious to that target property, making it hard to linearly separate the data according to it. While applicable for general scenarios, we evaluate our method on bias and fairness use-cases, and show that our method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.


page 1

page 2

page 3

page 4


Feature and Label Embedding Spaces Matter in Addressing Image Classifier Bias

This paper strives to address image classifier bias, with a focus on bot...

Obstructing Classification via Projection

Machine learning and data mining techniques are effective tools to class...

iFair: Learning Individually Fair Data Representations for Algorithmic Decision Making

People are rated and ranked, towards algorithmic decision making in an i...

VERB: Visualizing and Interpreting Bias Mitigation Techniques for Word Representations

Word vector embeddings have been shown to contain and amplify biases in ...

Measure Twice, Cut Once: Quantifying Bias and Fairness in Deep Neural Networks

Algorithmic bias is of increasing concern, both to the research communit...

Measuring a Texts Fairness Dimensions Using Machine Learning Based on Social Psychological Factors

Fairness is a principal social value that can be observed in civilisatio...

Towards Better Plasticity-Stability Trade-off in Incremental Learning: A simple Linear Connector

Plasticity-stability dilemma is a main problem for incremental learning,...

1 Introduction

What is encoded in vector representations of textual data, and can we control it? Word embeddings, pre-trained language models, and more generally deep learning methods emerge as very effective techniques for text classification. Accordingly, they are increasingly being used for predictions in real-world situations. A large part of the success is due to the models’ ability to perform

representation learning, coming up with effective feature representations for the prediction task at hand. However, these learned representations, while effective, are also notoriously opaque: we do not know what is encoded in them. Indeed, there is an emerging line of work on probing deep-learning derived representations for syntactic Linzen et al. (2016); Hewitt and Manning (2019); Goldberg (2019), semantic Tenney et al. (2019) and factual knowledge Petroni et al. (2019). There is also evidence that they capture a lot of information regarding the demographics of the author of the text Blodgett et al. (2016); Elazar and Goldberg (2018).

Figure 1: t-SNE projection of GloVe vectors of the most gender-biased words after t=0, 3, 18, and 35 iterations of INLP. Words are colored according to being male-biased or female-biased.

What can we do in situations where we do not want our representations to encode certain kinds of information? For example, we may want a word representation that does not take tense into account, or that does not encode part-of-speech distinctions. We may want a classifier that judges the formality of the text, but which is also oblivious to the topic the text was taken from. Finally, and also our empirical focus in this work, this situation often arises when considering fairness and bias of language-based classification. We may not want our word-embeddings to encode gender stereotypes, and we do not want sensitive decisions on hiring or loan approvals to condition on the race, gender or age of the applicant.

We present a novel method for actively removing certain kinds of information from a representation. Previous methods are either based on projection on a pre-specified, user-provided direction Bolukbasi et al. (2016), or on adding an adversarial objective to an end-to-end training process Xie et al. (2017). Both of these have benefits and limitations, as we discuss in the related work section (§2). Our proposed method, Iterative Null-space Projection (INLP), presented in section 4, can be seen as a combination of these approaches, capitalizing on the benefits of both. Like the projection methods, it is also based on the mathematical notion of linear projection, a commonly used deterministic operator. Like the adversarial methods, it is data-driven in the directions it removes: we do not presuppose specific directions in the latent space that correspond to the protected attribute, but rather learn those directions, and remove them. Empirically, we find it to work well. We evaluate the method on the challenging task of removing gender signals from word embeddings Bolukbasi et al. (2016); Zhao et al. (2018). Recently, Gonen and Goldberg (2019) showed several limitations of current methods for this task. We show that our method is effective in reducing many, but not all, of these (§4).

We also consider the context of fair classification, where we want to ensure that a classifier’s decision is oblivious to a protected attribute such as race, gender or age. There, we need to integrate the projection-based method within a pre-trained classifier. We propose a method to do so in section 5, and demonstrate its effectiveness in a controlled setup (§6.2) as well as in a real-world one (§6.3).

Finally, while we propose a general purpose information-removal method, our main evaluation is in the realm of bias and fairness applications. We stress that this calls for some stricter scrutiny, as the effects of blindly trusting strong claims can have severe real-world consequences on individuals. We discuss the limitations of our model in the context of such applications in section 7.

2 Related Work

The objective of controlled removal of specific types of information from neural representation is tightly related to the task of disentanglement of the representations (Bengio et al., 2013; Mathieu et al., 2016)

, that is, controlling and separating the different kinds of information encoded in them. In the context of transfer learning, previous methods have pursued representations which are

invariant to some properties of the input, such as genre or topic, in order to ease domain transfer (Ganin and Lempitsky, 2015). Those methods mostly rely on adding an adversarial component (Goodfellow et al., 2014; Ganin and Lempitsky, 2015; Xie et al., 2017; Zhang et al., 2018) to the main task objective: the representation is regularized by an adversary network, that competes against the encoder, trying to extract the protected information from its representation.

While adverserial methods showed impressive performance in various machine learning tasks, and were applied for the goal of removal of sensitive information

Elazar and Goldberg (2018); Coavoux et al. (2018); Resheff et al. (2019); Barrett et al. (2019), they are notoriously hard to train. Elazar and Goldberg (2018) have evaluated adverserial methods for the removal of demographic information from representations. They showed that the complete removal of the protected information is not trivial: even when the attribute seems protected, different classifiers of the same architecture can often still succeed in extracting it. Another drawback of these methods is their reliance on a main-task loss in addition to the adverserial loss, making them less suitable for tasks such as debiasing pre-trained word embeddings.

Xu et al. (2017) utilized a ”nullspace cleaning” operator for increasing privacy-preserving in classifiers. Given a pre-trained main-task model, they remove from the input a subspace that contains the nullspace (but is not limited to it). By doing so, they aim to remove from the representations information that is not used for the main task (and can be protected), while minimally impairing the main-task performance. While similar in spirit to our method, several key differences exist. As the complementary setting – removing the nullsapce of the main-task classifier vs. projection onto the nullspace of protected attribute classifiers – aims to achieve a distinct goal (privacy preseving), there is no notion of exhaustive, iterative cleaning. Furthermore, since de-biasing is not a goal, they do not remove protected attributes that are used by the pre-trained main-task classifier (for example, where the main task classifier conditions on gender).

A recent line of work focused on projecting the representation to a subspace which does not encode the protected attributes. Under this methodology, one identifies specific directions in the latent space that correspond to the protected attribute, and removes them. In a seminal work, Bolukbasi et al. (2016) aimed to identify a “gender subspace” in word-embedding space by calculating the main directions in a subspace spanned by the differences between gendered word pairs, such as the direction. They suggested to zero out the components of neutral words in the direction of the “gender subspace” first principle components, and actively pushed neutral words to be equally distant from male and female-gendered words. However, Gonen and Goldberg (2019) have recently shown that these methods only cover up these biases and that in fact, the information is deeply ingrained in the representations. A key drawback of this approach is that it relies on an intuitive selection of a few (or a single) gender directions, while, as we reveal in our experiments, the gender subspace is actually spanned by dozens to hundreds of orthogonal directions in the latent space, which are not necessarily as interpretable as the direction. This observation aligns with the analysis of Ethayarajh et al. (2019) who demonstrated that debiasing by projection is theoretically effective provided that one removes all directions in the latent space, and not only the first principle component.

Figure 2: Nullspace projection for a 2-dimensional binary classifier. The decision boundary of is ’s null-space.

3 Objective and Definitions

Our main goal is to “guard” sensitive information, so that it will not be encoded in a representation. Given a set of vectors , and corresponding discrete attributes , (e.g. race or gender), we aim to learn a transformation , such that cannot be predicted from . In this work we are concerned with “linear guarding”: we seek a guard such that no linear classifier can predict from with an accuracy greater than that of a decision rule that considers only the proportion of labels in . We also wish for to stay informative: when the vectors are used for some end task, we want to have as minimal influence as possible on the end task performance, provided that remains guarded. We use the following definitions:

Guarded w.r.t. a hypothesis class

Let be a set of vectors, with corresponding discrete attributes , . We say the set is guarded for Z with respect to hypothesis class (conversely Z is guarded in X) if there is no classifier that can predict from at better than guessing the majority class.

Guarding function

A function is said to be guarding X for Z (w.r.t. to class ) if the set is guarded for w.r.t. to .

We use the term linearly guarded to indicate guarding w.r.t. to the class of all linear classifiers.

4 Iterative Nullspace Projection

Given a set of vectors and a set of corresponding discrete111

While this work focuses on the discrete case, the extension to a linear regression setting is straightforward: A projection to the nullspace of a linear regressor

enforces for every , i.e., each input is regressed to the non-informative value of zero. protected attributes , we seek a linear guarding function that remove the linear dependence between and .

We begin with a high-level description of our approach. Let be a trained linear classifier, parameterized by a matrix , that predicts a property with some accuracy. We can construct a projection matrix such that for all , rendering useless on dataset . We then iteratively train additional classifiers and perform the same procedure, until no more linear information regarding remains in . Constructing is achieved via nullspace projection, as described below. This method is the core of the INLP algorithm (Algorithm 1).

Nullspace Projection

The linear interaction between and a new test point has a simple geometric interpretation: is projected on the subspace spanned by ’s rows, and is classified according to the dot product between and ’s rows, which is proportional to the components of in the direction of ’s rowpsace. Therefore, if we zeroed all components of in the direction of ’s row-space, we removed all information used by for prediction: the decision boundary found by the classifier is no longer useful. As the orthogonal component of the rowspace is the nullspace, zeroing those components of is equivalent to projecting on ’s nullspace. Figure 2 illustrates the idea for the 2 dimensional binary-classification setting, in which is just a 2-dimensional vector.

For an algebraic interpretation, recall that the null-space of a matrix is defined as the space . Given the basis vectors of we can construct a projection matrix into , yielding .

This suggests a simple method for rendering linearly guarded for a set of vectors : training a linear classifier that is parameterized by to predict from , calculating its nullspace, finding the orthogonal projection matrix onto the nullspace, and using it to remove from those components that were used by the classifier for predicting .

Note that the orthogonal projection is the least harming linear operation to remove the linear information captured by from , in the sense that among all maximum rank (which is not full, as such transformations are invertible—hence not linearly guarding) projections onto the nullspace of , it carries the least impact on distances. This is so since the image under an orthogonal projection into a subspace is by definition the closest vector in that subspace.

Iterative Projection

Projecting the inputs on the nullspace of a single linear classifier does not suffice for making linearly guarded: classifiers can often still be trained to recover from the projected

with above chance accuracy, as there are often multiple linear directions (hyperplanes) that can partially capture a relation in multidimensional space. This can be remedied with an iterative process: After obtaining

, we train classifier on , obtain a projection matrix , train a classifier on and so on, until no classifier can be trained. We return the guarding projection matrix , with the guarding function . Crucially, the th classifier is trained on the data after the projection on the nullspaces of classifiers , …, and is therefore trained to find separating planes that are independent of the separating planes found by previous classifiers.

In Appendix §A.1 we prove three desired proprieties of INLP: (1) any two protected-attribute classifiers found in INLP are orthogonal (Lemma A.1); (2) while in general the product of projection matrices is not a projection, the product calculated in INLP is a valid projection (Corollary A.1.2); and (3) it projects any vector to the intersection of the nullspaces of each of the classifiers found in INLP, that is, after INLP iterations, is a projection to (Corollary A.1.3). We further bound the damage causes to the structure of the space (Lemma A.2). INLP can thus be seen as a linear dimensionality-reduction method, which keeps only those directions in the latent space which are not indicative of the protected attribute.

During iterative nullspace projection, the property becomes increasingly linearly-guarded in . For binary protected attributes, each intermediate is a vector, and the nullspace rank is . Therefore, after iterations, if the original rank of was , the rank of the projected input is at least .

The entire process is formalized in Algorithm 1.

Input : : a training set of vectors and protected attributes
n: Number of rounds
Result: A projection matrix
Function GetProjectionMatrix():
       for i to n do
       end for
      return P
Algorithm 1 Iterative Nullspace Projection (INLP)
Implementation Details

A naive implementation of Algorithm 1 is prone to accumulating numerical errors. Those stem mainly from the accumulative projection-matrices multiplication . To mitigate this problem, we use the formula of Ben-Israel (2015), which connects the intersection of nullspaces with the projection matrices to the corresponding rowspaces:


Where is the orthogonal projection matrix to the row-space of a classifier . Accordingly, in practice, we do not multiply but rather collect rowspace projection matrices for each classifier . In place of each input projection , we recalculate according to 1, and perform a projection . Upon termination, we once again apply 1 to get the final nullspace projection matrix , and return it. We make our code publicly available.222

5 Application to Fair Classification

The previous section described the INLP method for producing a linearly guarding function

for a set of vectors. We now turn to describe its usage in the context of providing fair classification by a (possibly deep) neural network classifier.

In this setup, we are given, in addition to and also labels , and wish to construct a classifier , while being fair with respect to . Fairness in classification can be defined in many ways Hardt et al. (2016); Madras et al. (2019); Zhang et al. (2018). We focus on a notion of fairness by which the predictor is oblivious to when making predictions about .

To use linear guardedness in the context of a deep network, recall that a classification network can be decomposed into an encoder followed by a linear layer : , where is the last layer of the network and is the rest of the network. If we can make sure that is linearly guarded in the inputs to , then will have no knowledge of when making its prediction about , making the decision process oblivious to . Adversarial training methods attempt to achieve such obliviousness by adding an adversarial objective to make itself guarding. We take a different approach and add a guarding function on top of an already trained .
We propose the following procedure. Given a training set , and protected attribute , we first train a neural network to best predict . This results in an encoder that extracts effective features from for predicting . We then consider the vectors , and use the INLP method to produce a linear guarding function that guards in . At this point, we can use the classifier to produce oblivious decisions, however by introducing (which is lower rank than ) we may have harmed s performance. We therefore freeze the network and fine-tune only to predict from , producing the final fair classifier . Notice that only sees vectors which are linearly guarded for during its training, and therefore cannot take into account when making its predictions, ensuring fair classification.

We note that our notion of fairness by obliviousness does not, in the general case, correspond to other fairness metrics, such as equality of odds or of opportunity. It does, however,

correlate with fairness metrics, as we demonstrate empirically.
Further refinement. Guardedness is a property that holds in expectation over an entire dataset. For example, when considering a dataset of individuals from certain professions (as we do in §6.3), it is possible that the entire dataset is guarded for gender, yet if we consider only a subset of individuals (say, only those who work as nurses), we may still be able to recover gender with above majority accuracy within that sub-population. As fairness metrics are often concerned with classification behavior also within groups, we propose the following refinement to the algorithm, which we use in the experiments in §6.2 and §6.3: in each iteration, we train a classifier to predict the protected attribute not on the entire training set, but only on the training examples belonging to a single (randomly chosen) main-task class (e.g. profession). By doing so, we push the protected attribute to be linearly guarded in the examples belonging to each of the main-task labels.

6 Experiments and Analysis

6.1 “Debiasing” Word Embeddings

In the first set of experiments, we evaluate the INLP method in its ability to debias word embeddings Bolukbasi et al. (2016). After “debiasing” the embeddings, we repeat the set of diagnostic experiments of Gonen and Goldberg (2019).


Our debiasing targets are the uncased version of GloVe word embeddings (Zhao et al., 2018), after limiting the vocabulary to the 150,000 most common words. To obtain labeled data , for this classifier, we use the 7,500 most male-biased and 7,500 most female-biased words (as measured by the projection on the direction), as well as 7,500 neutral vectors, with a small component (smaller than 0.03) in the gender direction. The data is randomly divided into a test set (30%), and training and development sets (70%, further divided into 70% training and 30% development examples).


We use a -regularized SVM classifier (Hearst et al., 1998) trained to discriminate between the 3 classes: male-biased, female-biased and neutral. We run Algorithm 1 for 35 iterations.

6.1.1 Results

Classification. Initially, a linear SVM classifier perfectly discriminates between the two genders (100% accuracy). The accuracy drops to 49.3% following INLP. To measure to what extent gender is still encoded in a nonlinear

way, we train a 1-layer ReLU-activation MLP. The MLP recovers gender with accuracy of 85.0%. This is expected, as the INLP method is only meant to achieve

linear guarding333Interestingly, nonlinear SVMs with different kernels, such as RBF, all achieve random accuracy..

Human-selected vs. Learned Directions. Our method differs from the common projection-based approach by two main factors: the numbers of directions we remove, and the fact that those directions are learned iteratively from data. Perhaps the benefit is purely due to removing more directions? We compare the ability to linearly classify words by gender bias after removing 10 directions by our method (i.e., running Algorithm 1 for 10 iterations) with the ability to do so after removing (i.e., projecting to the intersection of nullspaces) 10 manually-chosen directions defined by the difference vectors between gendered pairs 444We use the following pairs, taken from Bolukbasi et al. (2016): (“woman”, “man”), (“girl”, “boy”), (“she”, “he”), (“mother”, “father”), (“daughter”, “son”), (“gal”, “guy”), (“female”, “male”), (“her”, “his”), (“herself”, “himself”), (“mary”, “john”).. INLP-based debiasing results in a very substantial drop in classification accuracy (54.4%), while the removal of the predefined directions only moderately decreases classification accuracy (80.7%). This shows that data-driven identification of gender-directions outperforms manually selected directions: there are many subtle ways in which gender is encoded, which are hard for people to imagine.


Both the previous method and our method start with the main gender-direction of . However, while previous attempts take this direction as the information that needs to be neutralized, our method instead considers the labeling induced by this gender direction, and then iteratively find and neutralize directions that correlate with this labeling. It is likely that the direction is one of the first to be removed, but we then go on and learn a set of other directions that correlate with the same labeling and which are predictive of it to some degree, neutralizing each of them in turn. Compared to the 10 manually identified gender-directions from Bolukbasi et al. (2016), it is likely that our learned directions capture a much more diverse and subtle set of gender clues in the embedding space.

Effect of debiasing on the embedding space. In appendix §A.2 we provide a list of 40 random words and their closest neighbors, before and after INLP, showing that INLP doesn’t significantly damage the representation space that encodes lexical semantics. We also include a short analysis of the influence on a specific subset of inherently gendered words: gendered surnames (Appendix §A.4).

Additionally, we perform a semantic evaluation on the debiased embeddings by evaluating on multiple word similarities datasets (e.g. SimLex-999 Hill et al. (2015)). We find large improvements in the quality of the embeddings after the projection (e.g. on SimLex-999 the correlation improves from 0.373 to 0.489) and we elaborate more on these findings in Appendix A.3.

Clustering. Figure 1 shows t-SNE (Maaten and Hinton, 2008) projections of the 2,000 most female-biased and 2,000 most male-biased words, before the projection and after , and

projection steps. The results clearly demonstrate that the classes are no longer linearly separable: this behavior is qualitatively different from previous word vector debiasing methods, which were shown to maintain much of the proximity between female and male-biased vectors

(Gonen and Goldberg, 2019)

. To quantify the difference, we perform K-means clustering to

clusters on the vectors, and calculate the V-measure (Rosenberg and Hirschberg, 2007) which assesses the degree of overlap between the two clusters found in K-means and the binary gender bias of the words. For the t-SNE projected vectors, the measure drops from 83.88% overlap before the debiasing-projection, to 0.44% following the projection; and for the original space, the measure drops from 100% to 0.31%.
WEAT. While our method does not guarantee attenuating the bias-by-neighbors phenomena that is discussed in Gonen and Goldberg (2019), it is still valuable to quantify to what extent it does mitigate this phenomenon. We repeat the Word Embedding Association Test (WEAT) from Caliskan et al. (2017) which aims to measure the association in vector space between male and female concepts and stereotypically male or female professions. Following Gonen and Goldberg (2019), we represent the male and female groups with common names of males and females, rather than with explicitly gendered words (e.g. pronouns). Three tests evaluate the association between a group of male names and a groups of female names to (1) career and family-related words; (2) art and mathematics words; and (3) artistic and scientific fields. In all three tests, we find that the strong association between the groups no longer exists after the projection (non-significant p-values of 0.855, 0.302 and 0.761, respectively).
Bias-by-Neighbors. To measure bias-by-neighbors as discussed in (Gonen and Goldberg, 2019), we consider the list of professions provided in (Bolukbasi et al., 2016) and measure the correlation between bias-by projection and bias by neighbors, quantified as the percentage of the top 100 neighbors of each profession which were originally biased-by-projection towards either of the genders. We find strong correlation of 0.734 (compared with 0.852 before), indicating that much of the bias-by-neighbors remains.555Note that if, for example, STEM-related words are originally biased towards men, the word “chemist” after the projection may still be regarded as male-biased by neighbors, not because an inherent bias but due to its proximity to other originally biased words (e.g. other STEM professions).

6.2 Fair Classification: Controlled Setup

We now evaluate using INLP with a deeper classifier, with the goal of achieving fair classification.
Classifier bias measure: TPR-GAP. To measure the bias in a classifier, we follow De-Arteaga et al. (2019) and use the TPR-GAP measure. This measure quantifies the bias in a classifier by considering the difference (GAP) in the True Positive Rate (TPR) between individuals with different protected attributes (e.g. gender/race). The TPR-GAP is tightly related to the notion of fairness by equal opportunity (Hardt et al., 2016): a fair classifier is expected to show similar success in predicting the task label for the two populations, when conditioned on the true class. Formally, for a binary protected attribute and a true class , define:



is a random variable denoting binary protected attribute,

and denote its two values, and , are random variables denoting the correct class and the predicted class, respectively.
Experiment setup. We begin by experimenting with a controlled setup, where we control for the proportion of the protected attributes within each main-task class. We follow the setup of Elazar and Goldberg (2018) which used a twitter dataset, collected by Blodgett et al. (2016), where each tweet is associated with “race” information and a sentiment which was determined by their belonging to some emoji group.

Naturally, the correlation between the protected class labels and the main-class labels may influence the fairness of the model, as high correlation can encourage the model to condition on the protected attributes. We measure the TPR-GAP on predicting sentiment for the different race groups (African American English (AAE) speakers and Standard American English (SAE) speakers), with different imbalanced conditions, with and without application of our “classifier debiasing” procedure.

In all experiments, the dataset is overly balanced with respect to both sentiment and race (50k instances for each). We change only the proportion of each race within each sentiment class (e.g., in the 0.7 condition, the “happy” sentiment class is composed of 70% AAE / 30% SAE, while the “sad” class is composed of 30% AAE / 70% SAE).

Sentiment TPR-Gap
Ratio Original INLP Original INLP
0.5 0.76 0.75 0.19 0.16
0.6 0.78 0.74 0.29 0.22
0.7 0.81 0.66 0.38 0.24
0.8 0.84 0.67 0.45 0.15
Table 1: The Sentiment scores (in accuracy, higher is better) and TPR differences (lower is better) as a function of the ratio of tweets written by black individuals in the positive-sentiment class.

Our classifier is based on the DeepMoji encoder Felbo et al. (2017), followed by a 1-hideen-layer MLP. The DeepMoji model was trained on millions of tweets in order to predict their emojis; a model which was proven to perform well on different classification tasks Felbo et al. (2017), but also encodes demographic information Elazar and Goldberg (2018). We train this classifier to predict sentiment. We then follow the procedure in §5: training a guarding function on the hidden layer of the MLP, and re-training the final linear layer on the guarded vectors. Table 1 presents the results.

As expected the TPR-GAP grows as we increase the correlation between class labels and protected attributes. The accuracy grows as well. Applying our debiasing technique significantly reduced the TPR gap in all settings, although hurting more the main task accuracy in the highly-imbalanced setting. In Appendix A.5, we give some more analysis on the balance between performance and TPR-Gap and show that one can control for this ratio, by using more iterations of INLP.

6.3 Fair Classification: In the Wild

We now evaluate the fair classification approach in a less artificial setting, measuring gender bias in biography classification, following the setup of De-Arteaga et al. (2019).

BoW FastText BERT
Accuracy (profession) Original 78.2 78.1 80.9
+INLP 80.1 73.0 75.2
Original 0.203 0.184 0.184
+INLP 0.124 0.089 0.095
Table 2: Fair classification on the Biographies corpus.
Figure 3: t-SNE projection of BERT representations for the profession “professor” (left) and for a random sample of all professions (right), before and after the projection.
Figure 6: The correlation between and the relative proportion of women in profession , for BERT representation, before (left; R=0.883) and after (right; R=0.470) the projection.

They scraped the web and collected a dataset of short biographies, annotated by gender and profession. They trained logistic regression classifiers to predict the profession of the biography’s subject based on three different input representation: bag-of-words (BOW), bag of word-vectors (BWV), and RNN based representation. We repeat their experiments, using INLP for rendering the classifier oblivious of gender.

Setup. Our data contains 393,423 biographies.666The original dataset had 399,000 examples, but 5,557 biographies were no longer available on the web. We follow the train:dev:test split of De-Arteaga et al. (2019), resulting in 255,710 training examples (65%), 39,369 development examples (10%) and 98,344 (25%) test examples. The dataset has 28 classes (professions), which we predict using a multiclass logistic classifier (in a one-vs-all setting). We consider three input representations: BOW, BWV and BERT (Devlin et al., 2019) based classification. In BOW, we represent each biography as the sum of one-hot vectors, each representing one word in the vocabulary. In the BWV representation, we sum the FastText token representations (Joulin et al., 2016) of the words in the biography. In BERT representation, we represent each biography as the last hidden state of BERT over the token. Each of these representations is then fed into the logistic classifier to get final prediction. We do not finetune FastText or BERT.

We run INLP with scikit-learn Pedregosa et al. (2011) linear classifiers. We use 100 logistic classifiers for BOW, 150 linear SVM classifiers for BWV, and 300 linear SVM classifiers for BERT.
Bias measure. We use the TPR-GAP measure for each profession. Following Romanov et al. (2019), we also calculate the root-mean square of over all professions , to get a single per-gender bias score:


where is the set of all labels (professions).

De-Arteaga et al. (2019) have shown that strongly correlates with the percentage of women in profession , indicating that the true positive rate of the model is influenced by gender.

6.3.1 Results

Main results

The results are summarized in Table 2. INLP moderately changes main-task accuracy, with a 1.9% increase in BOW, a 5.1% decrease in performance in BWV and a 5.51% decrease in BERT. is significantly decreased, indicating that on average, the true positive rate of the classifiers for male and female become closer: in BOW representation, from 0.203 to 0.124 (a 38.91% decrease); in BWV, from 0.184 to 0.089 (a 51.6% decrease); and in BERT, from 0.184 to 0.095 (a 48.36% decrease). We measure the correlation between for each profession , and the percentage of biographies of women in that profession. In BOW representation, the correlation decreases from -0.894 prior to INLP to - 0.670 after it (a 33.4% decrease). In BWV representation, the correlation decreases from 0.896 prior to INLP to 0.425 after it (a 52.5% decrease). In BERT representation, the correlation decreases from 0.883 prior to INLP to 0.470 following it (a 46.7% decreases; Figure (b)b). De-Arteaga et al. (2019) report a correlation of 0.71 for BWV representations when using a “scrubbed” version of the biographies, with all pronouns and names removed. INLP significantly outperforms this baseline, while maintaining all explicit gender markers in the input.
Analysis. How does imposing fairness influence the importance the logistic classifier attribute to different words in the biography? We take advantage of the BOW representation and visualize which features (words) influence each prediction (profession), before and after the projection. According to Algorithm 1, to debias an input , we multiply . Equivalently, we can first multiply by to get a “debiased” weight matrix . We begin by testing how much the debiased weights of words that are considered to be biased were changed during the debiasing, compared to random vocabulary words. We compare the relative change before and after the projection of these words, for every occupation. Biased words undergo an average relative change of x1.23 compared to the average change of the entire vocabulary, demonstrating that biased words indeed change more. The per-profession breakout is available in Figure 2 in Appendix §A.6.1.
Next, we test the words that were changed the most during the INLP process. We compare the weight difference before and after the projection. We sort each profession words by weight, and average their location index for each professions. Many words indeed seem gender specific (e.g. ms., mr., his, her, which appears in locations 1, 2, 3 and 4 respectively), but some seem unrelated, perhaps due to spurious correlations in the data. The complete list is available in Table 4 in the Appendix §A.6.1; an analogous analysis for the FastText representation is available at Appendix §A.6.2.

7 Limitations

When dealing with bias and fairness, it is important to also disclose the limitations. The main limitation of our method when used in the context of fairness is that, like other learning approaches, it depends on the data , that is fed to it, and works under the assumption that the training data is sufficiently large and is sampled i.i.d from the same distribution as the test data. This condition is hard to achieve in practice, and failure to provide sufficiently representative training data may lead to biased classifications even after its application. Like other methods, there are no magic guarantees, and the burden of verification remains on the user. It is also important to remember that the method is designed to achieve a very specific sense of protection: removal of linear information regarding a protected attribute. While it may correlate with fairness measures such as demographic parity, it is not designed to ensure them. Finally, it is designed to be fed to a linear decoder, and the attributes are not protected under non-linear classifiers.

8 Conclusion

We present a novel method for removing linearly-represented information from neural representations. We focus on bias and fairness as case studies, and demonstrate that our method is capable of attenuating societal biases that are expressed in representations learned from data. Our method is also shown applicable for increasing fairness in multi-class classification setting: predicting a profession from a biography of a person. We demonstrate that across three increasingly complicated architectures—bag of words, word embeddings and BERT representations—INLP is robust in decreasing bias. We also perform a controlled experiment on a DeepMoji model, to assess the influence of uneven division of protected attributes among main-task labels—as is commonly the case in many real-world application.

While this work focuses on societal bias and fairness, Iterative Nullspace Projection has broader possible use-cases, and can be utilized to remove specific components from a representation, in a controlled and deterministic manner. This method can be applicable for other end goals, such as style-transfer, disentanglement of neural representations and increasing their interpretability. We aim to explore those directions in a future work.


We thank Jacob Goldberger and Jonathan Berant for fruitful discussions. This project has received funding from the Europoean Research Council (ERC) under the Europoean Union’s Horizon 2020 research and innovation programme, grant agreement No. 802774 (iEXTRACT).


  • E. Agirre, E. Alfonseca, K. Hall, J. Kravalová, M. Pasca, and A. Soroa (2009) A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 19–27. Cited by: §A.3.
  • M. Barrett, Y. Kementchedjhieva, Y. Elazar, D. Elliott, and A. Søgaard (2019) Adversarial removal of demographic attributes revisited. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    pp. 6331–6336. Cited by: §2.
  • A. Ben-Israel (2015) Projectors on intersections of subspaces. Contemporary Mathematics, pp. 41–50. Cited by: §4.
  • Y. Bengio, A. C. Courville, and P. Vincent (2013) Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35 (8), pp. 1798–1828. Cited by: §2.
  • S. L. Blodgett, L. Green, and B. O’Connor (2016) Demographic dialectal variation in social media: a case study of african-american english. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1119–1130. Cited by: §1, §6.2.
  • T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349–4357. Cited by: §1, §2, §6.1.1, §6.1.1, §6.1, footnote 4.
  • A. Caliskan, J. J. Bryson, and A. Narayanan (2017) Semantics derived automatically from language corpora contain human-like biases. Science 356 (6334), pp. 183–186. Cited by: §6.1.1.
  • M. Coavoux, S. Narayan, and S. B. Cohen (2018) Privacy-preserving neural representations of text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1–10. Cited by: §2.
  • M. De-Arteaga, A. Romanov, H. M. Wallach, J. T. Chayes, C. Borgs, A. Chouldechova, S. C. Geyik, K. Kenthapadi, and A. T. Kalai (2019) Bias in bios: A case study of semantic representation bias in a high-stakes setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pp. 120–128. External Links: Link, Document Cited by: §6.2, §6.3.1, §6.3, §6.3, §6.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. External Links: Link, Document Cited by: §6.3.
  • Y. Elazar and Y. Goldberg (2018) Adversarial removal of demographic attributes from text data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 11–21. External Links: Link Cited by: §1, §2, §6.2, §6.2.
  • K. Ethayarajh, D. Duvenaud, and G. Hirst (2019) Understanding undesirable word embedding associations. arXiv preprint arXiv:1908.06361. Cited by: §2.
  • B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann (2017) Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §6.2.
  • Y. Ganin and V. S. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pp. 1180–1189. External Links: Link Cited by: §2.
  • Y. Goldberg (2019) Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287. Cited by: §1.
  • H. Gonen and Y. Goldberg (2019) Lipstick on a pig: debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862. Cited by: §1, §2, §6.1.1, §6.1.
  • I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §2.
  • G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren (2012) Large-scale learning of word relatedness with constraints. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1406–1414. Cited by: §A.3.
  • M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 3315–3323. External Links: Link Cited by: §5, §6.2.
  • M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf (1998) Support vector machines. IEEE Intelligent Systems and their applications 13 (4), pp. 18–28. Cited by: §6.1.
  • J. Hewitt and C. D. Manning (2019) A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4129–4138. External Links: Link, Document Cited by: §1.
  • F. Hill, R. Reichart, and A. Korhonen (2015)

    Simlex-999: evaluating semantic models with (genuine) similarity estimation

    Computational Linguistics 41 (4), pp. 665–695. Cited by: §A.3, §6.1.1.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2016) Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §6.3.
  • T. Linzen, E. Dupoux, and Y. Goldberg (2016) Assessing the ability of LSTMs to learn syntax-sensitive dependencies. TACL 4, pp. 521–535. External Links: Link Cited by: §1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. Cited by: §6.1.1.
  • D. Madras, E. Creager, T. Pitassi, and R. S. Zemel (2019) Fairness through causal awareness: learning causal latent-variable models for biased data. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January 29-31, 2019, pp. 349–358. External Links: Link, Document Cited by: §5.
  • M. Mathieu, J. J. Zhao, P. Sprechmann, A. Ramesh, and Y. LeCun (2016) Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 5041–5049. External Links: Link Cited by: §2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §6.3.
  • F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller, and S. Riedel (2019) Language models as knowledge bases?. arXiv preprint arXiv:1909.01066. Cited by: §1.
  • Y. Resheff, Y. Elazar, M. Shahar, and O. Shalom (2019) Privacy and fairness in recommender systems via adversarial training of user representations. In

    Proceedings of the 8th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,

    pp. 476–482. External Links: Document, ISBN 978-989-758-351-3 Cited by: §2.
  • A. Romanov, M. De-Arteaga, H. M. Wallach, J. T. Chayes, C. Borgs, A. Chouldechova, S. C. Geyik, K. Kenthapadi, A. Rumshisky, and A. Kalai (2019) What’s in a name? reducing bias in bios without access to protected attributes. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4187–4195. External Links: Link, Document Cited by: §6.3.
  • A. Rosenberg and J. Hirschberg (2007) V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 28-30, 2007, Prague, Czech Republic, J. Eisner (Ed.), pp. 410–420. External Links: Link Cited by: §6.1.1.
  • I. Tenney, D. Das, and E. Pavlick (2019) Bert rediscovers the classical nlp pipeline. arXiv preprint arXiv:1905.05950. Cited by: §1.
  • Q. Xie, Z. Dai, Y. Du, E. Hovy, and G. Neubig (2017) Controllable invariance through adversarial feature learning. In Advances in Neural Information Processing Systems, pp. 585–596. Cited by: §1, §2.
  • K. Xu, T. Cao, S. Shah, C. Maung, and H. Schweitzer (2017) Cleaning the null space: A privacy mechanism for predictors. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, S. P. Singh and S. Markovitch (Eds.), pp. 2789–2795. External Links: Link Cited by: §2.
  • B. H. Zhang, B. Lemoine, and M. Mitchell (2018) Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 335–340. Cited by: §2, §5.
  • J. Zhao, Y. Zhou, Z. Li, W. Wang, and K. Chang (2018) Learning gender-neutral word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii (Eds.), pp. 4847–4853. External Links: Link Cited by: §1, §6.1.

Appendix A Appendix

a.1 INLP Guarantees

In this section, we prove, for the binary case, an orthogonality property for INLP classifiers: each two classifiers and from two iterations steps and are orthogonal (Lemma A.1). Several useful properties of the matrix that is returned from INLP emerge as a direct result of orthogonality: the product of the projection matrices calculated in the different INLP steps is commutative (Corollary A.1.1); P is a valid projection (Corollary A.1.2); and P projects to a subspace which is the intersection of the nullspaces of all INLP classifiers (Corollary A.1.3). Furthermore, we bound the influence of on the structure of the representation space, demonstrating that its impact is limited only to those parts of the vectors that encode the protected attribute (Lemma A.2).

We prove those properties for two consecutive projection matrices and from two consecutive iterations of Algorithm 1, presented below in 5. The general property follows by induction.

  1. = GetProjectionMatrix())

  2. = GetProjectionMatrix())

INLP Projects to the Intersection of Nullspaces.

Lemma A.1.

if is initialized as the zero vector and trained with SGD, and the loss is convex, then is orthogonal to , that is, .


In line 4 of the algorithm, we calculate . For a convex and a linear model , it holds that the gradient with respect to is a linear function of : for some scalar . It follows that after stochastic SGD steps, is a linear combination of input vectors . Since we constrain the optimization to , and considering that fact the nullspace is closed under addition, at each step in the optimization it holds that . In particular, this also holds for the optimal 777If we performed proper dimensionality reduction at stage 3 – i.e., not only zeroing some directions, but completely removing them – the optimization in 4 would have a unique solution, as the input would not be rank-deficient. Then, we could use an alternative construction that relies on the Representer theorem, which allows expressing as a weighted sum of the inputs: , for some scalars . As each is inside the nullspace, so is any linear combinations of them, and in particular .. ∎

We proceed to prove commutativity based on this property.

Corollary A.1.1.


By Lemma A.1, , so , where is the projection matrix on the row-space of . We rely on the relation and write:



, which completes the proof. ∎

Corollary A.1.2.

is a projection, that is, .


, where follows from Corollary A.1.1 and follows from and being projections. ∎

Corollary A.1.3.

is a projection onto .


Let . , as is the projection matrix to . Similarly, , so . Conversely, let . Then , so , so is mapped by to .

Note that in practice, we enforce Corollary A.1.3 by using the projection Equation 1 (section 4). As such, the matrix that is returned from Algorithm 1 is a valid projection matrix to the intersection of the nullspaces even if the the conditions in Lemma A.1 do not hold, e.g. when is nonconvex or is not initialized as the zero vector.

INLP Approximately Preserves Distances.

While the projection operations removes the protected information from the representations, ostensibly it could have had a detrimental impact on the structure of the representations space: as a trivial example, the zero matrix

is another operator that removes the protected information, but at a price of collapsing the entire space into the zero vector. The following lemma demonstrate this is not the case. The projection minimally damages the structure of the representation space, as measured by distances between arbitrary vectors: the change in (squared) distance between is bounded by the difference between the “gender components” of and .

Lemma A.2.

Let be a unit gender direction found in one INLP iteration, and let be arbitrary input vectors. Let be the nullspace projection matrix corresponding to . Let and be the distances between before and after the projection, respectively. Then the following holds:


notation: we denote the th entry of a vector by .

Since is the parameter vector of a gender classifier, a point can be classified to a gender according to the sign of the dot product . Note that in the binary case, the nullspace projection matrix is given by


Where is the outer product. By definition, if is in the direction of one of the axes, say without loss of generality the first axis, such that , then the following holds:


Such that is the zero matrix except its entry, and then is simplified to


I.e, the unit matrix, except of a zero in the position. Hence, the projection operator keeps intact, apart from zeroing the first coordinate . We will take advantage of this property, and rotate the axes such that is the direction of the first axis. We will show that the results we derive this way still apply to the original axes system.

Let be a rotation matrix, such that after the rotation, the first coordinate of is aligned with :


One can always find such rotation of the axes. Let be another point in the same space. Given the original squared distance between and :


Our goal is to bound the squared distance between the projected points in the new coordinate system:


Where denotes the projection matrix in the rotated coordinate system, which takes the form 7.

Note that , being a rotation matrix, is orthogonal. By a known result in linear algebra, multiplication by orthogonal matrices preserves dot product and distances. That means that the distance is the same before and after the rotation: , so we can safely bound and the same bound would hold in the original coordinate system.

By 7,


Note that in general it holds that for any


Combining 12 with 11 when taking we get:


From 11 one can also trivially get


Combining 14 and 13 we finally get:


Or, equivalently, after subtracting from all elements and multiplying by -1:


Note that this result has a clear interpretation: the difference between the distance of the projected and the distance of the original is bounded by the difference of and in the gender direction . In particular, if and are equally male-biased, their distance would not change at all; if is very male-biased and is very female-biased, the projection would significantly alter the distance between them.

a.2 Influence on Local Neighbors in Glove Space

Word Neighbors before Neighbors after
order orders, ordering, purchase orders, ordering, ordered
crack keygen, cracks, torrent keygen, cracks, warez
craigslist ebay, craiglist, ads ebay, craiglist, freecycle
populations population, species, communities population, species, habitats
epub ebook, mobi, pdf mobi, ebook, kindle
finals semifinals, playoffs, championship semifinals, semifinal, quarterfinals
installed install, installing, installation install, installing, installs
identifiable disclose, identify, identifying disclose, pii, distinguishable
photographs photograph, photos, images photograph, images, photos
ta si, tu, ti que, bien, ele
couch sofa, sitting, bed sofa, couches, loveseat
cooler coolers, cooling, warmer coolers, cooling, warmer
becky debbie, kathy, julie debbie, steph, jen
appreciated appreciate, greatly, thanks appreciate, muchly, thanks
negotiation negotiating, negotiations, mediation negotiating, negotiations, mediation
initial subsequent, prior, following intial, inital, subsequent
chloe chanel, emma, lauren chloé, chanel, handbags
filipino pinoy, filipinos, philippine filipinos, pinoy, tagalog
relying rely, relied, relies rely, relied, relies
perpetual eternal, continual, irrevocable irrevocable, datejust, perpetuity
himself him, herself, his herself, oneself, he
seaside beach, beachside, picturesque beachside, idyllic, seafront
measure measures, measuring, measured measures, measuring, measured
yorkshire staffordshire, leeds, lancashire staffordshire, dales, lancashire
merchandise goods, items, apparel goods, items, merchandize
sub subs, k, def subs, subbed, svs
tones tone, hues, muted tone, polyphonic, muted
therapist therapists, psychologist, therapy therapists, physiotherapist, psychologist
leaned sighed, smiled, glanced leant, leaning, sighed
tho nnd, cuz, tlie nnd, tlio, tlie
lawyers attorneys, lawyer, attorney attorneys, lawyer, attorney
compile compiling, compiler, compiles compiling, compiler, compiles
chord chords, progressions, guitar chords, progressions, voicings
aims aim, aimed, aiming aim, aimed, aiming
ensure ensuring, assure, ensures ensuring, ensures, assure
aerospace aviation, engineering, automotive aeronautics, aviation, aeronautical
clubhouse pool, playground, amenities clubhouses, pool, playground
locking lock, locks, latch lock, locks, latch
reign reigns, emperor, throne reigns, reigned, emperor
vulnerable susceptible, fragile, affected susceptible, vunerable, fragile
Table 1: 3-nearest words before and after the INLP projection

Table 1 above presents the results of word-embeddings similarity test mentioned in 6.1. This table lists the top 3-nearest neighbors of sampled words from GloVe, before and after the INLP process. It is evident that INLP does not alter the neighbors of the random sample in a detrimental way.

a.3 Quantitative Influence of Gender Debiasing on Glove Embeddings

In Appendix A.2 we provide a sample of words to qualitatively evaluate the influence of INLP on semantic similarity in Glove word embeddings (Section 6.1

). We observe minimal change to the nearest neighbors. To complement this measure, we use a quantitative measure: measuring performance on established word-similarity tests, for the original Glove embeddings, and for the debiased ones. Those tests measure correlation between cosine similarity in embedding space and human judgements of similarity. Concretely, we test the embeddings similarities using three dataset, which contain four similarity tests that measure similarity or relatedness between words. We use the following datasets: SimLex999

Hill et al. (2015), WordSim353 Agirre et al. (2009) which contain two evaluations, on words similarity and relatedness and finally on Mturk-771 Halawi et al. (2012).

The test sets are composed of word pairs, where each pair was annotated by humans to give a similarity or relatedness score. To evaluate a model against such data, each pair is given a score (in the case of word embedding, cosine similarity) and then we calculate Spearman correlation between all the score pairs. The results on the regular Glove embeddings before and after the gender debiasing are presented in Table 3. We observe a major improvements across all evaluation sets after the projection: between 0.044 to 0.116 points.

This major difference in performance is rather surprising. It is not clear how to interpret the positive influence on correlation with human judgements. This puzzle is further compounded by the fact the projection reduces the rank of the embedding spaces, and by definition induces loss of information. We hypothesize that many of the words in the embedding space contain a significant gender component, which is not correlated with humans judgements of similarity. While intriguing, testing this hypothesis is beyond the scope of this work, and we leave the more rigorous answer to a future work.

a.4 Influence on Local Neighbors of Surnames Representations in Glove Space

Word Neighbors before Neighbors after
ruth helen, esther, margaret etting, esther, gehrig
charlotte raleigh, nc, atlanta raleigh, greensboro, nc
abigail hannah, lydia, eliza hannah, phebe, josiah
sophie julia, marie, lucy moone, bextor, marceau
nichole nicole, kimberly, kayla nicole, mya, heiress
emma emily, lucy, sarah grint, frain, watson
olivia emma, rachel, kate munn, thirlby, wilde
ava devine, zoe, isabella viticultural, devine, appellation
isabella sophia, josephine, isabel rossellini, beeton, ferdinand
sophia anna, lydia, julia hagia, antipolis, topkapi
mia bella, mamma, mama bangg, mamma, culpa
amelia earhart, louisa, caroline earhart, fernandina, bedelia
james john, william, thomas jassie, nightfire, perse
john james, william, paul deere, scatman, betjeman
robert richard, william, james pattinson, mccammon, blacksportsonline
michael david, mike, brian micheal, franti, moorcock
william henry, edward, james edward, henry, sir
david stephen, richard, michael bisbal, magen, sylvian
richard robert, william, david clayderman, brautigan, rorty
joseph francis, charles, thomas joesph, dreamcoat, abboud
thomas james, william, john szasz, deshaun, tomy
ariel sharon, alexis, hanna peterpan, mermaid, cinderella
mike brian, chris, dave mignola, birbiglia, dave
Table 2: 3-nearest words before and after the INLP projection, for surenames

The results in Table 1 suggest that, as expected, the projection has little influence on the lexical semantics of unbiased words, as measured by their closest neighbors in embedding space. But how does the projection influence inherently gendered words? Table 2 contains the closest-neighbors to the Glove representations of gendered surnames, before and after the projection. We observe an interesting tendency to move from neighbors which are other gendered surnames, towards family names, which are by definition gender-neutral (for instance, the closest neighbor of “Robert” changes from “Richard” to “Pattinson”). Another interesting tendency is to move towards place names bearing a connection to that surnames (For instance, the closest neighbor of “Sophia” changs to “Hagia”). At the same time, some gendered surnames remain close neighbors even after the projection.

a.5 Performance and “Fair Classification” as a Function of INLP Iterations

Eval Before After
SimLex999 0.373 0.489
WordSim353 - Sim 0.695 0.799
WordSim353 - Rel 0.599 0.698
Mturk-771 0.684 0.728
Table 3: Word similarity scores on Glove embeddings, before and after INLP. The scores are the Spearman correlation coefficient between the similarity scores.

In Section 6.2 where we compare the accuracy and TPR-Gap before and after using INLP for a certain amount of iterations. The number of iterations chosen is somehow arbitrary, but we emphasize that this can be controlled for as the number of iterations used with INLP. By sacrificing the main task performance, one can improve the TPR-Gap of their model. In Figure 1 we detail these trade-offs for the ratio, where the original TPR-Gap originally is the highest.

We note that the performance is minimally damaged for the first 180 iterations, while the TPR-Gap improves greatly, after-which, both metric account for larger drops. Using this trade-off, one can decide how much performance they are willing to sacrifice in order to get a less biased model.

Figure 1:

a.6 Biographies dataset: Words Most-Associated with Gender

a.6.1 Bag-of-Words Model

In this section, we present the raw results of the experiment aimed to assess the influence of INLP on specific words, under the bag-of-words model, for the biographies experiments (Section 6.3.1).

Table 4 lists the words most influenced by INLP projection (on average over all professions) after the debiasing procedure explained in Section 6.3.

Figure 2 presents the relative change of biased word for each profession, compared to a random sample.

Most Changed Words
ms., mr., his, her, he, she, mrs., specializes,
english, practices, ’,’, him, spanish,
speaks, with, affiliated, and, medicine, ms,
state, #, the, medical, michael, in,
residency, at, of, psychology, dr., ’s,
law, research, practice, about, where,
business, education, 5, -, is, first,
women, america, insurance, more, john,
university, location, ph.d., surgery, (,
mental, ), that, engineering, graduated,
language, bs, litigation, collection,
united, 1, graduate, humana, cpas,
cancer, npi, completed, 10, book, hospital, c,
out, family, or, when, oklahoma, certified,
ohio, number, training, for, like, a,
than, be, nursing, ], _, can, writing,
patients, no, orthopaedic, attorney,
over, ny, mr, “,

Table 4: Top 100 words influenced by INLP projection (BOW representation, biographies dataset).
Figure 2: The relative change of biased vs. random words, per profession.

a.6.2 Bag-of-Word-Vectors Model

Gender direction Male-biased Female-biased
0 his, he, His, himself herself, she, She, her
1 himself, him, Gavin, His Hatha, midwifery, Midwifery, feminist
2 Mark, Jon, Darren, Luke Actress, Zumba, Diana, woman
3 Gordon, he, wind, charge hers, recipe, Challenge, cookbooks
4 1935, 1955, namely, 1958 Roots, Issue, FHM, yoga
5 M, Mickey, KS, Bethesda Vietnam, Subject, Elle, Ecuador
6 Keys, correct, address, fuel leap, Embedded, textile, femininity
7 Papers, Categories, wherein, Newark Botox, LASIK, periodontal, UnityPoint
8 binding, closely, MT, command Aventura, brunette, HTML, Disclosure
9 82, 92, 91, 86 ASP.NET, committer, Twilight, Seth
10 t, Cisco, Philips, Sharp preschool, caregivers, homeowners, Preschool
11 Toulouse, Aviv, scored, commended intersectional, Equality, equality, ASME
12 addressing, segment, inequalities, segments Wire, loose, anything, Vincents
13 comparison, Hart, 480, refereed Matthew, independence, couples, LGBTQ
14 manufacturer, organizers, scope, specifications homeschooling, ligament, loyalty, graduating
Table 5: words closest to top 15 INLP gender directions (FastText representation, biographies dataset).

In this section, we present an analysis for the influence of INLP projection on the FastText representation of individual words, under the bag-of-word-vectors model, for the biographies experiments (Section 6.3.1). We begin by ordering the vocabulary items by their cosine similarity to each of the top 15 gender directions found in INLP (i.e., their similarity to the weight vector of each classifier). For each gender direction , we focus on the 20,000 most common vocabulary items, and calculate the closest words to (to get male-biased words) as well as the closest words to (to get female-biased words). The result are presented in Table 5.

The first gender direction seems to capture pronouns. Other gender directions capture socially biased terms, such as “preschool” (direction 10), “cookbooks” (direction 3) or other gender-related terms, such as “LGBTQ” (direction 15) or “femininity” (direction 6). Interestingly, those are mostly female-biased terms. As for the male-biased words, some directions capture surnames, such as “Gordon” and “Aviv”. Other words which were found to be male-biased are less interpretable, such as words specifying years (direction 4), organizational terms such as “Organizers”, “specifications” (direction 14), or the words “Papers”, “Categories” (direction 7). It is not clear if those are the result of spurious correlations/noise, or whether they reflect actual subtle differences in the way the biographies of men and women are written.

Gender rowspace

The above analysis focuses on what information do individual gender directions convey. Next, we aim to demonstrate the influence of the final INLP projection on the representation of words. To this end, we rely on the rowspace of the INLP matrix . Recall that the rowspace is the orthogonal complement of the nullspace. As the INLP matrix projects to the intersection of nullspaces of the gender directions, the complement projects to the union of rowspaces of individual gender directions. This is a subspace which is spanned by all gender directions, and thus can be thought of as an empirical gender subspace within the representation space.

For a given word vector , the “gender norm” – the norm of its projection on the rowspace, – is a scalar quantity which can serve as a measure for the gender-bias of the word. We sort the vocabulary by the ratio between the gender norm and the original norm, and present the 200 most gendered words (Table 6).

Top words by component on the gender subspace
motherhood, SSHRC,
microfinance, preschool, genocide, IFP,
CSE, intersectional, student,
homeschooling, photoshoot,
intersectionality, 920, breastfeeding, STEM,
photojournalistic, haiku, kindergarten,
FreeOnes, UNESCO, menstrual,
turbulence, NTR, ASME, HFN, ECE, IEEE,
feminism, noir, Jadavpur, Motherhood,
reportage, Contra, TU, WebSphere,
counsellor, photovoltaic, J2EE,
contraception, university, PEN,
masculinities, parenting, EAP,
Politecnico, Feminism, trauma,
Universiti, counselling, curriculum,
Kanpur, women, edits, Pune, Nanjing,
ethnographic, Pinterest, surrealist,
taught, Hindustan, students, CNRS,
Bangalore, Mumbai, consortium, tooth,
Vitae, Kindergarten, nanoscale,
school, ACL, scholarships, cloud,
Goa, NIJC, Montessori, JSPS,
scholarship, Neha, DAAD, endometriosis,
carrier, UCI, activism, Ambedkar,
EECS, semiconductor, scholar,
microfluidic, bikini, Raising, teacher,
Feminist, vinyasa, NBER, ethnography,
Twilight, Sunil, Shankar, viral,
earthquake, semiconductors,
historiography, vampire, HMO, PSU, bioenergy,
historian, Ravi, Breastfeeding, Raman,
resettlement, Shweta, ICTs, UNDP, NVIDIA,
HIV, Counselling, HEC, KDD,
Hyderabad, contraceptive, macro,
Ghaziabad, sexuality, CAS,
documentary, mic, biography, postdoc,
transnationalism, AMD, CFD, B.Tech, physicist,
LGBT, parenthood, HKU, HIP,
internationalization, M.Tech, BDS, acne, theorist,
HPV, Meerut, ageing, smile,
Rajesh, psychoeducational, PUNE,
grief, AHA, Essays, discourses,
secrets, Swati, EPFL, coaching, IIE,
Manoj, BIDMC, infertility,
fashion, Chicana, Vaishali,
Graduation, sociologist, Gender, EA, MIT,
teach, gift, IETF, NPPA, counselor,
JPL, gender, menopause, LGBTQ,
Waseda, perceptions, praxis,
birthday, Jawaharlal, fertility,
gendered, coverage, stills, PIH,
Balaji, Tagged, baking, USM,
postpartum, Goenka, Pooja, forgiveness

Table 6: Words by gender norm.

As before, we see a combination of inherently-gendered words (“motherhood”, “women”, “gender”, “masculinities”), socially-biased terms (“teacher”, “raising”, “semiconductors”, “B.Tech”, “IEEE”, “STEM”, “fashion”) and other words whose connection to gender is less interpretable, and potentially represent spurious correlations (“trauma”, “Vitae”, “smile”, “920”, “forgiveness”).