Word embeddings learnt from massive text collections have demonstrated significant levels of discriminative biases such as gender, racial or ethnic biases, which in turn bias the down-stream NLP applications that use those word embeddings. Taking gender-bias as a working example, we propose a debiasing method that preserves non-discriminative gender-related information, while removing stereotypical discriminative gender biases from pre-trained word embeddings. Specifically, we consider four types of information: feminine, masculine, gender-neutral and stereotypical, which represent the relationship between gender vs. bias, and propose a debiasing method that (a) preserves the gender-related information in feminine and masculine words, (b) preserves the neutrality in gender-neutral words, and (c) removes the biases from stereotypical words. Experimental results on several previously proposed benchmark datasets show that our proposed method can debias pre-trained word embeddings better than existing SoTA methods proposed for debiasing word embeddings while preserving gender-related but non-discriminative information.READ FULL TEXT VIEW PDF
In this work, we propose an analysis of the presence of gender bias
Abusive language detection models tend to have a problem of being biased...
Word embeddings trained on large corpora have shown to encode high level...
In comparison to the numerous debiasing methods proposed for the static
Gender, race and social biases have recently been detected as evident
The vector representation of words, known as word embeddings, has opened...
Search Engines (SE) have been shown to perpetuate well-known gender
Despite the impressive success stories behind word representation learning Devlin et al. (2018); Peters et al. (2018); Pennington et al. (2014); Mikolov et al. (2013c, a), further investigations into the learnt representations have revealed several worrying issues. The semantic representations learnt, in particular from social media, have shown to encode significant levels of racist, offensive and discriminative language usage Bolukbasi et al. (2016); Zhao et al. (2018b); Elazar and Goldberg (2018); Rudinger et al. (2018); Zhao et al. (2018a). For example, Tolga:NIPS:2016 showed that word representations learnt from a large (300GB) news corpus to amplify unfair gender biases. Microsoft’s AI chat bot Tay learnt abusive language from Twitter within the first 24 hours of its release, which forced Microsoft to shutdown the bot The Telegraph (2016). WEAT conducted an implicit association test (IAT) Greenwald et al. (1998)
using the cosine similarity measured from word representations, and showed that word representations computed from a large Web crawl contain human-like biases with respect to gender, profession and ethnicity.
Given the broad applications of pre-trained word embeddings in various down-stream NLP tasks such as machine translation Zou et al. (2013)2018), dialogue generation Zhang et al. (2018) etc., it is important to debias word embeddings before they are applied in NLP systems that interact with and/or make decisions that affect humans. We believe that no human should be discriminated on the basis of demographic attributes by an NLP system, and there exist clear legal European Union (1997), business and ethical obligations to make NLP systems unbiased Holstein et al. (2018).
Despite the growing need for unbiased word embeddings, debiasing pre-trained word embeddings is a challenging task that requires a fine balance between removing information related to discriminative biases, while retaining information that is necessary for the target NLP task. For example, profession-related nouns such as professor, doctor, programmer have shown to be stereotypically male-biased, whereas nurse, homemaker to be stereotypically female-biased, and a debiasing method must remove such biases. On the other hand, one would expect111This indeed is the case for pre-trained GloVe embeddings, beard to be associated with male nouns and bikini to be associated with female nouns, and preserving such gender biases would be useful, for example, for a recommendation system Garimella et al. (2017). As detailed later in section 2, existing debiasing methods can be seen as embedding word embeddings into a subspace that is approximately orthogonal to a gender subspace spanned by gender-specific word embeddings. Although unsupervised, weakly-supervised and adversarially trained models have been used for learning such embeddings, they primarily focus on the male-female gender direction and ignore the effect of words that have a gender orientation but not necessarily unfairly biased.
To perform an extensive treatment of the gender debiasing problem, we split a given vocabulary into four mutually exclusive sets of word categories:
(a) words that are female-biased but non-discriminative,
(b) words that are male-biased but non-discriminative,
(c) words that are gender-neutral, and
(d) words that are stereotypical (i.e., unfairly222We use the term unfair as used in fairness-aware machine learning.
fairness-aware machine learning.gender-biased). Given a large set of pre-trained word embeddings and small seed example sets for each of those four categories, we learn an embedding that (i) preserves the feminine information for the words in , (ii) preserves the masculine information for the words in , (iii) protects the neutrality of the gender-neutral words in , while (iv) removing the gender-related biases from stereotypical words in
. The embedding is learnt using an encoder in a denoising autoencoder, while the decoder is trained to reconstruct the original word embeddings from the debiased embeddings that do not contain unfair gender biases. The overall model is trained end-to-end to dynamically balance the competing criteria (i)-(iv).
We evaluate the bias and accuracy of the word embeddings debiased by the proposed method on multiple benchmark datasets.
On the SemBias Zhao et al. (2018b) gender relational analogy dataset, our proposed method outperforms previously proposed hard-debiasing Bolukbasi et al. (2016) and gender-neural Global Vectors
gender-neural Global Vectors(GN-GloVe) Zhao et al. (2018b) by correctly debiasing stereotypical analogies. Following prior work, we evaluate the loss of information due to debiasing on benchmark datasets for semantic similarity and word analogy. Experimental results show that the proposed method can preserve the semantics of the original word embeddings, while removing gender biases. This shows that the debiased word embeddings can be used as drop-in replacements for word embeddings used in NLP applications. Moreover, experimental results show that our proposed method can also debias word embeddings that are already debiased using previously proposed debiasing methods such as GN-GloVe to filter out any remaining gender biases, while preserving semantic information useful for downstream NLP applications. This enables us to use the proposed method in conjunction with existing debiasing methods.
To reduce the gender stereotypes embedded inside pre-trained word representations, Tolga:NIPS:2016 proposed a post-processing approach that projects gender-neutral words to a subspace, which is orthogonal to the gender dimension defined by a list of gender-definitional words. They refer to words associated with gender (e.g., she, actor) as gender-definitional words, and the remainder gender-neutral. They proposed a hard-debiasing method where the gender direction is computed as the vector difference between the embeddings of the corresponding gender-definitional words, and a soft-debiasing
method, which balances the objective of preserving the inner-products between the original word embeddings, while projecting the word embeddings into a subspace orthogonal to the gender definitional words. They use a seed set of gender-definitional words to train a support vector machine classifier, and use it to expand the initial set of gender-definitional words. Both hard and soft debiasing methods ignore gender-definitional words during the subsequent debiasing process, and focus only on words that arenot predicted as gender-definitional by the classifier. Therefore, if the classifier erroneously predicts a stereotypical word as a gender-definitional word, it would not get debiased.
Zhao:2018ab proposed Gender-Neutral Global Vectors (GN-GloVe) by adding a constraint to the Global Vectors (GloVe) Pennington et al. (2014) objective such that the gender-related information is confined to a sub-vector. During optimisation, the squared distance between gender-related sub-vectors are maximised, while simultaneously minimising the GloVe objective. GN-GloVe learns gender-debiased word embeddings from scratch from a given corpus, and cannot be used to debias pre-trained word embeddings. Moreover, similar to hard and soft debiasing methods described above, GN-GloVe uses pre-defined lists of feminine, masculine and gender-neutral words and does not debias words in these lists.
Debiasing can be seen as a problem of hiding information related to a protected attribute such as gender, for which adversarial learning methods Xie et al. (2017); Elazar and Goldberg (2018); Li et al. (2018) have been proposed in the fairness-aware machine learning community Kamiran and Calders (2009). In these approaches, inputs are first encoded, and then two classifiers are trained – a target task predictor that uses the encoded input to predict the target NLP task, and a protected-attribute predictor that uses the encoded input to predict the protected attribute. The two classifiers and the encoder is learnt jointly such that the accuracy of the target task predictor is maximised, while minimising the accuracy of the protected-attribute predictor. However, Elazar:EMNLP:2018 showed that although it is possible to obtain chance-level development-set accuracy for the protected attribute during training, a post-hoc classifier, trained on the encoded inputs can still manage to reach substantially high accuracies for the protected attributes. They conclude that adversarial learning alone does not guarantee invariant representations for the protected attributes.
Gender biases have been identified in several tasks in NLP such as coreference Rudinger et al. (2018); Zhao et al. (2018a) resolution and machine translation Prates et al. (2018). For example, rule-based, feature-based as well as neural coreference resolution methods trained on biased resources have shown to reflect those biases in their predictions Rudinger et al. (2018). Google Machine Translation, for example, provides male and female versions of the translations333https://bit.ly/2B0nVHZ, when the gender in the source language is ambiguous.
Given a pre-trained set of -dimensional word embeddings , over a vocabulary , we consider the problem of learning a map that projects the original pre-trained word embeddings to a debiased
-dimensional space. We do not assume any knowledge about the word embedding learning algorithm that was used to produce the pre-trained word embeddings given to us. Moreover, we do not assume the availability or access to the language resources such as corpora or lexicons that might have been used by the word embedding learning algorithm. Decoupling the debiasing method from the word embedding learning algorithm and resources increases the applicability of the proposed method, enabling us to debias pre-trained word embeddings produced using different word embedding learning algorithms and using different types of resources.
We propose a debiasing method that models the interaction between the values of the protected attribute (in the case of gender we consider male, female and neutral as possible attribute values), and whether there is a stereotypical bias or not. Given four sets of words: masculine (), feminine (), neutral () and stereotypical (), our proposed method learns a projection that satisfies the following four criteria:
for , we protect its feminine properties,
for , we protect its masculine properties,
for , we protect its gender neutrality, and
for , we remove its gender biases.
By definition the four word categories are mutually exclusive and the total vocabulary is expressed by their disjunction . A key feature of the proposed method that distinguishes it from prior work on debiasing word embeddings is its ability to differentiate between undesirable (stereotypical) biases from the desirable (expected) gender information in words. The procedure we follow to compile the four word-sets is described later in subsection 4.1, and the words that belong to each of the four categories are shown in the supplementary material.
To explain the proposed gender debiasing method, let us first consider a feminine regressor , parameterised by , that predicts the degree of feminineness of the word . Here, highly feminine words are assigned values close to 1. Likewise, let us consider a masculine regressor , parametrised by , that predicts the degree of masculinity of . We then learn the debiasing function as the encoder of an autoencoder (parametrised by ), where the corresponding decoder (parametrised by ) is given by .
For feminine and masculine words, we require the encoded space to retain the gender-related information. The squared losses, and , given respectively by (1) and (2), express the extent to which this constraint is satisfied.
Here, for notational simplicity, we drop the dependence on parameters.
For the stereotypical and gender-neutral words, we require that they are embedded into a subspace that is orthogonal to a gender directional vector, , computed using a set, , of feminine and masculine word-pairs as given by (3).
Prior work on gender debiasing Bolukbasi et al. (2016); Zhao et al. (2018b) showed that the vector difference between the embeddings for male-female word-pairs such as he and she accurately represents the gender direction. When training, we keepbetween every epoch. We consider the squared inner-product between and the debiased stereotypical or gender-neutral words as the loss, , as given by (4).
It is important that we preserve the semantic information encoded in the word embeddings as much as possible when we perform debiasing. If too much information is removed from the word embeddings, not limited to gender-biases, then the debiased word embeddings might not be sufficiently accurate to be used in downstream NLP applications. For this purpose, we minimise the reconstruction loss, , for the autoencoder given by (5).
Finally, we define the total objective as the linearly-weighted sum of the above-defined losses as given by (6).
Here, the coefficients are nonnegative hyper-parameters that add to 1. They determine the relative importance of the different constraints we consider and can be learnt using training data or determined via cross-validation over a dedicated validation dataset. In our experiments, we use the latter approach.
andand did not result in a significant increase in accuracy. Both the encoder and the decoder
of the autoencoder are implemented as feed forward neural networks with two hidden layers. Hyperbolic tangent is used as the activation function throughout the autoencoder.
The objective (6) is minimised w.r.t. the parameters , , and
for a given pre-trained set of word embeddings. During optimisation, we used dropout with probability
and use stochastic gradient descent with initial learning rate set to. The hyper-parameters are estimated using a separate validation dataset as described later in subsection 4.1.
Note that it is possible to pre-train and separately using and prior to training the full objective (6). In our preliminary experiments, we found that initialising and to the pre-trained versions of and to be helpful for the optimisation process, resulting in early convergence to better solutions compared to starting from random initialisations for and . For pre-training and we used Adam optimiser Kingma and Ba (2015) with initial learning rate set to 0.0002 and a mini-batch size of 512. Autoencoder is also pre-trained using a randomly selected 5000 word embeddings and dropout regularisation is applied with probability 0.05.
We note that and are separate word sets, not necessarily having corresponding feminine-masculine pairs as in used in (4). It is of course possible to re-use the words in in and , and we follow this approach in our experiments, which helps to decrease the number of seed words required to train the proposed method. Moreover, the number of training examples across the four categories were significantly different, which resulted in an imbalanced learning setting. We conduct one-sided undersampling Kubat and Matwin (1997) to successfully overcome this data imbalance issue. The code and the debiased embeddings are publicly available444https://github.com/kanekomasahiro/gp_debias.
We use the feminine and masculine word lists (223 words each) created by Zhao:2018ab as and , respectively. To create a gender-neutral word list, , we select gender-neutral words from a list of 3000 most frequent words in English555https://bit.ly/2SvBINY. Two annotators independently selected words and subsequently verified for gender neutrality. The final set of contains 1031 gender-neutral words. We use the stereotypical word list compiled by Tolga:NIPS:2016 as , which contains 166 professions that are stereotypically associated with one type of a gender. The four sets of words used in the experiments are shown in the supplementary material.
We train GloVe Pennington et al. (2014) on 2017 January dump of English Wikipedia to obtain pre-trained -dimensional word embeddings for 322636 unique words. In our experiments, we set both and to to create -dimensional de-biased word embeddings. We randomly selected 20 words from each of the 4 sets , , and , and used them as a development set for pre-training and
and to estimate the hyperparameters in (6). The optimal hyperparameter values estimated on this development dataset are: , and . In our preliminary experiments we observed that increasing and relative to results in higher reconstruction losses in the autoencoder. This shows that the ability to accurately reconstruct the original word embeddings is an important requirement during debiasing.
We compare our proposed method against several baselines.
is the pre-trained GloVe embeddings described in subsection 4.1. This baseline denotes a non-debiased version of the word embeddings.
We use the implementation666https://github.com/tolga-b/debiaswe of hard-debiasing Bolukbasi et al. (2016) method by the original authors and produce a debiased version of the pre-trained GloVe embeddings.777Tolga:NIPS:2016 released debiased embeddings for word2vec only and for comparison purposes with GN-GloVe, we use GloVe as the pre-trained word embedding and apply hard-debiasing on GloVe
: We use debiased GN-GloVe embeddings released by the original authors888https://github.com/uclanlp/gn_glove, without retraining ourselves as a baseline.
We train an autoencoder by minimising the reconstruction loss defined in (5) and encode the pre-trained GloVe embeddings to a vector space with the same dimensionality. This baseline can be seen as surrogated version of the proposed method with . AE (GloVe) does not perform debiasing and shows the amount of semantic information that can be preserved by autoencoding the input embeddings.
Similar to AE (GloVe), this method autoencodes the debiased word embeddings produced by GN-GloVe.
We apply the proposed gender-preserving (GP) debiasing method on pre-trained GloVe embeddings to debias it.
To test whether we can use the proposed method to further debias word embeddings that are already debiased using other methods, we apply it on GN-GloVe.
We use the SemBias dataset created by Zhao:2018ab to evaluate the level of gender bias in word embeddings. Each instance in SemBias consists of four word pairs: a gender-definition word pair (Definition; e.g. “waiter - waitress”), a gender-stereotype word pair (Stereotype; e.g., “doctor - nurse”) and two other word-pairs that have similar meanings but not a gender relation (None; e.g., “dog - cat”, “cup - lid”). SemBias contains 20 gender-stereotype word pairs and 22 gender-definitional word pairs and use their Cartesian product to generate 440 instances. Among the 22 gender-definitional word pairs, 2 word-pairs are not used as the seeds for training. Following, Zhao:2018ab, to test the generalisability of a debiasing method, we use the subset (SemBias-subset) of 40 instances associated with these 2 pairs. We measure relational similarity between word-pair and a word-pair in SemBias using the cosine similarity between the gender directional vector and using the word embeddings under evaluation. For the four word-pairs in each instance in SemBias, we select the word-pair with the highest cosine similarity with as the predicted answer. In Table 1, we show the percentages where a word-pair is correctly classified as Definition, Stereotype, or None. If the word embeddings are correctly debiased, we would expect a high accuracy for Definitions and low accuracies for Stereotypes and Nones.
From Table 1, we see that the best performances (highest accuracy on Definition and lowest accuracy on Stereotype) are reported by GP (GN-GloVe), which is the application of the proposed method to debias word embeddings learnt by GN-GloVe. In particular, in both SemBias and SemBias-subset, GP (GN-GloVe) statistically significantly outperforms GloVe and Hard-Glove
according to Clopper-Pearson confidence intervalsClopper and Pearson (1934). Although GN-GloVe obtains high performance on SemBias, it does not generalise well to SemBias-subset. However, by applying the proposed method, we can further remove any residual gender biases from GN-GloVe, which shows that the proposed method can be applied in conjunction with GN-GloVe. We see that GloVe contains a high percentage of stereotypical gender biases, which justifies the need for debiasing methods. By applying the proposed method on GloVe (corresponds to GP (GloVe)) we can decrease the gender biases in GloVe, while preserving useful gender-related information for detecting definitional word-pairs. Comparing corresponding AE and GP versions of GloVe and GN-GloVe, we see that autoencoding alone is insufficient to consistently preserve gender-related information.
It is important that the debiasing process removes only gender biases and preserve other information unrelated to gender biases in the original word embeddings. If too much information is removed from word embeddings during the debiasing process, then the debiased embeddings might not carry adequate information for downstream NLP tasks that use those debiased word embeddings.
To evaluate the semantic accuracy of the debiased word embeddings, following prior work on debiasing Bolukbasi et al. (2016); Zhao et al. (2018a), we use them in two popular tasks: semantic similarity measurement and analogy detection. We recall that we do not propose novel word embedding learning methods in this paper, and what is important here is whether the debiasing process preserves as much information as possible in the original word embeddings.
Given three words in analogy detection, we must predict a word that completes the analogy “ is as is to ”. We use the CosAdd Levy and Goldberg (2014) that finds that has the maximum cosine similarity with . We use the semantic (sem) and syntactic (syn) analogies in the Google analogy dataset Mikolov et al. (2013b) (in total contains 19,556 questions), MSR dataset (7,999 syntactic questions) Mikolov et al. (2013d) and SemEval dataset (SE, 79 paradigms) Jurgens et al. (2012) as benchmark datasets. The percentage of correctly solved analogy questions is reported in Table 2. We see that there is no significant degradation of performance due to debiasing using the proposed method.
The correlation between the human ratings and similarity scores computed using word embeddings for pairs of words has been used as a measure of the quality of the word embeddings Mikolov et al. (2013d). We compute cosine similarity between word embeddings and measure Spearman correlation against human ratings for the word-pairs in the following benchmark datasets: Word Similarity 353 dataset (WS) Finkelstein et al. (2001), Rubenstein-Goodenough dataset (RG) Rubenstein and Goodenough (1965), MTurk Halawi et al. (2012), rare words dataset (RW) Luong et al. (2013), MEN dataset Bruni et al. (2012) and SimLex dataset Hill et al. (2015).
Unfortunately, existing benchmark datasets for semantic similarity were not created considering gender-biases and contain many stereotypical examples. For example, in MEN, the word sexy has high human similarity ratings with lady and girl compared to man and guy. Furthermore, masculine words and soldier are included in multiple datasets with high human similarity ratings, whereas it is not compared with feminine words in any of the datasets. Although prior work studying gender bias have used these datasets for evaluation purposes Bolukbasi et al. (2016); Zhao et al. (2018a), we note that high correlation with human ratings can be achieved with biased word embeddings.
To address this issue, we balance the original datasets with respect to gender by including extra word pairs generated from the opposite sex with the same human ratings. For instance, if the word-pair (baby, mother) exists in the dataset, we add a new pair (baby, father) to the dataset. Ideally, we should re-annotate this balanced version of the dataset to obtain human similarity ratings. However, such a re-annotation exercise would be costly and inconsistent with the original ratings. Therefore, we resort to a proxy where we reassign the human rating for the original word-pair to its derived opposite gender version. Table 3 shows the number of word-pairs in the original (Orig) and balanced (Bal) similarity benchmarks.
As shown in Table 4, GP (GloVe) and GP (GN-GloVe) obtain the best performance on the balanced versions of all benchmark datasets. Moreover, the performance of GP (GloVe) on both original and balanced datasets is comparable to that of GloVe, which indicates that the information encoded in GloVe embeddings are preserved in the debiased embeddings, while removing stereotypical gender biases. The autoencoded versions report similar performance to the original input embeddings.
Overall, the results on the analogy detection and semantic similarity measurement tasks show that our proposed method removes only gender-biases and preserve other useful gender-related information.
To visualise the effect of debiasing on different word categories, we compute the cosine similarity between the gender directional vector , and selected gender-oriented (female or male), gender-neutral and stereotypical words. In Figure 1, horizontal axises show the cosine similarity with the gender directional vector (positive scores for masculine words) and the words are alphabetically sorted within each category.
From Figure 1, we see that the original GloVe embeddings show a similar spread of cosine similarity scores for gender-oriented as well as stereotypical words. When debiased by hard-debias (Hard-GloVe) and GN-GloVe, we see that stereotypical and gender-neutral words get their gender similarity scores equally reduced. Interestingly, Hard-GloVe shifts even gender-oriented words towards the masculine direction. On the other hand, GP (GloVe) decreases gender bias in the stereotypical words, while almost preserving gender-neutral and gender-oriented words as in GloVe.
Considering that a significant number of words in English are gender-neutral, it is essential that debiasing methods do not adversely change their orientation. In particular, the proposed method’s ability to debias stereotypical words that carry unfair gender-biases, while preserving the gender-orientation in feminine, masculine and neutral words is important when applying the debiased word embeddings in NLP applications that depend on word embeddings for representing the input texts
We proposed a method to remove gender-specific biases from pre-trained word embeddings. Experimental results on multiple benchmark datasets demonstrate that the proposed method can accurately debias pre-trained word embeddings, outperforming previously proposed debiasing methods, while preserving useful semantic information. In future, we plan to extend the proposed method to debias other types of demographic biases such as ethnic, age or religious biases.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2275–2285, Copenhagen, Denmark. Association for Computational Linguistics.