Log In Sign Up

Towards Understanding Gender Bias in Relation Extraction

by   Andrew Gaut, et al.

Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction (AKBC). While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to our knowledge to evaluate social biases in NRE systems. We create WikiGenderBias, a distantly supervised dataset with a human annotated test set. WikiGenderBias has sentences specifically curated to analyze gender bias in relation extraction systems. We use WikiGenderBias to evaluate systems for bias and find that NRE systems exhibit gender biased predictions and lay groundwork for future evaluation of bias in NRE. We also analyze how name anonymization, hard debiasing for word embeddings, and counterfactual data augmentation affect gender bias in predictions and performance.


Complex Relation Extraction: Challenges and Opportunities

Relation extraction aims to identify the target relations of entities in...

Investigations on Knowledge Base Embedding for Relation Prediction and Extraction

We report an evaluation of the effectiveness of the existing knowledge b...

Active Testing: An Unbiased Evaluation Method for Distantly Supervised Relation Extraction

Distant supervision has been a widely used method for neural relation ex...

Extracting or Guessing? Improving Faithfulness of Event Temporal Relation Extraction

In this paper, we seek to improve the faithfulness of TempRel extraction...

It's All in the Name: Mitigating Gender Bias with Name-Based Counterfactual Data Substitution

This paper treats gender bias latent in word embeddings. Previous mitiga...

Second Order WinoBias (SoWinoBias) Test Set for Latent Gender Bias Detection in Coreference Resolution

We observe an instance of gender-induced bias in a downstream applicatio...

Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation

Models often easily learn biases present in the training data, and their...

1 Introduction

* Equal Contribution.

With the wealth of information being posted online daily, Relation Extraction (RE) has become increasingly important. RE aims specifically to extract relations from raw sentences and represent them as succinct relation tuples of the form (head, relation, tail). An example is (Barack Obama, spouse, Michelle Obama).

The concise representations provided by RE models have been used to extend Knowledge Bases (KBs) Subasic et al. (2019); Trisedya et al. (2019). These KBs are then used heavily in NLP systems, such as Task-Based Dialogue Systems. In recent years, much focus in the NRE community has been centered on improvements in model precision and the reduction of noise Lin et al. (2016); Liu et al. (2017); Wu et al. (2017); Feng et al. (2018); Vashishth et al. (2018). Yet, little attention has been devoted towards the fairness of such systems.

In this paper, we take the first step at understanding and evaluating gender bias in NRE systems. We analyze gender bias by measuring the differences in model performance when extracting relations from sentences written about females versus sentences written about males. Significant discrepancies in performance between genders could diminish the fairness of systems and distort outcomes in applications that use them. For example, if a model predicts the occupation relation for with higher recall for male entities, this could lead to KBs having more occupation information for males. Downstream search tasks using that KB could produce biased predictions, such as ranking articles about female computer scientists below articles about their male peers.

We provide the first evaluation of social bias in NRE models; specifically, we evaluate gender bias in English language predictions of a collection of popularly used and open source NRE models

111 Lin et al. (2016); Wu et al. (2017); Liu et al. (2017); Feng et al. (2018). We evaluate OpenNRE on two fronts: (1) examining Equality of Opportunity Hardt et al. (2016) when OpenNRE is trained on an unmodified dataset and (2) examining the effect that various debiasing options Bolukbasi et al. (2016); Rudinger et al. (2018); Zhao et al. (2018a); Lu et al. (2018); Kiritchenko and Mohammad (2018) have on both absolute F1 score and the difference in F1-scores on male and female datapoints.

However, carrying out such an evaluation is difficult with existing NRE datasets, such as the NYT dataset from Riedel et al. (2010), because there is no reliable way to obtain gender information about the entities. Thus, we create a new dataset specifically aimed at evaluating gender bias for NRE, just as prior work has done for other tasks like Coreference Resolution Zhao et al. (2018b); Rudinger et al. (2018). We call our dataset WikiGenderBias and make it publicly available. Our contributions are as such:

  • WikiGenderBias is the first dataset aimed at training and evaluating NRE systems for gender bias. It contains ground truth labels for the test set and about 45,000 sentences in total.

  • We provide the first evaluation of NRE systems for gender bias and find that it exhibits gender bias.

  • We demonstrate that using both gender-swapping and debiased embeddings effectively mitigates bias in the model’s predictions and that using genderswapping improves the model’s performance when the training data contains contextual biases.

2 Related Work

The study of gender bias in NLP is still nascent; gender bias has not been studied in many NLP tasks. Typically, prior work first observes the gender bias, then attempts to mitigate it Sun et al. (2019). In this paper, we undertake that first step of observation for the task of RE.

Since a form of measurement is required for observation, prior work has created methods for measuring gender bias Zhao et al. (2017); Rudinger et al. (2018); Zhao et al. (2018a); Dixon et al. (2018); Lu et al. (2018); Kiritchenko and Mohammad (2018); Romanov et al. (2019). Gender bias has been measured mainly in training sets and in predictions. Measuring the latter is simple: measure the difference in performance of the model on male and female datapoints, with the definition of the gender of a datapoint being domain-dependent Lu et al. (2018); Kiritchenko and Mohammad (2018). Other metrics have been proposed to evaluate fairness of predictors and allocative bias Dwork et al. (2012); Hardt et al. (2016), such as Equality of Opportunity. We use both of these methods to evaluate NRE models.

After discovering gender bias exists, prior work has developed methods to mitigate that bias. Debiasing methods can debias the training set, word embeddings, or the prediction or training algorithms. In the case of training set or training algorithm debiasing, the model must be retrained. We use two training set debiasing methods (Counterfactual Data Augmentation Zhao et al. (2018a) and Name Anonymization Zhao et al. (2018a)) and a word embedding debiasing method (Hard Debiasing Bolukbasi et al. (2016)) and analyze their affect on bias in predictions of NRE models.

In RE, using supervised machine learning models has become popular. Training data for these models is typically obtained using the Distant Supervision or a variation: for a given relation

(e1, r, e2) in a KB, assume any sentence that contains both e1 and e2 expresses r Mintz et al. (2009). Many NRE models focus on mitigating the effects of noise in the training data introduced by Distant Supervision to increase performance Hoffmann et al. (2011); Surdeanu et al. (2012); Lin et al. (2016); Liu et al. (2017); Feng et al. (2018). Recent work uses KBs to further increase NRE performance Vashishth et al. (2018); Han et al. (2018). Despite these significant efforts towards improving NRE performance, there are no studies on bias or ethics in NRE to our knowledge. We provide such a study.

Dataset Entity Pairs Instances
Train 15075 5543 27048 9391
Dev 1970 670 3416 1144
Test 1275 1340 2320 2284
Total 18320 7553 32784 12819
Table 1: WikiGenderBias’s Dataset Splits. Entity Pairs means distinct pairs such that is a relation in WikiGenderBias. Instances are the total number of tuples in WikiGenderBias, where is distantly supervised.

3 WikiGenderBias

To evaluate gender bias in RE models, we need some measure of how gender affects predictions in RE models. To obtain this, we need some way to identify gender in test instances. Current datasets for RE lack gender information for entities. To obtain gender information for current datasets could be costly or impossible. Thus, we elected to create WikiGenderBias with this gender information. Specifically, we wanted to measure how predictions differed on sentences from Wikipedia articles about male entities versus those about female entities. Since most data about an entity in a KB is generated from that entity’s page, if an NRE model performed better for male articles, then likely male entities would have more information in a KB. This bias could propagate to downstream predictions for models using the KB, so for that reason evaluating performance differences on articles about entities of different genders is useful. WikiGenderBias’s splits are given in Table 1.

3.1 Dataset Creation

To generate WikiGenderBias, we use a variant of the Distant Supervision assumption: for a given relation between two entities, assume that any sentence from an article written about one of those entities that mentions the other entity expresses the relation. For instance, if we know (Barack, spouse, Michelle) is a relation and we find the sentence He and Michelle were married in Barack’s Wikipedia article, then we assume that sentence expresses the (Barack, spouse, Michelle) relation. This assumption is similar to that made by Mintz et al. (2009) and allows us to scalably create the dataset.

We use Wikpedia because many entities on Wikipedia have gender information and because Wikipedia contains articles written about these entities. This combined with relation information about these entities obtainable from DBPedia, Wikipedia’s KB, allowed us to create WikiGenderBias using our variant of the Distant Supervision assumption.

In WikiGenderBias, we use four relations: spouse, hypernym, birthDate, and birthPlace. We chose from a given set of relations stored in DBPedia. We hypothesized that models might use gender as a proxy to influence predictions for spouse and hypernym relations, since words pertaining to marriage are more often mentioned in female articles and words pertaining to hypernym (which is similar to occupation) are more often mentioned in articles about males Wagner et al. (2015); Graells-Garrido et al. (2015). We hypothesized that birthDate and birthPlace would operate like control groups and believed gender would correlate with neither relation. We also generate negative examples for these four relations by obtaining datapoints for three unrelated relations: parents, deathDate, and almaMater.

We use entities for which we could obtain data for all four relations. We set up our experiment such that head entities are not repeated across the train, dev, and test sets so that the model will see only new head entities at testing time. Since we obtain the distantly supervised sentences for a relation from the head entity’s article, this guarantees the model will not see sentences from the same article across datasets. However, it is possible that head entity will appear as a tail entity in other relations, so entities could appear in multiple datasets.

WikiGenderBias’s gender splits are given in Table 1. We first train OpenNRE on the raw, gender-imbalanced training data to reflect model performance without modification. We then introduce bias mitigation methods such as gender-swapping, name anonymzation, and hard debiasing from prior work to evaluate the tradeoff between model performance and gender parity.

3.2 Test Sets

We partition the test set into two subsets: one with sentences from female articles, and one with sentences from male articles (see Table 1). We collect data using our variant of the distant supervision assumption (see Section 3.1). However, as noted earlier, some sentences can be noisy. Evaluating models on noisy data is unfair since a model could be penalized for correctly predicting the relation is not expressed in the sentence. Thus, we had to obtain ground truth labels.

To find the ground truth, we collected annotations from AMT workers. We asked these workers to determine whether or not a given sentence expressed a given relation. If the majority answer was no, then we labeled that sentence as expressing no relation. (We denote no relation as NA in WikiGenderBias.) Each sentence was annotated by three different workers. Each worker was paid 15 cents per annotation. We only accepted workers from England, the US or Australia and with HIT Approval Rate greater than and Number of HITs greater than . We found the pairwise inter-annotator agreement as measured by Fleiss’ Kappa Fleiss (1971) to be 0.44, which is consistent across both genders and signals moderate agreement. We note that our value is affected by asking workers to make binary classifications, which limits the degree of agreement that is attainable above chance. We also found the pairwise inter-annotator agreement to be 84%.

Figure 1: Proportion of sentences corresponding to a given relation over total sentences extracted to WikiGenderBias for each entity. This demonstrates that, of the entities in WikiGenderBias, there are many more sentences expressing the spouse relation for females than males.

3.3 Further Analysis

In our creation of WikiGenderBias, we performed some statistical analysis on the Wikipedia data we obtained. We build on the work of Graells-Garrido et al. (2015), who discover that a higher proportion of Wikipedia Infoboxes on Wikipedia pages of female entities have spouse information than Wikipedia Infoboxes on Wikipedia pages of male entities. However, Figure 1 demonstrates a further discrepancy: that amongst articles for females and males which contain spouse information, articles written about females mention females’ spouses far more often than articles written about men. Additionally, we show that amongst female and male articles we sampled, hypernyms are mentioned far more often in male than female articles.

That female articles mention the females’ spouses more often than male articles indicates gender bias in Wikipedia’s composition; authors do not write about the two genders equally.

Spouse Birth Date Birth Place Hypernym
M F Diff M F Diff M F Diff M F Diff
PCNN + ATT .898 .836 .062 .804 .854 -.050 .776 .690 .086 .886 .892 -.006
+ AVE .914 .866 .048 .843 .869 -.026 .863 .776 .087 .843 .888 -.045
CNN + ATT .902 .832 .070 .694 .784 -.090 .831 .716 .115 .875 .892 -.017
+ AVE .906 .869 .037 .843 .877 -.034 .820 .735 .085 .855 .888 -.033
RNN + ATT .890 .836 .054 .827 .866 -.039 .753 .705 .048 .835 .862 -.027
+ AVE .925 .888 .037 .729 .780 -.051 .875 .821 .054 .847 .892 -.045
BIRNN + ATT .898 .858 .040 .820 .817 .003 .722 .672 .050 .859 .892 -.033
+ AVE .882 .881 .001 .780 .799 -.019 .878 .817 .061 .839 .881 -.042
Table 2: Equality of Opportunity results from running combinations of encoders and selectors of the OpenNRE model for the male and female genders of each relation. A positive difference means a higher prediction recall for male entities. A predictor would satisfy Equality of Opportunity if and only if the difference were 0, and a fair predictor should have close to 0 difference.

4 Experimental Setup

We evaluate NRE models from a popular open-source code repository called OpenNRE Han et al. (2019). OpenNRE models combine methods including usage of selective attention to add weight to sentences with relevant information Lin et al. (2016) as well as methods to reduce noise at an entity-pair level Lin et al. (2016) and innovations in adversarial training of NRE models Wu et al. (2017)

. OpenNRE allows users to choose a selector (Attention or Average) and an encoder (PCNN, CNN, RNN, or Bi-RNN) for each model. Each of these models requires word embeddings to create distributed representations of sentences. It should be noted that a PCNN is simply a CNN which has a piecewise max-pooling operation, where the sentence is split into three sections based on the positions of the head and tail entities

Zeng et al. (2015).

As mentioned in Section 1, we use OpenNRE and performance differences to evaluate the models.

4.1 Parameters for Equality of Opportunity

We train every encoder-selector combination on the WikiGenderBias training set and use Word2Vec embeddings Mikolov et al. (2013) also trained on WikiGenderBias and test each combination on the WikiGenderBias test set. 222

We performed Grid Search to determine the optimal hyperparameters. We set

, batch size and sliding window size (for CNN and PCNN).

We also utilize Equality of Opportunity Hardt et al. (2016). In our case , because gender is our protected attribute and we assume it to be binary. We evaluate EOP on a per-relation, one-versus-rest basis. Thus, we calculate one EOP where spouse is the positive class and all other classes are negative; in this case, corresponds to the true-label being spouse and corresponds to the true label being hypernym, birthDate, birthPlace, or NA. We then do another calculation for each relation where corresponds to that relation being expressed and corresponds to any other relation being expressed. Note that this is equivalent to measuring per-relation recall for each gender.

4.2 Bias Mitigation Methods

Then, we evaluate the PCNN, Attention model using the debiasing methods mentioned below.

The contexts in which males and females are written about can differ; for instance, on Wikipedia women are more often written about with words related to sexuality than men Graells-Garrido et al. (2015). Counterfactual Data Augmentation (CDA) mitigates these contextual biases. CDA consists of replacing masculine words in a sentence with their corresponding feminine words and vice versa for all sentences in a corpus, then training on the union of the original and augmented corpora 333We use the following list to perform gender-swapping: This equalizes the contexts for feminine and masculine words; if previously 100 doctors were referred to as he and 50 as she, in the new training set he and she will refer to doctor 150 times each.

Sometimes, models use entity names as a proxy for gender; if a model associates females with politician and John with males, then it might be less likely to predict that John is a politician expresses (John, hypernym, politican) than it would if it associated John

with females. Name Anonymization (NA) mitigates this. NA consists of finding all person entities with a Named Entity Recognition system

Finkel et al. (2005) then replacing the names of these entities with corresponding anonymizations. For instance, the earlier example might become E1 is a politcian, thereby preventing the model from using names as a proxy for gender.

Word embeddings can encode gender biases Bolukbasi et al. (2016); Caliskan et al. (2017); Garg et al. (2018) and this can affect bias in downstream predictions for models using the embeddings Zhao et al. (2018a)

. Hard-Debiasing mitigates gender bias in embeddings. Hard-Debiasing involves finding a direction representing gender in the vector space, then removing the component on that direction for all gender-neutral words, then equalizing the distance from that direction for all

(masculine, feminine) word pairs Bolukbasi et al. (2016). We applied hard-debiasing to Word2Vec embeddings Mikolov et al. (2013) we trained on the sentences in WikiGenderBias. Every time we applied CDA or NA or some combination of the two, we trained a new embedding model on that debiased dataset as well.

Figure 2: Trade-off between relation extraction model performance as measured by aggregate F1 score over both male and female genders (left) and F1 score gender gap (right). This is evaluated on the model with No Debiasing (ND) and three bias mitigation methods: Name Anonymization (NA), Debiased Embeddings (DE) and Gender Swapping (GS). An ideal algorithm maximizes aggregate F1 score while minimizing the gender gap.

As mentioned in Section 2, gender bias can be measured as the difference in a performance metric for a model when evaluated on male and female datapoints. We evaluate the effect these methods have on NRE models using this. We define male (female) datapoints to be relations for which the head entity is male (female), which means the distantly supervised sentence is taken from a male (female) article. Prior work has used area under the precision-recall curve and F1 score to measure NRE model performance Gupta et al. (2019); Han et al. (2019); Kuang et al. (2019); following prior work, we use F1 score as our performance metric.

4.3 Evaluation of Equality of Opportunity

OpenNRE models do not satisfy Equality of Opportunity, although they get close (see Table 2). Predictions on birthDate satisfy Equality of Opportunity the least in all case except in the case of RNN with Attention, when predictions on spouse were the most biased. Notably, Bi-RNN with Average almost perfectly satisfies Equality of Opportunity for spouse.

We also find that Average selectors do slightly better than Attention selectors for preventing bias. For every encoder except PCNN, architectures using the Average selector exhibited significantly less gender bias than models using the same encoder and the Attention selector. In the case of BiRNN, F1 gap for predictions on spouse with the Average selector were less than half the gap for the Attention selector. Average selectors do not provide as dramatic an improvement for Equality of Opportunity and actually increase gender bias for hypernym. All architectures have similar levels of bias, but by the F1 metric CNN with Average selector seems to have mitigated bias in the spouse relation the best while Bi-RNN with Average selector does best by the Equality of Opportunity metric.

It is also worth noting that the average selector performed slightly better than the attention selector across the board, which is intriguing considering that the average selector is used as a baseline since it weights sentences in the training data equally for each relation.

4.4 Evaluating Bias Mitigation Methods

F1 scores between predictions on male and female sentences on all relations differ for every encoder selector combinations, although the difference is relatively small (see the leftmost column in 2). We find that predictions on spouse typically exhibit the highest difference in F1 score, as we predicted. However, surprisingly, predictions on hypernym exhibit the least gender bias and predictions on birthPlace exhibit more significant gender bias than predictions on spouse in some cases. Predictions on birthDate exhibited very little gender bias, as predicted.

Name Anonymization surprisingly substantially increases F1 score gap for the hypernym relation, but slightly decreases F1 score gap for all other relations. Name Anonymization appears to be effective at debiasing all relations aside from hypernym, though not as effective as either Gender-Swapping or using Debiased Embeddings. These results indicate that entity bias likely does not contribute very much the gender bias in the models’ original predictions.

Hard-Debiased Word Embeddings was also extremely effective at mitigating the difference in F1 scores for all relations. While gender-swapping did slightly better at decreasing that difference for the spouse relation, debiased embeddings mitigated bias better for the birthDate and hypernym relations. We note that using debiased embeddings increases absolute scores just like gender-swapping, though it increases them slightly less.

Gender-Swapping substantially decreases F1 score gap for the spouse relation as well as for all other relations (see Figure 2). Interestingly, the absolute F1 scores for both male and female sentences for all relations increased when Gender-Swapping was applied (see Figure 3). Thus, gender-swapping is extremely effective not only for mitigating bias but also for improving performance. This is likely due to two things: 1) Wikipedia data is rife with contextual gender bias and 2) gender-swapping successfully removes those biases. Prior work has shown that many corpora contain similar biases, including even news articles like those from Google News Bolukbasi et al. (2016). Our results show that gender-swapping may be an effective tool to combat this context bias in the domain of NRE.

Figure 3: OpenNRE results on WikiGenderBias with different input combinations to the model. The y-axis indicates the difference in F1 score between genders (male-female).

Combinations Combining debiased embeddings and gender swapping turned out to have the highest relative difference in F1 score between male and female sentences for spouse while also reducing bias in other relations (see Figure 3). All models which use name anonymization (Models 1-4) have significantly higher F1 score gaps for the hypernym relation. While all combinations reduced gender bias to varying extents, gender bias in the spouse relation was mitigated to a similar extent by all combinations. Surprisingly, applying gender-swapping on its own reduces gender bias about as well or better as any combination of methods.

Aggregate Results Thus, throughout all combinations of debiasing options, the PCNN with Attention model attains better F1 score for the spouse relation when predicting on male sentences than for female sentences. For birthplace, F1 score gap is far lower as we predicted. To our surprise, F1 score gap was lowest for hypernym, which we predicted would have a higher gap like that for spouse. Also surprisingly, F1 gap for birthPlace was almost as high as that for spouse. While all the model exhibited bias in predictions for all relations, we note that using gender-swapping and debiasing embeddings were able to significantly mitigate the gap in F1 scores for the model’s predictions on male and female sentences. However, while the F1 score gap for birthPlace responded strongly to debiasing methods, spouse did not respond as strongly. Gender-swapping was able to bolster the model’s absolute F1 scores as well. Thus, we note that mitigating context bias worked extremely well in this case. Name anonymization was as effective and actually increased gender bias for hypernym; it seems removing entity bias increased F1 score gap for hypernym. We note that the best combination for both bias mitigation and absolute model performance was using gender-swapping on its own.

5 Conclusion

In our study, we create WikiGenderBias: the largest dataset for gender bias evaluation to date across all NLP tasks to our knowledge. We train OpenNRE models on the WikiGenderBias dataset and test them on gender-separated test sets. We find a substantial difference in F1 scores for the spouse relation between predictions on male sentences and female sentences for all OpenNRE model architectures. We find that this gender bias can be substantially mitigated merely by doing pre-processing on the dataset and the word embeddings utilized by the models, and find that the best debiasing combination was gender-swapping paired with debiased embeddings. We also note that this combination significantly increases the model performance in general as well. Finally, we build on Graells-Garrido et al. (2015)’s work and find further context bias latent in Wikipedia.

While these findings will help future work avoid gender biases, this study is preliminary. We only consider binary gender, but future work should consider non-binary genders. Additionally, future work should further probe the source of gender bias in the model’s predictions, perhaps by visualizing attention or looking more closely at the model’s outputs.


  • Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Kalai. 2016. Man Is to Computer Programmer As Woman Is to Homemaker? Debiasing Word Embeddings. In Neural Information Processing Systems (NIPS‘16).
  • Caliskan et al. (2017) Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics Derived Automatically from Language Corpora Contain Human-Like Biases. Science, 356(6334):183–186.
  • Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and Mitigating Unintended Bias in Text Classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society (AAAI‘18), pages 67–73. ACM.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. 2012. Fairness Through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, pages 214–226. ACM.
  • Feng et al. (2018) Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xiaoyan Zhu. 2018. Reinforcement Learning for Relation Classification from Noisy Data. In

    Thirty-Second Conference on Advancement of Artificial Intelligence (AAAI ‘18)

  • Finkel et al. (2005) Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Xystems by Gibbs Sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL ‘05), pages 363–370. Association for Computational Linguistics.
  • Fleiss (1971) Joseph L Fleiss. 1971. Measuring Nominal Scale Agreement Among Many Raters. Psychological bulletin, 76(5):378.
  • Garg et al. (2018) Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes. Proceedings of the National Academy of Sciences, 115(16):E3635–E3644.
  • Graells-Garrido et al. (2015) Eduardo Graells-Garrido, Mounia Lalmas, and Filippo Menczer. 2015. First Women, Second sex: Gender Bias in Wikipedia. In Proceedings of the 26th ACM Conference (ACM ‘15), pages 165–174. ACM.
  • Gupta et al. (2019) Pankaj Gupta, Subburam Rajaram, Hinrich Schütze, and Thomas Runkler. 2019. Neural Relation Extraction Within and Across Sentence Boundaries. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI ‘19), volume 33, pages 6513–6520.
  • Han et al. (2019) Xu Han, Tianyu Gao, Yuan Yao, Demin Ye, Zhiyuan Liu, and Maosong Sun. 2019. OpenNRE: An Open and Extensible Toolkit for Neural Relation Extraction. arXiv preprint arXiv:1909.13078.
  • Han et al. (2018) Xu Han, Zhiyuan Liu, and Maosong Sun. 2018.

    Neural Knowledge Acquisition via Mutual Attention Between Knowledge Graph and Text.

    In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI‘18).
  • Hardt et al. (2016) Moritz Hardt, Eric Price, and Srebro. 2016.

    Equality of Opportunity in Supervised Learning.

    In Advances in Neural Information Processing Systems (NIPS ‘16), pages 3315–3323.
  • Hoffmann et al. (2011) Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S Weld. 2011. Knowledge-based Weak Supervision for Information Extraction of Overlapping Relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL ‘11), pages 541–550. Association for Computational Linguistics.
  • Kiritchenko and Mohammad (2018) Svetlana Kiritchenko and Saif M Mohammad. 2018. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. arXiv preprint arXiv:1805.04508.
  • Kuang et al. (2019) Jun Kuang, Yixin Cao, Jianbing Zheng, Xiangnan He, Ming Gao, and Aoying Zhou. 2019. Improving Neural Relation Extraction with Implicit Mutual Relations. arXiv preprint arXiv:1907.05333.
  • Lin et al. (2016) Yankai Lin, Shiqi Shen, Zhiyuan Liu, Huanbo Luan, and Maosong Sun. 2016. Neural Relation Extraction with Selective Attention Over Instances. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL ‘16), volume 1, pages 2124–2133.
  • Liu et al. (2017) Tianyu Liu, Kexiang Wang, Baobao Chang, and Zhifang Sui. 2017. A Soft-Label Method for Noise-Tolerant Distantly Supervised Relation Extraction. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP ‘17)

    , pages 1790–1795.
  • Lu et al. (2018) Kaiji Lu, Piotr Mardziel, Fangjing Wu, Preetam Amancharla, and Anupam Datta. 2018. Gender Bias in Neural Natural Language Processing.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed Representations of Words and Phrases and Their Compositionality. In Advances in Neural Information Processing Systems (NIPS ‘13), pages 3111–3119.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant Supervision for Relation Extraction Without Labeled Data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association of Computational Linguistics (ACL‘09), pages 1003–1011. Association for Computational Linguistics.
  • Riedel et al. (2010) Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling Relations and Their Mentions Without Labeled Text. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD ‘10), pages 148–163. Springer.
  • Romanov et al. (2019) Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Anna Rumshisky, and Adam Tauman Kalai. 2019. What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes. arXiv preprint arXiv:1904.05233.
  • Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender Bias in Coreference Resolution. In North American Chapter of the Association for Computational Linguistics (NAACL‘18).
  • Subasic et al. (2019) Pero Subasic, Hongfeng Yin, and Xiao Lin. 2019.

    Building Knowledge Base through Deep Learning Relation Extraction and Wikidata.


    AAAI Spring Symposium: Combining Machine Learning with Knowledge Engineering

  • Sun et al. (2019) Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-Wei Chang, and William Yang Wang. 2019. Mitigating Gender Bias in Natural Language Processing: Literature Review.
  • Surdeanu et al. (2012) Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. 2012. Multi-instance Multi-label Learning for Relation Extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing (EMNLP ’12), pages 455–465. Association for Computational Linguistics.
  • Trisedya et al. (2019) Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong Qi, and Rui Zhang. 2019. Neural Relation Extraction for Knowledge Base Enrichment. ACL.
  • Vashishth et al. (2018) Shikhar Vashishth, Rishabh Joshi, Sai Suman Prayaga, Chiranjib Bhattacharyya, and Partha Talukdar. 2018. Reside: Improving Distantly-Supervised Neural Relation Extraction Using Side Information.
  • Wagner et al. (2015) Claudia Wagner, David Garcia, Mohsen Jadidi, and Markus Strohmaier. 2015. It’s a Man’s Wikipedia? Assessing Gender Inequality in an Online Encyclopedia. In Ninth International AAAI Conference on Web and Social Media (AAAI‘15).
  • Wu et al. (2017) Yi Wu, David Bamman, and Stuart Russell. 2017. Adversarial Training for Relation Extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP ‘17), pages 1778–1783.
  • Zeng et al. (2015) Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015.

    Distant Supervision for Relation Extraction Via Piecewise Convolutional Neural Networks.

    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP ‘15), pages 1753–1762.
  • Zhao et al. (2017) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men Also Like Shopping: Reducing Gender Bias Amplification Using Corpus-Level Constraints. In Empirical Methods of Natural Language Processing (EMNLP‘17).
  • Zhao et al. (2018a) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018a. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In North American Chapter of the Association for Computational Linguistics (NAACL‘18).
  • Zhao et al. (2018b) Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai-Wei Chang. 2018b. Learning Gender-Neutral Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods of Natural Language Processing (EMNLP‘18).