Toward Gender-Inclusive Coreference Resolution

10/30/2019 ∙ by Yang Trista Cao, et al. ∙ 0

Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systemic biases in coreference resolution systems, including biases that reinforce cis-normativity and can harm binary and non-binary trans (and cis) stakeholders. To be er understand such biases, we foreground nuanced conceptualizations of gender from sociology and sociolinguistics, and investigate where in the machine learning pipeline such biases can enter a system. We inspect many existing datasets for trans-exclusionary biases, and develop two new datasets for interrogating bias in crowd annotations and in existing coreference resolution systems. Through these studies, conducted on English text, we confirm that without acknowledging and building systems that recognize the complexity of gender, we will build systems that fail for: quality of service, stereotyping, and over- or under-representation.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The authors are grateful to a number of people who have provided pointers, edits, and suggestions to improve this work: Cassidy Henry, Marion Zepf, and Os Keyes all contributed to various aspects of this work, including suggestions for data sources for the GI Coref dataset. We also thank the CLIP lab at the University of Maryland for comments on previous drafts.


  • Agarwal et al. (2019) Oshin Agarwal, Sanjay Subramanian, Ani Nenkova, and Dan Roth. 2019. Evaluation of named entity coreference. In Proceedings of the Second Workshop on Computational Models of Reference, Anaphora and Coreference, pages 1–7.
  • Andrews (2017) Travis M. Andrews. 2017. The singular, gender-neutral ‘they’ added to the Associated Press Stylebook. Washington Post;
  • Arnold (2008) Jennifer E Arnold. 2008. Reference production: Production-internal and addressee-oriented processes. Language and cognitive processes, 23(4):495–527.
  • Bagga and Baldwin (1998) Amit Bagga and Breck Baldwin. 1998. Algorithms for scoring coreference chains. In The first international conference on language resources and evaluation workshop on linguistics coreference, volume 1, pages 563–566. Granada.
  • Bansal and Klein (2012) Mohit Bansal and Dan Klein. 2012. Coreference semantics from web features. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 389–398. Association for Computational Linguistics.
  • Bardzell and Churchill (2011) Shaowen Bardzell and Elizabeth F Churchill. 2011. Iwc special issue “feminism and HCI: new perspectives” special issue editors’ introduction. Interacting with Computers, 23(5).
  • Barocas et al. (2017) Solon Barocas, Kate Crawford, Aaron Shapiro, and Hanna Wallach. 2017. The Problem With Bias: Allocative Versus Representational Harms in Machine Learning. In Proceedings of SIGCIS.
  • Bender (2019) Emily M. Bender. 2019. A typology of ethical risks in language technology with an eye towards where transparent documentation can help.

    The Future of Artificial Intelligence: Language, Ethics, Technology.

  • Bender and Friedman (2018) Emily M Bender and Batya Friedman. 2018.

    Data statements for natural language processing: Toward mitigating system bias and enabling better science.

  • Bergsma and Lin (2006) Shane Bergsma and Dekang Lin. 2006. Bootstrapping path-based pronoun resolution. In COLING/ACL.
  • Blodgett et al. (2019) Su Lin Blodgett, Hal Daumé, III, Solon Barocas, and Hanna Wallach. 2019. Debunking debiasing. In Text as Data.
  • Bolukbasi et al. (2016) Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings. In Proceedings of NeurIPS.
  • Bruhns (2006) Karen Olsen Bruhns. 2006. Gender archaeology in native north america. In Sarah Nelson, editor, Handbook of Gender in Archaeology.
  • Bucholtz (1999) Mary Bucholtz. 1999. Gender. Journal of Linguistic Anthropology.

    Special issue: Lexicon for the New Millennium, ed. Alessandro Duranti.

  • Buolamwini and Gebru (2018) Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional accuracy disparities in commercial gender classification. In FAT.
  • Burton (1883) Richard Burton. 1883. Kama Sutra, Translation.
  • Bustillos (2011) Maria Bustillos. 2011. Our desperate, 250-year-long search for a gender-neutral pronoun.
  • Butler (1990) Judith Butler. 1990. Gender trouble.
  • Cai and Strube (2010) Jie Cai and Michael Strube. 2010. Evaluation metrics for end-to-end coreference resolution systems. In Proceedings of the SIGDIAL 2010 Conference, pages 28–36.
  • Caliskan et al. (2017) Aylin Caliskan, Joanna J Bryson, and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334).
  • Carreiras et al. (1996) Manuel Carreiras, Alan Garnham, Jane Oakhill, and Kate Cain. 1996. The use of stereotypical gender information in constructing a mental model: Evidence from english and spanish. The Quarterly Journal of Experimental Psychology Section A, 49(3).
  • Clark and Manning (2015) Kevin Clark and Christopher D. Manning. 2015. Entity-centric coreference resolution with model stacking. In ACL.
  • Clark and Manning (2016) Kevin Clark and Christopher D. Manning. 2016.

    Deep reinforcement learning for mention-ranking coreference models.

    In EMNLP.
  • Conrod (2018) Kirby Conrod. 2018. What does it mean to agree? coreference with singular they. In Pronouns in Competition workshop.
  • Corbett (1991) Greville G. Corbett. 1991. Gender. Cambridge University Press.
  • Corbett (2013) Greville G. Corbett. 2013. Number of genders. In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  • Craig (1994) Colette G Craig. 1994. Classifier languages. The encyclopedia of language and linguistics, 2:565–569.
  • Dahl (2000) Osten Dahl. 2000. Animacy and the notion of semantic gender. Trends in linguistics studies and monographs, 124:99–116.
  • Darwin (2017) Helana Darwin. 2017. Doing gender beyond the binary: A virtual ethnography. Symbolic Interaction, 40(3):317–334.
  • Daumé III and Marcu (2005) Hal Daumé III and Daniel Marcu. 2005. A large-scale exploration of effective global features for a joint entity detection and tracking model. In HLT/EMNLP, pages 97–104.
  • De Lauretis (1990) Teresa De Lauretis. 1990. Feminism and its differences. Pacific Coast Philology, pages 24–30.
  • Doleschal and Schmid (2001) Ursula Doleschal and Sonja Schmid. 2001. Doing gender in Russian. Gender Across Languages. The linguistic representation of women and men, 1:253–282.
  • Dryer (2013) Matthew S. Dryer. 2013. Expression of pronominal subjects. In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  • Du and Liddy (1990) Elizabeth Du and Ross Liddy. 1990. Anaphora in natural language processing and information retrieval. Inf. Process. Manage., 26.
  • Edens et al. (2003) Richard J. Edens, Helen L. Gaylard, Gareth J. F. Jones, and Adenike M. Lam-Adesina. 2003. An investigation of broad coverage automatic pronoun resolution for information retrieval. In SIGIR.
  • Esaulova et al. (2014) Yulia Esaulova, Chiara Reali, and Lisa von Stockhausen. 2014. Influences of grammatical and stereotypical gender during reading: eye movements in pronominal and noun phrase anaphor resolution. Language, Cognition and Neuroscience, 29(7).
  • Font and Costa-jussà (2019) Joel Escudé Font and Marta R. Costa-jussà. 2019.

    Equalizing gender biases in neural machine translation with word embeddings techniques.

    In Proceedings of the 1st ACL Workshop on Gender Bias for Natural Language Processing.
  • Frank et al. (2004) Anke Frank, Chr Hoffmann, Maria Strobel, et al. 2004. Gender issues in machine translation. Univ. Bremen.
  • Frank and Goodman (2012) Michael C Frank and Noah D Goodman. 2012. Predicting pragmatic reasoning in language games. Science, 336(6084):998–998.
  • Fraser (2008) Nancy Fraser. 2008. Abnormal Justice. Critical Inquiry, 34(3):393–422.
  • Friedman and Nissenbaum (1996) Batya Friedman and Helen Nissenbaum. 1996. Bias in computer systems. ACM Transactions on Information Systems (TOIS), 14(3):330–347.
  • Fuertes-Olivera (2007) Pedro A Fuertes-Olivera. 2007. A corpus-based view of lexical gender in written business english. English for Specific Purposes, 26(2):219–234.
  • Gardner et al. (2017) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. AllenNLP: A deep semantic natural language processing platform. arXiv:1803.07640.
  • Garnham et al. (1995) Alan Garnham, Jane Oakhill, Marie-France Ehrlich, and Manuel Carreiras. 1995. Representations and processes in the interpretation of pronouns: New evidence from spanish and french. Journal of Memory and Language, 34(1).
  • Gebru et al. (2018) Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumeé, III, and Kate Crawford. 2018. Datasheets for datasets. arXiv:1803.09010.
  • GLAAD (2007) GLAAD. 2007. Media reference guide–transgender.
  • Gonen and Goldberg (2019) Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them. In Proceedings of NAACL-HLT.
  • Guha et al. (2015) Anupam Guha, Mohit Iyyer, Danny Bouman, and Jordan Boyd-Graber. 2015. Removing the training wheels: A coreference dataset that entertains humans and challenges computers. In NAACL.
  • Guillou (2012) Liane Guillou. 2012. Improving pronoun translation for statistical machine translation. In Student Research Workshop at EACL.
  • Guillou and Hardmeier (2018) Liane Guillou and Christian Hardmeier. 2018. Automatic reference-based evaluation of pronoun translation misses the point. In EMNLP.
  • Hagoort and Brown (1999) Peter Hagoort and Colin M Brown. 1999. Gender electrified: ERP evidence on the syntactic nature of gender processing. Journal of psycholinguistic research, 28(6).
  • Hamidi et al. (2018) Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M Branham. 2018. Gender recognition or gender reductionism?: The social implications of embedded gender recognition systems. In CHI, page 8. ACM.
  • HaNasi (189) Judah HaNasi. 189. Mishnah bikkurim. In Mishnah, Chapter 4.
  • Hardmeier and Federico (2010) Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora in statistical machine translation. In IWSLT.
  • Hardmeier and Guillou (2018) Christian Hardmeier and Liane Guillou. 2018. Pronoun translation in english-french machine translation: An analysis of error types. arXiv preprint arXiv:1808.10196.
  • Hellinger and Motschenbacher (2015) Marlis Hellinger and Heiko Motschenbacher. 2015. Gender across languages, volume 4. John Benjamins Publishing Company.
  • Jagose (1996) Annamarie Jagose. 1996. Queer theory: An introduction. NYU Press.
  • Kay et al. (2015) Matthew Kay, Cynthia Matuszek, and Sean A Munson. 2015. Unequal representation and gender stereotypes in image search results for occupations. In CHI.
  • Kessler and McKenna (1978) Suzanne J. Kessler and Wendy McKenna. 1978. Gender: An ethnomethodological approach. University of Chicago Press.
  • Keyes (2018) Os Keyes. 2018. The misgendering machines: Trans/HCI implications of automatic gender recognition. CHI.
  • Kiritchenko and Mohammad (2018) Svetlana Kiritchenko and Saif M. Mohammad. 2018.

    Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems.

    In Proceedings of *SEM.
  • Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86.
  • Kramarae and Treichler (1985) Cheris Kramarae and Paula A Treichler. 1985. A feminist dictionary. Pandora Press.
  • Lakoff (1975) Robin Lakoff. 1975. Language and woman’s place. New York ao: Harper and Row.
  • Lambert and Packer (2019) Max Lambert and Melina Packer. 2019. How gendered language leads scientists astray. New York Times.
  • Larson (2017) Brian N Larson. 2017. Gender as a variable in natural-language processing: Ethical considerations. In ACL Workshop on Ethics in NLP.
  • Laws et al. (2012) Florian Laws, Florian Heimerl, and Hinrich Schütze. 2012. Active learning for coreference resolution. In NAACL.
  • Light (2011) Ann Light. 2011. HCI as heterodoxy: Technologies of identity and the queering of interaction with computers. Interacting with Computers, 23(5).
  • Loper and Bird (2002) Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXiv preprint cs/0205028.
  • Luo (2005) Xiaoqiang Luo. 2005. On coreference resolution performance metrics. In Proceedings of the conference on human language technology and empirical methods in natural language processing, pages 25–32. Association for Computational Linguistics.
  • Lyons (1977) John Lyons. 1977. Semantics. Cambridge University Press.
  • Merriam-Webster (2016) Merriam-Webster. 2016. Words we’re watching: Singular ’they’.
  • Miller et al. (2012) Timothy A Miller, Dmitriy Dligach, and Guergana K Savova. 2012. Active learning for coreference resolution. In Workshop on Biomedical NLP.
  • Mitchell et al. (2005) Alexis Mitchell, Stephanie Strassel, Shudong Huang, and Ramez Zakhary. 2005. Ace 2004 multilingual training corpus. Linguistic Data Consortium, Philadelphia, 1:1–1.
  • Mitkov (1999) Ruslan Mitkov. 1999. Introduction: special issue on anaphora resolution in machine translation and multilingual NLP. Machine translation, 14(3).
  • Moosavi and Strube (2016) Nafise Moosavi and Michael Strube. 2016. Which coreference evaluation metric do you trust? a proposal for a link-based entity aware metric. pages 632–642.
  • Neill (2008) James Neill. 2008. The Origins and Role of Same-Sex Relations in Human Societies.
  • Ng and Cardie (2003) Vincent Ng and Claire Cardie. 2003. Bootstrapping coreference classifiers with multiple machine learning algorithms. In EMNLP.
  • Nissen (2002) Uwe Kjær Nissen. 2002. Aspects of translating gender. Linguistik online, 11(2):02.
  • Ochs (1992) Elinor Ochs. 1992. Indexing gender. Rethinking context: Language as an interactive phenomenon, 11:335.
  • Orita et al. (2015) Naho Orita, Eliana Vornov, Naomi Feldman, and Hal Daumé, III. 2015. Why discourse affects speakers’ choices of referring expressions. In ACL.
  • Osterhout et al. (1997) Lee Osterhout, Michael Bersick, and Judith McLaughlin. 1997. Brain potentials reflect violations of gender stereotypes. Memory & Cognition, 25(3).
  • Osterhout and Mobley (1995) Lee Osterhout and Linda A Mobley. 1995. Event-related brain potentials elicited by failure to agree. Journal of Memory and language, 34(6).
  • Patton et al. (2019) Desmond Upton Patton, Philipp Blandfort, William R Frey, Michael B Gaskell, and Svebor Karaman. 2019. Annotating twitter data from vulnerable populations: Evaluating disagreement between domain experts and graduate student annotators.
  • Pirkola and Järvelin (1996) Ari Pirkola and Kalervo Järvelin. 1996. The effect of anaphor and ellipsis resolution on proximity searching in a text database. Inf. Process. Manage., 32.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting Bleu scores. In WMT.
  • Prates et al. (2019) Marcelo Prates, Pedro Avelar, and Luis C. Lamb. 2019. Assessing gender bias in machine translation – a case study with google translate. Neural Computing and Applications.
  • Raghunathan et al. (2010) Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky, and Christopher Manning. 2010. A multi-pass sieve for coreference resolution. In EMNLP.
  • Rahman and Ng (2011) Altaf Rahman and Vincent Ng. 2011. Coreference resolution with world knowledge. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 814–824. Association for Computational Linguistics.
  • Reali et al. (2015) Chiara Reali, Yulia Esaulova, and Lisa Von Stockhausen. 2015. Isolating stereotypical gender in a grammatical gender language: evidence from eye movements. Applied Psycholinguistics, 36(4).
  • Richards et al. (2017) Christina Richards, Walter Pierre Bouman, and Meg-John Barker. 2017. Genderqueer and Non-Binary Genders. Springer.
  • Risman (2009) Barbara J. Risman. 2009. From doing to undoing: Gender as we know it. Gender & Society, 23(1).
  • Romanov et al. (2019) Alexey Romanov, Maria De-Arteaga, Hanna Wallach, Jennifer Chayes, Christian Borgs, Alexandra Chouldechova, Sahin Geyik, Krishnaram Kenthapadi, Anna Rumshisky, and Adam Tauman Kalai. 2019. What’s in a name? reducing bias in bios without access to protected attributes. In NAACL.
  • Rudinger et al. (2017) Rachel Rudinger, Chandler May, and Benjamin Van Durme. 2017. Social bias in elicited natural language inferences. In ACL Workshop on Ethics in NLP, pages 74–79.
  • Rudinger et al. (2018) Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In NAACL.
  • Sachan et al. (2015) Mrinmaya Sachan, Eduard Hovy, and Eric P Xing. 2015. An active learning approach to coreference resolution. In IJCAI.
  • Schilt and Westbrook (2009) Kristen Schilt and Laurel Westbrook. 2009. Doing gender, doing heteronormativity. Gender & Society, 23(4).
  • Serano (2007) Julia Serano. 2007. Whipping Girl: A Transsexual Woman on Sexism and the Scapegoating of Femininity. Seal Press.
  • Silverstein (1979) Michael Silverstein. 1979. Language structure and linguistic ideology. The elements: A parasession on linguistic units and levels, pages 193–247.
  • Smith et al. (2013) Jason R Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the common crawl. In ACL.
  • Stoyanov et al. (2009) Veselin Stoyanov, Nathan Gilbert, Claire Cardie, and Ellen Riloff. 2009. Conundrums in noun phrase coreference resolution: Making sense of the state-of-the-art. In ACL.
  • Stryker (2008) Susan Stryker. 2008. Transgender history. Seal Press.
  • Sweeney (2013) Latanya Sweeney. 2013. Discrimination in online ad delivery. ACM Queue.
  • Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In LREC.
  • Vaughan and Wallach (2019) Jennifer Wortman Vaughan and Hanna Wallach. 2019. Microsoft research webinar: Machine learning and fairness.
  • Vilain et al. (1995) Marc Vilain, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A model-theoretic coreference scoring scheme. In Proceedings of the 6th conference on Message understanding, pages 45–52. Association for Computational Linguistics.
  • Walsh (2015) Bill Walsh. 2015. The post drops the ‘mike’–and the hyphen in ‘e-mail’. Washington Post;–and-the-hyphen-in-e-mail/2015/12/04/ccd6e33a-98fa-11e5-8917-653b65c809eb_story.html?tid=a_inl.
  • Wandruszka (1969) Mario Wandruszka. 1969. Sprachen: vergleichbar und unvergleichlich. R. Piper & Company.
  • Webster et al. (2018) Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. Mind the GAP: A balanced corpus of gendered ambiguous pronouns. TACL.
  • Weischedel et al. (2011) Ralph Weischedel, Eduard Hovy, Mitchell Marcus, Martha Palmer, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. 2011. OntoNotes: A Large Training Corpus for Enhanced Processing.
  • West and Zimmerman (1987) Candace West and Don H Zimmerman. 1987. Doing gender. Gender & society, 1(2):125–151.
  • Young et al. (2019) Meg Young, Lassana Magassa, and Batya Friedman. 2019. Toward inclusive tech policy design: A method for underrepresented voices to strengthen tech policy documents. Ethics and Information Technology, pages 1–15.
  • Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicenté Ordoñez, and Kai-Wei Chang. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. In NAACL.
  • Zmigrod et al. (2019) Ran Zmigrod, Sebastian J Mielke, Hanna Wallach, and Ryan Cotterell. 2019. Counterfactual data augmentation for mitigating gender stereotypes in languages with rich morphology. arXiv preprint arXiv:1906.04571.