"I'm" Lost in Translation: Pronoun Missteps in Crowdsourced Data Sets

04/22/2023
by   Katie Seaborn, et al.
0

As virtual assistants continue to be taken up globally, there is an ever-greater need for these speech-based systems to communicate naturally in a variety of languages. Crowdsourcing initiatives have focused on multilingual translation of big, open data sets for use in natural language processing (NLP). Yet, language translation is often not one-to-one, and biases can trickle in. In this late-breaking work, we focus on the case of pronouns translated between English and Japanese in the crowdsourced Tatoeba database. We found that masculine pronoun biases were present overall, even though plurality in language was accounted for in other ways. Importantly, we detected biases in the translation process that reflect nuanced reactions to the presence of feminine, neutral, and/or non-binary pronouns. We raise the issue of translation bias for pronouns and offer a practical solution to embed plurality in NLP data sets.

READ FULL TEXT
research
04/22/2023

Transcending the "Male Code": Implicit Masculine Biases in NLP Contexts

Critical scholarship has elevated the problem of gender bias in data set...
research
07/10/2020

Pragmatic information in translation: a corpus-based study of tense and mood in English and German

Grammatical tense and mood are important linguistic phenomena to conside...
research
04/07/2022

Mapping the Multilingual Margins: Intersectional Biases of Sentiment Analysis Systems in English, Spanish, and Arabic

As natural language processing systems become more widespread, it is nec...
research
02/01/2023

User Study for Improving Tools for Bible Translation

Technology has increasingly become an integral part of the Bible transla...
research
03/04/2019

Polylingual Wordnet

Princeton WordNet is one of the most important resources for natural lan...
research
10/06/2020

Is the Best Better? Bayesian Statistical Model Comparison for Natural Language Processing

Recent work raises concerns about the use of standard splits to compare ...
research
05/29/2019

Word-order biases in deep-agent emergent communication

Sequence-processing neural networks led to remarkable progress on many N...

Please sign up or login with your details

Forgot password? Click here to reset