The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses

10/18/2021
by   Bashar Alhafni, et al.
7

Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output. This corpus expands on Habash et al. (2019)'s Arabic Parallel Gender Corpus (APGC v1.0) by adding second person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. Our new dataset will aid the research and development of gender identification, controlled text generation, and post-editing rewrite systems that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender Corpus (APGC v2.0) publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/14/2022

The User-Aware Arabic Gender Rewriter

We introduce the User-Aware Arabic Gender Rewriter, a user-centric web-b...
research
06/11/2019

Counterfactual Data Augmentation for Mitigating Gender Stereotypes in Languages with Rich Morphology

Gender stereotypes are manifest in most of the world's languages and are...
research
06/20/2018

TxPI-u: A Resource for Personality Identification of Undergraduates

Resources such as labeled corpora are necessary to train automatic model...
research
05/04/2022

User-Centric Gender Rewriting

In this paper, we define the task of gender rewriting in contexts involv...
research
02/12/2021

They, Them, Theirs: Rewriting with Gender-Neutral English

Responsible development of technology involves applications being inclus...
research
05/25/2023

What about em? How Commercial Machine Translation Fails to Handle (Neo-)Pronouns

As 3rd-person pronoun usage shifts to include novel forms, e.g., neopron...
research
09/24/2020

Type B Reflexivization as an Unambiguous Testbed for Multilingual Multi-Task Gender Bias

The one-sided focus on English in previous studies of gender bias in NLP...

Please sign up or login with your details

Forgot password? Click here to reset