GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

12/10/2019
by   Marta R. Costa-Jussà, et al.
0

We introduce GeBioToolkit, a tool for extracting multilingual parallel corpora at sentence level, with document and gender information from Wikipedia biographies. Despite thegender inequalitiespresent in Wikipedia, the toolkit has been designed to extract corpus balanced in gender. While our toolkit is customizable to any number of languages (and different domains), in this work we present a corpus of 2,000 sentences in English, Spanish and Catalan, which has been post-edited by native speakers to become a high-quality dataset for machinetranslation evaluation. While GeBioCorpus aims at being one of the first non-synthetic gender-balanced test datasets, GeBioToolkit aims at paving the path to standardize procedures to produce gender-balanced datasets

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2019

WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia

We present an approach based on multilingual sentence embeddings to auto...
research
05/03/2020

Tailoring and Evaluating the Wikipedia for in-Domain Comparable Corpora Extraction

We propose an automatic language-independent graph-based method to build...
research
11/14/2022

Wikigender: A Machine Learning Model to Detect Gender Bias in Wikipedia

The way Wikipedia's contributors think can influence how they describe i...
research
04/11/2019

A high quality and phonetic balanced speech corpus for Vietnamese

This paper presents a high quality Vietnamese speech corpus that can be ...
research
10/03/2022

Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia

Corpora that contain tabular data such as WebTables are a vital resource...
research
07/13/2021

Generating Gender Augmented Data for NLP

Gender bias is a frequent occurrence in NLP-based applications, especial...
research
12/31/2020

Controlled Analyses of Social Biases in Wikipedia Bios

Social biases on Wikipedia, a widely-read global platform, could greatly...

Please sign up or login with your details

Forgot password? Click here to reset