Considerations for Multilingual Wikipedia Research

04/05/2022
by   Isaac Johnson, et al.
0

English Wikipedia has long been an important data source for much research and natural language machine learning modeling. The growth of non-English language editions of Wikipedia, greater computational resources, and calls for equity in the performance of language and multimodal models have led to the inclusion of many more language editions of Wikipedia in datasets and models. Building better multilingual and multimodal models requires more than just access to expanded datasets; it also requires a better understanding of what is in the data and how this content was generated. This paper seeks to provide some background to help researchers think about what differences might arise between different language editions of Wikipedia and how that might affect their models. It details three major ways in which content differences between language editions arise (local context, community and governance, and technology) and recommendations for good practices when using multilingual and multimodal data for research and modeling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/28/2015

Understanding Editing Behaviors in Multilingual Wikipedia

Multilingualism is common offline, but we have a more limited understand...
research
03/06/2013

Japanese-Spanish Thesaurus Construction Using English as a Pivot

We present the results of research with the goal of automatically creati...
research
04/02/2019

The Tower of Babel Meets Web 2.0: User-Generated Content and its Applications in a Multilingual Context

This study explores language's fragmenting effect on user-generated cont...
research
06/02/2023

Fair multilingual vandalism detection system for Wikipedia

This paper presents a novel design of the system aimed at supporting the...
research
08/01/2023

CoSMo: A constructor specification language for Abstract Wikipedia's content selection process

Representing snippets of information abstractly is a task that needs to ...
research
10/21/2020

Multilingual Contextual Affective Analysis of LGBT People Portrayals in Wikipedia

Specific lexical choices in how people are portrayed both reflect the wr...
research
01/23/2020

Uneven Coverage of Natural Disasters in Wikipedia: the Case of Flood

The usage of non-authoritative data for disaster management presents the...

Please sign up or login with your details

Forgot password? Click here to reset