Dataset Geography: Mapping Language Data to Language Users

12/07/2021
by   Fahim Faisal, et al.
0

As language technologies become more ubiquitous, there are increasing efforts towards expanding the language diversity and coverage of natural language processing (NLP) systems. Arguably, the most important factor influencing the quality of modern NLP systems is data availability. In this work, we study the geographical representativeness of NLP datasets, aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers. In doing so, we use entity recognition and linking systems, also making important observations about their cross-lingual consistency and giving suggestions for more robust evaluation. Last, we explore some geographical and economic factors that may explain the observed dataset distributions. Code and data are available here: https://github.com/ffaisal93/dataset_geography. Additional visualizations are available here: https://nlp.cs.gmu.edu/project/datasetmaps/.

READ FULL TEXT

page 18

page 19

page 20

page 22

page 23

page 24

page 25

page 26

research
03/18/2022

Challenges and Strategies in Cross-Cultural NLP

Various efforts in the Natural Language Processing (NLP) community have ...
research
05/29/2022

L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library

Despite being the third most popular language in India, the Marathi lang...
research
10/11/2022

BanglaParaphrase: A High-Quality Bangla Paraphrase Dataset

In this work, we present BanglaParaphrase, a high-quality synthetic Bang...
research
02/02/2022

PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts

PromptSource is a system for creating, sharing, and using natural langua...
research
09/21/2022

Transition to Adulthood for Young People with Intellectual or Developmental Disabilities: Emotion Detection and Topic Modeling

Transition to Adulthood is an essential life stage for many families. Th...
research
05/03/2023

Robust Natural Language Watermarking through Invariant Features

Recent years have witnessed a proliferation of valuable original natural...
research
09/18/2020

FarsTail: A Persian Natural Language Inference Dataset

Natural language inference (NLI) is known as one of the central tasks in...

Please sign up or login with your details

Forgot password? Click here to reset