Mapping Languages and Demographics with Georeferenced Corpora

04/02/2020
by   Jonathan Dunn, et al.
0

This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii) how to weight the datasets to provide more accurate representations of underlying populations. The paper finds that the two datasets represent very different populations and that they correlate with actual populations with values of r=0.60 (social media) and r=0.49 (web-crawled). Further, Twitter data makes better predictions about the inventory of languages used in each country.

READ FULL TEXT

page 8

page 10

page 12

research
04/02/2020

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a fo...
research
03/24/2022

Automatic User Profiling in Darknet Markets: a Scalability Study

In this study, we investigate the scalability of state-of-the-art user p...
research
04/03/2021

Measuring Linguistic Diversity During COVID-19

Computational measures of linguistic diversity help us understand the li...
research
05/15/2019

Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

Social media provide access to behavioural data at an unprecedented scal...
research
04/11/2019

Modeling Global Syntactic Variation in English Using Dialect Classification

This paper evaluates global-scale dialect identification for 14 national...
research
11/20/2017

#Halal Culture on Instagram

Halal is a notion that applies to both objects and actions, and means pe...
research
05/11/2019

Mining Hidden Populations through Attributed Search

Researchers often query online social platforms through their applicatio...

Please sign up or login with your details

Forgot password? Click here to reset