Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

04/03/2021
by   Jonathan Dunn, et al.
0

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

04/02/2020

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a fo...
04/29/2020

Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing

Corpus preparation for low-resource languages and for development of hum...
04/03/2021

Measuring Linguistic Diversity During COVID-19

Computational measures of linguistic diversity help us understand the li...
03/13/2020

Know thy corpus! Robust methods for digital curation of Web corpora

This paper proposes a novel framework for digital curation of Web corpor...
08/10/2016

An assessment of orthographic similarity measures for several African languages

Natural Language Interfaces and tools such as spellcheckers and Web sear...
01/09/2019

What do Language Representations Really Represent?

A neural language model trained on a text corpus can be used to induce d...
02/22/2017

Dialectometric analysis of language variation in Twitter

In the last few years, microblogging platforms such as Twitter have give...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.