Freshman or Fresher? Quantifying the Geographic Variation of Internet Language

10/22/2015
by   Vivek Kulkarni, et al.
0

We present a new computational technique to detect and analyze statistically significant geographic variation in language. Our meta-analysis approach captures statistical properties of word usage across geographical regions and uses statistical methods to identify significant changes specific to regions. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning. To validate our model, we study and analyze two different massive online data sets: millions of tweets from Twitter spanning not only four different countries but also fifty states, as well as millions of phrases contained in the Google Book Ngrams. Our analysis reveals interesting facets of language change at multiple scales of geographic resolution -- from neighboring states to distant continents. Finally, using our model, we propose a measure of semantic distance between languages. Our analysis of British and American English over a period of 100 years reveals that semantic variation between these dialects is shrinking.

READ FULL TEXT
research
10/04/2019

DialectGram: Detecting Dialectal Variation at Multiple Geographic Resolutions

Several computational models have been developed to detect and analyze d...
research
01/30/2021

Fake it Till You Make it: Self-Supervised Semantic Shifts for Monolingual Word Embedding Tasks

The use of language is subject to variation over time as well as across ...
research
11/16/2020

A Probabilistic Approach in Historical Linguistics Word Order Change in Infinitival Clauses: from Latin to Old French

This research offers a new interdisciplinary approach to the field of Li...
research
10/12/2021

A large scale lexical and semantic analysis of Spanish language variations in Twitter

Dialectometry is a discipline devoted to studying the variations of a la...
research
03/18/2021

Phylogenetic typology

In this article we propose a novel method to estimate the frequency dist...
research
08/23/2022

Computational valency lexica and Homeric formularity

Distributional semantics, the quantitative study of meaning variation an...
research
10/04/2019

DialectGram: Automatic Detection of Dialectal Variation at Multiple Geographic Resolutions

We propose DialectGram, a method to detect dialectical variation across ...

Please sign up or login with your details

Forgot password? Click here to reset