[SCiL 2020] DialectGram: Automatic Detection of Dialectal Changes with Multi-geographic Resolution Analysis
Several computational models have been developed to detect and analyze dialect variation in recent years. Most of these models assume a predefined set of geographical regions over which they detect and analyze dialectal variation. However, dialect variation occurs at multiple levels of geographic resolution ranging from cities within a state, states within a country, and between countries across continents. In this work, we propose a model that enables detection of dialectal variation at multiple levels of geographic resolution obviating the need for a-priori definition of the resolution level. Our method DialectGram, learns dialect-sensitive word embeddings while being agnostic of the geographic resolution. Specifically it only requires one-time training and enables analysis of dialectal variation at a chosen resolution post-hoc – a significant departure from prior models which need to be re-trained whenever the pre-defined set of regions changes. Furthermore, DialectGram explicitly models senses thus enabling one to estimate the proportion of each sense usage in any given region. Finally, we quantitatively evaluate our model against other baselines on a new evaluation dataset DialectSim (in English) and show that DialectGram can effectively model linguistic variation.READ FULL TEXT VIEW PDF
We propose DialectGram, a method to detect dialectical variation across
We present a new computational technique to detect and analyze statistic...
Most prior work on definition modeling has not accounted for polysemy, o...
Word class flexibility refers to the phenomenon whereby a single word fo...
The impact of predictive algorithms on people's lives and livelihoods ha...
Histopathological images provide rich information for disease diagnosis....
Recent coreference resolution models rely heavily on span representation...
[SCiL 2020] DialectGram: Automatic Detection of Dialectal Changes with Multi-geographic Resolution Analysis
Studying regional variation of language is central to the field of sociolinguistics. Traditional approaches Labov (1980); Milroy (1992); Tagliamonte (2006); Wolfram and Schilling (2015) focus on rigorous manual analysis of linguistic data collected through time-consuming and expensive surveys and questionnaires. The evolution of the Internet and social media now enables studying linguistic variation at a scale thus overcoming some of the scalability challenges faced by survey based methods. Consequently, computational methods to detect and analyze geographic variation in language have been proposed Eisenstein et al. (2010, 2011, 2014); Bamman et al. (2014); Kulkarni et al. (2015b)
However, most prior work suffers from three limitations: First, previous models Kulkarni et al. (2015b) such as Frequency Model, Syntactic Model, and
GEODIST all rely on pre-defined regional classes to model linguistic changes (an exception is Eisenstein et al. (2010) which focuses on lexical variation). The use of pre-defined regional classes limits the flexibility of these baseline models because dialect changes can be observed at various geographic resolutions. Second, previous models do not explicitly model the sense distribution of each word. In this work, we address these limitations by proposing a model DialectGram that enables analysis at multiple geographic resolutions while explicitly modeling word senses (see Figures 4 - 4). Given a corpus which can be associated with geographical regions, DialectGram first induces the number of senses for each word using a non-parametric Bayesian model Bartunov et al. (2016). This step requires no apriori knowledge of the geographic resolution111The only requirement is that the corpus be geo-tagged so that analysis can be conducted post-hoc at any desired resolution.. Having inferred the senses of each word, we show how to detect and analyze dialectal variation at any chosen geographic resolution by clustering usages in any given region based on their sense usage.
To summarize, our contributions are:
Multi-resolution Model: We introduce DialectGram, a method to study the geographic variation in language across multiple levels of resolution without assuming knowledge of the geographical resolution apriori.
Explicit Sense modeling: DialectGram predicts how likely each sense of a word is used in a context thus enabling a more precise modeling of linguistic change.
Corpus and Validation Set: We build a new English Twitter corpus
Geo-Tweets2019 for training dialect-sensitive word embeddings. Furthermore, we construct a new validation set
DialectSim for evaluating the quality of English region-specific word embeddings between UK and USA.
Linguistic variation. In the past, sociologists and linguists have been studying linguistic change by designing experiments to manually collect data Labov (1980); Milroy (1992) and conducting variation analysis Tagliamonte (2006). Several works Eisenstein et al. (2010); Gulordava and Baroni (2011); Kim et al. (2014); Jatowt and Duh (2014); Kulkarni et al. (2015a, b); Kenter et al. (2015); Gonçalves and Sánchez (2016); Donoso and Sanchez (2017); Lucy and Mendelsohn (2018); Shoemark et al. (2019) have used different computational models to study dialect variations with respective to geography, gender, and time.
Eisenstein et al. (2010) is one of the first to tackle the linguistic variation problem with computational models. They design a multi-level generative model that uses latent topic and geographic variables to analyze lexical variation in English. This latent variable model is able to generate an author’s geographic location based on the author’s text. To quantitatively evaluate the models, they compute the physical distance between the prediction and the true location. Similarly, Gonçalves and Sánchez (2016) apply -means method to cluster the geographic lexical superdialects assuming a list of pre-defined set of words that are known to demonstrate lexical variation. This was followed by Gonçalves and Sánchez (2016)
who propose two metrics to calculate the linguistic distance between geographic regions. That is, instead of using the physical distance between the predicted and the true location, they compute cosine similarities or Jensen-Shannon Divergence (JSD) to evaluate the model quantitatively.
Recently, Kulkarni et al. (2015b) building on the work of Bamman et al. (2014) propose a word embeddings based model
GEODIST model for robustly modeling dialectal variation and focuses on capturing semantic changes between dialects. Nevertheless, a pre-defined set of regions is required for the model to update region-specific embeddings. For instance, Kulkarni et al. (2015b) assume that English exhibits dialectal variation between the US and UK, and train the network to learn two sets of word embeddings for the two regions. However, a model trained using this data cannot be used to analyze dialectal variation across states or any other level of resolution without a re-training from scratch. To learn how English changes within each state, Kulkarni et al. (2015b) would need to tag each US tweet with a state name and train the model again. Moreover, the model does not explicitly capture senses of a word but only learns region specific embeddings.
Word Sense Disambiguation. The problem of detecting dialectal variants of a word can be viewed broadly in terms of word sense induction where the different word senses can roughly correspond to usages in different regions. For instance, the word pants usually refer to underwear in the US versus trousers in the UK, suggesting two senses for pants. Consequently, we discuss the most relevant work on word sense induction as well. Reisinger and Mooney (2010) is the first paper that modifies the single prototypevector space model to obtain multi-sense word embeddings with average cluster vectors as prototypes. Many works Huang et al. (2012); Neelakantan et al. (2014); Tian et al. (2014); Chen et al. (2014) are later dedicated to combine Skip-gram, clustering algorithm, and linguistic knowledge to learn word senses and embeddings jointly. Bartunov et al. (2016) adopt a non-parametric Bayesian approach and propose the Adaptive Skip-gram (AdaGram) model, which is able to induce word senses without assuming any fixed number of prototypes. As we will see in the following sections, we build on precisely this approach to model regional variation.
|Word||US Meaning||UK Meaning|
|flat||smooth and even; without marked lumps or indentations||apartment|
|flyover||flypast, ceremonial aircraft flight||elevated road section|
We create a new corpus,
Geo-Tweets2019, which consists of English tweets222We use the Tweepy toolkit. during April and May in 2019 from the United States and the United Kingdom. Each tweet includes the user ID, the published time, the geographic location, and tweet text. We have around 2M tweets from the US and 1M from the UK. We preprocessed the tweets with the tweet tokenizer from Eisenstein et al., 2010 and regular expressions. Finally, we filtered out URL’s, emojis, and other irregular uses of English to shrink the size of vocabulary and to facilitate the training of word vectors. Statistics can be seen in Table 2.
To evaluate the models, we construct a new validation set
DialectSim, which comprises of words with same or shifted meanings in the US and the UK. To build this validation set, we first crawled a list of words that show different meanings from the Wikipedia page333https://en.wikipedia.org/wiki/Lists_of_words_having_different_meanings_in_American_and_British_English and pick 341 words that appear more than 20 times in our corpus in the UK and the US. Table 1 presents three examples in the dataset. In order to generate balanced positive and negative samples, we sample another 341 negative examples randomly from our
Geo-Tweets2019 dataset. A minimum frequency of 20 is also used for negative sampling. These negative cases were manually verified by each of the three authors independently. Finally, we split the dataset into training set with 511 samples (75%) and testing set with 171 samples (25%).
Frequency Model. One baseline method to detect whether there are significant changes between usage in two regions is to count the occurrence of a word in the US and the UK tweets. We have implemented this Frequency Model as described in Kulkarni et al. (2015b).
Syntactic Model. A more nuances approach compared to the frequency based approach is to detect change in syntactical roles across regions. The Syntactic Model Kulkarni et al. (2015b) takes Part-of-Speech (POS) tag into consideration as well. More specifically, if a word is used equally frequently in both countries, but the their POS usages are different, then we consider the meaning of two words as different between two countries. We use the CMU ARK Twitter Part-of-Speech Tagger444http://www.cs.cmu.edu/~ark/TweetNLP/ for POS tagging.
GEODIST (Skip-gram) Model. The main idea of
GEODIST model (which can detect semantic changes) Kulkarni et al. (2015b) is to learn region-specific word embeddings and use boot-strapping to estimate confidence scores on detected changes. Instead of learning a single vector to represent a word, this model aims to jointly learn a global embedding as well as (multiple) differential embeddings for each word in the vocabulary with geographical regions exactly as described in Bamman et al. (2014). In particular, the region-specific embedding is defined as the sum of the global embedding and the differential embedding for that region:
. The objective function is to minimize the negative log-likelihood of the context word given the center word conditioned on the region. We use stochastic gradient descent methodBottou (1991) to update the model parameters. We implement our own
model in PyTorch.
We construct a new model for detecting dialectal changes which we called DialectGram (Dialectal Adaptive Skip-gram). The model first learns multi-sense word embeddings using Adagram Bartunov et al. (2016) through training on the region-agnostic corpus. Once sense specific embeddings are obtained, based on the chosen resolution the model composes region-specific word embeddings by taking a weighted average of sense embeddings. At last, the model calculates the distance between region-specific word embeddings of the same word to determine whether a significant change exists. Our method is described succinctly in Algorithm 1.
Compared to the
GEODIST model which needs predefined geographic label to update the region-specific embeddings, DialectGram learns multi-sense word embeddings on our dataset without any knowledge of the underlying regions.For instance, DialectGram automatically induces and learns the two senses of the word flat which could mean an apartment or level land corresponding to usages in the UK and US respectively.
We train our model on our Geo-Tweets2019 corpus to learn word sense embeddings using the Julia implementation of AdaGram555https://github.com/sbos/AdaGram.jl and then implement the inference algorithm in Python. To obtain a word’s region-specific embedding in a place, we first use DialectGram to predict the dominant sense for the word in each tweet from a region and use weighted average of the sense embeddings as the region-specific word embedding . We use the following hyper-parameter settings: , , , , , , . It is worth noting that a large (the underlying Dirichlet process) may lead to too many senses for some words and a small , on the contrary, results in too few senses.
To measure the significance of the dialectal change, Kulkarni et al. (2015b) propose an unsupervised method to detect words with statistically significant meaning changes. However, given that we have access to the humanly curated
DialectSim dataset, we evaluate the models on the list of annotated words using a simple thresh-holding model (where the thresh-hold parameter is learned from training data). Specifically, We evaluate both Skip-gram models (i.e.
GEODIST and DialectGram) by calculating the Manhattan distance666We tried euclidean and cosine distance as well, but use Manhattan distance since it yielded the best results out of the three metrics. between a word’s region-specific embeddings777Our models, validation set and code are available at: https://github.com/yuxingch/DialectGram..
|word||sense 1 neighbors||sense 2 neighbors|
|gas||industrial, masks, electric||car, station, bus|
|flat||kitchen, shower, window||shoes, problems, temperatures|
|buffalo||syracuse, hutchinson||chicken, fries, seafood|
|subway||starbucks, restaurant, mcdonalds||1mph, commercial, 5kmh|
We investigate the words that
GEODIST model predicts to have a significant dialectal change between the two regions. For example, the word mate is one of the top 20 words in our vocabulary if we sort the vocabulary by the Manhattan distance between the US and the UK embeddings from high to low. However, words like draft are predicted to have different regional meanings but not labelled as “significant” in
DialectSim. We further discuss this issue in section 5.2.3.
We select some words with significantly different meanings between the UK and the US. In our DialectGram model, we select the most frequent 2 senses, which usually account for more than 99% usage variation of a word, and plot a heat map on world map.
The word maps in Figure [4, 4] suggest that the usage of gas and flat are different in the UK and in the US. Gas is used commonly as petrol and related to gas station in the US, but in the UK, gas usually refers to air and natural gas. Flat could refer to apartment but in the US this meaning is not as common as in the UK. The same model can also be used at a different resolution level (across US states). For example, given the word buffalo, we show the most dominant senses where Buffalo City (in blue) and the buffalo sauce sense (in white). Similarly for the word pop, we observe that the Midwest area and the Pacific Northwest are more reddish, indicating people are more likely to use the word for soft drink, soda, while people in other areas like to use it to describe a certain type of music – pop music 888 We normalized the data points by filtering out states where the number of tweets is less than 15 since a small number of data points can suffer from high variance.
We normalized the data points by filtering out states where the number of tweets is less than 15 since a small number of data points can suffer from high variance..
Our training corpus
Geo-Tweets2019 has over three million tweets from US and UK. However, we still observed that micro-level analyses at a resolution lower than the state level required more data samples. Therefore, we only present the country-level and state-level analysis here (note that we do not need to train the model to learn embeddings again when we change resolutions for our analyses).
For each model, we defined a
core function that take in one word and return a real number denoting its difference in meanings between the UK and the US. We fit a simple threshold model that maximizes the accuracy on training set. Then we test the model performance on testing set. The results are shown in Table 4.
We observed that Frequency Model is more sensitive to word difference between two countries: football in the UK is same as soccer in the US, causing an imbalanced frequency of term football between both countries. However, it can not detect some semantic changes of words if the semantic change preserves frequency for both countries: flat has similar frequency in both countries, despite the fact that flat could mean apartment in the UK, whereas this usage is uncommon in the US. This model does not suffer from an over-fitting problem, because the model is fairly simple and the parameter space is quite small. However the Frequency model is susceptible to a high false positive rate.
Syntactic Model performs the worst among all the models. It still gets slightly higher precision than the Frequency Model on test set because it gets some dialectal syntactic changes correct. There are two reasons for its bad performance. First, it is limited by the performance of POS Tagger. Second many word sense changes do not alter POS tags. For example, pants refers to underwear in the UK while it refers to jeans in the US, and both of them are nouns.
As mentioned in Section 5.1,
GEODIST model is able to detect dialect changes. The accuracy on the test set beats the previous two baseline models (0.6432 versus 0.5600 and 0.5263), as shown in Table 4. It also outperforms the baseline models in terms of precision and F1 score. In fact,
GEODIST model has the highest precision among all models, including the DialectGram model that will be discussed in the next section. We also notice that the recall on the test set is the lowest. The high precision with low recall indicates that for those changes that
GEODIST model is very conservative and misses some words that actually have significant dialectal changes. For example, the difference between the two region-specific embeddings of the word pants is predicted to be not significant, while pants does have different meanings in the UK and the US (Table 1).
DialectGram outperforms the
GEODIST model in accuracy, recall, and F1 score. However, its precision is lower than that of the
GEODIST and Frequency Model. However, this is already impressive given the fact that DialectGram does not require pre-determined geographic labels and enables analysis at different geographic resolutions post-hoc (after the model is trained).
One reason for DialectGram’s lower performance in precision compared to
GEODIST model is that it over-estimates the number of senses (learning senses that overlap). For example the word gas in Table 3, we sometimes have an additional sense characterized by words such as air, house, pipe. This sense seems to be a mix of sense 1, gaseous substance, and sense 2, gasoline. The average number of senses is controlled by which we pick based on the model’s performance on the training set, but we acknowledge that smarter search strategies for could be employed.
In this work, we proposed a novel method to detect linguistic variations on multiple resolution levels. In our new approach, we use DialectGram to train multiple sense embeddings on region-agnostic data, compose region-specific word embeddings, and determines whether there is a significant dialectal variation across regions for a word. In contrast to baseline models, DialectGram does not rely on the region-labels for training multi-sense word embeddings. The use of region-agnostic data allows DialectGram to conduct multi-resolution analysis with one-time training. We also construct
Geo-Tweets2019, a new corpus from online Twitter users in the UK and US for training word embeddings. To validate our work, we also contribute a new validation set
DialectSim for explicitly measuring the performance of our models in detecting the linguistic variations between the US and the UK. This validation set allows for more precise comparison between our method (DialectGram) and previous methods including Frequency Model, Syntactic Model, and
GEODIST model. On
DialectSim, our method achieves better performance than the previous models in accuracy, recall, and F1 score. Through linguistic analysis, we also found that DialectGram model learns rich linguistic changes between British and American English.
Finally, we conclude by noting the method can be easily extended to temporal or analysis of language at multi-resolution levels.
We would like to thank Cindy Wang, Christopher Potts, and anonymous reviewers, who gave precious advice and comments to our paper. We would also like to thank Symbolic Systems Program at Stanford University for funding our research through Grants for Education And Research (GEAR).
Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes 91 (8), pp. 12. Cited by: §4.1.
Proceedings of the 2010 conference on empirical methods in natural language processing, pp. 1277–1287. Cited by: §1, §1, §2, §2, §3.1.