Detecting and analyzing regional variation in language is central to the field of socio-variational linguistics and dialectology (eg. [40, 25, 31, 41]). Since online content is an agglomeration of material originating from all over the world, language on the Internet demonstrates geographic variation. The abundance of geo-tagged online text enables a study of geographic linguistic variation at scales that are unattainable using classical methods like surveys and questionnaires.
Characterizing and detecting such variation is challenging since it takes different forms: lexical, syntactic and semantic. Most existing work has focused on detecting lexical variation prevalent in geographic regions [4, 13, 15, 16]. However, regional linguistic variation is not limited to lexical variation.
In this paper we address this gap. Our method, geodist, is the first computational approach for tracking and detecting statistically significant linguistic shifts of words across geographical regions. geodist detects syntactic and semantic variation in word usage across regions, in addition to purely lexical differences. geodist builds on recently introduced neural language models that learn word representations (word embeddings), extending them to capture region-specific semantics. Since observed regional variation could be due to chance, geodist explicitly introduces a null model to ensure detection of only statistically significant differences between regions.
Figure 1 presents a visualization of the semantic variation captured by geodist for the word test between the United States, the United Kingdoms, Canada, and India. In the majority of English speaking countries, test almost always means an exam, but in India (where cricket is a popular sport) test almost always refers to a lengthy form of cricket match. One might argue that simple baseline methods like (analyzing part of speech) might be sufficient to identify regional variation. However because these methods capture different modalities, they detect different types of changes as we illustrate in Figure 2.
We use our method in two novel ways. First, we evaluate our methods on several large datasets at multiple geographic resolutions. We investigate linguistic variation across Twitter at multiple scales: (a) between four English speaking countries and (b) between fifty states in USA. We also investigate regional variation in the Google Books Ngram Corpus data. Our methods detect a variety of changes including regional dialectical variations, region specific usages, words incorporated due to code mixing and differing semantics.
Second, we apply our method to analyze distances between language dialects. In order to do this, we propose a measure of semantic distance between languages. Our analysis of British and American English over a period of years reveals that semantic variation between these dialects is shrinking potentially due to cultural mixing and globalization (see Figure 3).
Specifically, our contributions are as follows:
Models and Methods: We present our new method geodist which extends recently proposed neural language models to capture semantic differences between regions (Section 3.2). geodist is a new statistical method that explicitly incorporates a null model to ascertain statistical significance of observed semantic changes.
Multi-Resolution Analysis: We apply our method on multiple domains (Books and Tweets) across geographic scales (States and Countries). Our analysis of these large corpora (containing billions of words) reveals interesting facets of language change at multiple scales of geographic resolution – from neighboring states to distant continents (Section 5).
Semantic Distance: We propose a new measure of semantic distance between languages which we use to characterize distances between various dialects of English and analyze their convergent and divergent patterns over time (Section 6).
2 Problem Definition
We seek to quantify shift in word meaning (usage) across different geographic regions. Specifically, we are given a corpus that spans regions where corresponds to the corpus specific to region . We denote the vocabulary of the corpus by . We want to detect words in that have region specific semantics (not including trivial instances of words exclusively used in one region). For each region , we capture statistical properties of a word ’s usage in that region. Given a pair of regions
, we then reduce the problem of detecting words that are used differently across these regions to an outlier detection problem using the statistical properties captured.
In summary, we answer the following questions:
In which regions does the word usage drastically differ from other regions?
How statistically significant is the difference observed across regions?
Given two regions, how close are their corresponding dialects semantically?
In this section we discuss methods to model regional word usage.
3.1 Baseline Methods
One standard method to detect which words vary across geographical regions is to track their frequency of usage. Formally, we track the change in probability of a word across regions as described in. To characterize the difference in frequency usage of between a region pair , we compute the ratio where is the probability of occurring in region . An example of the information we capture by tracking word frequencies over regions is shown in Figure 4. Observe that touchdown (an American football term) is used much more frequently in the US than in UK. While this naive method is easy to implement and identifies words which differ in their usage patterns, one limitation is an overemphasis on rare words. Furthermore frequency based methods overlook the fact that word usage or meaning changes are not exclusively associated with a change in frequency.
Syntactic Method. A method to capture syntactic variation in word usage through time was proposed by . Along similar lines, we can capture regional syntactic variation of words. The word lift is a striking example of such variation: In the US, lift is dominantly used as a verb (in the sense: “to lift an object”), whereas in the UK lift also refers to an elevator, thus predominantly used as a common noun. Given a word and a pair of regions we adapt the method outlined in  and compute the Jennsen-Shannon Divergence between the part of speech distributions for word corresponding to the regions.
Figure 5 shows the part of speech distribution for a few words that differ in syntactic usage between the US and UK. In the US, remit is used primarily as a verb (as in “to remit a payment”). However in the UK, remit can refer “to an area of activity over which a particular person or group has authority, control or influence” (used as “A remit to report on medical services”)111http://www.oxfordlearnersdictionaries.com/us/definition/english/remit_1. The word curb is used mostly as a noun (as ”I should put a curb on my drinking habits.”) in the UK but it is used dominantly as a verb in the US (as in “We must curb the rebellion.”).
Whereas the Syntactic method captures a deeper variation than the frequency methods, it is important to observe that semantic changes in word usage are not limited to syntactic variation as we illustrated before in Figure 2.
3.2 Distributional Method: geodist
As we noted in the previous section, linguistic variation is not restricted only to syntactic variation. In order to detect subtle semantic changes, we need to infer cues based on the contextual usage of a word. To do so, we use distributional methods which learn a latent semantic space that maps each word
to a continuous vector space.
We differentiate ourselves from the closest related work to our method , by explicitly accounting for random variation between regions, and proposing a method to detect statistically significant changes.
Learning region specific word embeddings
Given a corpus with regions, we seek to learn a region specific word embedding using a neural language model. For each word the neural language model learns:
A global embedding for the word ignoring all region specific cues.
A differential embedding that encodes differences from the global embedding specific to region .
The region specific embedding is computed as: . Before training, the global word embeddings are randomly initialized while the differential word embeddings are initialized to . During each training step, the model is presented with a set of words and the region they are drawn from. Given a word , the context words are the words appearing to the left or right of within a window of size . We define the set of active regions where MAIN is a placeholder location corresponding to the global embedding and is always included in the set of active regions. The training objective then is to maximize the probability of words appearing in the context of word conditioned on the active set of regions . Specifically, we model the probability of a context word given as:
where is defined as .
During training, we iterate over each word occurrence in to minimize the negative log-likelihood of the context words. Our objective function is thus given by:
When is large, it is computationally expensive to compute the normalization factor in Equation 1 exactly. Therefore, we approximate this probability by using hierarchical soft-max [34, 32] which reduces the cost of computing the normalization factor from to
. We optimize the model parameters using stochastic gradient descent, as where is the learning rate. We calculate the derivatives using the back-propagation algorithm . We set , context window size to and size of the word embedding to be unless stated otherwise.
Distance Computation between regional embeddings
After learning word embeddings for each word , we then compute the distance of a word between any two regions as where is defined by .
Figure 6 illustrates the information captured by our geodist method as a two dimensional projection of the latent semantic space learned, for the word theatre. In the US, the British spelling theatre is typically used only to refer to the performing arts. Observe how the word theatre in the US is close to other subjects of study: sciences, literature, anthropology, but theatre as used in UK is close to places showcasing performances (like opera, studio, etc). We emphasize that these regional differences detected by geodist are inherently semantic, the result of a level of language understanding unattainable by methods which focus solely on lexical variation .
3.3 Statistical Significance of Changes
In this section, we outline our method to quantify whether an observed change given by is significant. When one is operating on an entire population (or in the absence of stochastic processes), one fairly standard method to identify outliers is the -value test  (obtained by standardizing the raw scores) and marking samples whose -value exceeds a threshold (typically set to the th percentile) as outliers.
However since in our method, could vary due random stochastic processes (even possibly pure chance), whether an observed score is significant or not depends on two factors: (a) the magnitude of the observed score (effect size) and (b) probability of obtaining a score more extreme than the observed score, even in the absence of a true effect.
Specifically, given a word with a score between regions we ask the question: “What is the chance of observing or a more extreme value assuming the absence of an effect?”
First our method explicitly models the scenario when there is no effect, which we term as the null model. Next we characterize the distribution of scores under the null model. Our method then compares the observed score with this distribution of scores to ascertain the significance of the observed score. The details of our method are described in Algorithm 1 and below.
We simulate the null model by observing that under the null model, the labels of the text are exchangeable. Therefore, we generate a corpus by a random assignment of the labels (regions) of the given corpus . We then learn a model using
and estimateunder this model. By repeating this procedure times we estimate the distribution of scores for each word under the null model (Lines 4 to 13).
After we estimate the distribution of scores we then compute the confidence interval on under the null model. Thus for each word , we specify two measures: (a) observed effect size and (b) confidence interval (we typically set ) corresponding to the null distribution (Lines 20-21). When the observed effect is not contained in the confidence interval obtained for the null distribution, the effect is statistically significant at the significance level.
Even though -values have been traditionally used to report significance, recently researchers have argued against their use as -values themselves do not indicate what the observed effect size was and hence even very small effects can be deemed statistically significant [39, 14]. In contrast, reporting effect sizes and confidence intervals enables us to factor in the magnitude of effect size while interpreting significance. In a nutshell therefore, we deem a change observed for as statistically significant when:
The effect size exceeds a threshold which ensures the effect size is large enough. One typically standardizes the effect size and typically sets to the th percentile (which is usually around ).
It is rare to observe this effect as a result of pure chance. This is captured by our comparison to the null model and the confidence intervals computed.
Figure 7 illustrates this for two words: hand and buffalo. Observe that for hand, the observed score is smaller than the higher confidence interval, indicating that hand has not changed significantly. In contrast buffalo which is used differently in New York (since buffalo refers to a place in New York) has a score well above the higher confidence interval under the null model.
As we will also see in Section 5, the incorporation of the null model and obtaining confidence estimates enables our method to efficaciously tease out effects arising due to random chance from statistically significant effects.
Here we outline the details of two online datasets that we consider - Tweets from various geographic locations on Twitter and Google Books Ngram Corpus.
The Google Books Ngram Corpus
The Google Books Ngram Corpus corpus  contains frequencies of short phrases of text (ngrams) which were taken from books spanning eight languages over five centuries. While these ngrams vary in size from , we use the -grams in our experiments. Specifically we use the Google Books Ngram Corpus corpora for American English and British English and use a random sample of million ngrams for our experiments. Here, we show a sample of 5-grams along with their region:
|zucchini||“zucchinis” are known as “courgettes” in UK|
|touchdown||“touchdown” is a term in American football|
|bartender||“bartender” is a very recent addition to the pub language in UK.|
|freshman||“freshman” are referred to as “freshers” in the UK|
|hmu||hit me up a slang which is popular in USA|
||maccas||McDonald’s in Australia is called maccas|
|wickets||wickets is a term in cricket, a popular game in Australia|
|heaps||Australian colloquial for “alot”|
|Word||JS||US Usage||UK Usage|
|remit||remit the loan||The jury investigated issues within its remit (an assigned area).|
|oracle||Oracle the company||a person who is omniscient|
|wad||a wad of cotton||Wad the paper towel and throw it! (used as “to compress”)|
|sort||He’s not a bad sort||sort it out|
|lift||lift the bag||I am stuck in the lift (elevator)|
|ring||ring on my finger||give him a ring (call)|
|cracking||The ice is cracking||The girl is cracking (beautiful)|
|cuddle||Let her cuddle the baby (verb)||Come here and give me a cuddle (noun)|
|dear||dear relatives||Something is dear (expensive)|
||US Usage||AU Usage|
||kisses||hugs and kisses (as a noun)||He kisses them (verb)|
|claim||He made an insurance claim (noun)||I claim … (almost always used as a verb)|
This dataset consists of a sample of Tweets spanning 24 months starting from September 2011 to October 2013. Each Tweet includes the Tweet ID, Tweet and the geo-location if available. We partition these tweets by their location in two ways:
States in the USA: We consider Tweets originating in the United States and group the Tweets by the state in the United States they originated from. The joint corpus consists of million Tweets.
Countries: We consider million Tweets originating from USA, UK, India (IN) and Australia (AU) and partition the Tweets among these four countries.
Some sample Tweet text is shown below:
In order to obtain part of speech tags, for the tweets we use the TweetNLP POS Tagger.
5 Results and Analysis
In this section, we apply our methods to various data sets described above to identify words that are used differently across various geographic regions. We describe the results of our experiments below.
5.1 Geographical Variation Analysis
Table 1 shows words which are detected by the Frequency method. Note that zucchini is used rarely in the UK because a zucchini is referred to as a courgette in the UK. Yet another example is the word freshman which refers to a student in their first year at college in the US. However in the UK a freshman is known as a fresher. The Frequency method also detects terms that are specific to regional cultures like touchdown, an American football term and hence used very frequently in the US.
As we noted in Section 4, the Syntactic method detects words which differ in their syntactic roles. Table 2 shows words like lift, cuddle which are used as verbs in the US but predominantly as nouns in the UK. In particular lift in the UK also refers to an elevator. While in the USA, the word cracking is typically used as a verb (as in “the ice is cracking”), in the UK cracking is also used as an adjective and means “stunningly beautiful”. The Frequency method in contrast would not be able to detect such syntactic variation since it focuses only on usage counts and not on syntax.
In Tables 2(a) and 2(b) we show several words identified by our geodist method. While theatre refers primarily to a building (where events are held) in the UK, in the US theatre also refers primarily to the study of the performing arts. The word extract is yet another example: extract in the US refers to food extracts but is used primarily as a verb in the UK. While in the US, the word test almost always refers to an exam, in India test has an additional meaning of a cricket match that is typically played over five days. An example usage of this meaning is “We are going to see the test match between India and Australia” or the “The test was drawn.”. We reiterate here that the geodist method picks up on finer distributional cues that the Syntactic or the Frequency method cannot detect. To illustrate this, observe that theatre is still used predominantly as a noun in both UK and the USA, but they differ in semantics which the Syntactic method fails to detect.
|Naive Distances||nullmodel||geodist(Our Method)|
Another clear pattern that emerges are “code-mixed words”, which are regional language words that are incorporated into the variant of English (yet still retaining the meaning in the regional language). Examples of such words include main and hum which in India also mean “I” and “We” respectively in addition to their standard meanings. In Indian English, one can use main as “the main job is done” as well as “main free at noon. what about you?”. In the second sentence main refers to “I” and means “I am free at noon. what about you?”.
Furthermore, we demonstrate that our method is capable of detecting changes in word meaning (usage) at finer scales (within states in a country). Table 4 shows a sample of the words in states of the USA which differ in semantic usage markedly from their overall semantics globally across the country.
Note that the usage of buffalo significantly differs in New York as compared to the rest of the USA. buffalo typically would refer to an animal in the rest of USA, but it refers to a place named Buffalo in New York. The word queens is yet another example where people in New York almost always refer to it as a place.
Other clear trends evident are words that are typically associated with states. Examples of such words include golden, space and twins. The word golden in California almost always refers to The golden gate bridge and space in Washington refers to The space needle. While twins in the rest of the country is dominantly associated with twin babies (or twin brothers), in the state of Minnesota, twins also refers to the state’s baseball team Minnesota Twins.
Table 4 also illustrates the significance of incorporating the null model to detect which changes are significant. Observe how incorporating the null model renders several observed changes as being not significant thus highlighting statistically significant changes. Without incorporating the null model, one would erroneously conclude that hand has different semantic usage in several states. However on incorporating the null model, we notice that these are very likely due to random chance thus enabling us to reject this as signifying a true change.
These examples demonstrate the capability of our method to detect wide variety of variation across different scales of geography spanning regional differences to code-mixed words.
6 Semantic Distance
In this section we investigate the following question: Are British and American English converging or diverging over time semantically?
In order to measure semantic distance between languages through time, we propose a measure of semantic distance between two variants of the language at a given point . Specifically, at a given time , we are given a corpus and a pair of regions . Using our method (see Section 3.2) we compute the standardized distance for each word between the regions at time point . Then, we construct the intersection of the set of words that have been deemed to have changed significantly at each time point . We do this so that (a) we focus on only the words that were significantly different between the language dialects at time point and (b) the words identified as different are stable across time, allowing us to track the usage of the same set of divergent words over time. Our measure of the semantic distance between the two language dialects at time is then , the mean of the distances of words in .
In our experiment, we considered the Google Books Ngram Corpus for UK English and US English within a time span of using a window of years. We computed the semantic distance between these dialects as described above, which we present in Figure 3. We clearly observe a trend showing both British and American English are converging. Figure 8 shows one such word acts, where the usage in the UK starts converging to the usage in the US. Before the 1950’s, acts in British English was primarily used as a legal term (with ordinances, enactments, laws etc). American English on the other hand used acts to refer to actions (as in acts of vandalism, acts of sabotage). However in the ’s British English started adopting the American usage.
We hypothesize that this effect is observed due to globalization (the invention of radio, tv and the Internet), but leave a rigorous investigation of this phenomenon to future work.
While our measure of semantic distance between languages does not capture lexical variation, introduction of new words etc, our work opens the door for future research to design better metrics for measuring semantic distances while also accounting for other forms of variation.
7 Related Work
Most of the related work can be organized into two areas: (a) Socio-variational linguistics (b) Word embeddings
While previous work like [20, 7, 23, 22, 9] focus on temporal analysis of language variation, our work centers on methods to detect and analyze linguistic variation according to geography. A majority of these works also either restrict themselves to two time periods or do not outline methods to detect when changes are significant. Recently  proposed methods to detect statistically significant linguistic change over time that hinge on timeseries analysis. Since their methods explicitly model word evolution as a time series, their methods cannot be trivially applied to detect geographical variation.
Several works on geographic variation [4, 15, 35, 13] focus on lexical variation. Bamman and others  study lexical variation in social media like Twitter based on gender identity. Eisenstein et al.  describe a latent variable model to capture geographic lexical variation. Eisenstein et al.  outline a model to capture diffusion of lexical variation in social media. Different from these studies, our work seeks to identify semantic changes in word meaning (usage) not limited to lexical variation. The work that is most closely related to ours is that of Bamman, Dyer, and Smith . They propose a method to obtain geographically situated word embeddings and evaluate them on a semantic similarity task that seeks to identify words accounting for geographical location. Their evaluation typically focuses on named entities that are specific to geographic regions. Our work differs in several aspects: Unlike their work which does not explicitly seek to identify which words vary in semantics across regions, we propose methods to detect and identify which words vary across regions. While our work builds on their work to learn region specific word embeddings, we differentiate our work by proposing an appropriate null model, quantifying the change and assessing its significance. Furthermore our work is unique in the fact that we evaluate our method comprehensively on multiple web-scale datasets at different scales (both at a country level and state level).
Measures of semantic distance have been developed for units of language (words, concepts etc) which  provide an excellent survey. Cooper  study the problem of measuring semantic distance between languages, by attempting to capture the relative difficulty of translating various pairs of languages using bi-lingual dictionaries. Different from their work, we measure semantic distance between language dialects in an unsupervised manner (using word embeddings) and also analyze convergence patterns of language dialects over time.
The concept of using distributed representations to learn a mapping from symbolic data to continuous space dates back toHinton . In a landmark paper, Bengio et al. 
proposed a neural language model to learn word embeddings and demonstrated that they outperform traditional n-gram based models.Mikolov et al.  proposed Skipgram models for learning word embeddings and demonstrated that they capture fine grained structures and linguistic regularities [28, 30]. Also  induce language networks over word embeddings to reveal rich but varied community structure. Finally these embeddings have been demonstrated to be useful features for several NLP tasks [11, 3, 2, 10].
In this work, we proposed a new method to detect linguistic change across geographic regions. Our method explicitly accounts for random variation, quantifying not only the change but also its significance. This allows for more precise detection than previous methods.
We comprehensively evaluate our method on large datasets at different levels of granularity – from states in a country to countries spread across continents. Our methods are capable of detecting a rich set of changes attributed to word semantics, syntax, and code-mixing. Using our method, we are able to characterize the semantic distances between dialectical variants over time. Specifically, we are able to observe the semantic convergence between British and American English over time, potentially an effect of globalization. This promising (although preliminary) result points to exciting research directions for future work.
We thank David Bamman for sharing the code for training situated word embeddings. We thank Yingtao Tian for valuable comments.
- Aggarwal  Aggarwal, C. C. 2013. Outlier analysis. Springer Science & Business Media.
Al-Rfou et al. 
Al-Rfou, R.; Kulkarni, V.; Perozzi, B.; and Skiena, S.
Polyglot-ner: Massive multilingual named entity recognition.In SDM.
- Al-Rfou, Perozzi, and Skiena  Al-Rfou, R.; Perozzi, B.; and Skiena, S. 2013. Polyglot: Distributed word representations for multilingual nlp. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning.
- Bamman and others  Bamman, D., et al. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics.
- Bamman, Dyer, and Smith  Bamman, D.; Dyer, C.; and Smith, N. A. 2014. Distributed representations of geographically situated language. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 828–834.
Bengio et al. 
Bengio, Y.; Schwenk, H.; Senecal, J.-S.; Morin, F.; and Gauvain, J.-L.
Neural probabilistic language models.
Innovations in Machine Learning.
- Berners-Lee et al.  Berners-Lee, T.; Hendler, J.; Lassila, O.; et al. 2001. The Semantic Web. Scientific American.
Stochastic gradient learning in neural networks.In Proceedings of Neuro-Nîmes 91.
- Brigadir, Greene, and Cunningham  Brigadir, I.; Greene, D.; and Cunningham, P. 2015. Analyzing discourse communities with distributional semantic models. In ACM Web Science 2015 Conference. ACM.
- Chen et al.  Chen, Y.; Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2013. The expressive power of word embeddings. arXiv preprint arXiv:1301.3226.
- Collobert et al.  Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. JMLR.
- Cooper  Cooper, M. C. 2008. Measuring the semantic distance between languages from a statistical analysis of bilingual dictionaries*. Journal of Quantitative Linguistics.
- Doyle  Doyle, G. 2014. Mapping dialectal variation by querying social media. In EACL.
- du Prel et al.  du Prel, J.-B.; Hommel, G.; Röhrig, B.; and Blettner, M. 2009. Confidence interval or p-value?: part 4 of a series on evaluation of scientific publications. Deutsches Ärzteblatt International.
- Eisenstein et al.  Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P. 2010. A latent variable model for geographic lexical variation. In EMNLP.
- Eisenstein et al.  Eisenstein, J.; O’Connor, B.; Smith, N. A.; and Xing, E. P. 2014. Diffusion of lexical change in social media. PLoS ONE.
- Eisenstein, Smith, and others  Eisenstein, J.; Smith, N. A.; et al. 2011. Discovering sociolinguistic associations with structured sparsity. In In ACL-HLT.
- Goldberg and Orwant  Goldberg, Y., and Orwant, J. 2013. A dataset of syntactic-ngrams over time from a very large corpus of english books. In *SEM.
- Gonçalves and Sánchez  Gonçalves, B., and Sánchez, D. 2014. Crowdsourcing dialect characterization through twitter.
- Gulordava and Baroni  Gulordava, K., and Baroni, M. 2011. A distributional similarity approach to the detection of semantic change in the google books ngram corpus. In GEMS.
- Hinton  Hinton, G. E. 1986. Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society.
- Kenter et al.  Kenter, T.; Wevers, M.; Huijnen, P.; et al. 2015. Ad hoc monitoring of vocabulary shifts over time. In CIKM. ACM.
- Kim et al.  Kim, Y.; Chiu, Y.-I.; Hanaki, K.; Hegde, D.; and Petrov, S. 2014. Temporal analysis of language through neural language models. In ACL.
- Kulkarni et al.  Kulkarni, V.; Al-Rfou, R.; Perozzi, B.; and Skiena, S. 2015. Statistically significant detection of linguistic change. In WWW.
- Labov  Labov, W. 1980. Locating language in time and space / edited by William Labov. Academic Press New York.
- Lin et al.  Lin, Y.; Michel, J.-B.; Aiden, E. L.; et al. 2012. Syntactic annotations for the google books ngram corpus. In Proceedings of the ACL 2012 system demonstrations.
- Michel and others  Michel, J.-B., et al. 2011. Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182.
- Mikolov and others  Mikolov, T., et al. 2013. Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT.
- Mikolov et al. [2013a] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
- Mikolov et al. [2013b] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS.
- Milroy  Milroy, J. 1992. Linguistic variation and change: on the historical sociolinguistics of English. B. Blackwell.
- Mnih and Hinton  Mnih, A., and Hinton, G. E. 2009. A scalable hierarchical distributed language model. NIPS.
- Mohammad and Hirst  Mohammad, S. M., and Hirst, G. 2012. Distributional measures of semantic distance: A survey. arXiv preprint arXiv:1203.1858.
Morin and Bengio 
Morin, F., and Bengio, Y.
Hierarchical probabilistic neural network language model.
Proceedings of the international workshop on artificial intelligence and statistics.
- O’Connor et al.  O’Connor, B.; Eisenstein, J.; Xing, E. P.; and Smith, N. A. 2010. Discovering demographic language variation. In Proc. of NIPS Workshop on Machine Learning for Social Computing.
- Owoputi, O’Connor, and others  Owoputi, O.; O’Connor, B.; et al. 2013. Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics.
- Perozzi, Al-Rfou, and others  Perozzi, B.; Al-Rfou, R.; et al. 2014. Inducing language networks from continuous space word representations. In Complex Networks V.
- Rumelhart, Hinton, and Williams  Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 2002. Learning representations by back-propagating errors. Cognitive modeling 1:213.
- Sullivan and Feinn  Sullivan, G. M., and Feinn, R. 2012. Using effect size-or why the p value is not enough. Journal of graduate medical education.
- Tagliamonte  Tagliamonte, S. A. 2006. Analysing Sociolinguistic Variation. Cambridge University Press.
- Wolfram and Schilling-Estes  Wolfram, W., and Schilling-Estes, N. 2005. American English: dialects and variation, volume 20. Wiley-Blackwell.