Exploiting user-frequency information for mining regionalisms from Social Media texts

07/10/2019 ∙ by Juan Manuel Pérez, et al. ∙ University of Buenos Aires 0

The task of detecting regionalisms (expressions or words used in certain regions) has traditionally relied on the use of questionnaires and surveys, and has also heavily depended on the expertise and intuition of the surveyor. The irruption of Social Media and its microblogging services has produced an unprecedented wealth of content, mainly informal text generated by users, opening new opportunities for linguists to extend their studies of language variation. Previous work on automatic detection of regionalisms depended mostly on word frequencies. In this work, we present a novel metric based on Information Theory that incorporates user frequency. We tested this metric on a corpus of Argentinian Spanish tweets in two ways: via manual annotation of the relevance of the retrieved terms, and also as a feature selection method for geolocation of users. In either case, our metric outperformed other techniques based solely in word frequency, suggesting that measuring the amount of users that produce a word is informative. This tool has helped lexicographers discover several unregistered words of Argentinian Spanish, as well as different meanings assigned to registered words.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lexicography is the art of writing (designing, compiling, editing) dictionaries: that is, the description of the vocabulary used by members of a speech community Atkins and Rundell (2008). In the last 30 years, tools coming from Computational Linguistics have helped with this kind of work, mainly in the form of corpora of selected texts. Statistical analysis of corpora results in evidence to support the addition or removal of a word from a dictionary, its marking as dated or unused, as regional, etc., depending on different criteria.

In the process of compiling dictionaries, differences emerge between dialects, where frequently certain words or meanings do not span across all speakers. Since languages are ideal constructs based on the observation of dialects, it is of paramount importance to establish which words are most likely to be shared by an entire linguistic community and which are only used by a smaller group. In this last case, the description profits greatly from information as precise as possible, about geographical extension (region, province, district, city, even neighborhood), about registry (colloquial, neutral, formal), about frequency (actual, past or a combination of both depending on chronological span of the corpus), or any other variable.

Words that are used exclusively or mainly in a particular subregion of the territory occupied by a linguistic community, or that are used there with a different meaning, are called regionalisms, localisms or dialectal words. For example, the words “che”111Interjection used to get the interlocutor’s attention. and “metegol”222Mechanic game that emulates football (futbolín) Academia Argentina de Letras (2008). are used more frequently in Argentina than in Spain. Such words are commonly detected through surveys Almeida and Vidal (1995); Labov et al. (2005) or transcriptions, using methods that depend more or less on the intuition and expertise of linguists. The results of this methodology are of great value to lexicographers, who need evidence to support either the addition of a word into a regional dictionary or the indication of where it is used. Information gathered with these traditional methods has been used as lexical variables to computationally calculate similarities in dialects Kessler (1995); Nerbonne et al. (1996).

The irruption of Social Media and its microblogging services has produced an unprecedented wealth of content, with a clear tendency towards informal or colloquial text generated by users. This opens many opportunities to linguists due to the possibility of accessing geotagged contents, which provide valuable information about the origin of users. Social media texts have been used to study dialects and establish “continuous” isoglosses Gonçalves and Sánchez (2014); Huang et al. (2016), to study language diffusion Eisenstein et al. (2014) and other linguistic studies.

A problem intimately related to lexical dialectology is that of geolocation. These can be seen as inverse problems: one maps regions into dialectal words; the other maps words to regions (locations) Eisenstein (2014). Thus, a way to assess dialectometric models is to use them in geolocation algorithms. In fact, regionalisms can be seen as location-indicative words Han et al. (2012).

Most previous work in word-centric geolocation algorithms (and lexical dialectology) relies on the observation of the frequency of a certain word, ignoring the number of users producing them. Also, very little work has been performed in Spanish.

In this work, we present an information-theoretic measure to detect regionalisms in Social Media Texts, particularly on Twitter, and we test it against a dataset of tweets in Argentinian Spanish. Our contributions are twofold: a) we introduce a new metric based on Information Theory which can be seen as a mixture of TF-IDF and Information Gain; and b) we show that measuring the dispersion of users is a strong indicator of relevance, for both lexical dialectology and geolocation. We conduct our experiments on a dataset of tweets in Argentinian Spanish, with 81M tweets, 56K users, all balanced across the country’s 23 provinces.

2 Previous Work

Most of the previous work in lexical dialectometry consists in measuring words known a priori to be regional variants. These works typically use features gathered from sources such as web searches Grieve et al. (2013) and manually-collected regionalisms Ueda and Ruiz Tinoco (2003); Kessler (1995). Even works analyzing data from Twitter Huang et al. (2016); Gonçalves and Sánchez (2014) still rely on words known a-priori to discover dialectal patterns.

Language evolves so quickly that it is important to detect these contrastive words automatically –or at least, moderate the efforts to detect them. Two types of approaches exist for this problem: one model-based and one metric-based Rahimi et al. (2017a).

Model-based approaches use generative models to detect topics and regional variants Eisenstein et al. (2010); Ahmed et al. (2013). Topic modelling such as these approaches suffer from being very algorithmically complex, thus limiting the amount of data they can process.

Metric-based approaches Cook et al. (2014); Chang et al. (2012); Jimenez et al. (2018); Monroe et al. (2008) create a statistic for each word or expression, and then rankings of each expression. The generated lists of words could be evaluated by checking an external source of regionalisms –such as a thesaurus or dictionary. These methods are usually faster and more scalable but might get corrupted by topics.

In particular, we compare our metrics with those of Han et al. (2012): Term-Frequency Inverse Location Frequency (TF-ILF) and Information-Gain Ratio. We refer to them in the following section.

Text-based geolocation can be seen as the inverse problem of dialectology: while dialectology maps regions to text, geolocation maps text to regions Eisenstein (2014)

. Thus, a reasonable way of assessing the performance of a method for discovering regional words is to use this as feature-selection method for a geolocation classifier, as performed in

Han et al. (2012). In this work, we used provinces as our unit of study, but finer grained geolocation could be performed by using an adaptive grid Roller et al. (2012).

Rahimi et al. (2017b)

proposes a different approach to this problem: the authors train a multilayer perceptron with bag-of-words as input to geolocate users. Intermediate layers serve as vector representations to perform lexical analysis by analyzing proximity in the embedding space.

Information Theory is one of the basis of many of these methods Han et al. (2012); Roller et al. (2012); Chang et al. (2012). Other uses of information-theoretic measures include telling whether a hashtag is promoted by spammers by analyzing its dispersion in time and users Cui et al. (2012); Ghosh et al. (2011)

, and also to discover valuable features from users messages on Twitter for sentiment analysis and opinion mining

Pak and Paroubek (2010). The metrics in the next section use this concept of measuring the entropy of the users of a word.

3 Method and Materials

Data

Figure 1: Distributional figures of the dataset. Left: Distribution of number of tweets per user. Right: Distribution of length (in words) of tweets.

To gather our data, information of departments in Argentina (the second-level administrative division of the country, after provinces) was collected from the 2010 National Census.333https://www.indec.gov.ar Next, a lookup was made through the Twitter API for users with location matching those departments.

Although location fields in Twitter are not to be trusted most of the times Hecht et al. (2011) as we restrict it to a fixed number of names (departament names) most of the noise is reduced. The Python library tweepy was used to interact with the Twitter API.

For each of these users, we retrieved their entire tweetlines. Tweets were tokenized using NLTK Bird et al. (2009). Hashtags and mentions to users were removed; the remaining words were downcased; and identical consecutive vowels were normalized up to three repetitions (“woaaa” instead of “woaaaaaa”). Table 1 lists the figures for the collected dataset, and Figure 1 display the distributions of tweets per user and length of tweets.

Total Mean SD
Words 647M 28.14M 6.64M
Tweets 80.9M 3.51M 0.91M
Users 56.2K 2.44K 0.04K
Vocabulary 7.5M 0.32M 0.04M
Table 1:

Dataset summary. Total figures are provided, along with province-level mean and standard deviation.

It is well known that Twitter vocabulary tends to be very noisy Kaufmann and Kalita (2010) with lots of contractions, non-normal spellings (e.g., vocalizations), typos, etc. Consequently, only words occurring more than 40 times and used by more than 25 users were taken into account. This removes about 1% of the total words and reduces vocabulary from 2.3 million words to around 135 thousand words.

Method

We can think of a regionalism as a word whose usage is not uniform across all the studied territory – i.e., whose concentration is high in a specific region of the country. We are trying, in fact, to measure the disorder in the usage of a word, and there exists a specific information-theoretic tool for this: entropy.

It is known that entropy holds information about the semantic role played by a word. Given a text, high-entropy words are more likely to be pronouns, connectors and other closed-class words, whereas its low-entropy counterparts are usually nouns and adjectives with fuller semantic content Montemurro and Zanette (2002, 2010).

Taking into account their number of occurrences, words with high entropy (i.e., high disorder) can be regarded as used evenly all across the country. On the other hand, low-entropy words are used with higher frequency in a few specific locations.

Let be our locations, and our vocabulary. If refers to the event of occurrence of word , then

denotes the probability that word

occurred in location .

We next define the word-count entropy as

(1)

Note that this measure does not take into account the actual frequency of words. For instance, if two words and occur only in one particular location, but is much more frequent than , both words will still have the same entropy according to Equation 1.

In a similar fashion to tf-idf and inspired by Montemurro and Zanette (2010) and Han et al. (2012), we define measure for word as follows:

(2)

where is the maximum possible value of Shannon (2001), and is the relative frequency of in the corpus (). In this way, will be high for frequent words that accumulate in just a few locations.

Another important aspect of a word is the amount of people using it on Twitter Cui et al. (2012). Assuming we are now sampling users, let be the event that a particular user uses word . Then denotes the probability that the location of a user is given the fact that s/he uses word . We define the user-count entropy as

(3)

and the following metric of ,

(4)

where is the proportion of users who mentioned in the corpus (). Note that will be high for words mentioned by several users who accumulate in just a few locations.

According to Zipf’s Law, the frequencies of top-used words are many orders of magnitude higher than others – a phenomenon also true when counting users of words. So the and terms in equations (2) and (4) become a problem as words with high frequencies overcome their low entropies. To alleviate this, we performed a normalization on the word frequency as follows. Let be the most-frequent word, that is,

(5)

where denotes the total number of occurrences of in our dataset. Then, the Normalized log-frequency of word occurrences is defined as

(6)

Words with very high frequency differ little on their values of . We define analogously the Normalized log-frequency of user mentions . Hence, we redefine our two metrics as

(7)
(8)

We call the first metric Log-Term Frequency Information Gain (LTF-IG) and the second one Log-User Frequency Information Gain (LUF-IG).

A word having a high value for the metrics just defined may be regarded as being more present in a certain region than in the rest of the country. We subsequently sort all words in our dataset relative to these metrics, thus obtaining two word rankings: Word-Count Ranking and User-Count Ranking. The words that appear in the first positions of a ranking are those with high values for the metric, and thus more likely to be regionalisms.

3.1 Lexicographic Validation

With these rankings, a team of lexicographers performed a linguistic validation of the first thousand words according to each metric. This qualitative analysis consisted in a detailed study, word by word, to determine if the word in question is part of the lexical repertoire of a community of speakers. Proper and place names (toponyms) were excluded –as is traditional in lexicography– although many words in this class had high values for our metrics. To facilitate the exclusion of regionalisms by lexicographers, words suspected of being toponyms were automatically highlighted.

To perform the linguistic validation, lexicographers were provided with tables containing figures for each word and province: number of users, number of occurrences and normalized frequency (occurrences per million words). Also, samples of tweets containing these words were provided when necessary.

As a result of this process, every word in the top-1000 of each ranking was annotated with ‘1’ if it had lexical relevance as a regionalism, or ‘0’ if it had not. Lastly, lexicographers performed a characterization of the words marked as regionalisms, according to the linguistic phenomenon they represent. The outcome of these procedures is described in the following sections.

3.2 Feature Selection Methods for Geolocation

To indirectly assess the pertinence of our metrics, we used each as a feature-selection method to train geolocation classifiers. This means that, instead of using the entire bag-of-words as input for a geolocation algorithm, we consider a smaller subset of the vocabulary. This dimensionality reduction of the feature space is aimed at boosting the classifier performance.

This approach to geolocation can be classified as “word-centric”, as it uses lexical information from tweets to predict a location Zheng et al. (2018). We are concerned with user geolocation – i.e., not tweet geolocation. Thus, the units or documents considered are all the tweets from single users. From the collected dataset, we randomly selected 10,000 users, with 7,500 used as training set and 2,500 for testing purposes.

For reference, we compare our results to those obtained using the Information Gain Ratio (IGR) metric as described in Han et al. (2012); Cook et al. (2014): if

is a random variable denoting the location of a given occurrence of a

, then the Information Gain of is

where denotes the probability that does not occur. Then, is defined as

(9)

where is normalized by

We also calculate but with user-frequencies, in a similar way to Equation 4. As a baseline for our feature selection methods, we also calculate Term-Frequency Inverse Location Frequency (TF-ILF), which consists in sorting our terms first by Location Frequency (in ascending order) and then by Term-Frequency (in descending order).

Summing up, five feature selection methods are tested as feature selection for geolocation: TF-ILF, LTF-IG, LUF-IG, basic IGR, and User IGR

. We train Multinomial Logistic Regressions using the top

words as features, and test against the 2.5K held out users. Performance is assessed using accuracy and mean distance between capital cities of each province – a fairly good estimation, since most of the population concentrates around those cities.

4 Results

Rank Word User
1 ushuaia chivil
2 rioja ush
3 chivilcoy poec
4 bragado malpegue
5 viedma aijue
6 logroño tolhuin
7 chepes vallerga
8 oberá yarca
9 cldo blv
10 tdf portho
11 riojanos jumeal
12 breñas sinf
13 choele plottier
14 gallegos kraka
15 tiemposur fsa
16 fueguinos bombola
17 chilecito yarco
18 blv sanagasta
19 ush wika
20 merlo obera
Table 2: Top 20 words for the two metrics. Words in bold have lexicographic interest as regionalisms.

Table 2 shows the top-20 words calculated with each metric. Many are toponyms: chivil, ush, blv, tolhuin, kraka, sanagasta, wika refer to towns, cities and local clubs. Also, some words refer to gentilics (riojanos, fueguinos), or local institutions (POEC). Some of these words emerge as regionalisms: yarca/yarco, aijue, sinf, cldo, bombola, malpegue. We can observe that many words are shared among the rankings. User-Count and Word-Count have an overlap of 63% in the top thousand words.

(a) Color scale: Word-Count Ranking
(b) Color scale: Word-Count Ranking
(c) Color scale: User-Count Ranking
(d) Color scale: User-Count Ranking
Figure 2: Scatter plots showing words (dots) along three dimensions. Horizontal axes: word-count entropy (left plots) or user-count entropy (right plots). Vertical axes: normalized log word frequencies (left plots) or user frequencies (right plots). Color: log word rank according to Word-Count (top plots) or to User-Count (bottom plots); lighter color means higher rank.

Figure 2 shows four three-dimensional scatter plots. A dot in these plots corresponds to an individual word in our corpus, and is placed along the horizontal axes according to its word- or user-count entropy ( and , respectively). Along the vertical axes, each dot is located following its corresponding word or user frequency ( and ). Additionally, each dot is colored according to the position of the word in one of our rankings using a chromatic scale, such that the lighter the dot, the higher the word’s rank. For clearer visualization, word rankings are also shown in logarithmic scale.

Figure 1(a) shows that words that figure high in the Word-Count Ranking (in lighter color) tend to appear closer to the upper-left corner of the plot – that is, such words are more frequent and their mentions are concentrated in fewer regions. Figure 1(d) shows a very similar thing, now with respect to the number of users that mention the words: words high in the User-Count Ranking are mentioned by a larger number of users from fewer regions. These two figures display a gradient from the upper-left corner (words ranked higher, in lighter color) to the lower-right corner (words ranked lower, in darker color).

Figure 1(b) uses horizontal and vertical axes corresponding to users ( and ), but colors each word with respect to Word-Count Ranking. Here we can observe a slight perturbation in the gradient: there are words far from the left-corner that have light colors. From this, we understand that there are words with high Word-Count Ranking that have low User-Count Ranking.

Likewise, Figure 1(c) uses User-Count Ranking to color the points, and word axes and . The perturbation in the gradient is even clearer in this plot: There are many words that appear high in Word-Count Ranking (closer to the top-left corner, see Figure 1(a)) but appear low in User-Count Ranking (darker color).

To further inspect this phenomenon, we searched for words that have large differences in the logarithm of Word-Count Ranking and User-Count Ranking. The logarithm minimizes the difference between words ranked very high (e.g. between the word at position 10,000 and another in position 20,000) and amplifies the difference when one of the ranks is low and the other is high. A close examination of these words and the tweets they were used in showed that they were in the vocabulary of bots (news and metheorological accounts, or accounts using applications to get more followers) or small niches of fans of a certain celebrity. From the top-100 words sorted by this difference, only one has a higher ranking in users than in words.

Summing up, when a word has a high User-Count Ranking, it also tends to have a high Word-Count Ranking. The reverse is not true, however, as words produced by a small number of accounts would not rank well with respect to users. Thus, the User-Count Ranking successfully discards words coming from automatic agents, as already done in Cui et al. (2012).

Word Word Rank User Rank
rioja 2 2499
vto 27 28179
hoa 81 83717
contextos 88 71290
cardi 32 23756
agraden 107 75042
hemmings 59 40227
ushuaia 1 565
tweeted 43 21342
precipitación 66 31042
Table 3: Top 10 words with largest difference between their log word rank and their log user rank.

The first thousand words in the Word-Count Ranking were manually analyzed by the lexicographers, who marked 21.9% as likely regionalisms. Analogously, from the first thousand words in the User-Count Ranking, 30.2% were marked as being lexicographically interesting. This validation suggests that observing user-frequency dispersion is more relevant when assessing the word as a regionalism.

Lexical characterization is displayed in Table 4, which displays some groups among the regionalisms found in the analyzed words with examples. A special note is reserved for the group of Indigenisms, where a number of words were found coming from guaraní (for instance, mitaí, angá, angaú, nderakore) and also from quechua (ura). It is worth mentioning that words coming from guaraní —language spoken in Northeastern Argentina, Paraguay, Bolivia and Southwest of Brazil— coincide with the region delimited by Vidal de Battini (1964).

Colloquialisms Word Region Meaning culiado Córdoba asshole chombi Mendoza poor in quality carnasas Neuquén not classy, inelegant bolasear Cuyo to bullshit aprontar E. Ríos to get ready

Indigenisms ura Northwest vagina (quechua) mitaí Guaranitic boy angá Guaranitic unfortunate

Regional realities piadinas San Juan roll (food) tarefero Misiones yerba mate worker POEC Neuquén high School exam

Interjections aijue Formosa surprise yirr Corrientes joy aiss Formosa annoy jiaa Corrientes yeehay

Ortographic variations pesao Northwest pesado ql Northwest culiado uaso Córdoba guaso

Regional Morpheme raraso Córdoba very strange (raro) tardaso Córdoba very late (tarde)

Table 4: Characterization of some of the regionalisms found in the analysis. Each group corresponds to a subjective category found by the lexicographers during the annotation process

4.1 Geolocation of users

Mean distance error in user geolocation
Accuracy in user geolocation
Figure 3: Comparison of the metrics when used as feature selection methods for geolocation. Vertical axes show the percentage of the top words used as features to train a Multinomial Logistic Regresion, and vertical axes display the performance of each respective classifier. Figure fig:mean_distance_comparison uses mean distance error as y-axis (less is better) and Figure fig:accuracy_comparison uses accuracy (more is better)
Features Accuracy Mean Distance
All 0.383 599.8
TF-ILF 0.654 363.3
IGR-Words 0.736 214.2
IGR-Users 0.748 234.7
LTF-IG 0.737 227.9
LUF-IG 0.784 164.9
Table 5: Performance of the different feature selection methods when using the top-5000 words.

Figure 3 displays the performance of the different feature selection methods when used to train our discriminative classifier. Horizontal axes represent the percentage of top words selected, and the vertical axes represent the accuracy in the case of 3 and the mean distance error in 3.

We can observe that comparing both versions of the metrics, those which use user-frequencies obtain better performance than their word-frequency counterparts. This is more clear in the case of LTF-IG and LUF-IG but we can also observe this in both IGR metrics.

Log User Frequency-Information Gain (LUF-IG) obtains the best performance geolocating users, and achieves a plateau at about 3.75%. It outperforms its word-frequency version LTF-IG and both IGR metrics. Table 5 displays the results of using the full bag of words (baseline) versus using the different feature selection methods with 5,000 top words.

5 Discussion

Of the proposed metrics, User-Count Metric proved to be the more interesting. It removed from the top positions of the ranking words likely to come from automatic agents or from small niches of users, and lexicographic validation confirmed that this ranking contained more regionalisms than the Word-Count Metric. Further, using this metric as a feature selection method for geolocating users also showed a significative improvement over other metrics – both its word-frequency counterpart and IGR metrics from Han et al. (2012). This might suggest that measuring the dispersion of users of a certain word is a very informative indicator –both in lexicographic and in geolocation terms– backing what was already found in previous work to detect spam on Twitter Cui et al. (2012).

The proposed metric was developed in the context of analyzing regional colloquialisms. This area of the lexicon is most elusive, since its impact on any printed medium arrives noticeably late – and in many cases it never reaches it at all. Colloquialisms are a class of words hardly found in any other media. Our best performing metric marked as relevant several words that were already listed in the

Diccionario del Habla de los Argentinos Academia Argentina de Letras (2008), a fact that confirms the usefulness of both our metric and Social Media data in general for this task.

An outstanding subgroup found in the analysis are words coming from the guaranitic region, in Northeastern Argentina. In particular, three words have been proposed for addition to the aforementioned dictionary: angá, angaú, mitaí. This case is emblematic because it shows how this type of approach can help overcome the intrinsic limitations of doing regional lexicography. When lexicographers are native to only one of the different dialects of the region included in a projected dictionary, the probability of properly detecting and defining words of other dialects is slim or depends on mere chance. As the team of lexicographers expressed when confronted with these three words related to Guaraní heritage, those very robust normalized frequencies across a significant portion of the territory of Argentina would otherwise have remained unknown. Instead of including them in the next edition of the dictionary that attempts to describe all regional lexical items in the country, they would have remained unregistered, thus perpetuating a very serious omission.

As the focus was in detecting lexical variations within provinces, we paid no attention to spatial granularity. If a better granularity were necessary in the analysis, adaptive partitioning could be used Roller et al. (2012) to improve geolocation and to find localisms within provinces. Although previous work Vidal de Battini (1964) indicates that most provinces do not have large dialectal variations within them, this is something that would need to be explored and confirmed in future work.

Also, these techniques should be tested against other datasets (such as those used in Roller et al. (2012); Han et al. (2012)) to further confirm that they outperform other feature selection methods.

6 Conclusions

In this work, we developed and compared two metrics to detect regionalisms on Twitter based on Information Theory. One was based on the word frequency (Log Term Frequency-Information Gain, LTF-IG) and the other on the user frequency of a word (Log user frequency-Information Gain, LUF-IG). These metrics may be seen as a mixture of previous information-theoretic measures and classic TF-IDF.

We compared their performance by two means. First, a team of lexicographers manually assessed the presence of regionalisms in the first thousand words as ranked by each of these metrics. Second, we tested the metrics as feature-selection methods for geolocation algorithms, for which we also tested against metrics from previous works Han et al. (2012); Cook et al. (2014). In both evaluation types, the metric built upon user frequencies (LUF-IG) yielded the better results, suggesting that the number of users of a word is very informative – perhaps even more than simple word frequency.

This method has aided lexicographers in their task, letting them propose the addition of a number of words into the Diccionario del Habla de los Argentinos. In the case of this particular dictionary, work relies on a collaborative effort that is based on the intuition of academics and lexicographers that identify regionalisms used mainly (seldom exclusively) within Argentina’s borders by carefully parsing over a diversity of sources. Therefore, using Social Media to automatically detect regionalisms does not limit itself to avoiding most of this manual work, which, in and of itself, would already be a sizeable contribution. Since a considerable portion of the lexical repertoire of a community does not make its way across to published materials (which make most of the 300 millions words included to date in, for example, CORPES XXI Real Academia Española ), the possibility of creating lists of words that are likely to be regional, based on actual utterances written by users, opens a way of shedding light onto entire pockets of lexical items that would remain otherwise chronically underrepresented in dictionaries. Even when a regional word is published, and then included in corpora, the task of appropriately isolating it remains largely unchanged, given that the word has to previously be identified in order to then take advantage of the statistical information available.

A further challenge triggered by this work is the detection of regions with different dialectal uses Gonçalves and Sánchez (2014) but using features obtained in a semisupervised fashion with these metrics. This would allow to assess the validity of the dialectal regions of Argentina proposed by Vidal de Battini in 1964 Vidal de Battini (1964). Spatial and temporal information could be also explored, particularly finer-grained locations. Regarding geolocation, the proposed metrics should also be tested against other datasets to evaluate its performance as a feature selection method.

References

  • Academia Argentina de Letras (2008) Academia Argentina de Letras. 2008. Diccionario del habla de los argentinos. Emecé Editores.
  • Ahmed et al. (2013) Amr Ahmed, Liangjie Hong, and Alexander J Smola. 2013. Hierarchical geographical modeling of user locations from social media posts. In Proceedings of the 22nd international conference on World Wide Web, pages 25–36. ACM.
  • Almeida and Vidal (1995) Manuel Almeida and Carmelo Vidal. 1995. Variación socioestilística del léxico: un estudio contrastivo. Boletín de filología, 35(1):50.
  • Atkins and Rundell (2008) BT Sue Atkins and Michael Rundell. 2008. The Oxford guide to practical lexicography. Oxford University Press.
  • Vidal de Battini (1964) Berta Elena Vidal de Battini. 1964. El español en la argentina. Technical report, Argentina.
  • Bird et al. (2009) Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: Analyzing text with the natural language toolkit. O’Reilly Media, Inc.
  • Chang et al. (2012) Hau-wen Chang, Dongwon Lee, Mohammed Eltaher, and Jeongkyu Lee. 2012. @ phillies tweeting from philly? predicting twitter user locations with spatial word usage. In Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), pages 111–118. IEEE Computer Society.
  • Cook et al. (2014) Paul Cook, Bo Han, and Timothy Baldwin. 2014. Statistical methods for identifying local dialectal terms from gps-tagged documents. Dictionaries: Journal of the Dictionary Society of North America, 35(35):248–271.
  • Cui et al. (2012) Anqi Cui, Min Zhang, Yiqun Liu, Shaoping Ma, and Kuo Zhang. 2012. Discover breaking events with popular hashtags in twitter. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 12, pages 1794–1798, New York, NY, USA. ACM.
  • Eisenstein (2014) Jacob Eisenstein. 2014. Identifying regional dialects in online social media. Georgia Institute of Technology.
  • Eisenstein et al. (2010) Jacob Eisenstein, Brendan O’Connor, Noah A Smith, and Eric P Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 conference on empirical methods in natural language processing, pages 1277–1287. Association for Computational Linguistics.
  • Eisenstein et al. (2014) Jacob Eisenstein, Brendan O’Connor, Noah A Smith, and Eric P Xing. 2014. Diffusion of lexical change in social media. PloS one, 9(11):e113114.
  • Ghosh et al. (2011) Rumi Ghosh, Tawan Surachawala, and Kristina Lerman. 2011. Entropy-based classification of retweeting activity on Twitter. arXiv preprint arXiv:1106.0346.
  • Gonçalves and Sánchez (2014) Bruno Gonçalves and David Sánchez. 2014. Crowdsourcing dialect characterization through twitter. PloS one, 9(11):e112074.
  • Grieve et al. (2013) Jack Grieve, Costanza Asnaghi, and Tom Ruette. 2013. Site-restricted web searches for data collection in regional dialectology. American speech, 88(4):413–440.
  • Han et al. (2012) Bo Han, Paul Cook, and Timothy Baldwin. 2012. Geolocation prediction in social media data by finding location indicative words. Proceedings of COLING 2012, pages 1045–1062.
  • Hecht et al. (2011) Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H Chi. 2011. Tweets from justin bieber’s heart: the dynamics of the location field in user profiles. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 237–246. ACM.
  • Huang et al. (2016) Yuan Huang, Diansheng Guo, Alice Kasakoff, and Jack Grieve. 2016. Understanding us regional linguistic variation with twitter data analysis. Computers, Environment and Urban Systems, 59:244–255.
  • Jimenez et al. (2018) Sergio Jimenez, George Dueñas, Alexander Gelbukh, Carlos A Rodriguez-Diaz, and Sergio Mancera. 2018. Automatic detection of regional words for pan-hispanic spanish on twitter. In

    Ibero-American Conference on Artificial Intelligence

    , pages 404–416. Springer.
  • Kaufmann and Kalita (2010) Max Kaufmann and Jugal Kalita. 2010. Syntactic normalization of twitter messages. In International conference on natural language processing, Kharagpur, India.
  • Kessler (1995) Brett Kessler. 1995. Computational dialectology in irish gaelic. In Proceedings of the seventh conference on European chapter of the Association for Computational Linguistics, pages 60–66. Morgan Kaufmann Publishers Inc.
  • Labov et al. (2005) William Labov, Sharon Ash, and Charles Boberg. 2005. The atlas of North American English: Phonetics, phonology and sound change. Walter de Gruyter.
  • Monroe et al. (2008) Burt L Monroe, Michael P Colaresi, and Kevin M Quinn. 2008. Fightin’words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis, 16(4):372–403.
  • Montemurro and Zanette (2002) Marcelo A Montemurro and Damián H Zanette. 2002. Entropic analysis of the role of words in literary texts. Advances in complex systems, 5(01):7–17.
  • Montemurro and Zanette (2010) Marcelo A Montemurro and Damián H Zanette. 2010. Towards the quantification of the semantic information encoded in written language. Advances in Complex Systems, 13(02):135–153.
  • Nerbonne et al. (1996) John Nerbonne, Wilbert Heeringa, Erik Van den Hout, Peter Van der Kooi, Simone Otten, Willem Van de Vis, et al. 1996. Phonetic distance between dutch dialects. In CLIN VI: proceedings of the sixth CLIN meeting, pages 185–202.
  • Pak and Paroubek (2010) Alexander Pak and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, pages 1320–1326.
  • Rahimi et al. (2017a) Afshin Rahimi, Timothy Baldwin, and Trevor Cohn. 2017a. Continuous representation of location for geolocation and lexical dialectology using mixture density networks. arXiv preprint arXiv:1708.04358.
  • Rahimi et al. (2017b) Afshin Rahimi, Trevor Cohn, and Timothy Baldwin. 2017b. A neural model for user geolocation and lexical dialectology. arXiv preprint arXiv:1704.04008.
  • (30) Real Academia Española. Banco de datos (CORPES XXI) [online]. Corpus del español del siglo XXI (CORPES).
  • Roller et al. (2012) Stephen Roller, Michael Speriosu, Sarat Rallapalli, Benjamin Wing, and Jason Baldridge. 2012. Supervised text-based geolocation using language models on an adaptive grid. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1500–1510. Association for Computational Linguistics.
  • Shannon (2001) Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE mobile computing and communications review, 5(1):3–55.
  • Ueda and Ruiz Tinoco (2003) Hiroto Ueda and Antonio Ruiz Tinoco. 2003. Varilex, variación léxica del español en el mundo: Proyecto internacional de investigación léxica. In Pautas y pistas en el análisis del léxico hispano (americano), pages 141–278. Iberoamericana Vervuert.
  • Zheng et al. (2018) Xin Zheng, Jialong Han, and Aixin Sun. 2018. A survey of location prediction on twitter. IEEE Transactions on Knowledge and Data Engineering, 30(9):1652–1671.