A large scale lexical and semantic analysis of Spanish language variations in Twitter

10/12/2021 ∙ by Eric S. Tellez, et al. ∙ IEEE 26

Dialectometry is a discipline devoted to studying the variations of a language around a geographical region. One of their goals is the creation of linguistic atlases capturing the similarities and differences of the language under study around the area in question. For instance, Spanish is one of the most spoken languages across the world, but not necessarily Spanish is written and spoken in the same way in different countries. This manuscript presents a broad analysis describing lexical and semantic relationships among 26 Spanish-speaking countries around the globe. For this study, we analyze four-year of the Twitter geotagged public stream to provide an extensive survey of the Spanish language vocabularies of different countries, its distributions, semantic usage of terms, and emojis. We also offer open regional word-embedding resources for Spanish Twitter to help other researchers and practitioners take advantage of regionalized models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Communication is, at its core, an understanding task. Understanding a message implies that peers know the vocabulary and structure; i.e., the receiver obtains what the senders is intended to say. Language is a determinant factor in any communication. Even people who speak the same language can find difficulties communicating information due to slight language variations due to regional variations, language evolution, cultural influences, informality, and many others.

A dialect is a variation of a language that diverges from its origin due to several circumstances. Dialects can differ regarding their vocabulary, grammar, or even semantics. The same sentence can be semantically different among dialects. In contrast, people of different dialects may not understand sentences with the same meaning. This effect is notoriously complex for figurative language since it is full of cultural and ideological references. Studying these dialects can help us understand the cultural aspects of each population and the closeness between them. In this sense, dialectometry is defined as the study of the regional distribution of dialects. Similarly, the dialectometric analysis has as objective, through a computational approach, analyze this distribution and provide a quantitative linguistic distance between each dialect or region of it Donoso and Sánchez (2017).

Hence, the research in dialectology tries to understand language differences, innovations, or variations not only in space but also in time through several phenomena. These studies are traditionally carried out using interviews and surveys, which is limited naturally by the small sample size Huang et al. (2016)

. When the communication is made with written messages, i.e., text in some human language, we found ways to automatically analyze large amounts of messages using text-based Natural Language Processing (NLP) tools. Even with NLP techniques, the support for regional languages is still in its early stages, particularly for languages different from English.

On the other side, social media is a crucial component of our lives; Facebook, Twitter, Instagram, and Youtube are among the most used social networks that allow interaction among users in written form and other media. In particular, Twitter is a micro-blogging platform where most messages are intentionally publicly available, and developers and researchers can access these messages through an API (Application Programming Interface). Each message in Twitter is called a tweet. Each tweet contains text and additional metadata like the user that wrote it and the geographic location where it was published. When the source of the messages is informal, like social media, the errors are another source of variability in the language. This kind of messaging may impose extra difficulties than in formal written documents. Twitter and Big Data processing techniques have had demonstrated their utility in many research areas and applications over the years. For instance, several works analyze and study subjects such as marketing communication analysis Arrigo et al. (2021)

, stock market sentiment analysis 

Corea (2016), science evolution Li et al. (2021), among others.

Nonetheless, Twitter message’s quality and socio-demographic representativeness have been continuously questioned Crampton et al. (2013). Some authors have shown that in spite could be the over-representation of some social groups, the usage of social media can still be of enormous usefulness and quality Huang et al. (2016). Both language and geographical information are crucial to know and understand the geographies of this online data, also the way some information related to economic, social, political, environmental trends could be used Graham et al. (2014). In contrast, it is not easy to accurately analyze variations of language using only classical dialectometry Rodriguez-Diaz et al. (2018)

; therefore, we aim for approaches more related to machine learning and natural language processing solutions to be able to handle very large datasets in our study.

Our contribution

This manuscript presents a broad comparison of a large collection of text messages (Twitter) written in the Spanish language in 26 countries where the number of speakers is high, 21 countries where Spanish is accepted as a main language, and five more were the number of speakers is high. Our comparison encompasses lexical and semantic characteristics, like determination of coefficients of Heaps’ law and Zipf’s law. The number of users, tweets, tokens, and emojis by each country is also presented. We also provide a thorough study on similarities and dissimilarities among language variations linked to different countries. We documented our methodologies to reproduce it for other languages. Finally, we also provide our regional models as resources that any interested researcher or practitioner can access.

Tools

Our analysis and visualization rely heavily on the Uniform Manifold Approximation and Projection (UMAP) algorithm McInnes et al. (2020)

; which is used to project high dimensional data into two and three dimensions. The UMAP is a dimension reduction method used for visualization and general non-linear dimension reduction, see 

McInnes et al. (2020) for more details. We used the UMAP algorithm since it can be tuned to preserve both the local and global structure of the input data. Also, UMAP is well-known for its competitive performance in real data and good computational performance. In particular, we use the Julia implementation.111https://github.com/dillondaudert/UMAP.jl Our data analysis and data processing is mainly done using the Julia programming language Bezanson et al. (2017), using its native multi-threading and multi-processing computing capabilities. We ran our analysis in a work station with two Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (64 threads) and 256GB of RAM running CentOS 8. We handle text preprocessing and normalization with our TextSearch.jl package,222https://github.com/sadit/TextSearch.jl and fastText for regional embedding’s construction.333https://fasttext.cc/

The rest of the manuscript is organized as follows. Section 2 analyzes the related work. Section 3 describes our Twitter Spanish Corpora (TSC), a list of 26 corpus collected from Twitter and that correspond to 26 different countries with a broad population of Spanish speakers in the social network. Section 4 presents a lexical aspects over our studied corpora, similarly, Section 5 is dedicated to the semantic analysis of our TSC. Finally, findings and conclusions are summarized and discussed in Section 6.

2 Related work

Social media platforms have been shown their potential in many research areas, such as health, environmental issues, emotion, mental health, gender, and misogyny. In particular, knowing the particularities of languages in a specific region is helpful for social and regional studies.

For instance, Huang et al. Huang et al. (2016) used a set of geotagged tweets for a year to understand the regional linguistic variation in the United States of America (USA). Two questions try to be answered by the authors, first, how do linguistic styles vary from place to place, and what are the linguistic regions and sub-regions in the US. To do this, the authors used a set of 211 lexical alternatives of two or more different words with the same meaning, e.g., dad and father. Over these variations, the whole country was regionalized based on the number of users using the available alternatives. As a result, a map of the USA is presented, highlighting the language variations across it.

Hovy et al. in Hovy et al. (2020)

analyze the variation of the language over Europe using geotagged Twitter data. They used word tri-grams to capture the differences between countries and Principal Component Analysis (PCA) to visualize in a map cell these differences according to a combination of color. In

Gonçalves and Sánchez (2014), authors present a crowdsourcing language diatopic variation using Twitter data with geolocation, employing tweets messages in Spanish for more than two years over all the globe. The analysis was done using a set of pre-established words and counting each one and its variations worldwide. In this sense, to know what regions are close to each other, the authors used the k-means clustering algorithm over these words frequencies and PCA to reduce data into two-dimensional space for visualization. The clustering approach identifies large macro-regions sharing language characteristics.

On the other hand, Graham et al. discuss how good are methods to identify location and language in Twitter data Graham et al. (2014), which is an essential task for many researchers using Twitter data. To answer this question, the authors collected tweets between 10 November and 16 December 2011; authors randomly selected a small subset of 1000 tweets from four metropolitan areas. A labeled corpus was created by humans determining the primary language of tweets, and these labels were compared with the results obtained by several language detection tools. The same comparison was applied to a location using the geotagged data provided by Twitter versus the predicted by these tools. The authors conclude that there is no tool to approximate the results achieved by humans regarding the language recognition task. In the case of geolocation data, authors determined that usually, there is no correspondence between the geolocation provided by the device (smartphone mainly) and the provided in the user’s profile. So, the provided by users is not trustful.

Dunn Dunn (2019)

analyzes seven languages: Arabic, English, French, German, Portuguese, Russian, and Spanish. The authors crawled a set of web pages and tweets to compose the used dataset. The author employs a Support Vector Machine (SVM) classifier on different features: lexical unigrams, bi-grams, and tri-grams. The recall, precision, and f1-score metrics were calculated for all the languages and regional variations, showing outstanding results. In

Dunn (2020), the author describes a web-based corpus of global language for data-driven language mapping. An analysis of the relationship between two different sources, the web pages, and Twitter, is also presented. The study was made using a classification language model with good performance over multiple languages. In Rodriguez-Diaz et al. (2018) the authors study Spanish language variations in Colombia. The analysis used uni-gram features, and the authors stated that it was challenging to compare Spanish variations against regions identified by other authors using classical dialectometry. Hence, in conclusion, the authors said that automatic detection of dialectones is an adequate alternative to classical methods in dialectometry for automated language applications.

Mocanu et al. Mocanu et al. (2013) survey the linguistic landscape in the world using Twitter. This landscape includes the linguistic homogeneity or variations over countries and the touristic seasons that can be seen in them. The method employed to identify language is the Chromium Compact Language Detector by Google, and as location, only the data provided by devices in Twitter was utilized. As a result, the authors presented the distributions of language in Twitter over several countries by month of the year where the touristic seasons can be seen. Kejriwal et al. Kejriwal et al. (2021) study the use of emojis in terms of linguistic usage and countries. The authors collected tweets from 30 languages and countries. The authors found that emojis usage has a strong correlation in language and country level, which means that emojis are used according to language and region. Another example of studying Emojis usage is presented in  Li et al. (2019).

A lot of research has been done related to language variations analysis and regionalization of them. This manuscript presents an analysis of the Spanish variations over 26 countries using a vast amount of Twitter data collected over four years. This analysis, unlike the above-described works, is made considering both lexical and semantic aspects.

3 Twitter corpora of the Spanish language

With 489 million native speakers in 2020444https://blogs.cervantes.es/londres/2020/10/15/spanish-a-language-spoken-by-585-million-people-and-489-million-of-them-native, the Spanish language is one of the languages with a higher native speaking basis, just ranked behind Chinese Mandarin in terms of the number of native speakers. There exists 21 countries having Spanish language as the official language (by law or de facto)555https://en.wikipedia.org/wiki/List_of_countries_where_Spanish_is_an_official_language. In addition, we considered the following list of countries into this study: United States (US), Canada (CA), Great Britain (GB), France (FR), and Brazil (BR). These countries have a high number of collected tweets in our collection, but also because they have a well-known migration, business, and tourism activities of Spanish speakers. The number of Twitter users varies along with countries; they represent some part of society. Each country has different social, political, security, health, and economic conditions, so we will try to avoid any kind of generalization.

We collected tweets publicly published tweets from 2016 to 2019 using the Twitter stream API. We limited our collection to be geotagged messages and marked to be written in the Spanish language. We decided to let out the year 2020 tweets of the study to avoid disturbances in social media regarding COVID-19 pandemic since our objective is the language itself. Our strategy to collect tweets was to ask the API to accept messages with at least one Spanish stopword,666Words that are so common in a language that many NLP modeling schemes removed them in many NLP. These words are articles, prepositions, interjections, auxiliary verbs, among other typical words. We use a stopword’s list with 400 common words to follow API’s limits. and are also being labeled as written in Spanish by Twitter, i.e., keyword lang=es in Twitter’s API.777Please note that any misclassification of language made by Twitter is conserved. After this filtering procedure, we retain close to million tweets. Table 1 shows this data and its distribution over countries.

To ensure a minimum amount of information in each tweet, we discard those tweets with less than five tokens, i.e., words, emojis, or punctuation symbols. We also removed all retweets to avoid duplication of messages and reduce foreign messages commented by Spanish speakers.

country code number of users number of tweets number of tokens
Argentina AR 0.7563 1.8594 1,376K 234.22M 2,887.92M
Bolivia BO 0.7509 1.8913 36K 1.15M 20.99M
Chile CL 0.7555 1.8874 415K 45.29M 719.24M
Colombia CO 0.7562 1.8993 701K 61.54M 918.51M
Costa Rica CR 0.7447 1.8595 79K 7.51M 101.67M
Cuba CU 0.7640 1.8677 32K 0.37M 6.30M
Dominican Republic DO 0.7544 1.8832 112K 7.65M 122.06M
Ecuador EC 0.7538 1.8968 207K 13.76M 226.03M
El Salvador SV 0.7494 1.9066 49K 2.71M 44.46M
Equatorial Guinea GQ - - 1K 8.93K 0.14M
Guatemala GT 0.7498 1.9175 74K 5.22M 75.79M
Honduras HN 0.7486 1.8941 35K 2.14M 31.26M
Mexico MX 0.7557 1.8895 1,517K 115.53M 1,635.69M
Nicaragua NI 0.7445 1.8535 35K 3.34M 42.47M
Panama PA 0.7559 1.8952 83K 6.62M 108.74M
Paraguay PY 0.7511 1.8815 106K 10.28M 141.75M
Peru PE 0.7583 1.8966 271K 15.38M 241.60M
Puerto Rico PR 0.7498 1.8433 18K 0.58M 7.64M
Spain ES 0.7648 1.9036 1,278K 121.42M 1,908.07M
Uruguay UY 0.7516 1.8346 157K 30.83M 351.81M
Venezuela VE 0.7614 1.8959 421K 35.48M 556.12M
Brazil BR 0.7681 1.9389 1,604K 27.20M 142.22M
Canada CA 0.7652 1.9331 149K 1.55M 21.58M
France FR 0.9372 1.9324 292K 2.43M 27.73M
Great Britain GB 0.7687 1.9129 380K 2.68M 34.62M
United States of America US 0.7666 1.8929 2,652K 40.83M 501.86M
Total 12M 795.74M 10,876.25M
Table 1: Statistics of our datasets after a filtering by retweets and ensuring at least five words per tweets; it shows the country of origin, the country code in ISO 3166-1 alpha-2 format as reported by the Twitter’s API, the number of tweets and the number of different users in the collected period.

Table 1 shows statistics about our corpora describing aspects such as country, number of users, number of tweets, and number of tokens. The table shows that Spain, the USA, Mexico, and Argentina are countries with more users. Furthermore, they are also those with more tweets in the Spanish language, but the USA falls considerably in this aspect. The same proportion could be seen in the number of tokens column. Although, Argentina is the country with the highest number of tokens, above Mexico, and Spain, significantly.

The table also lists the coefficients for the expressions behind Heaps’ law and Zipf’s law; these are two well-known laws describing how the vocabulary grows in text collections written in non-severe-agglutinated languages. These both are properties of a corpus in a particular language. Heaps’ law describes the sub-linear growth of the vocabulary on a growing collection of size . Zipf’s law represents a power-law distribution where a few terms have very high frequencies, and many words occur with a shallow frequency in the collection. The expression that describes Zipf’s law is , where is the rank of the term’s frequency.

Figure 0(a) illustrates the Heaps’ law in a small sample of regions of interest. We can observe its predicted sub-linearity and that Mexico has the lowest growth in its vocabulary size according to the number of tokens. On the contrary, the US corpus shows faster vocabulary growth, possibly explained due to the mix of languages in many messages.

Figure 0(b) shows the Zipf’s law under a log-log scale and, therefore, its quasi-linear shape. We can see slight differences among curves, more noticeable on both the left and right parts of the plot. The left part of the curves corresponds to those terms with very high frequency, and the right side is dedicated to those terms being rare in the collection. Notice that all these curves are similar but slightly different; this is not a surprise since we analyze different dialects of the same idiom, i.e., the Spanish language. We need more detailed tools to characterize these similarities and differences, which we are trying to perform and describe in the rest of the document.

(a) Heaps law. Vocabulary size with respect to the number of tokens.
(b) Zipf’s law. Frequency of tokens.
Figure 1: The vocabulary growth and distribution of frequencies of tokens over a sample of our Twitter’s Spanish language corpora.

3.1 Geographic distribution

Figure 1(a) illustrates the number of collected Spanish language tweets over the world, despite that the rest of the analysis in this manuscript is limited to 26 countries or regions. The color intensity is on a logarithmic scale, which means that slight variations in the color imply significant changes in the number of messages; countries with the darkest blue have the highest number of tweets in Spanish. This figure shows how American countries (in the south, central, and north) have, as expected, more tweets in the Spanish language than the rest of the world.

Figure 1(b) shows the distribution of tweeters (users) per country. As in the previous image, we present a logarithmic scale in the intensity of color to represent number of users. The differences between this figure and Figure 1(a) are low, as expected, and follow the same distribution. Note that the intensity in American countries is higher.

(a) Distribution of tweets tagged as Spanish by Twitter.
(b) Distribution of the number of tweeters with at least one tweet tagged as Spanish language.
Figure 2: Distribution of tweets and tweeters labeled as Spanish-speaking users around the world. Colors are related to the logarithmic frequencies in data collected from 2016 to 2019 with the public Twitter API stream. Darker colors indicate higher population; the logarithmic scale implies that color changes are produced by large frequency differences.

4 Lexical analysis

This section analyzes our Spanish Twitter Corpora (TSC) in the lexical aspect, specifically from the vocabulary usage perspective. This analysis complements that given of the Heaps’ and Zipf’s laws and the information given in Table 1.

Figure 3

describes the procedure applied to obtain an affinity matrix of our Spanish corpora. For this purpose, we extracted the vocabulary of each corpus, i.e., a matrix that describes the similarities among corpora. The vocabulary was computed on the entire corpus after text normalizations described in the diagram. We also count the frequencies of all terms to obtain a Zipf’s like representation. We decided to remove the one hundred more frequent words; we also removed those terms with less than ten occurrences in the corpus from the corpus vocabulary. Therefore, we kept the portion of the vocabulary that illustrates the figure. The remaining terms and frequencies are used to create a vector that represents the regional corpus.

The affinity matrix is then computed using the cosine distance as described in the flow diagram. The heatmap represents the actual values in the matrix. This matrix is crucial for the rest of this analysis since it contains distances (dissimilarities) among all pairs of our Spanish corpora. Values close to zero (darker colors) imply that those regions are pretty similar, and lighter ones (close to one) are those regions with higher differences in their everyday vocabularies. For instance, the affinity matrix can show us how Mexico (MX) is more similar to Honduras (HN), Nicaragua (NI), Peru (PE), and the USA (US). This behavior could be the geographical location of the countries, and therefore, a large migration or cultural interchange is made. On the other hand, Brazil (BR) and Equatorial Guinea (GQ) are among the most atypical countries with low similarities with the other countries.

Figure 4 shows a projection in a 2D space using the UMAP method and our affinity matrix as pre-computed input.

The figure shows how close or far are each Spanish variation among the entire corpora. UMAP is parameterized by the number of nearest neighbors (k) in the affinity matrix, and thus, it can consider local and global structures, depending on the number of neighbors. We also colorized figures using a 3D projection to compose an RGB color space using a plain normalization of the ranges with UMAP and the specified number of neighbors.

Figure 3(a) shows the projection using 3; here, we can see four well-defined clusters. Figure 3(b), and 3(c) show the result of considering 7 and 12 neighbors to produce the UMAP mappings. In these figures, we can expect that the global structure of the affinity matrix is preserved better than for 3. For instance, Uruguay (UY) is very close to Argentina (AR) in three figures, and this is the case of other countries, like Mexico (MX), Colombia (CO), and the United States (US); or Venezuela (VE), and Ecuador (EQ).

Figure 3: Affinity matrix among Spanish regions’ vocabularies.

(a) 3

(b) 7

(c) 12
Figure 4: Two dimensional projection of the Spanish-speaking countries’ using the Cosine among vocabularies. We present three different figures, for 3, 7, and 12 neighbors for the UMAP projection.
Figure 5: Regional Vocabulary in RGB representation

For better geographical visualization and comparison, Figure 5 shows this result into a map. Here, we use the same RGB colors produced by reducing the affinity matrix to 3D using the UMAP with 3. Note that some countries in South America and Europe could be grouped (pink color). North America is also grouped (brown color). And, some countries in South America, such as Peru (PE), Venezuela (VE), and Bolivia (BO), are also similar (green color).

The most frequent words used in the vectors that represent each corpus are shown in the word cloud in Figure 5(a). While this is an illustrative figure, it is possible to observe what kind of terms are used. We observe numerous types of words, like verbs, adverbs, adjectives, and nouns; for instance, that terms such as personas, mundo, mañana (people, world, tomorrow), and tiempo (time) are among the most used terms in the collection. Please recall, we removed the one hundred most frequent terms in each vocabulary.

(a) Frequent tokens in Spanish messages in Twitter.
(b) Most popular emojis per Spanish speaking country.
Figure 6: Word clouds of frequent tokens (a) and most used emojis in Spanish (b)

Emojis are graphical symbols that can express an emotion or popular concepts. It is a lexical resource that can also imply an emotional charge. Emojis were created to compensate for the lack of facial expressions and other expressive ways people use face-to-face conversations. Therefore, emojis are popular in social networks like Twitter since they are concise and friendly ways to communicate Dresner and Herring (2010). The use of emojis is also dependent on the region, as illustrated in Figure 6. The figure shows the 32 most used emojis in each country; skin tone markers were separated from composed emojis and counted in an aggregated way. Note that the most popular emojis have consensus in almost all regions. In top rank, we found the laughing face, the in love face, and the heart (love). Another type of emoji that deserves attention is the color-skin mask, which marks emojis with a skin hue. White color-skip marks are the most used; on the contrary, those related to dark-skin people are less used. This information could have different meanings, e.g., people using Twitter in these countries are identified as white people, or perhaps it isn’t easy to select the proper under the Twitter clients. The real reason behind this finding is beyond the scope of this manuscript but deserves attention.

5 Semantic analysis and regional word embeddings

This section explores the semantic among the Spanish language corpora. Each region has particular uses of some terms, and this effect induces differences when people communicate using a global perspective of a language. Understanding a dialect may need regionalized resources. Therefore, our semantic approach is based in two main components: the creation of local word embedding for each regional corpus and a popular emotion lexicon to look for

anchors that allow a comparison among word embeddings. Both resources are used to create an affinity matrix to analyze the semantic similarities among our corpora.

While our procedure limits our semantic comparison to terms that express some emotion, the approach also produces valuable information for a broad list of tasks that can take advantage of emotional information such as opinion mining and emotion identification, hate speech identification, bot identification, among many others. The limit is necessary since the embedding creation procedure does not provide compatible spaces for each region in corpora; this effect applies even for different runs on the same corpus. This characteristic is detailed below.

5.1 Semantic representations with word embeddings

Word embeddings are vector representations of a vocabulary; a vector representing each term is a high-dimensional point representing the semantic of the corresponding word in the collection. The construction of the embedding learns semantics using the so-called distributional-semantic hypothesis: if two terms tend to be used in similar contexts, hence, they will exhibit a similar semantic. This statement is used to create sophisticated strategies to learn word meanings. If two vectors are close under some given distance, they will be semantically related, and the contrary becomes true. In summary, word embeddings have become a popular and effective way to capture semantics from a corpus Yang et al. (2018).

There exist several techniques to learn word embeddings. For instance Word2Vec Mikolov et al. (2013a), FastText Joulin et al. (2017), and Glove Pennington et al. (2014)

. In this work, we used FastText to create our word embeddings models. FastText is an open-source and free library for text classification and word vector representation, see

Mikolov et al. (2013b)

. It uses a shallow neural network to learn the embeddings with two main strategies: the

skip-gram and the cbow

strategies. A neural network machine learning algorithm that learns from a dataset using a loss function is tried to minimize using some optimization algorithm. The training process can be roughly described as follows. The used neural network consists of a set of layers that can be seen as stages of computations; each layer is composed of matrices, many times called parameters, that describe the model in its core and a non-linear function to output. The learning process begins with random parameters and then iteratively calculates the loss function, computes an error, and adjusts the net’s parameters with this information. Due to this non-deterministic process, two runs of the same network on the same dataset can produce two incompatible spaces. The FastText’s cbow learns the semantics of a word using the context (a window of words around it in the text); for this purpose, the idea is to take the context of words and try to predict the objective word. The skip-gram model uses a similar window but predicts the objective word using a random word in that window. FastText also learns sub-words, which are character q-grams that compose a word; this is used to handle vocabulary terms. The latter is of particular use on social networks like Twitter. We can find many lexical variations in this source due to typos, hashtags, deformations, newly generated words, etc. As robustness insight, FastText has been used in a variety of text classification tasks; for instance, in 

Karpathy and Fei-Fei (2015), a image description approach is presented with the purpose to learn about correspondences between language and visual data. It is also used in more classical tasks such as sentiment analysis Wan et al. (2018); Alessa et al. (2018), fake news detection De Sarkar et al. (2018), etc. This shows the importance of this tool in the natural language processing community.

As commented, we created 26 word-embedding models, one per country. We learn 300 dimension vectors, which is almost an standard for pre-trained embeddings. For the rest of hyper-parameters of FastText we select default values. We also apply text preprocessing described in §4. In addition, for comparing purposes, we use the entire corpora as a single corpus to create global word embedding. This is the strategy of most pre-trained word-embeddings, and it is used as a control to show if regional word embeddings are equivalent to a global one. These 27 embeddings are available in https://ingeotec.github.io/regional-spanish-models/.

5.2 Emotion lexicon

We use the NRC Emotion Lexicon Mohammad and Turney (2013, 2010), EmoLex, as a source of meaningful words. We use these words as anchors or references, necessaries by our comparison methodology. The EmoLex lexicon comprises a list of more than 14 thousand English words labeled with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). A word can have none, one, or more than one associated emotion. The annotations were made by human crowdsourcing using the Mechanical Turk platform.888https://www.mturk.com The automatic translation of it for a large number of languages, including Spanish, is also provided by the authors.999http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

5.3 Affinity matrix and semantic comparison

Figure 7: Affinity matrix visualization of the regional embeddings based on the semantic emotional projection.

As before, we start focusing our efforts on constructing an affinity matrix for comparing the semantics found in our corpora. Figure 7 shows the diagram flow to compute our matrix. As commented, the learning procedure of word embeddings does not allow to compare vectors of different runs, even when were are created on messages written with the same language. Therefore, it is necessary to use a different approach to characterize semantic similarities among Spanish dialects. For this reason, we elaborate a different strategy for comparison. We used a subset of the EmoLex, i.e., we select the words labeled with at least one emotion after our normalization procedure becomes unique; this reduces from more than 14 thousand words to 3821 labeled words being unique. Note that the number of words reduced significantly since the Spanish translation was performed with automatic tools; also, several words in Spanish distinguish variations based on diacritic symbols that were removed. Please note that we already expect that these lexical variations occur in informal texts like those found in social networks.

We call these reduced lexicons as references or anchors. They are used to induce a projection of each corpus to compare different regions in word embeddings. Given a distance function and the set of special anchors, each word in each region is represented by the order that perceives the references. We must take both anchors and terms from the same word embedding to be valid. Let be a word and the set of references of size ; then the rank is defined by the permutation that sorts . Two ranks can be compared using the Spearman distance, which can be efficiently computed as the Euclidean distance of the inverse permutation . We also fix that the same set of references represent each regionalized embedding; this simplifies and accelerates the entire process. The procedure is detailed in Figure 7.

The core idea behind the procedure is that anchors are common to all language variants, but their precise usage and concept may vary. Therefore we used the EmoLex to select, i.e., a number of well-known terms with high semantic value, to define themselves in each corpus ( in the flow). These matrices are measured with a distance function, i.e., in the diagram flow, that is used to create our main semantic affinity matrix.

This affinity matrix depicted as a heatmap is shown at the bottom of Fig. 7, the lower distances the darker hues. Here we can observe all the similarities of the Spanish-speaking countries in semantic way. For instance, we can see that Mexico is most similar to Argentina, Colombia, Chile, Spain, Peru, and the US. The case of the US is expected due to the number of migrants from Mexico in this country. Another aspect seen here is that Cuba and Puerto Rico are the most distant countries from the rest. Also, the ALL embedding is included, that was created using the entire corpora as input data. We can observe how it is among the closer semantic neighbor of Argentina, Chile, Colombia, Spain, and Mexico. The rest of the countries did not prefer them. This finding may indicate that pre-trained embeddings using a general language corpus may not be suitable for some regions since their concepts vary from the mainstream definitions. Our case applies to emotions, but our methodology can be extended to other domains.

(a) 3

(b) 7

(c) 12
Figure 8: A two dimensional reduction with UMAP to visualize the regional embeddings based on the semantic emotional projection. Colors are created through reducing the same data to three dimensions.

Figure 8 shows three projections of our semantic affinity matrix in 2D using UMAP. Each figure corresponds to three, seven, and twelve neighbors in the dimensional reduction procedure, i.e., few neighbors capture local structure while many capture the global one. The colors are computed by 3D reduction applying the UMAP dimensional reduction to the same affinity matrix; the resulting components create an RGB color set using a simple translation and scale procedure to compose values between and . For instance, Figure 7(a) shows the UMAP dimensional reduction for 3. Both distances and colors describe a few well-defined groups. The largest cluster is colorized using dark tones, capturing the differences between South American countries and Central and North American countries. The ALL embedding is central to the largest cluster and distant from other clusters. More neighbors, like Figures 7(b) and 7(c), show a more complex relation between regions, going from a connected cloud with colors intensifying their closeness notion.

Figure 9: Geographical location using RGB representation of each region.

Figure 9 illustrates the groups geographically colorizing regions as obtained with the 3D reduction (using 3). The geographic neighborhoods are sometimes preserved, but it is not the rule. It is interesting to note that Canada, Brazil, France, and Great Britain are grouped since these are countries where native Spanish speakers are foreign. However, the United States of America may have a similar profile, but it is closer to other countries with a large basis of native Spanish speakers. We believe that people are natively speaking Spanish so largely in the US that it looks like a country with a de facto Spanish language acceptance.

5.4 Concept dissimilarities

(a) Concepts with high consensus among regions
(b) Concepts with low consensus among regions
Figure 10: Consensus in the NRC lexicon on words labeled with at least one emotion; the weight is related to the amount of consensus, yet normalized by figure to improve visualizing the clouds.

Our semantic representation allows comparing region word embeddings for a language. In addition, it is also possible mining what concepts do not change across embeddings and what changes the most. We accumulate the Spearman distances for all permutations describing the same word across different word embeddings for this task. Figure 9(a) shows a word cloud with high consensus definitions; here, we can see words with strong emotional charge and linked somehow to different human conditions. For instance, we can see desigualdad (inequality), discriminación (discrimination), condenar (condemn), criminalidad (criminality), justificar (justify), establecer (determine), intolerancia (bigotry), among others. Figure 9(b) shows those words with low consensus among regions; please note that we normalize the consensus notion to improve visualization. Among the set of words with low consensus we find terms like estrangular (throttle), urna (urn), cosmopolita (cosmopolitan), filántropo (philanthropist), blanco (white), superestrella (superstar), brújula (compass). Determining the words with high and low consensus among regions is essential for constructing messages that a sender can effectively present for audiences of large regions.

6 Conclusions

This manuscript analyzes Spanish language variations using lexical and semantic approaches. For our analysis, we collected Twitter messages for four years, from 2016 to 2019. The messages must be geotagged and located in a country using Spanish as one of their primary languages and written in Spanish. We collected then messages from 26 regions to create our Twitter Spanish Corpora.

Regarding our lexical approach, we characterize each corpus using several traditional tools like Heaps and Zipf laws, its vocabulary, and others less traditional like the distribution of emojis. We created an affinity matrix and produced visualizations using the UMAP dimension reduction algorithm that help us understand how dialects group and share properties and distribute geographically. We counted and ranked emojis per corpus, separating the skin-tone mark finding a kind of self-reporting skin-tone distribution. Our current research is limited to language, but the study of the self-reporting skin tone, its context, and its implications through Big Data analysis is part of our future research.

On the other hand, the semantic analysis considers the same corpora yet uses well-known FastText to create regional word embeddings. We use a subset of the NRC emotion lexicon (EmoLex) to allow comparing regional word embeddings. We create an affinity matrix of corpora that captures semantic similarities of regional embeddings. Using this method, we produce different visualizations to understand dialects’ semantic similarity and geographical distributions.

As a result of this comparison in terms of lexical and semantic ways, it can be concluded that although there is similar behavior in some countries, the general countries’ clustering is different in semantic space than lexical. The above gives us an insight about two aspect must be considered to dialectometry analysis at least using Twitter data.

References

  • A. Alessa, M. Faezipour, and Z. Alhassan (2018) Text classification of flu-related tweets using fasttext with sentiment and keyword features. In 2018 IEEE International Conference on Healthcare Informatics (ICHI), Vol. , pp. 366–367. External Links: Document, ISSN 2575-2634 Cited by: §5.1.
  • E. Arrigo, C. Liberati, and P. Mariani (2021) Social media data and users’ preferences: a statistical analysis to support marketing communication. Big Data Research 24, pp. 100189. Cited by: §1.
  • J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah (2017) Julia: a fresh approach to numerical computing. SIAM review 59 (1), pp. 65–98. Cited by: §1.
  • F. Corea (2016) Can twitter proxy the investors’ sentiment? the case for the technology sector. Big Data Research 4, pp. 70–74. Cited by: §1.
  • J. W. Crampton, M. Graham, A. Poorthuis, T. Shelton, M. Stephens, M. W. Wilson, and M. Zook (2013) Beyond the geotag: situating "big data" and leveraging the potential of the geoweb. Cartography and Geographic Information Science 40 (2), pp. 130–139. External Links: Document, Link, https://doi.org/10.1080/15230406.2013.777137 Cited by: §1.
  • S. De Sarkar, F. Yang, and A. Mukherjee (2018) Attending sentences to detect satirical fake news. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3371–3380. External Links: Link Cited by: §5.1.
  • G. Donoso and D. Sánchez (2017) Dialectometric analysis of language variation in Twitter. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), Valencia, Spain, pp. 16–25. External Links: Link, Document Cited by: §1.
  • E. Dresner and S. C. Herring (2010) Functions of the nonverbal in cmc: emoticons and illocutionary force. Communication theory 20 (3), pp. 249–268. Cited by: §4.
  • J. Dunn (2019) Global syntactic variation in seven languages: toward a computational dialectology.

    Frontiers in artificial intelligence

    2, pp. 15.
    Cited by: §2.
  • J. Dunn (2020) Mapping languages: the corpus of global language use. Language Resources and Evaluation 54 (4), pp. 999–1018. Cited by: §2.
  • B. Gonçalves and D. Sánchez (2014) Crowdsourcing dialect characterization through twitter. PloS one 9 (11), pp. e112074. Cited by: §2.
  • M. Graham, S. A. Hale, and D. Gaffney (2014) Where in the world are you? geolocation and language identification in twitter. The Professional Geographer 66 (4), pp. 568–578. Cited by: §1, §2.
  • D. Hovy, A. Rahimi, T. Baldwin, and J. Brooke (2020) Visualizing regional language variation across europe on twitter. Handbook of the Changing World Language Map, pp. 3719–3742. Cited by: §2.
  • Y. Huang, D. Guo, A. Kasakoff, and J. Grieve (2016) Understanding us regional linguistic variation with twitter data analysis. Computers, environment and urban systems 59, pp. 244–255. Cited by: §1, §1, §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 427–431. Cited by: §5.1.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 3128–3137. Cited by: §5.1.
  • M. Kejriwal, Q. Wang, H. Li, and L. Wang (2021) An empirical study of emoji usage on twitter in linguistic and national contexts. Online Social Networks and Media 24, pp. 100149. Cited by: §2.
  • K. Li, H. Naacke, and B. Amann (2021) An analytic graph data model and query language for exploring the evolution of science. Big Data Research 26, pp. 100247. Cited by: §1.
  • M. Li, E. Chng, A. Y. L. Chong, and S. See (2019) An empirical analysis of emoji usage on twitter. Industrial Management & Data Systems. Cited by: §2.
  • L. McInnes, J. Healy, and J. Melville (2020) UMAP: uniform manifold approximation and projection for dimension reduction. External Links: 1802.03426 Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013a) Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, USA, pp. 3111–3119. External Links: Link Cited by: §5.1.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §5.1.
  • D. Mocanu, A. Baronchelli, N. Perra, B. Gonçalves, Q. Zhang, and A. Vespignani (2013) The twitter of babel: mapping world languages through microblogging platforms. PloS one 8 (4), pp. e61981. Cited by: §2.
  • S. M. Mohammad and P. D. Turney (2013) Crowdsourcing a word-emotion association lexicon. Computational Intelligence 29 (3), pp. 436–465. Cited by: §5.2.
  • S. Mohammad and P. Turney (2010) Emotions evoked by common words and phrases: using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, Los Angeles, CA, pp. 26–34. External Links: Link Cited by: §5.2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §5.1.
  • C. A. Rodriguez-Diaz, S. Jimenez, G. Dueñas, J. E. Bonilla, and A. Gelbukh (2018) Dialectones: finding statistically significant dialectal boundaries using twitter data. Computación y Sistemas 22 (4), pp. 1213–1222. Cited by: §1, §2.
  • S. Wan, B. Li, A. Zhang, K. Wang, and X. Li (2018) Vertical and sequential sentiment analysis of micro-blog topic. In Advanced Data Mining and Applications, G. Gan, B. Li, X. Li, and S. Wang (Eds.), Cham, pp. 353–363. Cited by: §5.1.
  • X. Yang, C. Macdonald, and I. Ounis (2018) Using word embeddings in twitter election classification. Information Retrieval Journal 21 (2), pp. 183–207. External Links: ISSN 1573-7659, Document, Link Cited by: §5.1.