Analyzing the Surprising Variability in Word Embedding Stability Across Languages

04/30/2020 ∙ by Laura Burdick, et al. ∙ University of Michigan 0

Word embeddings are powerful representations that form the foundation of many natural language processing architectures and tasks, both in English and in other languages. To gain further insight into word embeddings in multiple languages, we explore their stability, defined as the overlap between the nearest neighbors of a word in different embedding spaces. We discuss linguistic properties that are related to stability, drawing out insights about how morphological and other features relate to stability. This has implications for the usage of embeddings, particularly in research that uses embeddings to study language trends.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Word embeddings have become an established part of natural language processing (NLP) architectures Collobert et al. (2011); Dos Santos and Gatti (2014). The metric of stability, defined as the overlap between the nearest neighbors of a word in different embedding spaces, was introduced to measure variations in local embedding neighborhoods across changes in data, algorithms, and word properties Antoniak and Mimno (2018); Wendlandt et al. (2018). These studies found that many common English embedding spaces are surprisingly unstable, which has implications for work that uses embeddings as features in downstream tasks, and work that uses embeddings to study specific properties of language.

However, this analysis on English is most likely not representative of all languages. Since word embeddings rely only on text (rather than annotated data), they are broadly applicable, even in languages that have few linguistic resources available Adams et al. (2017). In this work, we explore stability of word embeddings in 111 languages across two different corpora. Having a better understanding of the differences caused by diverse languages will provide a foundation for building embeddings and NLP tools in all languages.

In English and, increasingly, other languages, it has become common to use contextualized word embeddings, such as BERT Devlin et al. (2019) and XLNet Yang et al. (2019). These contextualized embedding algorithms require huge amounts of computational resources and data. For example, it takes 2.5 days to train XLNet with 512 TPU v3 chips. In addition to requiring heavy computational resources, most contextualized embedding algorithms need large amounts of data. BERT uses 33 billion words of training data. In contrast to these large corpora, many datasets from low-resource languages are fairly small Maxwell and Hughes (2006). In scenarios where huge amounts of data and computational resources are not feasible, it is worthwhile to continue developing our knowledge of context-independent word embeddings, such as word2vec Mikolov et al. (2013) and GloVe Pennington et al. (2014). These algorithms continue to be used in a wide variety of situations, including the computational humanities Abdulrahim (2019); Hellrich et al. (2019) and languages where only small corpora are available Joshi et al. (2019). According to Google Scholar, in 2019 alone, GloVe was cited approx. 4,700 times and word2vec was cited approx. 5,740 times.

In this work, we consider how stability (percent overlap between nearest neighbors in an embedding space) varies for different languages. Specifically, we explore how linguistic properties are related to stability, a previously understudied relationship. Using regression modeling, we are able to capture relationships between linguistic properties and average stability of a language, and we draw out insights about how morphological and other features relate to stability. For instance, we find that languages with more complex morphology tend to be less stable than languages with simpler morphology. Our findings provide crucial context for research that uses word embeddings to study language properties (e.g., Hellrich et al., 2019; Heyman and Heyman, 2019). Often research that uses embeddings to study language trends relies on raw embeddings created by GloVe or word2vec Abdulrahim (2019); Hellrich et al. (2019). If these embeddings are unstable, then research using them needs to take this into account (in terms of methodologies, error analysis, etc.).

2 Related Work

Word embeddings are low-dimensional vectors used to represent words, normally in downstream tasks, such as named entity recognition

Collobert et al. (2011)

and sentiment analysis

Dos Santos and Gatti (2014). They have been shown to capture both syntactic and semantic properties of words, making them useful in a wide range of NLP tasks Mikolov et al. (2013). In this work, we explore word embeddings that generate one embedding per word, regardless of the word’s context. We consider two widely used algorithms: word2vec Mikolov et al. (2013) and GloVe Pennington et al. (2014).

Embeddings in Many Languages. In this work, we analyze embeddings in multiple languages. Our analysis is important because word embeddings are in common usage in many languages. Even when large corpora of data are not available, there has been interest in how word embeddings can be leveraged for low-resource languages Adams et al. (2017); Jiang et al. (2018). To build embeddings across many different languages efficiently, some recent research has focused on building cross-lingual embedding spaces, where words from different languages are embedded in the same vector space Chen and Cardie (2018); Ruder et al. (2019). There has also been recent interest in using word embeddings to create methods that work across a wide range of languages, for instance, in sentiment analysis Zhao and Schütze (2019).

Intrinsic Evaluation and Analysis of Embeddings. There has been much interest in evaluating the quality of different word embedding algorithms. This is typically done extrinsically, by measuring performance on a downstream task, but there have also been efforts to build reliable intrinsic evaluation techniques (e.g., Gladkova and Drozd 2016).

Intrinsic methods have also been proposed to explore the properties and limitations of word embeddings. Similar to the work we present here on stability, there is other research on how nearest neighbors vary as properties of the embedding spaces change Pierrejean and Tanguy (2018). Additional work has looked at how semantic and syntactic properties of words change with different embedding algorithm and parameter choices Artetxe et al. (2018); Yaghoobzadeh and Schütze (2016). This previous work only evaluates on English, unlike our current research.

3 Data

Building on this definition of stability, we explore the stability of word embeddings in different languages. We work with two datasets, Wikipedia and the Bible. Wikipedia has more data, but covers fewer languages. The Bible is smaller, but covers more languages. Wikipedia is a comparable corpus, whereas the Bible is a parallel corpus.

3.1 Wikipedia Corpus

We use pre-processed Wikipedia dumps in 40 languages taken from Al-Rfou et al. Al-Rfou et al. (2013).111Available online at https://sites.google.com/site/rmyeid/ projects/polyglot. These texts have been previously segmented using an OpenNLP probabilistic tokenizer whenever possible,222Danish, German, English, French, Dutch, Portuguese and a Unicode text segmentation model offered by Lucene when no model is available333See http://www.unicode.org/reports/tr29/. Al-Rfou et al. (2013).

In order to verify that this word segmentation is reasonable, we asked speakers of several of the languages444Finnish, German, Romanian, Italian, French, English, Arabic to look over a subset of the data and describe any errors that they saw. All languages that we checked were confirmed to have reasonable word segmentation, though a few small inconsistencies were observed. In Finnish, several word cases were handled inconsistently, and in Italian and French, determiners followed by words beginning with a vowel were not segmented correctly. However, speakers of these languages confirmed that these inconsistencies were relatively minor and most of the text is well-segmented.

3.2 Bible Corpus

The Bible corpus contains 1,821 full and partial Bibles in 1,104 languages McCarthy et al. (2020). In order to have enough data to train word embeddings, we work with Bibles that are at least 75% complete.555To work with a maximum number of languages, we only consider the complete Protestant Bible (e.g., all of the verses that appear in the English King James Version of the Bible). This leaves us with 97 languages (many of the Bibles in the corpus only have parts of the New Testament translated, which is why this number is substantially smaller than the number of total languages represented in the corpus).

We consider two sets of languages with the Bible corpus: the languages that overlap with the set of Wikipedia languages, and languages that are not covered in Wikipedia. Twenty-six languages exist in both the Wikipedia and Bible corpora.

3.3 Wals

To further explore these languages, we use information from the World Atlas of Language Structures (WALS),666Available online at https://wals.info. a database consisting of phonological, lexical, and grammatical properties of languages Dryer and Haspelmath (2013). This resource is hand-curated by experts, and contains 192 language features, each of which have between two and twenty-eight categorical values. Over two thousand languages have WALS entries, and each language is annotated for a subset of the features.

For example, WALS provides a number of word order features describing subject (S), object (O), and verb (V) order in various languages. These features include SOV order, languages with two dominant SOV orders, SV order, and OV order. A second example is the phonological features included in WALS such as the absence of common consonants. WALS also has features covering the areas of morphology, nominal categories, nominal syntax, verbal categories, simple clauses, complex sentences, lexicon, and sign language.

4 Calculating Stability in Many Languages

GloVe 1 GloVe 2 GloVe 3
indie punk punk
punk indie pop
progressive alternative indie
pop progressive alternative
roll band band
band sedimentary roll
blues bands progressive
brass psychedelic folk
classic climbing climbing
alternative pop metal
Table 1: Ten words most similar to rock in three GloVe models trained on subsets of Wikipedia. Only words present in all three subsets are considered. Words in all lists are in bold; words in only two lists are italicized.

While our work is an analysis across many languages, stability has previously been explored for English word embeddings Antoniak and Mimno (2018); Wendlandt et al. (2018); Pierrejean and Tanguy (2018).

4.1 Defining Stability

Stability is defined as the percent overlap between nearest neighbors in an embedding space. To calculate stability, given a word and two embedding spaces and

, take the ten nearest neighbors (measured using cosine similarity) of

in both and . The stability of is the percent overlap between these two lists of nearest neighbors. 100% stability indicates perfect agreement between the two embedding spaces, while 0% stability indicates complete disagreement. This definition of stability can be generalized to more than two embedding spaces by considering the average overlap between pairs of embedding spaces. Let and be two sets of embedding spaces. Then, for every pair of embedding spaces , where and , take the ten nearest neighbors of in both and and calculate percent overlap. Let the stability be the average percent overlap over every pair of embedding spaces . Sets of nearest neighbors smaller and larger than ten have been tried previously, with comparable results Wendlandt et al. (2018).

Table 1 shows the ten nearest neighbors for the word rock in three GloVe models trained on different subsets of English Wikipedia. Models 1 and 2 have 6 words (60%) in common, models 1 and 3 have 7 words (70%) in common, and models 2 and 3 have 7 words (70%) in common. Therefore, this word has a stability of 66.7%, the average word overlap between the three models. (In this simple example, the two sets of embedding spaces and are identical. It is also possible to calculate stability when .)

Previous work identified factors that play a role in the stability of word embeddings. For instance, it was found that the presence of certain documents in the training corpus affects stability Antoniak and Mimno (2018), and that training and evaluating embeddings on separate domains is less stable than training and evaluating on the same domain Wendlandt et al. (2018).

Throughout this work, we group stability into buckets to visualize our results (e.g., Figure 1). We use buckets of 5% (e.g., 0-5% stability, 5.1-10% stability). Doing this allows us to see patterns in stability for a corpus that are not visible from a single overall average.

4.2 Effect of Downsampling on Stability

Stability measures how making changes to the input data or algorithm affects embeddings. We expect some changes to cause instability, such as changing the embedding size by a factor of ten or greater. For other variations, instability is surprising, such as changing the random seed for the algorithm. Deciding what variation to introduce has an effect on the stability that is measured. For our experiments, we consider a previously unstudied source of instability: different data samples from the same distribution.

One way to generate data samples is to downsample (with or without replacement) an existing corpus to create multiple smaller corpora. Then, stability can be measured across these downsamples. Here, we consider whether studying stability across downsamples produces consistent results that we can compare across languages. This is a subtle methodological choice that if wrong, could lead to incorrect conclusions.

First, we consider downsampling with replacement. We sample five sets of 500,000 sentences multiple times, controlling the amount of overlap between downsamples (from 10% to 60%).777Data drawn from an English Wikipedia corpus of 5,269,686 sentences (denoted “Large English Wikipedia”). Stability is calculated using GloVe embeddings and the words that occur in every downsample for every overlap percentage. Figure 1 shows the results. While stability trends are similar for different overlap amounts, we see stability is consistently higher as the overlap amount increases. Using this method for multiple corpora of different sizes would mean that stability on the downsampled corpora could not be reliably compared, because the overlap amount would vary depending on the size of the original corpus. In our case, the Bible corpus is substantially smaller than the Wikipedia corpus, so if we used this downsampling method, we would not be able to accurately compare stability between the Bible and Wikipedia. For this reason, we do not use downsampling with replacement in our experiments.

We instead use downsampling without replacement. Figure 2 shows stability after downsampling without replacement for different dowsampling sizes. We see that varying the size of the downsample does not have a large effect on the patterns of stability. Particularly when looking at lower stability (towards the left side of the graph), the trends are remarkably consistent, even when the downsample size varies from 50,000 sentences to 500,000 sentences. The pattern grows less consistent when looking at higher stability (towards the right side of the graph), particularly with smaller downsample sizes.

This shows us that downsampling without replacement produces more consistent (and thus comparable) stability results than downsampling with replacement.

Figure 1: Percentage of words that occur in each stability bucket when varying the five Large English Wikipedia downsamples (with replacement) to have different amounts of overlap. Each line shows results for a different percentage of overlap.
Figure 2: Percentage of words that occur in each stability bucket after downsampling without replacement on Large English Wikipedia. Each line shows results for different downsample sizes (measured by number of sentences). For each size, five downsamples are taken and stability is calculated across GloVe embeddings for every word that appears more than five times across all downsamples.

4.3 Stability for Wikipedia

Because we see that downsampling with replacement is unreliable, all Wikipedia corpora are downsampled without replacement. In order to determine the best way to calculate stability, for each language in our Wikipedia corpus, we experiment with three settings: (1) Stability with GloVe embeddings across five downsampled corpora, (2) Stability with word2vec embeddings across five downsampled corpora, and (3) Stability with word2vec using five different random seeds across one downsampled corpus.

Figure 3: Percentage of words that occur in each stability bucket for four different methods, three on Wikipedia and one of the Bible. The 26 languages in common are shown here. The average stability for each method is shown on the individual graphs.
(a) German
(b) French
Figure 4: Percentage of words that occur in each stability bucket for different Bible translations in French and German.

Each downsampled corpus is 100,000 sentences, and words that occur with a frequency less than five are ignored. Previous work Pierrejean and Tanguy (2018); Wendlandt et al. (2018) has indicated that words that appear this infrequently will be very unstable. We use standard parameters for both embedding algorithms. For GloVe Pennington et al. (2014), we use 100 iterations, 300 dimensions, a window size of 5, and a minimum word count of 5; these parameters led to good performance in Wendlandt et al. Wendlandt et al. (2018). For word2vec (w2v) Mikolov et al. (2013), we use 300 dimensions, a window size of 5, and a minimum word count of 5. For each embedding, we calculate the ten nearest neighbors of every word using FAISS888We use exact, not approximate, search. Johnson et al. (2019). Finally, for each language, we calculate the stability for every word in that language across all five embedding spaces.

Figure 3 shows bucketed stability for all three methods. Visualizing stability in buckets allows us to see patterns in stability that are obscured in a single overall average. We see that these methods generally show similar behavior. For the following experiments on Wikipedia, we use stability with GloVe embeddings across five downsampled corpora.

Figure 3 also reveals a subtle issue in prior work, coming from the use of downsampling in measuring stability. Previous work claimed that GloVe was more stable than w2v Wendlandt et al. (2018), but this claim was based on numbers that were calculated from overlapping corpora. Figure 3 gives us a more accurate comparison of GloVe and w2v. In English, we see that GloVe on Wikipedia has an average stability of 0.84, while w2v on Wikipedia with downsampling has an average stability of 0.79. This is not a substantial difference, and it throws doubt on the claim that GloVe is more stable than w2v.

4.4 Stability for the Bible

The Bible corpus is substantially smaller than the Wikipedia corpus, so downsampling to calculate stability is not a feasible option. Given that Figure 3 shows that word2vec with a single downsample and five different random seeds gives comparable stability results to using GloVe across five downsamples, we choose to use this method (we use the same w2v parameters as for Wikipedia). By comparing to GloVe in Figure 3, we confirm that this method for measuring stability is reasonable and will produce intuitive results.

Several languages have multiple Bible translations. As a sanity check, we see if stability is consistent across multiple translations in the same language. Figure 4 shows that stability patterns are very consistent. The French Parole de Vie translation (yellow line in Figure 3(b)) intentionally uses simpler, everyday language, which could explain why this line follows a different pattern than the other French translations. For further experiments on languages with multiple Bible translation, we choose the Bible translation with the highest average stability.

In this section, we have considered the best way to measure stability. For Wikipedia, we measure stability with GloVe embeddings across five downsampled corpora, while for the Bible (a much smaller corpus), we measure it with w2v embeddings across five random seeds. We have shown that these methods produce consistent results, allowing us to compare across languages.

5 Regression Modeling

For both of these corpora, we are interested in which language properties are related to stability. To tease out various linguistic factors, we use a ridge regression model

Hoerl and Kennard (1970)

that predicts the average stability of all words in a language using features reflecting language properties. Ridge regression regularizes the magnitude of the model weights, producing a more interpretable model than non-regularized linear regression. This regularization also mitigates the effects of multicollinearity (when two features are highly correlated). Regression models have previously been used to measure impact of individual features

Singh et al. (2016). We choose to use a linear model here because of its interpretability. While more complicated models might yield additional insight, we show that there are interesting connections to be drawn from a linear model.

Since we are using regression models to learn associations between certain features and stability, no test data are necessary. The emphasis is on the model itself and the feature weights it learns, not on the model’s performance on a task.

Though we do not use test data, we do want to find how well our model fits the training data that we give it. For each model, we measure goodness of fit using the coefficient of determination . The

score measures how much variance in the dependent variable

is captured by the independent variables . A model that always predicts the expected value of , regardless of the input features, will have an score of 0. The highest possible score is 1, and can be negative.

We use score to understand how our model is performing overall, and we use the individual weights of features to measure how much a particular feature contributes to the overall model. We experiment with two regression models: a full model and a targeted morphology model.

For all models, the inputs are linguistic features of a language. Since WALS properties are categorical, we turn each property into a set of binary features. We also include an “Unknown” value, which we use when a feature is not defined for a language. Note that because all of our input features are binary, all weights are easily comparable. The output of each model is the average stability of a language, which is calculated by averaging together the stability of all of the words in a language. For each model, we bootstrap over the input features

times, allowing us to calculate standard error for both the

score and the model weights. Calculating significance for each feature allows us to discard highly variable weights and focus on features that consistently contribute to the regression model, giving us more confidence in the results.

WALS Attribute Weight
Tone: Complex tone system
Suppletion according to tense and
aspect: None

Numeral classifiers: Optional

Preverbal negative morphemes:
Unknown
Minor morphological means of
signaling negation: Unknown
Indefinite Articles: Unknown
Sex-based and non-sex-based gender systems: No gender
Tone: Unknown
Suppletion according to tense and
aspect: Unknown
Purpose clauses: Balanced
Table 2: Weights with the highest magnitude in the full regression model on all languages. Negative weights correspond with low stability, and positive weights correspond with high stability.

5.1 Full Model

First, we train full model considering a large set of WALS features. This will allow us to see which language features (represented in WALS) correlate strongly with stability. To confirm that our two corpora correlate reasonably well, we first train two regression models on all languages that are covered by both Wikipedia and the Bible (26 languages), using all WALS features that cover at least 25% of these languages. One model is trained on Wikipedia, and one model is trained on the Bible (both use the same WALS features as input, but may differ in the average stability that is being predicted). Both achieve high results ( for both models). The significant weights of the models also correlate well (Pearson correlation coefficient , -value ). This is intuitive, because these models cover the same languages. This also gives us confidence that the models are not overfitting to a specific set of languages.

Since we see that our two corpora correlate reasonably well, we combine both corpora to build a regression model that includes all of the languages that we have (111 languages). Combining corpora allows us to cover a larger number of languages, and it allows us to generalize across both datasets. If a language is present in both the Wikipedia and the Bible corpus, we average the stabilities from both corpora. We filter out all WALS features that are covered by less than 1% of our languages, leaving us with 35 WALS features. This model has an reasonably high score of , indicating that this model fits the data well. The significant weights with the highest magnitude are shown in Table 2. (We will discuss these results more thoroughly in Section 6.)

WALS Attribute Weight
Poss. classification: None
Fusion: Exclusively concatenative
Fusion: Ablaut / concatenative
Fusion: Isolating / concatenative
Exponence: Monoexponential case
Poss. classification: Unknown
Exponence: Case + number
Exponence: Unknown
Fusion: Unknown
Exponence: No case
Poss. classification: Two classes
Fusion: Exclusively isolating
Table 3: Significant weights in the morphology regression model. Negative weights correspond with low stability, and positive weights correspond with high stability.

5.2 Morphology Model

Next, we specifically consider the role of morphology in stability. We look at three WALS features: fusion, exponence, and possessive classification. Taken together, these three features capture the differences between isolating, agglutinative, and fusional languages Bickel and Nichols (2013b). Fusion refers to how grammatical markers (formatives) connect to a word or stem Bickel and Nichols (2013b). If a single formative forms a single word, then it is an isolating formative (e.g., Indonesian). Concatenative formatives form a single phonological word, along with a host word (e.g., Turkish). These formatives can still be clearly separated into morphemes. Formatives that cannot be clearly separated are called nonlinear (e.g., Hebrew), of which there are two types: ablaut and tonal. Languages can have a combination of these types of fusion. Exponence refers to the number of categories (e.g., number, case) that go into a single formative Bickel and Nichols (2013a). Possessive classification quantifies the number of ways to form a possessive noun phrase in a language Nichols and Bickel (2013).

WALS Attribute English Vietnamese Mandarin
Tone: No tones -0.37 - -
Sex-based and non-sex-based gender systems: No gender - 0.47 0.47
Nominal and verbal conjunction: Identity -0.3 -0.3 -
Suppletion according to tense and aspect: None - -0.37 -0.37
Order of subject and verb: SV -0.58 -0.58 -0.58
Zero copula for predicate nominals: Impossible -0.35 - -0.35
Numeral bases: Decimal 0.32 0.32 0.32
Preverbal negative morphemes: NegV 0.32 0.32 0.32
Predicted value 1.48 3.42 1.90
Ground truth: average stability 1.74 3.46 1.19
Table 4: Weights for three languages (English, Vietnamese, and Chinese) where the weight has a magnitude . Zero weights are shown as dashes. The predicted and ground truth values for each language are shown at the bottom of the table. Negative weights correspond with low stability, and positive weights correspond with high stability.
(a) Tone
(b) Suppletion
Figure 5: Violin plots comparing average stability distributions for languages with different properties.

We train a regression model taking as input these properties for all languages, and predicting the average stability of a language. This allows us to see how these specific properties of interest relate to stability. This model gets an score of . While this model has a lower score than the full model, this model shows that there is still a connection between morphology and stability. The weights of the model are shown in Table 3.

6 Discussion

From these models, we draw out a few key points.

The full model captures relationships between linguistic properties and average stability of a language. Our full model achieves an value of , indicating that the model is reasonably fitting the input data. To illustrate this, consider two WALS features that appear in both the top five highest and lowest weights, tone and suppletion. Tone describes how pitch patterns are used to distinguish different words and meanings. Complex tonal systems tend to be associated with other measures of phonological complexity, such as syllable complexity and number of consonants Maddieson (2013). Suppletion happens when normal semantic patterns are encoded in irregular ways (e.g., English buy v. bought) Veselinova (2013). Languages can be categorized by where this suppletion occurs, in verb tense changes and/or in verb aspect changes.

Figure 5 shows the distribution of average stability for languages with different tonal and suppletion properties. For tone (Figure 4(a)), the largest category of languages has an unknown tonal system. In our dataset, five languages have a complex tonal system, which tends to contribute to lower average stability. The stability distribution of these languages is the widest. Suppletion (Figure 4(b)) shows a similar pattern. Unknown suppletion, the largest category, has the widest and highest distribution, and the regression model captures this by indicating that unknown suppletion is related to higher stability.

The full model has explanatory power to differentiate languages. Looking at the full model, we are able to compare languages and understand what contributes to differences in stability. Consider three languages: English, Vietnamese, and Mandarin. Table 4 shows the weights that contribute most to the model for these languages for one particular regression model. We see that while English has no tones, this negative weight is offset by other positive weights, giving English an overall high predicted stability.

Morphology is related to average stability of a language. While the morphology model does not perform as well as the full model, it does indicate that morphological features are related to stability. One of the most important features of the model is fusion (Table 3). Perhaps intuitively, more concatenative languages tend to be less stable than more isolating languages.

7 Conclusion

In this paper, we have considered how stability varies across different languages. This work is important because algorithms such as GloVe and word2vec continue to be used in a wide variety of scenarios, including the computational humanities and languages where large corpora are not available. We study the relationship between linguistic properties and stability, something that has been previously understudied. We draw out several aspects of this relationship, including that languages with more complex morphology tend to be less stable than languages with simpler morphology. These insights can be used in future work to inform the design of embeddings in many languages.

References

  • A. Z. Abdulrahim (2019) Ideological drifts in the us constitution: detecting areas of contention with models of semantic change. In NeurIPS Joint Workshop on AI for Social Good, Cited by: §1, §1.
  • O. Adams, A. Makarucha, G. Neubig, S. Bird, and T. Cohn (2017) Cross-lingual word embeddings for low-resource language modeling. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 937–947. External Links: Link Cited by: §1, §2.
  • R. Al-Rfou, B. Perozzi, and S. Skiena (2013) Polyglot: distributed word representations for multilingual NLP. In Proceedings of the 17th Conference on Computational Natural Language Learning, pp. 183–192. External Links: Link Cited by: §3.1.
  • M. Antoniak and D. Mimno (2018) Evaluating the stability of embedding-based word similarities. Transactions of the Association for Computational Linguistics 6, pp. 107–119. External Links: Link, Document Cited by: §1, §4.1, §4.
  • M. Artetxe, G. Labaka, I. Lopez-Gazpio, and E. Agirre (2018) Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pp. 282–291. External Links: Link Cited by: §2.
  • B. Bickel and J. Nichols (2013a) Exponence of selected inflectional formatives. In The World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath (Eds.), External Links: Link Cited by: §5.2.
  • B. Bickel and J. Nichols (2013b) Fusion of selected inflectional formatives. In The World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath (Eds.), External Links: Link Cited by: §5.2.
  • X. Chen and C. Cardie (2018) Unsupervised multilingual word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 261–270. External Links: Link Cited by: §2.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    12 (Aug), pp. 2493–2537.
    Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §1.
  • C. Dos Santos and M. Gatti (2014)

    Deep convolutional neural networks for sentiment analysis of short texts

    .
    In Proceedings of the 25th International Conference on Computational Linguistics, pp. 69–78. External Links: Link Cited by: §1, §2.
  • M. S. Dryer and M. Haspelmath (Eds.) (2013) WALS online. Max Planck Institute for Evolutionary Anthropology, Leipzig. External Links: Link Cited by: §3.3.
  • A. Gladkova and A. Drozd (2016) Intrinsic evaluations of word embeddings: what can we do better?. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 36–42. External Links: Link, Document Cited by: §2.
  • J. Hellrich, S. Buechel, and U. Hahn (2019) Modeling word emotion in historical language: quantity beats supposed stability in seed word selection. In Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 1–11. Cited by: §1, §1.
  • T. Heyman and G. Heyman (2019) Can prediction-based distributional semantic models predict typicality?. Quarterly Journal of Experimental Psychology, pp. 1747021819830949. Cited by: §1.
  • A. E. Hoerl and R. W. Kennard (1970)

    Ridge regression: biased estimation for nonorthogonal problems

    .
    Technometrics 12 (1), pp. 55–67. External Links: Link Cited by: §5.
  • C. Jiang, H. Yu, C. Hsieh, and K. Chang (2018) Learning word embeddings for low-resource languages by PU learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1024–1034. External Links: Link, Document Cited by: §2.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with GPUs. IEEE Transactions on Big Data. Cited by: §4.3.
  • I. Joshi, P. Koringa, and S. Mitra (2019) Word embeddings in low resource gujarati language. In 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), Vol. 5, pp. 110–115. Cited by: §1.
  • I. Maddieson (2013) Tone. In The World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath (Eds.), External Links: Link Cited by: §6.
  • M. Maxwell and B. Hughes (2006) Frontiers in linguistic annotation for lower-density languages. In Proceedings of the Workshop on Frontiers in Linguistically Annotated Corpora, pp. 29–37. Cited by: §1.
  • A. D. McCarthy, R. Wicks, D. Lewis, A. Mueller, W. Wu, O. Adams, G. Nicolai, M. Post, and D. Yarowsky (2020) The Johns Hopkins University Bible Corpus: 1600+ tongues for typological exploration. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC 2020), Cited by: §3.2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. External Links: Link Cited by: §1, §2, §4.3.
  • J. Nichols and B. Bickel (2013) Possessive classification. In The World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath (Eds.), External Links: Link Cited by: §5.2.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing), pp. 1532–1543. External Links: Link, Document Cited by: §1, §2, §4.3.
  • B. Pierrejean and L. Tanguy (2018) Towards qualitative word embeddings evaluation: measuring neighbors variation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 32–39. External Links: Link, Document Cited by: §2, §4.3, §4.
  • S. Ruder, I. Vulić, and A. Søgaard (2019) A survey of cross-lingual word embedding models.

    Journal of Artificial Intelligence Research

    65, pp. 569–631.
    Cited by: §2.
  • A. D. Singh, P. Mehta, S. Husain, and R. Rajakrishnan (2016) Quantifying sentence complexity based on eye-tracking measures. In Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity, pp. 202–212. Cited by: §5.
  • L. N. Veselinova (2013) Suppletion according to tense and aspect. In The World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath (Eds.), External Links: Link Cited by: §6.
  • L. Wendlandt, J. K. Kummerfeld, and R. Mihalcea (2018) Factors influencing the surprising instability of word embeddings. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2092–2102. External Links: Link, Document Cited by: §1, §4.1, §4.1, §4.3, §4.3, §4.
  • Y. Yaghoobzadeh and H. Schütze (2016) Intrinsic subspace evaluation of word embedding representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 236–246. External Links: Link, Document Cited by: §2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237. Cited by: §1.
  • M. Zhao and H. Schütze (2019) A multilingual BPE embedding space for universal sentiment lexicon induction. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 3506–3517. Cited by: §2.