LINSPECTOR: Multilingual Probing Tasks for Word Representations

Despite an ever growing number of word representation models introduced for a large number of languages, there is a lack of a standardized technique to provide insights into what is captured by these models. Such insights would help the community to get an estimate of the downstream task performance, as well as to design more informed neural architectures, while avoiding extensive experimentation which requires substantial computational resources not all researchers have access to. A recent development in NLP is to use simple classification tasks, also called probing tasks, that test for a single linguistic feature such as part-of-speech. Existing studies mostly focus on exploring the information encoded by the sentence-level representations for English. However, from a typological perspective the morphologically poor English is rather an outlier: the information encoded by the word order and function words in English is often stored on a subword, morphological level in other languages. To address this, we introduce 15 word-level probing tasks such as case marking, possession, word length, morphological tag count and pseudoword identification for 24 languages. We present experiments on several state of the art word embedding models, in which we relate the probing task performance for a diverse set of languages to a range of classic NLP tasks such as semantic role labeling and natural language inference. We find that a number of probing tests have significantly high positive correlation to the downstream tasks, especially for morphologically rich languages. We show that our tests can be used to explore word embeddings or black-box neural models for linguistic cues in a multilingual setting. We release the probing datasets and the evaluation suite with



There are no comments yet.



LINSPECTOR WEB: A Multilingual Probing Suite for Word Representations

We present LINSPECTOR WEB, an open source multilingual inspector to anal...

Analyzing the Surprising Variability in Word Embedding Stability Across Languages

Word embeddings are powerful representations that form the foundation of...

Evaluation of Morphological Embeddings for the Russian Language

A number of morphology-based word embedding models were introduced in re...

Subword Pooling Makes a Difference

Contextual word-representations became a standard in modern natural lang...

Studying Taxonomy Enrichment on Diachronic WordNet Versions

Ontologies, taxonomies, and thesauri are used in many NLP tasks. However...

Mind Your Inflections! Improving NLP for Non-Standard English with Base-Inflection Encoding

Morphological inflection is a process of word formation where base words...

Measuring Memorization Effect in Word-Level Neural Networks Probing

Multiple studies have probed representations emerging in neural networks...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of Natural language processing (NLP) has seen great development after replacing the traditional discrete word representations with continuous ones. Representing text with dense, low-dimensional vectors - or

embeddings - has become the de facto approach, since these representations can encode complex relationships between the units of language and can be learned from unlabeled data, thus eliminating the need for expensive manual feature engineering. The initial success of dense representations in NLP applications has led to the development of a multitude of embedding models, which differ in terms of design objective (monolingual (Mikolov et al., 2013b), cross-lingual (Ruder, Vulić, and Søgaard, 2017), contextualized (Peters et al., 2018), retrofitted (Faruqui et al., 2015), multi-sense (Pilehvar et al., 2017), cross-domain (Yang, Lu, and Zheng, 2017), dependency-based (Levy and Goldberg, 2014)), encoding architecture, (e.g., convolution (Kim et al., 2016), linear vector operations (Bojanowski et al., 2017), bidirectional LSTM (Ling et al., 2015)), as well as in terms of the target units

(e.g., words, characters, character n-grams, morphemes, phonemes)


While offering substantial benefits over the feature-based representations of language, the performance of unsupervised embeddings may differ considerably with the language and the downstream task. For instance, early embedding models such as word2vec (Mikolov et al., 2013b), GloVe (Pennington, Socher, and Manning, 2014) have been shown to suffer from out-of-vocabulary (OOV) issues for agglutinative languages like Turkish and Finnish (Kim et al., 2016; Şahin and Steedman, 2018), while performing relatively well on analytic and fusional languages like English. Furthermore, there is no guarantee that a representation well-suited for some task would perform similarly well at other tasks even for the same language due to the differences in the information required to solve the tasks, as demonstrated in (Rogers, Ananthakrishna, and Rumshisky, 2018).

Given the variety of word representations and parameter options, searching for the right word embedding/encoder model for a specific language and a certain task is not trivial. Scanning the large parameter space may be extremely time consuming and computationally expensive, which poses significant challenges, especially in the lower-resource non-English academic NLP communities. To simplify the search for a good representation, and estimate the “quality” of the representations, intrinsic evaluation via similarity and analogy tasks have been proposed. While these tasks seem to be intuitive, there are concerns regarding their consistency and correlation with downstream task performance (Linzen, 2016; Schnabel et al., 2015). Furthermore, such evaluation requires manually created test sets and is only available for a limited number of languages. Another popular technique is extrinsic evaluation, where the vectors are used directly in downstream tasks such as syntactic parsing, machine translation (MT) (Ataman and Federico, 2018), semantic role labeling (SRL) (Şahin and Steedman, 2018) or language modeling (LM) (Vania and Lopez, 2017)

. Although this method provides more insightful information about the end task performance, it requires expensive human annotations, computational resources and the results are sensitive to hyperparameters. This evaluation also typically employs well-known annotated datasets such as Penn TreeBank (PTB) and is therefore bound to a limited number of downstream tasks and domains. This limits us to draw conclusions about the generalization power of the representations, i.e., the predicted performance on the other downstream tasks.

To address the problems with extrinsic evaluation Shi, Padhi, and Knight (2016); Adi et al. (2017); Veldhoen, Hupkes, and Zuidema (2016); Conneau et al. (2018a) have introduced probing tasks 111We use the terms probing tasks and probing tests interchangeably throughout the paper., which are a set of simple multi-label classification problems where each of the tests probes the input representation for one linguistic property, such as part-of-speech (POS). This technique has gained a lot of attention (Belinkov et al., 2017; Bisazza and Tump, 2018) due to its simplicity, low computational cost and ability to provide more complex insights into the linguistic properties of the input representations. The majority of the probing tests proposed so far operate on the sentence-level (e.g. tree depth, word count, top constituent by Conneau et al. (2018a)), and are designed for English language only. Although sentence-level probing provides valuable insights for English sentence-level representations, we hypothesize that probing token-level representations would be beneficial in a multilingual setup for several reasons. The first reason is that the information encoded by the word order and function words in English is encoded on a morphological, subword level in many other languages. Consider the Turkish word katılamayanlardan, that means “he/she is one of the folks who can not participate”. In morphologically complex languages like Turkish, single tokens might already communicate a lot of information such as event, its participants, tense, person, number, polarity; which would be encoded as a full multi-word clause in analytic languages. The second reason is the confusion of the signals, as pointed out by Tenney et al. (2019): “operating on full sentence encodings introduces confounds into the analysis, since sentence representation models must pool word-level representations over the entire sequence”. Finally, token-level probing is considerably simpler than its sentence-level counter-part in terms of the required amount of computational resources. In this work,

  • We extend the line of work by Conneau et al. (2018a) and Tenney et al. (2019) and introduce 15 token-level probing tasks for 24 languages by taking language properties into account. Our probing tasks cover a range of features: from superficial ones such as word length, to morphosyntactic features such as case marker, gender, and number; and psycholinguistic ones like pseudowords (artificial words that are phonologically well-formed but have no meaning). Although languages share a large set of common probing tasks, each has a list of its own, e.g., Russian and Spanish are probed for gender, while Turkish is probed for polarity and possession;

  • We introduce a reusable, systematic methodology for creation and evaluation of such tests by utilizing the existing resources such as UniMorph (Sylak-Glassman et al., 2015; Sylak-Glassman, 2016), Wikipedia and Wuggy (Keuleers and Brysbaert, 2010);

  • We then use the proposed probing tasks to evaluate the most commonly used multilingual word embedding models and to diagnose a neural end-to-end semantic role labeling model as a case study. We statistically assess the correlation between probing and downstream task performance for a variety of downstream tasks (POS tagging, dependency parsing (DEP), semantic role labeling (SRL), named entity recognition (NER) and natural language inference (NLI)) for a set of typologically diverse languages and find that a number of probing tests have significantly high positive correlation to syntactic and semantic downstream tasks, especially for morphologically rich languages;

  • We release the LINSPECTOR framework that contains the probing datasets along with the intrinsic and extrinsic evaluation suite:

We believe our evaluation suite together with probing datasets could be of great use for comparing various multilingual word representations such as automatically created cross-lingual embeddings, exploring the linguistic features captured by word encoding layers of black-box neural models, systematic searching of model/architecture parameters, or comparing transfer learning techniques by probing for the transferred linguistic knowledge from high-resource to low-resource languages.

2 Related Work

2.1 Intrinsic evaluation

A standard way to assess and compare continuous word representations is the pairwise similarity benchmarking: a set of word pairs is manually annotated with respect to some notion of similarity, and the similarity scores produced by the word embedding model are compared to the human annotators in terms of correlation. Most commonly used word similarity datasets are WordSim-353 (Finkelstein et al., 2001), MC (Miller and Charles, 1991), SCWS (Huang et al., 2012), rare words dataset (RW) (Luong, Socher, and Manning, 2013), MEN (Bruni et al., 2012), and SimLex-999 (Hill, Reichart, and Korhonen, 2015). The size of these datasets ranges from 30 to 999 word pairs with a focus on English.

Another popular evaluation strategy for word representations is the word analogy task, which aims to determine how well the lexical relationships between words are reflected in the vector space of a particular embedding model. Given a pair of words, e.g. man, woman, the task is to find a target word which shares the same relation with a given source word. For example, given a word king, one expected target word would be queen. The task was introduced by Mikolov, Yih, and Zweig (2013) and has gained a lot of attention mainly because it demonstrates how “linguistic regularities” are captured by word representation models. The analogy dataset which comes with the original paper consists of 14 categories involving syntactic and semantic regularities. While the analogy task has become a standard evaluation benchmark, Rogers, Drozd, and Li (2017) and Linzen (2016) have identified certain theoretical and practical drawbacks of this approach, mainly related to the consistency of the vector offset and the structure of the vector space model. Pairwise similarity benchmarks and word analogy tasks only offer a first approximation of the word embedding properties and provide limited insights into the downstream task performance. As one solution Tsvetkov et al. (2015)

proposed QVEC, an intrinsic word evaluation method which aligns word vector representations with a hand-crafted features extracted from lexical resources, focusing on the semantic content. They showed that their evaluation score correlates strongly with performance in downstream tasks.

More recently, (Rogers, Ananthakrishna, and Rumshisky, 2018) proposed a comprehensive list of scores, so called linguistic diagnostics factors, with an analysis of their relation to a set of downstream tasks such as Chunking, NER, Subjecitivity/Relation/Sentiment classification, demonstrated with word2vec (Mikolov et al., 2013a) and GloVe (Pennington, Socher, and Manning, 2014)

vectors. They extend the traditional intrinsic evaluation (e.g., word similarity and analogy), with semantics extracted from existing resources such as WordNet, and basic morphological information like shared lemma and affixes. Their findings support the previous studies that observe low correlation between word similarity/analogy and sequence-labeling downstream task performance. In addition, they observe high correlation between morphology-level intrinsic test with such downstream tasks even for English - one of the morphologically poorest languages. Unlike probing studies that train classifiers, they rely on nearest neighbor relation as a proxy to predict the performance of word vectors similar to early word analogy works.

2.2 Extrinsic evaluation

In general, evaluating word vectors on downstream NLP tasks is more challenging because of the time and resources needed for implementation. The two most common approaches are to test a single representation model on several downstream tasks (Ling et al., 2015; Pennington, Socher, and Manning, 2014; Bojanowski et al., 2017), or to test a number of representation models on a single task (Vania and Lopez, 2017; Ataman and Federico, 2018; Şahin and Steedman, 2018). For more general extrinsic evaluation, we note the work of Nayak, Angeli, and Manning (2016), which introduces an evaluation suite of six downstream tasks: two tasks to assess the syntactic properties of the representations and four tasks to assess the semantic properties. Specific to machine translation, Burlot et al. (2018) propose a morphology-oriented evaluation suite for four language pairs with different morphological properties.

2.3 Evaluation via probing task

The rise of deep learning based methods in NLP has stimulated the research on the interpretability of the neural models. In particular, several recent studies analyze representations generated by neural models to get insights on what kind of linguistic information are learned by the models. This has been one of the emerging trends in NLP as hinted by the on-going Representation Evaluation (RepEval) 

(Nangia et al., 2017) and BlackBoxNLP Workshop series (Tal Linzen, Chrupała, and Alishahi, 2018) organized in popular conference venues. The most common approach is to associate some linguistic properties such as POS, with various representations from the model like the encoding layer or the output of the activation layer. This method, which is called probing task or diagnostic classifier (Shi, Padhi, and Knight, 2016; Adi et al., 2017; Veldhoen, Hupkes, and Zuidema, 2016), uses representations generated from a fully-trained model with frozen weights to train a linear classifier predicting a particular linguistic property. The performance of this classifier is then used to measure how well the model has “learned” this particular property. One of the earliest studies (Köhn, 2015) that proposed training such classifiers for predicting word features such as gender and tense, extracted from annotated dependency treebanks. Due to unavailability of subword or contextualized embeddings at that time, the authors only experiment with static word-level embeddings and find that they all perform similarly. Moreover, the study assumes that the performance of this targeted word feature classifiers would be directly related to the parser performance, but does not validate this assumption empirically.

Recent works on probing have focused on analyzing the representations produced by the downstream tasks (e.g. machine translation (Belinkov et al., 2017; Belinkov and Bisk, 2018; Bisazza and Tump, 2018)), as well as on the sentence-level representations evaluation (Conneau et al., 2018a; Tenney et al., 2019). We build our methodology upon this recent work. Unlike these methods, our work is multilingual and takes language-specific features into account. Moreover our tests are word-level, rather than sentence (Conneau et al., 2018a) or sub-sentence level(Tenney et al., 2019).

3 Probing Tasks

With our probing tasks we aim to cover the properties ranging from shallow properties such as word length (Conneau et al., 2018a), to deeper ones such as distinguishing pseudowords from in vocabulary words. First, we probe for morphosyntactic and morphosemantic features such as case marking, gender, tense and number. Most probing tasks are defined for all languages, such as POS and number, however, some features are only defined for a subset of languages, e.g., polarity for Portuguese and Turkish, gender for Arabic and Russian. To maintain consistency, we base the majority of our tasks on the universal grammatical classes introduced by UniMorph project (Sylak-Glassman et al., 2015). Second, we propose tasks to evaluate a more general syntactic/semantic capability of the model such as predicting the number of morphological tags, detecting the common/shared and odd linguistic feature between two word forms. Finally, inspired by cognitive linguistics, we generate pseudowords, i.e., words that are phonetically similar to an existing word but have no meaning. The definition of our probing tasks are given in Section 3.1, and the creation process is described in Section 3.2.

3.1 Task Definitions

Case Marking

A substantial number of languages express the syntactic and semantic relationship between the nominal constituents and the verbs via morphological case markers. casemarking report that 161 out of 261 languages have at least two case markers as shown in Table 1.

#Case Categories #Languages Example
0 100 English, Spanish
2 23 Romanian, Persian
3 9 Greek
4 9 Icelandic, German, Albanian
5 12 Armenian, Serbo-Croatian, Latvian
6-7 37 Turkish, Polish, Russian, Georgian
8-9 23 Japanese
10 or more 24 Estonian, Finnish, Basque
Table 1: Languages with case marking

Although cases may be of different types, undertaking different roles among languages, a type of case marking, named as core, non-local, nuclear or grammatical case, is the most common. This category contains case markers that are used to mark the arguments of verbs such as subjects, objects and indirect objects Blake (2001); Comrie and Polinsky (1998). Moreover, in languages with rich case marking systems, they are commonly used to mark roles such as “location” and “instrument”. Below are given a Russian and a Turkish example sentences using Acc and Inst cases to mark the patient (object affected by the action) and the instrument roles. Mark- razbi-l- okn-o molotk-om // Mark-Nom.Sg break-Pst-Sg.M window-Acc.Sg hammer-Inst.Sg // Mark- pencere-yi çekiç-le kır-dı // Mark-Nom.Sg window-Acc.Sg hammer-Inst.Sg break-Pst.3.Sg// The relation between case markers and NLP tasks such as semantic role labeling, dependency parsing and question answering have been heavily investigated and using case marking as feature has been shown beneficial for numerous languages and tasks Isgüder and Adali (2014); Eryigit, Nivre, and Oflazer (2008).


According to gender, more than half of the world languages do not have a gender system. Majority of the languages with a gender system such as Spanish, French, German, and Russian, define either two (feminine, masculine) or three (neutral) classes. Gender is a grammatical category and participates in agreement: if a language has a gender system, the gender of a noun or pronoun influences the form of its syntactic neighbors, which could be verb, adjective, determiner, numeral or a focus particle, depending on the language. In the Russian example sentence given below, the gender agreement between the subject, its adjective modifier and the verb is shown. Gosudarstvenn-aya duma sdela-l-a zayavlenie// State-Nom.Sg.F parliament.F make-Pst-Sg.F announcement// The agreement features such as gender and number are crucial for structured grammatical analysis such as as dependency parsing, co-reference resolution, as well as for grammatical correction and automatic essay evaluation.


Many languages mark the modality of the verb via the help of morphological affixes. Modality of the verb, i.e., the grammatical mood, is used to define the status of the proposition from the speaker’s point of view. Some common mood categories are Indicative, Conditional, Subjunctive, Imperative-Jussive and Potential. German and Russian example sentences with Imperative mood feature is given below. Bring-e mir das Buch // Bring-2Sg.Imp me the book // Prines-i mne knigu // Bring-2Sg.Imp me book // Mood is an essential feature for dialogue systems, where the speaker’s point of view is important in order to decide for the next action.


The category of number is usually expressed by nouns, adjectives and verbs. Similar to gender, number is a common agreement feature and has similar potential applications. Number generally takes two values: Singular and Plural, that are marked by morphological affixes.


We use the following eight categories defined by the UniMorph Schema: nouns, adpositions, adjectives, verbs, masdars, participles, converbs, and adverbs. Detailed information on classes can be found in sylak2016composition. Part-of-speech has been one of the most prominent features of all high-level NLP tasks for decades. It is important to note that language-specific POS tags often encode additional information such as person and number. Throughout this work, however, we refer to coarse POS, a strictly grammatical category.


We use the traditional six person categories that are commonly marked by morphological markers: 1st, 2nd and 3rd person either singular or plural. This feature has strategic importance for dependency parsing, co-reference resolution, as well as high-level tasks that involve natural language understanding such as conversational agents, question answering or multi-modal applications such as generating images from sentences.

Ja vizh-u ptitsu // I.1Sg see-1Sg.Prs bird// On vid-it ptitsu // He.3Sg see-3Sg.Prs bird//


Some languages mark the verbs to indicate whether a statement is negative or positive. Generally markers are used to specify the negative polarity, assuming the positive polarity by default. The verb “go” is marked with a negative marker in the Turkish sentence given below. Although this feature is not notably common across languages, it has immediate use cases such as sentiment analysis and natural language inference.

Dün okul-a git-me-di-m // yesterday school-Dat.Sg go-Neg-Pst-Sg//


Although majority of the languages use adjectives such as his/her/my, some languages such as Turkish and Arabic mark the noun for possession. The number of values for the feature depends on the gender system of the language. For instance, while Arabic separately marks the possession by 3rd person singular for feminine and masculine, Turkish uses only one marker for the possession by the 3rd person singular. Ayakkabı-(s)ı-(n)ı giy-ecek // shoe-Poss.3Sg-Acc wear-3Sg.Fut// An example sentence in Turkish with “he/she will wear his/her shoes” is given above. As can be seen in the sentence, possession implicitly acts as an agreement feature, i.e., possession of the object and person of the verb must match.


We use the simplified universal definition of tense, which is encoding of the event time. Similar to previous categories, we only account for the categories and the languages that have morphological markers for tense. The most common values for tense across languages in our dataset are: Past, Present and Future. Russian and German examples with Past tense marking are given below for reference. On kupi-l- etot dom // He buy-Pst-Sg.M this house // Auf dem Tisch lag- ein Buch // On the.Dat table lay.Pst-Sg a book // Tense feature has a special importance for applications where resolving the event order is crucial.


This study is only concerned with frequently occurring Active and Passive voice features that have separate morphological markers in the verb. A Spanish example using passive voice is given below. La puerta ha sido cerra-do // The door is be close-Pass //

Character Bin

We simply create a test set consisting of pairs of randomly picked surface forms and the number of unicode characters they contain. For convenience, we used bins instead of real values as in  senteval18. Although this task is considered shallow, as we show later in the paper, it captures relevant information about morphology for agglutinative languages.

Tag Count

Similar to character bin, we create a test that contains tuples of surface forms and number of morphological tags (annotated according to UniMorph schema) for the token. It can be considered a simplistic approximation of the morphological information encoded in a word and is expected to cover a mixture of linguistic aspects explained in previous test definitions.


Pseudowords or Nonwords, are commonly used in psycholinguistics to study lexical choices or different aspects of language acquisition. There are various ways to generate pseudowords, e.g. randomly swapping two letters, randomly adding/deleting letters to/from a word; or concatenating high-frequency bi-grams or trigrams. These methods have limitations when it comes to multilingual studies such as computational time, availability of resources and researcher’s bias as explained in details by keuleers2010wuggy. In this study, we use the “Wuggy” algorithm Keuleers and Brysbaert (2010)

which is the most commonly used and freely available system for multilingual pseudowords generation. It builds a grammar of bi-gram chains from the syllabified lexicon and generates all possible words with the grammar, both words and nonwords. It is available for German, Dutch, English, Basque, French, Spanish and Vietnamese by default, and has been extended for Turkish 

Erten, Bozsahin, and Zeyrek (2014). Some examples of generated pseudowords from our dataset are given in Table 2.

Language Pseudo Words
English atlinsive, delilottent, foiry
French souvuille, faicha, blêlament
Basque zende, kontsiskio, anazkile, kaukasun, kaldretu
Dutch nerstbare, openkialig, inwrannees, tedenjaaigige, wuitje
Serbian aćejujelu, benkrilno, knjivule, haknjskim, znamaketi
German Anstiffung, hefumtechen, Schlauben, Scheckmal, spüßten
Spanish vuera, espisia, supencinzado, lungar, disciscir
Turkish ular, pesteklelik, çanar, tatsazı, yalsanla
Table 2: Examples of generated pseudowords

We choose two surface forms which share only one feature and label this form pair with the shared (same) feature. Some example data is given in Table 3. Since features depend on the language, the number of labels and the statistics of the dataset differ per language.

L form1 form2 SameFeat
TR yalvaracaksınız
beg (V.2Pl.Fut)
approve (V.3Sg.Pst)
TR yolculuklarına
travel (N.Poss3Pl.Dat)
enemy (N.Poss1Sg.Dat)
TR taşımam
carry (V.1Sg.Prs.Neg)
carry (V.3Pl.Pst)
TR sarımsaklarım
garlic (N.Pl.Poss1Sg.Nom)
sentence (N.Sg.Poss1Sg.Loc)
RU pantera
panther (N.Nom.Sg)
optimistic (Adj.Gen.Sg)
DE Stofftiere
stuffed_animal (N.Nom.Pl)
tennis_court (N.Nom.Sg)
Table 3: Examples of form pairs with only one shared feature. Poss3Pl: possession by 3rd plural person, Poss1Sg: possession by 1st singular person. Shared features shown in bold. Turkish positive polarity is not explicitly tagged by Unimorph. TR: Turkish, RU: Russian, DE: German

It is the opposite of the shared feature test. We prepare pairs of surface forms which differ only by one feature value and label them with this odd feature. Some examples are given in Table 4.

L form1 form2 OddFeat
TR istemeyecek
want (V.3Sg.Fut.NEG)
want (V.3Sg.Fut)
TR seçenekler
option (N.Nom.Pl)
option (N.Dat.Pl)
TR iyileşiyorlardı
heal (V.3Pl.Pst.Prog)
travel (V.3Pl.Pst.Prog)
TR deneyimlerine
experience (N.Dat.Pl.Poss2Sg)
experience (N.Dat.Pl.Poss1Sg)
RU zashitu
defence (N.Acc.Sg)
defence (N.Dat.Sg)
ES legalisada
legalized (V.Sg.Ptcp.F)
legalized (V.Sg.Ptcp.M)
DE integriert
integrate (V.3sg.Ind.Prs)
count (V.3Sg.Ind.Prs)
Table 4: Examples of form pairs with only one different feature. Odd features shown in bold. Turkish positive polarity is not tagged by Unimorph. Poss2sg: possession by 2nd singular person, Poss1Sg: possession by 1st singular person. Odd features shown in bold. Turkish positive polarity is not explicitly tagged by Unimorph. TR: Turkish, RU: Russian, ES: Spanish, DE: German

3.2 Dataset Creation

One of the goals of this study is to provide downstream performance prediction for as many languages as possible. Hence, while searching for a dataset to source the probing tests from, the number of languages this dataset covers is of key importance. Although there is only a small number of annotated truly multilingual datasets such as Universal Dependencies, unlabeled datasets are more abundant such as Wikipedia 222 and Wiktionary 333 In this work, we use UniMorph Sylak-Glassman et al. (2015); Kirov et al. (2016) that provides a dataset of inflection paradigms with universal morphology features mapped from Wiktionary for many of the world’s languages. In addition to UniMorph, we use the lexicon and the software provided by Wuggy to generate pseudowords Keuleers and Brysbaert (2010); Erten, Bozsahin, and Zeyrek (2014). Finally we use the word frequency lists extracted from Wikipedia. We follow different procedures to create datasets for each test type. Here, we briefly explain the creation process of single form feature tests such as Tense, Voice, Mood; paired form feature tests: OddFeat and SameFeat; followed by Character Bin, and pseudo word generation via Wuggy.

Single Form Feature Tests

A word annotated with UniMorph features can be used in several probing tests. For instance, the Turkish word “grubumuzdan”, (from our groups) is marked with the N.Sg.Poss1Pl.Abl tag and can be used to probe the POS, Case marking, Number and the Possession features since it has the N (Noun), Abl (Ablative), Sg (Singular) and Poss1Pl (Possession by first person plural) tags. While generating the tests, we check if the following conditions for a language and target feature are satisfied:

  • Since we need to train classifiers for the probing tests, we need a large enough training data. We eliminate the language/feature pair, if total number of samples for that certain feature is less than 10K.

  • If a feature, e.g. case marker, does not have more than one unique value for a given language-feature pair, it is excluded from the tests.

In addition, we perform two additional preprocessing steps: (1) removal of ambiguous forms with respect to linguistic feature, (2) partial filtering of the infrequent words. Ambiguity is one of the core properties of the natural language, and a single word form can have multiple morphological interpretations. For instance the German lemma “Teilnehmerin” would be inflected as “Teilnehmerinnen” as a plural noun marked either with accusative, dative or a genitive case marker. We remove such words with multiple interpretations for the same feature. This is a deliberate design choice we make, which, while potentially causing some systematic removals, substantially simplifies the task architecture and guarantees fair testing. In Fig.1, we show the ratio of removed instances to all instances separately for each language and test. As can be seen, the highest ratio is for German and around 30%, followed by the Gender and the Case tests for Russian. Secondly, UniMorph dataset contains many grammatically correct but infrequent word forms such as the English “transglycosylating” or the Turkish “satrançlarımızda” (in our chesses). To make sure that our probing tests are representative of the language use, we utilize the frequent word statistics extracted from the Wikipedia dump of the corresponding language. For each probing test, the dataset is compiled such that 80% of the forms are frequently encountered words. We keep a portion of “rare” words and use a considerably large proportion of frequency dictionary, (e.g., we keep the first 1M words for Russian) to identify frequent words in order to keep our tests domain-independent, hence not provide any unjust advantage to embedding models trained on Wikipedia. Finally, to have a dataset with feasible baseline scores, we introduce surface forms which do not contain that test feature. For instance if the Tense feature is probed, the 30% of the probing dataset contains nominal forms that are from “None” class.

Figure 1: Ratio of ambiguous forms to all instances with the given feature such as Mood. For convenience, only a subset of languages are shown.
Paired Feature Tests

Unlike for single features, we did not remove ambiguous forms for paired features tests, i.e., OddFeat and SameFeat due to the retrictive nature of the tests. For instance, while probing for the OddFeat between two forms, we assume that there exists a word pair differing only by one feature. Therefore, here we only consider one certain interpretation of the word form, which would share features with the interpretation of the other form, where is the total number of Unimorph features in both words.

Dataset for this test type is created in two separate steps: (1) for unimorph tags (2) for lemmas. For the SameFeat, we first group the words that contain the feature of interest together for the step (1). Then we split each feature group into two and sample words from both groups. These word pairs are compared against each other, and included in the test set if they share the same value for the feature of interest, but differ in all other features. Since some features are tagged by default, e.g., POS, we exclude these features from the comparison process. Otherwise our dataset would have no instances, since all nouns share the “N” tag. In addition to POS tags, we exclude the Mood feature from Finnish and Turkish, and Interrogativity feature from Turkish, since all verbs have the same tag for those features. For (2), we follow the same steps, but check if the lemma values are the same and others are different. 444We perform similar preprocessing and dataset balancing for all languages. The details of parameter values can be found in the project website.

While preparing the dataset for the OddFeat, we first group the words by lemma tagged with the target feature for the step (1). Then we randomly sample elements from each lemma group, and perform pairwise comparison. If two sampled forms have different values for the feature (e.g., Ablative and Locative) but have the same set of values for the other features (e.g., Singular), then they are assigned this feature as the label. In addition to the features with different values, we also consider the features that are not explicitly tagged. For instance if only one of the forms has the Possession feature, but all features except Possession are shared among these two forms; then we create a test pair with the value Possession. To generate the test pairs for the step (2), we group the words by their feature sets, i.e., different forms with the exact same set of feature values will be clustered together. Then we split each group into two, and sample number of forms from both halves. The procedure described above results in unbalanced datasets, usually dominated by the Number feature. In order to avoid this, we sample proportionally from such overly sized feature test pairs.

Character Bin

After removing the ambiguous forms, we have created bins of numbers for character counts since the variation was high. We used the following bins for character counts: [0-4, 5-8, 9-12, 13-16, 17-20, >20]. We applied the same bins for all languages. The bins are chosen such that the distribution of bins are balanced.

Pseudo Word Test

Finally, we have generated pseudo words for 9 languages. To do so, we first sampled 10K in-vocabulary words from the lexical resources provided by Wuggy. We then use those words as seeds to the Wuggy pseudoword generator. We finally generate pseudowords by setting the maximum number of candidates per word to 5, maximal search time per word to 5 seconds; and restricting the output to match the length of sub-syllabic segments, match the letter length, match transition frequencies and match 2/3 of sub-syllabic segments.

The set of languages for each probing test introduced in Section 3.1 are given in Table 5. To conclude, we have created 15 probing tests for a total of 24 languages, each containing 7K training, 2K development and 1K test instances.

4 Evaluation Methodology

In this section, we discuss our probing task evaluation methodology. First of all, due to the large number of languages and embedding models available, we choose a subset of each. We describe how we decide on the languages to test in Sec. 4.1, and the set of pretrained embeddings we have used is detailed in Sec. 4.2. In order to investigate the relation between probing and downstream tasks, we evaluate the same set of embeddings intrinsically via our probing tasks as explained in Sec. 4.3 and extrinsically on several downstream tasks discussed in Sec. 4.4. Furthermore, we show how the proposed tests can be used as a diagnostic tool for black box NLP systems in a case study explained in Sec. 4.5.

4.1 Languages

We have identified a list of languages to test our hypotheses on various research questions such as the relation between downstream and probing tasks or the information encoded in layers of black box models. We considered the following criteria:

  • Chosen languages should have relatively broad resource coverage e.g., annotated data for a variety of downstream tasks,

  • The set of chosen languages should have a high coverage of probing tests; and the number chosen languages should be in proportion to the number of languages that are probed for a certain test,

  • The languages should be as typologically diverse as possible in terms of linguistic properties we are probing for.

Considering the above, we have identified the following 5 languages: German, Finnish, Turkish, Spanish and Russian which are shown in colors in Table 5. Majority of them have annotated resources in addition to Universal Dependencies treebanks, e.g., datasets created for named entity recognition (NER) and semantic role labeling (SRL). As can be seen from the Table 5, all probing tests are covered and their ratio to other languages is well proportioned for each test. Furthermore, our selected langauges belong to diverse language families, namely from Germanic, Uralic, Turkic, Romance and Slavic.

arabic arabic armenian arabic arabic portuguese armenian armenian arabic
armenian armenian catalan armenian armenian turkish bulgarian czech armenian
bulgarian catalan finnish bulgarian catalan POSSESSION catalan finnish bulgarian
czech finnish french catalan finnish armenian finnish german catalan
estonian french german czech french quechua french hungarian czech
finnish german hungarian danish german turkish german macedonian danish
german hungarian italian estonian hungarian VOICE hungarian greek dutch
hungarian italian macedonian finnish italian arabic italian polish estonian
macedonian polish polish french macedonian bulgarian macedonian portuguese finnish
greek portuguese portuguese german greek finnish greek quechua french
polish romanian russian hungarian polish russian polish romanian german
quechua serbian spanish italian portuguese serbian portuguese russian italian
russian spanish swedish macedonian quechua swedish quechua serbian macedonian
serbian GENDER greek romanian PSEUDO romanian spanish greek
swedish arabic polish russian basque russian swedish polish
turkish bulgarian portuguese serbian dutch serbian turkish portuguese
macedonian quechua spanish english spanish quechua
greek romanian turkish french turkish romanian
polish russian german russian
portuguese serbian serbian serbian
russian spanish spanish spanish
serbo swedish turkish swedish
spanish turkish vietnamese turkish
Table 5: List of languages for each probing task. Languages shown in colored cells are the languages we experiment on. General refers to POS, Tag Count and Character Bin. Some of the tests with less number of languages are concatenated vertically for convenience.

4.2 Multilingual Embeddings

We choose the most commonly used pretrained word embedding models that are available for a large number of languages, since they are the ones that will most likely to be used by a researcher working on multilingual problems. Namely, in this work we report the performance of the following word embedding models: word2vec (Mikolov et al., 2013a); fastText (Bojanowski et al., 2017); Byte Pair Encoding (BPE; Sennrich, Haddow, and Birch, 2016); supervised MUSE (Conneau et al., 2017); and ELMo (Peters et al., 2018).

word2vec Among the selected representations, only word2vec uses word as the basic unit. We have traind a word2vec model for each of the selected languages on the latest preprocessed (tokenized, lowercased) Wikipedia dump using 300-dimensional CBOW, a window of size 10 and minimum target count as 5. We have used the implementation provided by the authors 555

fastText provides word representations that have subword-level information learned from character n-grams. In simple terms, words are represented as a linear combination of the character n-gram embeddings of the token’s character n-grams. We use the embeddings distributed via fastText 666 which are trained on preprocessed Wikipedia using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5.

BPE is another type of subword-level embedding that uses unsupervised morphological segments generated by a compression algorithm inspired from Gage (1994). We use the pretrained embeddings by (Heinzerling and Strube, 2018) which are trained on preprocessed Wikipedia using GloVe Pennington, Socher, and Manning (2014). We use the python wrapper open sourced by the authors 777 with default dictionary size and dimension 300. Since the tool provides embeddings for each segment, in case of multiple segments per token, we used the averaged vector as the word representation.

MUSE-supervised embeddings are considered as crosslingual fasttext embeddings, which are actually monolingual fasttext embeddings aligned in a common space using ground-truth bilingual dictionaries. We used the pre-aligned vectors distributed by the authors 888 The monolingual embeddings have the same properties as the fastText vectors described above. Since the authors only release the static embedding vector without the model, we could not generate embeddings for OOV words.


embeddings are computed on top of two-layer bidirectional language models which use characters composed using convolutional neural networks (CNN). Unlike previously introduced embeddings, ELMo provides

contextualized embeddings, i.e., the same words would have different representations when used in different contexts. However, our probing tests are type-level (as opposed to token-level), thus we only use the representations generated independently per each token both for the intrinsic and extrinsic experiments. In scope of this study, ELMo embeddings are treated as powerful pretrained character-level decontextualized vectors. To highlight this important detail, we further refer to our ELMo-derived embeddings as Decontextualized ELMo (D-ELMo). We use the multilingual pretrained ELMo embeddings distributed by the authors (Che et al., 2018; Fares et al., 2017) 999, which are trained with the same hyperparameter settings as the original (Peters et al., 2018) for the bidirectional language model and the character CNN. They are trained on randomly sampled 20 million words from Wikipedia dump and Common Crawl datasets and have the dimensionality of 1024. We use the 3-layer averaged ELMo representation for each word.

For all the experiments described in Sec. 4.4 and Sec. 4.3, we first created the vocabulary for all intrinsic and extrinsic datasets per language. Then, we generated the vectors using the embeddings that can handle OOV words, namely fasttext, BPE and D-ELMO, for each language-intrinsic and language-extrinsic pair. The static embeddings: word2vec and MUSE are used as provided. Hence, for the models using these embeddings, each unknown (OOV) word is replaced by the UNK token and the same vector is used for all UNK words.

4.3 Intrinsic evaluation: probing tasks

Following Conneau et al. (2018a), we use diagnostic classifiers (Shi, Padhi, and Knight, 2016; Adi et al., 2017)

for our main probing tests. Our diagnostic classifier is a feedforward neural network with one hidden layer, followed by a ReLU non-linearity. The classifier takes as an input a fixed trained word vector and predicts a particular label specific to the probing test. For

OddFeat and SameFeat

, since the input consists of two words, we first concatenate both word vectors before feeding it into the feedforward network. For all tests, we use the same hyperparameters: 300 hidden dimension and 0.5 dropout rate. We train each model for 20 epochs with early stopping (patience=5). The input dimension vector depends on the type of pre-trained word vectors that will be evaluated. Our evaluation suite is implemented using AllenNLP library

(Gardner et al., 2017).

4.4 Extrinsic evaluation: downstream tasks

We consider five tasks for our extrinsic evaluation: universal POS-tagging (POS), dependency parsing (DEP), named entity recognition (NER), semantic role labeling (SRL), and cross-lingual natural language inference (XNLI). The former two tasks are useful to measure correlation of our probing test sets to downstream syntactic tasks, while the latter three provide insight into the performance on more semantic tasks. Since our main goal is to evaluate the quality of the pre-trained word embedding spaces, we do neither update the word vectors during training nor use extra character-level information. Except for SRL, all tasks described below are trained using the models implemented in AllenNLP library.

POS Tagging

This is a classic sequence tagging task, where the goal is to assign a sequence of POS tags given the input sentence. We use data from the Universal Dependencies (UD) project version 2.3 (Nivre, 2018), and adapt universal POS tags as our target labels. It should be noted that POS-tagging downstream task is different than the POS probing task with respect to number of POS-labels and the applied annotation scheme. For the tagging model, we use a bidirectional LSTM encoder with 300 hidden units and 0.5 dropout. We use Adam optimizer with initial learning rate 0.001. We train each model with mini-batch size of 32 for 40 epochs, with early stopping (patience=10). We use the accuracy as our performance metric.

Dependency Parsing

The aim of dependency parsing is to predict syntactic dependencies between words in a sentence in the form of a tree structure. This task is especially interesting because of its deep interaction with morphology, which we will evaluate in our probing tests. We employ a deep biaffine parser of Dozat and Manning (2016), which is a variant of graph-based dependency parser of (McDonald, Crammer, and Pereira, 2005). The parsing model takes as input a sequence of token embeddings concatenated with the corresponding universal POS embeddings. The input is then processed by a multi-layer biLSTM. The output state of the final LSTM layer is then fed into four separate ReLU layers to produce four specific word representations: two for predicting the arcs (head predictions) and another two for predicting the dependency label (label prediction). The resulting four representations are used in two biaffine classifiers, one predicting the arc and another one to predict a dependency label, given a dependent/head word pair. For our experiments, we use 2 layer biLSTM with 250 hidden units, POS embedding dimension 100, and ReLU layer (for arc and label representations) with dimension 200. We train the model with mini-batch size of 128 for 30 epochs, and perform early stopping when the Label Attachment Score (LAS) on development set does not improve after 5 epochs.

Named entity recognition

The goal of this task is to label the spans of input text with entity labels, e.g., Person, Organization, or Location. Unlike POS tagging, NER annotates text spans and not individual tokens; this is usually represented via a (Begin, Inside, Outside) BIO-like encoding. We employ a standard NER architecture, a BiLSTM-CRF model where the output of BiLSTM is processed by a conditional random field to enforce global sequence-level constraints Huang, Xu, and Yu (2015). We use a 2-layer BiLSTM with 200 hidden units and 0.5 dropout trained for 20 epochs with patience 10, the performance is measured via span-based F1 score.

Semantic Role Labeling

The automatic process of identifying predicate-argument structures and assigning meaningful labels to them is named Semantic Role Labeling. An annotated sentence with the predicate sense “buy.01: purchase” is shown below.

MarkArg0: Buyer boughtbuy.01 a carArg1: Thing bought from a retailer storeArg2: Seller

We consider the dependency-based i.e., CoNLL-09 style, PropBank SRL, where the goal is to label the argument heads with semantic roles rather than the full span. We use the subword-level end-to-end biLSTM based sequence tagging SRL model introduced by

sahin:acl18. It can either use pretrained embeddings as word representations, or learn task specific subword-level (character, character-ngram, morphology) representations by composing word vectors via a separate bi-LSTM network. Here, we only used pretrained word embeddings concatenated with a binary predicate flag (1 if the token is predicate, 0 otherwise) and 2 layers of bi-LSTMs with 200 hidden dimensions on top of these representations. Finally, tokens are assigned the most probable semantic role calculated via the final softmax layer. Weight parameters are initialized orthogonally, batch size is chosen as 32, and optimized with stochastic gradient descent with adaptive learning rate initialized as 1. Gradient clipping and early stopping with patience 3 is used. We use the standard data splits and evaluate the results with the official evaluation script provided by CoNLL-09 shared task. We report the role labeling F1 scores.

Natural Language Inference

The NLI task aims to extract the relations such as Entailment, Neutral, and Contradiction between a pair of sentences. This task has been first introduced with the Stanford Natural Language Inference (SNLI; Bowman et al., 2015) dataset, which has been compiled from English image caption corpora and labeled via crowdsourcing. Some example pairs of sentences are shown in Table 6.

Premise Hypothesis Label
Met my first girlfriend that way. I didn’t meet my first girlfriend until later. Contradiction
I am a lacto-vegetarian. I enjoy eating cheese too much to abstain from dairy. Neutral
At 8:34, the Boston Center controller received a third transmission from American 11 The Boston Center controller got a third transmission from American 11 Entailment
Table 6: Example sentence pairs taken from (Williams, Nangia, and Bowman, 2018)

As stated by  (Bowman et al., 2015) and also can be seen from Table 6, a high-performing NLI model should handle phenomena like tense, belief, modality, negation and ambiguity, which are mostly covered by our probing tasks.

Later, the dataset has been extended via utilizing a more diverse variety of text styles and topics, and named as MultiGenre NLI (MultiNLI; Williams, Nangia, and Bowman, 2018). Recently, cross-lingual NLI (XNLI; Conneau et al., 2018b) dataset derived from MultiNLI, has been introduced as a benchmark for evaluating cross-lingual sentence representations. The evaluation benchmark is originally aimed at testing the models trained on the source language (English) data, on the target language (15 languages including Spanish, Turkish, Russian and German). It should be noted that, the development and test splits for each language in XNLI have been translated by professional translators. The authors also release the automatic translation of MultiNLI training split which they use to align the cross-lingual sentence embeddings. Since the multilingual embeddings used in this study are not all cross-lingual, we train a separate monolingual NLI model for each language by using the automatic translation data. We use the enhanced LSTM model (ESIM; Chen et al., 2017) with default parameters provided by AllenNLP framework.

4.5 Diagnostic evaluation: a case study on SRL

Another proposed application of our probing tests is to diagnose the layers of a black-box NLP model. In order to do so, we used the SRL model described in extrinsic evaluation (see Sec. 4.4). This time, instead of using pretrained embeddings, we used randomly initialized character trigram embeddings. The model generates intermediate word representations by summing the weighted forward and backward hidden states from the character trigram bi-LSTM network. As the model is trained with a negative log likelihood loss for semantic roles, it is expected to learn character trigram embeddings and other model parameters that are better suited for SRL. In order to diagnose whether it does learn any useful information, we save the model states for different epochs and generate the word representations via the aforementioned internal biLSTM layer and use our intrinsic evaluation tool from Sec. 4.3, to evaluate these representations. As preprocessing, all tokens are lowercased and marked with start and end characters. One layer of bi-LSTMs both for subword composition and argument labeling with hidden size of 200 are used. Character trigrams are randomly initialized as 200-dim vectors. The other hyperparameters are kept the same as Sec. 4.4.

5 Experiments and Results

In this section, we first discuss the datasets used for our intrinsic and extrinsic experiments. We then provide the results and briefly discuss the general patterns and exceptions observed in both experiments.

5.1 Dataset

For intrinsic evaluation, we use the probing datasets that have been described in Sec 3.2, and experiment on the five languages: Finnish (Uralic), German (Germanic), Spanish (Romance), Russian (Slavic), and Turkish (Turkic) as discussed in Sec.4.1. The statistics of our extrinsic datasets are presented in Table 7. For NER dataset, the Turkish and Russian data are substantially larger than the other languages. For practical reasons and fair comparison, we randomly sample 5-8% subsets of the original datasets and split them into train/dev/test sets.

Task Language Source Number of tokens
train dev test
POS and DEP Finnish UD version 2.3 (Nivre, 2018) 199K 22K 26K
German 310K 15K 19K
Spanish 503K 59K 59K
Russian 1M 138K 137K
Turkish 50K 13K 13K
NER Finnish FinNER  101010Data is available from and the article “A Finnish News Corpus for Named Entity Recognition” where the dataset is described is reported to be under review. 194K 14K 50K
German Germeval-2014 (Benikova et al., ) 477K 44K 102K
Spanish CoNLL-2002 (Sang, 2002) 273K 55K 53K
Russian WikiNER (Ghaddar and Langlais, 2017) 22M 738K 744K
Turkish TWNERTC (Sahin et al., 2017) 5.7M 1.9M 1.9M
SRL Finnish Finnish PropBank (Haverinen et al., 2015) 188K 11K 10K
German CoNLL-09 (Hajič et al., 2009) 685K 34K 34K
Spanish CoNLL-09 442K 52K 52K
Turkish Turkish PropBank (Şahin and Adalı, 2018) 48K 10K 10K
XNLI German XNLI (Conneau et al., 2018b) 400K 2.5K 5K
Table 7: Statistics of our extrinsic dataset.

5.2 Results

We present the probing test results of the multilingual embeddings introduced in Sec. 4.2 for each language/test pair in Table 9. In addition, we report the baseline scores calculated with majority voting baseline for each language/test pair.According to Table 9, the majority of the tests had a baseline score under 50%, although some language/test pairs had higher baselines due to the dataset properties such as lacking annotations for certain tags. These tests are POS for Finnish, Spanish and Turkish and TagCount for Finnish. In addition, SameFeat has one of the lowest baseline scores consistently across languages, generally followed by Case and OddFeat. Table 9 shows that all embeddings achieved their lowest score for CharacterBin, followed by the OddFeat. In order to assess the difficulty of the tests, we calculated the gap between the average performance of the embeddings and the baseline scores. A small gap can point to a “hard-to-beat” majority vote baseline, which suggests that the task is relatively easy. It can also point to a probing test that would be difficult to master. After eliminating the tests with high baseline scores, we observe that the gap is particularly small for CharacterBin for all languages, as shown in Fig. 2. However, all tests except CharacterBin has seen improvements ranging between 50%-200%, albeit their low baseline scores.

Figure 2: Improvement over baselines given per each language/test pair.

First of all, for probing tests we observe that all embeddings outperform the baseline for all tasks and languages. Apart from a few cases, we see that D-ELMo achieves the highest scores for all language-test pairs, generally followed by fastText and BPE. For the languages Finnish, Russian and Turkish, D-ELMo outperforms the other embeddings by a larger margin compared to Spanish and German. fastText and BPE perform similarly, except from Russian where BPE achieves significantly higher scores than fastText in almost all tests. We observe that the static embeddings, word2vec and MUSE, which can not handle OOV words, performed similarly and had lower scores than other embeddings for most of the tests, except from Pseudo. Especially MUSE, has an outstanding performance on Pseudo tests, compared to its performance on other tests. This is due to having an internal vocabulary and treating all words out of their dictionary the same, e.g., assigning the same random vector, that leads to easier classification of non vocabulary words.

Task MUSE word2vec BPE fastText D-ELMo
SRL 62.30 57.68 60.41 64.19 72.26
DEP 79.62 79.84 80.6 82.45 87.78
POS 89.56 89.86 89.88 92.55 96.56
NER 72.96 71.17 75.69 80.54 78.45
Task MUSE word2vec BPE fastText D-ELMo
SRL 55.25 60.60 57.11 61.75 61.85
DEP 82.43 82.78 82.32 83.20 83.46
POS 91.82 92.14 90.59 92.66 93.57
NER 74.32 76.13 71.43 78.35 71.81
XNLI 44.03 40.08 43.55 44.69 44.05
Task MUSE word2vec BPE fastText D-ELMo
SRL 64.49 62.78 62.34 66.39 70.03
DEP 90.17 90.26 89.99 90.55 91.09
POS 96.07 96.58 95.66 96.49 97.43
NER 77.48 79.31 77.36 78.96 77.75
XNLI 46.75 41.28 45.17 46.80 45.07
Task MUSE word2vec BPE fastText D-ELMo
DEP 90.13 90.54 90.16 87.41 92.26
POS 95.62 96.11 95.91 92.61 97.84
NER 78.38 79.92 75.84 64.20 79.71
XNLI 43.43 39.80 43.53 41.64 45.05
Task MUSE word2vec BPE fastText D-ELMo
SRL 53.29 46.35 53.51 53.14 63.38
DEP 57.82 56.67 55.92 57.70 62.97
POS 86.52 87.35 87.57 86.80 94.48
NER 48.87 52.21 51.75 52.52 49.22
XNLI 42.79 42.93 45.17 44.25 43.81
Table 8: Downstream tasks results for all languages. Bold represents the best score, while italics is the second best.

We present the results for extrinsic experiments in Table 8. The general performance ordering of the embeddings: D-ELMo, fastText/BPE, word2vec/MUSE repeats for syntactic (POS, DEP) and shallow semantic tasks (SRL) for all languages. However, for NER and XNLI tasks, we do not observe a repeating pattern. It should be noted that, for majority of the languages, POS, DEP and SRL are derived from the same treebanks and have gold (expert) annotations. On the other hand, NER and XNLI datasets are generally compiled from a more diverse set of resources, and of silver quality (automatic translation or annotation).

We observe that the static embeddings, namely word2vec and MUSE, perform relatively better on downstream tasks than on probing tasks for the fusional languages (German, Spanish and Russian). The performance difference may be due to the different rates of rare words in both datasets. As discussed in Sec 3.2, some of the words in our probing tasks, do not occur frequent enough in Wikipedia to have good representations, which is not the case for the words appearing in the downstream task corpora. Our second observation is the change in the order of the best performing models for NER and XNLI. As discussed before, it may be due to the differences in the training corpora of downstream tasks or the lexical differences between the training corpora of multilingual embeddings and downstream tasks 111111All embeddings except D-ELMo are trained on the full Wikipedia dump, while D-ELMo was trained on a sample of Wikipedia. Therefore, some entity mentions in Wikipedia were not encountered in the sampled training data of D-ELMo. Since most NER datasets are based on Wikipedia, the vocabulary of other embedding spaces had a higher coverage of named entities, which led to higher scores.. More importantly, NER and XNLI are two distinct tasks, that rely on different features than the syntactic tasks. For instance the vocabulary (e.g., gazetteer) is a crucial component for a NER model, while syntactic clues are of greater importance to other downstream tasks. Similarly, ability to extract negation, quantification, coreference, tense, modality, and resolving lexical and syntactic ambiguities have higher importance to XNLI task than other tasks.

Task baseline MUSE word2vec BPE fastText D-ELMo
Case 30.0 49.3 59.9 83.4 86.6 96.7
Mood 50.0 62.3 67.9 84.7 89.0 93.8
Number 45.6 60.4 69.4 83.4 90.3 97.4
POS 67.9 75.3 70.3 85.7 90.0 97.1
Person 30.1 54.0 66.8 84.6 88.8 94.6
Tense 40.9 65.4 73.4 86.0 90.6 94.7
Voice 50.8 63.4 70.8 86.8 89.6 95.1
CharacterBin 44.2 45.0 44.8 52.0 58.4 63.8
TagCount 86.0 88.6 87.0 91.0 95.0 98.4
OddFeat 22.7 24.4 24.5 65.1 76.7 88.4
SameFeat 29.1 94.1 92.0 96.9 96.5 98.4
Task baseline MUSE word2vec BPE fastText D-ELMo
Case 34.2 62.0 68.7 90.9 95.1 94.0
Mood 37.4 54.3 54.1 90.1 91.0 93.9
Number 40.1 60.4 66.8 90.7 93.7 97.7
POS 55.8 63.1 65.8 92.2 94.9 96.9
Person 52.9 65.2 60.3 90.4 91.5 95.8
Pseudo 50.0 96.7 80.1 83.2 90.0 91.0
Tense 52.9 73.1 71.5 91.5 92.9 93.2
CharacterBin 45.4 49.0 45.0 63.0 62.9 70.4
TagCount 54.9 61.5 63.1 83.0 86.5 89.2
OddFeat 22.6 37.9 34.8 65.1 71.2 75.4
SameFeat 28.4 84.5 86.5 89.6 90.4 89.0
Task baseline MUSE word2vec BPE fastText D-ELMo
Gender 34.5 67.0 74.5 98.0 98.8 99.8
Mood 52.0 67.0 66.1 89.2 90.9 95.0
Number 34.0 69.2 69.9 95.0 95.0 99.8
POS 70.9 85.6 84.1 97.6 98.5 99.6
Person 27.4 60.9 52.8 92.6 87.8 98.6
Pseudo 49.8 92.3 89.4 75.9 91.9 94.7
Tense 39.9 59.1 60.8 87.1 85.9 95.0
CharacterBin 50.9 55.2 55.3 72.3 69.6 76.2
TagCount 40.0 61.0 59.0 90.8 87.8 95.8
OddFeat 44.8 53.4 55.8 77.1 78.5 81.7
SameFeat 27.2 89.6 89.1 91.1 93.3 91.1
Task baseline MUSE word2vec BPE fastText D-ELMo
Case 31.0 57.3 78.0 80.8 62.0 96.7
Gender 39.8 57.7 78.3 95.4 80.7 99.3
Number 41.1 54.7 75.7 89.7 74.3 96.9
POS 48.4 56.5 67.8 89.7 74.2 98.2
Person 31.9 49.4 72.2 93.0 81.0 96.7
Tense 43.8 56.3 73.6 90.1 73.6 94.3
Voice 47.6 62.2 66.5 99.4 96.1 99.0
CharacterBin 46.0 46.3 52.5 68.9 64.4 70.9
TagCount 53.8 60.4 68.5 85.2 67.9 96.4
OddFeat 21.8 36.9 48.2 74.4 55.4 90.0
SameFeat 29.4 84.7 90.9 93.9 93.6 97.6
Task baseline MUSE word2vec BPE fastText D-ELMo
Case 31.1 63.5 57.4 87.1 85.4 96.1
POS 75.5 85.9 83.5 95.9 94.8 98.4
Person 30.3 52.5 52.0 93.5 90.5 96.1
Polarity 44.6 62.0 61.0 97.3 93.6 96.1
Possession 30.6 59.2 56.7 87.1 75.5 92.5
Pseudo 51.5 90.3 90.2 71.4 79.6 91.7
Tense 34.9 57.7 58.8 89.4 85.4 94.7
CharacterBin 46.1 58.1 53.6 66.7 66.7 71.5
TagCount 46.6 71.4 60.7 85.6 79.9 89.8
OddFeat 38.7 38.5 40.6 76.7 79.8 79.0
SameFeat 21.3 73.9 74.7 86.9 90.0 86.5
Table 9: Probing task results for all languages. Bold represents the best score, while italics is the second best.

6 Analysis

In this section we investigate the relation between downstream and the probing tasks more closely, and then interpret our results with respect to language families and downstream tasks. We present the results for the diagnostic case study described in Sec. 4.5 and show the close connection to highly correlated probing tests. Finally we give a brief summary of our findings related to proposed probing tasks.

6.1 Correlation

In order to calculate the relation between the downstream tasks and the probing tests, we calculate the Spearman correlation as shown in Fig. 3. In addition, we calculate the two-sided p-values

to test the null hypothesis, i.e., whether two sets of results are uncorrelated, and interpret the results with respect to the languages and the tasks. 

121212Significant correlations are given in Appendices.

Figure 3: Spearman correlation between probing and downstream tasks for each language.

6.1.1 Language-related findings


We observed the highest correlations with p-value of in Finnish language 131313Since the number of samples, i.e., number of embeddings, for the correlation analysis are 5, we use a high p-value of 0.1.. According to the calculated p-values, all proposed tests, except from SameFeat, had a statistically significant correlation with POS, DEP, SRL and NER for Finnish.


For German, we have observed high correlation with p-value of for Number, POS and TagCount tests, whereas Case, Mood, OddFeat, Person and Tense have statistically significant correlation with p-value of for SRL. For German, the correlation pattern of Case, Number, POS and TagCount repeated for syntactic and shallow semantic tasks: POS, DEP and SRL, whereas XNLI correlated well with Case, Mood, OddFeat, Person and Tense.


For Russian, we find that Case, Number and TagCount to have high correlations with p-value of for syntactic tasks, whereas XNLI correlated relatively better with other features such as SameFeat, OddFeat, Person and Tense.


For Spanish, there was no clear pattern, except from the Pseudo test, that had a strong correlation to SRL and DEP with . We believe that, this may be due to having less linguistic features encoded on the word level.


For Turkish, we found strong correlations for all single feature tests for syntactic tasks (except from DEP), and calculated relatively high correlation between Polarity, SameFeat, OddFeat, Tense and XNLI task.


To summarize our findings for languages, we observed that a large number of probing tasks had high correlation to syntactic tasks for agglutinative languages: Turkish and Finnish. On the other hand, we found a more focused/smaller set of probing tasks with high correlation for fusional languages, especially for syntactic tasks. Furthermore, we observed that a set of common probing tests have higher correlation to certain downstream tasks among most languages, such as Case, POS, Number and TagCount to syntactic tasks; and OddFeat, SameFeat, Tense to XNLI.

6.1.2 Downstream task-related findings


For all languages, SRL is found to correlate with the highest amount of probing tasks. This finding is intuitive since SRL performance is dependent on more complex linguistic phenomena compared to other tasks. Regardless of the languages families, we find that SRL has high correlations with Case and POS, generally followed by Person and Tense tests. This finding is on par with the traditional language independent features used for SRL back in the feature-engineering days Hajič et al. (2009). In addition to those tests, for agglutinative languages, we find high correlation for CharacterBin and TagCount. In addition, SRL has high correlation to Possession and Polarity, which only exists for Turkish. We see that Mood is a common highly correlated test for fusional languages, whereas Number only correlates with German SRL, and Pseudo only correlates for Spanish SRL.


They can be considered easier tasks compared to SRL, where superficial linguistic cues would be enough to decide on local classes. However these cues are not expected to be distinct from SRL, rather a subset of it. Confirming this, for German we see that the set of highly correlated features are reduced to the subset: Case, Number, POS and TagCount. Another hint to support this hypothesis, is the decline in the correlation scores of CharacterBin and TagCount in POS and DEP for agglutinative languages. This finding suggests that instead of a feature that distantly approximates the morphological features of a given word such as TagCount, a feature focused on a single linguistic phenomenon has higher correlation with more syntactic tasks.


Except from Finnish NER, none of NER tasks had significantly high correlations to our probing tests. It should be noted that NER as a task is very different from POS, DEP and SRL. For instance, while morphological cues are of great importance to POS, DEP and SRL; for NER, having good representations for the entities

is more important. This finding suggests that either (1) the proposed probing tasks, including the Pseudo test, were not capable of capturing this phenomenon; or (2) the chosen embeddings were similarly syntactic and hence had a low variance.


For XNLI, we observe a noticeable pattern consistently among almost all languages, which is high correlation to Mood, Polarity, Tense and Person that had been hypothesized by N18-1101; and high correlation with one of our paired tests (usually SameFeat) that resembles the NLI task.

Furthermore, we notice that CharacterBin and TagCount are redundant tests for agglutinative languages. This is due to these languages having one to one morpheme/tag mapping, which suggests that the number of characters is also a good indicator for number of tags. Since other languages are fusional, hence having one to many relation, these tests do not relate to each other as can be seen from the correlation matrices of German and Russian, for which the TagCount has high correlation scores unlike the CharacterBin.

6.2 Diagnostic Task

In this section, we demonstrate the results and analysis of the diagnostic case study for Finnish and Turkish. As described in Sec. 4.5, we train an end-to-end SRL model which only uses character trigrams as input. We then probe the word encoding layer with the suggested probing tests for three consecutive epochs, where we see a large improvement in SRL F1 scores. We probe Finnish for epochs: , and , which had F1 scores of , and . We have seen large improvements on Case, Mood, Person and Voice, while seeing a drop or a constant score for the features CharacterBin, Number, POS and TagCount as shown in in Fig. 4. These results suggest that the encoding layer captures more of Case, Mood, Person and Voice information throughout the training for an SRL objective, which are also found to have significantly high correlation to Finnish SRL as discussed in previous section. One of the interesting phenomena is the constant or lower scores in correlated features such as CharacterBin and Number. We note that, even if these tests provide a predictive performance on the SRL task, not all neural models are capable of learning all correlated features discussed in previous section. This aforementioned phenomena could be due to inabilities of the neural model, or these features getting captured easily during the very early stages of the training. Since the aim of this section is to demonstrate a case study, a thorough comparison and in-depth investigation of the root causes is not in scope of this work. Similarly we probe the encoding layer of Turkish SRL for the epochs , and , with , and F1 scores. Since the F1 improvements are more dramatic, we see a more clear pattern for Turkish. Similar to our results for Finnish, we have encountered considerably high improvements for Case, Person, Polarity, POS, Possession and Tense features, while no improvement has been seen for CharacterBin, TagCount or Pseudo. Again, all tests with increasing scores had been shown to have significantly high correlation to Turkish SRL, while Pseudo had no significant correlation. Other correlated features with non-increasing scores can be explained similar to the case for Finnish.

Figure 4: Probing tests used as diagnostic tool

6.3 Discussion on Probing Tests

To summarize our general findings on the proposed probing tasks from the experiments (see Sec. 5.2) and analysis (see Sec. 6.1-6.2) sections:

  • The calculated correlations were positive,

  • The set of correlated tests were generally smaller for fusional languages,

  • The set of correlated tests varied with the complexity of the downstream task, however the correlation pattern was common across similar tasks (e.g., SRL had a large set of correlated tests, while POS tagging has a only subset of it),

  • The set of correlated tests varied with the requirements of the downstream task (e.g., paired tests like SameFeat had strong correlation to XNLI, but had weak ties to syntactic tasks.)

  • We observe commonalities among the correlated probing tests for Finnish, Turkish, German and Russian. For instance the correlation between Case, POS, Person, Tense, TagCount and the downstream tasks were higher than the other probing tests. This suggests that the findings are transferable, hence the proposed probing tests can be used for other languages,

  • We also observe that language specific tests are beneficial, i.e., have significantly high correlation such as Polarity for Turkish, and some tests could be impactful for a language family, e.g. CharacterBin for agglutinative languages, Pseudo for Spanish.

To follow up on our discussion on strong baselines, and the difficulty of the tests from Sec. 5.2, we find that correlations neither depend on how strong the initial baseline is, nor how low the accuracy scores for this test are. For instance, Part-of-Speech tagging has strong baselines for many languages, however it is also one of our most correlated tests. Moreover, “hard” tests with low scores, such as OddFeat and CharacterBin behaved like any other tests, i.e., had low-to-high levels of correlation for different language/downstream task pairs.

7 Conclusion

In this study we have introduced 15 word-level probing tests for a total of 24 languages, where the target linguistic phenomena differ depending on the typological properties of the language, e.g., Case, Polarity for Turkish, Gender for Russian and German. These tests are proposed as an exploratory tool to reveal the underlying linguistic properties captured by a word embedding or a layer of a neural model trained for a downstream task. Furthermore, we introduce a methodology for creation and evaluation of such tests which can be easily extended to other datasets and languages. We release the framework LINSPECTOR with, that consists of the datasets for probing tasks along with an easy-to-use probing and downstream evaluation suite based on AllenNLP.

We have performed an exhaustive set of intrinsic and extrinsic experiments with commonly used pretrained multilingual embeddings for five typologically diverse languages: German, Spanish, Russian, Turkish and Finnish. We found that all evaluated embeddings provide a varying range of improvement over the baselines. Our statistical analysis of intrinsic and extrinsic experimental results showed that the proposed probing tasks are positively correlated to majority of the downstream tasks. In general, the amount of correlated probing tests were found to be higher for agglutinative languages, especially for syntactic tasks. We showed that the set of correlated tests differ with the type of the downstream task. For instance XNLI is strongly related with the SameFeat, while SRL is correlated well with the Case. We observed Case, POS, Person, Tense and TagCount to have significantly high correlations for majority of the analyzed languages and tasks; in addition, language specific tests such as Possession were found to correlate well in cases when they were applicable. Furthermore, the results of our diagnostic case study, where we probe encoding layer of a black-box neural model, showed strong connections to the correlated tests. All these findings suggest that the proposed probing tests can be used to get predictive performance of an input representation on a downstream task, as well as to explore the strengths and weaknesses of existing neural models, or to understand the relation between a parameter and linguistic cues (e.g., how the performance on probing tests changes after increasing the model size).

As future work, we plan to extend our evaluation suite. Current evaluation suite only supports probing of static word embeddings, which is not convenient for exploring black box models. Hence we plan to implement a feature for existing AllenNLP models, where the user can specify the layer and the language to be probed, and obtain the probing results automatically. Furthermore, the methodology we introduced, can be used to create “contextual” probing tests with a small effort, to explore the properties of contextualized embeddings by using the recently introduced SIGMORPHON 2019 dataset adapted from Universal Dependencies.

8 Appendix

Correlation In order to provide more insight for the relation between the downstream tasks and the probing tests, we show only the significant Spearman correlations with in Fig. 5.

Figure 5: Spearman correlation between probing and downstream tasks for each language. Weak correlations are not shown.


  • Adi et al. (2017) Adi, Yossi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations.
  • Ataman and Federico (2018) Ataman, Duygu and Marcello Federico. 2018.

    Compositional representation of morphologically-rich input for neural machine translation.

    In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 305–311, Association for Computational Linguistics.
  • Belinkov and Bisk (2018) Belinkov, Yonatan and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations.
  • Belinkov et al. (2017) Belinkov, Yonatan, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. 2017. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 861–872, Association for Computational Linguistics.
  • (5)

    Benikova, Darina, Chris Biemann, Max Kisselew, and Sebastian Padó.

    Germeval 2014 named entity recognition shared task: Companion paper. Organization, 7:281.
  • Bisazza and Tump (2018) Bisazza, Arianna and Clara Tump. 2018. The lazy encoder: A fine-grained analysis of the role of morphology in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2871–2876, Association for Computational Linguistics.
  • Blake (2001) Blake, Barry J. 2001. Case. Cambridge University Press.
  • Bojanowski et al. (2017) Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.
  • Bowman et al. (2015) Bowman, Samuel R., Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics.
  • Bruni et al. (2012) Bruni, Elia, Gemma Boleda, Marco Baroni, and Nam-Khanh Tran. 2012. Distributional semantics in technicolor. In The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, July 8-14, 2012, Jeju Island, Korea - Volume 1: Long Papers, pages 136–145.
  • Burlot et al. (2018) Burlot, Franck, Yves Scherrer, Vinit Ravishankar, Ondřej Bojar, Stig-Arne Grönroos, Maarit Koponen, Tommi Nieminen, and François Yvon. 2018. The wmt’18 morpheval test suites for english-czech, english-german, english-finnish and turkish-english. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 546–560.
  • Che et al. (2018) Che, Wanxiang, Yijia Liu, Yuxuan Wang, Bo Zheng, and Ting Liu. 2018. Towards better UD parsing: Deep contextualized word embeddings, ensemble, and treebank concatenation. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 55–64, Association for Computational Linguistics, Brussels, Belgium.
  • Chen et al. (2017) Chen, Qian, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1657–1668.
  • Comrie and Polinsky (1998) Comrie, Bernard and Maria Polinsky. 1998. The great daghestanian case hoax. Case, typology and grammar, pages 95–114.
  • Conneau et al. (2018a) Conneau, Alexis, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. 2018a. What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2126–2136.
  • Conneau et al. (2017) Conneau, Alexis, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2017. Word translation without parallel data. arXiv preprint arXiv:1710.04087.
  • Conneau et al. (2018b) Conneau, Alexis, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018b. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics.
  • Corbett (2013) Corbett, Greville G. 2013. Number of genders. In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  • Şahin and Adalı (2018) Şahin, Gözde Gül and Eşref Adalı. 2018. Annotation of semantic roles for the turkish proposition bank. Language Resources and Evaluation, 52(3):673–706.
  • Şahin and Steedman (2018) Şahin, Gözde Gül and Mark Steedman. 2018. Character-level models versus morphology in semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 386–396.
  • Dozat and Manning (2016) Dozat, Timothy and Christopher D. Manning. 2016. Deep biaffine attention for neural dependency parsing.
  • Erten, Bozsahin, and Zeyrek (2014) Erten, Begum, Cem Bozsahin, and Deniz Zeyrek. 2014. Turkish resources for visual word recognition. In LREC, pages 2106–2110.
  • Eryigit, Nivre, and Oflazer (2008) Eryigit, Gülsen, Joakim Nivre, and Kemal Oflazer. 2008. Dependency parsing of turkish. Computational Linguistics, 34(3):357–389.
  • Fares et al. (2017) Fares, Murhaf, Andrey Kutuzov, Stephan Oepen, and Erik Velldal. 2017. Word vectors, reuse, and replicability: Towards a community repository of large-text resources. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pages 271–276, Association for Computational Linguistics, Gothenburg, Sweden.
  • Faruqui et al. (2015) Faruqui, Manaal, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A Smith. 2015. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1606–1615.
  • Finkelstein et al. (2001) Finkelstein, Lev, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: the concept revisited. In Proceedings of the Tenth International World Wide Web Conference, WWW 10, Hong Kong, China, May 1-5, 2001, pages 406–414.
  • Gage (1994) Gage, Philip. 1994. A new algorithm for data compression. C Users J., 12(2):23–38.
  • Gardner et al. (2017) Gardner, Matt, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017. Allennlp: A deep semantic natural language processing platform.
  • Ghaddar and Langlais (2017) Ghaddar, Abbas and Phillippe Langlais. 2017. Winer: A wikipedia annotated corpus for named entity recognition. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 413–422, Asian Federation of Natural Language Processing.
  • Hajič et al. (2009) Hajič, Jan, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009. The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, CoNLL ’09, pages 1–18, Association for Computational Linguistics, Stroudsburg, PA, USA.
  • Haverinen et al. (2015) Haverinen, Katri, Jenna Kanerva, Samuel Kohonen, Anna Missila, Stina Ojala, Timo Viljanen, Veronika Laippala, and Filip Ginter. 2015. The Finnish Proposition Bank. Language Resources and Evaluation, 49(4):907–926.
  • Heinzerling and Strube (2018) Heinzerling, Benjamin and Michael Strube. 2018. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), European Language Resources Association (ELRA), Miyazaki, Japan.
  • Hill, Reichart, and Korhonen (2015) Hill, Felix, Roi Reichart, and Anna Korhonen. 2015. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics, 41(4):665–695.
  • Huang et al. (2012) Huang, Eric, Richard Socher, Christopher Manning, and Andrew Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 873–882, Association for Computational Linguistics.
  • Huang, Xu, and Yu (2015) Huang, Zhiheng, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • Iggesen (2013) Iggesen, Oliver A. 2013. Number of cases. In Matthew S. Dryer and Martin Haspelmath, editors, The World Atlas of Language Structures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig.
  • Isgüder and Adali (2014) Isgüder, Gözde Gül and Esref Adali. 2014. Using morphosemantic information in construction of a pilot lexical semantic resource for turkish. In Proceedings of Workshop on Lexical and Grammatical Resources for Language Processing, LG-LP at COLING 2014, Dublin, Ireland, August 24, 2014, pages 46–54.
  • Keuleers and Brysbaert (2010) Keuleers, Emmanuel and Marc Brysbaert. 2010. Wuggy: A multilingual pseudoword generator. Behavior research methods, 42(3):627–633.
  • Kim et al. (2016) Kim, Yoon, Yacine Jernite, David Sontag, and Alexander Rush. 2016. Character-Aware Neural Language Models. In

    Proceedings of the 2016 Conference on Artificial Intelligence (AAAI)

  • Kirov et al. (2016) Kirov, Christo, John Sylak-Glassman, Roger Que, and David Yarowsky. 2016. Very-large scale parsing and normalization of wiktionary morphological paradigms. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016.
  • Köhn (2015) Köhn, Arne. 2015. What’s in an embedding? analyzing word embeddings through multilingual evaluation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2067–2073.
  • Levy and Goldberg (2014) Levy, Omer and Yoav Goldberg. 2014. Dependency-based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 2: Short Papers, pages 302–308.
  • Ling et al. (2015) Ling, Wang, Chris Dyer, Alan W Black, Isabel Trancoso, Ramon Fermandez, Silvio Amir, Luis Marujo, and Tiago Luis. 2015. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1520–1530, Association for Computational Linguistics, Lisbon, Portugal.
  • Linzen (2016) Linzen, Tal. 2016. Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 13–18.
  • Luong, Socher, and Manning (2013) Luong, Minh-Thang, Richard Socher, and Christopher D. Manning. 2013. Better word representations with recursive neural networks for morphology. In CoNLL, Sofia, Bulgaria.
  • McDonald, Crammer, and Pereira (2005) McDonald, Ryan, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 91–98, Association for Computational Linguistics, Ann Arbor, Michigan.
  • Mikolov et al. (2013a) Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
  • Mikolov et al. (2013b) Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013b. Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pages 3111–3119, Curran Associates Inc., USA.
  • Mikolov, Yih, and Zweig (2013) Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pages 746–751.
  • Miller and Charles (1991) Miller, George A and Walter G Charles. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):1–28.
  • Nangia et al. (2017) Nangia, Nikita, Adina Williams, Angeliki Lazaridou, and Samuel Bowman. 2017. The repeval 2017 shared task: Multi-genre natural language inference with sentence representations. In Proceedings of the 2nd Workshop on Evaluating Vector Space Representations for NLP, pages 1–10.
  • Nayak, Angeli, and Manning (2016) Nayak, Neha, Gabor Angeli, and Christopher D. Manning. 2016. Evaluating word embeddings using a representative suite of practical tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, RepEval at ACL 2016, Berlin, Germany, August 2016, pages 19–23.
  • Nivre (2018) Nivre, Joakim et. al. 2018. Universal dependencies 2.3. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
  • Pennington, Socher, and Manning (2014) Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543.
  • Peters et al. (2018) Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
  • Pilehvar et al. (2017) Pilehvar, Mohammad Taher, José Camacho-Collados, Roberto Navigli, and Nigel Collier. 2017. Towards a seamless integration of word senses into downstream NLP applications. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1857–1869.
  • Rogers, Ananthakrishna, and Rumshisky (2018) Rogers, Anna, Shashwath Hosur Ananthakrishna, and Anna Rumshisky. 2018. What’s in your embedding, and how it predicts task performance. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2690–2703.
  • Rogers, Drozd, and Li (2017) Rogers, Anna, Aleksandr Drozd, and Bofang Li. 2017. The (too many) problems of analogical reasoning with word vectors. In Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pages 135–148, Association for Computational Linguistics.
  • Ruder, Vulić, and Søgaard (2017) Ruder, Sebastian, Ivan Vulić, and Anders Søgaard. 2017. A survey of cross-lingual word embedding models. arXiv preprint arXiv:1706.04902.
  • Sahin et al. (2017) Sahin, H. Bahadir, Caglar Tirkaz, Eray Yildiz, Mustafa Tolga Eren, and Omer Ozan Sonmez. 2017. Automatically annotated turkish corpus for named entity recognition and text categorization using large-scale gazetteers. CoRR, abs/1702.02363.
  • Sang (2002) Sang, Erik F. Tjong Kim. 2002. Introduction to the conll-2002 shared task: Language-independent named entity recognition. In Proceedings of the 6th Conference on Natural Language Learning, CoNLL 2002, Held in cooperation with COLING 2002, Taipei, Taiwan, 2002.
  • Schnabel et al. (2015) Schnabel, Tobias, Igor Labutov, David M. Mimno, and Thorsten Joachims. 2015. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 298–307.
  • Sennrich, Haddow, and Birch (2016) Sennrich, Rico, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Association for Computational Linguistics.
  • Shi, Padhi, and Knight (2016) Shi, Xing, Inkit Padhi, and Kevin Knight. 2016. Does string-based neural mt learn source syntax? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1526–1534, Association for Computational Linguistics.
  • Sylak-Glassman (2016) Sylak-Glassman, John. 2016. The composition and use of the universal morphological feature schema (unimorph schema). Technical report.
  • Sylak-Glassman et al. (2015) Sylak-Glassman, John, Christo Kirov, David Yarowsky, and Roger Que. 2015. A language-independent feature schema for inflectional morphology. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics, ACL 2015, July 26-31, 2015, Beijing, China, Volume 2: Short Papers, pages 674–680.
  • Tal Linzen, Chrupała, and Alishahi (2018) Tal Linzen, Tal, Grzegorz Chrupała, and Afra Alishahi. 2018. Proceedings of the 2018 emnlp workshop blackboxnlp: Analyzing and interpreting neural networks for nlp. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Association for Computational Linguistics.
  • Tenney et al. (2019) Tenney, Ian, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. 2019. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
  • Tsvetkov et al. (2015) Tsvetkov, Yulia, Manaal Faruqui, Wang Ling, Guillaume Lample, and Chris Dyer. 2015. Evaluation of word vector representations by subspace alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2049–2054.
  • Vania and Lopez (2017) Vania, Clara and Adam Lopez. 2017. From characters to words to in between: Do we capture morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2016–2027, Association for Computational Linguistics.
  • Veldhoen, Hupkes, and Zuidema (2016) Veldhoen, Sara, Dieuwke Hupkes, and Willem H. Zuidema. 2016. Diagnostic classifiers revealing how neural networks process hierarchical structure. In CoCo@NIPS.
  • Williams, Nangia, and Bowman (2018) Williams, Adina, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, Association for Computational Linguistics.
  • Yang, Lu, and Zheng (2017) Yang, Wei, Wei Lu, and Vincent Zheng. 2017. A simple regularization-based algorithm for learning cross-domain word embeddings. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2898–2904.