How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

06/16/2020 ∙ by Steffen Eger, et al. ∙ Technische Universität Darmstadt 0

Sentence encoders map sentences to real valued vectors for use in downstream applications. To peek into these representations - e.g., to increase interpretability of their results - probing tasks have been designed which query them for linguistic knowledge. However, designing probing tasks for lesser-resourced languages is tricky, because these often lack large-scale annotated data or (high-quality) dependency parsers as a prerequisite of probing task design in English. To investigate how to probe sentence embeddings in such cases, we investigate sensitivity of probing task results to structural design choices, conducting the first such large scale study. We show that design choices like size of the annotated probing dataset and type of classifier used for evaluation do (sometimes substantially) influence probing outcomes. We then probe embeddings in a multilingual setup with design choices that lie in a 'stable region', as we identify for English, and find that results on English do not transfer to other languages. Fairer and more comprehensive sentence-level probing evaluation should thus be carried out on multiple languages in the future.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Extending the concept of word embeddings to the sentence level, sentence embeddings (a.k.a. as sentence encoders) have become ubiquitous in NLP (Kiros et al., 2015; Conneau et al., 2017)

. In the context of recent efforts to open the black box of deep learning models and representations

(Linzen et al., 2019), it has also become fashionable to probe sentence embeddings for the linguistic information signals they contain (Perone et al., 2018), as this may not be clear from their performances in downstream tasks. Such probes are linguistic micro tasks—like detecting the length of a sentence or its dependency tree depth—that have to be solved by a classifier using given representations.



High (A,B,C) (A,B,C) (C,A,B) (C,B,A)
Mid (A,C,B) (C,B,A) (A,B,C) (C,B,A)
Low (A,B,C) (B,A,C) (B,C,A) (A,B,C)
Table 1: Schematic illustration of our concept of stability across two dimensions (classifier and training size). Here, three encoders, dubbed A,B,C, are ranked. The region of stability is given by those settings that support the majority ranking of encoders, which is ABC.

The majority of approaches for probing sentence embeddings target English, but recently some works have also addressed other languages such as Polish, Russian, or Spanish in a multi- and cross-lingual setup Krasnowska-Kieraś and Wróblewska (2019); Ravishankar et al. (2019). Motivations for considering a multi-lingual analysis include knowing whether findings from English transfer to other languages and determining a universal set of probing tasks that suits multiple languages, e.g., with richer morphology and freer word order.

Our work is also inspired by probing sentence encoders in multiple (particularly low-resource) languages. We are especially interested in the formal structure of probing task design in this context. Namely, when designing probing tasks for low-resource languages, some questions arise naturally that are less critical in English. One of them is the size of training data for probing tasks, as this training data typically needs to be (automatically or manually) annotated, an inherent obstacle in low-resource settings.111The main issue is that high-quality dependency parsers, as required for standard probing tasks, are available only for a handful of languages. E.g., UDPipe (Straka, 2018) is available for only about 100 languages, and performance scores for some of these are considerably below those of English (Straka, 2018).

Thus, at first, we ask for the training data size required for obtaining reliable probing task results. This question is also relevant for English: on the one hand, Conneau et al. (2018) claim that training data for a probing task should be plentiful, as otherwise (highly parametrized) classifiers on top of representations may be unable to extract the relevant information signals; on the other hand, Hewitt and Liang (2019)

note that a sufficiently powerful classifier with enough training data can in principle learn any task, without this necessarily allowing to conclude that the representations adequately store the linguistic signal under scrutiny. Second, we ask how stable probing task results are across different classifiers (e.g., MLP vs. Naive Bayes). This question is closely related to the question about size, since different classifiers have different sensitivities to data size; especially deep models are claimed to require more training data.

We evaluate the sensitivity of probing task results to the two outlined parameters—which are meremachine learning design choices that do not affect the linguistic content stored in the sentence representations under scrutiny—and then determine a ‘region of stability’ for English (en), where outcomes are predicted to be similar for the majority of parameter choices. Table 1 illustrates this. Using parameter choices within our region of stability, we turn to three lower-resource languages, viz.: Turkish (tr), Russian (ru), and Georgian (ka). tr is a Turkic language written in Latin script which makes exhaustive use of agglutination. ru is a Slavic language written in Cyrillic script characterized by strong inflection and rich morphology. ka is a South-Caucasian language using its own script called Mkhedruli. It makes use of both agglutination as well as inflection. For these, our main research questions are whether probing task results transfer from English to the other languages.

Overall, our research questions are:

  • [topsep=3pt,itemsep=0pt,leftmargin=*]

  • (i) How reliable are probing task results across machine learning design choices?

  • (ii) Will encoder performances correlate across languages, even though the languages and their linguistic properties may differ?

  • (iii) Will probing task performances correlate across languages?

  • (iv) Will the correlation between probing and downstream tasks be the same across languages?

These questions are important because they indicate whether or not probing tasks (and their relation to downstream tasks) have to be re-evaluated in languages other than en. Our results strongly suggest that re-evaluation is required and that claims of superiority of sentence encoders on en data do not transfer to other languages.

2 Related work

The goal of this work is to probe for sentence-level linguistic knowledge encoded in sentence embeddings Perone et al. (2018) in a multilingual setup which marginalizes out the effects of probing task design choices when comparing sentence representations.

Sentence embeddings have become central for representing texts beyond the word level, e.g., in small data scenarios, where it is difficult to induce good higher-level text representations from word embeddings (Subramanian et al., 2018) or for clustering or text retrieval applications (Reimers and Gurevych, 2019). To standardize the comparison of sentence embeddings, conneau-kiela-2018-senteval proposed the SentEval framework for evaluating the quality of sentence embeddings on a range of downstream and 10 probing tasks.

Probing tasks are used to introspect embeddings for linguistic knowledge, by taking “probes” as dedicated syntactic or semantic micro tasks Köhn (2016). As opposed to an evaluation in downstream applications or benchmarks like GLUE Wang et al. (2018), probing tasks target very specific linguistic knowledge which may otherwise be confounded in downstream applications. Since they are artificial tasks, they can also be better controlled for to avoid dataset biases and artifacts. Probing is typically either executed on type/token (word) (Tenney et al., 2019) or sentence level (Adi et al., 2017). For sentence level evaluation, SentEval thus far only includes en data. Each probing task in SentEval is balanced and has 100k train, 10k dev, and 10k test instances. The effects of these design choices are unclear, which is why our work addresses their influence systematically.

In the multilingual setting, DBLP:journals/corr/abs-1903-09442 propose 15 token and type level probing tasks. Their probing task data is sourced from UniMorph 2.0 Kirov et al. (2018), Universal Dependency treebanks McCarthy et al. (2018) and Wikipedia word frequency lists. To deal with lower-resourced languages, they only use 10K samples per probing task/language pair (7K/2K/1K for train/dev/test) and exclude task/language pairs for which this amount cannot be generated. Their final experiments are carried out on five languages (Finnish, German, Spanish, ru, tr), for which enough training data is available. They find that for morphologically rich (agglutinative) languages, several probing tasks positively correlate with downstream applications. This finding is obviously bound to the fact that they tested on word level, such that probes in agglutinative languages which encode more linguistic information in a single word are easier to solve. Our work also investigates correlation between probing and downstream performance, but we do so on sentence level.

On sentence level, ravishankar-etal-2019-probing train an InferSent-like encoder on en and map this encoder to four languages (ru, French, German, Spanish) using parallel data. Subsequently, they probe the encoders on the probing tasks proposed by Conneau.2018a on Wikipedia data for each language. They use the same size of probing task data as in SentEval, i.e., 100k/10k/10k for train/dev/test. Their interest is in whether probing tasks results are higher/lower compared to en scores. They find particularly the ru probing scores to be low, which they speculate to be an artifact of cross-lingual word embedding induction and the language distance of ru to en. In contrast to us, their focus is particularly on the effect of transferring sentence representations from en to other languages. The problem of such an analysis is that results may be affected by the nature of the cross-lingual mapping techniques.

Krasnowska.2019 probe sentence encoders in en and Polish (pl). They use tasks defined in Conneau.2018a but slightly modify them (e.g., replacing dependency with constituency trees), reject some tasks (Bigram-Shift, as word order may play a minor role in pl), and add two new tasks (Voice and Sentence Type). Since pl data is less abundant, they shrink the size of the pl datasets to 75k/7.5k/7.5k for train/dev/test and, for consistency, do the same for en. They extract probing datasets from an en-pl parallel corpus using COMBO for dependency parsing (Rybak and Wróblewska, 2018). They find that en and pl probing results mostly agree, i.e., encoders store the same linguistic information across the two languages.

3 Approach

Task Description Example
Bigram Shift Whether two words in a sentence are inverted This is my Eve Christmas. True
Tree Depth Longest path from root to leaf in constituent tree “One hand here , one hand there , that ’s it” 5
Length Number of tokens I like cats 1-4 words
Subject Number Whether the subject is in singular or plural They work together Plural
Word Content Which mid-frequency word a sentence contains Everybody should step back everybody
Top Constituents Classific. task where classes are given by 19 most Did he buy anything from Troy VDP_NP_VP
common top constituent sequences in corpus
Voice Whether sent. contains a passive construct He likes cats False
SV Agree Whether subject and verb agree They works together Disagree
SV Dist Distance between subject and verb The delivery was very late 1
Table 2: Probing tasks, their description and illustration. Top tasks are defined as in SentEval.

In the absence of ground truth, our main interest is in a ‘stable’ structural setup for probing task design—with the end goal of applying this design to multilingual probing analyses (keeping their restrictions, e.g., small data sizes, in mind). To this end, we consider a two-dimensional space comprising probing data size and classifier choice for probing tasks.222We also looked at further parameters, e.g. the class (im)balances of training datasets; details and results can be found in the Appendix. Since, however, their influence seemed to be less severe and an increased search space would blow up computational cost, we decided to limit our investigation to the described dimensions. For a selected set of points in , we evaluate all our encoders on , and determine the ‘outcomes’ (e.g., ranking) of the encoders at . We consider a setup as stable if outcome is shared by a majority of other settings . This can be considered a region of agreement, similarly as in inter-annotator agreement Artstein and Poesio (2008). In other words, we identify ‘ideal’ test conditions by minimizing the influence of parameters on the outcome . Below, we will approximate these intuitions using correlation.

3.1 Embeddings

We consider two types of sentence encoders, non-parametric methods which combine word embeddings in elementary ways, without training; and parametric methods

, which tune parameters on top of word embeddings. As non-parametric methods, we consider: (i) average word embeddings as a popular baseline, (ii) the concatenation of average, min and max pooling (

pmeans) (Rücklé et al., 2018); and Random LSTMs (Conneau et al., 2017; Wieting and Kiela, 2019), which feed word embeddings to randomly initialized LSTMs, then apply a pooling operation across time-steps. As parametric methods, we consider: InferSent (Conneau et al., 2017), which induces a sentence representation by learning a semantic entailment relationship between two sentences; QuickThought (Logeswaran and Lee, 2018), as a supervised improvement over the popular SkipThought model (Kiros et al., 2015); LASER (Artetxe and Schwenk, 2019) derived from massively multilingual machine translation models, and BERT base (Devlin et al., 2019), where we average token embeddings of the last layer for a sentence representation. Dimensionalities of encoders are listed in the Appendix.

3.2 Probing Tasks

Following Conneau et al. (2018), we consider the following probing tasks: BigramShift (en, tr, ru, ka), TreeDepth (en), Length (en, tr, ru, ka), Subject Number (en, tr, ru), WordContent (en, tr, ru, ka), and TopConstituents (en). We choose Length, BigramShift and WordContent because they are unsupervised tasks that require no labeled data and thus can be easily implemented across different languages—they also represent three different types of elementary probing tasks: surface, syntactic and semantic/lexical. We further include Subject Number across all our languages because number marking is extremely common across languages and it is comparatively easy to identify. We adopt Voice (en, tr, ru, ka) from Krasnowska-Kieraś and Wróblewska (2019). For en, we additionally evaluate on TreeDepth and TopConstituents as hard syntactic tasks. We add two tasks not present in the canon of probing tasks listed in SentEval: Subject-Verb-Agreement (SV-Agree) (en, tr, ru, ka) and Subject-Verb-Distance (SV-Dist) (en, tr, ru). We probe representations for these properties because we suspect that agreement between subject and verb is a difficult task which requires inferring a relationship between pairs of words which may stand in a long-distance relationship (Gulordava et al., 2018). Moverover, we assume this task to be particularly hard in morphologically rich and word-order free languages, thus it could be a good predictor for performance in downstream tasks.

To implement the probing tasks, for en, we use the probing tasks datasets defined in Conneau and Kiela (2018) and we apply SpaCy to sentences extracted from Wikipedia for our newly added probing tasks Voice, SV-Dist and SV-Agree. For tr, ru, and ka, we do not rely on dependency parsers because of quality issues and unavailability for ka. Instead, we use Universal Dependencies (UD) (Nivre et al., 2016) and manual rules for sentences extracted from Wikipedia. In particular, for SV-Agree, we create a list of frequently occurring verbs together with their corresponding present tense conjugations for each individual language.11todo: 1SE: including English? We check each individual sentence from Wikipedia for the presence of a verb form in the list. If no word is present, we exclude the sentence from consideration. Otherwise, we randomly replace the verb form by a different conjugation in 50% of the cases. For SV-Dist, we use the information from UD to determine the dependency distance between the main verb and the subject. Instead of predicting the exact distances, we predict binned classes: [1], [2,4], [5,7], [8,12], [13,). This task could not be implemented for ka, due to missing dependency information in the UD. We omit Subject Number for ka for the same reason.

An overview of the probing tasks, along with descriptions and examples, is given in Table 2.

3.3 Downstream Tasks

In addition to probing tasks, we test the embeddings in downstream applications. Our focus is on a diverse set of high-level sentence classification tasks. We choose Argument Mining (AM) (Stab et al., 2018), sentiment analysis and TREC question answering. Required training data for languages other than en has been machine translated using Google Translate333 for AM and TREC.444

To estimate the quality of the machine translation, we measured its performance on parallel data. Details can be found in the Appendix.

Sentiment analysis uses original datasets with 2 to 3 sentiment classes. Details of the training procedure and the tasks themselves can be found in the Appendix. Statistics are reported in Table 6.

4 Experiments

Experimental Setup

To the SentEval toolkit (Conneau and Kiela, 2018)

, which addresses both probing and downstream tasks and offers Logistic Regression (LR) and MLP classifiers on top of representations, we added implementations of Random Forest (RF) and Naive Bayes (NB) from scikit-learn as other popular but ‘simple’ classifiers. SentEval defines specific model validation techniques for each task. For all probing tasks and TREC, we use predefined splits. For AM and sentiment analysis, we use 10 fold inner cross validation. Following SentEval, we tune the size of the hidden layer in

, dropout in and regularization in when training an MLP. For RF, we tune maximum tree depth in . For Logistic Regression (LR), we tune regularization in

. We did not tune any hyperparameters for NB.

4.1 Probing task design in en

In our design, we consider (a) four well-known and popular classifiers—LR, MLP, NB, RF—on top of sentence representations, and (b) six different training data sizes (between 2k and 100k). We perform an exhaustive grid-search for size and classifier design, considering all combinations.


For each classifier, we obtain results (on 10k test instances) when varying the training data size over 2k, 5k, 10k, 20k, 30k, 100k. Downsampling was implemented by selecting the same percentage of samples that appears in the full dataset for each class. We then report average Spearman/Pearson correlations / between any two training set sizes and over all 9 probing tasks:555We report both Spearman and Pearson for some of the results but give only Pearson for the remainder, as outcomes are very similar.


where is the number of probing tasks ( for en), and is the vector that holds scores for each of the 7 sentence encoders in our experiments, given training size , for probing task and classifier . We set correlations to zero if the p-value has .666We choose a high -value, because we correlate small vectors of size 7. In Table 3, we then report the minimum and average scores and , respectively, per classifier . We observe that the minimum values are small to moderate correlations between 0.2 (for NB) and 0.6 (for RF). The average correlations are moderate to high correlations ranging from 0.6 (for NB) to above 0.8 (for the others).

In Figure 1 (top), we show all the values sim for LR, NB. We observe that, indeed, LR has high correlations between training sizes especially starting from 10k training data points. The corresponding correlations of NB are much lower comparatively.

Figure 1: Top: Average correlations sim for LR (left) and NB (right). Bottom: Average correlations sim for LR and RF (left) and LR and NB (right).

In Figure 2, we plot the stability of each training data size


for all of our classifiers and where is a normalizer, in our case. The higher this score for a training size , the more similar are the probing results for another training size , on average. Across all classifiers, 2k and 100k are least stable—100k is the default setting of SentEval. Most stable are 10k and 20k.

Min Avg
MLP .480 .420 .810 .843
LR .524 .502 .808 .805
RF .529 .623 .800 .853
NB .174 .292 .626 .671
Table 3: Stability over training sizes, in terms of minimum and average Spearman () / Pearson () correlation between any two sizes.

Next, we add the classifier choice as a 2nd dimension: we examine whether correlations (Spearman/Pearson) between vectors (holding scores for each of 7 sentence encoders for a classifier ) and (holding the same scores for a classifier ) are similar in the same sense as in Eq. (1):


Again, we average across all probing tasks, and set correlation values to zero if the p-value exceeds 0.2. In Table 5, we give min/avg values across data set sizes in this setup. We observe that LR and MLP most strongly agree. They have acceptable average agreement with RF, but low agreement with NB, on average, and, in the worst cases, even negative correlations with NB.

In Figure 1 (bottom), we illustrate correlations between three classifiers, comparing LR with NB and RF across all possible training set sizes. We observe that as the training data set sizes for RF and LR become larger, these two classifiers agree more strongly with LR. RF starts to have acceptable agreement with LR from 10k training instances onwards, while NB has acceptable agreement with LR only in the case of 100k training instances.

Figure 2: Stability of each training size computed using Eq. (2).

We now operationalize our intuition of ‘region of stability’ outlined in Table 1. For each of nine probing tasks, we compute the following. Let be a specific ranking of encoders, where is a fixed permutation. Let be the ranking of encoders according to the classifier, size combination . We compute the Spearman correlation between and . For each possible ranking of our 7 encoders, we then determine its support as the average over all values and then find the ranking with most support according to this definition. Finally, we assign a score to the combination not only when equals , but also when is close to : we again use the Spearman correlation between and as a measure of closeness (we require a closeness of at least 0.75). The final score for is given by:


Table 4 shows classifier, size combinations with highes scores. LR and MLP are at the top, along with RF in the setting of 100k training data size. LR with size 10k is most stable overall, but the distance to the other top settings is small. Least stable (not shown) is NB.

classifier LR LR RF MLP MLP MLP
size 10k 20k 100k 20k 30k 10k
7.6 7.3 7.2 7.0 7.0 6.9
Table 4: Most stable classifier, size combinations according to Eq. (4).

Overall, we answer our (i) first research question as follows: probing tasks results can be little reliable and may vary greatly among machine learning parameter choices. The standard training set size of SentEval, 100k, appears to be less stable. As region of stability, we postulate especially the setting with 10k training instances for the LR classifier.

MLP .481/.790 .492/.632 -.043/.236
LR .406/.640 -.057/.197
RF .029/.320
Table 5: Min/Avg values sim across (using Pearson) between classifiers and .

4.2 Multi-lingual results

Experimental Setup

Given our results for en, we choose the LR classifier with a size of roughly 10k instances overall. Table 6 provides more details about the datasets. In line with SentEval (and partly supported by our results in the appendix), we aim for as balanced label distributions as possible. Because of the small test sizes, we use inner 5-fold CV for all tasks except for SubjNumber, where we use pre-defined train/dev/test splits as in Conneau et al. (2018) to avoid leaking lexical information from train to test splits.

We obtain average and pmeans embeddings through pooling over pretrained Fasttext embeddings (Grave et al., 2018). The same embeddings are used for the random LSTM. For average BERT, we use the base-multilingual-cased model. We machine translate the AllNLI corpus into tr, ru and ka, to obtain training data for Infersent.777Using Google Translate, see Appendix for details. The models are then trained using default hyperparameters and using pre-trained FastText embeddings. Compared to en, we modify the WC probing task in the multilingual setting to only predict 30 mid-frequency words instead of 1000. This is more appropriate for our much smaller data sizes.

Task Size Balance Size Balance Size Balance Size Balance
Bigram Shift 100k 1:1 10k 1:1 10k 1.1:1 10k 1.1:1
Length 100k 1:1 10k 1:1 12k 1:1 10k 1:1
Subject Number 100k 1:1 4,093 5:1 11,877 1:1 - -
Word Content 100k 1:1 10k 1.5:1 10k 1.2:1 10k 5:1
Top Constituents 100k 1:1 - - - - - -
Tree Depth 100k 2.2:1 - - - - - -
Voice 100k 1:1 8,417 6:1 10k 2:1 10k 1.9:1
SV Agree 100k 1:1 10k 1:1 10k 1:1 10k 1:1
SV Dist 100k 1:1 1,642 1.9:1 8,231 1.1:1 - -
Argumentation Mining (macro-F1) 25,303 3:1 25,303 3:1 25,303 3:1 25,303 3:1
TREC (Accuracy) 5,952 14:1 5,952 14:1 5,952 14:1 5,952 14:1
Sentiment Analysis (macro-F1) 14,148 4.2:1 6,172 1.7:1 30k 1:1 11,513 5.5:1
Table 6: Probing and downstream tasks. We report the balance between the class with the most and the least samples. For downstream tasks, the evaluation measure is given in brackets.

4.2.1 Probing tasks

Results are shown in Figures 3 and 4.22todo: 2SE: do we always report micro-F1 as in SentEval because datasets are mostly balanced?

(ii) Will encoder performances correlate across languages?
Figure 3: Pearson correlations across languages for different encoders.

For each encoder , we correlate performances of between en and the other languages on 5 (for ka) and 7 (for tr, ru) probing tasks (using 10k dataset size and LR for all involved languages, including en). In Figure 3, we observe that correlations between en and other languages are generally either zero or weakly positive. Only average embeddings have 2 positive correlation scores across the 3 language combinations with en. Among low-resource languages, there are no negative correlations and fewer zero correlations. All of the low-resource languages correlate more among themselves than with en. This makes sense from a linguistic point of view, since en

is clearly the outlier in our sample given its minimal inflection and fixed word order. Thus, the answer to this research question is that our results support the view that transfer is better for typologically similar languages.

(iii) Will probing task performances correlate across languages?

For each probing task , we report Pearson correlations, between all language pairs, of vectors holding scores of 7 encoders on . Figure 4 shows the results.

Figure 4: Pearson correlations across languages for different probing tasks.

The pattern is overall similar as for (ii) in that there are many zero correlations between en and the other languages. tr and ka also have negative correlations with en for selected tasks. Only BigramShift has positive correlations throughout. Low-resource languages correlate better among themselves as with en. Our conclusions are the same as for question (ii).

Note that our findings contrast with Krasnowska.2019, who report that probing results for en and pl are mostly the same. Our results are more intuitively plausible: e.g., a good encoder should store linguistic information relevant for a particular language.

Figure 5: Pearson correlation among probing task and downstream performance for all languages.

4.2.2 Downstream Tasks

Results are shown in Figure 5.

(iv) Will the correlation between probing and downstream tasks be the same across languages?

For each of our languages, we correlate probing and downstream tasks. The results show that the answer to research question (iv) is clearly negative. In particular, en behaves differently to the other languages—while, again, ru and tr behave more similarly. ka is the only language with negative correlations for Length, en the only one with positive scores. For the sentiment task, Word Content correlates positively for all languages except ka. The AM task correlates only in en and ka, but with different probing tasks. SV-Agree correlates positively with TREC and sentiment in all languages but en. Predicting the performances of embeddings in downstream tasks via probing tasks thus appears idiosyncratic for individual languages. Opposed to DBLP:journals/corr/abs-1903-09442, who suggest a direct relation between word-level probing and downstream performance on agglutinative languages, we see little to no systematic correlation on the sentence-level. Overall, SV-Agree is the best predictor across languages, with 7 positive correlations out of 12 possible. Interestingly, this task is missing from the current canon of SentEval.

5 Concluding Remarks

We investigated formal aspects of probing task design, including probing data size and classifier choice, in order to determine structural conditions for multilingual (low-resource) probing. We showed that probing tasks results are at best partly stable even for en and that the rankings of encoders varies with design choices. However, we identified a partial region of stability where results are supported by a majority of settings—even though this may not be mistaken for a region of ‘truth’. This region was identified in en, which has most resources available. Our further findings then showed that probing and downstream results do not transfer well from English to our other languages, which in turn challenges our identified region of stability.

Overall, our results have partly negative implications for current practices of probing task design as they may mean that probing tasks are to some degree unreliable as tools for introspecting linguistic information contained in sentence encoders. Their relation to downstream tasks is also unclear, as our multilingual results show. This is supported by recent findings giving contradictory claims regarding, e.g., the importance of the Word Content probing task for downstream performances (Eger et al., 2019; Wang and Kuo, 2020; Perone et al., 2018; Conneau et al., 2018). An important aspect to keep in mind in this context is that results may heavily depend on the selection of encoders involved in the analysis—in our case, we selected a number of recently proposed state-of-the-art models in conjunction with weaker baseline models, for a diverse collection of encoders. Another clear limitation of our approach is the small number of encoders we examined—nonetheless, many of our results are significant (at relatively large p-values).

To the degree that the supervised probing tasks examined here will remain important tools for interpretation of sentence encoders in the future, our results indicate that multilingual probing is important for a fairer and more comprehensive comparison of encoders with respect to the linguistic information signals that they store.33todo: 3SE: another problem can be seen in Table 6: we correlate English 10k with other languages, but these do not always have 10k instances—sometimes they have a lot less data. Also, for English it is 10k train - 10k test, while for the other languages it is 10k overall, including test.


  • Y. Adi, E. Kermany, Y. Belinkov, O. Lavi, and Y. Goldberg (2017) Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. CoRR abs/1608.04207. External Links: Link Cited by: §2.
  • Ž. Agić and I. Vulić (2019) JW300: a wide-coverage parallel corpus for low-resource languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3204–3210. External Links: Link, Document Cited by: §A.1.
  • M. Artetxe and H. Schwenk (2019) Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. TACL 7, pp. 597–610. External Links: Link Cited by: §3.1.
  • R. Artstein and M. Poesio (2008) Inter-coder agreement for computational linguistics. Comput. Linguist. 34 (4), pp. 555–596. External Links: ISSN 0891-2017, Link, Document Cited by: §3.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    Copenhagen, Denmark, pp. 670–680. External Links: Link, Document Cited by: §1, §3.1.
  • A. Conneau and D. Kiela (2018) SentEval: an evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §A.1, §3.2, §4.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single vector: probing sentence embeddings for linguistic properties. CoRR abs/1805.01070. External Links: Link Cited by: §1, §3.2, §4.2, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §3.1.
  • S. Eger, A. Rücklé, and I. Gurevych (2019) Pitfalls in the evaluation of sentence embeddings. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, pp. 55–60. External Links: Link, Document Cited by: §5.
  • E. Grave, P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov (2018) Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Cited by: §4.2.
  • K. Gulordava, P. Bojanowski, E. Grave, T. Linzen, and M. Baroni (2018) Colorless green recurrent networks dream hierarchically. CoRR abs/1803.11138. External Links: Link Cited by: §3.2.
  • J. Hewitt and P. Liang (2019) Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2733–2743. External Links: Link, Document Cited by: §1.
  • R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler (2015) Skip-thought vectors. CoRR abs/1506.06726. External Links: Link Cited by: §1, §3.1.
  • C. Kirov, R. Cotterell, J. Sylak-Glassman, G. Walther, E. Vylomova, P. Xia, M. Faruqui, S. J. Mielke, A. McCarthy, S. Kübler, D. Yarowsky, J. Eisner, and M. Hulden (2018) UniMorph 2.0: universal morphology. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §2.
  • A. Köhn (2016) Evaluating embeddings using syntax-based classification tasks as a proxy for parser performance. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, Berlin, Germany, pp. 67–71. External Links: Link, Document Cited by: §2.
  • K. Krasnowska-Kieraś and A. Wróblewska (2019) Empirical linguistic study of sentence embeddings. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5729–5739. External Links: Link Cited by: §1, §3.2.
  • T. Linzen, G. Chrupała, Y. Belinkov, and D. Hupkes (Eds.) (2019)

    Proceedings of the 2019 acl workshop blackboxnlp: analyzing and interpreting neural networks for nlp

    Association for Computational Linguistics, Florence, Italy. External Links: Link Cited by: §1.
  • L. Logeswaran and H. Lee (2018) An efficient framework for learning sentence representations. CoRR abs/1803.02893. External Links: Link Cited by: §3.1.
  • A. D. McCarthy, M. Silfverberg, R. Cotterell, M. Hulden, and D. Yarowsky (2018) Marrying universal dependencies and universal morphology. In Proceedings of the Second Workshop on Universal Dependencies (UDW 2018), Brussels, Belgium, pp. 91–101. External Links: Link, Document Cited by: §2.
  • J. Nivre, M. de Marneffe, F. Ginter, Y. Goldberg, J. Hajič, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman (2016) Universal dependencies v1: a multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, pp. 1659–1666. External Links: Link Cited by: §3.2.
  • C. Perone, R. Silveira, and T. Paula (2018) Evaluation of sentence embeddings in downstream and linguistic probing tasks. CoRR abs/1806.06259. External Links: Link Cited by: §1, §2, §5.
  • V. Ravishankar, L. Øvrelid, and E. Velldal (2019) Probing multilingual sentence representations with x-probe. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, pp. 156–168. External Links: Link, Document Cited by: §1.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §2.
  • A. Rücklé, S. Eger, M. Peyrard, and I. Gurevych (2018) Concatenated power mean word embeddings as universal cross-lingual sentence representations. CoRR abs/1803.01400. External Links: Link Cited by: §3.1.
  • P. Rybak and A. Wróblewska (2018) Semi-supervised neural system for tagging, parsing and lematization. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 45–54. External Links: Link, Document Cited by: §2.
  • C. Stab, T. Miller, B. Schiller, P. Rai, and I. Gurevych (2018) Cross-topic argument mining from heterogeneous sources. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3664–3674. External Links: Link, Document Cited by: §A.1, §3.3.
  • M. Straka (2018) UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, pp. 197–207. External Links: Link, Document Cited by: footnote 1.
  • S. Subramanian, A. Trischler, Y. Bengio, and C. J. Pal (2018) Learning general purpose distributed sentence representations via large scale multi-task learning. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • I. Tenney, P. Xia, B. Chen, A. Wang, A. Poliak, R. T. McCoy, N. Kim, B. V. Durme, S. Bowman, D. Das, and E. Pavlick (2019) What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • J. Tiedemann (23-25) Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), N. C. (. Chair), K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis (Eds.), Istanbul, Turkey (english). External Links: ISBN 978-2-9517408-7-7 Cited by: §A.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, pp. 353–355. External Links: Link, Document Cited by: §2.
  • B. Wang and C. -C. J. Kuo (2020) SBERT-wk: a sentence embedding method by dissecting bert-based word models. External Links: 2002.06652 Cited by: §5.
  • J. Wieting and D. Kiela (2019) No training required: exploring random encoders for sentence classification. CoRR abs/1901.10444. External Links: Link Cited by: §3.1.
  • W. Zhao, M. Peyrard, F. Liu, Y. Gao, C. M. Meyer, and S. Eger (2019)

    MoverScore: text generation evaluating with contextualized embeddings and earth mover distance

    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 563–578. External Links: Link, Document Cited by: §A.1.

Appendix A Appendix

a.1 Downstream Tasks

Details of our three downstream applications are described in the following.

Argument Mining (AM)

AM is an emergent NLP task requiring sophisticated reasoning capabilities. We reuse the sentence-level argument (stance) detection dataset by Stab et al. (2018), which labels sentences extracted from web pages as pro-, con-, or non-arguments for eight different topics. A sentence only qualifies as pro or con argument when it both expresses a stance towards the topic and gives a reason for that stance. The classifier input is a concatenation of the sentence embedding and the topic encoding. In total, there are about 25,000 sentences.

Sentiment Analysis

As opposed to AM, sentiment analysis only determines the opinion flavor of a statement. Since sentiment analysis is a very established NLP task, we did not machine translate en training data, but used original data for en, ru and tr and created a novel dataset for ka. For en, we use the US Airline Twitter Sentiment dataset, consisting of 14,148 tweets labeled in three sentiment classes888 For tr, we took the Turkish Twitter Sentiment Dataset with 6,172 examples and three classes999 For ru, we used the Russian Twitter Corpus (RuTweetCorp), which we reduced to 30,000 examples in two classes.101010 For ka, we followed the approach by choudhary-etal-2018-twitter and crawled sentiment flavored tweets in a distant supervision manner. Emojis were used as distant signals to indicate sentiment on preselected tweets from the Twitter API. After post-processing, we were able to collect 11,513 Georgian tweets in three sentiment classes. The dataset will made available publicly, including more details on the creation process.

TREC Question Type Detection

Question type detection is an important part of Question-Answering systems. The Text Retrieval Conference (TREC) dataset consists of a set of questions labeled with their respective question types (six labels including e.g. “description” or “location”) and is part of the SentEval benchmark Conneau and Kiela (2018). We used the data as provided in SentEval, yielding 5,952 instances.

Quality of Machine Translated Data

We automatically translated the input data for the AM and TREC downstream tasks. To estimate the quality of the machine translated data, we measured the performance of the service used to translate the data with the help of the JW300 corpus Agić and Vulić (2019); Tiedemann (23-25). For each of the language pairs en-ka, en-tr, and en-ru, we translated the first 10,000 sentences of the respective bitext files from JW300 and measured their quality in terms of BLEU, METEOR and MOVERSCORE (Zhao et al., 2019).111111Misaligned sentences were skipped. Results are summarized in Table 7. They show that, with the exception of en-ka, all language pairs have high-quality translations. We thus expect the influence of errors of the machine translated data to be minimal in tr and ru. For ka, this is not necessarily the case.

BLEU (1-gram) METEOR Moverscore
en-ka 0.271 0.149 0.272
en-ru 0.470 0.335 0.353
en-tr 0.493 0.359 0.398
Table 7: Quality of the machine translation service used to translate training data for downstream tasks on reference datasets.

a.2 Sentence Encoder Dimensions

Table 8 shows the full list of encoders used in our study and dimensionalities.

Encoder Size
Avg 300
pmeans (AvgMaxMin) 900
Random LSTM 4096
InferSent 4096
QuickThought 2400
LASER 1024
BERT 768
Table 8: Encoders and their dimensionalities.

a.3 Class Imbalance

In addition to the classifier type and size, we also tested the influence of the class (im)balance of the training data. In particular, for the four binary probing tasks BigramShift, SubjNumber, SV-Agree, and Voice, we examine the effect of imbalancing with ratios of 1:5 and 1:10. We use LR with sizes of 10k, 20k, and 30k training instances and correlate the results for imbalanced datasets with the standardly balanced datasets. We find that (i) for two tasks (BigramShift, SV-Agree) there is typically high correlation (0.6-0.8) while for the other two tasks the correlation is typically zero between the balanced and imbalanced setting; (ii) correlation to the setting 1:1 (slightly) diminishes as we increase the class imbalance from 1:5 to 1:10. Thus, the scenarios 1:5 and 1:10 do not strongly correlate with 1:1 (as used in all our other experiments). As a consequence, in the multilingual setup, we paid attention to keep datasets as uniform as possible.