Cross-Lingual BERT Contextual Embedding Space Mapping with Isotropic and Isometric Conditions

07/19/2021 ∙ by Haoran Xu, et al. ∙ Johns Hopkins University 0

Typically, a linearly orthogonal transformation mapping is learned by aligning static type-level embeddings to build a shared semantic space. In view of the analysis that contextual embeddings contain richer semantic features, we investigate a context-aware and dictionary-free mapping approach by leveraging parallel corpora. We illustrate that our contextual embedding space mapping significantly outperforms previous multilingual word embedding methods on the bilingual dictionary induction (BDI) task by providing a higher degree of isomorphism. To improve the quality of mapping, we also explore sense-level embeddings that are split from type-level representations, which can align spaces in a finer resolution and yield more precise mapping. Moreover, we reveal that contextual embedding spaces suffer from their natural properties – anisotropy and anisometry. To mitigate these two problems, we introduce the iterative normalization algorithm as an imperative preprocessing step. Our findings unfold the tight relationship between isotropy, isometry, and isomorphism in normalized contextual embedding spaces.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mikolov et al. (2013b)

first notice that the word vectors pretrained by monolingual data have similar topology structure in different languages, which allows word embedding spaces to be aligned by a simple linear mapping. Orthogonal mapping

(Xing et al., 2015) is subsequently proved to be an effective improvement for space alignment. With the development of multilingual tasks, cross-lingual word e

mbeddings (CLWE) have attracted a lot of attention in recent times. CLWE facilitates model transfer between languages by providing a shared embedding space, where vector representations of words which have similar meanings from various languages are spatially close. Previous methods can be basically classified into two categories: optimizing a linear transformation to map pretrained word embedding vectors

(Mikolov et al., 2013b; Xing et al., 2015; Artetxe et al., 2016; Zhang et al., 2019), and jointly learning word representation for multiple languages (Luong et al., 2015; Gouws et al., 2015)

. Some recent studies even alternatively work out transformation by unsupervised learning

(Miceli Barone, 2016; Zhang et al., 2017; Conneau et al., 2018).

In this paper, we focus on supervised linear mapping methods. Linear mappings with orthogonal constraints are in the light of the assumption that monolingual word embedding graphs are ( approximately) isomorphic across different languages (Søgaard et al., 2018). However, this assumption is also a significant limitation because the different structural properties (e.g., morphology, syntax) across languages make it difficult for static word embeddings to meet the hypothesis.

Instead of utilizing static word embeddings (Mikolov et al., 2013a; Pennington et al., 2014; Bojanowski et al., 2017), Schuster et al. (2019) try leveraging word embeddings extracted from ELMo (Peters et al., 2018) to align the embedding spaces by using gold dictionaries, and show the advantages in existing transfer tasks.

Since high-quality, freely available, wide-coverage manually dictionaries are still rare (Ruder et al., 2019), we investigate an alignment approach that builds silver token translation pairs from parallel corpus rather than leverages gold dictionaries. We first obtain aligned contextual type-level embeddings in both source and target sides simultaneously by averaging vectors of all occurrences of the silver aligned word pairs. Furthermore, we adaptively split a type-level representation into several sense-level representations, where each sense vector in the semantic space represents one of the meanings of the word. A visualization of how these fine-grained sense-level vectors are assigned as anchor vectors to assist in aligning embedding spaces is illustrated in Figure 1.

(a) Type-level mapping: English words are accurately mapped to their German translation words.
(b) Sense-level mapping: ‘bank’ is split into two sense embeddings, i.e., for the meaning of financial establishment and for the meaning of shore, which are respectively mapped to German ‘bank’ and ‘ufer’. A similar discussion also holds for the word ‘hard’, where its two sense vectors are mapped to German ‘schwer’ (difficult) and ‘hart’ (solid).
Figure 1: Illustration of contextual cross-lingual mapping of English and German, where word and sense vectors are visualized by t-SNE (Maaten and Hinton, 2008).

We also explore the properties of contextual embeddings. Compared with static type-level embeddings, experimental results show that better isomorphism of contextual embeddings is the main reason to explain the superior performance of our contextual-aware mapping on the BDI task. Moreover, sense-level embeddings are demonstrated to have a closer isomorphic structure than type-level embeddings. Interestingly, we also discover that contextual embeddings suffer from the problems of anisotropy (Ethayarajh, 2019)

and anisometry. Anisotropy is an inherent problem of embedding vectors, where the direction of vectors in the semantic space are not uniformly distributed. Vectors from different languages possess various degrees of anisotropy, which deteriorates the performance of the mapping. Anisometry is also a factor of misalignment because orthogonal mapping is a distance-preserving projection, which implicitly adds an additional restriction in the aforementioned assumption of isomorphism, that is

isometrical isomorphism. However, relative distances across languages are usually different (anisometric). To tackle these problems, we introduce the iterative normalization method (Zhang et al., 2019), and show the importance of isotropy and isometry, which effectively improve the quality of mapping.

2 Background

2.1 Supervised Bilingual Space Alignment

Mikolov et al. (2013b)

take the lead in exploiting the topological similarities of monolingual embedding spaces to successfully learn a linear mapping. Although more complicated models like multi-layer neural networks are tried to implement on space alignment, they don’t observe any improvements. To bridge a link between source and target language spaces, a gold dictionary has to be used. Let denote

as word embedding vectors corresponding to the translation pairs in the dictionary. A linear matrix is learned by minimizing the Frobenius norm:

(1)

where and are embedding matrix composed of embedding vectors of word pairs in the dictionary, and is the dimension of embedding vectors. Since the objective function is convex, can be solved by gradient-based methods. Alternatively, Xing et al. (2015) shows better results of bilingual space alignment by enforcing orthogonal limitation on linear mapping, so the optimization function boils down to the Procrustes problem (Schönemann, 1966), which offers a close form solution:

(2)

where , and is a set of orthogonal matrices with size .

2.2 Hubness Problem

Word retrieval task is a metric to evaluate the quality of embedding space mapping. Still, it is known to suffer from the hubness problem (Radovanović et al., 2010), where a few vectors (hubs) in high dimension spaces are the nearest neighbors of many other vectors. This phenomenon undermines the performance of retrieval methods which are established on the nearest neighbor rules. In lieu of matching pairs by finding the nearest neighbour vector, Conneau et al. (2018) propose cross-domain similarity local scaling (CSLS) criterion to alleviate the hubness problem by penalizing the similarity score of hubs.

2.3 Quantifying Isomorphism

The degree of isomorphism reflects how topologically similar the structures of the two vector spaces are. A higher degree of isomorphism across two vector spaces always means that they are easier mapped by an orthogonal matrix. Based on the analysis that if two semantic language spaces possess a high degree of isomorphism, the similarity distribution of words in the same meaning within each language should be similar,

Vulić et al. (2020) propose relational similarity (RS) metric. Formally speaking, translation pairs are extracted from bilingual embeddings first. Then, , we list the similarity of all word pairs , where are and word in the source side, and analogously, obtain a similar list in the target side. Finally, Pearson correlation coefficient of the two lists is calculated to evaluate the degree of isomorphism. The correlation coefficient increases with a higher degree of isomorphism. It is worth mentioning that if two embedding spaces are isomorphic.

2.4 Contextual Word Embeddings

The remarkable progress of NLP tasks utilizing pretrained monolingual models (Peters et al., 2018; Devlin et al., 2019; Yang et al., 2019) illustrates the crucialness of contextual representation. One straightforward way to obtain contextual type-level embeddings is averaging the vectors of words from the monolingual corpus fed into the pretrained language model. An offline transformation matrix can be learned by these contextual vectors and has been successfully applied into a zero-shot cross-lingual dependency parsing task (Schuster et al., 2019). However, gold dictionaries are still necessary prerequisites to align the spaces, and words in a different context cannot always be accurately translated by static dictionaries. Importantly, type-level representations for multi-sense words are biased because they tend to express their majority meaning, potentially harming the space alignment (more details in Section 3.3).

The predominant performance of pretrained models also attracts attention to the hierarchy of linguistic information, where an in-depth study of BERT (Devlin et al., 2019) is conducted by Jawahar et al. (2019). They reveal that surface features, syntactic features, and semantic features of words respectively lie on the bottom layers, middle layers, and top layers of BERT. In this paper, the representations of words used for aligning the bilingual spaces are the normalized mean vectors of the topmost four layers of pretrained BERTs.

3 Approach

We next describe the context-aware embedding mapping approaches and properties of embeddings.

3.1 Preprocessing

Fast Align (Dyer et al., 2013) is a log-linear reparameterization of IBM Model 2 (Brown et al., 1993), which is an effective unsupervised bidirectional token alignment algorithm. Instead of using a dictionary to derive translation pairs, we apply Fast Align to parallel corpora to obtain the silver aligned token pairs. This offers us three advantages for mapping over using a dictionary: 1) parallel corpora provide a more comprehensive range of scope for translation pairs than a dictionary; 2) embeddings of translation token pairs contain the same contextual information; 3) tokens are aligned in each parallel sentence, and embeddings are already aligned as well, so mappings can be created by aligning embeddings and skip the step of word alignment from a dictionary.

3.2 Contextual Type-Level Embeddings Alignment

A tokenized parallel corpus is fed into pretrained BERTs (Wolf et al., 2019; Safaya et al., 2020) of the source and target languages. Since every occurrence of a token possesses a contextual word embedding and a type commonly appears multiple times in the corpus, we always receive a collection of contextual vectors for a type. On the source side, type-level representation of a type is defined as the mean vector of all vectors in its collection. Because Fast Align bridges a link between token pairs in every parallel sentence and embeddings of the token pairs, the mean vector of linked target vectors can be simultaneously derived. Then, we build two column-wise aligned embedding matrices , , where , and is the vocabulary size of the source language. We derive the optimal orthogonal transformation by leveraging the close form solution in Equation 2.

3.3 Contextual Sense-Level Embeddings Alignment

Unlike type-level alignment, we align the space in a finer resolution by leveraging sense vectors, where each meaning of a word has its representation.

Well Separated Contextual Vectors:

Intuitively, contextual vectors of a word with different meanings are expected to be distributed in distinct locations of semantic space, and vectors that represent the same meanings should be tightly adjoined together. A visualized example of contextual vectors of a word ‘bank’ is given in Figure 2. In this case, two clusters corresponding to different meanings of ‘bank’ are spatially opposite to each other. This observation supports the anchor-driven mapping, which is realized by aligning the centers of clusters — sense-level representation.

Sense-Level Representation:

We cluster the embedding vectors in the source side by -means algorithm and define the mean vector of a cluster as a sense-level embedding vector. The algorithm of finding out the optimal number of clusters follows an elbow-based approach (Satopaa et al., 2011) 222We find that the performance of adaptively detecting is better than setting as a constant number., where the ‘knee’ point 333We also have tried Gap statistic algorithm (Tibshirani et al., 2001). However, it is more time-expansive, and our preliminary experiments show that its performance is weaker than the knee detection approach. at which the optimal locates can be adaptively detected. Importantly, this approach refuses to cluster a vector collection that is adjudicated as not well separate, which promises that mono-sense words are not over-clustered.

Space Alignment:

For each type , a list of sense-level representations is derived after clustering contextual vectors, where is the number of clusters. Similar to the alignment in Section 3.2, each source embedding is linked to a target embedding, so a sense-level embedding and the corresponding target embedding can be obtained in the meantime, where integer . Two sense-level aligned matrices , are generated by concatenating all their sense-level vectors. Namely, , and , where denotes all possible types in the source vocabulary. Finally, we derive an optimal orthogonal transformation by Equation 2.

Intuition behind Sense-Level Embeddings — Solving Representation Bias Problem:

Contextual type-level embeddings have one apparent phenomenon that they are inclined to embody their majority meanings, i.e., the most frequent meanings that appear in the corpus. In other words, type-level embedding vectors of multi-sense words tend to be closer to the vectors representing their primary meaning in the semantic space. This phenomenon brings up a drawback that multi-sense word vectors are difficult to accurately represent any of their senses. The representation bias for multi-sense words potentially degenerate the quality of embedding vectors and deteriorate the accuracy of cross-lingual mapping. To mitigate the representation bias problem, we investigate the sense-level representation.

Figure 2: The distribution of contextual embedding vectors of word ‘bank’ in 2 dimension PCA, where the red points in the left side represent the financial meaning in their contexts, while blue points in the right side are the meanings of ‘shore’. The center of the clusters are illustrated as stars.
(a) Before iterative normalization: monolingual vectors only gather in a small area.
(b) After iterative normalization: monolingual vectors are uniformly distributed.
Figure 3: The distribution of normalized contextual word embeddings on the surface of a sphere, where blue stars are English word vectors and red points are German word vectors. 3-d vectors are derived by PCA dimension reduction.

3.4 Properties of Embedding Spaces

Here we introduce two important concepts: isotropy and isometry, and reveal their influence on improving the degree of isomorphism for contextual spaces, which offers higher-quality mapping. Our findings show that isomorphism is positively correlated with isometry, and isometric spaces can be built by enforcing spaces to be isotropic.

Isotropy:

An embedding space is isotropic if the directions of embedding vectors are uniformly distributed. Unfortunately, contextual word representation is usually anisotropic. Geometric speaking, normalized word embedding vectors are more likely to gather on a narrow conical surface of a hypersphere rather than uniformly distributed in all directions. Commonly, various language vector spaces have various degrees of anisotropy. Figure 2(a) illustrates the different size of areas that contextual English and German vectors occupy, where vectors are obtained from a parallel corpus. We give a simple metric to evaluate the degree of isotropy:

(3)

where are normalized vectors in the space, is the number of randomly selected vectors, and most importantly,

represents the average cosine similarity between all selected normalized vectors. The closer the average similarity is to 0, the more isotropic the embedding space is. An isotropic space should hold the property that

equals 0.

Isometry:

We define two spaces are isometric if relative Euclidean distances among vectors are identical between spaces. Orthogonal mapping is a distance-persevering and isometrically isomorphic transformation so that Euclidean distance between two vectors does not change after mapping. Thus, two semantic language spaces are more comfortable to be aligned if their relative distances of vectors are similar. We measure the degree of isometry between two embedding spaces by calculating the average of absolute difference of relative distances:

(4)

where are vectors in the source space and are the translation vectors in the target space. The lower is, the more isometric the bilingual spaces are. Two spaces are isometric when is . Since vectors in embedding spaces are normalized, Equation 4 boils down to:

(5)

Iterative normalization:

Anisometry undermines the quality of mapping due to the inconsistent relative distances of embeddings in the source and target embedding spaces. Therefore, we look for a way to reduce (increase) the degree of anisometry (isometry) to mitigate its negative influence. Relying on Equation 3 and 5, two semantic spaces are near-isometric when they have a similar degree of isotropy. However, it is unrealistic to control spaces in the same attribute (an)isotropic degree, but alternatively, we introduce a preprocessing method, namely iterative normalization (IN). We apply it to transform anisotropic contextual embedding spaces to be (approximately) isotropic () by forcibly distributing vectors uniformly on the surface of the unit hypersphere (Figure 2(b)). This highly improves the degree of isometry, i.e., relative embedding distances across language spaces are more similar. The preprocessing method iteratively enforces vectors to be zero-mean and be normalized. We let denote the initial embedding for word (sense) . For every word (sense), embedding vectors are firstly enforced to be normalized to a unit length in the iteration:

(6)

and make it be zero-mean:

(7)

where is the number of words (senses). We repeat the two steps above until convergence.

4 Experiment

We evaluate the cross-lingual contextual mapping between English and three other languages — German (de), Arabic (ar), and Dutch (nl), where German and Dutch are closely related to English, and Arabic is distinct to English. We set English as the source language and other languages as targets.

4.1 Settings

Dataset:

Parallel corpora of European languages are downloaded from ParaCrawl v6.0 444www.paracrawl.eu, while the English-Arabic parallel corpus is extracted from United Nations Parallel Corpus 555https://conferences.unite.un.org/uncorpus/. We select 500K parallel sentences for each language pairs and truncate sentences whose length is longer than 150.

Token Alignment:

Accurate alignment is not always offered by Fast Align, so we do not take one-to-many, many-to-one, and many-to-many alignments into consideration. We do not use subword embeddings to approximately represent an Out-of-vocabulary (OOV) token, as aligned embeddings of non-OOV tokens are easier to have more similar relative positions across embedding spaces.

Contextual Embeddings Extraction:

For types in the vocabulary that appear at least five times in the given parallel corpus, we store at most 10K contextual vectors. When we extract sense-level embeddings, we only cluster types whose occurrence in the corpus is over 100. Sense-level embedding vectors for stopwords 666We use the lists of stopwords from https://github.com/Alir3z4/stop-words and low-frequent types (occurrence under 100) are just obtained by averaging all their vectors — they boil down to type-level representations. Note that contextual embeddings are normalized to a unit length. BERT models of all languages are base-size (12 layers with 768 dimensions).

4.2 Bilingual Dictionary Induction

de ar nl
Before iterative normalization P@1 P@5 P@1 P@5 P@1 P@5
FastText - NN 72.80 87.22 52.13 76.94 73.23 90.21
Ours, type-level - NN 81.60 95.30 71.93 95.74 86.36 97.17
Ours, sense-level - NN 81.60 96.42 71.43 96.00 87.00 97.30
FastText - CSLS 77.71 88.96 64.91 81.95 79.02 91.63
Ours, type-level - CSLS 85.28 96.73 79.70 95.99 89.32 98.33
Ours, sense-level - CSLS 85.58 97.65 80.70 96.24 89.70 97.81
After iterative normalization
FastText - NN 73.01 87.93 53.88 77.70 75.16 90.86
Ours, type-level - NN 81.60 95.81 72.43 96.49 86.10* 97.17
Ours, sense-level - NN 82.41 96.83 74.44 96.00 86.10* 98.01
FastText - CSLS 77.61* 89.37 64.16* 82.96 79.79 92.54
Ours, type-level - CSLS 86.30 97.03 80.70 96.24 91.12 98.46
Ours, sense-level - CSLS 87.12 98.16 81.20 96.24 91.51 98.58
Table 1: Evaluation measures for the three cross-lingual embedding mapping approaches before and after iterative normalization. (*) means that the score does not increase after iteartive normalization. ‘FastText’ in the table refers to the supervised mapping from Conneau et al. (2018).

We evaluate our mappings on the BDI task, which considers a problem of a target language word retrieval for each query source word relying on the representation in the space after a shared cross-lingual embedding space is built.

Baselines:

Our main baseline is supervised mapping (Conneau et al., 2018) that uses Procrustes solution for fastText embeddings (Bojanowski et al., 2017). The second baseline is supervised mapping for ELMo embeddings from Schuster et al. (2019). Following their experimental settings, the outputs of the first LSTM layer are used for representing tokens. For a fair comparison, the anchor vectors are derived by utilizing the same corpora as what we use to generate our contextual embeddings. We only compare our method with the ELMo embedding mapping in English-German language pair for which they support the off-the-shelf monolingual pretrained ELMo models 777They do not support Dutch and Arabic ELMo models..

Training and Evaluation:

Although our approach is dictionary-free, mapping methods from Conneau et al. (2018) and Schuster et al. (2019) need a seed dictionary. To fairly reveal the comparison of our mappings and the baselines, we use identical training pairs in the same seed dictionary and evaluate mappings on the identical translation pairs in the test dataset. Mappings are obtained by leveraging the 5k most frequent words in the source language and their translations in the dictionaries. Dictionaries take into account the polysemy of words and are publicly available in the MUSE library 888https://github.com/facebookresearch/MUSE. We evaluate mappings based on 1.5k word pairs whose source words rank in frequency from 5K to 6.5K, where source word queries use 200K target words. Note that we do not consider the OOV tokens, so the training and test pairs are the overlapping tokens in the dictionary and non-subwords in the BERT vocabulary. Also, queries use the overlapping tokens of 200K target words and non-subwords in the BERT vocabulary. Thus, for en-de, en-nl, and en-ar, we use 4191, 3375, 2605 training pairs, 978, 777, 399 source test queries, and 14140, 12352, 7918 target words, respectively. We retrieve target words by finding the nearest neighbor (largest cosine similarity) to the source words and then repeat the retrieval task again by using the CSLS metric with the setting number of neighbors as 10.

4.3 Degree of Isomorphism

To illustrate the close relationship between isotropy, isometry, and isomorphism, we also compute their scores. For each language pairs, we leverage the aforementioned relational similarity (RS) in Section 2.3 to measure the degree of isomorphism. We use translation pairs whose source words are most frequent to calculate RS. The number of randomly selected vectors is 1K for the calculation of isotropic degree and isometric degree .

4.4 Iterative Normalization

For all experiments above, we conduct experiments both with and without iterative normalization. Experiments run for 5 iterations, which is sufficient to converge.

5 Discussion

5.1 Effect of Our Context-Aware Embeddings

The main results are shown in Table 1. It indicates that the performance of our contextual embeddings are significantly superior than of static fastText embeddings in the BDI task, where our mappings outperform fastText embedding mappings by approximate 10% accuracy. Especially in the evaluation of distinct language pairs, English and Arabic, it shows us the most impressive improvement, which boosts the accuracy of P@1 and P@5 almost 20% higher. Even though the target words for queries are a subset of 200K words, we are surprised that contextual embedding mapping outperforms fastText embedding mapping by a large margin in the same settings of training and test word pairs. The success of contextual mapping is favored with the similar relative positions of our aligned contextual embeddings because they share the same contextual information from the parallel corpus, and embeddings of subwords do not approximately represent them, which builds a high degree of isomorphism for the cross-lingual spaces. Recall that a higher RS score implies a higher degree of isomorphism. As illustrated by the RS column in Table 2, contextual embeddings are able to construct spaces that possess substantially higher RS scores in comparison to fastText.

In Table 3, our contextual mapping also significantly outperforms the ELMo embedding mapping. Note that we still use the same training and test sets as the fastText alignment experiment.

5.2 Effect of Sense-Level Embeddings

Contextual sense-level embeddings are derived by splitting multi-sense word embeddings to mitigate the problem of representation bias (Section 3.3) and construct a higher degree of isomorphism spaces across various language pairs.


de ar nl
Before IN RS RS RS

fastText
0.1650 0.1681 0.1085 0.5452 0.1706 0.1611 0.1362 0.3751 0.1701 0.1996 0.1161 0.5958
type-level 0.4496 0.7087 0.5184 0.6998 0.4161 0.5001 0.1888 0.6660 0.4784 0.7545 0.5524 0.6804
sense-level 0.4187 0.6793 0.5213 0.7334 0.4000 0.4747 0.1796 0.6912 0.4587 0.7326 0.5480 0.7382

After IN
fastText 0.0035 0.0022 0.1279 0.5245 0.0040 0.0024 0.1459 0.3994 0.0080 0.0052 0.1193 0.5747
type-level 0.0089 0.0062 0.1062 0.7586 0.0115 0.0101 0.1091 0.7970 0.0109 0.0068 0.1035 0.7829
sense-level 0.0064 0.0043 0.1082 0.7620 0.0112 0.0108 0.1125 0.8019 0.0085 0.0054 0.1044 0.7832
Table 2: Scores of isotropy, isometry and relational simiarlity across embedding spaces before and after iteartive normlization (IN) between English and three target languages, where and are the scores of isotropy in the source and target space respectively. Bold numbers are the highest score of RS among three mapping methods.
P@1 RS
ELMo (before IN) 55.73 0.2160 0.3200 0.2425 0.4984
ELMo (after IN) 56.75 0.0062 0.0053 0.1787 0.5547
Ours, sense-level (after IN) 87.12 0.0064 0.0043 0.1082 0.7620
Table 3: The comparison of our best mapping with ELMo embedding mapping for English and German. Note that the P@1 scores are derived by CSLS metric.

Mitigation of Representation Bias:

We select two English words, ‘bank’ and ‘hard’, to expose how sense-level embeddings allay the representation bias problem between English and German. The English word ‘bank’ usually has one financial meaning and meaning of land alongside a river. After we obtain a transformation matrix for English and German language pairs, we map English embedding vectors into German space and calculate the cosine similarity between the English word ‘bank’ with its German translation ‘bank’ (financial meaning) and ‘ufer’ (shore meaning). Table 4 shows that the sense-level embedding that represents the financial meaning is closer to the vector of the German word ‘bank’ compared with type-level embeddings, and another sense-level vector is also correctly mapped to the neighbor of ‘ufer’. Note that ‘bank’ is in the most frequent 5K training pairs, so the mapping used in Table 4 excludes it during training. The same holds for the English word ‘hard’ with its translation of German words ‘hart’ and ‘schwer’.

Compare with Type-Level Embeddings:

In the comparison of scores on the BDI task illustrated by Table 1, sense-level embeddings outperform the type-level one by around 1% accuracy. We attribute the better performance of sense-level embeddings to the slightly higher degree of isomorphism, which has been shown by the RS scores in Table 2. It reveals that the better isomorphic structure benefits from mitigating the representation bias.

en word de word cosine simiarity

type-level
bank bank 0.6868
bank ufer 0.1461

sense-level
bank bank 0.6891
bank ufer 0.6481

type-level
hard schwer 0.6775
hard hart 0.2551

sense-level
hard schwer 0.7286
hard hart 0.6755

Table 4: Comparison of cosine similarity between example English words (‘bank’, ‘hard’) and their translation German words after mapping in the type-level and the sense-level. In the case of sense-level method, bank means ‘financial establishment’, bank means ‘shore’, hard means ‘difficult’, and hard means ‘solid’.

5.3 Effect of Isotropy and Isometry

As indicated in the Table 2, the original (before iterative normalization) degree of isotropy of contextual embeddings across English ( column) and other three target languages ( column) are very different. The isotropic score of English space is around 0.45, while the score of Arabic space is 0.5, and scores of German and Dutch spaces even exceed 0.7, which implies that the target language embeddings gather in narrower ‘cones’ in the semantic space.

As discussed in Section 3.4, two semantic spaces have a higher isometric degree when they have similar degrees of (an)isotropy. Results under the row of After IN in Table 2 shows the scores of isotropy are all dropped near to 0, and the scores of isometry drop significantly in the meantime for our contextual mapping methods, which indicates the success of manufacturing better isometric condition ( column) by leveraging iterative normalization.

RS scores in Table 2 demonstrate that a higher degree of isomorphic space can basically be constructed with a higher isometric degree. In comparison with the P@1 and P@5 results before IN in Table 1, the superior results after IN, corresponding to the higher RS scores (after IN) in Table 2, importantly shows the improvement on BDI task benefits from the higher isomorphic and isometric spaces. Note that the above discussion also applies to ELMo contextual embedding alignment (indicated in Table 3).

6 Conclusion

In this paper, our contextual embeddings have unfolded their powerful capacity of building high-quality mappings and also illustrated a higher degree of isomorphism across language spaces compared with previous mapping methods. The success of the contextual embeddings provided us a new sight of exploring cross-lingual spaces by extracting parallel information from deep pre-trained language models. Interestingly, contextual sense-level embeddings showed advantages in space mapping by splitting multi-sense word embedding vectors into several sense vectors, which ameliorated the representation bias problem. We have also explored the relationship of isotropy and isometry for cross-lingual embedding spaces and leveraged iterative normalization to keep the consistency of isometry across languages, which improved the current degree of isomorphism again.

Our future work is to apply our contextual embedding mapping method to downstream cross-lingual transfer tasks with a broader range of high-quality aligned embeddings of translation pairs.

References

  • M. Artetxe, G. Labaka, and E. Agirre (2016) Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

    ,
    Austin, Texas, pp. 2289–2294. External Links: Link, Document Cited by: §1.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Link, Document Cited by: §1, §4.2.
  • P. F. Brown, S. A. Della-Pietra, V. J. Della-Pietra, and R. L. Mercer (1993) The mathematics of statistical machine translation. Computational Linguistics 19 (2), pp. 263–313. External Links: Link Cited by: §3.1.
  • A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word translation without parallel data. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.2, §4.2, §4.2, Table 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.4, §2.4.
  • C. Dyer, V. Chahuneau, and N. A. Smith (2013) A simple, fast, and effective reparameterization of IBM model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 644–648. External Links: Link Cited by: §3.1.
  • K. Ethayarajh (2019) How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512. Cited by: §1.
  • S. Gouws, Y. Bengio, and G. Corrado (2015)

    Bilbowa: fast bilingual distributed representations without word alignments

    .
    Cited by: §1.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does BERT learn about the structure of language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3651–3657. External Links: Link, Document Cited by: §2.4.
  • T. Luong, H. Pham, and C. D. Manning (2015) Bilingual word representations with monolingual quality in mind. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, Colorado, pp. 151–159. External Links: Link, Document Cited by: §1.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne.

    Journal of machine learning research

    9 (Nov), pp. 2579–2605.
    Cited by: Figure 1.
  • A. V. Miceli Barone (2016)

    Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders

    .
    In Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 121–126. External Links: Link, Document Cited by: §1.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a)

    Efficient estimation of word representations in vector space

    .
    arXiv preprint arXiv:1301.3781. Cited by: §1.
  • T. Mikolov, Q. V. Le, and I. Sutskever (2013b) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Cited by: §1, §2.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §1.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §2.4.
  • M. Radovanović, A. Nanopoulos, and M. Ivanović (2010)

    Hubs in space: popular nearest neighbors in high-dimensional data

    .
    Journal of Machine Learning Research 11 (Sep), pp. 2487–2531. Cited by: §2.2.
  • S. Ruder, I. Vulić, and A. Søgaard (2019) A survey of cross-lingual word embedding models.

    Journal of Artificial Intelligence Research

    65, pp. 569–631.
    Cited by: §1.
  • A. Safaya, M. Abdullatif, and D. Yuret (2020) KUISAIL at semeval-2020 task 12: bert-cnn for offensive speech identification in social media. External Links: 2007.13184 Cited by: §3.2.
  • V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan (2011) Finding a" kneedle" in a haystack: detecting knee points in system behavior. In 2011 31st international conference on distributed computing systems workshops, pp. 166–171. Cited by: §3.3.
  • P. H. Schönemann (1966) A generalized solution of the orthogonal procrustes problem. Psychometrika 31 (1), pp. 1–10. Cited by: §2.1.
  • T. Schuster, O. Ram, R. Barzilay, and A. Globerson (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1599–1613. External Links: Link, Document Cited by: §1, §2.4, §4.2, §4.2.
  • A. Søgaard, S. Ruder, and I. Vulić (2018) On the limitations of unsupervised bilingual dictionary induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 778–788. External Links: Link, Document Cited by: §1.
  • R. Tibshirani, G. Walther, and T. Hastie (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2), pp. 411–423. Cited by: footnote 3.
  • I. Vulić, S. Ruder, and A. Søgaard (2020) Are all good word vector spaces isomorphic?. arXiv preprint arXiv:2004.04070. Cited by: §2.3.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §3.2.
  • C. Xing, D. Wang, C. Liu, and Y. Lin (2015) Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1006–1011. External Links: Link Cited by: §1, §2.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §2.4.
  • M. Zhang, Y. Liu, H. Luan, and M. Sun (2017)

    Adversarial training for unsupervised bilingual lexicon induction

    .
    In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1959–1970. External Links: Link, Document Cited by: §1.
  • M. Zhang, K. Xu, K. Kawarabayashi, S. Jegelka, and J. Boyd-Graber (2019) Are girls neko or shōjo? cross-lingual alignment of non-isomorphic embeddings with iterative normalization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3180–3189. External Links: Link, Document Cited by: §1, §1.