Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

by   Yuxuan Wang, et al.
Harbin Institute of Technology

This paper investigates the problem of learning cross-lingual representations in a contextual space. We propose Cross-Lingual BERT Transformation (CLBT), a simple and efficient approach to generate cross-lingual contextualized word embeddings based on publicly available pre-trained BERT models (Devlin et al., 2018). In this approach, a linear transformation is learned from contextual word alignments to align the contextualized embeddings independently trained in different languages. We demonstrate the effectiveness of this approach on zero-shot cross-lingual transfer parsing. Experiments show that our embeddings substantially outperform the previous state-of-the-art that uses static embeddings. We further compare our approach with XLM (Lample and Conneau, 2019), a recently proposed cross-lingual language model trained with massive parallel data, and achieve highly competitive results.



There are no comments yet.


page 1

page 2

page 3

page 4


Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation

Linear embedding transformation has been shown to be effective for zero-...

Language Embeddings for Typology and Cross-lingual Transfer Learning

Cross-lingual language tasks typically require a substantial amount of a...

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

Pretrained contextual representation models (Peters et al., 2018; Devlin...

Toward Cross-Lingual Definition Generation for Language Learners

Generating dictionary definitions automatically can prove useful for lan...

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

We introduce a novel method for multilingual transfer that utilizes deep...

Genre as Weak Supervision for Cross-lingual Dependency Parsing

Recent work has shown that monolingual masked language models learn to r...

Cross-lingual Word Sense Disambiguation using mBERT Embeddings with Syntactic Dependencies

Cross-lingual word sense disambiguation (WSD) tackles the challenge of d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the most promising directions for cross-lingual dependency parsing, which also remains a challenge, is to bridge the gap of lexical features. Prior works W14-1613; guo-EtAl:2015:ACL-IJCNLP2 have shown that cross-lingual word embeddings are able to significantly improve the transfer performance compared to delexicalized models mcdonald2011multi; mcdonald2013universal. These cross-lingual word embeddings are static in the sense that they do not change with the context.222In this paper, we refer to these embeddings as static as opposed to contextualized ones.

Recently, contextualized word embeddings derived from large-scale pre-trained language models NIPS2017_7209; peters2017semi; peters2018deep; devlin2018bert have demonstrated dramatic superiority over traditional static word embeddings, establishing new state-of-the-arts in various monolingual NLP tasks suzana2018deep; schuster2018cross. The success has also been recognized in dependency parsing che2018towards. The great potential of these contextualized embeddings has inspired us to extend its power to cross-lingual scenarios.

Figure 1: A toy illustration of the method, where contextualized embeddings of the word canal from Spanish is transformed to the semantic space of English.

Several recent works have been proposed to learn contextualized cross-lingual embeddings by training cross-lingual language models from scratch with parallel data as supervision, and has been demonstrated effective in several downstream tasks schuster2018cross; mulcaire2019polyglot; lample2019cross. These methods are typically resource-demanding and time-consuming.333For instance, XLM was trained on 64 Volta GPUs lample2019cross. While the time of training is not described in the paper, we may take the statistics from BERT as a reference, e.g., BERT was trained on 4 Cloud TPUs for 4 days devlin2018bert. In this paper, we propose Cross-Lingual BERT Transformation (CLBT), a simple and efficient off-line approach that learns a linear transformation from contextual word alignments. With CLBT, contextualized embeddings from pre-trained BERT models in different languages are projected into a shared semantic space. The learned transformation is then used on top of the BERT encodings for each sentence, which are further fed as input to a parser.

Our approach utilizes the semantic equivalence in word alignments, and thus is supposed to be word sense-preserving. Figure 1 illustrates our approach, where contextualized embeddings of the Spanish word “canal” are transformed to the corresponding semantic space in English.

Experiments on the Universal Dependencies (UD) treebanks (v2.2) nivre2018ud show that our approach substantially outperforms previous models that use static cross-lingual embeddings, with an absolute gain of 2.91% in averaged LAS. We further compare to XLM lample2019cross, a recently proposed large-scale cross-lingual language model. Results demonstrate that our approach requires significantly fewer training data, computing resources and less training time than XLM, yet achieving highly competitive results.

2 Related Work

Static cross-lingual embedding learning methods can be roughly categorized as on-line and off-line methods. Typically, on-line approaches integrate monolingual and cross-lingual objectives to learn cross-lingual word embeddings in a joint manner C12-1089; P14-2037; guo2016representation, while off-line approaches take pre-trained monolingual word embeddings of different languages as input and retrofit them into a shared semantic space xing2015normalized; lample2018word; chen2018unsupervised.

Several approaches have been proposed recently to connect the rich expressiveness of contextualized word embeddings with cross-lingual transfer. mulcaire2019polyglot based their model on ELMo peters2018deep and proposed a polyglot contextual representation model by capturing character-level information from multilingual data. lample2019cross adapted the objectives of BERT devlin2018bert to incorporate cross-lingual supervision from parallel data to learn cross-lingual language models (XLMs), which have obtained state-of-the-art results on several cross-lingual tasks. Similar to our approach, schuster2019cross also aligned pre-trained contextualized word embeddings through linear transformation in an off-line fashion. They used the averaged contextualized embeddings as an anchor for each word type, and learn a transformation in the anchor space. Our approach, however, learns this transformation directly in the contextual space, and hence is explicitly designed to be word sense-preserving.

3 Cross-Lingual BERT Transformation

This section describes our proposed approach, namely CLBT, to transform pre-trained monolingual contextualized embeddings to a shared semantic space.

3.1 Contextual Word Alignment

Traditional methods of learning static cross-lingual word embeddings have been relying on various sources of supervision such as bilingual dictionaries lazaridou2015hubness; smith2017offline, parallel corpus guo-EtAl:2015:ACL-IJCNLP2 or on-line Google Translate mikolov2013exploiting; xing2015normalized. To learn contextualized cross-lingual word embeddings, however, we require supervision at word token-level (or context-level) rather than type-level (i.e. dictionaries). Therefore, we assume a parallel corpus as our supervision, analogous to on-line methods such as XLM lample2019cross.

In our approach, unsupervised bidirectional word alignment is first applied to the parallel corpus to obtain a set of aligned word pairs with their contexts, or contextual word pairs for short. For one-to-many and many-to-one alignments, we use the left-most aligned word,444Preliminary experiments indicate that this way works better than keeping all the alignments. such that all the resulting word pairs are one-to-one. In practice, since WordPiece embeddings wu2016google are used in BERT, all the parallel sentences are tokenized using BERT’s wordpiece vocabulary before being aligned.

3.2 Off-Line Transformation

Given a set of contextual word pairs, their BERT representations can be easily obtained from pre-trained BERT models,555In this work, we use the English BERT (enBERT) for the source language (English) and the multilingual BERT (mBERT), which is trained on 102 languages without cross-lingual supervision, for all the target languages. where is the contextualized embedding of token in the target language, and is the representation of its alignment in the source language.

In our experiments, a parser is trained on source language data and applied directly to all the target languages. Therefore, we propose to project the embeddings of target languages to the space of the source language, instead of the opposite direction. Specifically, we aim at finding an appropriate linear transformation , such that approximates .666We also investigated non-linear transformation in our experiments, but didn’t observe any improvements. This can be achieved by solving the following optimization problem:

where is a parameter matrix.

Previous works on static cross-lingual embeddings have shown that an orthogonal (i.e. ) is helpful for the word translation task xing2015normalized

. In this case, an analytical solution can be found through singular value decomposition (SVD) of


Here and are the contextualized embedding matrices, where is the number of aligned contextual word pairs, is the dimension of monolingual contextualized embeddings. Each pair of rows indicates an aligned contextual word pair.

Although this can be computed in CPUs within several minutes, more memories will be required with the growth of the amount of training data. Therefore, we present an approximate solution, where is optimized with gradient decent (GD) and is not constrained to be orthogonal.777We found the orthogonal constraint doesn’t help for GD. This GD-based approach can be trained on a single GPU and typically converges in several hours.

To validate the effectiveness of our approach in cross-lingual dependency parsing, we first obtain the CLBT embeddings with the proposed approach, and then use them as input to a modern graph-based neural parser (described in next section), in replacement of the pre-trained static embeddings. Note that BERT produces embeddings in wordpiece-level, so we only use the left-most wordpiece embedding of each word.888We tried alternative strategies such as averaging, using the middle or right-most wordpiece, but observed no significant difference.

4 Experiments

4.1 Data and Settings

In our experiments, the contextual word pairs are obtained from the Europarl corpora koehn2005epc using the fast_align toolkit dyer2010cdec. Only 10,000 sentence pairs are used for each target language. For the parsing datasets, we use the Universal Dependencies(UD) Treebanks (v2.2) nivre2018ud, following the settings of the previous state-of-the-art system ahmad2018near. From the 31 languages they have analyzed, we select 18 whose Europarl data is publicly available.101010For languages with multiple treebanks, we use the same combinations as they did. Statistics of the selected languages and treebanks can be found in the Appendix. We employ the Biaffine Graph-based Parser of dozat2017deep and adopt their hyper-parameters for all of our models.

In all the experiments, English is used as the source language, and the other 17 languages as targets. The model is trained on the English treebank and applied directly to target languages with the transformed contextualized embeddings. We train our models using the Adam optimizer kingma2015adam

, and most of the them converge within a few thousand epochs in several hours. More implementation details are reported in the Appendix.

4.2 Baseline Systems

We compare our method with the following three baseline models:

  • mBERT (contextualized). Embeddings generated by the mBERT model are directly used in the training and testing procedures.

  • FT-SVD (ahmad2018near, off-line, static). SVD-based transformation smith2017offline is applied on 300-dimensional FastText embeddings bojanowski2017enriching to obtain cross-lingual static embeddings, which represents the previous state-of-the-art. We report results from their paper of the RNNGraph model which used the same architecture as ours.

  • XLM (lample2019cross, on-line, contextualized). A strong method which learns contextualized cross-lingual embeddings from scratch with cross-lingual data.

For the XLM model, we employ the XNLI-15 they released to generate embeddings and apply them to cross-lingual dependency parsing in the same way as we do with our own model. We compare with them in the 4 overlapped languages both works have researched on.

4.3 Comparison with Off-Line Methods

Lan. Static Contextualized
en 88.31 90.71 91.03*
de 59.31 63.41 64.47* 62.14
da 68.81 70.57 71.60* 71.66*
sv 73.49 70.09 73.33* 75.95*
nl 60.11 65.66 65.45 63.86
fr 73.46 72.97 74.70* 76.59*
it 76.23 79.02 79.46 78.98
es 66.91 65.43 67.14* 68.33*
pt 67.98 67.11 69.12* 69.25*
ro 52.11 46.40 55.14* 55.84*
sk 56.98 50.76 59.46* 59.92*
pl 58.59 63.10 65.37* 65.80*
bg 66.68 71.20 70.26 70.75
sl 54.57 56.78 57.42* 57.21*
cs 52.80 45.20 52.20* 52.99*
fi 48.74 49.56 51.00* 52.61*
et 44.40 46.64 47.79* 48.52*
lv 49.59 45.11 48.59* 49.78*
AVG. 60.63 60.53 63.09 63.54
Table 1: Results (LAS%) on test sets. Languages are split by language families with dashed lines. AVG. means the average of results from all target languages. Statistically significant differences between our methods and the mBERT model are marked with *, with p-value < 0.05 under McNemar’s test.

Results on the test sets are shown in Table 1.121212UAS results are listed in the Appendix due to space limit. Note that since we have no access to the parsed files of the FT-SVD model, we only report statistical significant tests between our methods and the mBERT model, which is highly comparable to the FT-SVD model on average. Languages are grouped by language families. Overall, our approach with either SVD or GD outperforms both FT-SVD and mBERT by a substantial margin (+2.91% in averaged LAS), among which GD turns out to be slightly better than SVD in most of the languages. When combined with FT-SVD, the performances can be further improved by 0.33% in LAS for the GD method and 0.51% for SVD (see the Appendix for more details). Interestingly, the mBERT model which is trained without any cross-lingual supervision but using a shared multilingual wordpiece vocabulary works surprisingly well in some languages, especially in those linguistically close to English. Similar observations have also been identified in other works pires2019multilingual; wu2019beto.

en 91.85/89.92 92.81*/91.03*
de 74.65/65.31 73.72/64.47 71.08/62.14
fr 79.62/73.41 80.01/74.70* 80.85*/76.59*
es 75.41/67.43 75.52/67.14* 75.70*/68.33*
bg 81.07/69.45 82.14*/70.26 81.51/70.75*
AVG. 77.69/68.90 77.85/69.14 77.29/69.45
Data 0.2-13.1M 10K
Table 2: Results (UAS%/LAS%) on test sets. The last row shows the training data used in each language by sentence. AVG. means the average of results from 4 target languages. Statistically significant differences between our methods and the XLM are marked with an asterisk, with p-value < 0.05 under McNemar’s test.

4.4 Comparison with On-Line Methods

Comparison of our approach and a cross-lingual language model pre-training (XLM) method lample2019cross in the 4 overlapped languages is shown in Table 2. CLBT outperforms XLM in 3 out of the 4 languages but lower in German (de). The amount of training data used in each method is also shown in the bottom: the number of parallel sentences used by XLM ranges from 0.2 million (10 million tokens) for Bulgarian to 13.1 million (682 million tokens) for French. In comparison, only 10,000 parallel sentences (0.4 million tokens) are used for each language in CLBT, demonstrating the data-efficiency of our approach. Moreover, given the efficiency in both data and training, CLBT can be readily scaled to new language pairs in hours.

4.5 Analysis

4.5.1 Transformation of Cross-lingual BERT Embedding

In order to investigate the properties of contextualized representations before and after the linear transformation, we employ the SENSEVAL2 data edmonds2001senseval2, where words from different languages are tagged by their word senses in different contexts.

Figure 2: t-SNE visualization of the English word nature and its Spanish translation naturaleza in different contexts by the contextualized representations before (a) and after (b) the linear transformation. Points are colored by word senses. Example contexts are given in (a). Translations of Spanish sentences are in brackets.

We took contextualized representations of the English word nature and its Spanish translation naturaleza in different contexts from pre-trained English and multilingual BERT respectively and visualize their distributions in Figure 2, where we can observe obvious clustering of word senses. Specifically, words with sense nature-1 and naturaleza-1 mean the physical world, whereas nature-2 and naturaleza-2 mean inherent features. We then apply our GD-based method to embeddings of naturaleza and depict the resulting cross-lingual embeddings in Figure 2. The distance between embeddings from English and Spanish is effectively reduced after the transformation. And it is apparent that embeddings of Spanish words are closer to those with similar meanings from English, which indicates the effectiveness of our approach.

4.5.2 Effect of Training Data Size

We select several languages from each language family, and investigate the effect of the amount of training data on the performances of zero-shot cross-lingual dependency parsing. Specifically, we take the SVD-based approach, since it is faster than the GD-based one, and trained different transformation models with different amount of parallel sentences from Europarl dataset on each of the 13 selected languages.

Figure 3: Effects of the amount of training data on different languages. (-axis represents the LAS.)

As shown in Figure 3, for most of the languages, the best performance is achieved with only 5000 parallel sentences. It is also worth noting that for most of Germanic (e.g. German, Danish, Swedish and Dutch) and Romance (e.g. French, Italian, Spanish and Romanian) languages, which are typologically closer to English, a rather small training set of merely 100 sentences is capable of achieving comparative results.

5 Conclusion

We propose the Cross-Lingual BERT Transformation (CLBT) approach for contextualized cross-lingual embedding learning, which substantially outperforms the previous state-of-the-art in zero-shot cross-lingual dependency parsing. By exploiting publicly available pre-trained BERT models, our approach provides a fast and data-efficient solution to learning cross-lingual contextualized embeddings. Compared to the XLM, our method requires much fewer parallel data and less training time, yet achieving comparable performance.

For future work, we are interested in unsupervised cross-lingual alignment, inspired by prior success on static embeddings lample2018word; alvarez2018gromov, which demands a deeper understanding to the geometry of the multilingual contextualized embedding space.


We thank the anonymous reviewers for their valuable suggestions. This work was supported by the National Natural Science Foundation of China (NSFC) via grant 61976072, 61632011 and 61772153.


Appendix A Appendices for “Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing”

a.1 Statistics of UD (v2.2) Treebanks

The statistics of the Universal Dependency treebanks we used are summarized in Table 3.

Language Language Family Treebank Test Sentences
English (en) IE.Germanic EWT 2,077
German (de) IE.Germanic GSD 977
Danish (da) IE.Germanic DDT 565
Swedish (sv) IE.Germanic Talbanken 1,219
Dutch (nl) IE.Germanic Alpino, LassySmall 1,472
French (fr) IE.Romance GSD 416
Italian (it) IE.Romance ISDT 482
Spanish (es) IE.Romance GSD, AnCora 2,147
Portuguese (pt) IE.Romance Bosque, GSD 1,681
Romanian (ro) IE.Romance RRT 729
Slovak (sk) IE.Slavic SNK 1,061
Polish (pl) IE.Slavic LFG, SZ 2,827
Bulgarian (bg) IE.Slavic BTB 1,116
Slovenian (sl) IE.Slavic SSJ, SST 1,898
Czech (cs) IE.Slavic PDT, CAC, CLTT, FicTree 12,203
Finnish (fi) Uralic TDT 1,555
Estonian (et) Uralic EDT 2,737
Latvian (lv) IE.Baltic LVTB 1,228
Table 3: Statistics of the Universal Dependeny treebanks we selected in our experiments. For language family, “IE” is the abbreviation for Indo-European.

a.2 Implementation Details

For the graph-based Biaffine parser, we exclude the learned embeddings in our re-implementation, to focus on the effect of pre-trained embeddings. Besides, the universal POS tags are used throughout our experiments.

The PyTorch version of the base BERT model for English and multi-languages are used to generate the 768-dimensional contextualized embeddings for English and target languages respectively. In the GD-based method, we use Adam optimizer, with a learning rate of 0.001, , .

a.3 Full Results on UD Treebanks

The LAS of our models (including the combination of cross-lingual FastText embeddings and our CLBT ones, where they are concatenated as the input to the parser) and the baseline ones are shown in Table 4, and UAS in Table 5.

Lan. Static Contextualized
en 88.31 90.71 91.03 91.32 91.03 91.32
de 59.31 63.41 64.47 64.78 62.14 63.05
da 68.81 70.57 71.60 72.03 71.66 71.57
sv 73.49 70.09 73.33 75.70 75.95 76.72
nl 60.11 65.66 65.45 65.90 63.86 64.92
fr 73.46 72.97 74.70 75.56 76.59 76.38
it 76.23 79.02 79.46 79.18 78.98 79.27
es 66.91 65.43 67.14 67.47 68.33 67.71
pt 67.98 67.11 69.12 69.00 69.25 69.09
ro 52.11 46.40 55.14 54.79 55.84 55.53
sk 56.98 50.76 59.46 59.43 59.92 59.60
pl 58.59 63.10 65.37 65.71 65.80 66.80
bg 66.68 71.20 70.26 70.33 70.75 70.89
sl 54.57 56.78 57.42 57.36 57.21 57.68
cs 52.80 45.20 52.20 52.37 52.99 53.05
fi 48.74 49.56 51.00 53.26 52.61 53.91
et 44.40 46.64 47.79 48.27 48.52 48.57
lv 49.59 45.11 48.59 50.04 49.78 50.98
AVG. 60.63 60.53 63.09 63.60 63.54 63.87
Table 4: Results (LAS%) on the test sets. The two columns on the left show results of baseline models, while the others on the right show results of our models. Languages are split by language families with dashed lines. AVG. means the average of results from all target languages. (Lan. stands for Language, FT stands for FastText.)
Lan. Static Contextualized
en 90.44 92.49 92.81 93.11 92.81 93.11
de 69.49 72.34 73.72 73.72 71.08 71.51
da 77.36 79.29 79.63 80.05 79.16 79.70
sv 81.23 78.25 80.57 82.28 82.64 83.34
nl 67.88 73.22 72.80 73.30 71.00 72.11
fr 78.35 78.79 80.01 81.10 80.85 80.92
it 81.10 83.73 84.53 84.22 83.33 83.95
es 74.92 73.97 75.52 75.89 75.70 75.59
pt 76.46 75.09 77.17 76.90 76.71 76.44
ro 63.23 58.45 66.01 66.07 66.30 66.00
sk 65.41 60.19 67.56 68.31 67.62 67.83
pl 71.89 74.03 76.68 76.25 76.52 77.04
bg 78.05 82.83 82.14 82.01 81.51 81.70
sl 66.27 67.86 69.04 69.16 68.26 68.59
cs 61.88 54.86 61.02 61.29 61.26 61.26
fi 66.36 65.45 65.65 68.28 67.96 69.16
et 65.25 64.22 65.26 65.87 66.76 66.49
lv 71.43 61.73 65.54 66.98 67.41 68.20
AVG. 71.56 70.84 73.11 73.63 73.18 73.52
Table 5: Results (UAS%) on the test sets. The two columns on the left show results of baseline models, while the others on the right show results of our models. AVG. means the average of results from all target languages. (Lan. stands for Language, FT stands for FastText.)