Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank

09/05/2019 ∙ by Zhang Meishan, et al. ∙ 0

Treebank translation is a promising method for cross-lingual transfer of syntactic dependency knowledge. The basic idea is to map dependency arcs from a source treebank to its target translation according to word alignments. This method, however, can suffer from imperfect alignment between source and target words. To address this problem, we investigate syntactic transfer by code mixing, translating only confident words in a source treebank. Cross-lingual word embeddings are leveraged for transferring syntactic knowledge to the target from the resulting code-mixed treebank. Experiments on University Dependency Treebanks show that code-mixed treebanks are more effective than translated treebanks, giving highly competitive performances among cross-lingual parsing methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Treebank translation tiedemann2014treebank; tiedemann2015improving; tiedemann2016synthetic has been considered as a method for cross-lingual syntactic transfer. Take dependency grammar for instance. Given a source treebank, machine translation is used to find target translations of its sentences. Then word alignment is used to find mappings between source and target words, so that source syntactic dependencies can be projected to the target translations. Following, a post-processing step is applied by removing unaligned target words, in order to ensure that the resulting target syntax forms a valid dependency tree, The method has shown promising performance for unsupervised cross-lingual dependency parsing among transfer methods mcdonald2011multi; tackstrom2012cross; rasooli2015density; guo2016representation.

(a) full-scale translation.
(b) this method, partial translation.
Figure 1: An example to illustrate our method, where the source and target languages are English (en) and Swedish (sv), respectively.

The treebank translation method, however, suffers from various sources of noise. For example, machine translation errors directly affect the resulting treebank, by introducing ungrammatical word sequences. In addition, the alignments between source and target words may not be isomorphic due to inherent differences between languages or paraphrasing during translation. For example, in the case of Figure 1, the English words “are” and “being”, and the Swedish word “med”, do not have corresponding word-level translation. In addition, it can be perfect to express “as soon as they can” using “very quickly” in a translation, which looses word alignment information because of the longer span. Finally, errors in automatic word alignments can also bring noise. Such alignment errors can directly affect grammaticality of the resulting target treebank due to deletion of unaligned words during post-processing, or cause lost or mistaken dependency arcs.

We consider a different approach for translation-based syntactic knowledge transfer, which aims at making the best use of source syntax with the minimum noise being introduced. To this end, we leverage recent advances in cross-lingual word representations, such as cross-lingual word clusters tackstrom2012cross and cross-lingual word embeddings guo2015cross

, which allow words from different languages to reside within a consistent feature vector space according to structural similarities between words. Thus, they offer a bridge on the lexical level between different languages


A cross-lingual model can be trained by directly using cross-lingual word representations on a source treebank guo2015cross. Using this method, knowledge transfer can be achieved on the level of token correspondences. We take this approach as a naive baseline. To further introduce structural feature transfer, we transform a source treebank into a code-mixed treebank by considering word alignments between a source sentence and its machine translation. In particular, source words with highly confident target alignments are translated into target words by consulting the machine translation output, so that target word vectors can be directly used for learning target syntax. In addition, continuous spans of target words of the code-mixed treebank are reordered according to the translation, so that target grammaticality can be exploited to the maximum potent.

We conduct experiments on Universal Dependency Treebanks (v2.0) mcdonald2013universal; nivre2016universal. The results show that a code-mixed treebank can bring significantly better performance compared to a fully translated treebank, resulting in averaged improvements of 4.30 points on LAS. The code and related data will be released publicly available under Apache License 2.0.111

2 Related Work

Existing work on cross-lingual transfer can be classified into two categories. The first aims to train a dependency parsing model on source treebanks

mcdonald2011multi; guo2016distributed; guo2016representation, or their adapted versions zhao2009cross; tiedemann2014treebank; hongmin:2017:singapore in the target language. The second category, namely annotation projection, aims to produce a set of large-scale training instances full of automatic dependencies by parsing parallel sentences hwa2005bootstrapping; rasooli2015density. The two broad methods are orthogonal to each other, and can both make use of the lexicalized dependency models trained with cross-lingual word representations rasooli2017cross; rasooli2019low.

Source Treebank Adaption. There has been much work on unsupervised cross-lingual dependency parsing by direct source treebank transferring. Several researchers investigate delexicalized models where only non-lexical features are used in the models zeman2008cross; cohen2011unsupervised; mcdonald2011multi; naseem2012selective; tackstrom2013target; rosa2015klcpos3. All the features in these models are language independent, and are consistent across languages and treebanks. Thus they can be applied into target languages directly.

Subsequent research proposes to exploit lexicalized features to enhance the parsing models, by resorting to cross-lingual word representations tackstrom2012cross; guo2015cross; duong2015low; duong2015cross; zhang2015hierarchical; guo2016representation; ammar2016many; wick2016minimally; de2018parameter. Cross-lingual word clusters and cross-lingual word embeddings are two main sources of features for transferring knowledge between source and target language sentences. These studies enable us to train lexicalized models on code-mixed treebanks as well. Thus here we integrate the cross-lingual word representations as well, which gives more direct interaction between source and target words.

Our work follows another mainstream method of this line of work, namely treebank translation tiedemann2014treebank; tiedemann2015improving; tiedemann2016synthetic, which aims to adapt an annotated source treebank into the target language by machine translation. In addition, the target-side sentences are produced by machine translation. Previous work aims to build a well-formed tree tiedemann2016synthetic

from source dependencies, solving word alignment conflicts by heuristic rules. In contrast, we use partial translation instead to avoid unnecessary noise.

Annotation Projection. The annotation projection approach relies on a set of parallel sentences between the source and target languages hwa2005bootstrapping; ganchev2009dependency. In particular, a source parser trained on the source treebank is used to parse the source-side sentences of the parallel corpus. The source dependencies are then projected onto the target sentences according to word alignments. Different strategies can be applied for the dependency projection task ma2014unsupervised; rasooli2015density; xiao2015annotation; agic2016multilingual; schlichtkrull2017cross. For example, one can project only dependency arcs whose words are aligned to target-side words with high confidence lacroix2016frustratingly. The resulting treebank can be highly noisy due to the auto-parsed source dependency trees. Recently lacroix2016frustratingly and rasooli2017cross propose to filter the results from the large-scale parallel corpus. Our work is different in that the source dependencies are from gold-standard treebanks.

3 Dependency Parsing

Figure 2: The overall architecture of the BiAffine parser.

We adopt a state-of-the-art neural BiAffine parser dozat2016deep as the baseline, which has achieved competitive performances for dependency parsing. The overall architecture is shown in Figure 2. Given an input sentence , the model finds its embedding representations of each word , where


Here denotes the word cluster of , and denotes the POS tag. We exploit cross-lingual word embeddings, clusters and also universal POS tags, respectively, which are consistent across language and treebanks.

A three-layer deep bidirectional long short term memory (LSTM) neural structure is applied on

to obtain hidden vectors . For head finding, two nonlinear feed-forward neural layers are used on to obtain and . We compute the score for each dependency by:


The above process is also used for scoring a labeled dependency , by extending the 1-dim vector into dims, where is the total number of dependency labels.

When all scores are ready, a softmax function is used at each position for all candidate heads and labels, and then the normalized scores are used for decoding and training. For decoding, we exploit the MST algorithm to ensure tree-structural outputs. For training, we accumulate the cross-entropy loss at the word-level by treating the normalized scores as prediction probabilities. The reader is referred to dozat2016deep for more details.

4 Code-Mixed Treebank Translation

We derive code-mixed trees from source dependency trees by partial translation, projecting words and the corresponding dependencies having high-confidence alignments with machine-translated target sentences. Our approach assumes that sentence level translations and alignment probabilities are available. The motivation is to reduce noise induced by problematic word alignments.

We adopt the word-level alignment strategy, which has been demonstrated as effective as phrase-level alignment yet much simpler tiedemann2014treebank; tiedemann2015improving; tiedemann2016synthetic. Give a source sentence and its target language translation , denotes the probability of word being aligned with ( and ), where denotes a null word, indicating the no alignment probability for one target word.

The translation process can be conducted by three steps:

  • word substitution, which incrementally substitutes the source words with the target translations;

  • word deletion, which removes several unaligned source words;

  • sentence reordering, which reorders the partially translated sentence, ensuring local target language word order.

Algorithm 1 shows pseudocode for code-mixing tree translation, where line 1-8 denotes the first step, line 9-16 denotes the second and line 17 denotes the last step.

Input:   ,  ,  ,   ,   

1:  Let .
2:  for   do
4:  end for
5:   Select()
6:  for  do
7:      Substitute()
8:  end for
10:  for  do
12:  end for
13:   Select()
14:  for  do
15:      Delete()
16:  end for
17:   Reorder()
Algorithm 1 The process of tree translation.

There is a hyper-parameter to control the overall ratio of translation. The function Select() is to obtain a subset of by ratio with top element values as indicated by inside . If , it is still the source language dependence tree, since no source word is substituted or deleted. In this condition, our method is equal to guo2015cross by bridging source and target dependency parsing with universal word representations. If , the resulting tree is a fully translated target dependency tree, as all words are target language produced by translation. In this setting, our method is equal to tiedemann2016synthetic where the only difference the our baseline parsing model. Thus our method can be regarded as a generalization of both source-side training guo2015cross and fully translated target training tiedemann2016synthetic with fine-grained control over translation confidence.

(a) one-to-one.
(b) many-to-one.
Figure 3: Examples of word substitution, where the thicker lines indicates more confident alignments.

4.1 Word Substitution

Word substitution is the key step for producing a target treebank. We first obtain the most confidently aligned source word for each target word as well as their alignment probability , as shown by line 3 in Algorithm 1. Then we sort the target words by these alignment probabilities, choosing the top words with highest alignment probabilities for substitution. The sorting and choosing is reflected in line line 5 of Algorithm 1. Finally for each chosen word and its aligned source word , we replace the source word by , as shown by line 7 in Algorithm 1.

One key of the substitution is to maintain the corresponding dependency structures. If and bares a one-one mapping, with no other target word being aligned with , the source dependencies are kept unchanged, as shown by Figure 3(a). While if two or more words (i.e., ) are aligned to , we simply link all words to , with the same dependency label as the original dependency arc. Figure 3(b) illustrates this condition. Both the Swedish words “under” and “tiden” are headed to “hittat” (the Swedish translation of English word “found”) by the dependency label “advmod” inherited from the source English side . Note that the POS tags of the substituted words are the same as the corresponding source words.

4.2 Word Deletion

There can be source words to which no target word is aligned. These words are typically functional words belonging to source language only, such as “the”, “are” and “have”. We remove such words to produce dependency trees that are close in syntax to the target language.

In particular, we accumulate the probabilities of for the source word who has no aligned target word:

where we traverse all target words to sum their alignment probabilities with . The value of can be interpreted as the confidence score of retention. The words with lower retention scores should be more preferred to be deleted, as these words have lower probabilities aligning with some word of the target language sentence. Concretely, we collect all source words with no aligned target words, computing their retention values, and the selecting a subset of these words with the lowest retention values by the hyper-parameter (line 13 in Algorithm 1). Finally we delete all the selected words (line 15 in Algorithm 1).

Figure 4 shows an example of word deletion. The two words “are” and “being” both have no aligned words in the other side, and meanwhile “are” has a lower retention score compared with “being”.222Both the two words are only related to the Swedish word “med”, but is slightly lower. Thus the source word “are” is prefer to be deleted. In most cases, the deleted words are leaf nodes, which can be unattached to the resulted dependency tree and deleted them directly. In case of exceptions, we simply reset the corresponding heads of its child nodes by the head of (i.e., ) instead. For example, a dependency is changed into .

Figure 4: An example of word deletion.
(a) full sentence
(b) two spans
Figure 5: Two examples of sentence reorder.

4.3 Sentence Reordering

Continuous target spans are reordered to make the final code-mixed sentence contain grammatical phrases in the target language. Figure 5(a) shows one example of full sentence reordering. We can see that the word order by word-level substitutions on the source words is different with the order of the machine-translated sentence. Thus we adjust the leaf nodes, letting the word order strictly follow the machine-translated sentence order For example, the word “vi” is moved from the first position into the third position, and similarly the word “mu” is moved from the last position into the first position.

Concretely, we perform the word reorder by the span level, extracting all the continuous spans of target words, because the target language words may be interrupted by source language words. Then we reorder the words in each span according to their order in the machine translation outputs. Figure 5(b) shows another example, where there are two spans separated by the English word “the”. Each span is reordered individually. We do not consider the inconsistent orders inter the spans in this work. Note that this step does not change any dependency arc between words.

5 Experiments

We conduct experiments to verify the effectiveness of our proposed models in this section.

5.1 Settings

Our experiments are conducted on the Google Universal Dependency Treebanks (v2.0) mcdonald2013universal; nivre2016universal, using English as the source language, and choosing six languages, including Spanish (ES), German (DE), French (FR), Italian (IT), Portuguese (PT) and Swedish (sv), as the target languages. Google Translate333 at Oct, 2018 is used to translate the sentences in the English training set into other languages. In order to generate high-quality word-level alignments, we merge the translated sentence pairs and the parallel data of EuroParl koehn2005europarl to obtain word alignments. We use the fastAlign tool dyer2013simple to obtain word alignments.

We use the cross-lingual word embeddings and clusters by guo2016representation for the baseline system. The dimension size of word embeddings is 50 and the word cluster number across of all languages is 256.

For network building and training, we use the same setting as dozat2016deep, including the dimensional sizes, the dropout ratio, as well as the parameter optimization method. We assume that no labeled corpus is available for the target language. Thus training is performed for 50 iterations over the whole training data without early-stopping.

To evaluate dependency parsing performances, we adopt UAS and LAS as the major metrics, which indicate the accuracies of unlabeled dependencies and labeled dependencies, respectively. We ignore the punctuation words during evaluation following previous work. We run each experiment 10 times and report the averaged results.

5.2 Models

We compare performances on the following models:

  • Delex mcdonald2013universal: The delexicalized BiAffine model without cross-lingual word embeddings and clusters.

  • Src guo2015cross: The BiAffine model trained on the source English treebank only.

  • PartProj lacroix2016frustratingly: The BiAffine model trained on the corpus by projecting only the source dependencies involving high-confidence alignments into target sentences. Note that the baseline only draws the idea from lacroix2016frustratingly, and the two models are significant different in fact.

  • Tgt tiedemann2016synthetic: The BiAffine model trained on the fully translated target treebank only.

  • Src+Tgt: The BiAffine model trained on the combination dataset of the source and fully translated target treebanks.

  • Mix: The BiAffine model trained on the code-mixed treebank only.

  • Src+Mix: The BiAffine model trained on the combination dataset of the source and code-mixed treebanks.

The Src and Tgt methods have been discussed in Section 4. The PartProj model is another way to leverage imperfect word alignments lacroix2016frustratingly. The training corpus of PartProj may be incomplete dependency trees with a number of words missing heads, because no word is deleted from machine translation outputs. The POS tags of words in PartProj with low-confidence alignments are obtained by a supervised POS tagger yang2018design trained on the corresponding universal treebank.

5.3 Development Results

We conduct several developmental experiments on the Swedish dataset to examine important factors to our model.

Figure 6: The LAS relative to the translation ratio .
Src 79.52 69.54
Tgt 79.34 69.96
Mix 80.33 71.29
Src + Tgt 80.12 71.16
Src + Mix 80.91 71.73
Table 1: Experiments of corpus mixing.
Mix 80.33 71.29
Sentence Reordering 79.79 70.47
Word Deletion 79.82 70.64
Both 79.46 69.59
Table 2: Ablation experiments.

5.3.1 Influence of The Translation Ratio

Our model has an important hyper-parameter to control the percentage of translation. Figure 6 shows the influence of this factor, where the percentages increase from 0 to 1 by intervals of 0.1. A of 0 gives our baseline by using the source treebank only. As the grows, more source words are translated into the target. We can see that the performance improves after translating some source dependencies into the target, demonstrating the effectiveness of syntactic transferring. The performance reaches the peak when , but there is a significant drop when grows from to . This can be because the newly added dependency arc projections are mostly noisy. This sharp decrease indicates that noise from low-confidence word alignments can have strong impact on the performance. According to the results, we adopt for code-mixed treebanking.

Lang. Delex PartProj Src Tgt Src + Tgt Mix Src + Mix
DE 64.10 53.77 69.90 61.28 66.87 57.46 70.84 62.30 72.41 63.74 71.41 63.46 72.78 64.38
ES 71.53 63.33 75.81 66.83 75.63 65.85 76.49 67.39 77.00 67.95 81.18 71.80 81.44 71.66
FR 75.13 67.26 75.54 67.63 78.13 70.63 76.91 69.39 78.75 71.17 83.20 76.32 83.77 76.48
IT 77.71 69.27 77.71 69.27 81.11 72.83 79.30 71.65 81.56 74.09 85.30 77.43 86.13 78.38
PT 74.03 67.70 79.44 71.30 77.37 69.36 78.32 70.67 79.73 71.84 83.54 75.34 84.05 75.89
AVG 72.50 64.27 75.68 67.26 75.82 67.23 76.37 68.28 77.89 69.76 80.93 72.87 81.63 73.36
Table 3: Final results.

5.3.2 Mixing with Source TreeBank

We investigate the effectiveness of the source treebank by merging it into the translated treebanks. First, we show the model performances of Src, Tgt and Mix, which are trained on the individual treebanks, respectively. Then we merge the source treebank with the two translated treebanks, and show the results trained on the merging corpora. Table 1 shows the results. According to the results, we can find that the source treebank is complementary with the translated treebanks. Noticeably, although Src + Mix gives the best performance, its improvement over Mix is relatively smaller than that of Src + Tgt over Tgt. This is reasonable as the code-mixed treebank contains relatively more source treebank content than the fully translated target treebank.

5.3.3 Ablation Studies

The overall translation is conducted by three steps as mentioned in Section 4, where the first word substitution is compulsory, and the remaining two steps aim to build better mixed dependency trees. Here we conduct ablation studies to test the effectiveness of word deletion and sentence reordering. Table 2 shows the experimental results. We can see both steps are important for dependency tree translation. Without word deletion and sentence reordering, the mix model shows decreases of 0.82 and 0.65 on LAS, respectively. If both are removed, the performance is only comparable with the baseline src model (see Table 1).

5.4 Final Results

We show the final results of our proposed models in Table 3. As shown, the model Tgt gives better averaged performance compared to Src. However, its results on French and Italian are slightly worse, which indicates that noise from translation impacts the quality of the projected treebank. The model Mix gives much better performance compared with Src, demonstrating the effectiveness of structural transfer. Mix also outperforms Tgt significantly, obtaining average increases of points on UAS and points on LAS, respectively. The best setting is the model Src + Mix, trained on the combined corpus of the source and code-mixed treebanks, which gives better performance than solely code-mixed treebanks.

By comparing with the delexicalized model Delex, we can see that lexicalized features are highly useful for cross-lingual transfer. For the PartProj model, we conduct preliminary experiments on Swedish to tune the ratio of the projected dependencies. The results show that the difference is very small ( for UAS) between 0.9 to 1.0, and the performance degrades significantly as the ratio decreases below 0.9. The observation indicates that this method is probably not effective for filtering low-confidence word alignments. The final results confirm our hypothesis. As shown in Table 3, the PartProj model gives only comparable performance with Src. One possible reason may be the unremoved target words (if the words are removed, the PartProj model with ratio 1.0 will be identical to Tgt), which have been demonstrated noisy previously tiedemann2016synthetic.

TreeBank Transferring
This 72.78 81.44 83.77 86.13 84.05
Guo15 60.35 71.90 72.93
Guo16 65.01 79.00 77.69 78.49 81.86
TA16 75.27 76.85 79.21
Annotation Projection
MX14 74.30 75.53 70.14 77.74 76.65
RC15 79.68 80.86 82.72 83.67 82.07
LA16 75.99 78.94 80.80 79.39
TreeBank Transferring + Annotation Projection
RC17 82.1 82.6 83.9 84.4 84.6
Table 4: Comparison with previous work (UAS).

5.5 Comparison with Previous Work

We compare our method with previous work in the literature. Table 4 shows the results, where the UAS values are reported. Our model denoted by This refers to the model of Src + Mix. Note that these models are not directly comparable due to the setting and baseline parser differences. The first block shows several models by directly transferring gold-standard source treebank knowledge into the target side, including the models of Guo15 guo2015cross, Guo16 guo2016representation and TA16 tiedemann2016synthetic. Our model gives the best performance with one exception on the German language. One possible reason may be that TA16 has exploited multiple sources of treebanks besides English.

The second block shows representative annotation projection models, including MX14 ma2014unsupervised, RC15 rasooli2015density, LA16. The models of annotation projection can be complementary with our work, since they build target training corpus from raw parallel texts. The best-performed results of the RC17 model rasooli2017cross have demonstrated this point, which can be regarded as a combination of the dictionary-based treebank translation444The method has been demonstrated worse than TA16 in tiedemann2016synthetic. zhao2009cross and RC15.

5.6 Analysis

We conduct experimental analysis on the Spanish (ES) dataset to show the differences between the models of Src, Tgt and Mix.

Figure 7:

Performance relative to POS tags (F-score).

5.6.1 Performance Relative to POS Tags

Figure 7 show the F-scores of labeled dependencies on different POS tags. We list the six most representative POS tags. The Mix model achieves the best F-scores on 5 of 6 POS tags, with the only exception on tag ADV which has no significant difference with the Tgt model. The Mix and Tgt models are much better than the Src model as a whole, especially on the POS tag ADJ where an increase of over 20% has been achieved. In addition, we find that the Src model can significantly outperform the Tgt model on ADP and DET. For Spanish, ADP words are typically “de”, “en”, “con” and etc., which behave similarly to the English words such as “’s”, “to” and “of”. The Spanish words of DET include “el”, “la”, “su” and etc., which are similar to the English words such as “the” and “a”. These words are highly ambiguous for automatic word alignment. The results indicate that our Mix model can better handle these word alignment noise, mitigating their negative influence of treebank translation, while the Tgt model suffers from such noise.

Figure 8: Performance relative to arc distances (F-score).

5.6.2 Performance Relative to Arc Distances

Figure 8 show the F-scores of labeled dependencies by different arc distances. Particularly we treat the root type as one special case. According to the results, the Mix model performs the best over all distances, indicating its effectiveness on treebank transferring. The Tgt model achieves better performance than the Src model with one exception on distance 2. We look into the dependency patterns of distance 2 arcs further, finding that the dependency arc accounts for over 30%, and it is the major source of errors. The finding is consistent with that on POS tags, denoting the effectiveness of the code-mixed treebank in handling noise. In addition, as the distance increases the performance drops gradually. The F-score of root dependency is the highest.

6 Conclusion

We proposed a new treebank translation method for unsupervised cross-lingual dependency parsing. Unlike previous work, which adopts full-scale translation for source dependency trees, we investigated partial translation instead, producing synthetic code-mixed treebanks. The method can better leverage imperfect word alignments between source and target sentence pairs, translating only high-confidence source sentential words, thus generating dependencies in high-quality. Experimental results on Universal Dependency Treebak v2.0 showed that partial translation is highly effective, and code-mixed treebanks can give significantly better results than full-scale translation.

Our method is complementary with several other methods for cross-lingual transfer, such as annotation projection, and thus can be further integrated with these methods.


We thank all anonymous reviewers for their valuable comments. Several suggestions are not integrated in this version, i.e., more experiments on really low-resource languages, and detailed analysis on more languages. We will supplement later on the webpages of the authors. This work is supported by National Natural Science Foundation of China (NSFC) grants 61602160, U1836222 and 61672211.