Cross-lingual learning which explores knowledge transfer between different languages has been studied for various NLP tasks Guo et al. (2015); Zhou et al. (2016); Zoph et al. (2016); Kim et al. (2017). It is a challenging problem with tremendous practical values. On one hand, many NLP problems have achieved remarkable accuracy in resource-rich languages thanks to the availability of large-scale annotated data, while the performances on low-resource languages are still behind in the absence of abundant annotations. Cross-lingual techniques are excellent means to subsidize the requirement of annotated data by transferring the knowledge from resource-rich languages to low-resource ones. On the other hand, cross-lingual learning is challenging because different languages diverge significantly at levels of morphology, syntax, and semantics. It is challenging to learn language-invariant features that are robustly transferable for distant languages.
Prior work on cross-lingual transfer mainly focused on the word-level by inducing multi-lingual invariant word embeddings Xiao and Guo (2014); Guo et al. (2016); Sil et al. (2017). However, words are not independent in sentences; their interaction and combination forms larger linguistic units, known as context. Understanding context is vital for most problems in NLP as context influences a word’s meaning. A successful architecture for NLP usually contains mechanisms to contextualize words and compose higher-level features to understand other parts of sentences. We refer to the mechanisms as contextual encoders. In this paper, we explore transfer on the contextual encoder-level, where we consider how to induce combinatorial features that are language-invariant.
With the developments of representation learning and neural networks, Recurrent Neural Networks (RNNs) has become a prevalent encoder for many NLP tasks that demonstrated compelling performances McCann et al. (2017); Peters et al. (2018). However, we hypothesize, in cross-lingual setting, RNNs’ sequential nature introduces the risk of encoding language-specific word ordering information, thus overfit to specific word order. To test our hypothesis, we investigate the flexibility of contextual encoders for cross-lingual transfer.
Self-attention based architecture “Transformer” has been proposed recently and shown to be effective in various NLP tasks Vaswani et al. (2017); Liu et al. (2018); Kitaev and Klein (2018). While keeping a strong modeling capability, it is more flexible in capturing contextual information than RNNs since it does not explicitly model the ordering information although positional information is still indispensable for Transformer to be successful Vaswani et al. (2017). To this end, we explore flexible ways to utilize relative positional information for the Transformer encoder.
In this work, we quantify language distances in terms of word order typology and systematically study the transferability of two leading neural architectures as the contextual encoder. We evaluate the cross-lingual transferability of the contextual encoders on dependency parsing, primarily because of the availability of unified annotations across a broad spectrum of languages Nivre et al. (2018). Besides, word order typology found to influence dependency parsing Östling (2015). Moreover, parsing is a low-level NLP task Hashimoto et al. (2016) that can benefit many downstream applications McClosky et al. (2011); Gamallo et al. (2012); Jie et al. (2017).
We conduct the evaluations on 31 languages across a broad spectrum of language families, as shown in Table 1. Our empirical results show that attention based encoding and decoding performs better than the RNN-based ones, especially when the source and target languages are distant.
2 Quantifying Language Distances
|Afro-Asiatic||Arabic (ar), Hebrew (he)|
|IE.Germanic||Danish (da), Dutch (nl), English (en), German (de), Norwegian (no), Swedish (sv)|
|IE.Romance||Catalan (ca), French (fr), Italian (it), Portuguese (pt), Romanian (ro), Spanish (es)|
|IE.Slavic||Bulgarian (bg), Croatian (hr), Czech (cs), Polish (pl), Russian (ru), Slovak (sk), Slovenian (sl), Ukrainian (uk)|
|Uralic||Estonian (et), Finnish (fi)|
Word order can be a significant distinctive feature to differentiate languages Dryer (2007). Since word order features can especially influence parsing Östling (2015), we wonder whether we can distinguish languages or design a “language distance” measure based on word order. We investigate this empirically by collecting the dependency direction statistics of various dependency types from the datasets.
For the datasets, we use the Universal Dependencies (UD) Treebanks (v2.2) Nivre et al. (2018) in this paper. We select 31 languages for evaluation and analysis, with a general selecting criterion that the overall token number in the treebanks of a language is over 100K. We group these languages by their language families in Table 1. Detailed statistical information of the selected languages and treebanks can be found in Appendix A.
We look at finer-grained dependency types than the 37 universal dependency labels111http://universaldependencies.org/u/dep/index.html in UD v2, by augmenting the dependency labels with the universal part-of-speech (POS) tags of the head and modifier nodes. We use triplets of “(ModifierPOS, HeadPOS, DependencyLabel)” for the augmented types. With this, we can investigate language differences in a fine-grained way. For example, “(PRON, VERB, obj)” denotes the dependency type between verbs and their pronoun objects, for which specific languages (like French) may have different ordering compared to the placement of plain noun objects.
We only select the most common types, since statistics on rare types can be unstable. The number of the selected types is 52 and please refer to Appendix E for details. On each dependency type, we collect the relative frequencies on the dependency direction222For dependency relations, there can be only two choices of direction: left (modifier before head) and right (modifier after head). We also tried finer-grained dependency distances, but found the simple binary directional one was good enough. for each language. By concatenating the directional relative frequencies of all concerned types, we can obtain a word-ordering vector for each language. In figure 1 and 2, we show the hierarchical clustering and t-SNE visualization Maaten and Hinton (2008)
with the word-ordering vectors of the 31 languages. By using word ordering information alone, we can almost recover the grouping of their original language families. It should be noted that the outliers, German(de) and Dutch(nl), are the ones that have verb-second (V2) word ordering and are different to other more English-like Germanic languages.
This indicates that word ordering is a major aspect of how languages differ and we can extract useful feature vectors from ordering information. We use these word-ordering vectors to define word-ordering distance measurements in our analysis (Section 4.3). Moreover, we take word ordering as a major factor in our model designs.
Our basic setting is to transfer the knowledge of one source language syntax to target languages without any explicit target syntactic annotations. Therefore, when designing a model, we want it to capture less specific information of the source language, especially concerning word order.
With this in mind, we briefly describe the three major components for our cross-lingual dependency parser: (1) input lexicon representations, (2) contextual encoders, and (3) structured decoders.
3.1 Input Representations
The basis of cross-lingual transfer parsing is representing the inputs of different languages in a shared space so that models trained only in the source language can be transferable. Firstly, Universal POS tags in UD already provide valuable shared information about the inputs. Moreover, we can also obtain shared lexicalized information with multilingual embeddings. Recently, works on cross-lingual embeddings Smith et al. (2017); Conneau et al. (2018) show that word embeddings of different languages can be indeed mapped into a shared space. Therefore, we take these shared lexicon representations with multilingual embeddings and concatenate them with Universal POS embeddings, forming our final input vectors.
3.2 Contextual Encoders
In a sentence, words are not isolated, and they are complexly interacting with each other. In our preliminary experiments, we tried models without any contextual encoders, and the results for all target languages turned out to be much worse. This indicates the importance of encoders to capture shared contextual information.
Considering the sequential nature of languages, RNN can be a natural choice for encoding. However, modeling words one by one in the sequence inevitably encodes the word orders, some of which can be specific only to the source language. To alleviate this problem, we propose to adopt the purely self-attention based Transformer encoder Vaswani et al. (2017) in cross-lingual parsing. It is less sensitive to word order but not necessarily less potent at capturing contextual information, which makes it suitable in our setting. Furthermore, we adopt a modified version of relative position representations Shaw et al. (2018) to make the encoder capture less positional and ordering information further. In the following, we briefly describe the two encoder prototypes that we study.
The original Transformer encoder takes absolute positional embeddings as inputs, which might still capture much positional and ordering information. To mitigate this, we utilize relative position representations instead. We further make a simple modification to it: the original relative position representations in Shaw et al. (2018) discriminate left and right contexts by adding signs to distances, while we only use absolute values of the distances and discard directional information. With this, the model only knows what words are surrounding but cannot tell the directions. We analyze these choices of position representations in Section 4.3.4, which shows that our strategy performs the best.
3.3 Structured Decoders
With the contextual representations from the encoder, the decoder predicts the output tree structures. Generally speaking, there have been mainly two classes of typical approaches for decoding McDonald and Nivre (2007): Transition-based and Graph-based. Intuitively, the first-order graph-based method can be less sensitive to word orders since it cares less about decoding directions. To empirically verify this, we investigate both of them with mono-lingual state-of-the-art models.
Graph-based method benefits from being able to search globally for the best structure with strong independence assumptions. Recently, with a deep biaffine attentional scorer, Dozat and Manning (2017) obtained state-of-the-art results with simple first-order factorization Eisner (1996); McDonald et al. (2005). For a graph-based decoder, we directly take this deep-biaffine-scorer based architecture.
Transition-based decoders build the parse tree incrementally with a series of transition actions. Recently, Ma et al. (2018) proposed a top-down transition-based parsing method and also obtained state-of-the-art results. Thus, we select it as our transition-based decoder. To be noted, in the Stack-Pointer decoder, an RNN is recording the top-down decoding trajectory and can be also sensitive to word order. We will discuss this in the experiments.
4 Experiments and Analysis
In our experiments, we took English as the source language and other 30 languages as the targets. In training, we only used English training and development set for parameter-updating and model-selection. In testing, we directly applied the parsing model to target languages with the inputs from target-language embeddings that were projected into the same space as the source language. We started from the 300 pre-trained embeddings from FastText333https://fasttext.cc/docs/en/crawl-vectors.html Bojanowski et al. (2017), and projected them into the same space using the offline transformation method444https://github.com/Babylonpartners/fastText_multilingual. In preliminary experiments, we also tried the projection method of Conneau et al. (2018), but found similar results. of Smith et al. (2017). We froze word embeddings throughout training and testing since fine-tuning on them might disturb the multi-lingual alignments.
For other hyper-parameters, we adopted similar ones as the Biaffine Graph Parser Dozat and Manning (2017) and Stack-Pointer Parser Ma et al. (2018). Detailed hyper-parameter settings can be found in Appendix B. Throughout our experiments, we only used the first-level UD labels since fine-grained labels might be language-dependent. We adopted a sentence length threshold of 140. For evaluation, performance was measured by unlabeled attachment score (UAS) and labeled attachment score (LAS)555Punctuations and symbols (PUNCT, SYM) are excluded.. We trained each of our cross-lingual models five times with different random seeds and reported average scores.
As described before, we have a Transformer or RNN encoder, and Graph-based or Stack-Pointer-based method for decoding. The combination gives us four different models, named in the format of “Encoder” plus “Decoder”. For example, “TransformerGraph” means the model with Transformer encoder and graph-based decoder. Also, we compare with a baseline of a shift-reduce transition-based parser, which gave previous state-of-the-art results for cross-lingual parsing Guo et al. (2015). Since they used older datasets, we re-trained the model on our datasets666We also evaluated our models on the older dataset and compared with their results, as shown in Appendix C. with their implementation777https://github.com/jiangfeng1124/acl15-clnndep. Moreover, We also list the supervised results (using our “RNNGraph” model) for each language as a reference of the upper-line of cross-lingual parsing.
The results on the test sets are shown in Table 2. The languages are ordered by their average evaluation scores over all the models. In preliminary experiments, we found our lexicalized models performed poorly in Chinese (zh) and Japanese (ja). We guess that one of the reasons is that their embeddings are not well aligned to the English ones. Therefore, we use delexicalized models, where only POS tags are used as inputs. The delexicalized results888We found delexicalized models to be better only for Chinese and Japanese. For other languages, they performed worse for about 2 to 5 points. We also tried models without POS, but they were worse for about 10 points. This indicates the importance of universal POS tags in cross-lingual parsing. for Chinese and Japanese are listed in the rows marked with “*”.
Overall, the “TransformerGraph” model performs the best in over half of the languages and beats the runner-up “RNNGraph” by around 1.3 in UAS and 1.2 in LAS averagely. When compared with “RNNStack” and “TransformerStack”, the average difference is larger than 1.5 points. This shows that model that captures less word order information can generally be better at cross-lingual parsing. It is a little surprising that “TransformerStack” performs the worst, indicating that the RNN-based decoder in Stack-Pointer parser might still learn too much source-language-specific information and does not fit well with the self-attention encoder. Compared with the baseline, our better results show the importance of the contextual encoder. Compared with the supervised models, the cross-lingual results are still lower by a large gap, indicating space for improvements.
After taking a closer look, we find an interesting pattern in the results: models with RNN encoders perform better at languages that have higher evaluation scores (upper rows in the table), while for languages that are “distant” from English, “TransformerGraph” performs much better. Such a pattern corresponds to our motivation, that is, the ability to capture word ordering information can impact the performances of cross-lingual learning.
We further analyze how different modeling choices influence the cross-lingual performance. Since we have not touched the training sets for languages other than English, to be more robust (with more data), we evaluate and analyze the results on the original training sets in this sub-section. The results on original training sets are shown in Appendix D and are similar to the ones on the test sets. For English, we use the results on the development set since its training set is exposed to learning. Also, because of possible problems in word embeddings, we use the results of delexicalized models for Chinese and Japanese.
4.3.1 The Overall Pattern
Retake a look at our motivation: we hope that relaxing the ability to capture word ordering information can make the model better at cross-lingual transfer parsing where the target languages can have various word orders to the source language. Now we can analyze this point with the word-ordering distance.
Using the word-ordering vectors described in Section 2, we can measure the distances between languages. Since every entry in the word-ordering vector represents the frequency of a binary value, we use Manhattan distance. We refer to this distance measure as word-ordering distance. With word-ordering distances, we can further analyze the relation between word ordering and cross-lingual transferability. For each target language, we have two types of distances when comparing it to English: one is the word-ordering distance, the other is the performance distance, which is the gap of evaluation scores999Unless otherwise noted, we simply average UAS and LAS for evaluation scores. between the target language and English. The performance distance can represent the general transferability from English to this language. We calculate the Pearson correlation of these two distances on all the concerned languages, and the results turn to be quite high: the correlation is around 0.9 using the evaluations of any of our four cross-lingual transfer models. This again suggests that word order is indeed an essential factor of cross-lingual transferability.
Furthermore, we separately analyze the encoding and decoding modules since we have two architecture choices for both of them. When examining one module, we take the maximum evaluation scores of all the architectures in the other module. For example, when comparing RNN and Transformer encoders, we take the best evaluation scores of “RNNGraph” and “RNNStack” for RNN and the best of “TransformerGraph” and “TransformerStack” for Transformer. Figure 3 shows the score differences of encoding and decoding architectures against the languages which are sorted by their word-ordering distances to English. For both the encoding and decoding module, we observe a similar overall pattern: the architectures that are less sensitive to word ordering (Transformer for encoding and Graph-based method for decoding) in general performs better than their alternatives in the languages that are further from the source language English. On the other hand, some languages that are closer to English, models with RNNs (RNN encoder and Stack-Pointer decoder) perform better, possibly benefiting from being able to capture similar word ordering information.
4.3.2 On Dependency Types
Moreover, we compare the results on specific dependency types for concrete examples.101010We also provide some detailed comparisons on Czech in Appendix F as the example of a specific language. For a specific type, we sort the languages by the relative frequencies of left-direction (modifier before head) and plot the evaluation differences for encoding and decoding modules. Figure 8 shows four typical example types: Adposition and Noun, Adjective and Noun, Auxiliary and Verb, and Object and Verb. In Figure (a)a, we examine the “case” dependencies between adpositions and nouns. The pattern is similar to the overall pattern, for the languages which mainly use prepositions as in English, different models perform similarly. For the languages which use postpositions (Japanese, Hindi, Finnish, Estonian), the models that capture less word ordering information get better results. The patterns of adjective modifier (Figure (b)b) and auxiliary (Figure (c)c) are also similar.
On dependencies between object nouns and verbs, although models that are more flexible on word order, in general, perform better, the pattern diverges from what we expect. There can be several possible explanations for this. Firstly, the tokens which are noun objects of verbs only take about 3.1% averagely over all tokens. Also, if considering just this specific dependency type, the correlation between frequency distances and performance differences is 0.64, which is far less than 0.9 when considering all types. Therefore, although Verb-Object ordering is a typical example, we may not take it as the whole story of word order. Secondly, Verb-Object dependencies can usually be the difficult types to decide. They sometimes can be long-ranged and have complex interactions with other words. Therefore, merely modeling less ordering information may have complicated effects. Moreover, although our relatively-positional Transformer does not explicitly encode word positions, it may still capture specific positional information with relative distances. For example, the words in the middle of the sentence can have different distance patterns to the words at the beginning or the end. With this knowledge, the model can still prefer the pattern where a verb is in the middle as in English’s Subject-Verb-Object ordering and may find sentences in Subject-Object-Verb languages strange. It will be interesting to explore more to weaken or remove this bias.
4.3.3 On Dependency Distances
We further analyze dependency lengths and directions. Here, we combine dependency length and direction as dependency distance , by using negative signs for dependencies with left-direction (modifier before head) and positive for right-direction (head before modifier). We use the average of UAS and LAS as evaluation score and show the model performances (recalls on gold distance) in Figure 9. The pattern looks strange at first glance: on dependency distances =1, all transfer models performs quite bad. It might be explained by the relative frequencies of dependency distances as shown in Table 3. There is a discrepancy between English and the average one at =1. About 80% of the dependencies whose length are 1 in English is in the left direction (modifier before head), while overall other languages have a more right direction at =1. Since the models are only trained in English, they might be less confident in predicting =1 dependencies.
We further compare our models on the =1 dependencies and as shown in Figure 10, the familiar pattern appears again. The models that capture less word ordering information perform better at the languages which have more =1 dependencies. Such finding indicates that our model design of relaxing model’s ability to capture word order information can help on short-ranged dependencies of different directions to the source language. However, since the model has access to the local context information, which is quite important for parsing as we will show, it might still be able to guess specific word ordering or positional information; thus the improvements are limited. One of the most challenging parts of unsupervised cross-lingual parsing: modeling more cross-lingual and shareable features, but less language-specific information, that is, we want a flexible yet powerful model. Our effor through self-attention models can be the first step on this.
4.3.4 Positional Representations in Transformer
In our Transformer encoder, we use a modified version of relative position representations instead of the absolute positional embeddings. Table 4 shows the comparisons of these choices. We use graph-based method for decoding, and report evaluation scores averaged on all languages. Firstly, we can see that positional information is a key component, without which the model (“No-Positional”) performs quite bad. It is natural since, without positional representations, the model does not have access to the locality of contexts. The model with absolute positional embeddings (“Absolute”) also does not perform as good, and maybe for parsing, relative positional representations can be more direct and easy to learn. We also try the model with directional distance information (“Relative+dir”) as in Shaw et al. (2018), in which distances are augmented with directions (by adding negative or positive signs to distinguish left and right contexts). This model performs well, and interestingly, its performance is close to the one with RNN encoder. Finally, by discarding the information of directions, our relatively positional Transformer performs the best, indicating that it can capture useful cross-language context information while depending less on language-specific positional and ordering information.
5 Related Work
Existing works in zero-shot cross-lingual dependency parsing, in general, train a dependency parser on the source language and then directly run on the target languages. Training of the monolingual parsers are often delexicalized, i.e., removing all lexical features from the source treebank Zeman and Resnik (2008); McDonald et al. (2013b), and the underlying feature model is selected from a shared part-of-speech (POS) representation utilizing the Universal POS Tagset Petrov et al. (2012). Another pool of prior works improve the delexicalized approaches by adapting the model to fit the target languages better. Cross-lingual approaches that facilitate the usage of lexical features includes choosing the source language data points suitable for the target language Søgaard (2011); Täckström et al. (2013), transferring from multiple sources McDonald et al. (2011); Guo et al. (2016), using cross-lingual word clusters Täckström et al. (2012) and lexicon mapping Xiao and Guo (2014); Guo et al. (2015).
Although the objective of harnessing dependency parsers with neural networks is learning transferable representations of the lexical units, what type of neural architectures are suitable for cross-lingual transfer is never studied. A recent work Lakew et al. (2018) investigates the pros and cons of Transformer and recurrent neural networks on multilingual machine translation, but the impact of language differences on neural cross-language learning is still unraveled. In this work, we solemnly focus on dissecting the two preeminent neural architectures for dependency parsing with a specific goal of unveiling the transferability of a language feature, word order typology and present their advantages and limitations.
In this work, we present a comprehensive study of how the choice of neural architectures affects cross-lingual transfer learning. We thoroughly examine the strengths and weaknesses of two notable design of neural architectures (sequential RNN vs. self-attentional Transformer) using dependency parsing as our evaluation task. We show that the self-attention model performs better than the sequential one overall, especially when there is a significant difference in the word order typology between the target and source language. However, when the source and target languages are close in word ordering, sequential models can be better. We also investigate the impact of decoding methods of parsing in the cross-lingual setting. The empirical findings suggest that for cross-lingual transfer learning, which model is the more suitable one depends on the language structural differences (like word ordering) between the source and target languages. Thus, we should take those into consideration in our model designs. In future work, we plan to employ prior linguistic knowledge into the models for better cross-lingual transferring.
- Agić et al. (2014) Željko Agić, Jörg Tiedemann, Kaja Dobrovoljc, Simon Krek, Danijela Merkler, and Sara Može. 2014. Cross-lingual dependency parsing of related languages with rich morphosyntactic tagsets. In EMNLP 2014 Workshop on Language Technology for Closely Related Languages and Language Variants.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146.
- Buys and Botha (2016) Jan Buys and Jan A Botha. 2016. Cross-lingual morphological tagging for low-resource languages. arXiv preprint arXiv:1606.04279 .
- Conneau et al. (2018) Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. 2018. Word translation without parallel data. ICLR .
- Cotterell and Duh (2017) Ryan Cotterell and Kevin Duh. 2017. Low-resource named entity recognition with cross-lingual, character-level neural conditional random fields. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers). volume 2, pages 91–96.
- Dozat and Manning (2017) Timothy Dozat and Christopher D Manning. 2017. Deep biaffine attention for neural dependency parsing. ICLR .
- Dryer (2007) Matthew S Dryer. 2007. Word order. Language typology and syntactic description 1:61–131.
- Eisner (1996) Jason M Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In Proceedings of the 16th conference on Computational linguistics-Volume 1. Association for Computational Linguistics, pages 340–345.
Gamallo et al. (2012)
Pablo Gamallo, Marcos Garcia, and Santiago Fernández-Lanza. 2012.
Dependency-based open information extraction.
Proceedings of the joint workshop on unsupervised and semi-supervised learning in NLP. Association for Computational Linguistics, pages 10–18.
Guo et al. (2015)
Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015.
Cross-lingual dependency parsing based on distributed representations.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). volume 1, pages 1234–1244.
- Guo et al. (2016) Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2016. A representation learning framework for multi-source transfer parsing.
- Hashimoto et al. (2016) Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. 2016. A joint many-task model: Growing a neural network for multiple nlp tasks. arXiv preprint arXiv:1611.01587 .
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Huang et al. (2013) Jui-Ting Huang, Jinyu Li, Dong Yu, Li Deng, and Yifan Gong. 2013. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, pages 7304–7308.
- Jie et al. (2017) Zhanming Jie, Aldrian Obaja Muis, and Wei Lu. 2017. Efficient dependency-guided named entity recognition. In AAAI. pages 3457–3465.
- Kann et al. (2017) Katharina Kann, Ryan Cotterell, and Hinrich Schütze. 2017. One-shot neural cross-lingual transfer for paradigm completion. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics page 1993–2003.
- Kim et al. (2017) Joo-Kyung Kim, Young-Bum Kim, Ruhi Sarikaya, and Eric Fosler-Lussier. 2017. Cross-lingual transfer learning for pos tagging without cross-lingual resources. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2832–2838.
- Kiperwasser and Goldberg (2016) Eliyahu Kiperwasser and Yoav Goldberg. 2016. Simple and accurate dependency parsing using bidirectional lstm feature representations. arXiv preprint arXiv:1603.04351 .
- Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. arXiv preprint arXiv:1805.01052 .
- Kundu et al. (2018) Gourab Kundu, Avirup Sil, Radu Florian, and Wael Hamza. 2018. Neural cross-lingual coreference resolution and its application to entity linking. arXiv preprint arXiv:1806.10201 .
- Lakew et al. (2018) Surafel M Lakew, Mauro Cettolo, and Marcello Federico. 2018. A comparison of transformer and recurrent neural networks on multilingual neural machine translation. arXiv preprint arXiv:1806.06957 .
- Litschko et al. (2018) Robert Litschko, Goran Glavaš, Simone Paolo Ponzetto, and Ivan Vulić. 2018. Unsupervised cross-lingual information retrieval using monolingual data only. arXiv preprint arXiv:1805.00879 .
- Liu et al. (2018) Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. 2018. Generating wikipedia by summarizing long sequences. Internation Conference on Learning Representations .
- Ma et al. (2018) Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng, Graham Neubig, and Eduard Hovy. 2018. Stack-pointer networks for dependency parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Ma and Xia (2014) Xuezhe Ma and Fei Xia. 2014. Unsupervised dependency parsing with transferring distribution via parallel guidance and entropy regularization. In Proceedings of ACL 2014. Baltimore, Maryland, pages 1337–1348.
Maaten and Hinton (2008)
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using t-sne.
Journal of machine learning research9(Nov):2579–2605.
- McCann et al. (2017) Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems. pages 6294–6305.
- McClosky et al. (2011) David McClosky, Mihai Surdeanu, and Christopher D Manning. 2011. Event extraction as dependency parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, pages 1626–1635.
- McDonald et al. (2005) Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In Proceedings of ACL-2005. Ann Arbor, Michigan, USA, pages 91–98.
- McDonald and Nivre (2007) Ryan McDonald and Joakim Nivre. 2007. Characterizing the errors of data-driven dependency parsing models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
- McDonald et al. (2013a) Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, Claudia Bedini, Núria Bertomeu Castelló, and Jungmee Lee. 2013a. Universal dependency annotation for multilingual parsing. In Proceedings of ACL-2013. Sofia, Bulgaria, pages 92–97.
- McDonald et al. (2013b) Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Täckström, et al. 2013b. Universal dependency annotation for multilingual parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). volume 2, pages 92–97.
- McDonald et al. (2011) Ryan McDonald, Slav Petrov, and Keith Hall. 2011. Multi-source transfer of delexicalized dependency parsers. In Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pages 62–72.
- Nivre et al. (2018) Joakim Nivre, Mitchell Abrams, Željko Agić, and et al. 2018. Universal dependencies 2.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
- Östling (2015) Robert Östling. 2015. Word order typology through multilingual word alignment. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). volume 2, pages 205–211.
- Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 .
- Petrov et al. (2012) Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of LREC-2012. Istanbul, Turkey, pages 2089–2096.
- Sasaki et al. (2018) Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui. 2018. Cross-lingual learning-to-rank with shared representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). volume 2, pages 458–463.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155 .
- Sil et al. (2017) Avirup Sil, Gourab Kundu, Radu Florian, and Wael Hamza. 2017. Neural cross-lingual entity linking. arXiv preprint arXiv:1712.01813 .
- Smith et al. (2017) Samuel L Smith, David HP Turban, Steven Hamblin, and Nils Y Hammerla. 2017. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. ICLR .
- Søgaard (2011) Anders Søgaard. 2011. Data point selection for cross-language adaptation of dependency parsers. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2. Association for Computational Linguistics, pages 682–686.
- Täckström et al. (2013) Oscar Täckström, Ryan McDonald, and Joakim Nivre. 2013. Target language adaptation of discriminative transfer parsers .
- Täckström et al. (2012) Oscar Täckström, Ryan McDonald, and Jakob Uszkoreit. 2012. Cross-lingual word clusters for direct transfer of linguistic structure. In Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: Human language technologies. Association for Computational Linguistics, pages 477–487.
- Tiedemann (2015) Jörg Tiedemann. 2015. Cross-lingual dependency parsing with universal dependencies and predicted pos labels. In Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015). pages 340–349.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. pages 5998–6008.
- Vulić and Moens (2015) Ivan Vulić and Marie-Francine Moens. 2015. Monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, pages 363–372.
- Xiao and Guo (2014) Min Xiao and Yuhong Guo. 2014. Distributed word representation learning for cross-lingual dependency parsing. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning. pages 119–129.
- Xie et al. (2018) Jiateng Xie, Zhilin Yang, Graham Neubig, Noah A Smith, and Jaime Carbonell. 2018. Neural cross-lingual named entity recognition with minimal resources. arXiv preprint arXiv:1808.09861 .
- Xu et al. (2014) Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2014. Cross-language transfer learning for deep neural network based speech enhancement. In Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, pages 336–340.
- Yang et al. (2016) Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. 2016. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270 .
- Zeman and Resnik (2008) Daniel Zeman and Philip Resnik. 2008. Cross-language parser adaptation between related languages. In Proceedings of the IJCNLP-08 Workshop on NLP for Less Privileged Languages.
- Zhou et al. (2016) Joey Tianyi Zhou, Sinno Jialin Pan, Ivor W Tsang, and Shen-Shyang Ho. 2016. Transfer learning for cross-language text categorization through active correspondences construction. In AAAI. pages 2400–2406.
Zoph et al. (2016)
Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016.
Transfer learning for low-resource neural machine translation.In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, pages 1568–1575.
A Details of UD Treebanks
The statistics of the Universal Dependency treebanks we used are summarized in Table 5.
|Language||Lang. Family||Treebank||#Sent.||#Token(w/o punct)|
|Czech (cs)||IE.Slavic||PDT,CAC, CLTT,FicTree||train||102993||1806230(1542805)|
|Dutch (nl)||IE.Germanic||Alpino, LassySmall||train||18058||261180(228902)|
|Korean (ko)||Korean||GSD, Kaist||train||27410||353133(312481)|
|Norwegian (no)||IE.Germanic||Bokmaal, Nynorsk||train||29870||489217(432597)|
|Polish (pl)||IE.Slavic||LFG, SZ||train||19874||167251(136504)|
|Portuguese (pt)||IE.Romance||Bosque, GSD||train||17993||462494(400343)|
|Slovenian (sl)||IE.Slavic||SSJ, SST||train||8556||132003(116730)|
|Spanish (es)||IE.Romance||GSD, AnCora||train||28492||827053(730062)|
|MLP||arc MLP size||512|
|label MLP size||128|
|MLP||arc MLP size||512|
|label MLP size||128|
C Results on Google Universal Dependency Treebanks v2.0
We also ran our models on Google Universal Dependency Treebanks v2.0 McDonald et al. (2013a), which is an older dataset that was used by Guo et al. (2015). The results show that our models perform better consistently.
|Language||TransformerGraph||RNNGraph||TransformerStack||RNNStack||Guo et al. (2015)|
D Results on the original training sets
E Details about augmented dependency types
|Type||Avg. Freq. (%)||#Lang.||Type||Avg. Freq. (%)||#Lang.|
|(ADP, NOUN, case)||7.47||31||(PROPN, VERB, nsubj)||0.81||30|
|(PUNCT, VERB, punct)||6.91||30||(PRON, VERB, obj)||0.77||30|
|(NOUN, NOUN, nmod)||4.97||31||(NOUN, ROOT, root)||0.66||31|
|(ADJ, NOUN, amod)||4.92||31||(VERB, VERB, xcomp)||0.61||28|
|(DET, NOUN, det)||4.69||30||(VERB, VERB, ccomp)||0.60||30|
|(VERB, ROOT, root)||4.31||31||(ADP, PRON, case)||0.57||29|
|(NOUN, VERB, obl)||3.96||30||(AUX, NOUN, cop)||0.57||28|
|(NOUN, VERB, obj)||3.10||31||(ADV, ADJ, advmod)||0.54||29|
|(NOUN, VERB, nsubj)||2.89||31||(AUX, ADJ, cop)||0.50||27|
|(PUNCT, NOUN, punct)||2.75||30||(PROPN, VERB, obl)||0.48||29|
|(ADV, VERB, advmod)||2.43||31||(PRON, VERB, obl)||0.44||30|
|(AUX, VERB, aux)||2.29||28||(ADV, NOUN, advmod)||0.41||28|
|(PRON, VERB, nsubj)||1.53||30||(ADJ, ROOT, root)||0.39||29|
|(ADP, PROPN, case)||1.46||29||(PRON, NOUN, nmod)||0.39||22|
|(NOUN, NOUN, conj)||1.32||30||(NOUN, ADJ, obl)||0.37||25|
|(VERB, NOUN, acl)||1.31||31||(PROPN, PROPN, conj)||0.35||29|
|(SCONJ, VERB, mark)||1.27||28||(NOUN, ADJ, nsubj)||0.35||30|
|(CCONJ, VERB, cc)||1.18||30||(CCONJ, ADJ, cc)||0.29||28|
|(PROPN, NOUN, nmod)||1.14||30||(PUNCT, NUM, punct)||0.26||24|
|(CCONJ, NOUN, cc)||1.13||30||(NOUN, NOUN, nsubj)||0.25||31|
|(NUM, NOUN, nummod)||1.11||31||(ADJ, ADJ, conj)||0.25||26|
|(PROPN, PROPN, flat)||1.09||26||(CCONJ, PROPN, cc)||0.22||26|
|(VERB, VERB, conj)||1.05||30||(PRON, VERB, iobj)||0.21||21|
|(PUNCT, PROPN, punct)||0.94||29||(ADV, ADV, advmod)||0.19||21|
|(VERB, VERB, advcl)||0.89||30||(NOUN, NOUN, appos)||0.18||23|
|(PUNCT, ADJ, punct)||0.89||30||(PROPN, VERB, obj)||0.17||24|
F Results on specific dependency types for Czech
In table 10, we show results of Czech on some dependency types with evaluation breakdowns on dependency directions. We specifically select Czech mainly for two reasons: (1) It has the largest dataset; (2) Czech is famous for relatively flexible word order. Generally, we can see that models that are more flexible on word ordering perform better. Interestingly, for objective and subjective types, we can see that LAS scores for all models are quite low even when the correct heads are predicted. The reason might be that even the relative-positional Transformer can capture some positional information which further reveals word ordering information in some way.
|(ADP, NOUN, case): (mod-first% in English is 99.92%.)|
|(NOUN, NOUN, nmod): (mod-first% in English is 4.72%.)|
|(ADJ, NOUN, amod): (mod-first% in English is 99.01%.)|
|(NOUN, VERB, obl): (mod-first% in English is 9.62%.)|
|(NOUN, VERB, obj): (mod-first% in English is 0.72%.)|
|(NOUN, VERB, nsubj): (mod-first% in English is 85.07%.)|
|(ADV, VERB, advmod): (mod-first% in English is 58.82%.)|
|(AUX, VERB, aux): (mod-first% in English is 99.64%.)|
|(VERB, VERB, advcl): (mod-first% in English is 31.02%.)|