Working Hard or Hardly Working: Challenges of Integrating Typology into Neural Dependency Parsers

09/20/2019
by   Adam Fisch, et al.
MIT
0

This paper explores the task of leveraging typology in the context of cross-lingual dependency parsing. While this linguistic information has shown great promise in pre-neural parsing, results for neural architectures have been mixed. The aim of our investigation is to better understand this state-of-the-art. Our main findings are as follows: 1) The benefit of typological information is derived from coarsely grouping languages into syntactically-homogeneous clusters rather than from learning to leverage variations along individual typological dimensions in a compositional manner; 2) Typology consistent with the actual corpus statistics yields better transfer performance; 3) Typological similarity is only a rough proxy of cross-lingual transferability with respect to parsing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/06/2017

Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages

In cross-lingual dependency annotation projection, information is often ...
09/03/2019

Target Language-Aware Constrained Inference for Cross-lingual Dependency Parsing

Prior work on cross-lingual dependency parsing often focuses on capturin...
09/05/2019

Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank

Treebank translation is a promising method for cross-lingual transfer of...
08/18/2017

Cross-Lingual Dependency Parsing for Closely Related Languages - Helsinki's Submission to VarDial 2017

This paper describes the submission from the University of Helsinki to t...
12/24/2020

Cross-lingual Dependency Parsing as Domain Adaptation

In natural language processing (NLP), cross-lingual transfer learning is...
10/07/2019

MaskParse@Deskin at SemEval-2019 Task 1: Cross-lingual UCCA Semantic Parsing using Recursive Masked Sequence Tagging

This paper describes our recursive system for SemEval-2019 Task 1: Cros...
05/04/2020

From SPMRL to NMRL: What Did We Learn (and Unlearn) in a Decade of Parsing Morphologically-Rich Languages (MRLs)?

It has been exactly a decade since the first establishment of SPMRL, a r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last decade, dependency parsers for resource-rich languages have steadily continued to improve. In parallel, significant research efforts have been dedicated towards advancing cross-lingual parsing. This direction seeks to capitalize on existing annotations in resource-rich languages by transferring them to the rest of the world’s over 7,000 languages Bender (2011). The NLP community has devoted substantial resources towards this goal, such as the creation of universal annotation schemas, and the expansion of existing treebanks to diverse language families. Nevertheless, cross-lingual transfer gains remain modest when put in perspective: the performance of transfer models can often be exceeded using only a handful of annotated sentences in the target language (Section 5). The considerable divergence of language structures proves challenging for current models.

One promising direction for handling these divergences is linguistic typology. Linguistic typology classifies languages according to their structural and functional features. By explicitly highlighting specific similarities and differences in languages’ syntactic structures, typology holds great potential for facilitating cross-lingual transfer 

O’Horan et al. (2016). Indeed, non-neural parsing approaches have already demonstrated empirical benefits of typology-aware models Naseem et al. (2012); Täckström et al. (2013); Zhang and Barzilay (2015) While adding discrete typological attributes is straightforward for traditional feature-based approaches, for modern neural parsers finding an effective implementation choice is more of an open question. Not surprisingly, the reported results have been mixed. For instance, ammar2016many found no benefit to using typology for parsing when using a neural-based model, while wang2018surface and scholivet2019typological did in several cases.

There are many possible hypotheses that can attempt to explain the state-of-the-art. Might neural models already implicitly learn typological information on their own? Is the hand-specified typology information sufficiently accurate — or provided in the right granularity — to always be useful? How do cross-lingual parsers use, or ignore, typology when making predictions? Without understanding answers to these questions, it is difficult to develop a principled way for robustly incorporating linguistic knowledge as an inductive bias for cross-lingual transfer.

In this paper, we explore these questions in the context of two predominantly-used typology-based neural architectures for delexicalized dependency parsing.222We focus on delexicalized parsing in order to isolate the effects of syntax by removing lexical influences. The first method implements a variant of selective sharing Naseem et al. (2012); the second adds typological information as an additional feature of the input sentence. Both models are built on top of the popular Biaffine Parser Dozat and Manning (2017). We study model performance across multiple forms of typological representation and resolution.

Our key findings are as follows:

  • [leftmargin=*]

  • Typology as Quantization Cross-lingual parsers use typology to coarsely group languages into syntactically-homogeneous clusters, yet fail to significantly capture finer distinctions or typological feature compositions. Our results indicate that they primarily take advantage of the simple geometry of the typological space (e.g. language distances), rather than specific variations in individual typological dimensions (e.g. SV vs. VS).

  • Typology Quality Typology that is consistent with the actual corpus statistics results in better transfer performance, most likely by capturing a better reflection of the typological variations within that sample. Typology granularity also matters. Finer-grained, high-dimensional representations prove harder to use robustly.

  • Typology vs. Parser Transferability Typological similarity only partially explains cross-lingual transferability with respect to parsing. The geometry of the typological space does not fully mirror that of the “parsing” space, and therefore requires task-specific refinement.

2 Typology Representations

Linguistic Typology, : The standard representation of typology is sets of annotations by linguists for a variety of language-level properties. These properties can be found in online databases such as The World Atlas of Language Structures (WALS) Dryer and Haspelmath (2013). We consider the same subset of features related to word order as used by naseem2012selective, represented as a

-hot vector

, where is the set of values feature may take.

Liu Directionalities, : liu2010dependency proposed using a real-valued vector of the average directionalities of each of a corpus’ dependency relations as a typological descriptor. These serve as a more fine-grained alternative to linguistic typology. Compared to WALS, there are rarely missing values, and the degree

of dominance of each dependency ordering is directly encoded — potentially allowing for better modeling of local variance within a language. It is important to note, however, that true directionalities require a parsed corpus to be derived; thus, they are not a realistic option for cross-lingual parsing in practice.

333Though wang2017fine indicate that they can be predicted from unparsed corpora with reasonable accuracy. Nevertheless, we include them for completeness.

Surface Statistics, : It is possible to derive a proxy measure of typology from part-of-speech tag sequences alone. wang2017fine found surface statistics to be highly predictive of language typology, while wang2018surface replaced typological features entirely with surface statistics in their augmented dependency parser. Surface statistics have the advantage of being readily available and are not restricted to narrow linguistic definitions, but are less informed by the true underlying structure. We compute the set of hand-engineered features used in Wang and Eisner (2018), yielding a real-valued vector .

3 Parsing Architecture

We use the graph-based Deep Biaffine Attention neural parser of Dozat and Manning (2017) as our baseline model. Given a delexicalized sentence consisting of part-of-speech tags, the Biaffine Parser embeds each tag , and encodes the sequence with a bi-directional LSTM to produce tag-level contextual representations . Each is then mapped into head- and child-specific representations for arc and relation prediction, , , , and

, using four separate multi-layer perceptrons.

For decoding, arc scores are computed as:

(1)

while the score for dependency label for edge is computed in a similar fashion:

(2)

Both and are trained greedily using cross-entropy loss with the correct head or label. At test time the final tree is composed using the Chu-Liu-Edmonds (CLE) maximum spanning tree algorithm Chu and Liu (1965); Edmonds (1967).

Language B +   Our Baseline Selective Sharing + + + Fine-tune
Basque 49.89 54.34   56.18 56.54 56.35 56.77 56.50 60.71
Croatian 65.03 67.78   74.86 75.23 74.07 77.39 75.20 78.39
Greek 65.91 68.37   70.09 70.49 68.05 71.66 70.47 73.35
Hebrew 62.58 66.27   68.85 68.61 72.02 72.75 69.21 73.88
Hungarian 58.50 64.13   63.81 64.78 70.28 66.40 64.21 72.50
Indonesian 55.22 64.63   63.68 64.96 69.73 67.73 66.25 73.34
Irish 58.58 61.51   61.72 61.49 65.88 66.49 62.21 66.76
Japanese 54.97 60.41   57.28 57.80 63.83 64.28 57.04 72.72
Slavonic 68.79 71.13   75.18 75.17 74.65 74.17 75.16 73.11
Persian 40.38 34.20   53.87 53.61 45.14 56.72 53.03 59.92
Polish 72.15 76.85   76.01 75.93 79.51 71.09 76.29 77.78
Romanian 66.55 69.69   73.00 73.40 75.20 76.34 73.82 75.15
Slovenian 72.21 76.06   81.21 80.99 81.39 81.36 80.92 82.43
Swedish 72.26 75.32   79.39 79.64 80.28 80.10 79.22 81.29
Tamil 51.59 57.53   57.81 58.85 59.70 60.37 58.39 62.94
Average 60.97 64.55   67.53 67.83 69.07 69.57 67.86 72.28
Table 1: A comparison of all methods on held-out test languages. UAS results are reported over the train splits of the held-out languages, following Wang and Eisner (2018). B and + are the baseline and surface statistics model results, respectively, of Wang and Eisner (2018).555wang2018surface’s final also contains additional neural features that we omitted, as we found it to underperform using only hand-engineered features. Fine-tune is the result of adapting our baseline model using only 10 sentences from the target language. All of our reported numbers are the average of three runs with different random seeds. Results with differences that are statistically insignificant compared to the baseline are marked with (arc-level paired permutation test with ).

4 Typology Augmented Parsing

Selective Sharing: naseem2012selective introduced the idea of selective sharing in a generative parser, where the features provided to a parser were controlled by its typology. The idea was extended to discriminative models by tackstrom-etal-2013-target. For neural parsers which do not rely on manually-defined feature templates, however, there isn’t an explicit way of using selective sharing. Here we choose to directly incorporate selective sharing as a bias term for arc-scoring:

(3)

where is a learned weight vector and is a feature vector engineered using Täckström et al.’s head-modifier feature templates (Appendix B).

Input Features: We follow wang2018surface and encode the typology for language with an MLP, and concatenate it with each input:

(4)
(5)

This approach assumes the parser is able to learn to use information in to induce some distinctive change in encoding .

5 Experiments

Data: We conduct our analysis on the Universal Dependencies v1.2 dataset Nivre et al. (2015) and follow the same train-test partitioning of languages as wang2018surface. We train on 20 treebanks and evaluate cross-lingual performance on the other 15; test languages are shown in Table 1.666Two treebanks are excluded from evaluation, following the setting of Wang and Eisner (2018). We perform hyper-parameter tuning via 5-fold cross-validation on the training languages.

Results: Table 1 presents our cross-lingual transfer results. Our baseline model improves over the benchmark in Wang and Eisner (2018) by more than %. As expected, using typology yields mixed results. Selective sharing provides little to no benefit over the baseline. Incorporating the typology vector as an input feature is more effective, with the Liu Directionalities () driving the most measurable improvements — achieving statistically significant gains on languages. The Linguistic Typology () gives statistically significant gains on languages. Nevertheless, the results are still modest. Fine-tuning on only sentences yields a 2.3 larger average UAS increase, a noteworthy point of reference.

6 Analysis

Typology as Quantization: Adding simple, discrete language identifiers to the input has been shown to be useful in multi-task multi-lingual settings Ammar et al. (2016); Johnson et al. (2017)

. We hypothesize that the model utilizes typological information for a similar purpose by clustering languages by their parsing behavior. Testing this to the extreme, we encode languages using one-hot representations of their cluster membership. The clusters are computed by applying K-Means

777We use Euclidean distance as our metric, another extreme simplification. There is no guarantee that all dimensions should be given equal weight, as indicated in Table 4. to WALS feature vectors (see Figure 1 for an illustration). In this sparse form, compositional aspects of cross-lingual sharing are erased. Performance using this impoverished representation, however, only suffers slightly compared to the original — dropping by just % UAS overall and achieving statistically significant parity or better with on languages. A gap does still partially remain; future work may investigate this further.

Figure 1: t-SNE projection of WALS vectors with clustering. Persian (fa) is an example of a poorly performing language that is also far from its cluster center.

This phenomenon is also reflected in the performance when the original WALS features are used. Test languages that do belong to compact clusters have higher performance on average than that of those who are isolates (e.g., Persian, Basque). Indeed from Table 1 and Fig. 1 we observe that the worst performing languages are isolated from their cluster centers. Even though their typology vectors can be viewed as compositions of training languages, the model appears to have limited generalization ability. This suggests that the model does not effectively use individual typological features.

This can likely be attributed to the training routine, which poses two inherent difficulties: 1) the parser has few examples (entire languages) to generalize from, making it hard from a learning perspective and 2) a naïve encoder can already implicitly capture important typological features within its hidden state, using only the surface forms of the input. This renders the additional typology features redundant. Table 2

presents the results of probing the final max-pooled output of the

BiLSTM encoder for typological features on a sentence level

. We find they are nearly linearly separable — logistic regression achieves greater than

% accuracy on average.

wang2018surface attempt to address the learning problem by using the synthetic Galactic Dependencies (GD) dataset Wang and Eisner (2016) as a form of data augmentation. GD constructs “new” treebanks with novel typological qualities by systematically combining the behaviors of real languages. Following their work, we add GD treebanks synthesized from the UD training languages, giving training treebanks in total. Table 3 presents the results of training on this setting. While GD helps the weaker substantially, the same gains are not realized for models built on top of our stronger baseline — in fact, the baseline only narrows the gap even further by increasing by % UAS overall.888Sourcing a greater number of real languages may still be helpful. The synthetic GD setting is not entirely natural, and might be sensitive to hyper-parameters.

WALS ID 82A 83A 85A 86A 87A 88A
Logreg 87 85 97 92 94 92
Majority 61 56 87 75 51 82
Table 2: Performance of typology prediction using hidden states of the parser’s encoder, compared to a majority baseline which predicts the most frequent category.
+GD B +   Baseline + + +
Average 67.11   68.45 69.23 68.36 67.12
Table 3: Average UAS results when training with Galactic Dependencies. The Linguistic Typology () here is computed directly from the corpora using the rules in Appendix E. All of our reported numbers are the average of three runs.

Typology Quality: The notion of typology is predicated on the idea that some language features are consistent across different language samples, yet in practice this is not always the case. For instance, Arabic is listed in WALS as SV (82A, SubjectVerb), yet follows a large number of VerbSubject patterns in UD v1.2. Fig. 2 further demonstrates that for some languages these divergences are significant (see Appendix F for concrete examples). Given this finding, we are interested in measuring the impact this noise has on typology utilization. Empirically, , which is consistent with the corpus, performs best. Furthermore, updating our typology features for to match the dominant ordering of the corpus yields a slight improvement of % UAS overall, with statistically significant gains on languages.

In addition to the quality of the representation, we can also analyze the impact of its resolution. In theory, a richer, high-dimensional representation of typology may capture subtle variations. In practice, however, we observe an opposite effect, where the Linguistic Typology () and the Liu Directionalities () outperform the surface statistics (), with . This is likely due to the limited number of languages used for training (though training on GD exhibits the same trend). This suggests that future work may consider using targeted dimensionality reduction mechanisms, optimized for performance.

Typology vs. Parser Transferability: The implicit assumption of all the typology based methods is that the typological similarity of two languages is a good indicator of their parsing transferability. As a measure of parser transferability, for each language we select the oracle source language which results in the best transfer performance. We then compute precision@ for the nearest neighbors in the typological space, i.e. whether the best source appears in the nearest neighbors. As shown in Table 4, we observe that while there is some correlation between the two, they are far from perfectly aligned. has the best alignment, which is consistent with its corresponding best parsing performance. Overall, this divergence motivates the development of approaches that better match the two distributions.

Figure 2: Averaged matching accuracy of the linguistically-defined WALS features on 15 randomly sampled languages compared to their corpus-specific values derived from UD v1.2. Rules for deriving the features from corpus are described in Appendix E.
P@1 P@3 P@5 P@10
13 33 60 80
27 67 67 93
13 27 27 73
Table 4: Precision@ for identifying the best parsing transfer language, for the typological neighbors.

7 Related Work

Other recent progress in cross-lingual parsing has focused on lexical alignment Guo et al. (2015, 2016); Schuster et al. (2019). Data augmentation Wang and Eisner (2017) is another promising direction, but at the cost of greater training demands. Both directions do not directly address structure. ahmad2019difficulties showed structural-sensitivity is important for modern parsers; insensitive parsers suffer. Data transfer is an alternative solution to alleviate the typological divergences, such as annotation projection Tiedemann (2014) and source treebank reordering Rasooli and Collins (2019). These approaches are typically limited by parallel data and imperfect alignments. Our work aims to understand cross-lingual parsing in the context of model transfer, with typology serving as language descriptors, with the goal of eventually addressing the issue of structure.

8 Conclusion

Realizing the potential for typology may require rethinking current approaches. We can further drive performance by refining typology-based similarities into a metric more representative of actual transfer quality. Ultimately, we would like to design models that can directly leverage typological compositionality for distant languages.

Acknowledgments

We thank Dingquan Wang, Jason Eisner, the MIT NLP group (special thanks to Jiaming Luo), and the reviewers for their valuable comments. This research is supported in part by the Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity, via contract #FA8650-17-C-9116, and the National Science Foundation Graduate Research Fellowship under Grant #1122374. Any opinion, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the the NSF, ODNI, IARPA, or the U.S. Government. We are also grateful for the support of MIT Quest for Intelligence program.

References

  • W. Ammar, G. Mulcaire, M. Ballesteros, C. Dyer, and N. A. Smith (2016) Many languages, one parser. TACL 4, pp. 431–444. Cited by: §6.
  • E. M. Bender (2011) On achieving and evaluating language-independence in nlp. Linguistic Issues in Language Technology. Cited by: §1.
  • Y. Chu and T. Liu (1965) On the shortest arborescence of a directed graph. Scientia Sinica 14, pp. 1396–1400. Cited by: §3.
  • T. Dozat and C. D. Manning (2017) Deep Biaffine Attention for Neural Dependency Parsing. In ICLR, Cited by: Appendix C, §1, §3.
  • M. S. Dryer and M. Haspelmath (Eds.) (2013) WALS online. Max Planck Institute for Evolutionary Anthropology, Leipzig. External Links: Link Cited by: §2.
  • J. Edmonds (1967) Optimum branchings. Journal of Research of the National Bureau of Standards 71B, pp. 233–240. Cited by: §3.
  • J. Guo, W. Che, D. Yarowsky, H. Wang, and T. Liu (2015)

    Cross-lingual dependency parsing based on distributed representations

    .
    In ACL-IJCNLP, Vol. 1, pp. 1234–1244. Cited by: §7.
  • J. Guo, W. Che, D. Yarowsky, H. Wang, and T. Liu (2016) A representation learning framework for multi-source transfer parsing. In AAAI, Cited by: §7.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, et al. (2017)

    Google’s multilingual neural machine translation system: enabling zero-shot translation

    .
    TACL 5, pp. 339–351. Cited by: §6.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix C.
  • T. Naseem, R. Barzilay, and A. Globerson (2012) Selective sharing for multilingual dependency parsing. In ACL, pp. 629–637. Cited by: §1, §1.
  • J. Nivre, Ž. Agić, M. J. Aranzabe, M. Asahara, A. Atutxa, M. Ballesteros, J. Bauer, K. Bengoetxea, R. A. Bhat, C. Bosco, S. Bowman, G. G. A. Celano, M. Connor, M. de Marneffe, A. Diaz de Ilarraza, K. Dobrovoljc, T. Dozat, T. Erjavec, R. Farkas, J. Foster, D. Galbraith, F. Ginter, I. Goenaga, K. Gojenola, Y. Goldberg, B. Gonzales, B. Guillaume, J. Hajič, D. Haug, R. Ion, E. Irimia, A. Johannsen, H. Kanayama, J. Kanerva, S. Krek, V. Laippala, A. Lenci, N. Ljubešić, T. Lynn, C. Manning, C. Mărănduc, D. Mareček, H. Martínez Alonso, J. Mašek, Y. Matsumoto, R. McDonald, A. Missilä, V. Mititelu, Y. Miyao, S. Montemagni, S. Mori, H. Nurmi, P. Osenova, L. Øvrelid, E. Pascual, M. Passarotti, C. Perez, S. Petrov, J. Piitulainen, B. Plank, M. Popel, P. Prokopidis, S. Pyysalo, L. Ramasamy, R. Rosa, S. Saleh, S. Schuster, W. Seeker, M. Seraji, N. Silveira, M. Simi, R. Simionescu, K. Simkó, K. Simov, A. Smith, J. Štěpánek, A. Suhr, Z. Szántó, T. Tanaka, R. Tsarfaty, S. Uematsu, L. Uria, V. Varga, V. Vincze, Z. Žabokrtský, D. Zeman, and H. Zhu (2015) Universal dependencies 1.2. Note: LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University External Links: Link Cited by: §5.
  • H. O’Horan, Y. Berzak, I. Vulic, R. Reichart, and A. Korhonen (2016)

    Survey on the use of typological information in natural language processing

    .
    In COLING, pp. 1297–1308. Cited by: §1.
  • M. S. Rasooli and M. Collins (2019) Low-resource syntactic transfer with unsupervised source reordering. In NAACL, pp. 3845–3856. Cited by: §7.
  • T. Schuster, O. Ram, R. Barzilay, and A. Globerson (2019) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In NAACL, pp. 1599–1613. Cited by: §7.
  • O. Täckström, R. McDonald, and J. Nivre (2013) Target language adaptation of discriminative transfer parsers. In NAACL, Atlanta, Georgia, pp. 1061–1071. Cited by: §1, §4.
  • J. Tiedemann (2014) Rediscovering annotation projection for cross-lingual parser induction. In COLING, pp. 1854–1864. Cited by: §7.
  • D. Wang and J. Eisner (2016) The galactic dependencies treebanks: getting more data by synthesizing new languages. TACL 4, pp. 491–505. Cited by: §6.
  • D. Wang and J. Eisner (2017)

    Fine-grained prediction of syntactic typology: discovering latent structure with supervised learning

    .
    TACL 5, pp. 147–161. Cited by: §7.
  • D. Wang and J. Eisner (2018) Surface statistics of an unknown language indicate how to parse it. TACL 6, pp. 667–685. Cited by: Appendix C, §2, Table 1, §5, footnote 6.
  • Y. Zhang and R. Barzilay (2015)

    Hierarchical low-rank tensors for multilingual transfer parsing

    .
    In EMNLP, Cited by: §1.

Appendix A Dependency Relations for Deriving the Liu Directionalities

Among all the 37 relation types defined in Universal Dependencies, we select a subset of dependency relations which appear in at least 20 languages, as listed in Table 5. For relation types that are missing in a specific language, we simply put its value (directionality) as 0.5 without making any assumption to its tendency.

cc, conj, case, nsubj, nmod, dobj, mark,
advcl, amod, advmod, neg, nummod, xcomp,
ccomp, cop, acl, aux, punct, det, appos,
iobj, dep, csubj, parataxis, mwe, name,
nsubjpass, compound, auxpass, csubjpass,
vocative, discourse
Table 5: Subset of universal dependency relations used for deriving the Liu directionalities.

Appendix B Feature Templates for Selective Sharing

We use the same set of selective sharing feature templates (Table 6) as tackstrom-etal-2013-target.

d w.81A [h.p=VERB m.p=NOUN]
d w.81A [h.p=VERB m.p=PRON]
d w.85A [h.p=NOUN m.p=ADP]
d w.86A [h.p=PRON m.p=ADP]
d w.87A [h.p=NOUN m.p=ADJ]
Table 6: Arc-factored feature templates for selective sharing. Arc direction: d {Left, Right}; Part-of-speech tag of head / modifier: h.p / m.p. WALS features: w.X for X=81A (order of Subject, Verb and Object), 85A (order of Adposition and Noun), 86A (order of Genitive and Noun), 87A (order of Adjective and Noun).

Appendix C Training details

To train our baseline parser and its typology-augmented variants, we use Adam Kingma and Ba (2015) with a learning rate of 1e-3 for 200K updates (2M when using GD). We use a batch size of 500 tokens. Early stopping is also employed, based on the validation set in the training languages. Following Dozat and Manning (2017), we use a 3-layered bidirectional LSTM with a hidden size of 400. The hidden sizes of the MLPs for predicting arcs and dependency relations are 500 and 100, respectively.

Our baseline model shares all parameters across languages. During training, we truncate each training treebank to a maximum of 500K tokens for efficiency. Batch updates are composed of examples derived from a single language, and are sampled uniformly, such that the number of per-language updates are proportional to the size of each language’s treebank. Following Wang and Eisner (2018)

, when training on GD, we sample a batch from a real language with probability

, and a batch of GD data otherwise.

For fine-tune, we perform 100 SGD updates with no early-stopping. When using K-Means to obtain language clusters, we set , based on cross-validation.

Appendix D LAS Results

Table 7 summarizes the LAS scores corresponding to Table 1 in the paper.

Language B +   Our Baseline Selective Sharing + + + Fine-tune
Basque 27.07 31.46   34.64 34.79 36.49 36.83 34.90 43.04
Croatian 48.68 52.29   61.34 61.41 59.86 63.72 61.60 65.07
Greek 50.10 56.73   56.51 56.53 55.16 60.18 56.59 64.66
Hebrew 49.71 53.29   41.15 41.05 43.58 43.63 41.50 43.14
Hungarian 42.85 47.73   32.65 33.43 34.14 32.01 33.07 44.26
Indonesian 39.46 47.63   47.17 48.21 51.82 50.78 49.22 62.23
Irish 39.06 40.75   39.63 39.60 43.02 42.14 40.24 48.58
Japanese 37.57 40.6   43.32 43.69 47.67 48.10 42.85 60.59
Slavonic 40.03 43.95   57.35 57.40 56.89 56.69 57.19 53.88
Persian 30.06 24.6   35.72 35.59 32.85 39.78 34.93 49.72
Polish 50.08 54.85   61.67 61.57 64.69 57.20 61.71 65.68
Romanian 50.90 53.42   55.77 56.21 55.99 59.28 56.48 59.12
Slovenian 57.09 61.48   70.86 70.01 70.44 70.03 70.29 73.81
Swedish 55.35 58.42   67.24 67.40 66.92 68.03 67.04 68.65
Tamil 28.39 37.81   33.81 34.57 34.96 36.61 34.70 47.46
AVG 43.09 47.00   49.26 49.43 50.30 51.00 49.49 56.66
Table 7: LAS results corresponding to Table 1 in the paper. Results with differences that are statistically insignificant compared to the baseline are marked with (arc-level paired permutation test with ).

Appendix E Rules for Deriving Corpus-specific WALS Features

Table 8 summarizes the rules we used to derive corpus-specific WALS features. The values are determined by the dominance of directionalities, e.g., if , then its typological feature is set to the right-direction value, vice versa. In-between values are set to Mixed. In our experiments, .

WALS ID Condition Values
82A
    relation {nsubj, csubj}
h.p=VERB (m.p=NOUN m.p=PRON)
VS(), SV(), Mixed
83A
    relation {dobj, iobj}
h.p=VERB (m.p=NOUN m.p=PRON)
VO(), OV(), Mixed
85A (h.p=NOUN h.p=PRON) m.p=ADP
Prepositions(),
Postpositions()
86A h.p=NOUN m.p=NOUN
Noun-Genitive(),
Genitive-Noun(),
Mixed
87A h.p=NOUN m.p=ADJ
Adjective-Noun(),
Noun-Adjective(),
Mixed
88A relation {det} m.p=DET
Demonstrative-Noun(),
Noun-Demonstrative(),
Mixed
Table 8: Rules for determining the dependency arc set of each specific WALS feature type. The arc direction specificed in the parenthesis of each value indicates the global directional tendency of the corresponding typological feature.

Appendix F Examples of Mismatching between WALS and Corpus Statistics

Table 9 shows some examples of mismatching between WALS and corpus statistics. Substantial variations exist for some typological features, and for UD v1.2 in several cases, the dominant word order specified by linguists is questionable or even reversed (cf. Arabic subject-verb order).

Language WALS UD
ID Value #{} #{}
Arabic 82A SV () 4,875 2,489
Czech 82A SV () 13,925 32,510
Czech 83A VO () 37,034 20,246
Spanish 83A VO () 10,745 6,119
Finnish 86A G-N () 6,010 8,134
Table 9: Example of mismatching between WALS and arc directionalities collected from UD v1.2. G-N is short for Genitive-Noun.