Log In Sign Up

Many Languages, One Parser

We train one multilingual model for dependency parsing and use it to parse sentences in several languages. The parsing model uses (i) multilingual word clusters and embeddings; (ii) token-level language information; and (iii) language-specific features (fine-grained POS tags). This input representation enables the parser not only to parse effectively in multiple languages, but also to generalize across languages based on linguistic universals and typological similarities, making it more effective to learn from limited annotations. Our parser's performance compares favorably to strong baselines in a range of data scenarios, including when the target language has a large treebank, a small treebank, or no treebank for training.


page 1

page 2

page 3

page 4


UDapter: Language Adaptation for Truly Universal Dependency Parsing

Recent advances in the field of multilingual dependency parsing have bro...

Fast XML/HTML for Haskell: XML TypeLift

The paper presents and compares a range of parsers with and without data...

Multilingual AMR Parsing with Noisy Knowledge Distillation

We study multilingual AMR parsing from the perspective of knowledge dist...

Parsing with Pretrained Language Models, Multiple Datasets, and Dataset Embeddings

With an increase of dataset availability, the potential for learning fro...

Parser Training with Heterogeneous Treebanks

How to make the most of multiple heterogeneous treebanks when training a...

Fill it up: Exploiting partial dependency annotations in a minimum spanning tree parser

Unsupervised models of dependency parsing typically require large amount...

Dependency resolution and semantic mining using Tree Adjoining Grammars for Tamil Language

Tree adjoining grammars (TAGs) provide an ample tool to capture syntax o...

1 Introduction

Developing tools for processing many languages has long been an important goal in NLP [rosner:88, heid:89],111As of 2007, the total number of native speakers of the hundred most popular languages only accounts for 85% of the world’s population [native:07]. but it was only when statistical methods became standard that massively multilingual NLP became economical. The mainstream approach for multilingual NLP is to design language-specific models. For each language of interest, the resources necessary for training the model are obtained (or created), and separate parameters are fit for each language separately. This approach is simple and grants the flexibility of customizing the model and features to the needs of each language, but it is suboptimal for theoretical and practical reasons. Theoretically, the study of linguistic typology tells us that many languages share morphological, phonological, and syntactic phenomena [bender:11]; therefore, the mainstream approach misses an opportunity to exploit relevant supervision from typologically related languages. Practically, it is inconvenient to deploy or distribute NLP tools that are customized for many different languages because, for each language of interest, we need to configure, train, tune, monitor, and occasionally update the model. Furthermore, code-switching or code-mixing (mixing more than one language in the same discourse), which is pervasive in some genres, in particular social media, presents a challenge for monolingually-trained NLP models [barman:14].222While our parser can be used to parse input with code-switching, we have not evaluated this capability due to the lack of appropriate data.

In parsing, the availability of homogeneous syntactic dependency annotations in many languages [mcdonald:13, universal:v1_0, universal:v1_1, universal:v1_2] has created an opportunity to develop a parser that is capable of parsing sentences in multiple languages, addressing these theoretical and practical concerns.333Although multilingual dependency treebanks have been available for a decade via the 2006 and 2007 CoNLL shared tasks [buchholz:06, nivre:07], the treebank of each language was annotated independently and with its own annotation conventions. A multilingual parser can potentially replace an array of language-specific monolingually-trained parsers (for languages with a large treebank). The same approach has been used in low-resource scenarios (with no treebank or a small treebank in the target language), where indirect supervision from auxiliary languages improves the parsing quality [cohen:11, mcdonald:11, zhang:15, duong:15b, duong:15, guo:16], but these models may sacrifice accuracy on source languages with a large treebank. In this paper, we describe a model that works well for both low-resource and high-resource scenarios.

We propose a parsing architecture that takes as input sentences in several languages,444We discuss data requirements in the next section. optionally predicting the part-of-speech (POS) tags and input language. The parser is trained on the union of available universal dependency annotations in different languages. Our approach integrates and critically relies on several recent developments related to dependency parsing: universal POS tagsets [petrov:12], cross-lingual word clusters [tackstrom:12], selective sharing [naseem:12], universal dependency annotations [mcdonald:13, universal:v1_0, universal:v1_1, universal:v1_2]

, advances in neural network architectures

[chen:14, dyer:15], and multilingual word embeddings [gardner:15, guo:16, ammar:16]. We show that our parser compares favorably to strong baselines trained on the same treebanks in three data scenarios: when the target language has a large treebank (Table 3), a small treebank (Table 7), or no treebank (Table 8). Our parser is publicly available.555

German (de) English (en) Spanish (es) French (fr) Italian (it) Portuguese (pt) Swedish (sv)
UDT 2 train 14118 (264906) 39832 (950028) 14138 (375180) 14511 (351233) 6389 (149145) 9600 (239012) 4447 (66631)
dev. 801 (12215) 1703 (40117) 1579 (40950) 1620 (38328) 399 (9541) 1211 (29873) 493 (9312)
test 1001 (16339) 2416 (56684) 300 (8295) 300 (6950) 400 (9187) 1205 (29438) 1219 (20376)
UD 1.2 train 14118 (269626) 12543 (204586) 14187 (382436) 14552 (355811) 11699 (249307) 8800 (201845) 4303 (66645)
dev. 799 (12512) 2002 (25148) 1552 (41975) 1596 (39869) 489 (11656) 271 (4833) 504 (9797)
test 977 (16537) 2077 (25096) 274 (8128) 298 (7210) 489 (11719) 288 (5867) 1219 (20377)
tags - 50 - - 36 866 134
Table 1: Number of sentences (tokens) in each treebank split in Universal Dependency Treebanks (UDT) version 2.0 and Universal Dependencies version (UD) 1.2 for the languages we experiment with. The last row gives the number of unique language-specific fine-grained POS tags used in a treebank.

2 Overview

Our goal is to train a dependency parser for a set of target languages , given universal dependency annotations in a set of source languages . Ideally, we would like to have training data in all target languages (i.e., ), but we are also interested in the case where the sets of source and target languages are disjoint (i.e., ). When all languages in have a large treebank, the mainstream approach has been to train one monolingual parser per target language and route sentences of a given language to the corresponding parser at test time. In contrast, our approach is to train one parsing model with the union of treebanks in , then use this single trained model to parse text in any language in , hence the name “Many Languages, One Parser” (MaLOPa). MaLOPa strikes a balance between: (1) enabling cross-lingual model transfer via language-invariant input representations; i.e., coarse POS tags, multilingual word embeddings and multilingual word clusters, and (2) tweaking the behavior of the parser depending on the current input language via language-specific representations; i.e., fine-grained POS tags and language embeddings.

In addition to universal dependency annotations in source languages (see Table 1), we use the following data resources for each language in :

  • universal POS annotations for training a POS tagger,666See §3.6 for details.

  • a bilingual dictionary with another language in for adding cross-lingual lexical information,777Our best results make use of this resource. We require that all languages in are (transitively) connected. The bilingual dictionaries we used are based on unsupervised word alignments of parallel corpora, as described in guo:16. See §3.3 for details.

  • language typology information,888See §3.4 for details.

  • language-specific POS annotations,999Our best results make use of this resource. See §3.5 for details. and

  • a monolingual corpus.101010This is only used for training word embeddings with ‘multiCCA,’ ‘multiCluster’ and ‘translation-invariance’ methods in Table 6. We do not use this resource when we compare to previous work.

Novel contributions of this paper include: (i) using one parser instead of an array of monolingually-trained parsers without sacrificing accuracy on languages with a large treebank, (ii) an effective neural network architecture for using language embeddings to improve multilingual parsing, and (iii) a study of how automatic language identification affects the performance of a multilingual dependency parser.

While not the primary focus of this paper, we also show that a variant of our parser outperforms previous work on multi-source cross-lingual parsing in low resource scenarios, where languages in have a small treebank (see Table 7) or where (see Table 8). In the small treebank setup with 3,000 token annotations, we show that our parser consistently outperforms a strong monolingual baseline with 5.7 absolute LAS (labeled attachment score) points per language, on average.

3 Parsing Model

Recent advances suggest that recurrent neural networks, especially long short-term memory (LSTM) architectures, are capable of learning useful representations for modeling problems of sequential nature

[graves:13, sutskever:14]. In this section, we describe our language-universal parser, which extends the stack LSTM (S-LSTM) parser of dyer:15.

3.1 Transition-based Parsing with S-LSTMs

This section briefly reviews Dyer et al.’s S-LSTM parser, which we modify in the following sections. The core parser can be understood as the sequential manipulation of three data structures:

  • a buffer (from which we read the token sequence),

  • a stack (which contains partially-built parse trees), and

  • a list of actions previously taken by the parser.

The parser uses the arc-standard transition system [nivre:04].111111In a preprocessing step, we transform nonprojective trees in the training treebanks to pseudo-projective trees using the “baseline” scheme in [nivre:05]. We evaluate against the original nonprojective test set. At each timestep , a transition action is applied that alters these data structures according to Table 2.

Stack Buffer Action Dependency Stack Buffer
Table 2: Parser transitions indicating the action applied to the stack and buffer at time and the resulting stack and buffer at time .

Along with the discrete transitions of the arc-standard system, the parser computes vector representations for the buffer, stack and list of actions at time step

denoted , , and , respectively.121212A stack-LSTM module is used to compute the vector representation for each data structure, as detailed in dyer:15. The parser state at time is given by:


where the matrix and the vector are learned parameters. The matrix is multiplied by the vector created by the concatenation of . The parser state is then used to define a categorical distribution over possible next actions :131313The total number of actions is the number of unique dependency labels in the treebank used for training, but we only consider actions which meet the arc-standard preconditions in Fig. 2.


where and are parameters associated with action . The selected action is then used to update the buffer, stack and list of actions, and to compute , and accordingly.

The model is trained to maximize the log-likelihood of correct actions. At test time, the parser greedily chooses the most probable action in every time step until a complete parse tree is produced.

The following sections describe our extensions of the core parser. More details about the core parser can be found in dyer:15.

3.2 Token Representations

The vector representations of input tokens feed into the stack-LSTM modules of the buffer and the stack. For monolingual parsing, we represent each token by concatenating the following vectors:

  • a fixed, pretrained embedding of the word type,

  • a learned embedding of the word type,

  • a learned embedding of the Brown cluster,

  • a learned embedding of the fine-grained POS tag,

  • a learned embedding of the coarse POS tag.

For multilingual parsing with MaLOPa, we start with a simple delexicalized model where the token representation only consists of learned embeddings of coarse POS tags, which are shared across all languages to enable model transfer. In the following subsections, we enhance the token representation in MaLOPa to include lexical embeddings, language embeddings, and fine-grained POS embeddings.

3.3 Lexical Embeddings

Previous work has shown that sacrificing lexical features amounts to a substantial decrease in the performance of a dependency parser [cohen:11, tackstrom:12, tiedemann:15, guo:15]. Therefore, we extend the token representation in MaLOPa by concatenating learned embeddings of multilingual word clusters, and pretrained multilingual embeddings of word types.

Multilingual Brown clusters.

Before training the parser, we estimate Brown clusters of English words and project them via word alignments to words in other languages. This is similar to the ‘projected clusters’ method in tackstrom:12. To go from Brown clusters to embeddings, we ignore the hierarchy within Brown clusters and assign a unique parameter vector to each cluster.

Multilingual word embeddings.

We also use Guo et al.’s (2016) ‘robust projection’ method to pretrain multilingual word embeddings. The first step in ‘robust projection’ is to learn embeddings for English words using the skip-gram model [mikolov:13]. Then, we compute an embedding of non-English words as the weighted average of English word embeddings, using word alignment probabilities as weights. The last step computes an embedding of non-English words which are not aligned to any English words by averaging the embeddings of all words within an edit distance of 1 in the same language. We experiment with two other methods—‘multiCCA’ and ‘multiCluster,’ both proposed by ammar:16—for pretraining multilingual word embeddings in §4.1. ‘MultiCCA’ uses a linear operator to project pretrained monolingual embeddings in each language (except English) to the vector space of pretrained English word embeddings, while ‘multiCluster’ uses the same embedding for translationally-equivalent words in different languages. The results in Table 6 illustrate that the three methods perform similarly on this task.

3.4 Language Embeddings

While many languages, especially ones that belong to the same family, exhibit some similar syntactic phenomena (e.g., all languages have subjects, verbs, and objects), substantial syntactic differences abound. Some of these differences are easy to characterize (e.g., subject-verb-object vs. verb-subject-object, prepositions vs. postpositions, adjective-noun vs. noun-adjective), while others are subtle (e.g., number and positions of negation morphemes). It is not at all clear how to translate descriptive facts about a language’s syntax into features for a parser.

Consequently, training a language-universal parser on treebanks in multiple source languages requires caution. While exposing the parser to a diverse set of syntactic patterns across many languages has the potential to improve its performance in each, dependency annotations in one language will, in some ways, contradict those in typologically different languages.

For instance, consider a context where the next word on the buffer is a noun, and the top word on the stack is an adjective, followed by a noun. Treebanks of languages where postpositive adjectives are typical (e.g., French) will often teach the parser to predict reduce-left, while those of languages where prepositive adjectives are more typical (e.g., English) will teach the parser to predict shift.

Inspired by naseem:12, we address this problem by informing the parser about the input language it is currently parsing. Let be the input vector representation of a particular language. We consider three definitions for :141414The files which contain these definitions are available at

  • one-hot encoding of the language ID,

  • one-hot encoding of individual word-order properties,151515The World Atlas of Language Structures (WALS; Dryer and Haspelmath, 2013) is an online portal documenting typological properties of 2,679 languages (as of July 2015). We use the same set of WALS features used by zhang:15, namely 82A (order of subject and verb), 83A (order of object and verb), 85A (order of adposition and noun phrase), 86A (order of genitive and noun), and 87A (order of adjective and noun). and

  • averaged one-hot encoding of WALS typological properties (including word-order properties).161616Some WALS features are not annotated for all languages. Therefore, we use the average value of all languages in the same genus. We rescale all values to be in the range .

It is worth noting that the first definition (language ID) turns out to work best in our experiments.

We use a hidden layer with nonlinearity to compute the language embedding as:

where the matrix and the vector are additional model parameters. We modify the parsing architecture as follows:

  • include in the token representation (which feeds into the stack-LSTM modules of the buffer and the stack as described in §3.1),

  • include in the action vector representation (which feeds into the stack-LSTM module that represents previous actions as described in §3.1), and

  • redefine the parser state at time as .

Intuitively, the first two modifications allow the input language to influence the vector representation of the stack, the buffer and the list of actions. The third modification allows the input language to influence the parser state which in turn is used to predict the next action. In preliminary experiments, we found that adding the language embeddings at the token and action level is important. We also experimented with computing more complex functions of () to define the parser state, but they did not help.

3.5 Fine-grained POS Tag Embeddings

tiedemann:15 shows that omitting fine-grained POS tags significantly hurts the performance of a dependency parser. However, those fine-grained POS tagsets are defined monolingually and are only available for a subset of the languages with universal dependency treebanks.

We extend the token representation to include a fine-grained POS embedding (in addition to the coarse POS embedding). We stochastically dropout the fine-grained POS embedding for each token with 50% probability [srivastava:14] so that the parser can make use of fine-grained POS tags when available but stay reliable when the fine-grained POS tags are missing.

3.6 Predicting POS Tags

The model discussed thus far conditions on the POS tags of words in the input sentence. However, gold POS tags may not be available in real applications (e.g., parsing the web). Here, we describe two modifications to (i) model both POS tagging and dependency parsing, and (ii) increase the robustness of the parser to incorrect POS predictions.

Tagging model.

Let , , be the sequence of words, POS tags, and parsing actions, respectively, for a sentence of length

. We define the joint distribution of a POS tag sequence and parsing actions given a sequence of words as follows:

where is defined in Eq. 2, and uses a bidirectional LSTM [graves:13]. huang:15 show that the performance of a bidirectional LSTM POS tagger is on par with a conditional random field tagger.

We use slightly different token representations for tagging and parsing in the same model. For tagging

, we construct the token representation by concatenating the embeddings of the word type (pretrained), the Brown cluster and the input language. This token representation feeds into the bidirectional LSTM, followed by a softmax layer (at each position) which defines a categorical distribution over possible POS tags. For

parsing, we construct the token representation by further concatenating the embeddings of predicted POS tags. This token representation feeds into the stack-LSTM modules of the buffer and stack components of the transition-based parser. This multi-task learning setup enables us to predict both POS tags and dependency trees in the same model. We note that pretrained word embeddings, cluster embeddings and language embeddings are shared for tagging and parsing.

Block dropout.

We use an independently developed variant of word dropout [iyyer:15], which we call block dropout. The token representation used for parsing includes the embedding of predicted POS tags, which may be incorrect. We introduce another modification which makes the parser more robust to incorrect POS tag predictions, by stochastically zeroing out the entire embedding of the POS tag. While training the parser, we replace the POS embedding vector with another vector (of the same dimensionality) stochastically computed as: , where

is a Bernoulli-distributed random variable with parameter

which is initialized to 1.0 (i.e., always dropout, setting ), and is dynamically updated to match the error rate of the POS tagger on the development set. At test time, we never dropout the predicted POS embedding, i.e., . Intuitively, this method extends the dropout method [srivastava:14] to address structured noise in the input layer.

LAS target language average
de en es fr it pt sv
monolingual 79.3 85.9 83.7 81.7 88.7 85.7 83.5 84.0
MaLOPa 70.4 69.3 72.4 71.1 78.0 74.1 65.4 71.5
+lexical 76.7 82.0 82.7 81.2 87.6 82.1 81.2 81.9
  +language ID 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5
    +fine-grained POS 78.9 85.4 84.3 82.4 89.0 86.2 84.5 84.3
Table 3: Dependency parsing: labeled attachment scores (LAS) for monolingually-trained parsers and MaLOPa in the fully supervised scenario where . Note that we use the universal dependencies verson 1.2 which only includes annotations for 13,000 English sentences, which explains the relatively low scores in English. When we instead use the universal dependency treebanks version 2.0 which includes annotations for 40,000 English sentences (originally from the English Penn Treebank), we achieve UAS score 93.0 and LAS score 91.5.

4 Experiments

In this section, we evaluate the MaLOPa approach in three data scenarios: when the target language has a large treebank (Table 3), a small treebank (Table 7) or no treebank (Table 8).


For experiments where the target language has a large treebank, we use the standard data splits for German (de), English (en), Spanish (es), French (fr), Italian (it), Portuguese (pt) and Swedish (sv) in the latest release (version 1.2) of Universal Dependencies [universal:v1_2], and experiment with both gold and predicted POS tags. For experiments where the target language has no treebank, we use the standard splits for these languages in the older universal dependency treebanks v2.0 [mcdonald:13] and use gold POS tags, following the baselines [zhang:15, guo:16]. Table 1 gives the number of sentences and words annotated for each language in both versions. In a preprocessing step, we lowercase all tokens and remove multi-word annotations and language-specific dependency relations. We use the same multilingual Brown clusters and multilingual embeddings of guo:16, kindly provided by the authors.


We follow dyer:15 in parameter initialization and optimization.171717We use stochastic gradient updates with an initial learning rate of in epoch #0, update the learning rate in following epochs as . We clip the norm of the gradient to avoid “exploding” gradients. Unlabeled attachment score (UAS) on the development set determines early stopping. Parameters are initialized with uniform samples in where and are the sizes of the previous and following layer in the nueral network [glorot:10]

. The standard deviations of the labeled attachment score (LAS) due to random initialization in individual target languages are 0.36 (de), 0.40 (en), 0.37 (es), 0.46 (fr), 0.47 (it), 0.41 (pt) and 0.24 (sv). The standard deviation of the average LAS scores across languages is 0.17.

However, when training the parser on multiple languages in MaLOPa

, instead of updating the parameters with the gradient of individual sentences, we use mini-batch updates which include one sentence sampled uniformly (without replacement) from each language’s treebank, until all sentences in the smallest treebank are used (which concludes an epoch). We repeat the same process in following epochs. We found this to help prevent one source language with a larger treebank (e.g., German) from dominating parameter updates at the expense of other source languages with a smaller treebank (e.g., Swedish).

4.1 Target Languages with a Treebank ()

Here, we evaluate our MaLOPa parser when the target language has a treebank.


For each target language, the strong baseline we use is a monolingually-trained S-LSTM parser with a token representation which concatenates: pretrained word embeddings (50 dimensions),181818These embeddings are treated as fixed inputs to the parser, and are not optimized towards the parsing objective. We use the same embeddings used in guo:16. learned word embeddings (50 dimensions), coarse (universal) POS tag embeddings (12 dimensions), fine-grained (language-specific, when available) POS tag embeddings (12 dimensions), and embeddings of Brown clusters (12 dimensions), and uses a two-layer S-LSTM for each of the stack, the buffer and the list of actions. We independently train one baseline parser for each target language, and share no model parameters. This baseline, denoted ‘monolingual’ in Tables 3 and 7, achieves UAS score 93.0 and LAS score 91.5 when trained on the English Penn Treebank, which is comparable to dyer:15.

Recall % left right root short long nsubj* dobj conj *comp case *mod
monolingual 89.9 95.2 86.4 92.9 81.1 77.3 75.5 66.0 45.6 93.3 77.0
MaLOPa 85.4 93.3 80.2 91.2 73.3 57.3 62.7 64.2 34.0 90.7 69.6
+lexical 89.9 93.8 84.5 92.6 78.6 73.3 73.4 66.9 35.3 91.6 75.3
  +language ID 89.1 94.7 86.6 93.2 81.4 74.7 73.0 71.2 48.2 92.8 76.3
    +fine-grained POS 89.5 95.7 87.8 93.6 82.0 74.7 74.9 69.7 46.0 93.3 76.3
Table 4: Recall of some classes of dependency attachments/relations in German.
LAS target language average
language ID coarse POS de en es fr it pt sv
gold gold 78.6 84.2 83.4 82.4 89.1 84.2 82.6 83.5
predicted gold 78.5 80.2 83.4 82.1 88.9 83.9 82.5 82.7
gold predicted 71.2 79.9 80.5 78.5 85.0 78.4 75.5 78.4
predicted predicted 70.8 74.1 80.5 78.2 84.7 77.1 75.5 77.2
Table 5: Effect of automatically predicting language ID and POS tags with MaLOPa  on LAS scores.


We train MaLOPa  on the concantenation of training sections of all seven languages. To balance the development set, we only concatenate the first 300 sentences of each language’s development section.

Token representations.

The first MaLOPa parser we evaluate uses only coarse POS embeddings to construct the token representation.191919We use the same number of dimensions for the coarse POS embeddings as in the monolingual baselines. The same applies to all other types of embeddings used in MaLOPa. As shown in Table 3, this parser consistently underperforms the monolingual baselines, with a gap of 12.5 LAS points on average.

Augmenting the token representation with lexical embeddings to the token representation (both multilingual word clusters and pretrained multilingual word embeddings, as described in §3.3) substantially improves the performance of MaLOPa, recovering 83% of the gap in average performance.

We experimented with three ways to include language information in the token representation, namely: ‘language ID’, ‘word order’ and ‘full typology’ (see §3.4 for details), and found all three to improve the performance of MaLOPa  giving LAS scores 83.5, 83.2 and 82.5, respectively. It is noteworthy that the model benefits more from language ID than from typological properties. Using ‘language ID,’ we recover another 12% of the original gap.

Finally, the best configuration of MaLOPa adds fine-grained POS embeddings to the token representation.202020

Fine-grained POS tags were only available for English, Italian, Portuguese and Swedish. Other languages reuse the coarse POS tags as fine-grained tags instead of padding the extra dimensions in the token representation with zeros.

Surprisingly, adding fine-grained POS embeddings improves the performance even for some languages where fine-grained POS tags are not available (e.g., Spanish). This parser outperforms the monolingual baseline in five out of seven target languages, and wins on average by 0.3 LAS points. We emphasize that this model is only trained once on all languages, and the same model is used to parse the test set of each language, which simplifies the distribution or deployment of multilingual parsing software.

Qualitative analysis.

To gain a better understanding of the model behavior, we analyze certain classes of dependency attachments/relations in German, which has notably flexible word order, in Table 4. We consider the recall of left attachments (where the head word precedes the dependent word in the sentence), right attachments, root attachments, short-attachments (with distance ), long-attachments (with distance ), as well as the following relation groups: nsubj* (nominal subjects: nsubj, nsubjpass), dobj (direct object: dobj), conj (conjunct: conj), *comp (clausal complements: ccomp, xcomp), case (clitics and adpositions: case), *mod (modifiers of a noun: nmod, nummod, amod, appos), neg (negation modifier: neg).212121For each group, we report recall of both the attachment and relation weighted by the number of instances in the gold annotation. A detailed description of each relation can be found at


We found that each of the three improvements (lexical embeddings, language embeddings and fine-grained POS embeddings) tends to improve recall for most classes. MaLOPa  underperforms (compared to the monolingual baseline) in some classes: nominal subjects, direct objects and modifiers of a noun. Nevertheless, MaLOPa  outperforms the baseline in some important classes such as: root, long attachments and conjunctions.

Predicting language IDs and POS tags.

In Table 3, we assume that both gold language ID of the input language and gold POS tags are given at test time. However, this assumption is not realistic in practical applications. Here, we quantify the degradation in parsing accuracy when language ID and POS tags are only given at training time, but must be predicted at test time. We do not use fine-grained POS tags in these experiments because some languages use a very large fine-grained POS tag set (e.g., 866 unique tags in Portuguese).

In order to predict language ID, we use the library [lui:12]222222

and classify individual sentences in the test sets to one of the seven languages of interest, using the default models included in the library. The macro average language ID prediction accuracy on the test set across sentences is 94.7%. In order to predict POS tags, we use the model described in §

3.6 with both input and hidden LSTM dimensions of 60, and with block dropout. The macro average accuracy of the POS tagger is 93.3%. Table 5 summarizes the four configurations: {gold language ID, predicted language ID} {gold POS tags, predicted POS tags}. The performance of the parser suffers mildly (–0.8 LAS points) when using predicted language IDs, but more (–5.1 LAS points) when using predicted POS tags. As an alternative approach to predicting POS tags, we trained the Stanford POS tagger, for each target language, on the coarse POS tag annotations in the training section of the universal dependency treebanks,232323We used version 3.6.0 of the Stanford POS tagger, with the following pre-packaged configuration files: german-fast-caseless.tagger.props (de), english-caseless-left3words-distsim.tagger.props (en), spanish.tagger.props (es), french.tagger.props (fr). We reused french.tagger.props for (it, pt, sv). then replaced the gold POS tags in the test set of each language with predictions of the monolingual tagger. The resulting degradation in parsing performance between gold vs. predicted POS tags is –6.0 LAS points (on average, compared to a degradation of –5.1 LAS points in Table 5). The disparity in parsing results with gold vs. predicted POS tags is an important open problem, and has been previously discussed by tiedemann:15.

The predicted POS results in Table 5 use block dropout. Without using block dropout, we lose an extra 0.2 LAS points in both configurations using predicted POS tags.

Different multilingual embeddings.

multilingual embeddings UAS LAS
multiCluster 87.7 84.1
multiCCA 87.8 84.4
robust projection 87.8 84.2
Table 6: Effect of multilingual embedding estimation method on the multilingual parsing with MaLOPa. UAS and LAS scores are macro-averaged across seven target languages.

Several methods have been proposed for pretraining multilingual word embeddings. We compare three of them:

  • multiCCA [ammar:16] uses a linear operator to project pretrained monolingual embeddings in each language (except English) to the vector space of pretrained English word embeddings.

  • multiCluster [ammar:16] uses the same embedding for translationally-equivalent words in different languages.

  • robust projection [guo:15] first pretrains monolingual English word embeddings, then defines the embedding of a non-English word as the weighted average embedding of English words aligned to the non-English words (in a parallel corpus). The embedding of a non-English word which is not aligned to any English words is defined as the average embedding of words with a unit edit distance in the same language (e.g., ‘playz’ is the average of ‘plays’ and ‘play’).242424Our implementation of this method can be found at

All embeddings are trained on the same data and use the same number of dimensions (100).252525We share the embedding files at Table 6 illustrates that the three methods perform similarly on this task. Aside from Table 6, in this paper, we exclusively use the robust projection multilingual embeddings trained in guo:16.262626The embeddings were kindly provided by the authors of guo:16 at The “robust projection” result in Table 6 (which uses 100 dimensions) is comparable to the last row in Table 3 (which uses 50 dimensions).

LAS target language
de es fr it sv
monolingual 58.0 64.7 63.0 68.7 57.6
Duong et al. 61.8 70.5 67.2 71.3 62.5
MaLOPa 63.4 70.5 69.1 74.1 63.4
Table 7: Small (3,000 token) target treebank setting: language-universal dependency parser performance.
LAS target language average
de es fr it pt sv
zhang:15 54.1 68.3 68.8 69.4 72.5 62.5 65.9
guo:16 55.9 73.0 71.0 71.2 78.6 69.5 69.3
MaLOPa 57.1 74.6 73.9 72.5 77.0 68.1 70.5
Table 8: Dependency parsing: labeled attachment scores (LAS) for multi-source transfer parsers in the simulated low-resource scenario where .

Small target treebank.

duong:15 considered a setup where the target language has a small treebank of 3,000 tokens, and the source language (English) has a large treebank of 205,000 tokens. The parser proposed in duong:15 is a neural network parser based on chen:14, which shares most of the parameters between English and the target language, and uses an regularizer to tie the lexical embeddings of translationally-equivalent words. While not the primary focus of this paper,272727The setup cost involved in recruiting linguists, developing and revising annotation guidelines to annotate a new language ought to be higher than the cost of annotating 3,000 tokens. After investing much resources in a language, we believe it is unrealistic to stop the annotation effort after only 3,000 tokens. we compare our proposed method to that of duong:15 on five target languages for which multilingual Brown clusters are available from guo:16. For each target language, we train the parser on the English training data in the UD version 1.0 corpus [universal:v1_0] and a small treebank in the target language.282828We thank Long Duong for sharing the processed, subsampled training corpora in each target language at

Following duong:15, in this setup, we only use gold coarse POS tags, we do not use any development data in the target languages (we use the English development set instead), and we subsample the English training data in each epoch to the same number of sentences in the target language. We use the same hyperparameters specified before for the single

MaLOPa parser and each of the monolingual baselines. Table 7 shows that our method outperforms duong:15 by 1.4 LAS points on average. Our method consistently outperforms the monolingual baselines in this setup, with an average improvement of 5.7 absolute LAS points.

4.2 Target Languages without a Treebank ()

mcdonald:11 established that, when no treebank annotations are available in the target language, training on multiple source languages outperforms training on one (i.e., multi-source model transfer outperforms single-source model transfer). In this section, we evaluate the performance of our parser in this setup. We use two strong baseline multi-source model transfer parsers with no supervision in the target language:

  • zhang:15 is a graph-based arc-factored parsing model with a tensor-based scoring function. It takes typological properties of a language as input. We compare to the best reported configuration (i.e., the column titled “OURS” in Table 5 of Zhang and Barzilay, 2015).

  • guo:16 is a transition-based neural-network parsing model based on chen:14. It uses a multilingual embeddings and Brown clusters as lexical features. We compare to the best reported configuration (i.e., the column titled “MULTI-PROJ” in Table 1 of Guo et al., 2016).

Following guo:16, for each target language, we train the parser on six other languages in the Google universal dependency treebanks version 2.0292929 (de, en, es, fr, it, pt, sv, excluding whichever is the target language), and we use gold coarse POS tags. Our parser uses the same word embeddings and word clusters used in guo:16, and does not use any typology information.303030In preliminary experiments, we found language embeddings to hurt the performance of the parser for target languages without a treebank.

The results in Table 8 show that, on average, our parser outperforms both baselines by more than 1 point in LAS, and gives the best LAS results in four (out of six) languages.

5 Related Work

Our work builds on the model transfer approach, which was pioneered by zeman:08 who trained a parser on a source language treebank then applied it to parse sentences in a target language. cohen:11 and mcdonald:11 trained unlexicalized parsers on treebanks of multiple source languages and applied the parser to different languages. naseem:12, tackstrom:13, and zhang:15 used language typology to improve model transfer. To add lexical information, tackstrom:12 used multilingual word clusters, while xiao:14, guo:15, sogaard:15 and guo:16 used multilingual word embeddings. duong:15 used a neural network based model, sharing most of the parameters between two languages, and used an regularizer to tie the lexical embeddings of translationally-equivalent words. We incorporate these ideas in our framework, while proposing a novel neural architecture for embedding language typology (see §3.4), and use a variant of word dropout [iyyer:15] for consuming noisy structured inputs. We also show how to replace an array of monolingually trained parsers with one multilingually-trained parser without sacrificing accuracy, which is related to vilares:16.

Neural network parsing models which preceded dyer:15 include henderson:03, titov:07, henderson:10 and chen:14. Related to lexical features in cross-lingual parsing is durrett:12 who defined lexico-syntactic features based on bilingual lexicons. Other related work include ostling:15, which may be used to induce more useful typological properties to inform multilingual parsing.

Another popular approach for cross-lingual supervision is to project annotations from the source language to the target language via a parallel corpus [yarowsky:01, hwa:05] or via automatically-translated sentences [tiedemann:14b]. ma:14 used entropy regularization to learn from both parallel data (with projected annotations) and unlabeled data in the target language. rasooli:15 trained an array of target-language parsers on fully annotated trees, by iteratively decoding sentences in the target language with incomplete annotations. One research direction worth pursuing is to find synergies between the model transfer approach and annotation projection approach.

6 Conclusion

We presented MaLOPa, a single parser trained on a multilingual set of treebanks. We showed that this parser, equipped with language embeddings and fine-grained POS embeddings, on average outperforms monolingually-trained parsers for target languages with a treebank. This pattern of results is quite encouraging. Although languages may share underlying syntactic properties, individual parsing models must behave quite differently, and our model allows this while sharing parameters across languages. The value of this sharing is more pronounced in scenarios where the target language’s training treebank is small or non-existent, where our parser outperforms previous cross-lingual multi-source model transfer methods.


Waleed Ammar is supported by the Google fellowship in natural language processing. Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA). Part of this material is based upon work supported by a subcontract with Raytheon BBN Technologies Corp. under DARPA Prime Contract No. HR0011-15-C-0013, and part of this research was supported by a Google research award to Noah Smith. We thank Jiang Guo for sharing the multilingual word embeddings and multilingual word clusters. We thank Lori Levin, Ryan McDonald, Jörg Tiedemann, Yulia Tsvetkov, and Yuan Zhang for helpful discussions. Last but not least, we thank the anonymous TACL reviewers for their valuable feedback.