On Multilingual Training of Neural Dependency Parsers

05/29/2017 ∙ by Michał Zapotoczny, et al. ∙ Akademia Sztuk Pięknych we Wrocławiu 0

We show that a recently proposed neural dependency parser can be improved by joint training on multiple languages from the same family. The parser is implemented as a deep neural network whose only input is orthographic representations of words. In order to successfully parse, the network has to discover how linguistically relevant concepts can be inferred from word spellings. We analyze the representations of characters and words that are learned by the network to establish which properties of languages were accounted for. In particular we show that the parser has approximately learned to associate Latin characters with their Cyrillic counterparts and that it can group Polish and Russian words that have a similar grammatical function. Finally, we evaluate the parser on selected languages from the Universal Dependencies dataset and show that it is competitive with other recently proposed state-of-the art methods, while having a simple structure.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Parsing text is an important part of many natural language processing applications. Recent state-of-the-art results were obtained with parsers implemented using deep neural networks


. Neural networks are flexible learners able to express complicated input-output relationships. However, as more powerful machine learning techniques are used, the quality of results will not be limited by the capacity of the model, but by the amount of the available training data. In this contribution we examine the possibility of increasing the training set by using treebanks from similar languages.

For example, in the upcoming Universal Dependencies (UD) 2.0 treebank collection [28] there are 863 annotated Ukrainian sentences, 333 Belarusian, but nearly 60k Russian ones (divided into two sets: a default one of 4.4k sentences and SynTagRus with 55.4k sentences). Similarly, there are 7k Polish sentences and a little over 100k Czech ones111However, experiments use UD 1.3 dataset which does not include Belarusian and Ukrainian.. Since these languages belong to the same Slavic language family, performance on the low resource languages should improve by joint training the model also on a better annotated language [6]. In this paper, we demonstrate this improvement. Starting with a parser competitive with the current state-of-the-art, we are able to further improve the results for tested languages from the Slavic family. We train the model on pairs of languages through simple parameter sharing in an end-to-end fashion, retaining the structure and qualities of the base model.

2 Background and Related Work

Dependency parsers represent sentences as trees in which every word is connected to its head with a directed edge (called a dependency) labeled with the dependency’s type. Parsers often contain parts that are learned on a corpus. In example, transition-based dependency parsers use the learned component to guide their actions, while graph-based dependency parser learn a scoring that measures the quality of inserting a (head, dependency) edge into the tree.

Historically, the learning algorithms were relatively simple ones, e.g. transition-based parsers used linear SVMs [27, 26]. Recently, those simple learning models were successfully replaced by deep neural networks [33, 9, 15, 3]. This trend coincides with successes of those models on other NLP tasks, such as language modeling [25, 20] and translation [4, 32, 35].

Neural networks have enough capacity to directly solve the parsing task. For example a constituency parser can be implemented using a sequence-to-sequence network originally developed for translation [34]. Similarly, a graph-based dependency parser can be implemented by solving two supervised tasks: head selection and dependency labeling. Both are easily solved using neural networks [22, 37, 13, 12]. Moreover, neural networks can extract meaningful features from the data, which may augment or replace manually designed ones, as it is the case with word embeddings [24] or features derived from the spelling of words [21, 5, 12].

Another particularly nice property of neural models is that all internal computations use distributed representations of input data that are embedded in highly dimensional vector spaces

[19]. These internal representation can be easily shared between tasks [8]. Likewise, neural parsers can share some of their parameters to harness similarities between languages [6, 18, 14, 2]. Creation of multilingual parsers is further facilitated by the introduction of standardized treebanks, such as the Universal Dependencies [28].

3 Model

Our multilingual parser can be seen as identical neural dependency parsers for languages, which share parameters. When all parameters are shared a single parser is obtained for all languages. When only a subset of parameters is shared the model can be seen as a parser for a main language that is partially regularized using data for other languages.

Each of the parsers is a single neural network that directly reads a sequence of characters and finds dependency edges along with their labels [12]. We can functionally describe four basic parts: Reader, Tagger, Labeler/Scorer, and an optional POS Tag Predictor (Figure 1).

Figure 1: The model architecture.

The reader is tasked with transforming the orthographic representation of a single word into a vector , also called the word ’s embedding. First, we represent each word as a sequence of characters fenced with start-of-word and end-of-word tokens. We find low dimensional characters embeddings and concatenate them to form a matrix . Next we convolve this matrix with a learned filterbank


where is the i-th filter and

denotes convolution over the length of the word. Thanks to the start- and end-of-word tokens the filters can selectively target infixes, prefixes and suffixes of words. Finally, we max-pool the filter activations over the word length and apply a small feedforward network to obtain final word embedding


The tagger processes complete sentences and puts individual word embeddings into their contexts. We use a multi-layer bidirectional GRU Recurrent Neural Network (BiRNN) [10, 29]. The output of the tagger is a sequence of the BiRNN’s hidden states with , where corresponds to a prepended ROOT word and is the length of the sentence. Please observe that while the embedding of the -th word only depends on the word’s spelling, the corresponding hidden state depends on the whole sentence.

We have also added an auxiliary network to predict POS tags based on hidden states . It serves two purposes: first, it can provide extra supervision on POS tags known during training. Second, it helps to attribute errors to various parts of the network (c.f. Sec. 4.4). The POS tag predictor is optional: its output is not used during inference because the tagger communicates all information to the scorer and labeler through the hidden states .

Finally, the network produces the dependency tree by solving two supervised learning tasks: using a

scorer to find the head word, then using a labeler to find the edge label .

The scorer determines whether each pair of hidden vectors forms a dependency. We employ per-word normalization of scores: for a given word location scores are SoftMax-normalzied over all head locations .

The labeler reads a pair of hidden vectors and predicts the label of this dependency edge. During training we use the ground-truth head location, while during inference we use the location predicted using the scorer.

We employ the following training criterion:

, where , , are negative log-likelihood losses of the scorer, the labeler and POS tag predictor, respectively.

4 Experiment Details and Results

4.1 Model Hyperparameters

We have decided to use the same set of hyperparameters for all languages and multilingual parsers, which were a compromise in model capacity for languages that had small and large treebanks. The reported size of recurrent layers is slightly too big for low-resources single-language parser, but we have determined that it is optimal for languages with large treebanks and for multilingual training.

The reader embeds each character into vector of size 15, and contains 1050 filters (50

k filters of length k for k = 1, 2,…, 6) whose outputs are projected into 512-dimensional vector transformed by a 3 equally sized layers of feedforward neural network with ReLU activation. Unlike

[21, 12] we decided to remove Highway layers [31] from the reader. Their usage introduced a marginal accuracy gain, while nearly doubling the computational burden. The tagger contains 2 BiRNN layers of GRU units with 548 hidden states for both forward and backward passes which are later aggregated using addition [12]. Therefore the hidden states of the tagger are also 548-dimensional. The POS tag predictor consists of a single affine transformation followed by a SoftMax predictor for each POS category. The scorer uses a single layer of 384 tanh for head word scoring while the labeller

uses 256 Maxout units (each using 2 pieces) to classify the relation label

[17]. The training cost used the constants .

We regularize the models using Dropout [30] applied to the reader output (20%), between the BiRNN layers of the tagger (70%) and to the labeller (50%). Moreover we apply mild weight decay of .

We have trained all models using the Adadelta [36]

learning rule with epsilon annealed from 1e-8 to 1e-12 and adaptive gradient clipping

[11]. Experiments are early-stopped on validation set Unlabeled Attachment Score (UAS) score. Unfortunately, due to limited computational resources we are only able to present the results for a subset of the UD treebanks that are shown in Table 1.

Multilingual models use the same architecture. We unify the inputs and outputs of all models by taking the union of all possible token categories (characters, POS categories, dependency labels). If some category does not exist within a particular language we use a special UNK token. All parsers are trained in parallel minimizing a sum of their individual training costs. We use early-stopping on the main (first) language UAS score. We equalize training mini-batches such that each contains the same number of sentences from all languages. We determined the optimal amount of parameter sharing and show it in Table 2. Moreover, we never share the start-of-word and end-of-word tokens to indicate to the network which language is parsed.

language #sentences Ours SyntaxNet Ammar et al. ParseySaurus
Czech 87 913 91.41 88.18 89.47 85.93 - 89.09 84.99
Polish 8 227 90.26 85.32 88.30 82.71 - 91.86 87.49
Russian 5 030 83.29 79.22 81.75 77.71 - 84.27 80.65
German 15 892 82.67 76.51 79.73 74.07 71.2 84.12 79.05
English 16 622 87.44 83.94 84.79 80.38 79.9 87.86 84.45
French 16 448 87.25 83.50 84.68 81.05 78.5 86.61 83.1
Ancient Greek 25 251 78.96 72.36 68.98 62.07 - 73.85 68.1
Table 1: Baseline results of single language models from UD v1.3. Our models use only orthographic representations of tokenized words during inference and work without a separate POS tagger. Ammar et al. [2] uses version 1.2 of UD and uses gold language ids and predicted coarse tags. SyntaxNet[3, 1] works on predicted POS tags, while ParseySaurus[1] uses word spellings.
Shared parts Main lang Auxiliary lang UAS LAS
- Polish - 90.26 85.32
Parser Polish Czech 90.72 85.57
Tagger, Parser Polish Czech 91.19 86.37
Tagger, POS Predictor, Parser Polish Czech 91.65 86.88
Reader, Tagger, POS Predictor, Parser Polish Czech 91.91 87.77
Parser Polish Russian 90.31 85.07
Tagger, POS Predictor, Parser Polish Russian 91.34 86.36
Reader, Tagger, POS Predictor, Parser Polish Russian 89.16 82.94
- Russian - 83.29 79.22
Parser Russian Czech 83.15 78.69
Tagger, POS Predictor, Parser Russian Czech 83.91 79.79
Reader, Tagger, POS Predictor, Parser Russian Czech 84.78 80.35
Table 2: Impact of parameter sharing strategies on main language parsing accuracy when multilingual training is used for additional supervision.

4.2 Main Results

Our results on single language training are presented in Table 1. Our models reach better scores than the highly tuned SyntaxNet transition-based parser [3] and are competitive with the DRAGNN based ParseySaurus which also uses character-based input [1].

Multilingual training (Table 2) improves the performance on low-resource languages. We observe that the optimal amount of parameter sharing depends on the similarity between languages and corpus size – while it is beneficial to share all parameters of the PL-CZ and RU-CZ parser, the PL-RU parser works best if the reader subnetworks are separated. We attribute this to the quality of Czech treebank which has several times more examples than Polish and Russian datasets combined.

4.3 Analysis of Language Similarities Identified by the Network

We have first analyzed whether a PL-RU parser can learn the correspondence between Latin and Cyrillic scripts222Conveniently, the Unicode has separate codes for Latin and Cyrillic letters.. We have inspected the reader subnetworks of a PL-RU parser that shared all parameters. As described in Section 3, the model begins processing a word by finding the embedding of each character. For the analysis we have extracted the embeddings associated with all Polish and Russian characters. We have paired Polish and Russian letters which have similar pronunciations. We note that the pairing omits letters that have no clear counterparts (e.g. the Russian letter я correspond to the syllable “ja” in Polish).

a-а, b-б, c-ц, d-д, e-е, e-э, f-ф, g-г, h-х, i-и, j-й, k-к, l-л, m-м, n-н, o-о, p-п, r-р, s-с, t-т, u-у, w-в, y-ы, z-з, ł-л, ż-ж

Adapting the famous equation [24] we inspected to what extent our network was able to deduce Latin-Cyrillic correspondences. For all distinct pairs of letter correspondences we computed the vector , where stands for char embedding, and found Russian letter which had the closest embedding vector. In 48.3% cases we choose the right vector. We found it quite striking given that the two languages have separated from their common root (Proto-Slavic) more than 1000 years ago. Moreover, relations between Polish and Russian letters are side effects, not the main objective of the neural network.

We have also examined word representations computed for Polish and Russian by the shared reader subnetwork. As one could expect, the network was able to realize that in these languages morphology is suffix based. However, the network was also able to learn that words built from different letters can behave in similar way. We can observe it in both monolingual or multilingual context. Table 3 shows some Polish adjectives and the top-7 Russian words with the closest embedding. All Russian words which are not italics have the same morphological tags as the Polish word. In the first row we can observe 2 suffixes             1 0 .25 1-ской (skoy) and             1 0 .25 1-нной (nnoy) quite distant from polish -owej (ovey). In the second row we see that the model was able to correctly alias the Polish 3-letter suffix -ych with the Russian 2 letter suffix        1 0 .25 1-ых which are pronounced the same way. The relation found by the network is purely syntactical – there is no easy-to-find connection between semantics of these words.

Polish word Closest Russian embeddings
przedwrześniowej адренергической тренерской таврической
непосредственной археологической философской                 1 0 .25 1верхнюю
większych автомобильных                          1 0 .25 1трёхдневные технических
практических официальных оригинальных
policyjnym главным историческим глазным
непосредственным               1 0 .25 1косыми летним двухсимвольным
Table 3: The network learns to group Polish words with Russian words that have a similar grammatical function.

4.4 Common Error Analysis

We have investigated two possible sources of errors produced by the parser. First, we verified if using a more advanced tree-building algorithm was better than using a greedy one. We have observed that the scorer

produces very sharp probability distributions that can be transformed into trees using a greedy algorithm that simply selects for each word the highest scored head

[12, 13]. Counterintuitively, the Chu-Liu-Edmonds (CLE) maximum spanning tree algorithm [16] often makes the decoding results slightly worse. We have established that the network is so confident in its predictions that non-top scores do not reflect alternatives but are only noise. Therefore when the greedy decoding creates a cycle the CLE usually breaks it in a wrong place introducing another pointer error.

We have used the POS predictor to pinpoint which parts of the network (reader/tagger or labeler/scorer) were responsible for errors. Tests showed that if the predicted tag was wrong, the scorer and labeler will nearly always produce erroneous results too.

5 Conclusions and Future Works

We have demonstrated a graph-based dependency parser implemented as a single deep neural network that directly produces parse trees from characters and does not require other NLP tools such as a POS tagger. The proposed parser can be easily used in a multilingual setup, in which parsers for many languages that share parameters are jointly trained. We have established that the degree of sharing depends on language similarity and corpus size: the best PL-CZ parser and RU-CZ shared all parameters (essentially creating a single parser for both languages), while the best PL-RU parser had separate morphological feature detectors (i.e. readers). We have also determined that the network can extract meaningful relations between languages, such as approximately learning a mapping from Latin to Cyrillic characters or associate Polish and Russian words that have a similar grammatical function. While this contribution focused on improving the performance on a low-resource language using data from another languages, similar parameter sharing techniques could be used to create one universal parser [2].

We have performed qualitative error analysis and have determined to regions for possible future improvements. First, the network does not indicate alternatives to the produced parse tree. Second, errors in word interpretation are often impossible to correct by the upper layers of the network. In the future we plan to investigate training a better POS tagging subnetwork possibly using other sources of data.


The experiments used Theano

[7], Blocks and Fuel [23] libraries. The authors would like to acknowledge the support of the following agencies for research funding and computing support: National Science Center (Poland) grant Sonata 8 2014/15/D/ST6/04402, National Center for Research and Development (Poland) grant Audioscope (Applied Research Program, 3rd contest, submission no. 245755).