Parsing text is an important part of many natural language processing applications. Recent state-of-the-art results were obtained with parsers implemented using deep neural networks
. Neural networks are flexible learners able to express complicated input-output relationships. However, as more powerful machine learning techniques are used, the quality of results will not be limited by the capacity of the model, but by the amount of the available training data. In this contribution we examine the possibility of increasing the training set by using treebanks from similar languages.
For example, in the upcoming Universal Dependencies (UD) 2.0 treebank collection  there are 863 annotated Ukrainian sentences, 333 Belarusian, but nearly 60k Russian ones (divided into two sets: a default one of 4.4k sentences and SynTagRus with 55.4k sentences). Similarly, there are 7k Polish sentences and a little over 100k Czech ones111However, experiments use UD 1.3 dataset which does not include Belarusian and Ukrainian.. Since these languages belong to the same Slavic language family, performance on the low resource languages should improve by joint training the model also on a better annotated language . In this paper, we demonstrate this improvement. Starting with a parser competitive with the current state-of-the-art, we are able to further improve the results for tested languages from the Slavic family. We train the model on pairs of languages through simple parameter sharing in an end-to-end fashion, retaining the structure and qualities of the base model.
2 Background and Related Work
Dependency parsers represent sentences as trees in which every word is connected to its head with a directed edge (called a dependency) labeled with the dependency’s type. Parsers often contain parts that are learned on a corpus. In example, transition-based dependency parsers use the learned component to guide their actions, while graph-based dependency parser learn a scoring that measures the quality of inserting a (head, dependency) edge into the tree.
Historically, the learning algorithms were relatively simple ones, e.g. transition-based parsers used linear SVMs [27, 26]. Recently, those simple learning models were successfully replaced by deep neural networks [33, 9, 15, 3]. This trend coincides with successes of those models on other NLP tasks, such as language modeling [25, 20] and translation [4, 32, 35].
Neural networks have enough capacity to directly solve the parsing task. For example a constituency parser can be implemented using a sequence-to-sequence network originally developed for translation . Similarly, a graph-based dependency parser can be implemented by solving two supervised tasks: head selection and dependency labeling. Both are easily solved using neural networks [22, 37, 13, 12]. Moreover, neural networks can extract meaningful features from the data, which may augment or replace manually designed ones, as it is the case with word embeddings  or features derived from the spelling of words [21, 5, 12].
Our multilingual parser can be seen as identical neural dependency parsers for languages, which share parameters. When all parameters are shared a single parser is obtained for all languages. When only a subset of parameters is shared the model can be seen as a parser for a main language that is partially regularized using data for other languages.
Each of the parsers is a single neural network that directly reads a sequence of characters and finds dependency edges along with their labels . We can functionally describe four basic parts: Reader, Tagger, Labeler/Scorer, and an optional POS Tag Predictor (Figure 1).
The reader is tasked with transforming the orthographic representation of a single word into a vector , also called the word ’s embedding. First, we represent each word as a sequence of characters fenced with start-of-word and end-of-word tokens. We find low dimensional characters embeddings and concatenate them to form a matrix . Next we convolve this matrix with a learned filterbank
where is the i-th filter and
denotes convolution over the length of the word. Thanks to the start- and end-of-word tokens the filters can selectively target infixes, prefixes and suffixes of words. Finally, we max-pool the filter activations over the word length and apply a small feedforward network to obtain final word embedding.
The tagger processes complete sentences and puts individual word embeddings into their contexts. We use a multi-layer bidirectional GRU Recurrent Neural Network (BiRNN) [10, 29]. The output of the tagger is a sequence of the BiRNN’s hidden states with , where corresponds to a prepended ROOT word and is the length of the sentence. Please observe that while the embedding of the -th word only depends on the word’s spelling, the corresponding hidden state depends on the whole sentence.
We have also added an auxiliary network to predict POS tags based on hidden states . It serves two purposes: first, it can provide extra supervision on POS tags known during training. Second, it helps to attribute errors to various parts of the network (c.f. Sec. 4.4). The POS tag predictor is optional: its output is not used during inference because the tagger communicates all information to the scorer and labeler through the hidden states .
Finally, the network produces the dependency tree by solving two supervised learning tasks: using ascorer to find the head word, then using a labeler to find the edge label .
The scorer determines whether each pair of hidden vectors forms a dependency. We employ per-word normalization of scores: for a given word location scores are SoftMax-normalzied over all head locations .
The labeler reads a pair of hidden vectors and predicts the label of this dependency edge. During training we use the ground-truth head location, while during inference we use the location predicted using the scorer.
We employ the following training criterion:
, where , , are negative log-likelihood losses of the scorer, the labeler and POS tag predictor, respectively.
4 Experiment Details and Results
4.1 Model Hyperparameters
We have decided to use the same set of hyperparameters for all languages and multilingual parsers, which were a compromise in model capacity for languages that had small and large treebanks. The reported size of recurrent layers is slightly too big for low-resources single-language parser, but we have determined that it is optimal for languages with large treebanks and for multilingual training.
The reader embeds each character into vector of size 15, and contains 1050 filters (50
k filters of length k for k = 1, 2,…, 6) whose outputs are projected into 512-dimensional vector transformed by a 3 equally sized layers of feedforward neural network with ReLU activation. Unlike[21, 12] we decided to remove Highway layers  from the reader. Their usage introduced a marginal accuracy gain, while nearly doubling the computational burden. The tagger contains 2 BiRNN layers of GRU units with 548 hidden states for both forward and backward passes which are later aggregated using addition . Therefore the hidden states of the tagger are also 548-dimensional. The POS tag predictor consists of a single affine transformation followed by a SoftMax predictor for each POS category. The scorer uses a single layer of 384 tanh for head word scoring while the labeller
uses 256 Maxout units (each using 2 pieces) to classify the relation label. The training cost used the constants .
We regularize the models using Dropout  applied to the reader output (20%), between the BiRNN layers of the tagger (70%) and to the labeller (50%). Moreover we apply mild weight decay of .
We have trained all models using the Adadelta 
learning rule with epsilon annealed from 1e-8 to 1e-12 and adaptive gradient clipping. Experiments are early-stopped on validation set Unlabeled Attachment Score (UAS) score. Unfortunately, due to limited computational resources we are only able to present the results for a subset of the UD treebanks that are shown in Table 1.
Multilingual models use the same architecture. We unify the inputs and outputs of all models by taking the union of all possible token categories (characters, POS categories, dependency labels). If some category does not exist within a particular language we use a special UNK token. All parsers are trained in parallel minimizing a sum of their individual training costs. We use early-stopping on the main (first) language UAS score. We equalize training mini-batches such that each contains the same number of sentences from all languages. We determined the optimal amount of parameter sharing and show it in Table 2. Moreover, we never share the start-of-word and end-of-word tokens to indicate to the network which language is parsed.
|language||#sentences||Ours||SyntaxNet||Ammar et al.||ParseySaurus|
|Ancient Greek||25 251||78.96||72.36||68.98||62.07||-||73.85||68.1|
|Shared parts||Main lang||Auxiliary lang||UAS||LAS|
|Tagger, POS Predictor, Parser||Polish||Czech||91.65||86.88|
|Reader, Tagger, POS Predictor, Parser||Polish||Czech||91.91||87.77|
|Tagger, POS Predictor, Parser||Polish||Russian||91.34||86.36|
|Reader, Tagger, POS Predictor, Parser||Polish||Russian||89.16||82.94|
|Tagger, POS Predictor, Parser||Russian||Czech||83.91||79.79|
|Reader, Tagger, POS Predictor, Parser||Russian||Czech||84.78||80.35|
4.2 Main Results
Our results on single language training are presented in Table 1. Our models reach better scores than the highly tuned SyntaxNet transition-based parser  and are competitive with the DRAGNN based ParseySaurus which also uses character-based input .
Multilingual training (Table 2) improves the performance on low-resource languages. We observe that the optimal amount of parameter sharing depends on the similarity between languages and corpus size – while it is beneficial to share all parameters of the PL-CZ and RU-CZ parser, the PL-RU parser works best if the reader subnetworks are separated. We attribute this to the quality of Czech treebank which has several times more examples than Polish and Russian datasets combined.
4.3 Analysis of Language Similarities Identified by the Network
We have first analyzed whether a PL-RU parser can learn the correspondence between Latin and Cyrillic scripts222Conveniently, the Unicode has separate codes for Latin and Cyrillic letters.. We have inspected the reader subnetworks of a PL-RU parser that shared all parameters. As described in Section 3, the model begins processing a word by finding the embedding of each character. For the analysis we have extracted the embeddings associated with all Polish and Russian characters. We have paired Polish and Russian letters which have similar pronunciations. We note that the pairing omits letters that have no clear counterparts (e.g. the Russian letter я correspond to the syllable “ja” in Polish).
a-а, b-б, c-ц, d-д, e-е, e-э, f-ф, g-г, h-х, i-и, j-й, k-к, l-л, m-м, n-н, o-о, p-п, r-р, s-с, t-т, u-у, w-в, y-ы, z-з, ł-л, ż-ж
Adapting the famous equation  we inspected to what extent our network was able to deduce Latin-Cyrillic correspondences. For all distinct pairs of letter correspondences we computed the vector , where stands for char embedding, and found Russian letter which had the closest embedding vector. In 48.3% cases we choose the right vector. We found it quite striking given that the two languages have separated from their common root (Proto-Slavic) more than 1000 years ago. Moreover, relations between Polish and Russian letters are side effects, not the main objective of the neural network.
We have also examined word representations computed for Polish and Russian by the shared reader subnetwork. As one could expect, the network was able to realize that in these languages morphology is suffix based. However, the network was also able to learn that words built from different letters can behave in similar way. We can observe it in both monolingual or multilingual context. Table 3 shows some Polish adjectives and the top-7 Russian words with the closest embedding. All Russian words which are not italics have the same morphological tags as the Polish word. In the first row we can observe 2 suffixes 1 0 .25 1-ской (skoy) and 1 0 .25 1-нной (nnoy) quite distant from polish -owej (ovey). In the second row we see that the model was able to correctly alias the Polish 3-letter suffix -ych with the Russian 2 letter suffix 1 0 .25 1-ых which are pronounced the same way. The relation found by the network is purely syntactical – there is no easy-to-find connection between semantics of these words.
|Polish word||Closest Russian embeddings|
|przedwrześniowej||адренергической тренерской таврической|
|непосредственной археологической философской 1 0 .25 1верхнюю|
|większych||автомобильных 1 0 .25 1трёхдневные технических|
|практических официальных оригинальных|
|policyjnym||главным историческим глазным|
|непосредственным 1 0 .25 1косыми летним двухсимвольным|
4.4 Common Error Analysis
We have investigated two possible sources of errors produced by the parser. First, we verified if using a more advanced tree-building algorithm was better than using a greedy one. We have observed that the scorer
produces very sharp probability distributions that can be transformed into trees using a greedy algorithm that simply selects for each word the highest scored head[12, 13]. Counterintuitively, the Chu-Liu-Edmonds (CLE) maximum spanning tree algorithm  often makes the decoding results slightly worse. We have established that the network is so confident in its predictions that non-top scores do not reflect alternatives but are only noise. Therefore when the greedy decoding creates a cycle the CLE usually breaks it in a wrong place introducing another pointer error.
We have used the POS predictor to pinpoint which parts of the network (reader/tagger or labeler/scorer) were responsible for errors. Tests showed that if the predicted tag was wrong, the scorer and labeler will nearly always produce erroneous results too.
5 Conclusions and Future Works
We have demonstrated a graph-based dependency parser implemented as a single deep neural network that directly produces parse trees from characters and does not require other NLP tools such as a POS tagger. The proposed parser can be easily used in a multilingual setup, in which parsers for many languages that share parameters are jointly trained. We have established that the degree of sharing depends on language similarity and corpus size: the best PL-CZ parser and RU-CZ shared all parameters (essentially creating a single parser for both languages), while the best PL-RU parser had separate morphological feature detectors (i.e. readers). We have also determined that the network can extract meaningful relations between languages, such as approximately learning a mapping from Latin to Cyrillic characters or associate Polish and Russian words that have a similar grammatical function. While this contribution focused on improving the performance on a low-resource language using data from another languages, similar parameter sharing techniques could be used to create one universal parser .
We have performed qualitative error analysis and have determined to regions for possible future improvements. First, the network does not indicate alternatives to the produced parse tree. Second, errors in word interpretation are often impossible to correct by the upper layers of the network. In the future we plan to investigate training a better POS tagging subnetwork possibly using other sources of data.
The experiments used Theano, Blocks and Fuel  libraries. The authors would like to acknowledge the support of the following agencies for research funding and computing support: National Science Center (Poland) grant Sonata 8 2014/15/D/ST6/04402, National Center for Research and Development (Poland) grant Audioscope (Applied Research Program, 3rd contest, submission no. 245755).
-  Alberti, C., et al.: SyntaxNet Models for the CoNLL 2017 Shared Task. arXiv:1703.04929 (Mar 2017)
-  Ammar, W., et al.: Many Languages, One Parser. Transactions of the Association for Computational Linguistics 4(0), 431–444 (Jul 2016)
-  Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., Collins, M.: Globally Normalized Transition-Based Neural Networks. arXiv:1603.06042 [cs] (Mar 2016)
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473 [cs, stat] (Sep 2014)
-  Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMs. arXiv preprint arXiv:1508.00657 (2015)
-  Bender, E.M.: On achieving and evaluating language-independence in nlp. Linguistic Issues in Language Technology 6(3), 1–26 (2011)
-  Bergstra, J., et al.: Theano: a CPU and GPU math expression compiler. In: Proc. SciPy (2010)
-  Caruana, R.: Multitask learning. Machine Learning 28(1), 41–75 (1997)
-  Chen, D., Manning, C.D.: A Fast and Accurate Dependency Parser using Neural Networks. In: EMNLP. pp. 740–750 (2014)
-  Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)
-  Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results. arXiv:1412.1602 [cs, stat] (Dec 2014)
-  Chorowski, J., Zapotoczny, M., Rychlikowski, P.: Read, tag, and parse all at once, or fully-neural dependency parsing. CoRR abs/1609.03441 (2016)
-  Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing. CoRR abs/1611.01734 (2016)
-  Duong, L., Cohn, T., Bird, S., Cook, P.: A neural network model for low-resource universal dependency parsing. In: EMNLP. pp. 339–348. Citeseer (2015)
-  Dyer, C., Ballesteros, M., Ling, W., Matthews, A., Smith, N.A.: Transition-based dependency parsing with stack long short-term memory. arXiv preprint arXiv:1505.08075 (2015)
-  Edmonds, J.: Optimim Branchings. JOURNAL OF RESEARCH of the National Bureau of Standards - B. 71B(4), 233–240 (1966)
-  Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout Networks. In: ICML. pp. 1319–1327 (2013)
-  Guo, J., Che, W., Yarowsky, D., Wang, H., Liu, T.: Cross-lingual dependency parsing based on distributed representations. In: ACL (1). pp. 1234–1244 (2015)
-  Hinton, G.E., McClelland, J.L., Rumelhart, D.E.: Paralell Distributed Processing: Explorations in the microstructure of cognition. Volume 1: Foundations. MIT Press/Bradford Books (1986)
-  Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., Wu, Y.: Exploring the Limits of Language Modeling. arXiv:1602.02410 [cs] (Feb 2016)
-  Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. arXiv preprint arXiv:1508.06615 (2015)
-  Kiperwasser, E., Goldberg, Y.: Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations. arXiv:1603.04351 [cs] (Mar 2016)
-  van Merriënboer, B., et al.: Blocks and fuel: Frameworks for deep learning. arXiv:1506.00619 [cs, stat] (Jun 2015)
-  Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS. pp. 3111–3119 (2013)
-  Mikolov, T., Karafiát, M., Burget, L., Cernocky, J., Khudanpur, S.: Recurrent neural network based language model. Makuhari, Chiba, Japan (Sep 2010)
-  Nivre, J.: Algorithms for Deterministic Incremental Dependency Parsing. Comput. Linguist. 34(4), 513–553 (Dec 2008)
-  Nivre, J., et al.: MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering p. 1 (Jan 2005)
-  Nivre, J., et al.: Universal Dependencies 1.2. http://universaldependencies.github.io/docs/ (Nov 2015)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45(11), 2673–2681 (Nov 1997)
-  Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR 15, 1929–1958 (2014)
-  Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway Networks. arXiv:1505.00387 [cs] (May 2015)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to Sequence Learning with Neural Networks. arXiv preprint arXiv:1409.3215 (2014)
-  Titov, I., Henderson, J.: A latent variable model for generative dependency parsing. In: In Proceedings of IWPT (2007)
-  Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a Foreign Language. arXiv:1412.7449 [cs, stat] (Dec 2014)
-  Wu, Y., et al.: Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv: 1609.08144 (Sep 2016)
-  Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv:1212.5701 (2012)
-  Zhang, X., Cheng, J., Lapata, M.: Dependency parsing as head selection. CoRR abs/1606.01280 (2016)