The Universal Dependencies project Nivre et al. (2016) aims to collect consistently annotated treebanks for many languages. Its current version (2.2) Nivre et al. (2018) includes publicly available treebanks for 71 languages in CoNLL-U format. The treebanks contain lemmas, part-of-speech tags, morphological features and dependency relations for every word.
Neural networks have been successfully applied to most of these tasks and produced state-of-the-art results for part-of-speech tagging and dependency parsing. Part-of-speech tagging is usually defined as a sequence tagging problem and is solved with recurrent or convolutional neural networks using word-level softmax outputs or conditional random fieldsLample et al. (2016); Strubell et al. (2017); Chiu and Nichols (2016). Reimers and Gurevych (2017)
have studied these architectures in depth and demonstrated the effect of network hyperparameters and even random seeds on the performance of the networks.
Neural networks have been applied to dependency parsing since 2014 Chen and Manning (2014). The state-of-the-art in dependency parsing is a network with deep biaffine attention module, which won CoNLL 2017 UD Shared Task Dozat et al. (2017).
Nguyen et al. (2017) used a neural network to jointly learn POS tagging and dependency parsing. To the best of our knowledge, lemma generation and POS tagging have never been trained jointly using a single multitask architecture.
This paper describes our submission to CoNLL 2018 UD Shared Task. We have designed a neural network that jointly learns to predict part-of-speech tags, morphological features and lemmas for the given sequence of words. This is the first step towards JointUD, a multitask neural network that will learn to output all labels included in UD treebanks given a tokenized text. Our system used UDPipe 1.2 Straka et al. (2016) for sentence segmentation, tokenization and dependency parsing.
Our main contribution is the extension of a sequence tagging network by Reimers and Gurevych (2017) to support character-level sequence outputs for lemma generation. The proposed architecture was validated on nine UD v2.2 treebanks. The results are generally not better than the UDPipe baseline, but we did not extensively tune the network to squeeze most out of it. Hyperparameter search and improved network design are left for the future work.
2 System Architecture
Our system used in CoNLL 2018 UD Shared Task consists of two parts. First, it takes the raw input and produces CoNLL-U file using UDPipe 1.2. Then, if the corresponding neural model exists, the columns corresponding to lemma, part-of-speech and morphological features are replaced by the predictions of the neural model. Note that UDPipe 1.2 did not use the POS tags and lemmas produced by our neural model. We did not train neural models for all treebanks, so most of our submissions are just the output of UDPipe.
The codename of our system in the Shared Task was ArmParser. The code is available on GitHub111 https://github.com/YerevaNN/JointUD/.
3 Neural model
In this section we describe the neural architecture that takes a sequence of words and outputs lemmas, part-of-speech tags, and 21 morphological features. POS tag and morphological feature prediction is done using a sequence tagging network from Reimers and Gurevych (2017). To generate lemmas, we extend the network with multiple decoders similar to the ones used in sequence-to-sequence architectures.
Suppose the sentence is given as a sequence of words . Each word consists of characters . For each , we are given its lemma as a sequence of characters: , POS tag , and 21 features . The sets contain the possible values for POS tags and morphological features and are language-dependent: the sets are constructed based on the training data of each language. Table 1 shows the possible values for POS tags and morphological features for English - EWT treebank.
|Number||Sing (27.357%), Plur (6.16%), None (66.483%)|
|Degree||Pos (5.861%), Cmp (0.308%), Sup (0.226%), None (93.605%)|
|Mood||Ind (7.5%), Imp (0.588%), None (91.912%)|
|Tense||Past (4.575%), Pres (5.316%), None (90.109%)|
|VerbForm||Fin (9.698%), Inf (4.042%), Ger (1.173%), Part (2.391%), None (82.696%)|
|Definite||Def (4.43%), Ind (2.07%), None (93.5%)|
|Case||Acc (1.284%), Nom (4.62%), None (94.096%)|
|Person||1 (3.255%), 3 (5.691%), 2 (1.396%), None (89.658%)|
|PronType||Art (6.5%), Dem (1.258%), Prs (7.394%), Rel (0.569%), Int (0.684%), None (83.595%)|
|NumType||Card (1.954%), Ord (0.095%), Mult (0.033%), None (97.918%)|
|Voice||Pass (0.589%), None (99.411%)|
|Gender||Masc (0.743%), Neut (0.988%), Fem (0.24%), None (98.029%)|
|Poss||Yes (1.48%), None (98.52%)|
|Reflex||Yes (0.049%), None (99.951%)|
|Foreign||Yes (0.009%), None (99.991%)|
|Abbr||Yes (0.04%), None (99.96%)|
|Typo||Yes (0.052%), None (99.948%)|
The network consists of three parts: embedding layers, feature extraction layers and output layers.
3.1 Embedding layers
By we denote a -dimensional embedding of the integer . Usually, is an index of a word in a dictionary or an index of a character in an alphabet.
is represented by a concatenation of three vectors:. The first vector, is a 300-dimensional pretrained word vector. In our experiments we used FastText vectors Bojanowski et al. (2017) released by Facebook222https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md. The second vector, , is a one-hot representation of eight casing features, described in Table 2.
The third vector, is a character-level representation of the word. We map each character to a randomly initialized 30-dimensional vector , and apply a bi-directional LSTM on these embeddings. is the concatenation of the 25-dimensional final states of two LSTMs.
The resulting is a 358-dimensional vector.
|numeric||All characters are numeric|
|mainly numeric||More than 50% of characters are numeric|
|all lower||All characters are lower cased|
|all upper||All characters are upper cased|
|initial upper||The first character is upper cased|
|contains digit||At least one of the characters is digit|
|other||None of the above rules applies|
|padding||This is used for the padding placeholders for short sequences|
3.2 Feature extraction layers
We apply three layers of LSTM with 150-dimensional hidden states on the embedding vectors:
where . We also apply 50% dropout before each LSTM layer.
The obtained 150-dimensional vectors represent the words with their contexts, and are expected to contain necessary information about the lemma, POS tag and morphological features.
3.3 Output layers
3.3.1 POS tags and features
Part-of-speech tagging and morphological feature prediction are word-level classification tasks. For each of these tasks we apply a linear layer with softmax activation.
The dimensions of the matrices , and vectors , depend on the training set for the given language: , ,
. So we end up with 22 cross-entropy loss functions:
3.3.2 Lemma generation
This subsection describes our main contribution. In order to generate the lemmas for all words, we add one GRU-based decoder per each word. These decoders share the weights and work in parallel. The -th decoder outputs , the predicted characters of the lemma of the -th word. We denote the inputs to the -th decoder by . Each of is a concatenation of four vectors: .
is the representation of the -th word after feature extractor LSTMs. This is the only part of vector that does not depend on . This trick is important to make sure that word-level information is always available in the decoder.
is the same embedding of the -th character of the word used in the character-level BiLSTM described in Section 3.1.
is the indicator of the previous character of the lemma. During training it is the one-hot vector of the ground-truth: . During inference it is the output of the GRU in the previous timestep .
These inputs are passed to a single layer of GRU network. The output of the decoder is formed by applying another dense layer on the GRU state:
Here, , , where is the number of characters in the alphabet. The initial state of the GRU is the output of the feature extractor LSTM: . All GRUs share the weights.
The loss function for lemma output is:
3.4 Multitask loss function
The combined loss function is a weighted average of the loss functions described above:
The final version of our system used and for every .
We have implemented the architecture defined in the previous section using Keras framework. Our implementation is based on the codebase forReimers and Gurevych (2017)333https://github.com/UKPLab/emnlp2017-bilstm-cnn-crf. The new part of the architecture (lemma generation) is quite slow. The overall training speed is decreased by more than three times when it is enabled. We have left speed improvements for future work.
To train the model we used RMSProp optimizer with early stopping. The initial learning rate was, and it was decreased to
since the seventh epoch. The training was stopped when the loss function was not improved on the development set for five consecutive epochs.
Due to time constraints, we have trained our neural architecture on just nine treebanks. These include three English and two French treebanks.
The version we ran on TIRA had a bug in the preprocessing pipeline and was doubling new line symbols in the input text. Raw texts in UD v2.2 occasionally contain new line symbols inside the sentences. These symbols were duplicated due to the bug, and the sentence segmentation part of UDPipe treated them as two different sentences. The evaluation scripts used in CoNLL 2018 UD Shared Task obviously penalized these errors. After the deadline of the Shared Task, we ran the same models (without retraining) on the test sets on our local machines without new line symbols.
Additionally, we locally trained models for two more non-Indo-European treebanks: Arabic PADT and Korean GSD.
Table 3 shows the main metrics of CoNLL 2018 UD Shared Task on the nine treebanks that we used for training our models. For each of the metrics we report five scores, two scores on our local machine (our model and UDPipe 1.2), and three scores from the official leaderboard444http://universaldependencies.org/conll18/results.html (our model, UDPipe baseline, the best score for that particular treebank). LAS metric evaluates sentence segmentation, tokenization and dependency parsing, so the numbers for our models should be identical to UDPipe 1.2. MLAS metric additionally takes into account POS tags and morphological features, but not the lemmas. BLEX metric evaluates dependency parsing and lemmatization. The full description of these metrics are available in Zeman et al. (2018b) and in CoNLL 2018 UD Shared Task website555http://universaldependencies.org/conll18/evaluation.html. Table 4 compares the same models using another set of metrics that measure the performance of POS tagging, morphological feature extraction and lemmatization.
5.1 Input vectors for lemma generation
The initial versions of the lemma decoder did not get the state of the LSTM below and positional embedding as inputs. The network learned to produce lemmas with some accuracy but with many trivial errors. In particular, after training on English - EWT treebank, the network learned to remove s from the end of the plural nouns. But it also started to produce ¡end-of-the-word¿ symbol even if s was in the middle of the word. We believe the reason was that there was almost no information available that would allow the decoder to distinguish between plural suffix and a simple s inside the word. One could argue that the initial state of the GRU () could contain such information, but it could have been lost in the GRU.
To remedy this we decided to pass as an input at every step of the decoder. This idea is known to work well in image caption generation. The earliest usage of this trick we know is in Donahue et al. (2015).
Additionally, we have added explicit information about the position in the word. Unlike Vaswani et al. (2017), we encode the number of characters left before the end of the word. This choice might be biased towards languages where the ending of the word is the most critical in lemmatization.
By combining these two ideas we got significant improvement in lemma generation for English. We did not do ablation experiments to determine the effect of each of these additions.
The additional experiments showed that this architecture of the lemmatizer does not generalize to Arabic and Korean. We will investigate this problem in the future work.
5.2 Balancing different tasks
Multitask learning in neural networks is usually complicated because of varying difficulty of individual tasks. The coefficients in (1) can be used to find optimal balance between the tasks. Our initial experiments with all coefficients equal to showed that the loss term for POS tagging () had much higher values than the rest. We decided to set to give more weight to the other tasks and noticed some improvements in lemma generation.
We believe that more extensive search for better coefficients might help to significantly improve the overall performance of the system.
5.3 Fighting against overfitting
The main challenge in training these networks is to overcome overfitting. The only trick we used was to apply dropout layers before feature extractor LSTMs. We did not apply recurrent dropout Gal and Ghahramani (2016) or other noise injection techniques, although recent work in language modeling demonstrated the importance of such tricks for obtaining high performance models Merity et al. (2018).
In this paper we have described our submission to CoNLL 2018 UD Shared Task. Our neural network was learned to jointly produce lemmas, part-of-speech tags and morphological features. It is the first step towards a fully multitask neural architecture that will also produce dependency relations. Future work will include more extensive hyperparameter tuning and experiments with more languages.
We would like to thank Tigran Galstyan for helpful discussions on neural architecture. We would also like to thank anonymous reviewers for their comments. Additionally, we would like to thank Marat Yavrumyan and Anna Danielyan. Their efforts on bringing Armenian into UD family motivated us to work on sentence parsing.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5:135–146.
Chen and Manning (2014)
Danqi Chen and Christopher Manning. 2014.
A fast and accurate dependency parser using neural networks.
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pages 740–750.
- Chiu and Nichols (2016) Jason P.C. Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association for Computational Linguistics 4:357–370. https://transacl.org/ojs/index.php/tacl/article/view/792.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, pages 1724–1734. https://doi.org/10.3115/v1/D14-1179.
- Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In . pages 2625–2634.
- Dozat et al. (2017) Timothy Dozat, Peng Qi, and Christopher D. Manning. 2017. Stanford’s graph-based neural dependency parser at the conll 2017 shared task. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, pages 20–30. https://doi.org/10.18653/v1/K17-3002.
- Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Advances in neural information processing systems. pages 1019–1027.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional Sequence to Sequence Learning. ArXiv e-prints .
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
Lample et al. (2016)
Guillaume Lample, Miguel Ballesteros, Kazuya Kawakami, Sandeep Subramanian, and
Chris Dyer. 2016.
Neural architectures for named entity recognition.In Proc. NAACL-HLT.
- Merity et al. (2018) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. 2018. Regularizing and optimizing LSTM language models. In International Conference on Learning Representations. https://openreview.net/forum?id=SyyGPP0TZ.
- Nguyen et al. (2017) Dat Quoc Nguyen, Mark Dras, and Mark Johnson. 2017. A novel neural network model for joint pos tagging and graph-based dependency parsing. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Vancouver, Canada, pages 134–142. http://www.aclweb.org/anthology/K17-3014.
- Nivre et al. (2016) Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajič, Christopher Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, Portorož, Slovenia, pages 1659–1666.
- Nivre et al. (2018) Joakim Nivre et al. 2018. Universal Dependencies 2.2. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, http://hdl.handle.net/11234/1-1983xxx. http://hdl.handle.net/11234/1-1983xxx.
- Potthast et al. (2014) Martin Potthast, Tim Gollub, Francisco Rangel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. 2014. Improving the reproducibility of PAN’s shared tasks: Plagiarism detection, author identification, and author profiling. In Evangelos Kanoulas, Mihai Lupu, Paul Clough, Mark Sanderson, Mark Hall, Allan Hanbury, and Elaine Toms, editors, Information Access Evaluation meets Multilinguality, Multimodality, and Visualization. 5th International Conference of the CLEF Initiative (CLEF 14). Springer, Berlin Heidelberg New York, pages 268–299. https://doi.org/10.1007/978-3-319-11382-1_22.
- Reimers and Gurevych (2017) Nils Reimers and Iryna Gurevych. 2017. Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). Copenhagen, Denmark, pages 338–348. http://aclweb.org/anthology/D17-1035.
- Straka et al. (2016) Milan Straka, Jan Hajič, and Jana Straková. 2016. UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, POS tagging and parsing. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association, Portorož, Slovenia.
- Strubell et al. (2017) Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2670–2680.
- Sukhbaatar et al. (2015) Sainbayar Sukhbaatar, Jason Weston, Rob Fergus, et al. 2015. End-to-end memory networks. In Advances in neural information processing systems. pages 2440–2448.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. pages 5998–6008.
- Zeman et al. (2018a) Dan Zeman et al. 2018a. Universal Dependencies 2.2 – CoNLL 2018 shared task development and test data. LINDAT/CLARIN digital library at the Institute of Formal and Applied Linguistics, Charles University, Prague, http://hdl.handle.net/11234/1-2184. http://hdl.handle.net/11234/1-2184.
- Zeman et al. (2018b) Daniel Zeman, Jan Hajič, Martin Popel, Martin Potthast, Milan Straka, Filip Ginter, Joakim Nivre, and Slav Petrov. 2018b. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. Association for Computational Linguistics, Brussels, Belgium, pages 1–20.