LemmaTag
A neural network that jointly part-of-speech tags and lemmatizes sentences, boosting accuracy for morphologically-rich languages (Czech, Arabic, etc.)
view repo
We present LemmaTag, a featureless recurrent neural network architecture that jointly generates part-of-speech tags and lemmatizes sentences of languages with complex morphology, using bidirectional RNNs with character-level and word-level embeddings. We demonstrate that both tasks benefit from sharing the encoding part of the network and from using the tagger output as an input to the lemmatizer. We evaluate our model across several morphologically-rich languages, surpassing state-of-the-art accuracy in both part-of-speech tagging and lemmatization in Czech, German, and Arabic.
READ FULL TEXT VIEW PDFA neural network that jointly part-of-speech tags and lemmatizes sentences, boosting accuracy for morphologically-rich languages (Czech, Arabic, etc.)
Morphologically rich languages are often difficult to process in many NLP tasks Tsarfaty et al. (2010). As opposed to analytical languages like English, morphologically rich languages encode diverse sets of grammatical information within each word using inflections, which convey characteristics such as case, gender, and tense. The addition of several inflectional variants across many words dramatically increases the vocabulary size, which results in data sparsity and out-of-vocabulary (OOV) issues.
Due to these issues, morphological part-of-speech (POS) tagging and lemmatization are heavily used in NLP tasks such as machine translation Fraser et al. (2012)
Abdul-Mageed et al. (2014). In morphologically rich languages, the POS tags typically consist of multiple morpho-syntactic subcategories providing additional information (see Figure 1). Closely related to POS tagging is lemmatization, which involves transforming each word to its root or dictionary form. Both tasks require context-sensitive awareness to disambiguate words with the same form but different syntactic or semantic features and behavior. Furthermore, lemmatization of a word form can benefit substantially from the information present in morphological tags, as grammatical attributes often disambiguate word forms using context Müller et al. (2015).We address context-sensitive POS tagging and lemmatization using a neural network model that jointly performs both tasks on each input word in a given sentence.111The code for this project is available at https://github.com/hyperparticle/LemmaTag We train the model in a supervised fashion, requiring training data containing word forms, lemmas, and POS tags. In addition, we incorporate the ideas from inoue2017joint to optionally allow the network to predict the subcategories of each tag to improve accuracy. Our model is related to the work of muller2015joint, which use conditional random fields (CRF) to jointly tag and lemmatize words for morphologically rich languages. The idea of jointly predicting several dimensions of categories has been explored prior to this work, for example, joint morphological and syntactic analysis (Bohnet et al., 2013) or joint parsing and semantic role labeling (Gesmundo et al., 2009).
Our model consists of three parts: The shared encoder, which creates an internal representation for every word based on its character sequence and the sentence context. We adopt the encoder architecture of chakrabarty2017context, utilizing character-level Heigold et al. (2017) and word-level embeddings Mikolov et al. (2013b); Santos and Zadrozny (2014)
processed through several layers of bidirectional recurrent neural networks (BRNN/BiRNN)
Schuster and Paliwal (1997); Chakrabarty et al. (2017). The tagger decoder, which applies a fully-connected layer to the outputs of the shared encoder to predict the POS tags. The lemmatizer decoder, which applies an RNN sequence decoder to the combined outputs of the shared encoder and tagger decoder, producing a sequence of characters that predict each lemma (similar to bergmanis2018context).The main advantages over other proposed models are: The model is featureless, requiring little to no text preprocessing or morphological analysis postprocessing. The model shares the word embeddings, character embeddings, and RNN encoder weights in the tagger and lemmatizer, improving both tagging and lemmatization accuracy while reducing the number of parameters required for both tasks. The model predicts tag subcategories and provides the output of the tagger as features for the input of the lemmatizer, further improving accuracy.
We evaluate the accuracy of our model in POS tagging and lemmatization across several languages: Czech, Arabic, German, and English. For each language, we also compare the performance of a fully separate tagger and lemmatizer to the proposed joint model. Our results show that our joint model is able to improve the accuracy for both tasks, and achieves state-of-the-art performance in both POS tagging and lemmatization in Czech, German, and Arabic, while closely matching state-of-the-art performance for English.
Given a sequence of words in a sentence , the task of the model is to produce a sequence of associated tags and lemmas . For a word at position , we denote to be the sequence of characters that make up , where indicates the length of the word string at position . Analogously, we define to be the sequence of characters that make up the lemma .
Our proposed model (shown in Figures 2 and 3) is split into three parts: the shared encoder, the tagger, and the lemmatizer. The initial layers of the model are shared between the tagger and lemmatizer, encoding the words, characters, and context in a given sentence. The encoder then passes its outputs to two networks, which perform a classification task to predict tags by the tagger and a sequence prediction task to output lemmas (character-by-character) in the lemmatizer.
In the encoder shown in Figure 2, each character of a word
is indexed into an embedding layer to produce fixed-length embedded vectors representing each character. These vectors are further passed into a layer of BRNNs composed of gated recurrent units (GRU)
Cho et al. (2014) producing outputs , and whose final states are concatenated to produce the character-level embedding of the word. Similarly, we index into a word-level embedding layer to compute vector . Then we sum these results to produce the final word embedding .We repeat this process independently for all the words in the sentence and feed the resulting sequence
into another two BRNN layers composed of long short-term memory units (LSTM) with residual connections. This produces word-level outputs
that encode sentence-level context for each word (we ignore the final hidden states).—
Sentence-level encoder and tag classifier. Two BRNN layers with residual connections act on the embedded words
of a sentence, providing context. The output of the tag classification are the logits for both the whole tags
and their components .The task of the tagger is to predict a tag given a word and its context, where is a set of possible tags. As explained the introduction, morphologically rich languages typically subdivide tags further into several subcategories , where , the -th subcategory. See Figure 1 for an illustration taken from the Czech PDT tagset where .
Having the encoded words of a sentence available, the tagger consists of a fully-connected layer with neurons whose input is the output of the word feature RNN (see figure 2). This layer produces the logits of the tag values and the predictions as the maximum-likelihood value (i.e., softmax).
To obtain the information about categorical nature of each tag, we also predict every category of the tag independently (if they exist in the dataset) with dense layers similar to inoue2017joint. The -th layer has neurons and outputs the logits for the category values. While these values are trained for, their value is not used in tag prediction. All tag values are concatenated into a flat vector and fed into the lemmatizer as an additional set of potentially useful features.
The task of the lemmatizer is to produce a sequence of characters and the lemma length for each lemma
. We use a recurrent sequence decoder, a setup typical of many sequence-to-sequence (seq2seq) tasks such as in neural machine translation
Sutskever et al. (2014).The lemmatizer consists of a recurrent LSTM layer whose initial state is taken from word-level output and whose inputs consist of three parts. The first part is the embedding of the previous output character (initially a beginning-of-word character BOW).
The second part is a character-level attention mechanism Bahdanau et al. (2014) on the outputs of the character-level BRNN . We employ the multiplicative attention mechanism described in luong2015effective, which allows the LSTM cell to compute an attention vector that selectively weights character-level information in at each time step based on the input state of the LSTM cell.
The third and final part of the RNN input allows the network to receive the information about the embedding of the word, the surrounding context of the sentence, and the output of the tagger. This output is the same for all time steps of a lemma and is a concatenation of the following: the output of the encoder , the embedded word and processed tag features . The tag features are obtained by projecting the concatenated outputs of the tagger
through a fully connected layer with ReLU activation. During training, we do not pass the gradients back through
to prevent the distortion of the tagger output.The decoder performs greedy decoding to predict the character outputs. It runs until it produces the end-of-word character EOW or reaches a character limit of .
Approach | Czech-PDT | German-TIGER | Arabic-PADT | Eng-EWT | Eng-WSJ | ||||
---|---|---|---|---|---|---|---|---|---|
tag | lem | tag | lem | tag | lem | tag | lem | tag | |
LemmaTag (sep) | 96.83 | 98.02 | 98.96 | 98.84 | 95.03 | 96.07 | 95.50 | 97.03 | 97.59 |
LemmaTag (joint) | 96.90 | 98.37 | 98.97 | 99.05 | 95.21 | 96.08 | 95.37 | 97.53 | N/A |
SoTA results | 95.89 | 97.86 | 98.04 | 98.24 | 91.68 | 92.60 | 93.90 | 96.90 | 97.78 |
We define the final loss function as the weighted sum of the losses of the tagger and the lemmatizer:
where y are the predicted outputs, the expected outputs, , the tag components and
are the lemma characters. The tagger and lemmatizer losses are separately computed as the softmax cross entropy of the output logits. The weight hyperparameters
scale the training losses so that the subtag and lemmatizer losses do not overpower the unfactored tag predictor gradients. The vector contains weights: one for the whole tag and one for every component.222If no components are available, .In this section, we show the outcomes of evaluation when running our joint tagger and lemmatizer and compare with the current state of the art in Czech, German, Arabic, and English datasets. Additionally, we evaluate the lemmatizer and tagger separately to compare the relative increase in tagging and lemmatization accuracy.
Our datasets consist of the Czech Prague Dependency Treebank (PDT) Hajič et al. (2006, 2018), the German TIGER corpus Brants et al. (2004), the Universal Dependencies Prague Arabic Dependency Treebank (UD-PADT) Hajic et al. (2004), the Universal Dependencies English Web Treebank (UD-EWT) Silveira et al. (2014), and the WSJ portion of the English Penn Treebank (tags only) Marcus et al. (1993). In all datasets, we use the tags specific to their respective language. Of these datasets, only Czech and Arabic provide subcategorical tags, and we use unfactored tags for the rest. See Table 1 for tagger and lemmatizer accuracies.
Note that the PDT dataset disambiguates lemmas with the same textual representation by appending a number as lemma sense indicator. For example, the dataset contains disambiguated lemmas moc-1 (as power) and moc-2 (as too much). About 17.5% of the PDT tokens have such sense-disambiguated lemmas. LemmaTag predicts the lemmas including the senses and the accuracies in Table 1 take that into account. Ignoring the sense ambiguity, the lemmatization accuracy of the joint LemmaTag model is 98.94% for Czech-PDT.
We use loss weights for the whole tags, for the tag component losses and for the lemmatizer loss.333These are reasonable values to prevent gradients from overpowering one another. The lemmatizer tends to influence the tagger heavily. The RNNs and word embedding tables have dimensionality 768 except for character-level embeddings and the character-level RNN, which are of dimension 384. The fully-connected layer whose inputs are is of dimension 256.
We train the models for 40 epochs with random permutations of training sentences and batches of 16 sentences. The starting learning rate is
and we scale this by 0.25 at epochs 20 and 30 to increase accuracy. We train the network using the lazy variant of the Adam optimizer Kingma and Ba (2014), which only updates accumulators for variables that appear in the current batch TensorFlow (2018), with parameters and . We clip the global gradient norm to 3.0 to reduce the risk of exploding gradients.To prevent the tagger from overfitting, we devise several strategies for regularization. We apply dropouts with rate 0.5 as indicated in Figures 2 and 3. The word dropout (WD) replaces 25% of words by the unknown token <unk> to force the network to rely more on context, combatting data sparsity issues. Lastly, we employ label smoothing Pereyra et al. (2017) which is a way to prevent the network from being too confident in any one class. The label smoothing parameter is set to for the tagger logits (both whole tags and the tag components).
Note that we did not perform any complex hyperparameter search. For additional information on real-world performance and additional techniques which have not improved evaluation accuracy, see Appendix A.
The evaluation results show that performing lemmatization and tagging jointly by sharing encoder parameters and utilizing tag features is mutually beneficial in morphologically rich languages. We have shown that incorporating these ideas results in excellent performance, surpassing state-of-the-art in Czech, German, and Arabic POS tagging and lemmatization by a substantial margin, while closely matching state-of-the-art English POS tagging accuracy.
However, in languages with weak morphology such as English (and German to a lesser extent), sharing the encoder parameters may even hurt the performance of the tagger. We believe this is a consequence of tags correlating less with word-level morphology, and more with sentence-level syntax in morphologically poor languages. Lemma prediction could benefit from the syntactic information in the tags, but the tag predictions rely more on syntactic structure (i.e., word order) rather than on root forms of individual words which could be ambiguous.
There are some possible performance improvements and additional metrics which we leave for future work. For simplicity, one improvement we intentionally left out is the use of additional data. We can incorporate word2vec Mikolov et al. (2013a) or ELMo Peters et al. (2018) word representations, which have shown to reduce out-of-domain issues and provide semantic information Eger et al. (2016). A second improvement is to integrate information from a morphological dictionary to resolve certain ambiguities Hajič et al. (2009); Inoue et al. (2017). A third improvement can be to replace the seq2seq lemmatizer decoder with a classifier that chooses a corresponding edit tree to modify (reduce) the word form to its lemma Chakrabarty et al. (2017). A fourth possible improvement would be to experiment with the Transformer model Vaswani et al. (2017), which utilizes non-recurrent multi-headed self-attention and has been shown to achieve state-of-the-art performance in several related sequence tasks Dehghani et al. (2018). Lastly, we would like to evaluate LemmaTag on a wider range of languages, e.g., on the Universal Dependencies Nivre et al. (2016) languages and treebanks which employ lemmatization, and to analyze the use of different types of POS tags in the model.
The code we used for LemmaTag is available at https://github.com/hyperparticle/LemmaTag.
The work described herein has been supported by the City of Prague under the “OP PPR” program, project No. CZ.07.1.02/0.0/0.0/16_023/0000108 and it has been using language resources developed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).
Tomáš Gavenčiak has been supported by Czech Science Foundation (GACR) project 17-10090Y “Network optimization”. Daniel Kondratyuk has been supported by the Erasmus Mundus program in Language & Communication Technologies (LCT).
Semi-supervised training for the averaged perceptron pos tagger.
In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pages 763–771. Association for Computational Linguistics.Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing
, pages 2268–2274.Proceedings of the 31st International Conference on Machine Learning (ICML-14)
, pages 1818–1826.Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition.
In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 13–18, Baltimore, Maryland. Association for Computational Linguistics.We ran all the tests on an NVIDIA GTX 1080 Ti GPU. The joint LemmaTag training takes about 3 hours for Arabic PADT, 4.5 hours for English EWT, 12 hours for German TIGER, and 22 hours for Czech PDT. The separate models take about 50% more time. After training, the lemma and tag predictions of 219,000 test tokens of the Czech PDT take about 100 seconds.
We briefly summarize some of the additional techniques we have tried but which do not improve the results. While some of those techniques do help on smaller models or earlier in the training, the effect on the fully trained network seems to be marginal or even detrimental.
Separate sense prediction. Instead of predicting the sense disambiguation with the lemmatizer (Czech only), we tried to predict sense as an additional classification problem with one dense layer based on and , but it seems to perform slightly worse (0.2%).
Beam search decoder. We have implemented a beam search decoder for the lemmatizer instead of the standard greedy one, but the improvement was marginal (around 0.01%).
Variational dropout. While the dropouts in the LemmaTag are completely random, variational dropout erases the same channels across the time steps of the RNN. While this generally improves training in convolutional networks and RNNs, we saw no significant difference.
Layer normalization. Layer normalization applied to the encoding RNNs did not bring significant gain and also slowed down the training.