Static and Dynamic Feature Selection in Morphosyntactic Analyzers

03/21/2016 ∙ by Bernd Bohnet, et al. ∙ Google Universitat Pompeu Fabra Uppsala universitet 0

We study the use of greedy feature selection methods for morphosyntactic tagging under a number of different conditions. We compare a static ordering of features to a dynamic ordering based on mutual information statistics, and we apply the techniques to standalone taggers as well as joint systems for tagging and parsing. Experiments on five languages show that feature selection can result in more compact models as well as higher accuracy under all conditions, but also that a dynamic ordering works better than a static ordering and that joint systems benefit more than standalone taggers. We also show that the same techniques can be used to select which morphosyntactic categories to predict in order to maximize syntactic accuracy in a joint system. Our final results represent a substantial improvement of the state of the art for several languages, while at the same time reducing both the number of features and the running time by up to 80



There are no comments yet.


page 6

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Morphosyntactic tagging, whether limited to basic part-of-speech tags or using rich morphosyntactic features, is a fundamental task in natural language processing, used in a variety of applications from machine translation

[Habash and Sadat2006] to information extraction [Banko et al.2007]. In addition, tagging can be the first step of a syntactic analysis, providing a shallow, non-hierarchical representation of syntactic structure.

Morphosyntactic taggers tend to belong to one of two different paradigms: standalone taggers or joint taggers. Standalone taggers use narrow contextual representations, typically an -gram window of fixed size. To achieve state-of-the-art results, they employ sophisticated optimization techniques in combination with rich feature representations [Brants2000, Toutanova and Manning2000, Giménez and Màrquez2004, Müller et al.2013]. Joint taggers, on the other hand, combine morphosyntactic tagging with deeper syntactic processing. The most common case is parsers that predict constituency structures jointly with part-of-speech tags [Charniak and Johnson2005, Petrov et al.2006] or richer word morphology goldberg08.

In dependency parsing, pipeline models have traditionally been the norm, but recent studies have shown that joint tagging and dependency parsing can improve accuracy of both [Lee et al.2011, Hatori et al.2011, Bohnet and Nivre2012, Bohnet et al.2013]. Unfortunately, joint models typically increase the search space, making them more cumbersome than their pipeline equivalents. For instance, in the joint morphosyntactic transition-based parser of tacl-bbjn, the number of parser actions increases linearly by the size of the part-of-speech and/or morphological label sets. For some languages this can be quite large. For example, muller2013 report morphological tag sets of size 1,000 or more.

The promise of joint tagging and parsing is that by trading-off surface morphosyntactic predictions with longer distance dependency predictions, accuracy can be improved. However, it is unlikely that every decision will benefit from this trade-off. Local -gram context is sufficient for many tagging decisions, and parsing decisions likely only benefit from morphological attributes that correlate with syntactic functions, like case, or those that constrain agreement, like gender or number. At the same time, while standalone morphosyntactic taggers require large feature sets in order to make accurate predictions, it may be the case that fewer features are needed in a joint model, where these predictions are made in tandem with dependency decisions of larger scope. This naturally raises the question as to whether we can advantageously optimize feature sets at the tagger and parser levels in joint parsing systems to alleviate their inherent complexity.

We investigate this question in the context of the joint morphosyntactic parser of tacl-bbjn, focusing on optimizing and compressing feature sets via greedy feature selection techniques, and explicitly contrasting joint systems with standalone taggers. The main findings emerging from our investigations are:

  • Feature selection works for standalone taggers but is more effective in a joint system. This holds for model size as well as tagging accuracy (and parsing accuracy as a result).

  • Dynamic feature selection strategies that take feature redundancy into account often lead to more compact models than static selection strategies with little loss in accuracy.

  • Similar selection techniques can also reduce the set of morphological attributes to be predicted jointly with parsing, reducing the size of the output space at no cost in accuracy.

The key to all our findings is that these techniques simultaneously compress model size and/or decrease the search space while increasing the underlying accuracy of tagging and parsing, even surpassing the state of the art in a variety of languages. With respect to the former, we observe empirical speed-ups upwards of 5x. With respect to the latter, we show that the resulting morphosyntactic taggers consistently beat state-of-the-art taggers across a number of languages.

2 Related Work

Since morphosyntactc tagging interacts with other tasks such as word segmentation and syntactic parsing, there has been an increasing interest in joint models that integrate tagging with these other tasks. This line of work includes joint tagging and word segmentation [Zhang and Clark2008a]

, joint tagging and named entity recognition

[Móra and Vincze2012], joint tagging and parsing [Lee et al.2011, Li et al.2011, Hatori et al.2011, Bohnet and Nivre2012, Bohnet et al.2013], and even joint word segmentation, tagging and parsing [Hatori et al.2012]. These studies often show improved accuracy from joint inference in one or all of the tasks involved.

Feature selection has been a staple of statistical NLP since its beginnings, notably selection via frequency cut-offs in part-of-speech tagging [Ratnaparkhi1996]. Since then efforts have been made to tie feature selection with model optimization. For instance, mccallum2003 used greedy forward selection with respect to model log-likelihood to select features for named entity recognition. Sparse priors, such as L1 regularization, are a common feature selection technique that trades off feature sparsity with the model’s objective [Gao et al.2007]

. martins2011 extended such sparse regularization techniques to allow a model to deselect entire feature templates, potentially saving entire blocks of feature extraction computation. However, current systems still tend to employ millions of features without selection, relying primarily on model regularization to combat overfitting. Selection of morphological attributes has been carried out previously in ballesteros2013effective and selection of features under similar constraints was carried out by bb2014automatic.

3 Feature Selection

The feature selection methods we investigate can all be viewed as greedy forward selection, shown in Figure 1. This paradigm starts from an empty set and considers features one by one. In each iteration, a model is generated from a training set and tested on a development set relative to some accuracy metric of interest. The feature under consideration is added if it increases this metric beyond some threshold and discarded otherwise.

This strategy is similar to the one implemented in MaltOptimizer [Ballesteros and Nivre2014]. It differs from classic forward feature selection [Della Pietra et al.1997] in that it does not test all features in parallel, but instead relies on an ordering of features as input. This is primarily for efficiency, as training models in parallel for a large number of feature templates is cumbersome.

The set of features, , can be defined as fully instantiated input features, e.g., suffix=ing, or as feature templates, e.g., prefix, suffix, form, etc. Here we always focus on the latter. By reducing the number of feature templates, we are more likely to positively affect the runtime of feature extraction, as many computations can simply be removed.

3.1 Static Feature Ordering

Our feature selection algorithm assumes a given order of the features to be evaluated against the objective function. One simple strategy is for a human to provide a static ordering on features, that is fixed for traversal. This means that we are testing feature templates in a predefined order and keeping those that improve the accuracy. Those that do not are discarded and never visited again. In Figure 1, this means that the Order() function is fixed throughout the procedure. In our experiments, this fixed order is the same as in Table 1.

Let be the full set of features, let

be the evaluation metric for feature set

let Order() be an ordering function over a feature set , and let be the threshold.
1 2 3 while 4       Order() 5       if then 6           7           8       9 return
Figure 1: Algorithm for forward feature selection.

3.2 Dynamic Feature Ordering

In text categorization, static feature selection based on correlation statistics is a popular technique [Yang and Pedersen1997]. The typical strategy in such offline selectors is to rank each feature by its correlation to the output space, and to select the top K features. This strategy is often called max relevance, since it aims to optimize the features based solely to their predictive power.

Unfortunately, the best features selected by these algorithms might not provide the best result [Peng et al.2005]. Redundancy among the features is the primary reason for this, and Peng et al. develop the minimal redundancy maximal relevance (MRMR) technique to address this problem. The MRMR method tries to keep the redundancy minimal among the features. The approach is based on mutual information to compute the relevance of features and the redundancy of a feature in relation to a set of already selected features.

The mutual information of two discrete random variables

and is defined as follows

Max relevance selects the feature set that maximizes the mutual information of feature templates and the output classes .

To account for cases when features are highly redundant, and thus would not change much the discriminative power of a classifier, the following criterion can be added to minimize mutual information between selected feature templates:

Minimal redundancy maximal relevance (MRMR) combines both objectives:

For the greedy feature selection method outlined Figure 1, we can use the MRMR criteria to define the Order() function. This leads to a dynamic feature selection technique as we update the order of features considered dynamically at each iteration, taking into account redundancy amongst already selected features.

This technique can be seen in the same light as greedy document summarization

[Carbonell and Goldstein1998], where sentences are selected for a summary if they are both relevant and minimally redundant with sentences previously selected.

4 Morphosyntactic Tagging

In this section, we describe the two systems for morphosyntactic tagging we use to compare feature selection techniques.

4.1 Standalone Tagger

The first tagger is a standalone SVM tagger, whose training regime is shown in Figure 2. The tagger iterates up to times (typically twice) over a sentence from left to right (line 2). This iteration is performed to allow the final assignment of tags to benefit from tag features on both sides of the target token. For each token of the sentence, the tagger initializes an -best list and extracts features for the token in question (line 4-7). In the innermost loop (line 9-11), the algorithm computes the score for each morphosyntactic tag and inserts a pair consisting of the morphosyntactic tag and its score into the -best list. The algorithm returns a two dimensional array, where the first dimension contains the tokens and the second dimension contains the sorted lists of tag-score pairs. The tagger is trained online using MIRA [Crammer et al.2006]. When evaluating this system as a standalone tagger, we select the 1-best tag. This can be viewed as a multi-pass version of standard SVM-based tagging [Màrquez and Giménez2004].

is a sentence and

are weight vectors

1 // iterate k times over each sentence 2 for 3      // for each token in sentence 4       for 5           6           // extract the features for token 7           8           // for each part-of-speech tag 9           for each 10               11               insert-pair 12 return
Figure 2: Algorithm for standalone tagger

4.2 Joint Dependency-Based Tagger

The joint tagger-parser follows the design of tacl-bbjn, who augment an arc-standard transition-based dependency parser with the capability to select a part-of-speech tag and/or morphological tag for each input word from an -best list of tags for that word. The tag selection is carried out when an input word is shifted onto the stack. Only the highest-scoring tokens from each -best list are considered, and only tags whose score is at most below the score of the best tag. In all experiments of this paper, we set to 2 and to 0.25. The tagger-parser uses beam search to find the highest scoring combined tagging and dependency tree. When pruning the beam, it first extracts the 40 highest scoring distinct dependency trees and then up to 8 variants that differ only with respect to the tagging, a technique that was found by tacl-bbjn to give a good balance between tagging and parsing ambiguity in the beam. The tagger-parser is trained using the same online learning algorithm as the standalone tagger. When evaluating this system as a part-of-speech tagger, we consider only the finally selected tag sequence in the dependency tree output by the parser.

5 Part-of-Speech Tagging Experiments

To simplify matters, we start by investigating feature selection for part-of-speech taggers, both in the context of standalone and joint systems. The main hypotheses we are testing is whether feature selection techniques are more powerful in joint morphosyntactic systems as opposed to standalone taggers. That is, the resulting models are both more compact and accurate. Additionally, we wish to empirically compare the impact of static versus dynamic feature selection techniques.

5.1 Data Sets

We experiment with corpora from five different languages: Chinese, English, German, Hungarian and Russian. For Chinese, we use the Penn Chinese Treebank 5.1 (CTB5), converted with the head-finding rules, conversion tools and with the same split as in zhang08.111Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147. For English, we use the WSJ section of the Penn Treebank, converted with the head-finding rules of yamada03 and the labeling rules of nivre06book.222Training: 02-21. Development: 24. Test: 23. For German, we use the Tiger Treebank [Brants et al.2002] in the improved dependency conversion by seeker12. For Hungarian, we use the Szeged Dependency Treebank [Farkas et al.2012]. For Russian we use the SynTagRus Treebank [Boguslavsky et al.2000, Boguslavsky et al.2002].

5.2 Feature Templates

Table 1 presents the feature templates that we employed in our experiments (second column). The name of the functor indicates the purpose of the feature template. For instance, the functor form defines the word form. The argument specifies the location of the token, for instance, form(w+1) denotes the token to the right of the current token w.

When more than one argument is given, the functor is applied to each defined position and the results are concatenated. Thus, form(w,w+1) expands to form(w)+form(w+1). The functor formlc denotes the form with all letters converted to lowercase and lem denotes the lemma of a word. The functors suffix1, suffix2,… and prefix1,… denote suffixes and prefixes of length 1, 2, …, 5. The suffix1+uc, … functors concatenates a suffix with a value that indicates uppercase or lowercase word.

The functors pos and mor denote part-of-speech tags and morphological tags, respectively. The tags to the right of the current position are available as features in the second iteration of the standalone tagger as well as in the final resolution stage in the joint system. Patterns of the form denote the th character. Finally, the functor number denotes a sequence of numbers, with optional periods and commas.

5.3 Main Results

In our experiments we make a division of the training corpora into 80% for training and 20% for development. Therefore, in each iteration a model is trained over 80% of the training corpus and tested on 20%.333There is also a held-out test set for evaluation, which is the standard test set provided and depicted in Section 5.1. For feature selection, if the outcome of the newly trained model is better than the best result so far, then the feature is added to the feature model; otherwise, it is not. A model has to show improvement of at least 0.02 on part-of-speech tagging accuracy to count as better.444All the experiments were carried out on a CPU Intel Xeon 3.4 Ghz with 6 cores. Since the feature selection experiments require us to train a large number of parsing and/or tagging models, we needed to find a realistic training setup that gives us a sufficient accuracy level while maintaining a reasonable speed. After some preliminary experiments, we selected a beam size of 8 and 12 training iterations for the feature selection experiments while the final models are tested with a beam size of 40 and 25 training iterations. The size of the second beam for alternative tag sequences is kept at 8 for all experiments and the threshold at 0.25.

Part-of-Speech Morphology
Standalone Joint Standalone Joint
static dynamic static dynamic static dynamic static dynamic
Id Feature Name Ch En Ge Hu Ru Ch En Ge Hu Ru Ch En Ge Hu Ru Ch En Ge Hu Ru Ge Hu Ru Ge Hu Ru Ge Hu Ru Ge Hu Ru
1 form
2 formlc
3 prefix1
4 prefix2
5 prefix3
6 prefix4
7 prefix5
13 suffix1
14 suffix2
15 suffix3
16 suffix4
17 suffix5
18 suffix2+uc
19 suffix3+uc
20 suffix4+uc
21 suffix5+uc
22 uppercase
23 ccc
24 ccc
25 ccc
26 cccc
27 cccc
28 form(w,w+1)
29 form(w+1)
30 prefix1(w+1)
31 suffix1(w+1)
32 suff2+pref1(w+1)
33 suff2+suff1(w+1)
34 suffix2(w+1)
35 prefix2(w+1)
36 suff2+pref2(w+1)
37 suffix2(w,xw+1)
41 form(w+1,w+2)
42 form(w+2)
43 form(w+2,w+3)
44 length(w)
45 lemma(w)
46 number
47 lemmas(w-1,w+1)
48 form(w-1)
49 lemma(w-1 )
50 form(w-2)
51 form(w-3,w-2)
52 form(w-1,w)
53 form(w-2,w-1,w)
54 form(w-1,w,w+1)
55 form(w,w+1,w+2)
56 suffix2(w-1)
57 suffix2(w-1,w-2)
59 suffix2(w-2)
60 suffix2(w-3)
61 suffix2(w+1)
64 suffix2(w+2)
67 pos(w+1,w+2)
68 pos(w+1)
69 pos(w-1)
70 pos(w-1,w+1)
71 pos(w-2)
72 pos(w-1,w-2)
73 form(w-1,w-2)
74 form(w-1,w-1)
75 pos(w-3)
76 morph(w+1,w+2)
77 morph(w+1)
78 morph(w-1)
79 morph(w-1,w+1)
80 morph(w-2)
81 morph(w-1,w-2)
82 fo.(w-1)+mo.(w-2)
83 morph(w-2,w-3)
84 pos(w)
85 pos(w,w-1)
86 pos(w,w+1)
87 pos(w-2,w-1,w)
88 pos(w,w+1,w+2)
89 pos(w-1,w,w+1)
# total 21 24 23 23 23 19 24 21 24 17 19 16 19 15 21 20 15 21 18 9 44 24 31 29 15 24 27 18 21 32 12 16
# average 22.8 21 17.8 16.6 31 22.6 22 20
reduction 69% 71% 76% 78% 58% 69% 70% 73%
Table 1: Selected features for the standalone tagger and the joint tagger-parser with a threshold of 0.02 for the experiments of Section 5 (part-of-speech tagging) and Section 6 (morphological tagging). Some of the features are not shown because they have not been selected. A filled box means selected, while an empty one means not selected.

Table 1 (columns under Part-of-Speech) shows the features that the algorithms selected for each language and each system, and Table 2 shows the performance on the development set. We primarily report part-of-speech tagging accuracy (POS), but also report unlabeled (UAS) and labeled (LAS) attachment scores [Buchholz and Marsi2006] to show the effect of improved taggers on parsing quality. Additionally, Table 2 contains the number of features selected (#).

The first conclusion to draw is that the feature selection algorithms work for both standalone and joint systems. The number of features selected is drastically reduced. The dynamic MRMR feature selection technique for the joint system compresses the model by as much as 78%. This implies faster inference (smaller dot products and less feature extraction) and a smaller memory footprint. In general, joint systems compress more over their standalone counterpart, by about 20%. Furthermore, the dynamic technique tends to have slightly more compression.

The accuracies of the joint tagger-parser are in general superior to the ones obtained by the standalone tagger, as noted by bohnet12emnlp. In terms of tagging accuracy, static selection works slightly better for Chinese, German and Hungarian while dynamic MRMR works best for English and Russian (Table 2). Moreover, the standalone tagger selects several feature templates that requires iterating over the sentence, such as pos(w+1), pos(w+2), whereas the feature templates selected by the joint system contain significantly fewer of these features. This shows that a joint system is less reliant on context features to resolve many ambiguities that previously needed a wider context. This is almost certainly due to the global contextual information pushed to the tagger via parsing decisions. As a consequence, the preprocessing tagger can be simplified and we need to conduct only one iteration over the sentence while maintaining the same accuracy level. Interestingly, the dynamic MRMR technique tends to select less form features, which have the largest number of realizable values and thus model parameters.

Table 3 compares the performance of our two taggers with two state-of-the-art taggers. Except for English, the joint tagger consistently outperforms the Stanford tagger and MarMot: for Chinese by 0.3, for German by 0.38, for Hungarian by 0.25 and for Russian by 0.75. Table 4 compares the resulting parsing accuracies to state-of-the-art dependency parsers for English and Chinese, showing that the results are in line with or higher than the state of the art.

Pipeline Joint
none 94.14 78.98 81.77 74 94.34 79.33 82.21 74
static 94.26 79.06 81.96 20 94.57 79.75 82.54 19
dynamic 94.06 79.26 82.21 23 94.49 79.68 82.54 20
none 97.18 90.78 91.97 74 96.99 90.92 92.06 74
static 96.98 90.56 91.77 23 96.99 91.05 92.20 16
dynamic 97.05 90.69 91.97 23 97.13 90.78 91.95 15
none 97.79 91.33 93.33 74 98.14 91.64 93.60 74
static 97.77 91.70 93.71 25 98.20 91.83 93.83 19
dynamic 97.60 91.40 93.60 22 97.91 91.74 93.73 22
none 97.89 87.89 90.44 74 98.00 88.11 90.59 74
static 97.87 88.01 90.54 23 98.01 88.25 90.78 23
dynamic 97.85 87.80 90.36 24 98.00 88.16 90.65 18
none 98.62 87.45 92.58 74 98.79 87.69 92.83 74
static 98.70 87.41 92.62 23 98.85 87.61 92.86 21
dynamic 98.69 87.69 92.92 17 98.87 87.61 92.86 9
Table 2: Tagging and parsing accuracy scores on the dev set without feature selection (none), with static and with dynamic MRMR greedy feature selection.
Ch En Ge Hu Ru
Stanford 93.75 97.44 97.51 97.55 98.16
MarMot 93.84 97.43 97.57 97.63 98.18
Standalone 94.04 97.33 97.56 97.69 98.73
Joint 94.14 97.42 97.95 97.88 98.93
Table 3: State-of-the-art comparison for tagging on the test set.
li11 93.08 80.55
hatori12 93.94 81.20
bohnet12emnlp 93.24 77.91 81.42
joint-static 94.14 78.77 81.91 20
mcdonald05acl 90.9
mcdonald06eacl 91.5
huang10 92.1
koo10acl 93.04
zhang11 92.9
martins10 93.26
bohnet12emnlp 97.33 92.44 93.38
zhang-2013 93.50
koo08 93.16
carreras08 93.5
suzuki-EtAl:2009 93.79
joint-dynamic 97.42 92.50 93.50 15
Table 4: State of the art comparison for parsing on the test set. Results marked with a dagger are not directly comparable as additional data was used.

6 Morphological Tagging Experiments

The joint morphology and syntactic inference requires the selection of morphological attributes (case, number, etc.) and the selection of features to predict the morphological attributes. In past work on joint morphosyntactic parsing, all morphological attributes are predicted jointly with syntactic dependencies [Bohnet et al.2013]. However, this could lead to unnecessary complexity as only a subset of the attributes are likely to influence parsing decisions, and vice versa.

In this section we investigate whether feature selection methods can also be used to reduce the set of morphological attributes that are predicted as part of a joint system. For instance, consider a language that has the following attributes: case, gender, number, animacy. And let us say that language does not have gender agreement. Then likely only case and number will be useful in a joint system, and the gender and animacy attributes can be predicted independently. This could substantially improve the speed of the joint model – on top of standard feature selection – as the size of the morphosyntactic tag set will be reduced significantly.

6.1 Data Sets

We use the data sets listed in subsection 5.1 for the languages that provide morphological annotation, which are German, Hungarian and Russian.

6.2 Main Results: Attribute Selection

For the selection of morphological attributes (e.g. case, number, tense), we explore a simple method that departs slightly from those in Section 3. In particular, we do not run greedy forward selection. Instead, we compute accuracy improvements for each attribute offline. We then independently select attributes based on these values. Our initial design was a greedy forward attribute selection, but we found experimentally that independent attribute selection worked best.

We run 10-fold cross-validation experiments on the training set where 90% of training set is used for training and 10% for testing. Here we simply test for each attribute independently whether its inclusion in a joint morphosyntactic parsing system increases parsing accuracy (LAS/UAS) by a statistically significant amount. If it does, then it is included. We applied cross validation to obtain more reliable results than with the development sets as some improvements where small, e.g., gender and number in German are within the range of the standard deviation results on the development set. We use parsing accuracy as we are primarily testing whether a subset of attributes can be used in place of the full set in joint morphosyntactic parsing.

Even though this method only tests an attribute’s contribution independently of other attributes, we found experimentally that this was never a problem. For instance, in German, without any morphologic attribute, we get a baseline of 89.18 LAS; when we include the attribute case, we get 89.45 LAS; and when we include number, we get 89.32 LAS. When we include both case and number, we get 89.60 LAS.

Table 5 shows which attributes were selected. We include an attribute when the cross-validation experiment shows an improvement of at least 0.1 with a statistical significance of 0.01 or better (indicated in the table by **). Some borderline cases remain such as for Russian passive where we observed an accuracy gain of 0.2 but only a low statistical significance.

Cross Valid. Exp.
attribute LAS UAS stat. sig. Sel.
none 89.2 91.8
case 89.5 91.9 yes
gender 89.2 91.8 no
number 89.3 91.9 yes
mode 89.2 91.8 no
person 89.2 91.8 no
tense 89.2 91.8 no
Cross Valid. Exp.
attribute LAS UAS stat. sig. Sel.
none 84.5 88.3
case 85.7 89.0 yes
degree 84.6 88.4 yes
number 84.7 88.5 yes
mode 84.6 88.4 no
person P 84.6 88.7 yes
person 85.0 88.9 yes
subpos 85.4 88.9 yes
tense 84.6 88.4 yes
Cross Valid. Exp.
attribute LAS UAS stat. sig. Sel.
none 79.4 88.2
act 80.4 89.1 yes
anim 79.8 88.3 yes
aspect 79.4 88.2 no
case 80.9 89.3 yes
degree 79.4 88.2 no
gender 80.1 88.6 yes
mode 80.0 88.7 yes
number 82.2 88.6 yes
passive 79.6 88.2 yes
tense 79.8 88.4 yes
typo 79.4 88.1 no
Table 5: Morphological attribute selection.
Pipeline Joint
none 93.07 91.69 93.66 83 94.21 91.69 93.65 83
static 92.65 91.70 93.75 44 94.17 91.56 93.54 27
dynamic 92.89 91.61 93.62 29 94.01 91.72 93.72 32
none 97.10 88.00 90.49 83 97.22 88.02 90.51 83
static 97.11 87.98 90.43 24 97.23 88.06 90.49 18
dynamic 96.90 87.92 90.38 15 97.22 87.97 90.43 12
none 93.41 87.36 92.58 83 95.78 87.54 92.79 83
static 95.37 87.50 92.80 31 95.39 87.58 92.85 21
dynamic 95.31 87.53 92.74 24 95.74 87.59 92.86 16
Table 6: Morphological and syntactic accuracy scores on the dev set without feature selection (none), with greedy forward feature selection (static) and with Minimum Redundancy Maximum Relevance (dynamic).
seeker13 91.50 93.48
joint-static-static 97.97 94.20 91.88 93.81
farkas12 87.2 90.1
tacl-bbjn 97.80 96.4 88.9 91.3
joint-static-static 97.85 96.97 88.85 91.32
boguslavsky11 86.0 90.0
tacl-bbjn 98.50 94.4 87.6 92.8
joint-dynamic-dynamic 98.90 95.62 87.86 92.95
Table 7: State of the art comparison on the test set. Results marked with a dagger are not comparable since they use different morphological attribute bundels.

6.3 Main Results: Feature Selection

Having fixed the set of attributes to be predicted jointly with the parser, we can turn our attention to optimizing the feature sets for morphosyntactic tagging. To this end, we again consider greedy forward selection with the static and dynamic strategies. Table 1 shows the selected features for the different languages where the grey boxes again mean that the feature was selected. Table 6 shows the performance on the development set. For German, the full template set performs best but only 0.04 better than static selection which performs nearly as well while reducing the template set by 68%. For Hungarian, all sets perform similarly while dynamic selection needs 86% less features. The top performing feature set for Russian is dynamic selection in a joint system which needs 81% less features. We observe again that dynamic selection tends to select less feature templates compared to static selection, but here both the full set of features and the set selected by static selection appear to have better accuracy on average.

The feature selection methods obtain significant speed-ups for the joint system. On the development sets we observed a speedup from 0.015 to 0.003 sec/sentence for Hungarian, from 0.014 to 0.004 sec/sentence for German, and from 0.015 to 0.006 sec/sentence for Russian. This represents a reduction in running time between 50 and 80%.

Table 7 compares our system to other state-of-the-art morphosyntactic parsers. We can see that on average the accuracies of our attribute/feature selection models are competitive or above the state-of-the art. The key result is that state of the art accuracy can be achieved with much leaner and faster models.

7 Conclusions

There are several methodological lessons to learn from this paper. First, feature selection is generally useful as it leads to fewer features and faster tagging while maintaining state-of-the-art results. Second, feature selection is even more effective for joint tagging-parsing, where it leads to even better results and smaller feature sets. In some cases, the number of feature templates is reduced by up to 80% with a correponding reduction in running time. Third, dynamic feature selection strategies [Peng et al.2005] lead to more compact models than static feature selection, without significantly impacting accuracy. Finally, similar methods can be applied to morphological attribute selection leading to even leaner and faster models.


Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA)


  • [Ballesteros and Bohnet2014] Miguel Ballesteros and Bernd Bohnet. 2014. Automatic feature selection for agenda-based dependency parsing. In Proceedings of the 25th International Conference on Computational Linguistics (COLING).
  • [Ballesteros and Nivre2014] Miguel Ballesteros and Joakim Nivre. 2014. MaltOptimizer: Fast and Effective Parser Optimization. Natural Language Engineering.
  • [Ballesteros2013] Miguel Ballesteros. 2013. Effective morphological feature selection with maltoptimizer at the spmrl 2013 shared task. In Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages, pages 53–60.
  • [Banko et al.2007] Michele Banko, Michael J Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. 2007. Open information extraction for the web. In IJCAI, volume 7, pages 2670–2676.
  • [Boguslavsky et al.2000] Igor Boguslavsky, Svetlana Grigorieva, Nikolai Grigoriev, Leonid Kreidlin, and Nadezhda Frid. 2000. Dependency treebank for Russian: Concept, tools, types of information. In COLING, pages 987–991.
  • [Boguslavsky et al.2002] Igor Boguslavsky, Ivan Chardin, Svetlana Grigorieva, Nikolai Grigoriev, Leonid Iomdin, Leonid Kreidlin, and Nadezhda Frid. 2002. Development of a dependency treebank for Russian and its possible applications in NLP. In LREC, pages 852–856.
  • [Boguslavsky et al.2011] Igor Boguslavsky, Leonid Iomdin, Victor Sizov, Leonid Tsinman, and Vadim Petrochenkov. 2011. Rule-based dependency parser refined by empirical and corpus statistics. In Proceedings of the International Conference on Dependency Linguistics, pages 318–327.
  • [Bohnet and Nivre2012] Bernd Bohnet and Joakim Nivre. 2012. A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In EMNLP-CoNLL, pages 1455–1465.
  • [Bohnet et al.2013] Bernd Bohnet, Joakim Nivre, Igor Boguslavsky, Rich´ard Farkas, Filip Ginter, and Jan Hajiˇa. 2013. Joint morphological and syntactic analysis for richly inflected languages. TACL, 1:415–428.
  • [Brants et al.2002] Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, and George Smith. 2002. TIGER treebank. In Proceedings of the 1st Workshop on Treebanks and Linguistic Theories (TLT), pages 24–42.
  • [Brants2000] Thorsten Brants. 2000. TnT – a statistical part-of-speech tagger. In ANLP.
  • [Buchholz and Marsi2006] Sabine Buchholz and Erwin Marsi. 2006. CoNLL-X shared task on multilingual dependency parsing. In CoNLL, pages 149–164.
  • [Carbonell and Goldstein1998] Jaime Carbonell and Jade Goldstein. 1998. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336. ACM.
  • [Carreras et al.2008] Xavier Carreras, Michael Collins, and Terry Koo. 2008.

    Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing.

    In CoNLL, pages 9–16.
  • [Charniak and Johnson2005] Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine -best parsing and MaxEnt discriminative reranking. In ACL, pages 173–180.
  • [Crammer et al.2006] Koby Crammer, Ofer Dekel, Joseph Keshet, Shai Shalev-Shwartz, and Yoram Singer. 2006. Online passive-aggressive algorithms.

    Journal of Machine Learning Research

    , 7:551–585.
  • [Della Pietra et al.1997] Stephen Della Pietra, Vincent Della Pietra, and John Lafferty. 1997. Inducing features of random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19.
  • [Farkas et al.2012] Richárd Farkas, Veronika Vincze, and Helmut Schmid. 2012. Dependency parsing of hungarian: Baseline results and challenges. In EACL, pages 55–65.
  • [Gao et al.2007] Jianfeng Gao, Galen Andrew, Mark Johnson, and Kristina Toutanova. 2007.

    A comparative study of parameter estimation methods for statistical natural language processing.

    In ACL, volume 45, page 824.
  • [Giménez and Màrquez2004] Jesús Giménez and Lluís Màrquez. 2004.

    SVMTool: A general POS tagger generator based on support vector machines.

    In LREC.
  • [Goldberg and Tsarfaty2008] Yoav Goldberg and Reut Tsarfaty. 2008. A single generative model for joint morphological segmentation and syntactic parsing. In ACL, pages 371–379.
  • [Habash and Sadat2006] Nizar Habash and Fatiha Sadat. 2006. Arabic preprocessing schemes for statistical machine translation. In NAACL-06.
  • [Hatori et al.2011] Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2011. Incremental joint pos tagging and dependency parsing in chinese. In IJCNLP, pages 1216–1224.
  • [Hatori et al.2012] Jun Hatori, Takuya Matsuzaki, Yusuke Miyao, and Jun’ichi Tsujii. 2012. Incremental joint approach to word segmentation, pos tagging, and dependency parsing in chinese. In ACL, pages 1045–1053.
  • [Huang and Sagae2010] Liang Huang and Kenji Sagae. 2010. Dynamic programming for linear-time incremental parsing. In ACL, pages 1077–1086.
  • [Koo and Collins2010] Terry Koo and Michael Collins. 2010. Efficient third-order dependency parsers. In ACL, pages 1–11.
  • [Koo et al.2008] Terry Koo, Xavier Carreras, and Michael Collins. 2008. Simple semi-supervised dependency parsing. In ACL, pages 595–603.
  • [Lee et al.2011] John Lee, Jason Naradowsky, and David A. Smith. 2011. A discriminative model for joint morphological disambiguation and dependency parsing. In ACL, pages 885–894.
  • [Li et al.2011] Zhenghua Li, Min Zhang, Wanxiang Che, Ting Liu, Wenliang Chen, and Haizhou Li. 2011. Joint models for chinese pos tagging and dependency parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1180–1191.
  • [Màrquez and Giménez2004] L Màrquez and J Giménez. 2004. A general pos tagger generator based on support vector machines. JMLR.
  • [Martins et al.2010] Andre Martins, Noah Smith, Eric Xing, Pedro Aguiar, and Mario Figueiredo. 2010. Turbo parsers: Dependency parsing by approximate variational inference. In EMNLP, pages 34–44.
  • [Martins et al.2011] André FT Martins, Noah A Smith, Pedro MQ Aguiar, and Mário AT Figueiredo. 2011. Structured sparsity in structured prediction. In EMNLP, pages 1500–1511. Association for Computational Linguistics.
  • [McCallum and Li2003] Andrew McCallum and Wei Li. 2003.

    Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons.

    In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 188–191. Association for Computational Linguistics.
  • [McDonald and Pereira2006] Ryan McDonald and Fernando Pereira. 2006. Online learning of approximate dependency parsing algorithms. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 81–88.
  • [McDonald et al.2005] Ryan McDonald, Koby Crammer, and Fernando Pereira. 2005. Online large-margin training of dependency parsers. In ACL, pages 91–98.
  • [Móra and Vincze2012] György Móra and Veronika Vincze. 2012. Joint part-of-speech tagging and named entity recognition using factor graphs. In Petr Sojka, Ales Horák, Ivan Kopecek, and Karel Pala, editors, Text, Speech and Dialogue, volume 7499 of Lecture Notes in Computer Science, pages 232–239. Springer.
  • [Müller et al.2013] Thomas Müller, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order crfs for morphological tagging. In In Proceedings of EMNLP.
  • [Nivre2006] Joakim Nivre. 2006. Inductive Dependency Parsing. Springer.
  • [Peng et al.2005] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. In IEEE Transactions On Pattern Analysis And Machine Intelligence, pages 1226–1238.
  • [Petrov et al.2006] Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein. 2006. Learning accurate, compact, and interpretable tree annotation. In ACL, pages 433–440.
  • [Ratnaparkhi1996] Adwait Ratnaparkhi. 1996. A maximum entropy part-of-speech tagger. In EMNLP, pages 133–142.
  • [Seeker and Kuhn2012] Wolfgang Seeker and Jonas Kuhn. 2012. Making ellipses explicit in dependency conversion for a german treebank. In LREC, pages 3132–3139.
  • [Seeker and Kuhn2013] Wolfgang Seeker and Jonas Kuhn. 2013. Morphological and syntactic case in statistical dependency parsing. Computational Linguistics, 39:23–55.
  • [Suzuki et al.2009] Jun Suzuki, Hideki Isozaki, Xavier Carreras, and Michael Collins. 2009. An empirical study of semi-supervised structured conditional models for dependency parsing. In EMNLP, pages 551–560, Singapore, August. Association for Computational Linguistics.
  • [Toutanova and Manning2000] Kristina Toutanova and Christopher D. Manning. 2000. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 63–70.
  • [Yamada and Matsumoto2003] Hiroyasu Yamada and Yuji Matsumoto. 2003. Statistical dependency analysis with support vector machines. In Proceedings of the 8th International Workshop on Parsing Technologies (IWPT), pages 195–206.
  • [Yang and Pedersen1997] Yiming Yang and Jan O Pedersen. 1997. A comparative study on feature selection in text categorization. In ICML, volume 97, pages 412–420.
  • [Zhang and Clark2008a] Yue Zhang and Stephen Clark. 2008a. Joint word segmentation and POS tagging using a single perceptron. In ACL, pages 888–896.
  • [Zhang and Clark2008b] Yue Zhang and Stephen Clark. 2008b. A tale of two parsers: Investigating and combining graph-based and transition-based dependency parsing. In EMNLP, pages 562–571.
  • [Zhang and Nivre2011] Yue Zhang and Joakim Nivre. 2011. Transition-based parsing with rich non-local features. In ACL.
  • [Zhang et al.2013] Hao Zhang, Liang Huang, Kai Zhao, and Ryan McDonald. 2013. Online learning for inexact hypergraph search. In EMNLP, pages 908–913, Seattle, Washington, USA, October. Association for Computational Linguistics.