We study the use of greedy feature selection methods for morphosyntactic tagging under a number of different conditions. We compare a static ordering of features to a dynamic ordering based on mutual information statistics, and we apply the techniques to standalone taggers as well as joint systems for tagging and parsing. Experiments on five languages show that feature selection can result in more compact models as well as higher accuracy under all conditions, but also that a dynamic ordering works better than a static ordering and that joint systems benefit more than standalone taggers. We also show that the same techniques can be used to select which morphosyntactic categories to predict in order to maximize syntactic accuracy in a joint system. Our final results represent a substantial improvement of the state of the art for several languages, while at the same time reducing both the number of features and the running time by up to 80READ FULL TEXT VIEW PDF
Feature selection methods are usually evaluated by wrapping specific
We present paired learning and inference algorithms for significantly
An automated feature selection pipeline was developed using several
We consider the problem of classifying business process instances based ...
Many current approaches to the design of intrusion detec- tion systems a...
Features play an important role in various visual tasks, especially in v...
This paper uses a classical approach to feature selection: minimization ...
Morphosyntactic tagging, whether limited to basic part-of-speech tags or using rich morphosyntactic features, is a fundamental task in natural language processing, used in a variety of applications from machine translation[Habash and Sadat2006] to information extraction [Banko et al.2007]. In addition, tagging can be the first step of a syntactic analysis, providing a shallow, non-hierarchical representation of syntactic structure.
Morphosyntactic taggers tend to belong to one of two different paradigms: standalone taggers or joint taggers. Standalone taggers use narrow contextual representations, typically an -gram window of fixed size. To achieve state-of-the-art results, they employ sophisticated optimization techniques in combination with rich feature representations [Brants2000, Toutanova and Manning2000, Giménez and Màrquez2004, Müller et al.2013]. Joint taggers, on the other hand, combine morphosyntactic tagging with deeper syntactic processing. The most common case is parsers that predict constituency structures jointly with part-of-speech tags [Charniak and Johnson2005, Petrov et al.2006] or richer word morphology goldberg08.
In dependency parsing, pipeline models have traditionally been the norm, but recent studies have shown that joint tagging and dependency parsing can improve accuracy of both [Lee et al.2011, Hatori et al.2011, Bohnet and Nivre2012, Bohnet et al.2013]. Unfortunately, joint models typically increase the search space, making them more cumbersome than their pipeline equivalents. For instance, in the joint morphosyntactic transition-based parser of tacl-bbjn, the number of parser actions increases linearly by the size of the part-of-speech and/or morphological label sets. For some languages this can be quite large. For example, muller2013 report morphological tag sets of size 1,000 or more.
The promise of joint tagging and parsing is that by trading-off surface morphosyntactic predictions with longer distance dependency predictions, accuracy can be improved. However, it is unlikely that every decision will benefit from this trade-off. Local -gram context is sufficient for many tagging decisions, and parsing decisions likely only benefit from morphological attributes that correlate with syntactic functions, like case, or those that constrain agreement, like gender or number. At the same time, while standalone morphosyntactic taggers require large feature sets in order to make accurate predictions, it may be the case that fewer features are needed in a joint model, where these predictions are made in tandem with dependency decisions of larger scope. This naturally raises the question as to whether we can advantageously optimize feature sets at the tagger and parser levels in joint parsing systems to alleviate their inherent complexity.
We investigate this question in the context of the joint morphosyntactic parser of tacl-bbjn, focusing on optimizing and compressing feature sets via greedy feature selection techniques, and explicitly contrasting joint systems with standalone taggers. The main findings emerging from our investigations are:
Feature selection works for standalone taggers but is more effective in a joint system. This holds for model size as well as tagging accuracy (and parsing accuracy as a result).
Dynamic feature selection strategies that take feature redundancy into account often lead to more compact models than static selection strategies with little loss in accuracy.
Similar selection techniques can also reduce the set of morphological attributes to be predicted jointly with parsing, reducing the size of the output space at no cost in accuracy.
The key to all our findings is that these techniques simultaneously compress model size and/or decrease the search space while increasing the underlying accuracy of tagging and parsing, even surpassing the state of the art in a variety of languages. With respect to the former, we observe empirical speed-ups upwards of 5x. With respect to the latter, we show that the resulting morphosyntactic taggers consistently beat state-of-the-art taggers across a number of languages.
Since morphosyntactc tagging interacts with other tasks such as word segmentation and syntactic parsing, there has been an increasing interest in joint models that integrate tagging with these other tasks. This line of work includes joint tagging and word segmentation [Zhang and Clark2008a]
, joint tagging and named entity recognition[Móra and Vincze2012], joint tagging and parsing [Lee et al.2011, Li et al.2011, Hatori et al.2011, Bohnet and Nivre2012, Bohnet et al.2013], and even joint word segmentation, tagging and parsing [Hatori et al.2012]. These studies often show improved accuracy from joint inference in one or all of the tasks involved.
Feature selection has been a staple of statistical NLP since its beginnings, notably selection via frequency cut-offs in part-of-speech tagging [Ratnaparkhi1996]. Since then efforts have been made to tie feature selection with model optimization. For instance, mccallum2003 used greedy forward selection with respect to model log-likelihood to select features for named entity recognition. Sparse priors, such as L1 regularization, are a common feature selection technique that trades off feature sparsity with the model’s objective [Gao et al.2007]
. martins2011 extended such sparse regularization techniques to allow a model to deselect entire feature templates, potentially saving entire blocks of feature extraction computation. However, current systems still tend to employ millions of features without selection, relying primarily on model regularization to combat overfitting. Selection of morphological attributes has been carried out previously in ballesteros2013effective and selection of features under similar constraints was carried out by bb2014automatic.
The feature selection methods we investigate can all be viewed as greedy forward selection, shown in Figure 1. This paradigm starts from an empty set and considers features one by one. In each iteration, a model is generated from a training set and tested on a development set relative to some accuracy metric of interest. The feature under consideration is added if it increases this metric beyond some threshold and discarded otherwise.
This strategy is similar to the one implemented in MaltOptimizer [Ballesteros and Nivre2014]. It differs from classic forward feature selection [Della Pietra et al.1997] in that it does not test all features in parallel, but instead relies on an ordering of features as input. This is primarily for efficiency, as training models in parallel for a large number of feature templates is cumbersome.
The set of features, , can be defined as fully instantiated input features, e.g., suffix=ing, or as feature templates, e.g., prefix, suffix, form, etc. Here we always focus on the latter. By reducing the number of feature templates, we are more likely to positively affect the runtime of feature extraction, as many computations can simply be removed.
Our feature selection algorithm assumes a given order of the features to be evaluated against the objective function. One simple strategy is for a human to provide a static ordering on features, that is fixed for traversal. This means that we are testing feature templates in a predefined order and keeping those that improve the accuracy. Those that do not are discarded and never visited again. In Figure 1, this means that the Order() function is fixed throughout the procedure. In our experiments, this fixed order is the same as in Table 1.
In text categorization, static feature selection based on correlation statistics is a popular technique [Yang and Pedersen1997]. The typical strategy in such offline selectors is to rank each feature by its correlation to the output space, and to select the top K features. This strategy is often called max relevance, since it aims to optimize the features based solely to their predictive power.
Unfortunately, the best features selected by these algorithms might not provide the best result [Peng et al.2005]. Redundancy among the features is the primary reason for this, and Peng et al. develop the minimal redundancy maximal relevance (MRMR) technique to address this problem. The MRMR method tries to keep the redundancy minimal among the features. The approach is based on mutual information to compute the relevance of features and the redundancy of a feature in relation to a set of already selected features.
The mutual information of two discrete random variablesand is defined as follows
Max relevance selects the feature set that maximizes the mutual information of feature templates and the output classes .
To account for cases when features are highly redundant, and thus would not change much the discriminative power of a classifier, the following criterion can be added to minimize mutual information between selected feature templates:
Minimal redundancy maximal relevance (MRMR) combines both objectives:
For the greedy feature selection method outlined Figure 1, we can use the MRMR criteria to define the Order() function. This leads to a dynamic feature selection technique as we update the order of features considered dynamically at each iteration, taking into account redundancy amongst already selected features.
In this section, we describe the two systems for morphosyntactic tagging we use to compare feature selection techniques.
The first tagger is a standalone SVM tagger, whose training regime is shown in Figure 2. The tagger iterates up to times (typically twice) over a sentence from left to right (line 2). This iteration is performed to allow the final assignment of tags to benefit from tag features on both sides of the target token. For each token of the sentence, the tagger initializes an -best list and extracts features for the token in question (line 4-7). In the innermost loop (line 9-11), the algorithm computes the score for each morphosyntactic tag and inserts a pair consisting of the morphosyntactic tag and its score into the -best list. The algorithm returns a two dimensional array, where the first dimension contains the tokens and the second dimension contains the sorted lists of tag-score pairs. The tagger is trained online using MIRA [Crammer et al.2006]. When evaluating this system as a standalone tagger, we select the 1-best tag. This can be viewed as a multi-pass version of standard SVM-based tagging [Màrquez and Giménez2004].
The joint tagger-parser follows the design of tacl-bbjn, who augment an arc-standard transition-based dependency parser with the capability to select a part-of-speech tag and/or morphological tag for each input word from an -best list of tags for that word. The tag selection is carried out when an input word is shifted onto the stack. Only the highest-scoring tokens from each -best list are considered, and only tags whose score is at most below the score of the best tag. In all experiments of this paper, we set to 2 and to 0.25. The tagger-parser uses beam search to find the highest scoring combined tagging and dependency tree. When pruning the beam, it first extracts the 40 highest scoring distinct dependency trees and then up to 8 variants that differ only with respect to the tagging, a technique that was found by tacl-bbjn to give a good balance between tagging and parsing ambiguity in the beam. The tagger-parser is trained using the same online learning algorithm as the standalone tagger. When evaluating this system as a part-of-speech tagger, we consider only the finally selected tag sequence in the dependency tree output by the parser.
To simplify matters, we start by investigating feature selection for part-of-speech taggers, both in the context of standalone and joint systems. The main hypotheses we are testing is whether feature selection techniques are more powerful in joint morphosyntactic systems as opposed to standalone taggers. That is, the resulting models are both more compact and accurate. Additionally, we wish to empirically compare the impact of static versus dynamic feature selection techniques.
We experiment with corpora from five different languages: Chinese, English, German, Hungarian and Russian. For Chinese, we use the Penn Chinese Treebank 5.1 (CTB5), converted with the head-finding rules, conversion tools and with the same split as in zhang08.111Training: 001–815, 1001–1136. Development: 886–931, 1148–1151. Test: 816–885, 1137–1147. For English, we use the WSJ section of the Penn Treebank, converted with the head-finding rules of yamada03 and the labeling rules of nivre06book.222Training: 02-21. Development: 24. Test: 23. For German, we use the Tiger Treebank [Brants et al.2002] in the improved dependency conversion by seeker12. For Hungarian, we use the Szeged Dependency Treebank [Farkas et al.2012]. For Russian we use the SynTagRus Treebank [Boguslavsky et al.2000, Boguslavsky et al.2002].
Table 1 presents the feature templates that we employed in our experiments (second column). The name of the functor indicates the purpose of the feature template. For instance, the functor form defines the word form. The argument specifies the location of the token, for instance, form(w+1) denotes the token to the right of the current token w.
When more than one argument is given, the functor is applied to each defined position and the results are concatenated. Thus, form(w,w+1) expands to form(w)+form(w+1). The functor formlc denotes the form with all letters converted to lowercase and lem denotes the lemma of a word. The functors suffix1, suffix2,… and prefix1,… denote suffixes and prefixes of length 1, 2, …, 5. The suffix1+uc, … functors concatenates a suffix with a value that indicates uppercase or lowercase word.
The functors pos and mor denote part-of-speech tags and morphological tags, respectively. The tags to the right of the current position are available as features in the second iteration of the standalone tagger as well as in the final resolution stage in the joint system. Patterns of the form denote the th character. Finally, the functor number denotes a sequence of numbers, with optional periods and commas.
In our experiments we make a division of the training corpora into 80% for training and 20% for development. Therefore, in each iteration a model is trained over 80% of the training corpus and tested on 20%.333There is also a held-out test set for evaluation, which is the standard test set provided and depicted in Section 5.1. For feature selection, if the outcome of the newly trained model is better than the best result so far, then the feature is added to the feature model; otherwise, it is not. A model has to show improvement of at least 0.02 on part-of-speech tagging accuracy to count as better.444All the experiments were carried out on a CPU Intel Xeon 3.4 Ghz with 6 cores. Since the feature selection experiments require us to train a large number of parsing and/or tagging models, we needed to find a realistic training setup that gives us a sufficient accuracy level while maintaining a reasonable speed. After some preliminary experiments, we selected a beam size of 8 and 12 training iterations for the feature selection experiments while the final models are tested with a beam size of 40 and 25 training iterations. The size of the second beam for alternative tag sequences is kept at 8 for all experiments and the threshold at 0.25.
Table 1 (columns under Part-of-Speech) shows the features that the algorithms selected for each language and each system, and Table 2 shows the performance on the development set. We primarily report part-of-speech tagging accuracy (POS), but also report unlabeled (UAS) and labeled (LAS) attachment scores [Buchholz and Marsi2006] to show the effect of improved taggers on parsing quality. Additionally, Table 2 contains the number of features selected (#).
The first conclusion to draw is that the feature selection algorithms work for both standalone and joint systems. The number of features selected is drastically reduced. The dynamic MRMR feature selection technique for the joint system compresses the model by as much as 78%. This implies faster inference (smaller dot products and less feature extraction) and a smaller memory footprint. In general, joint systems compress more over their standalone counterpart, by about 20%. Furthermore, the dynamic technique tends to have slightly more compression.
The accuracies of the joint tagger-parser are in general superior to the ones obtained by the standalone tagger, as noted by bohnet12emnlp. In terms of tagging accuracy, static selection works slightly better for Chinese, German and Hungarian while dynamic MRMR works best for English and Russian (Table 2). Moreover, the standalone tagger selects several feature templates that requires iterating over the sentence, such as pos(w+1), pos(w+2), whereas the feature templates selected by the joint system contain significantly fewer of these features. This shows that a joint system is less reliant on context features to resolve many ambiguities that previously needed a wider context. This is almost certainly due to the global contextual information pushed to the tagger via parsing decisions. As a consequence, the preprocessing tagger can be simplified and we need to conduct only one iteration over the sentence while maintaining the same accuracy level. Interestingly, the dynamic MRMR technique tends to select less form features, which have the largest number of realizable values and thus model parameters.
Table 3 compares the performance of our two taggers with two state-of-the-art taggers. Except for English, the joint tagger consistently outperforms the Stanford tagger and MarMot: for Chinese by 0.3, for German by 0.38, for Hungarian by 0.25 and for Russian by 0.75. Table 4 compares the resulting parsing accuracies to state-of-the-art dependency parsers for English and Chinese, showing that the results are in line with or higher than the state of the art.
The joint morphology and syntactic inference requires the selection of morphological attributes (case, number, etc.) and the selection of features to predict the morphological attributes. In past work on joint morphosyntactic parsing, all morphological attributes are predicted jointly with syntactic dependencies [Bohnet et al.2013]. However, this could lead to unnecessary complexity as only a subset of the attributes are likely to influence parsing decisions, and vice versa.
In this section we investigate whether feature selection methods can also be used to reduce the set of morphological attributes that are predicted as part of a joint system. For instance, consider a language that has the following attributes: case, gender, number, animacy. And let us say that language does not have gender agreement. Then likely only case and number will be useful in a joint system, and the gender and animacy attributes can be predicted independently. This could substantially improve the speed of the joint model – on top of standard feature selection – as the size of the morphosyntactic tag set will be reduced significantly.
We use the data sets listed in subsection 5.1 for the languages that provide morphological annotation, which are German, Hungarian and Russian.
For the selection of morphological attributes (e.g. case, number, tense), we explore a simple method that departs slightly from those in Section 3. In particular, we do not run greedy forward selection. Instead, we compute accuracy improvements for each attribute offline. We then independently select attributes based on these values. Our initial design was a greedy forward attribute selection, but we found experimentally that independent attribute selection worked best.
We run 10-fold cross-validation experiments on the training set where 90% of training set is used for training and 10% for testing. Here we simply test for each attribute independently whether its inclusion in a joint morphosyntactic parsing system increases parsing accuracy (LAS/UAS) by a statistically significant amount. If it does, then it is included. We applied cross validation to obtain more reliable results than with the development sets as some improvements where small, e.g., gender and number in German are within the range of the standard deviation results on the development set. We use parsing accuracy as we are primarily testing whether a subset of attributes can be used in place of the full set in joint morphosyntactic parsing.
Even though this method only tests an attribute’s contribution independently of other attributes, we found experimentally that this was never a problem. For instance, in German, without any morphologic attribute, we get a baseline of 89.18 LAS; when we include the attribute case, we get 89.45 LAS; and when we include number, we get 89.32 LAS. When we include both case and number, we get 89.60 LAS.
Table 5 shows which attributes were selected. We include an attribute when the cross-validation experiment shows an improvement of at least 0.1 with a statistical significance of 0.01 or better (indicated in the table by **). Some borderline cases remain such as for Russian passive where we observed an accuracy gain of 0.2 but only a low statistical significance.
|Cross Valid. Exp.|
|Cross Valid. Exp.|
|Cross Valid. Exp.|
Having fixed the set of attributes to be predicted jointly with the parser, we can turn our attention to optimizing the feature sets for morphosyntactic tagging. To this end, we again consider greedy forward selection with the static and dynamic strategies. Table 1 shows the selected features for the different languages where the grey boxes again mean that the feature was selected. Table 6 shows the performance on the development set. For German, the full template set performs best but only 0.04 better than static selection which performs nearly as well while reducing the template set by 68%. For Hungarian, all sets perform similarly while dynamic selection needs 86% less features. The top performing feature set for Russian is dynamic selection in a joint system which needs 81% less features. We observe again that dynamic selection tends to select less feature templates compared to static selection, but here both the full set of features and the set selected by static selection appear to have better accuracy on average.
The feature selection methods obtain significant speed-ups for the joint system. On the development sets we observed a speedup from 0.015 to 0.003 sec/sentence for Hungarian, from 0.014 to 0.004 sec/sentence for German, and from 0.015 to 0.006 sec/sentence for Russian. This represents a reduction in running time between 50 and 80%.
Table 7 compares our system to other state-of-the-art morphosyntactic parsers. We can see that on average the accuracies of our attribute/feature selection models are competitive or above the state-of-the art. The key result is that state of the art accuracy can be achieved with much leaner and faster models.
There are several methodological lessons to learn from this paper. First, feature selection is generally useful as it leads to fewer features and faster tagging while maintaining state-of-the-art results. Second, feature selection is even more effective for joint tagging-parsing, where it leads to even better results and smaller feature sets. In some cases, the number of feature templates is reduced by up to 80% with a correponding reduction in running time. Third, dynamic feature selection strategies [Peng et al.2005] lead to more compact models than static feature selection, without significantly impacting accuracy. Finally, similar methods can be applied to morphological attribute selection leading to even leaner and faster models.
Miguel Ballesteros is supported by the European Commission under the contract numbers FP7-ICT-610411 (project MULTISENSOR) and H2020-RIA-645012 (project KRISTINA)
Tag, dynamic programming, and the perceptron for efficient, feature-rich parsing.In CoNLL, pages 9–16.
Journal of Machine Learning Research, 7:551–585.
A comparative study of parameter estimation methods for statistical natural language processing.In ACL, volume 45, page 824.
SVMTool: A general POS tagger generator based on support vector machines.In LREC.
Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons.In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 188–191. Association for Computational Linguistics.