Feature engineering is widely recognized as an important component of robust NLP systems, with much of this engineering done by hand. Articles describing improvements in task performance over prior work tend to be methodologically driven (for example low-regret online learning algorithms and structured regularizers), with improvements in feature design often described just briefly, and as a matter of secondary importance. While the distinctions between methods of inference are formalized in the language of mathematics, most expositions of feature design employ terse, natural language descriptions, often not sufficient for reliable reproduction of the underlying factors being extracted. This has led to stagnation in feature design, and in general an attitude in some circles that features themselves are not worth exploring; i.e., we should abandon explicit, interpretable features for neural techniques which create their own representations which may not align with our own.
Features sets are constructed by authors using heuristics which are often not tested. For example it is common to coarsen a feature before using it in a product because the fine grained product would produce “too many” features. The author may have been correct (they ran the experiment and verified that performance went down) or not, but the reader often doesn’t know which is the case, and are left with the same problem of whether to run that experiment or not. Due to the cost of running experiments, practitioners are biased towards copying the feature set verbatim.
This work is about removing the human from the loop of feature selection, focussing on Semantic Role Labeling (SRL)[Gildea and Jurafsky2002]. The key challenge that we address is feature generation. Previous work has generated features by taking the Cartesian product of templates, but this is not rich enough to capture many widely used manually created features. We show that by decomposing the template even further, into atoms called featlets, we can automatically compose templates with rich, ad-hoc combinators. This process can generate many features which an expert might not consider.
Once we have tackled the feature generation problem, we show that we can automatically derive feature sets which match the performance of state-of-the-art feature sets created by experts. Our method uses basic statistics and requires no human expertise. We believe that models specified using featlets are easier to reproduce and offer the potential for performing feature selection with machine learning rather than domain expert knowledge, potentially at lower-cost and super-human performance.
Feature Descriptions in the Literature
For a case study on feature descriptions, consider the “voice feature” for SRL. It was first motivated and described in gildea2002automatic. They said that they defined 10 rules for when a verb had either active or passive voice, but never said what they were. Since then, almost every prominent paper on SRL has listed voice as a feature or template that they use, but none of the following defined their rules for the voice feature.111 gildea2002automatic xue2004calibrating pradhan2005support Toutanova:2005 Johansson:2008 Marquez:2008 punyakanok2008importance DasFramesCL:2014 Further, discrepancies between authors is not unheard of: gildea2002automatic report 5% of verbs were passive in the PTB, while pradhan2005support report 11%. Some of these papers go into great detail about other aspects like ILP constraints and and regularization constants, but this same clarity doesn’t always extend to features.
In methods papers, math is used as a bridge between natural and programming languages. There is no equivalent for describing features, so this type of omission is understandable given space constraints and the clumsiness of natural language. However, given the importance of the underlying factors in a model, the lack of clarity diminishes the value of the work to other practitioners, especially among those less linguistically-inclined.
Our approach begins with the notion that features can be decomposed into smaller units called featlets. These units can be composed together to make a wide variety of features. We distinguish featlets from feature templates, or just templates, which are effectively sets of features. Featlets are not necessarily features, but are composed to produce features or feature templates.
To start with an example, the featlet Word: given the index of a token, it returns the word at that position. A feature would not assume a token index is given, only that and are given, so Word is not a feature. Featlets are also used to provide information to other featlets. For example, the featlet ArgHead takes the head token index of an argument span and passes it to Word. The combination of the two, [ArgHead, Word], is a template. Importantly featlets are interchangeable: the template [ArgHead, Word] is related to [ArgHead, Pos] and [ArgHeadParent, Word]. Featlets are minimal to ensure that the trial and error of feature engineering falls to the machine rather than the expert.
Featlets are operations performed on a context, which is a data structure with:
Named fields which have types
A list of featlets which have been applied
An output buffer of features
In our implementation the data fields are:
token1 and token2: are integers
span1 and span2: are pairs of integers (start, end)
value: an untyped object
sequence: is an untyped list
Each of the fields in a context start out as a special value Nil. Once they are set, other featlets can read from these fields and put a feature into the output buffer. If a Nil field is read, then the featlet fails and no features are output.
This group of featlets are responsible for reading an aspect of the label and putting it into the context. These are the only task-specific featlets which the inference algorithm has to be aware of.
TargetSpan: sets span2 to a target
TargetHead: sets token2 to the head of the target span
ArgSpan: sets span1 to an argument span
ArgHead: sets token1 to the head of an argument span
Role: sets value to a role
FrameRole: sets value to the concatenation of a frame and a role
Frame: sets value to a frame
These read from token1 and output a feature.
Word, Pos, Lemma
WnSynset: reads the lemma and POS tag at token1, looks up the first WordNet sense, and puts its synset id onto value.
BrownClust: looks up a un-supervised hierarchical word cluster id for token1222One featlet for a 256 and a 1000 cluster output of liang2005semi.
DepRel: output the syntactic dependency edge label connecting token1 to its parent.
DepthD: compute the depth of token1 in a dependency tree
Before moving on, it will be helpful to slightly redefine the behavior of token extractors: instead of immediately outputting a string, instead they will store that string in the value field and leave it to the Output featlet to finish the job of outputting a feature. By convention, we will assume that every string of featlets ends in a (possibly implicit) Output, so the old meaning of “token extractors output a feature” is true as long as the token extractor featlet is not followed by anything.
In many cases though, we will want normalized or simplified versions of other features. For example we could want to find the shape of a word, “McDonalds” “CcCcccccc”, or perhaps just take the first few characters, “NNP” “N”. Value mutators read a string from value, compute a new string, and store it back to value. This enables features like [ArgHead, Word, Shape] or [TargetHead, Pos, Prefix1].
LC: if value is a string, output its lowercase
Shape: if value is a string, output its shape
PrefixN: sets value to a prefix of length
An interesting special case of value mutators are ones which filter. A featlet like ClosedClass can be applied after Word but before Output in order to have a feature only fire for closed class words. This is achieved by ClosedClass writing Nil to value so that Output fails and no features are output. This selective firing is valuable because it can lead to expressive feature semantics (e.g. “only output the first word in a span if is in a closed class”).
Syntax lets us jump around in a sentence where structural proximity is often a more informative measure of relevance than linear proximity. Dependency walkers333Every time we list a Left featlet, we have omitted its Right equivalent for space. provide one way of jumping around by reading and writing token1. These can be composed as well to form walks, e.g. “grandparent” = [ParentD, ParentD].
ParentD: set token1 to its parent in the dependency tree.
LeftChildD: sets token1 to its left-most child in the dependency tree
LeftSibD: sets token1 to its next-to-the-left sibling in the dependency tree
LeftMostSibD: sets token1 to its left-most sibling in the dependency tree
Some information is contained in the name of a dependency walker (e.g. ParentD), other information is contained in the edge crossed (e.g. whether the parent is nsubj or dobj). To capture this information, dependency walkers also append the edge that they crossed into the sequence field. This side information can be read out later by other featlets.
Values appended to sequence are converted into features with sequence reducers.
: reads n-grams from sequence, outputs each. Ifvalue is set, prefixes every n-gram with value. Clears sequence when done.
Bag: special case of n-grams when n=1
SeqN: if sequence is no longer than N, output items concatenated (preserves order). Also will prepend value if set. Clears sequence regardless of length.
CompressRuns: Collapses X Y Y Z Z Z to X Y+ Z+, no output, doesn’t clear sequence
Going back to dependency walkers for a moment, the edge that they append tosequence need not be a string like nsubj which sequence reducers can operate on. Edges are represented as tuples of (parent, child, deprel)444Sequence reducers fail when attempting to operate non-string values in sequence, so a representer must be called first., and we add featlets which choose a string to represent each edge, called dependency representers. We construct an edge to string function by taking the Cartesian product of the token extractors to represent the parent and the set of functions EdgeDirectionD555Left or right, EdgeLabelD666Taken from dependency parse, e.g. nsubj, NoEdgeD777A constant string: “*” and make a featlet to map each of these functions over sequence.
There is an equivalent class of featlets to the dependency walkers which instead operate on span1, span2, and a constituency tree, called constituent walker. Each of these operations fail if the span they read is not a constituent.
ParentC: sets span1 to its parent
LeftChildC: sets span1 to the left-most child node
LeftSibC: sets span1 to the next-to-the-left sibling node
LeftMostSibC: sets span1 to the left-most sibling node
Constituent walkers differ from dependency walkers in the values that they append to sequence, they store grammar rules like S NP VP. Equivalently to dependency representers, the constituency representers are CategoryC888The left hand side of a rule, SubCategoryC999The left and right hand side of a rule .
Operations longer than one step typically require that a start and endpoint are known to avoid meaningless walks. Tree walkers use both dependency and constituency parses, but take shortest path walks between two endpoints, adding edges or rules to sequence.
ToRootD: walks from token1 to root
CommonParentD: walks from token1 to a common parent and then token2
ToRootC: walks from span1 to root
CommonParentC: like CommonParentD for constituency trees
ChildrenD: walks the children of token1, left to right
ChildrenC: walks the children of span1, left to right
If syntactic trees are not available or accurate, linear walkers can provide another source of relevant information. These featlets append token indices to sequence for multi-step walks, but behave like dependency walkers otherwise, mutating a field such as token1.
LeftL: moves token1 position if possible
Span1StartToEndL: walks tokens in span1
Span1LeftToRightL: like Span1StartToEndL but expands two tokens in either direction
Span1ToSpan2L: adds any tokens between the two spans to sequence
Distance can be informative, but usually not clear how to represent its scale. Featlets let us address which distances to measure separately from the units used to measure them. Distance functions put a number into the value field:
SeqLength: the number of elements in sequence101010This can be use to measure a variety of distances using linear walkers, like ArgSpan width or the distance between ArgHead and TargetHead.
DeltaDepthD: if values in sequence are dependency nodes or token indices, put the depth of the first minus the second into value
DeltaDepthC: like DeltaDepthD for constituency nodes.
Once a number has been put into value, distance representers write a string representation suitable for a feature back to value.
DasBuckets: encodes the bucket widths defined in DasFramesCL:2014
Direction: writes +1 or -1 based on the sign of the number given.
2.1 Finding Legal Templates
We can figure out which strings of featlets constitute a template (most sequences don’t make sense, like [ArgSpan, Shape]) by brute force search with a few heuristics. We have a rules which filter out strings of featlets like:
nothing can come before a token extractor
if apply a featlet fails or doesn’t change the context, stop there (this and all suffixes are invalid)
We do a breadth first search over all strings of featlets up to length 6 and collect all strings which are templates: those that produce output on at least 2 of 50 instances, producing 5241 templates.
At this point note that since featlets are functions from one context to another, they are closed under function composition.
2.2 Frequency-based Template Transforms
For every template found, we produce 5 additional templates by appending the following featlets: Top10, Top100, Top1000, Cnt8 Cnt16. The TopN transforms a template by sorting its features by frequency, and only letting the template fire for values with a count at least as high as the most common feature. CntC only lets through features observed at least C times in the training data.
These automatic transforms are useful for building products, since they control number of features created. In our experiments, we found that a TopN template transform appeared in a little less than 50% of our final features and a CntC feature appeared in a little less than 10%.
2.3 Template Composition
To grow features larger than we can discover with brute force enumeration of featlet strings, we consider products of templates. It is common to represent this by string concatenation, but we define template products to be the same as featlet concatenation. This has one importance difference: a template may return no value, in which case the rest of the product returns no value. With string concatenation you can represent one template making another more specific by including more information. With featlet composition you can have a template modulate when another can fire. This is weaker than general featlet composition though, and order doesn’t matter.
2.4 Near Duplicate Removal
We will generate pairs of similar and possibly redundant templates. For example: e.g. [ArgHead, LeftSibD, Word] and [ArgHead, LeftMostSibD, Word]. Principled approaches like conditional mutual information could be used to filter, but this would require a lot of computation. It is faster to use the type level (featlet/template names) rather than the token level (values extracted on instance data). We can construct similarity functions for each of the levels of structure we’ve produced.
similarity between two featlets is the normalized Levenshtein edit distance111111Operations have unit cost and we divide by the length of the longer string. between their names, so LeftSibD is similar to LeftMostSibD but not ParentC.
similarity between two templates (strings of featlets) is again the the normalized Levenshtein edit distance, but over an alphabet of featlets, where the substitution cost is inversely related to the previous similarity function.121212We convert edit distance to similarity by with .
similarity between two features is the max-weight bipartite matching of templates, where the weight is inversely related to the previous similarity function. We don’t use edit distance here since order doesn’t matter.
We use this last similarity function to prune ranked lists of features produced in §3.131313We consider two features redundant if the normalized max-weight matching is greater than 0.75. To normalize we divide by the shorter length (in templates) of the two features.
3 Feature Selection
Feature generation can lead to too many features to fit in memory. Some of the features we generate may provide no signal towards a label we are trying to predict. To filter down to a manageable set of informative features, we score each template using mutual information (sometimes referred to as information gain), between a label (Bernoulli) and a template (Multinomial). Mutual information has a natural connection to Bayesian independence testing [Minka2003]. Since computing mutual information is just counting, this task is embarrassingly parallel and can be easily implemented in frameworks like MapReduce.
We select a budget of how many features to search over , and divide that budget up amongst template products up to order such that order features get a proportion of the budget of . In these experiments we set and , which meant that the split between features was [21%, 32%, 47%].141414These are really maximum proportions, filled up from lowest order to highest order, with extra slots rolled over to the remaining slices proportional to the remaining weights.
For each split, features were ranked by the max of a heuristic score for each of its templates. Each templates heuristic score was its mutual information plus a Gaussian with mean 0 and standard deviation 2. Randomness was introduced for diversity and so that templates which are more useful as filters (and have low mutual information by themselves) have some chance of being picked.
Entropy and mutual information estimation breaks down when the cardinality of the variables is large compared to the number of observations[Paninski2003]. We observed that entropy estimates based on the maximum likelihood estimates of and from counts of yielded very biased estimates of mutual information (high for sparse features). We correct for this problem by using the BUB entropy estimation method described by paninski2003estimation.
We produce a final ranked list of features by sorting by and then applying the greedy pruning described in §2.4. This expression’s limit as is mutual information and normalized mutual information as . In our experiments, most features had between one and nine nats of entropy, as shown in figure 1, and we created feature sets out of
In our experiments we use semantic role labeling (SRL) as a case study to test whether our automatic methods can derive feature sets which work as well as hand engineered ones. SRL is a difficult prediction task with more than one structural formulation (type of label). Sometimes arguments are represented by their head token in a dependency tree [Surdeanu et al.2008, Hajič et al.2009] and sometimes they are specified by a span or constituent [Carreras and Màrquez2005, Baker et al.1998]. For span-based SRL, the correspondence between the argument spans and syntactic constituents can be very tight [Kingsbury and Palmer2002] or not [Ruppenhofer et al.2006]. Sometimes the role labels depend on the predicate sense [Ruppenhofer et al.2006] and sometimes they don’t [Kingsbury and Palmer2002]. These differences indicate that there may not be one “SRL feature set” which works best, or even well, for all variants of the task.
We used FrameNet 1.5 [Ruppenhofer et al.2006] and the CoNLL 2012 data derived from the OntoNotes 5.0 corpus [Pradhan et al.2013]. We used the argument candidate selection method described in xue2004calibrating as well as the extensions in acl2014frames. Annotations are provided from the Stanford CoreNLP toolset [Manning et al.2014]. Feature selection is run first on each data set to produce a few feature sets based on
and size, then we evaluate their performance using an averaged perceptron[Freund and Schapire1999, Collins2002]. We ran the algorithm for up to 10 passes over the data, shuffling before each pass, and selected the averaged weights which yielded the best F1 on the dev set.151515For FrameNet data we took a random 10% slice of the training data as the dev set.
Most work on SRL breaks the problem into (at least) two stages: argument identification and role classification. The argument identification stage classifies whether a span is a semantic argument to a particular target, and then the classification stage chooses the role for each span which was selected by the identification stage. We adopt this standard architecture for efficiency: if there arespans and roles, it turns an decoding problem into an decoding problem.
Given this stage-based architecture, we split our budget, half going to each stage. For both we define to be a Bernoulli variable which is one on sub-structures which appear on the gold parse. For arg id the instances are spans for a particular target, and for role classification the instances are roles for the span chosen in the previous stage. During training we use gold arg id to train the role classifier. For arg id we only score features which contain a ArgHead, or ArgSpan featlet, and for role classification we additionally require a RoleArg or FrameRoleArg featlet appear.
Overall, our method seems to work about as well as experts manually designing features for SRL. Results in table 1 shows our approach matching the performance of Das:2012 and pradhan2013towards. Other systems achieve better performance, but these models all use global information, an orthogonal issue to the local feature set.
In looking at the feature sets generated, one major difference is the complexity parameter . For FrameNet, the best value of was 10, meaning that the features with the highest normalized mutual information were chosen, whereas with Propbank and , it was better to ignore the entropy of the feature. This makes sense in retrospect when you consider the size of the training sets, Propbank is about 20 times larger, but its not clear how much data is needed to justify this shift when tuning by hand.
This difference in selection criteria does lead to very different feature sets chosen,161616There are actually only two templates in common between the best FrameNet and Propbank feature sets. Both contain the CommonParentD featlet. but it is another question of whether this matters towards system performance. It could be that there are many different types of feature sets which lead to good performance on either task/dataset, and only one is needed (possibly created manually). In table 2
we show the effects generating a feature set for one dataset and applying it to the other. The performance on the diagonal is considerably higher, indicating empirically that there likely isn’t one “SRL feature set”. If you weight both equally, the average increase in error due to domain shift is 7.9%. This is even the case for feature selection with FrameNet, where you might expect that selecting features on Propbank, a much larger resource, could yield gains because of much lower variance without much bias.
Sensitivity by Stage
Given that we can automatically generate feature sets, we can easily determine how adding or removing features from each stage will affect performance. This is useful for choosing a feature set which balances the cost of prediction time with performance, which is labor intensive and error prone when done manually. Table 3 shows that the model is more sensitive to the removal of argument id features than role classification ones. This is not a new result [Màrquez et al.2008], but this work offers a way to respond by applying computational rather than human resources to the problem.
On limitation of the methods described here is finding symmetries. The product operator for templates is commutative, but this is not the case for featlets. Some templates are equivalent and there is no easy way to check short of checking their denotations, which is expensive.
Another issue is that a lot of features are required. The best models we trained for FrameNet use over 2500 features, which is significantly more than Das:2012, which used 34. Upon manual inspection of the feature sets we learned, we find most if not all of the features that Das:2012 created,171717It is a not trivial to match our templates to theirs. For example, the template [TargetHead, LeftL, Top10] * [TargetHead, ChildSequenceD, SeqMapDepRel, Bag] is likely close to the passive voice template used in Das:2012, since “was” and “be” are in the top 10 words to the left of a verb. but precision is low.
7 Related Work
Recently there has been a swell in interest in neural methods in NLP which use continuous representations rather than discrete feature weights. This work shares some motivation with neural methods, e.g. the desire to avoid domain expert-derived features, but we diverge primarily for computational reasons. This work is about model generation and scoring, and it is not clear how to score neural models in ways that don’t involve re-training a model. Feature based models are amenable decomposition and information theoretic analysis in ways that neural models aren’t.
Within feature based methods, backwards selection methods are common, including a large body of work on sparsity-inducing regularization, the canonical being the lasso [Tibshirani1996]. These methods are applicable when the entire feature set can be enumerated and scored on one machine. This is not feasible for this work, since we generate features which lie in a combinatorial space too large to fit in memory. For example, our method found the feature [ TargetHead, Right, WnSynset, ArgSpan, Span1Start, LeftL, Word, Top10, Span1ToSpan2L, SeqMapPos, Bag ], which is comprised of 11 featlets181818Technically it is 13 featlets since an Output featlet is not written after the WnSynset and Top10 featlets, §2..
The alternative are forwards selection methods which work by building bigger features from smaller ones. Bjorkelund:2009 used forwards selection for dependency-based SRL [Hajič et al.2009]
for 7 languages based on products of templates. Their experiments showed a great diversity of the features learned for different languages and they placed second in the shared task. McCallum:2002 used forwards selection for named entity recognition, scoring new feature products using approximate model re-fitting (pseudo-likelihood), which also produced good results. In both of these works, scoring new features depended on the output of a smaller feature set. Sequential methods like this are not amenable to paralellization and take quadratic time with respect to the number of feature to be searched over.
Our method is more similar to the work of gormley-etal:2014:SRL where every template is scored in parallel irrespective of a trained model.
This work dove-tails with the approach described by Lee:2007, which derives a prior or regularization constant for individual features by looking at properties of the feature (meta features). This work generates features with a lot of structure, which the learner could reflect upon to improve regularization and generalization.
The structure in these features can also inform parameterization. Tensor decomposition methods of fixed-order tensors have been used to great effect[Lei et al.2014, Lei et al.2015]. Low-rank or embedding methods (e.g. RNNs) for parameterizing featlet strings, as opposed to storing a weight in a dictionary, could also improve regularization.
Step-wise methods which select some features, fit a model, and then select more features with respect to mutual information with residuals, are another simple and promising direction.
In this work we propose a general framework for generating feature sets with the goal of removing expert engineering from the machine learning loop. Our approach is based on composing units called featlets to create templates. Featlets are small functions which are task agnostic and easy to define and implement by non-experts. Featlets on one hand preserve a wide variety of nuanced feature semantics, and on the other can be enumerated automatically to derive a huge amount of novel templates and features. We validate our approach on semantic role labeling and achieve performance on par with models that had considerable expert intervention.
- [Baker et al.1998] Collin F Baker, Charles J Fillmore, and John B Lowe. 1998. The berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. ACL.
- [Björkelund et al.2009] Anders Björkelund, Love Hafdell, and Pierre Nugues. 2009. Multilingual semantic role labeling. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, CoNLL ’09, pages 43–48, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Carreras and Màrquez2005] Xavier Carreras and Lluís Màrquez. 2005. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning, pages 152–164. Association for Computational Linguistics.
Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms.In
Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 1–8. Association for Computational Linguistics.
- [Das et al.2012] Dipanjan Das, André F. T. Martins, and Noah A. Smith. 2012. An exact dual decomposition algorithm for shallow semantic parsing with constraints. In SemEval, SemEval ’12. Association for Computational Linguistics.
- [Das et al.2014] Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguistics, 40:1:9–56.
[FitzGerald et al.2015]
Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan
Semantic role labelling with neural network factors.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisboa, Portugal, September. Association for Computational Linguistics.
- [Freund and Schapire1999] Yoav Freund and Robert E Schapire. 1999. Large margin classification using the perceptron algorithm. Machine learning, 37(3):277–296.
- [Gildea and Jurafsky2002] Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational linguistics, 28(3).
- [Gormley et al.2014] Matthew R. Gormley, Margaret Mitchell, Benjamin Van Durme, and Mark Dredze. 2014. Low-resource semantic role labeling. In Proceedings of ACL, June.
[Hajič et al.2009]
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, et al.2009. The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–18. Association for Computational Linguistics.
- [Hermann et al.2014] Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. 2014. Semantic frame identification with distributed word representations. In Proceedings of ACL. Association for Computational Linguistics.
- [Johansson and Nugues2008] Richard Johansson and Pierre Nugues. 2008. Dependency-based semantic role labeling of propbank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’08, pages 69–78, Stroudsburg, PA, USA. Association for Computational Linguistics.
- [Kingsbury and Palmer2002] Paul Kingsbury and Martha Palmer. 2002. From treebank to propbank. In LREC. Citeseer.
- [Lee et al.2007] Su-In Lee, Vassil Chatalbashev, David Vickrey, and Daphne Koller. 2007. Learning a meta-level prior for feature relevance from multiple related tasks. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 489–496, New York, NY, USA. ACM.
- [Lei et al.2014] Tao Lei, Yu Xin, Yuan Zhang, Regina Barzilay, and Tommi Jaakkola. 2014. Low-rank tensors for scoring dependency structures. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1381–1391, Baltimore, Maryland, June. Association for Computational Linguistics.
- [Lei et al.2015] Tao Lei, Yuan Zhang, Lluís Màrquez, Alessandro Moschitti, and Regina Barzilay. 2015. High-order low-rank tensors for semantic role labeling. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1150–1160, Denver, Colorado, May–June. Association for Computational Linguistics.
- [Liang2005] Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology.
- [Manning et al.2014] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60.
- [Màrquez et al.2008] Lluís Màrquez, Xavier Carreras, Kenneth C. Litkowski, and Suzanne Stevenson. 2008. Semantic role labeling: An introduction to the special issue. Comput. Linguist., 34(2):145–159, June.
Efficiently inducing features of conditional random fields.
Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence, UAI’03, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
- [Minka2003] Tom Minka. 2003. Bayesian inference, entropy, and the multinomial distribution. Online tutorial.
- [Paninski2003] Liam Paninski. 2003. Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253.
[Pradhan et al.2005]
Sameer Pradhan, Kadri Hacioglu, Valerie Krugler, Wayne Ward, James H Martin,
and Daniel Jurafsky.
Support vector learning for semantic argument classification.Machine Learning, 60(1-3):11–39.
- [Pradhan et al.2013] Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning. Sofia, Bulgaria: Association for Computational Linguistics, pages 143–52.
- [Punyakanok et al.2008] Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics, 34(2):257–287.
- [Ruppenhofer et al.2006] Josef Ruppenhofer, Michael Ellsworth, Miriam RL Petruck, Christopher R Johnson, and Jan Scheffczyk. 2006. Framenet ii: Extended theory and practice.
- [Surdeanu et al.2008] Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluís Màrquez, and Joakim Nivre. 2008. The conll-2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, pages 159–177. Association for Computational Linguistics.
- [Täckström et al.2015] Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Efficient inference and structured learning for semantic role labeling. Transactions of the Association for Computational Linguistics, 3.
- [Tibshirani1996] Robert Tibshirani. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288.
- [Toutanova et al.2005] Kristina Toutanova, Aria Haghighi, and Christopher D. Manning. 2005. Joint learning improves semantic role labeling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05. Association for Computational Linguistics.
- [Xue and Palmer2004] Nianwen Xue and Martha Palmer. 2004. Calibrating features for semantic role labeling. In EMNLP, pages 88–94.
[Zhou and Xu2015]
Jie Zhou and Wei Xu.
End-to-end learning of semantic role labeling using recurrent neural networks.In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1127–1137.