This paper introduces a novel model for semantic role labeling that makes use of neural sequence modeling techniques. Our approach is motivated by the observation that complex syntactic structures and related phenomena, such as nested subordinations and nominal predicates, are not handled well by existing models. Our model treats such instances as sub-sequences of lexicalized dependency paths and learns suitable embedding representations. We experimentally demonstrate that such embeddings can improve results over previous state-of-the-art semantic role labelers, and showcase qualitative improvements obtained by our method.READ FULL TEXT VIEW PDF
We reduce the task of (span-based) PropBank-style semantic role labeling...
We introduce a new syntax-aware model for dependency-based semantic role...
The task of semantic role labeling (SRL) is dedicated to finding the
For multi-turn dialogue rewriting, the capacity of effectively modeling ...
Deep neural models achieve some of the best results for semantic role
This paper introduces and analyzes a battery of inference models for the...
The latest developments in neural semantic role labeling (SRL), includin...
The goal of semantic role labeling (SRL) is to identify and label the arguments of semantic predicates in a sentence according to a set of predefined relations (e.g., “who” did “what” to “whom”). Semantic roles provide a layer of abstraction beyond syntactic dependency relations, such as subject and object
, in that the provided labels are insensitive to syntactic alternations and can also be applied to nominal predicates. Previous work has shown that semantic roles are useful for a wide range of natural language processing tasks, with recent applications including statistical machine translation[Aziz et al.2011, Xiong et al.2012], plagiarism detection [Osman et al.2012, Paul and Jamal2015], and multi-document abstractive summarization [Khan et al.2015].
|mate-tools||*He had [trouble] raising [funds].|
|mateplus||*He had [trouble] raising [funds].|
|TensorSRL||*He had trouble raising [funds].|
|easySRL||*He had trouble raising [funds].|
|This work||[He] had trouble raising [funds].|
The task of semantic role labeling (SRL) was pioneered by gildea02. In their work, features based on syntactic constituent trees were identified as most valuable for labeling predicate-argument relationships. Later work confirmed the importance of syntactic parse features [Pradhan et al.2005, Punyakanok et al.2008] and found that dependency parse trees provide a better form of representation to assign role labels to arguments [Johansson and Nugues2008].
Most semantic role labeling approaches to date rely heavily on lexical and syntactic indicator features. Through the availability of large annotated resources, such as PropBank [Palmer et al.2005], statistical models based on such features achieve high accuracy. However, results often fall short when the input to be labeled involves instances of linguistic phenomena that are relevant for the labeling decision but appear infrequently at training time. Examples include control and raising verbs, nested conjunctions or other recursive structures, as well as rare nominal predicates. The difficulty lies in that simple lexical and syntactic indicator features are not able to model interactions triggered by such phenomena. For instance, consider the sentence He had trouble raising funds and the analyses provided by four publicly available tools in Table 1 (mate-tools, bjoerkelund10; mateplus, rothwoodsend14; TensorSRL, lei15; and easySRL, lewis15). Despite all systems claiming state-of-the-art or competitive performance, none of them is able to correctly identify He as the agent argument of the predicate raise. Given the complex dependency path relation between the predicate and its argument, none of the systems actually identifies He as an argument at all.
In this paper, we develop a new neural network model that can be applied to the task of semantic role labeling. The goal of this model is to better handle control predicates and other phenomena that can be observed from the dependency structure of a sentence. In particular, we aim to model the semantic relationships between a predicate and its arguments by analyzing the dependency path between the predicate word and each argument head word. We consider lexicalized paths, which we decompose into sequences of individual items, namely the words and dependency relations on a path. We then apply long-short term memory networks[Hochreiter and Schmidhuber1997] to find a recurrent composition function that can reconstruct an appropriate representation of the full path from its individual parts (Section 2). To ensure that representations are indicative of semantic relationships, we use semantic roles as target labels in a supervised setting (Section 3).
By modeling dependency paths as sequences of words and dependencies, we implicitly address the data sparsity problem. This is the case because we use single words and individual dependency relations as the basic units of our model. In contrast, previous SRL work only considered full syntactic paths. Experiments on the CoNLL-2009 benchmark dataset show that our model is able to outperform the state-of-the-art in English (Section 4), and that it improves SRL performance in other languages, including Chinese, German and Spanish (Section 5).
In the context of neural networks, the term embedding refers to the output of a function
within the network, which transforms an arbitrary input into a real-valued vector output. Word embeddings, for instance, are typically computed by forwarding a one-hot word vector representation from the input layer of a neural network to its first hidden layer, usually by means of matrix multiplication and an optional non-linear function whose parameters are learned during neural network training.
Here, we seek to compute real-valued vector representations for dependency paths between a pair of words . We define a dependency path to be the sequence of nodes (representing words) and edges (representing relations between words) to be traversed on a dependency parse tree to get from node to node . In the example in Figure 1, the dependency path from raising to he is .
Analogously to how word embeddings are computed, the simplest way to embed paths would be to represent each sequence as a one-hot vector. However, this is suboptimal for two reasons: Firstly, we expect only a subset of dependency paths to be attested frequently in our data and therefore many paths will be too sparse to learn reliable embeddings for them. Secondly, we hypothesize that dependency paths which share the same words, word categories or dependency relations should impact SRL decisions in similar ways. Thus, the words and relations on the path should drive representation learning, rather than the full path on its own. The following sections describe how we address representation learning by means of modeling dependency paths as sequences of items in a recurrent neural network.
The recurrent model we use in this work is a variant of the long-short term memory (LSTM) network. It takes a sequence of items as input, recurrently processes each item at a time, and finally returns one embedding state e for the complete input sequence. For each time step , the LSTM model updates an internal memory state m that depends on the current input as well as the previous memory state m. In order to capture long-term dependencies, a so-called gating mechanism controls the extent to which each component of a memory cell state will be modified. In this work, we employ input gates i, output gates o and (optional) forget gates f. We formalize the state of the network at each time step as follows:
In each equation, W describes a matrix of weights to project information between two layers, b is a layer-specific vector of bias terms, and is the logistic function. Superscripts indicate the corresponding layers or gates. Some models described in Section 3 do not make use of forget gates or memory-to-gate connections. In case no forget gate is used, we set . If no memory-to-gate connections are used, the terms in square brackets in (1), (2), and (4) are replaced by zeros.
We define the embedding of a dependency path to be the final memory output state of a recurrent LSTM layer that takes a path as input, with each input step representing a binary indicator for a part-of-speech tag, a word form, or a dependency relation. In the context of semantic role labeling, we define each path as a sequence from a predicate to its potential argument.111We experimented with different sequential orders and found this to lead to the best validation set results. Specifically, we define the first item to correspond to the part-of-speech tag of the predicate word , followed by its actual word form, and the relation to the next word . The embedding of a dependency path corresponds to the state e returned by the LSTM layer after the input of the last item, , which corresponds to the word form of the argument head word . An example is shown in Figure 2.
The main idea of this model and representation is that word forms, word categories and dependency relations can all influence role labeling decisions. The word category and word form of the predicate first determine which roles are plausible and what kinds of path configurations are to be expected. The relations and words seen on the path can then manipulate these expectations. In Figure 2, for instance, the verb raising complements the phrase had trouble, which makes it likely that the subject he is also the logical subject of raising.
By using word forms, categories and dependency relations as input items, we ensure that specific words (e.g., those which are part of complex predicates) as well as various relation types (e.g., subject and object) can appropriately influence the representation of a path. While learning corresponding interactions, the network is also able to determine which phrases and dependency relations might not influence a role assignment decision (e.g., coordinations).
assigns the highest probable class label.
Our SRL model consists of four components depicted in Figure 3
: (1) an LSTM component takes lexicalized dependency paths as input, (2) an additional input layer takes binary features as input, (3) a hidden layer combines dependency path embeddings and binary features using rectified linear units, and (4) a softmax classification layer produces output based on the hidden layer state as input. We therefore learn path embeddings jointly with feature detectors based on traditional, binary indicator features.
Given a dependency path , with steps , and a set of binary features as input, we use the LSTM formalization from equations (1–5) to compute the embedding at time step and formalize the state of the hidden layer h and softmax output s for each class category as follows:
The overall architecture of our SRL system closely follows that of previous work [Toutanova et al.2008, Björkelund et al.2009] and is depicted in Figure 4. We use a pipeline that consists of the following steps: predicate identification and disambiguation, argument identification, argument classification, and re-ranking. The neural-network components introduced in Section 2 are used in the last three steps. The following sub-sections describe all components in more detail.
Given a syntactically analyzed sentence, the first two steps in an end-to-end SRL system are to identify and disambiguate the semantic predicates in the sentence. Here, we focus on verbal and nominal predicates but note that other syntactic categories have also been construed as predicates in the NLP literature (e.g., prepositions; srikumar13). For both identification and disambiguation steps, we apply the same logistic regression classifiers used in the SRL components of mate-tools[Björkelund et al.2010]. The classifiers for both tasks make use of a range of lexico-syntactic indicator features, including predicate word form, its predicted part-of-speech tag as well as dependency relations to all syntactic children.
Given a sentence and a set of sense-disambiguated predicates in it, the next two steps of our SRL system are to identify all arguments of each predicate and to assign suitable role labels to them. For both steps, we train several LSTM-based neural network models as described in Section 2. In particular, we train separate networks for nominal and verbal predicates and for identification and classification. Following the findings of earlier work [Xue and Palmer2004], we assume that different feature sets are relevant for the respective tasks and hence different embedding representations should be learned. As binary input features, we use the following sets from the SRL literature [Björkelund et al.2010].
|Argument labeling step||forget gate||memorygates||alpha||dropout rate|
Word form and word category of the predicate and candidate argument; dependency relations from predicate and argument to their respective syntactic heads; full dependency path sequence from predicate to argument.
Word forms and word categories of the candidate argument’s and predicate’s syntactic siblings and children words.
Relative position of the candidate argument with respect to the predicate (left, self, right); sequence of part-of-speech tags of all words between the predicate and the argument.
As all argument identification (and classification) decisions are independent of one another, we apply as the last step of our pipeline a global reranker. Given a predicate , the reranker takes as input the best sets of identified arguments as well as their
best label assignments and predicts the best overall argument structure. We implement the reranker as a logistic regression classifier, with hidden and embedding layer states of identified arguments as features, offset by the argument label, and a binary label as output (1: best predicted structure, 0: any other structure). At test time, we select the structure with the highest overall score, which we compute as the geometric mean of the global regression and all argument-specific scores.
In this section, we demonstrate the usefulness of dependency path embeddings for semantic role labeling. Our hypotheses are that (1) modeling dependency paths as sequences will lead to better representations for the SRL task, thus increasing labeling precision overall, and that (2) embeddings will address the problem of data sparsity, leading to higher recall. To test both hypotheses, we experiment on the in-domain and out-of-domain test sets provided in the CoNLL-2009 shared task [Hajič et al.2009] and compare results of our system, henceforth PathLSTM, with systems that do not involve path embeddings. We compute precision, recall and F-score using the official CoNLL-2009 scorer.222Some recently proposed SRL models are only evaluated on the CoNLL 2005 and 2012 data sets, which lack nominal predicates or dependency annotations. We do not list any results from those models here. The code is available at https://github.com/microth/PathLSTM.
We train argument identification and classification models using the XLBP toolkit for neural networks [Monner and Reggia2012]. The hyperparameters for each step were selected based on the CoNLL 2009 development set. For direct comparison with previous work, we use the same preprocessing models and predicate-specific SRL components as provided with mate-tools [Bohnet2010, Björkelund et al.2010]. The types and ranges of hyperparameters considered are as follows: learning rate , dropout rate , and hidden layer sizes , . In addition, we experimented with different gating mechanisms (with/without forget gate) and memory access settings (with/without connections between all gates and the memory layer, cf. Section 2). The best parameters were chosen using the Spearmint hyperparameter optimization toolkit [Snoek et al.2012], applied for approx. 200 iterations, and are summarized in Table 2.
The results of our in- and out-of-domain experiments are summarized in Tables 3 and 5, respectively. We present results for different system configurations: ‘local’ systems make classification decisions independently, whereas ‘global’ systems include a reranker or other global inference mechanisms; ‘single’ refers to one model and ‘ensemble’ refers to combinations of multiple models.
In the in-domain setting, our PathLSTM model achieves 87.7% (single) and 87.9% (ensemble) F-score, outperforming previously published best results by 0.4 and 0.2 percentage points, respectively. At a F-score of 86.7%, our local model (using no reranker) reaches the same performance as state-of-the-art local models. Note that differences in results between systems might originate from the application of different preprocessing techniques as each system comes with its own syntactic components. For direct comparison, we evaluate against mate-tools, which use the same preprocessing techniques as PathLSTM. In comparison, we see improvements of 0.8–1.0 percentage points absolute in F-score.
|System (local, single)||P||R||F|
|PathLSTM w/o reranker||88.1||85.3||86.7|
|System (global, single)||P||R||F|
|rothwoodsend14333Results are taken from lei15.||86.3|
|System (global, ensemble)||P||R||F|
|FitzGerald et al. 10 models||87.7|
|PathLSTM 3 models||90.3||85.7||87.9|
|PathLSTM||P (%)||R (%)||F (%)|
|w/o path embeddings||65.7||87.3||75.0|
|w/o binary features||73.2||33.3||45.8|
|System (local, single)||P||R||F|
|PathLSTM w/o reranker||76.9||73.8||75.3|
|System (global, single)||P||R||F|
|System (global, ensemble)||P||R||F|
|FitzGerald et al. 10 models||75.5|
|PathLSTM 3 models||79.7||73.6||76.5|
In the out-of-domain setting, our system achieves new state-of-the-art results of 76.1% (single) and 76.5% (ensemble) F-score, outperforming the previous best system by rothwoodsend14 by 0.2 and 0.6 absolute points, respectively. In comparison to mate-tools, we observe absolute improvements in F-score of 0.4–0.8%.
To determine the sources of individual improvements, we test PathLSTM models without specific feature types and directly compare PathLSTM and mate-tools, both of which use the same preprocessing methods. Table 4 presents in-domain test results for our system when specific feature types are omitted. The overall low results indicate that a combination of dependency path embeddings and binary features is required to identify and label arguments with high precision.
Figure 5 shows the effect of dependency path embeddings at mitigating sparsity: if the path between a predicate and its argument has not been observed at training time or only infrequently, conventional methods will often fail to assign a role. This is represented by the recall curve of mate-tools, which converges to zero for arguments with unseen paths. The higher recall curve for PathLSTM demonstrates that path embeddings can alleviate this problem to some extent. For unseen paths, we observe that PathLSTM improves over mate-tools by an order of magnitude, from 0.9% to 9.6%. The highest absolute gain, from 12.8% to 24.2% recall, can be observed for dependency paths that occurred between 1 and 10 times during training.
Figure 7 plots role labeling performance for sentences with varying number of words. There are two categories of sentences in which the improvements of PathLSTM are most noticeable: Firstly, it better handles short sentences that contain expletives and/or nominal predicates ( absolute in F-score). This is probably due to the fact that our learned dependency path representations are lexicalized, making it possible to model argument structures of different nominals and distinguishing between expletive occurrences of ‘it’ and other subjects. Secondly, it improves performance on longer sentences (up to absolute in F-score). This is mainly due to the handling of dependency paths that involve complex structures, such as coordinations, control verbs and nominal predicates.
We collect instances of different syntactic phenomena from the development set and plot the learned dependency path representations in the embedding space (see Figure 6). We obtain a projection onto two dimensions using t-SNE [Van der Maaten and Hinton2008]. Interestingly, we can see that different syntactic configurations are clustered together in different parts of the space and that most instances of the PropBank roles A0 and A1 are separated. Example phrases in the figure highlight predicate-argument pairs that are correctly labeled by PathLSTM but not by mate-tools. Path embeddings are essential for handling these cases as indicator features do not generalize well enough.
Finally, Table 6 shows results for nominal and verbal predicates as well as for different (gold) role labels. In comparison to mate-tools, we can see that PathLSTM improves precision for all argument types of nominal predicates. For verbal predicates, improvements can be observed in terms of recall of proto-agent (A0) and proto-patient (A1) roles, with slight gains in precision for the A2 role. Overall, PathLSTM does slightly worse with respect to modifier roles, which it labels with higher precision but at the cost of recall.
In this section, we report results from additional experiments on Chinese, German and Spanish data. The underlying question is to which extent the improvements of our SRL system for English also generalize to other languages. To answer this question, we train and test separate SRL models for each language, using the system architecture and hyperparameters discussed in Sections 3 and 4, respectively.
We train our models on data from the CoNLL-2009 shared task, relying on the same features as one of the participating systems [Björkelund et al.2009], and evaluate with the official scorer. For direct comparison, we rely on the (automatic) syntactic preprocessing information provided with the CoNLL test data and compare our results with the best two systems for each language that make use of the same preprocessing information.
The results, summarized in Table 7, indicate that PathLSTM performs better than the system by bjoerkelund09 in all cases. For German and Chinese, PathLSTM achieves the best overall F-scores of 80.1% and 79.4%, respectively.
|& Role Label||PathLSTM||over mate-tools|
|P (%)||R (%)||P (%)||R (%)|
|verb / A0||90.8||89.2||0.4||1.8|
|verb / A1||91.0||91.9||0.0||1.1|
|verb / A2||84.3||76.9||1.5||0.0|
|verb / AM||82.2||72.4||2.9||2.0|
|noun / A0||86.9||78.2||0.8||3.3|
|noun / A1||87.5||84.4||2.6||2.2|
|noun / A2||82.4||76.8||1.0||2.1|
|noun / AM||79.5||69.2||0.9||2.8|
collobert11 pioneered neural networks for the task of semantic role labeling. They developed a feed-forward network that uses a convolution function over windows of words to assign SRL labels. Apart from constituency boundaries, their system does not make use of any syntactic information. foland15 extended their model and showcased significant improvements when including binary indicator features for dependency paths. Similar features were used by fitzgerald15, who include role labeling predictions by neural networks as factors in a global model.
These approaches all make use of binary features derived from syntactic parses either to indicate constituency boundaries or to represent full dependency paths. An extreme alternative has been recently proposed in zhou15, who model SRL decisions with a multi-layered LSTM network that takes word sequences as input but no syntactic parse information at all.
Our approach falls in between the two extremes: we rely on syntactic parse information but rather than solely making using of sparse binary features, we explicitly model dependency paths in a neural network architecture.
Within the SRL literature, recent alternatives to neural network architectures include sigmoid belief networks [Henderson et al.2013]
as well as low-rank tensor models[Lei et al.2015]. Whereas Lei et al. only make use of dependency paths as binary indicator features, Henderson et al. propose a joint model for syntactic and semantic parsing that learns and applies incremental dependency path representations to perform SRL decisions. The latter form of representation is closest to ours, however, we do not build syntactic parses incrementally. Instead, we take syntactically preprocessed text as input and focus on the SRL task only.
Apart from more powerful models, most recent progress in SRL can be attributed to novel features. For instance, deschacht09 and huang10 use latent variables, learned with a hidden markov model, as features for representing words and word sequences. zapirain13 propose different selection preference models in order to deal with the sparseness of lexical features. rothwoodsend14 address the same problem with word embeddings and compositions thereof. rothlapata15 recently introduced features that model the influence of discourse on role labeling decisions.
Rather than coming up with completely new features, in this work we proposed to revisit some well-known features and represent them in a novel way that generalizes better. Our proposed model is inspired both by the necessity to overcome the problems of sparse lexico-syntactic features and by the recent success of SRL models based on neural networks.
The idea of embedding dependency structures has previously been applied to tasks such as relation classification and sentiment analysis. xu15 and liu15 use neural networks to embed dependency paths between entity pairs. To identify the relation that holds between two entities, their approaches make use of pooling layers that detect parts of a path that indicate a specific relation. In contrast, our work aims at modeling an individual path as a complete sequence, in which every item is of relevance. tai15 and ma15 learn embeddings of dependency structures representing full sentences, in a sentiment classification task. In our model, embeddings are learned jointly with other features, and as a result problems that may result from erroneous parse trees are mitigated.
We introduced a neural network architecture for semantic role labeling that jointly learns embeddings for dependency paths and feature combinations. Our experimental results indicate that our model substantially increases classification performance, leading to new state-of-the-art results. In a qualitive analysis, we found that our model is able to cover instances of various linguistic phenomena that are missed by other methods.
Beyond SRL, we expect dependency path embeddings to be useful in related tasks and downstream applications. For instance, our representations may be of direct benefit for semantic and discourse parsing tasks. The jointly learned feature space also makes our model a good starting point for cross-lingual transfer methods that rely on feature representation projection to induce new models [Kozhevnikov and Titov2014].
We thank the three anonymous ACL referees whose feedback helped to substantially improve the present paper. The support of the Deutsche Forschungsgemeinschaft (Research Fellowship RO 4848/1-1; Roth) and the European Research Council (award number 681760; Lapata) is gratefully acknowledged.
The Journal of Machine Learning Research, 12:2493–2537.
Dependency-based semantic role labeling using convolutional neural networks.In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics, pages 279–288, Denver, Colorado.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, et al.2009. The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–18, Boulder, Colorado.