Semantic Role Labeling with Supertags
We introduce a simple and accurate neural model for dependency-based semantic role labeling. Our model predicts predicate-argument dependencies relying on states of a bidirectional LSTM encoder. The semantic role labeler achieves competitive performance on English, even without any kind of syntactic information and only using local inference. However, when automatically predicted part-of-speech tags are provided as input, it substantially outperforms all previous local models and approaches the best reported results on the English CoNLL-2009 dataset. We also consider Chinese, Czech and Spanish where our approach also achieves competitive results. Syntactic parsers are unreliable on out-of-domain data, so standard (i.e., syntactically-informed) SRL models are hindered when tested in this setting. Our syntax-agnostic model appears more robust, resulting in the best reported results on standard out-of-domain test sets.READ FULL TEXT VIEW PDF
We introduce a new syntax-aware model for dependency-based semantic role...
This paper introduces and analyzes a battery of inference models for the...
The goal of semantic role labeling (SRL) is to discover the
Semantic role labeling (SRL) aims at elaborating the meaning of a senten...
Semantic role labeling (SRL) is the task of identifying the
The current state-of-the-art end-to-end semantic role labeling (SRL) mod...
Work on summarization has explored both reinforcement learning (RL)
Semantic Role Labeling with Supertags
The task of semantic role labeling (SRL), pioneered by Gildea and Jurafsky (2002), involves the prediction of predicate argument structure, i.e., both identification of arguments as well as their assignment to an underlying semantic role. These representations have been shown to be beneficial in many NLP applications, including question answering (Shen and Lapata, 2007) and information extraction (Christensen et al., 2011). Semantic banks (e.g., PropBank (Palmer et al., 2005)) often represent arguments as syntactic constituents or, more generally, text spans (Baker et al., 1998). In contrast, CoNLL-2008 and 2009 shared tasks (Surdeanu et al., 2008; Hajic et al., 2009) popularized dependency-based semantic role labeling where the goal is to identify syntactic heads of arguments rather than entire constituents. Figure 1 shows an example of such a dependency-based representation: node labels are senses of predicates (e.g., “01” indicates that the first sense from the PropBank sense repository is used for predicate makes in this sentence) and edge labels are semantic roles (e.g., A0 is a proto-agent, ‘doer’).
Until recently, state-of-the-art SRL systems relied on complex sets of lexico-syntactic features (Pradhan et al., 2005) as well as declarative constraints (Punyakanok et al., 2008; Roth and Yih, 2005)
. Neural SRL models instead exploited feature induction capabilities of neural networks, largely eliminating the need for complex hand-crafted features. Initially achieving state-of-the-art results only in the multilingual setting, where careful feature engineering is not practical(Gesmundo et al., 2009; Titov et al., 2009), neural SRL models now also outperform their traditional counterparts on standard benchmarks for English (FitzGerald et al., 2015; Roth and Lapata, 2016; Swayamdipta et al., 2016; Foland and Martin, 2015).
Recently, it has been shown that an accurate span-based SRL model can be constructed without relying on syntactic features (Zhou and Xu, 2015). Nevertheless, the situation with dependency-based SRL has not changed: even recent state-of-the-art methods for this task heavily rely on syntactic features (Roth and Lapata, 2016; FitzGerald et al., 2015; Lei et al., 2015; Roth and Woodsend, 2014; Swayamdipta et al., 2016). In particular, Roth and Lapata (2016) argue that syntactic features are necessary for the dependency-based SRL and show that performance of their model degrades dramatically if syntactic paths between arguments and predicates are not provided as an input. In this work, we are the first to show that it is possible to construct a very accurate dependency-based semantic role labeler which either does not use any kind of syntactic information or uses very little (automatically predicted part-of-speech tags). This suggests that our LSTM model can largely implicitly capture syntactic information, and this information can, to a large extent, substitute treebank syntax.
Similarly to the span-based model of Zhou and Xu (2015) we use bidirectional LSTMs to encode sentences and rely on their states when predicting arguments of each predicate.111In the CoNLL-2009 benchmark, predicates do not need to be identified: their positions are provided as input at test time. Consequently, as standard for dependency SRL, we ignore this subtask in further discussion. We predict semantic dependency edges between predicates and arguments relying on LSTM states corresponding to the predicate and the argument positions (i.e. both edge endpoints). As semantic roles are often specific to predicates or even predicate senses (e.g., in PropBank Palmer et al. (2005)), instead of predicting the role label (e.g., A0 for Sequa in our example), we predict predicate-specific roles (e.g., make-A0) using a compositional model. Both these aspects (predicting edges and compositional embeddings of roles) contrast our approach with that of zhou-xu:2015:ACL-IJCNLP who essentially treat the SRL task as a generic sequence labeling task. We empirically show that using these two ideas is crucial for achieving competitive performance on dependency SRL (+1.0% semantic F1 in our ablation studies on English). Also, unlike the span-based version, we observe that using automatically predicted POS tags is also important (+0.7% F1).
The resulting SRL model is very simple. Not only we do not rely on syntax, our model is also local, i.e., we do not globally score or constrain sets of arguments. On the standard English in-domain CoNLL-2009 benchmark we achieve F1 which compares favorable to the best local model (% F1 for PathLSTM (Roth and Lapata, 2016)) and approaches the best results overall (% for an ensemble of 3 PathLSTM models with a reranker on top). When we experiment with Chinese, Czech and Spanish portions of the CoNLL-2009 dataset, we also achieve competitive results, even without any extra hyper-parameter tuning.
Moreover, as syntactic parsers are not reliable when used out-of-domain, standard (i.e., syntactically-informed) dependency SRL models are crippled when applied to such data. In contrast, our syntax-agnostic model appears to be considerably more robust: we achieve the best result so far on the English and Czech out-of-domain test set (% and % F1, respectively). For English, this constitutes a % absolute improvement over the comparable previous model (% for the local PathLSTM) and substantially outperforms any previous method (% for the ensemble of 3 PathLSTMs). We believe that out-of-domain performance may in fact be more important than in-domain one: in practice linguistic tools are rarely, if ever, used in-domain.
The key contributions can be summarized as follows:
we propose the first effective syntax-agnostic model for dependency-based SRL;
it achieves the best results among local models on the English, Chinese and Czech in-domain test sets;
it substantially outperforms all previous methods on the out-of-domain test set on both English and Czech.
Despite the effectiveness of our syntax-agnostic version, we believe that both integration of treebank syntax and global inference are promising directions and
leave them for future work. In fact, the proposed SRL model, given its simplicity and efficiency, may be used as a natural building block for future global and syntactically-informed SRL models.222The code is available at https://github.com/diegma/neural-dep-srl.
The focus of this paper is on argument identification and labeling, as these are the steps which have been previously believed to require syntactic information. For the predicate disambiguation subtask we use models from previous work.
In order to identify and classify arguments, we propose a model composed of three components:
a word representation component that from a word in a sentence build a word representation ;
a Bidirectional LSTM (BiLSTM) encoder which takes as input the word representation and provide a dynamic representation of the word and its context in a sentence;
a classifier which takes as an input the BiLSTM representation of the candidate argument and the BiLSTM representation of the predicate to predict the role associated to the candidate argument.
We represent each word
as the concatenation of four vectors: a randomly initialized word embedding, a pre-trained word embedding , a randomly initialized part-of-speech tag embedding and a randomly initialized lemma embedding that is only active if the word is one of the predicates. The randomly initialized embeddings , , and are fine-tuned during training, while the pre-trained ones are kept fixed, as in Dyer et al. (2015). The final word representation is given by , where represents the concatenation operator.
One of the most effective ways to model sequences are recurrent neural networks (RNN)(Elman, 1990)
, more precisely their gated versions, for example, Long Short-Term Memory (LSTM) networks(Hochreiter and Schmidhuber, 1997).
Formally, we can define an LSTM as a function that takes as input the sequence and returns a hidden state . This state can be regarded as a representation of the sentence from the start to the position , or, in other words, it encodes the word at position along with its left context. Bidirectional LSTMs make use of two LSTMs: one for the forward pass, and another for the backward pass, and , respectively. In this way the concatenation of forward and backward LSTM states encodes both left and right contexts of a word, . In this work we stack layers of bidirectional LSTMs, each layer takes the lower layer as its input.
As we will show in the ablation studies in Section 3, encoding a sentence with a bidirectional LSTM in one shot and using it to predict the entire semantic dependency graph does not result in competitive SRL performance. Instead, similarly to Zhou and Xu (2015), we produce predicate-specific encodings of a sentence and use them to predict arguments of the corresponding predicate. This contrasts with most other applications of LSTM encoders (for example, in syntactic parsing Kiperwasser and Goldberg (2016); Cross and Huang (2016) or machine translation Sutskever et al. (2014)), where sentences are typically encoded once and then used to predict the entire structured output (e.g., a syntactic tree or a target sentence).
Specifically, when identifying arguments of a given predicate, we add a predicate-specific feature to the representation of each word in the sentence by concatenating a binary flag to the word representation of Section 2.1. The flag is set to 1 for the word corresponding to the currently considered predicate, it is set to 0 otherwise. In this way, sentences with more than one predicate will be re-encoded by bidirectional LSTMs multiple times.
Our goal is to predict and label arguments for a given predicate. This can be accomplished by labeling each word in a sentence with a role, including the special ‘NULL’ role to indicate that it is not an argument of the predicate. We start with explaining the basic role classifier and then discuss two extensions, which we will later show to be crucial for achieving competitive performance.
The basic role classifier takes the hidden state of the top-layer bidirectional LSTM corresponding to the considered word at position
. Though we experimented with multilayer perceptrons, we obtained the best results with a simple log-linear model:
where is the hidden state calculated by , refers to the predicate and the symbol signifies proportionality. This is essentially equivalent to the approach used in Zhou and Xu (2015) for span-based SRL.333Since they considered span-based SRL, they used BIO encoding Ramshaw and Marcus (1995) and ensured the consistency of B, I and O labels with a 1-order Markov CRF. For dependency SRL both BIO encoding and the 1-order Markov CRF would be useless.
Since the context of a predicate in the sentence is highly informative for deciding if a word is its argument and for choosing its semantic role, we provide the predicate’s hidden state () as another input to the classifier (as in Figure 2):
where, as before, denotes concatenation. Note that we are effectively predicting an edge between words and in the sentence, so it is quite natural to exploit hidden states corresponding to both endpoints.444We abuse the notation and refer as both to the predicate word and to its position in the sentence.
Since we use predicate information within the classifier, it may seem that predicate-specific sentence encoding (Section 2.3) is not needed anymore. Moreover, predicting dependency edges relying on LSTM states of endpoints was shown effective in the context of syntactic dependency parsing without any form of re-encoding Kiperwasser and Goldberg (2016). Nevertheless, in our ablation studies we observed that foregoing predicate-specific encoding results in large performance degradation (-6.2% F on English). Though this dramatic drop in performance seems indeed surprising, the nature of the semantic dependencies, especially for nominal predicates, is different from general syntactic dependencies, with many arguments being far away from the predicates. Relations of these arguments to the predicate may be hard to encode with this simpler mechanism.
The two ways of encoding predicate information, using predicate-specific encoding and incorporating the predicate state in the classifier, turn out to be complementary.
Instead of using a matrix we found it beneficial to jointly embed the role and predicate lemma
using a non-linear transformation:
is the rectilinear activation function,is a parameter matrix, whereas and are randomly initialized embeddings of predicate lemmas and roles. In this way each role prediction is predicate-specific, and at the same time we expect to learn a good representation for roles associated to infrequent predicates. This form of compositional embedding is similar to the one used in FitzGerald et al. (2015).
We applied our model to the English, Chinese, Czech and Spanish CoNLL-2009 datasets with the standard split into training, test and development sets. For English, we used external embeddings of Dyer et al. (2015)
learned using the structured skip n-gram approach ofLing et al. (2015), for Chinese, we used external embeddings produced with the neural language model of DBLP:journals/jmlr/BengioDVJ03. For Czech and Spanish, we used embeddings created with the model proposed by bojanowski2016enriching.
Similarly to Kiperwasser and Goldberg (2016) we used word dropout (Iyyer et al., 2015); we replaced a word with the UNK token with probability , where is an hyper-parameter and is the frequency of the word . The predicted POS tags were provided by the CoNLL-2009 shared-task organizers. We used the same predicate disambiguator as in roth-lapata:2016:P16-1 for English, the one used in DBLP:conf/conll/ZhaoCKUT09 for Czech and Spanish, and the one used in DBLP:conf/conll/BjorkelundHN09 for Chinese. The training objective was the categorical cross-entropy, and we optimized it with Adam (Kingma and Ba, 2015)
. The hyperparameter tuning and all model selection was performed on the English development set; the chosen values are shown in Table1.
|Semantic role labeler|
|(English word embeddings)||100|
|(Chinese word embeddings)||128|
|(Czech word embeddings)||300|
|(Spanish word embeddings)||300|
|(LSTM hidden states)||512|
|(output lemma representation)||128|
We compared our full model (with POS tags and the classifier defined in Section 2.4.3) against state-of-the-art models for dependency-based SRL on English, Chinese, Czech and Spanish. For English, our model significantly outperformed all the local counter-parts (i.e., models which do not perform global inference) on the in-domain tests (see Table 2) with 87.6% F1 for our model vs. 86.7% for PathLSTM Roth and Lapata (2016). When compared with global models, our model performed on-par with the state-of-the-art global version of PathLSTM.
Though we had not done any parameter selection for other languages (i.e., used the same parameters as for English), our model performed competitively across all languages we considered.
For Chinese (Table 4), the proposed model outperformed the best previous model (PathLSTM) with an improvement of 1.8% F1.
For Czech (Table 5), our model, even though unlike previous work it does not use any kind of morphological features explicitly,555However, character level information is encoded in the external embeddings, see Bojanowski et al. (2016). was able to outperform the system that achieved the best score in the CoNLL-2009 shared task. The improvement is 0.8% F1.
Finally, for Spanish (Table 6), our system, though again achieved competitive results, did not outperform the best CoNLL-2009 model and yielded results very similar to those of PathLSTM. One possible reason for this slightly weaker performance is the relatively small size of the Spanish training set (less then half of the English one). This suggests that our model, tuned on English, is likely over-parametrized or under-regularized for Spanish.
The results are especially strong on out-of-domain data. As shown in Table 3, our approach outperformed even ensemble models on the out-of-domain English data (77.7% vs. 76.5% for the ensemble of PathLSTMs). Similarly, it performed very well on the out-of-domain Czech dataset scoring 87.2% F1, with a 1.8% F1 improvement over the best CoNLL-2009 participant (see Table 5, bottom). The favorable results on out-of-domain test sets are not surprising, as syntactic parsers, even the most accurate ones, usually struggle on domains different from the ones they have been trained on. This means that the syntactic trees they produce are unreliable and compromise the accuracy of SRL systems which rely on them. The error propagation can in principle be mitigated by exploiting a distribution over parse trees (e.g., encoded in a parse forest) rather than using a single (’Viterbi’) parse. However, this is rarely feasible in practice. Since our model does not use predicted parse trees and instead relies on the ability of LSTMs to capture long distance dependencies and syntactic phenomena Linzen et al. (2016), it is less brittle in this setting.
|w/o POS tags||87.3||84.5||85.9|
|w/o predicate-specific encoding||80.9||79.8||80.4|
|with basic classifier||86.7||84.5||85.6|
In order to show the contribution of the modeling choices we made, we performed an ablation study on the English development set (Table 7). In these experiments we made individual changes to the model (one by one) and measured their influence on the model performance.
First, we observed that POS tag information is highly beneficial for obtaining competitive performance.
Not using predicate-specific encoding (Section 2.3), or, in other words, doing one-pass encoding with no predicate flags, hurts the performance even more badly (6% drop in F1 on the development set). This is somewhat surprising given that one-pass LSTM encoders performed competitively for syntactic dependencies Kiperwasser and Goldberg (2016); Cross and Huang (2016) and suggests that major differences between the two problems require the use of different modeling approaches.
We also observed a 1.0% drop in F1 when we follow Zhou and Xu (2015) and use the basic role classifier (Section 2.4.1). These results show that both predicate-specific encoding (Section 2.3) and exploiting predicate information in the classifier (Sections 2.4.2-2.4.3) are complementary.
We also studied how performance varies depending on the distance between a predicate and an argument (Figure 3). We compared our approach to the global PathLSTM model: PathLSTM is a natural reference point as it is the most accurate previous model, exploits similar modeling and representation techniques (e.g., word embeddings, LSTMs) but, unlike our approach, relies on predicted syntax. Contrary to our expectations, syntactically-driven and global PathLSTM was weaker for longer distances. We may speculate that syntactic paths for arguments further away from the predicate become unreliable. Though LSTMs are likely to be affected by a similar trend, their states may be able to capture the uncertainty about the structure and thus let the role classifier account for this uncertainty without the need to explicitly sum over potential syntactic analysis. In contrast, PathLSTM will have access only to the single (top scoring) parse tree and, thus, may be more brittle.
In Table 8, we break down F1 results on the English test set into verbal and nominal predicates, and again compare our results with PathLSTM. First, as expected, we observe that both models are less accurate in predicting semantic roles of nominal predicates. For verbal predicates, our model slightly outperformed PathLSTM in core roles (A0-2) and performed much better (0.9% F1) in predicting modifiers (AM-*). This is very surprising as some information about modifiers is actually explicitly encoded in syntactic dependencies exploited by PathLSTM (e.g., the syntactic dependency TMP is predictive of the modifier role AM-TMP). Note though that the syntactic parser was trained on the same sentences (both data originates from WSJ sections 02-22 of Penn Treebank), and this can explain why these syntactic dependencies (e.g., TMP) may convey little beneficial information to the semantic role labeler. For nominal predicates, PathLSTM was more accurate than our model for all roles excluding A0. To get a better idea for what is happening, we plotted the F1 scores as a function of the length of the shortest path between nominal predicates and their arguments. On one hand, Figure 4 shows that PathLSTM is more accurate on roles one syntactic arc away from the nominal predicate. Note that these are the majority (78%) of arguments. On the other hand, our model appears to be more accurate for arguments syntactically far from nominal predicates. This again suggests that PathLSTM struggles with harder cases.
Unlike verbal predicates, syntactic structure is less predictive of semantic roles for nominals (e.g., many arguments are noun modifiers). Consequently, we hypothesized that our model should be weaker than PathLSTM in recognizing arguments but should be on par with PathLSTM in assigning their roles. To test this, we looked into argument identification performance (i.e., ignored labels). Table 9 shows the accuracy of both models in recognizing arguments of nominal and verbal predicates. Our model appears more accurate in recognizing arguments of both nominal (88.0% vs 87.8% F1) and verbal predicates (91.2% vs. 90.5% F1). This, when taken together with weaker labeled F1 of our model for nominal predicates (Table 8), implies that, contrary to our expectations, it is the role labeling performance for nominals which is problematic for our model. Examples of this behavior can be seen in Table 10: all arguments of the predicate pressure are correctly recognized by our model but the role for the argument selling is not predicted correctly. In contrast, PathLSTM does not make any mistake with the labeling of the argument selling but fails to recognize from as an argument.
Earlier approaches to SRL heavily relied on complex sets of lexico-syntactic features Gildea and Jurafsky (2002)
. DBLP:conf/conll/PradhanHWMJ05 used a support vector machine classifier and relied on two syntactic views (obtained with two different parsers), for feature extraction. In addition to hand-crafted features, DBLP:conf/icml/RothY05 enriched CRFs with an integer linear programming inference procedure in order to encode non-local constraints in SRL; DBLP:journals/coling/ToutanovaHM08 employed a global reranker for dealing with structural constraint; while DBLP:journals/jair/SurdeanuMCC07 studied several combination strategies of local and global features obtained from several independent SRL models.
In the last years there has been a flurry of work that employed neural network approaches for SRL. fitzgerald-EtAl:2015:EMNLP used hand-crafted features within an MLP for calculating potentials of a CRF model; roth-lapata:2016:P16-1 extended the features of a non-neural SRL model with LSTM representations of syntactic paths between arguments and predicates; lei-EtAl:2015:NAACL-HLT relied on low-rank tensor factorization that captured interactions between arguments, predicate, their syntactic path and semantic roles; while collobert2011natural and foland2015 used convolutional networks as sentence encoder and a CRF as a role classifier, both approaches employed a rich set of features as input of the convolutional encoder. Finally, DBLP:conf/conll/SwayamdiptaBDS16 jointly modeled syntactic and semantic structures; they extended one of the earliest neural approaches for SRLHenderson et al. (2008); Titov et al. (2009); Gesmundo et al. (2009), with more sophisticated modeling techniques, for example, using LSTMs instead of vanilla RNNs.
Another related line of work Naradowsky et al. (2012); Gormley et al. (2014), instead of relying on treebank syntax, integrated grammar induction as a sub-component into their statistical model. In this way, similarly to us, they do not use treebank syntax but rather rely on the ability of their joint model to induce syntax appropriate for SRL. Their focus was primarily on the low resource setting (where syntactic annotation is not available), whereas in standard set-ups their performance was not as strong. It would be interesting to see if explicit modeling of latent syntax is also beneficial when used in conjunction with LSTMs.
We proposed a neural syntax-agnostic method for dependency-based SRL. Our model is simple and fast, and surpasses comparable approaches (no system combination, local inference) on the standard in-domain CoNLL-2009 benchmark for English, Chinese, Czech and Spanish. Moreover, it outperforms all previous methods (including ensembles) in the arguably more realistic out-of-domain setting in both English and Czech. In the future, we will consider integration of syntactic information and joint inference.
The project was supported by the European Research Council (ERC StG BroadSem 678254), the Dutch National Science Foundation (NWO VIDI 639.022.518) and an Amazon Web Services (AWS) grant. The authors would like to thank Michael Roth for his helpful suggestions.
Journal of Machine Learning Research3:1137–1155.
Dependency-based semantic role labeling using convolutional neural networks.In Joint Conference on Lexical and Computational Semantics.
Jan Hajic, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Stepánek, Pavel Stranák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009.The CoNLL-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of CoNLL.
Journal of Artificial Intelligence Research29:105–151.