The task of semantic role labeling (SRL) consists of predicting the predicate-argument structure of a sentence. More formally, for every predicate, the SRL model has to identify all argument spans and label them with their semantic roles (see Figure 1).
The most popular resources for estimating SRL models are PropBankPalmer et al. (2005) and FrameNet Baker et al. (1998). In both cases annotations are made on top of syntactic constituent structures.
Earlier work on semantic role labeling hinged on constituent syntactic structure, using the trees to derive features and constraints on role assignments Gildea and Jurafsky (2002); Pradhan et al. (2005); Punyakanok et al. (2008). In contrast, modern SRL systems largely ignore treebank syntax He et al. (2018, 2017); Marcheggiani et al. (2017); Zhou and Xu (2015) and instead use powerful feature extractors, for example, LSTM sentence encoders.
There have been recent successful attempts to improve neural SRL models using syntax Roth and Lapata (2016); Marcheggiani and Titov (2017); Strubell et al. (2018). Nevertheless, they have relied on syntactic dependency representations rather than constituent trees.
In these methods, information from dependency trees is injected into word representations using graph convolutional networks (GCN) Kipf and Welling (2017) or self-attention mechanisms Vaswani et al. (2017). Since SRL annotations are done on top of syntactic constituents,111There exists another formulation of SRL task, where the focus is on predicting semantic dependency graphs Surdeanu et al. (2008). For English, however, these dependency annotations are automatically derived from span-based PropBank. we argue that exploiting constituency syntax, rather than dependency one, is more natural and may yield more predictive features for semantic roles. For example, even though constituent boundaries could be derived from dependency structures, this would require unbounded number of hops over the dependency structure in GCNs or self attention. This would be impractical: both Strubell et al. (2018) and Marcheggiani and Titov (2017) use only one hop in their best systems.
Neural models typically treat SRL as a sequence labeling problem, and hence predictions are done for individual words. Though injecting dependency syntax into word representations is relatively straightforward, it is less clear how to incorporate constituency syntax into them. In this work, we show how this can be achieved with GCNs.
Nodes in our SpanGCN correspond to constituents. The computation is done in 3 stages. First, initial span representations are produced by ‘composing’ word representations of the first and the last word in the constituent. Second, graph convolutions relying on the constituent tree are performed, yielding syntactically-informed constituent representations. Finally, the constituent representations are ‘decomposed’ back into word representations which in turn are used as input to the SRL classifier. This approach directly encodes into word representation information about boundaries and syntactic labels of constituents and also provides information about their neighbourhood in the constituent structure.
We show effectiveness of our approach on three datasets: CoNLL-2005 Carreras and Màrquez (2005) and CoNLL-2012 Pradhan et al. (2012) with PropBank-style Palmer et al. (2005) annotation and on FrameNet 1.5 Baker et al. (1998).
SpanGCNs may be beneficial in other NLP tasks, where neural sentence encoders are already effective and syntactic structure can provide a useful inductive bias. For example, consider logical semantic parsing Dong and Lapata (2016) or sentence simplification Chopra et al. (2016). Moreover, SpanGCN can be in principle applied to other forms of span-based linguistic representations (e.g., co-reference graphs). However, we leave this for future work.
2 Constituency Tree Encoding
The architecture for encoding constituency trees make use of two building blocks, a bidirectional LSTM for encoding sequences and a graph convolutional network for encoding graph structures.
2.1 BiLSTM encoder
A bidirectional LSTM (BiLSTM) Graves (2013) consists of two LSTMs Hochreiter and Schmidhuber (1997), one that encodes the left context of a word and one that encodes the right context. In this paper we use alternating-stack BiLSTMs as introduced by Zhou and Xu (2015), where the forward LSTM is used as input to the backward LSTM. As in He et al. (2017), we also employ highway connections Srivastava et al. (2015) between layers and recurrent dropout Gal and Ghahramani (2016) to avoid overfitting.
The second building block we use is a graph convolutional network Kipf and Welling (2017)
. GCNs are neural networks that, given a graph, compute the representation of a node conditioned on the neighboring nodes. It can be seen as a message passing algorithm where the representation of a node is updated based on ‘messages’ sent by its neighboring nodesGilmer et al. (2017).
The input to GCN is an undirected graph , where () and are sets of nodes and edges, respectively. DBLP:journals/corr/KipfW16 assume that the set of edges contains also a self-loop, i.e., for any . We refer to the initial representation of nodes with a matrix , with each its column () encoding node features. The new node representation is computed as
where and are a weight matrix and a bias, respectively; are neighbors of ;
The original GCN definition assumes that the edges are undirected and unlabeled. We take the inspiration from SyntacticGCNs Marcheggiani and Titov (2017) introduced for dependency syntactic structures. Our update function is defined as
where refers to layer normalization Ba et al. (2016) applied after summing the messages. Expressions and are fine-grained and coarse-grained versions of edge labels. For example, may simply return the direction of arc (i.e. whether the message flows along the graph edge or in the opposite direction), whereas the bias can provide some additional syntactic information. The typing decides how many parameters GCN has. It is crucial to keep the number of coarse-grained types low as the model will have to estimate one matrix per coarse-grained type. We will formally define the types in the next section. We also used scalar gates to weight the contribution of each node in the neighborhood and potentially ignore irrelevant edges:
where is the logic sigmoid activation function, whereas and are edge-type-specific parameters.
Now, we will show how to compose GCN and LSTM layers to produce a syntactically-informed encoder.
2.3 From words to constituents and back
The model we propose for encoding constituency structure is shown in Figure 2. It is composed of three modules: constituent composition, constituent GCN and constituent decomposition. Note that there is no parameter sharing across these components.
The model takes as input word representations which can either be static word embeddings or contextual word vectorsPeters et al. (2018a). The sentence is first encoded with a BiLSTM to obtain a context-aware representation of each word. A constituency tree is composed of words () and constituents ().222We slightly abuse the notation by referring to non-terminals as constituents: part-of-speech tags (normally ‘pre-terminals’) are stripped off from our trees. We add representations (initially zero vectors) for each constituent in the tree, i.e. green blocks in Figure 2. Each constituent representations is computed using GCN updates (Equation 2.2) from word representations corresponding to the beginning of its span and to the end of its span. The coarse-grained types here are binary, distinguishing messages from start tokens vs. from end tokens. The fine-grained edge types encode additionally the constituent label (e.g., NP or VP).
Constituent composition is followed by a layer where constituent nodes exchange messages. This layer makes sure that information about children gets incorporated into representations of immediate parents and vice versa. GCN operates on the graph with nodes corresponding to all constituents () in the trees. The edges connect constituents and their immediate children in the syntactic tree, and do it in both directions. Again, the updates are defined as in Equation 2. As before, is binary, now distinguishing parent-to-children messages from children-to-parent messages. additionally encodes the label of the constituent sending the message. For example, consider the computation of the VP constituent in Figure 2. It receives a message from the constituent, this is a parent-to-child message and the ‘sender’ is . The parameters corresponding to these edge types will be used in computing this message.
At this point, we want to ’infuse’ words with information coming from constituents. The graph here is the inverse of that used in the composition stage: the constituents pass the information to the first and the last words in their spans. As in the composition stage, is binary, distinguishing messages to start and end tokens. The fine-grained edge types, also as before, additionally encode the constituent label. In order to spread syntactic information across the sentence, a further BiLSTM layer is used.
3 Semantic Role Labeling
SRL can be cast as a sequence labeling problem where given an input sentence of length , and the position of the predicate in the sentence , the goal is to predict a BIO sequence of semantic roles (see Figure 1). We test our model on two different semantic role labeling formalisms, PropBank and FrameNet.
In PropBank conventions, a frame is specific to a predicate sense. For example, for the predicate make, it distinguishes ‘make.01’ (‘create’) frame from ‘make.02’ (‘cause to be’) frame. Though roles are formally frame-specific (e.g., A0 is the ‘creator’ for the frame ‘make.01’ and the ‘writer’ for the frame ‘write.01’), there are certain cross-frame regularities. For example, A0 and A1 tend to correspond to proto-agents and proto-patients, respectively.
In FrameNet, every frame has its own set of role labels (frame elements in FrameNet terminology).333Cross-frame relations (e.g., the frame hierarchy) present in FrameNet can in principle be used to establish correspondences between a subset of roles. This makes the problem of predicting role labels harder. Differently from PropBank, lexically distinct predicates (lexical units or targets in FrameNet terms) may evoke the same frame. For example, need and require both can trigger frame ‘Needing’.
4 Semantic Role Labeling Model
For both PropBank and FrameNet we use the same model architecture.
We represented words with 100-dimensional GlovE embeddings Pennington et al. and we keep them fixed during training. Word embeddings are concatenated with 100-dimensional embeddings of a predicate binary feature (indicating if the word is the target predicate or not). Before concatenation the GlovE embeddings are passed through layer normalization Ba et al. (2016) and dropout Srivastava et al. (2014). Formally,
where is a function that returns the embedding for the presence or absence of the predicate at position . The obtained embedding is then fed to the sentence encoder.
As a sentence encoder we use SpanGCN introduced in Section 2. The SpanGCN model is fed with word representations . Its output is a sequence of hidden vectors that encode syntactic information for each candidate argument . As a baseline we also use a syntax-agnostic sentence encoder that is the reimplementation of the encoder in He et al. (2017) with stacked alternating LSTMs, i.e. our model with the three GCN layers stripped off.444In order to have a fair baseline, we independently tuned the number of BiLSTM layers for our model and the baseline.
Following Strubell et al. (2018) we used a bilinear scorer:
and are a non-linear projection of the predicate at position in the sentence and the candidate argument . The scores are passed through the softmax function and fed to the conditional random field (CRF) layer.
Conditional random field
As output layer we use a first-order Markov CRF Lafferty et al. (2001). The Viterbi algorithm is used to predict the most likely label assignment at test time.
At train time we learn the scores for transitions between BIO labels. The entire model is trained to minimize the negative conditional log-likelihood:
where is the predicate position for the training example .
5.1 Data and setting
We experimented on the CoNLL-2005 and CoNLL-2012 (OntoNotes) datasets, and used the CoNLL 2005 evaluation script for evaluation. We also applied our approach to FrameNet 1.5 with the data split of Das et al. (2014) and followed the official evaluation set-up from the SemEval’07 Task 19 on frame-semantic parsing Baker et al. (2007).
We trained the self-attentive constituency parser of Kitaev and Klein (2018)555https://github.com/nikitakit/self-attentive-parser on the training data of the CoNLL-2005 dataset and we parsed the development and test sets of CoNLL-2005 dataset. We applied the same procedure for the CoNLL-2012 dataset. For non-ELMo experiments both the syntatic parser and the SRL model did not used contextualized external embeddings. We performed 10-fold jackknifing to obtain syntactic predictions for the training set of CoNLL-2005 and CoNLL-2012. For FrameNet, we parsed the entire corpus with the parser trained on the training set of CoNLL-2005.
We used 100-dimensional GloVe embeddings for all our experiments, unless otherwise specified. The hyperparameters are tuned on the CoNLL-2005 development set. The LSTMs hidden states dimensions were set to 300 for CoNLL experiments and to 200 for FrameNet ones. In our model, we used a four-layer BiLSTM below GCN layers and a two-layer BiLSTM on top. We used an eight-layer BiLSTM in our syntax-agnostic baseline, the number of layers was independently tuned on the CoNLL-2005 development set. For ELMo experiments, we learned the mixing coefficients of ELMo and we projected the weighted sum of the ELMo layers to a 100-dimensional vector, applied layer normalization, ReLU, and dropout.
For FrameNet experiments, we constrained the CRF layer to accept only for BIO tags compatible with the selected frame. We used Adam Kingma and Ba (2015)
as an optimizer with an initial learning rate of 0.001, we halved the learning rate if we did not see an improvement on the development set for two epochs. We trained the model for maximum of 100 epochs.
5.2 Importance of syntax and ablations
Before comparing our full model to state-of-the-art SRL systems, we show that our model genuinely benefits from incorporating syntactic information and motivate other modeling decisions (e.g., the presence of BiLSTM layers at the top).
We perform this analysis on the CoNLL-2005 dataset. We also experiment with gold-standard syntax, as this provides an upper bound on what SpanGCN can gain from using syntactic information.
|Single / No ELMo|
|He et al. (2017)||83.1||83.0||83.1|
|He et al. (2018)||84.2||83.7||83.9|
|Tan et al. (2018)||84.5||85.2||84.8|
|Ouchi et al. (2018)||84.7||82.3||83.5|
|Strubell et al. (2018)(LISA)||84.72||84.57||84.64|
|Single / ELMo|
|He et al. (2018)||-||-||87.4|
|Li et al. (2019)||87.9||87.5||87.7|
|Ouchi et al. (2018)||88.2||87.0||87.6|
|Single / No ELMo|
|He et al. (2017)||72.9||71.4||72.1|
|He et al. (2018)||74.2||73.1||73.7|
|Tan et al. (2018)||73.5||74.6||74.1|
|Ouchi et al. (2018)||76.0||70.4||73.1|
|Strubell et al. (2018)(LISA)||74.77||74.32||74.55|
|Single / ELMo|
|He et al. (2018)||-||-||80.4|
|Li et al. (2019)||80.6||80.4||80.5|
|Ouchi et al. (2018)||79.9||77.5||78.7|
From Table 1, we can see that SpanGCN improves over the syntax-agnostic baseline by % F1, a substantial boost from using predicted syntax. We can also observe that it is important to have the top BiLSTM layer. When we remove the BiLSTM layer, the performance drops by 1% F1. It is interesting that without this last layer, SpanGCN’s performance is roughly the same as that of the baseline. This shows the importance of spreading syntactic information from constituent boundaries to the rest of the sentence.
When we compare SpanGCN relying on predicted syntax with the version using gold-standard syntax, we can see that SRL scores improve greatly.999The syntactic parser we use scores 92.5% F1 on the development set. This suggests that, despite its simplicity (e.g., somewhat impoverished parameterization of constituent GCNs), SpanGCN is capable of extracting predictive features from syntactic structures.
|Single / No ELMo|
|He et al. (2017)||81.7||81.6||81.7|
|He et al. (2018)||-||-||82.1|
|Tan et al. (2018)||81.9||83.6||82.7|
|Ouchi et al. (2018)||84.4||81.7||83.0|
|Swayamdipta et al. (2018)||85.1||81.2||83.8|
|Single / ELMo|
|Peters et al. (2018a)||-||-||84.6|
|He et al. (2018)||-||-||85.5|
|Li et al. (2019)||85.7||86.3||86.0|
|Ouchi et al. (2018)||87.1||85.3||86.2|
Not surprisingly, the performance of every model degrades with the length. For the model using gold syntax, the difference between F1 scores on short sentences and long sentences is smaller (2.2% F1) than for the models using predicted syntax (6.9% F1). This is also expected as in the gold-syntax set-up SpanGCN can rely on perfect syntactic parses even for long sentences, while in the realistic set-up syntactic features start to be unreliable. SpanGCN performs on par with the baseline for very short and very long sentences. Intuitively, for short sentences BiLSTMs may already encode enough syntactic information, while for longer sentences the quality of predicted syntax is not good enough to get gains over the BiLSTM baseline.
When considering the performance of each model as a function of the distance between a predicate and its arguments, we observe that all models struggle with more ‘remote’ arguments. Evaluated in this setting, SpanGCN is slightly better than the baseline.
We also checked what kind of errors these models make by using an oracle to correct one error type a time and measuring influence on the performance He et al. (2017). Figure 5 shows the results. We can see that all the models make the same fraction of mistakes in labeling arguments, even with gold syntax. It is also clear that using gold syntax and, to a lesser extent, predicted syntax, helps the model to figure out exact boundaries of argument spans. The difference in improvement of SpanGCN with gold syntax after fixing the errors related to spans (merge two spans, spit into two spans, fix both boundaries) is 1.4% F1, while for SpanGCN with predicted syntax is 6.1% F1. The correction of the same errors for the BiLSTM baseline results in a difference of 6.8% F1.
5.3 Comparing to the state of the art
We compare SpanGCN with state-of-the-art models on both CoNLL-2005 and CoNLL-2012.101010We only considered single, non-ensemble models.
In Table 2 (Single) we show results on the CoNLL-2005 dataset. We compare the model with state-of-the-art approaches that use syntax Strubell et al. (2018) and with syntax-agnostic models He et al. (2018, 2017); Tan et al. (2018); Ouchi et al. (2018)
. SpanGCN obtains state-of-the-art results outperforming also the multi-task self-attention model ofStrubell et al. (2018)111111We compared with the LISA model where no ELMo information is used, neither in the syntactic parser nor in the SRL components. on the in-domain (85.43 vs. 84.64 F1) and out-of-domain (75.45 vs. 74.55 F1) test sets. The performance on the out-of-domain data shows that SpanGCN is quite robust with nosier syntax. This may be surprising given that the GCN-based dependency-SRL model of Marcheggiani and Titov (2017) did not benefit from using dependency syntax on out-of-domain data.
In Table 3 (Single) we report results on the CoNLL-2012 dataset. SpanGCN obtains 84.4 F1, outperforming all previous models evaluated on this data.
In Table 4, we show the impact of ELMo used in different ways: as word embedding (EMB), as predicted syntax obtained with the ELMo-based parser (SYN), and both (EMB-SYN). As expected, using ELMo always results in an improvement. Using ELMo as input word embeddings (EMB) is more effective than using it indirectly through predicted syntax (SYN), 85.9% vs. 85.7% F1. When using both ELMo embeddings and the ELMo parser, we obtain even better scores 86.6% F1. This result is 2.2% better than SpanGCN without ELMo and 0.65% better than the EMB model. This may suggest that although contextualized word embeddings contain information about syntax Tenney et al. (2019); Hewitt and Manning (2019); Peters et al. (2018b), explicitly-encoding high quality syntax is still useful.
In Table 2 (Single / ELMo) we show results of ELMo models on the CoNLL-2005 test set. SpanGCN performs 1.8% F1 better than its non-ELMo counterpart in the in-domain test set and 2.9% F1 better on the out-of-domain test set. SpanGCN is outperformed by other models in the ELMo setting. This may suggest that SpanGCN does not fully exploit ELMo embeddings. A further study on how to better integrate structured syntax and contextualized embeddings is left to future work.
We also compared our model against Strubell et al. (2018) in the setting where ELMo is used only to obtain syntax. SpanGCN (SYN) outperforms LISA+D&M on the in-domain test set (86.49 vs. 86.04 F1) and performs on par on out-of-domain (76.57 vs. 76.54 F1) test set.
We report ELMo results on CoNLL-2012 in Table 3 (Single / ELMo). SpanGCN outperforms the BIO model of Peters et al. (2018a) by 1.3% F1 and the span-based model of He et al. (2018) by 0.4% F1. Also on this dataset, more sophisticated span-based models perform better than SpanGCN, even if the difference is smaller than on CoNLL-2005.
|Yang and Mitchell (2017) (Seq)||63.4||66.4||64.9|
|Yang and Mitchell (2017) (All)||70.2||60.2||65.5|
|Swayamdipta et al. (2018)||69.2||69.0||69.1|
On FrameNet data, we compare SpanGCN with the sequential and sequential-span ensemble models of Yang and Mitchell (2017), and with the multi-task learning model of Swayamdipta et al. (2018). Swayamdipta et al. (2018) use a multi-task learning objective where the syntactic scaffolding model and the semantic role labeler share the same sentence encoder and are trained together on disjoint data. Like our method, this approach injects syntactic information (though dependency rather than constituent syntax) into word representations which are then used by the SRL model. We show results obtained on the FrameNet test set in Table 5. The SpanGCN model obtains 69.3% F1 score. It performs better than the syntax-agnostic baseline (2.9% improvement) and better than the syntax-agnostic ensemble model (ALL) of Yang and Mitchell (2017) (3.8% improvement). SpanGCN slightly outperforms (0.2% F1) the multi-task syntactic model of Swayamdipta et al. (2018) obtaining state-of-the-art results, 69.3% F1.
6 Related Work
Among earlier approaches to incorporating syntax in SRL, Socher et al. (2013); Tai et al. (2015) proposed recursive neural networks that encode constituency trees by recursively creating representations of constituents. There are two important differences with our approach. First, in our model the syntactic information in the constituents flows back to word representations. This may be achieved with their inside-outside versions Le and Zuidema (2014); Teng and Zhang (2017) . Second, these previous model perform a global pass over the tree whereas GCNs take into account only small fragments of the graph. This may make GCNs more robust when using noisy predicted syntactic structures.
More recently, dependency syntax has gained a lot of attention. Similarly to this work, Marcheggiani and Titov (2017) proposed to encode dependency structure using GCNs for SRL. Strubell et al. (2018) used a multi-task objective to force one of the heads of the self-attention model to predict syntactic edges. Roth and Lapata (2016) encoded dependency paths between predicates and arguments using an LSTM. Also, Swayamdipta et al. (2018) used a multi-task learning objective to produce syntactically-informed word representation, with a sentence encoder shared between two tasks, a main task (SRL) and an auxiliary syntax-related task. In earlier work, syntax has been incorporated in a number of different ways. Naradowsky et al. (2012) used graphical models to encode syntactic structures while Moschitti et al. (2008) applied tree kernels for encoding constituency trees for SRL. Many SRL approaches cast the problem of SRL as a span classification problem, instead of treating it as sequence labeling. FitzGerald et al. (2015) used hand-crafted features to represent spans, while He et al. (2018) and Ouchi et al. (2018) adopted a BiLSTM feature extractor. In principle, SpanGCN can also be used as a syntactic feature extractor within this class of models.
In this paper we introduced SpanGCN, a novel neural architecture encoding constituency syntax at the word level. We applied SpanGCN to the semantic role labeling task, on PropBank and FrameNet. We can observe substantial improvements from using constituent syntax on both datasets, and also in the realistic out-of-domain setting. Given that GCNs over dependency and constituency structure have access to very different information, it would be interesting to see in future work if combining two types of representations can lead to further improvements. While we experimented only with constituency syntax, SpanGCN may in principle be able to encode any kind of span structure, for example, co-reference graphs, and can also be used to produce linguistically-informed encoders for other NLP tasks rather than only SRL.
We thank Luheng He for her helpful suggestions. The project was supported by the European Research Council (ERC StG BroadSem 678254), and the Dutch National Science Foundation (NWO VIDI 639.022.518). We thank NVIDIA for donating the GPUs used for this research.
- Ba et al. (2016) Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
- Baker et al. (2007) Collin F. Baker, Michael Ellsworth, and Katrin Erk. 2007. Semeval-2007 task 19: Frame semantic structure extraction. In Proceedings of SemEval@ACL, pages 99–104.
- Baker et al. (1998) Collin F. Baker, Charles J. Fillmore, and John B. Lowe. 1998. The berkeley framenet project. In Proceedings of COLING-ACL, pages 86–90.
- Carreras and Màrquez (2005) Xavier Carreras and Lluís Màrquez. 2005. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of CoNLL, pages 152–164.
- Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of NAACL-HLT, pages 93–98.
- Das et al. (2014) Dipanjan Das, Desai Chen, André F. T. Martins, Nathan Schneider, and Noah A. Smith. 2014. Frame-semantic parsing. Computational Linguistics, 40(1):9–56.
- Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of ACL.
- FitzGerald et al. (2015) Nicholas FitzGerald, Oscar Täckström, Kuzman Ganchev, and Dipanjan Das. 2015. Semantic role labeling with neural network factors. In Proceedings of EMNLP.
- Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of NIPS, pages 1019–1027.
- Gildea and Jurafsky (2002) Daniel Gildea and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational linguistics, 28(3):245–288.
- Gilmer et al. (2017) Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. 2017. Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.
- Graves (2013) Alex Graves. 2013. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.
- He et al. (2018) Luheng He, Kenton Lee, Omer Levy, and Luke Zettlemoyer. 2018. Jointly predicting predicates and arguments in neural semantic role labeling. In Proceedings of ACL, pages 364–369.
- He et al. (2017) Luheng He, Kenton Lee, Mike Lewis, and Luke Zettlemoyer. 2017. Deep semantic role labeling: What works and what’s next. In Proceedings of ACL, pages 473–483.
- Hewitt and Manning (2019) John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of NAACL-HLT, pages 4129–4138.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of ICLR.
- Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of ICLR.
- Kitaev and Klein (2018) Nikita Kitaev and Dan Klein. 2018. Constituency parsing with a self-attentive encoder. In Proceedings of ACL, pages 2675–2685.
- Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML, pages 282–289.
- Le and Zuidema (2014) Phong Le and Willem Zuidema. 2014. The inside-outside recursive neural network model for dependency parsing. In Proceedings of EMNLP.
- Li et al. (2019) Zuchao Li, Shexia He, Hai Zhao, Yiqing Zhang, Zhuosheng Zhang, Xi Zhou, and Xiang Zhou. 2019. Dependency or span, end-to-end uniform semantic role labeling. CoRR, abs/1901.05280.
- Marcheggiani et al. (2017) Diego Marcheggiani, Anton Frolov, and Ivan Titov. 2017. A simple and accurate syntax-agnostic neural model for dependency-based semantic role labeling. In Proceedings of CoNLL.
- Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of EMNLP, pages 1506–1515.
- Moschitti et al. (2008) Alessandro Moschitti, Daniele Pighin, and Roberto Basili. 2008. Tree kernels for semantic role labeling. Computational Linguistics, 34(2):193–224.
- Naradowsky et al. (2012) Jason Naradowsky, Sebastian Riedel, and David A Smith. 2012. Improving nlp through marginalization of hidden syntactic structure. In Proceedings of EMNLP.
- Ouchi et al. (2018) Hiroki Ouchi, Hiroyuki Shindo, and Yuji Matsumoto. 2018. A span selection model for semantic role labeling. In Proceedings of EMNLP, pages 1630–1642.
- Palmer et al. (2005) Martha Palmer, Paul Kingsbury, and Daniel Gildea. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, 31(1):71–106.
- (29) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543.
- Peters et al. (2018a) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word representations. In Proceedings of NAACL-HLT, pages 2227–2237.
- Peters et al. (2018b) Matthew E. Peters, Mark Neumann, Luke Zettlemoyer, and Wen-tau Yih. 2018b. Dissecting contextual word embeddings: Architecture and representation. In Proceedings of EMNLP, pages 1499–1509.
- Pradhan et al. (2005) Sameer Pradhan, Kadri Hacioglu, Wayne H. Ward, James H. Martin, and Daniel Jurafsky. 2005. Semantic role chunking combining complementary syntactic views. In Proceedings of CoNLL.
- Pradhan et al. (2012) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Olga Uryupina, and Yuchen Zhang. 2012. Conll-2012 shared task: Modeling multilingual unrestricted coreference in ontonotes. In Proceedings of EMNLP-CoNLL, pages 1–40.
- Punyakanok et al. (2008) Vasin Punyakanok, Dan Roth, and Wen-tau Yih. 2008. The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics, 34(2):257–287.
- Roth and Lapata (2016) Michael Roth and Mirella Lapata. 2016. Neural semantic role labeling with dependency path embeddings. In Proceedings of ACL.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of EMNLP.
Srivastava et al. (2014)
Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. 2014.
Dropout: a simple
way to prevent neural networks from overfitting.
Journal of Machine Learning Research, 15(1):1929–1958.
- Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Proceedings of NIPS, pages 2377–2385.
- Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-informed self-attention for semantic role labeling. In Proceedings of EMNLP, pages 5027–5038.
- Surdeanu et al. (2008) Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluís Màrquez, and Joakim Nivre. 2008. The conll 2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of CoNLL.
- Swayamdipta et al. (2018) Swabha Swayamdipta, Sam Thomson, Kenton Lee, Luke Zettlemoyer, Chris Dyer, and Noah A. Smith. 2018. Syntactic scaffolds for semantic structures. In Proceedings of EMNLP, pages 3772–3782.
- Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of ACL-IJCNLP, pages 1556–1566.
- Tan et al. (2018) Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018. Deep semantic role labeling with self-attention. In Proceedings of AAAI, pages 4929–4936.
- Teng and Zhang (2017) Zhiyang Teng and Yue Zhang. 2017. Head-lexicalized bidirectional tree lstms. TACL, 5:163–177.
- Tenney et al. (2019) Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas McCoy, Najoung Kim, Benjamin Van Durme, Samuel R Bowman, Dipanjan Das, et al. 2019. What do you learn from context? Probing for sentence structure in contextualized word representations. In Proceedings of ICLR.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS, pages 6000–6010.
- Yang and Mitchell (2017) Bishan Yang and Tom M. Mitchell. 2017. A joint sequential and relational model for frame-semantic parsing. In Proceedings of EMNLP, pages 1247–1256.
- Zhou and Xu (2015) Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of ACL.
Appendix A Additional Results
|Dev||WSJ Test||Brown Test|
|He et al. (2017)||81.6||81.6||81.6||83.1||83.0||83.1||72.9||71.4||72.1|
|He et al. (2018)||-||-||-||84.2||83.7||83.9||74.2||73.1||73.7|
|Tan et al. (2018)||82.6||83.6||83.1||84.5||85.2||84.8||73.5||74.6||74.1|
|Ouchi et al. (2018)||83.6||81.4||82.5||84.7||82.3||83.5||76.0||70.4||73.1|
|Strubell et al. (2018)||83.6||83.74||83.67||84.72||84.57||84.64||74.77||74.32||74.55|
|He et al. (2018)||-||-||83.9||-||-||87.4||-||-||80.4|
|Li et al. (2019)||-||-||-||87.9||87.5||87.7||80.6||80.4||80.5|
|Ouchi et al. (2018)||87.4||86.3||86.9||88.2||87.0||87.6||79.9||77.5||78.7|
|He et al. (2017)||81.8||81.4||81.5||81.7||81.6||81.7|
|Tan et al. (2018)||82.2||83.6||82.9||81.9||83.6||82.7|
|Ouchi et al. (2018)||84.3||81.5||82.9||84.4||81.7||83.0|
|Swayamdipta et al. (2018)||-||-||-||85.1||81.2||83.8|
|Peters et al. (2018a)||-||-||-||-||-||84.6|
|Li et al. (2019)||-||-||85.7||86.3||86.0|
|Ouchi et al. (2018)||87.2||85.5||86.3||87.1||85.3||86.2|