Graph Convolutions over Constituent Trees for Syntax-Aware Semantic Role Labeling

09/21/2019 ∙ by Diego Marcheggiani, et al. ∙ 0

Semantic role labeling (SRL) is the task of identifying predicates and labeling argument spans with semantic roles. Even though most semantic-role formalisms are built upon constituent syntax and only syntactic constituents can be labeled as arguments (e.g., FrameNet and PropBank), all the recent work on syntax-aware SRL relies on dependency representations of syntax. In contrast, we show how graph convolutional networks (GCNs) can be used to encode constituent structures and inform an SRL system. Nodes in our SpanGCN correspond to constituents. The computation is done in 3 stages. First, initial node representations are produced by `composing' word representations of the first and the last word in the constituent. Second, graph convolutions relying on the constituent tree are performed, yielding syntactically-informed constituent representations. Finally, the constituent representations are `decomposed' back into word representations which in turn are used as input to the SRL classifier. We show the effectiveness of our syntax-aware model on standard CoNLL-2005, CoNLL-2012, and FrameNet benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of semantic role labeling (SRL) consists of predicting the predicate-argument structure of a sentence. More formally, for every predicate, the SRL model has to identify all argument spans and label them with their semantic roles (see Figure 1).

The most popular resources for estimating SRL models are PropBank

Palmer et al. (2005) and FrameNet Baker et al. (1998). In both cases annotations are made on top of syntactic constituent structures.

Earlier work on semantic role labeling hinged on constituent syntactic structure, using the trees to derive features and constraints on role assignments Gildea and Jurafsky (2002); Pradhan et al. (2005); Punyakanok et al. (2008). In contrast, modern SRL systems largely ignore treebank syntax He et al. (2018, 2017); Marcheggiani et al. (2017); Zhou and Xu (2015) and instead use powerful feature extractors, for example, LSTM sentence encoders.

There have been recent successful attempts to improve neural SRL models using syntax Roth and Lapata (2016); Marcheggiani and Titov (2017); Strubell et al. (2018). Nevertheless, they have relied on syntactic dependency representations rather than constituent trees.

Figure 1: An example with semantic-role annotation and its reduction to the sequence labeling problem (BIO labels): the argument structure for predicates appeal and limit are shown in blue and red, respectively.

In these methods, information from dependency trees is injected into word representations using graph convolutional networks (GCN) Kipf and Welling (2017) or self-attention mechanisms Vaswani et al. (2017). Since SRL annotations are done on top of syntactic constituents,111There exists another formulation of SRL task, where the focus is on predicting semantic dependency graphs Surdeanu et al. (2008). For English, however, these dependency annotations are automatically derived from span-based PropBank. we argue that exploiting constituency syntax, rather than dependency one, is more natural and may yield more predictive features for semantic roles. For example, even though constituent boundaries could be derived from dependency structures, this would require unbounded number of hops over the dependency structure in GCNs or self attention. This would be impractical: both Strubell et al. (2018) and Marcheggiani and Titov (2017) use only one hop in their best systems.

Neural models typically treat SRL as a sequence labeling problem, and hence predictions are done for individual words. Though injecting dependency syntax into word representations is relatively straightforward, it is less clear how to incorporate constituency syntax into them. In this work, we show how this can be achieved with GCNs.

Nodes in our SpanGCN correspond to constituents. The computation is done in 3 stages. First, initial span representations are produced by ‘composing’ word representations of the first and the last word in the constituent. Second, graph convolutions relying on the constituent tree are performed, yielding syntactically-informed constituent representations. Finally, the constituent representations are ‘decomposed’ back into word representations which in turn are used as input to the SRL classifier. This approach directly encodes into word representation information about boundaries and syntactic labels of constituents and also provides information about their neighbourhood in the constituent structure.

We show effectiveness of our approach on three datasets: CoNLL-2005 Carreras and Màrquez (2005) and CoNLL-2012 Pradhan et al. (2012) with PropBank-style Palmer et al. (2005) annotation and on FrameNet 1.5 Baker et al. (1998).

SpanGCNs may be beneficial in other NLP tasks, where neural sentence encoders are already effective and syntactic structure can provide a useful inductive bias. For example, consider logical semantic parsing Dong and Lapata (2016) or sentence simplification Chopra et al. (2016). Moreover, SpanGCN can be in principle applied to other forms of span-based linguistic representations (e.g., co-reference graphs). However, we leave this for future work.

2 Constituency Tree Encoding

Figure 2: SpanGCN encoder. First, for each constituent, an initial representation is produced by composing the start and end tokens’ BiLSTM states (purple and black dashed arrows, respectively). This is followed with a constituent GCN: red and black arrows represent parent-to-children and children-to-parent messages, respectively. Finally, the constituent is decomposed back: each constituent sends messages to its start and end tokens.

The architecture for encoding constituency trees make use of two building blocks, a bidirectional LSTM for encoding sequences and a graph convolutional network for encoding graph structures.

2.1 BiLSTM encoder

A bidirectional LSTM (BiLSTM) Graves (2013) consists of two LSTMs Hochreiter and Schmidhuber (1997), one that encodes the left context of a word and one that encodes the right context. In this paper we use alternating-stack BiLSTMs as introduced by Zhou and Xu (2015), where the forward LSTM is used as input to the backward LSTM. As in He et al. (2017), we also employ highway connections Srivastava et al. (2015) between layers and recurrent dropout Gal and Ghahramani (2016) to avoid overfitting.

2.2 Gcn

The second building block we use is a graph convolutional network Kipf and Welling (2017)

. GCNs are neural networks that, given a graph, compute the representation of a node conditioned on the neighboring nodes. It can be seen as a message passing algorithm where the representation of a node is updated based on ‘messages’ sent by its neighboring nodes

Gilmer et al. (2017).

The input to GCN is an undirected graph , where () and are sets of nodes and edges, respectively. DBLP:journals/corr/KipfW16 assume that the set of edges contains also a self-loop, i.e., for any . We refer to the initial representation of nodes with a matrix , with each its column () encoding node features. The new node representation is computed as

where and are a weight matrix and a bias, respectively; are neighbors of ;

is the rectifier linear unit activation function.

The original GCN definition assumes that the edges are undirected and unlabeled. We take the inspiration from SyntacticGCNs Marcheggiani and Titov (2017) introduced for dependency syntactic structures. Our update function is defined as

(1)

where refers to layer normalization Ba et al. (2016) applied after summing the messages. Expressions and are fine-grained and coarse-grained versions of edge labels. For example, may simply return the direction of arc (i.e. whether the message flows along the graph edge or in the opposite direction), whereas the bias can provide some additional syntactic information. The typing decides how many parameters GCN has. It is crucial to keep the number of coarse-grained types low as the model will have to estimate one matrix per coarse-grained type. We will formally define the types in the next section. We also used scalar gates to weight the contribution of each node in the neighborhood and potentially ignore irrelevant edges:

(2)

where is the logic sigmoid activation function, whereas and are edge-type-specific parameters.

Now, we will show how to compose GCN and LSTM layers to produce a syntactically-informed encoder.

2.3 From words to constituents and back

The model we propose for encoding constituency structure is shown in Figure 2. It is composed of three modules: constituent composition, constituent GCN and constituent decomposition. Note that there is no parameter sharing across these components.

Constituent composition

The model takes as input word representations which can either be static word embeddings or contextual word vectors

Peters et al. (2018a). The sentence is first encoded with a BiLSTM to obtain a context-aware representation of each word. A constituency tree is composed of words () and constituents ().222We slightly abuse the notation by referring to non-terminals as constituents: part-of-speech tags (normally ‘pre-terminals’) are stripped off from our trees. We add representations (initially zero vectors) for each constituent in the tree, i.e. green blocks in Figure 2. Each constituent representations is computed using GCN updates (Equation 2.2) from word representations corresponding to the beginning of its span and to the end of its span. The coarse-grained types here are binary, distinguishing messages from start tokens vs. from end tokens. The fine-grained edge types encode additionally the constituent label (e.g., NP or VP).

Constituent GCN

Constituent composition is followed by a layer where constituent nodes exchange messages. This layer makes sure that information about children gets incorporated into representations of immediate parents and vice versa. GCN operates on the graph with nodes corresponding to all constituents () in the trees. The edges connect constituents and their immediate children in the syntactic tree, and do it in both directions. Again, the updates are defined as in Equation 2. As before, is binary, now distinguishing parent-to-children messages from children-to-parent messages. additionally encodes the label of the constituent sending the message. For example, consider the computation of the VP constituent in Figure 2. It receives a message from the constituent, this is a parent-to-child message and the ‘sender’ is . The parameters corresponding to these edge types will be used in computing this message.

Constituent decomposition

At this point, we want to ’infuse’ words with information coming from constituents. The graph here is the inverse of that used in the composition stage: the constituents pass the information to the first and the last words in their spans. As in the composition stage, is binary, distinguishing messages to start and end tokens. The fine-grained edge types, also as before, additionally encode the constituent label. In order to spread syntactic information across the sentence, a further BiLSTM layer is used.

Note that residual connections indicated in blue in Figure 

2, let the model bypass GCN if / where needed.

3 Semantic Role Labeling

SRL can be cast as a sequence labeling problem where given an input sentence of length , and the position of the predicate in the sentence , the goal is to predict a BIO sequence of semantic roles (see Figure 1). We test our model on two different semantic role labeling formalisms, PropBank and FrameNet.

PropBank

In PropBank conventions, a frame is specific to a predicate sense. For example, for the predicate make, it distinguishes ‘make.01’ (‘create’) frame from ‘make.02’ (‘cause to be’) frame. Though roles are formally frame-specific (e.g., A0 is the ‘creator’ for the frame ‘make.01’ and the ‘writer’ for the frame ‘write.01’), there are certain cross-frame regularities. For example, A0 and A1 tend to correspond to proto-agents and proto-patients, respectively.

FrameNet

In FrameNet, every frame has its own set of role labels (frame elements in FrameNet terminology).333Cross-frame relations (e.g., the frame hierarchy) present in FrameNet can in principle be used to establish correspondences between a subset of roles. This makes the problem of predicting role labels harder. Differently from PropBank, lexically distinct predicates (lexical units or targets in FrameNet terms) may evoke the same frame. For example, need and require both can trigger frame ‘Needing’.

As in previous work we compare to, we assume to have access to gold frames Swayamdipta et al. (2018); Yang and Mitchell (2017).

4 Semantic Role Labeling Model

For both PropBank and FrameNet we use the same model architecture.

Word representation

We represented words with 100-dimensional GlovE embeddings Pennington et al. and we keep them fixed during training. Word embeddings are concatenated with 100-dimensional embeddings of a predicate binary feature (indicating if the word is the target predicate or not). Before concatenation the GlovE embeddings are passed through layer normalization Ba et al. (2016) and dropout Srivastava et al. (2014). Formally,

where is a function that returns the embedding for the presence or absence of the predicate at position . The obtained embedding is then fed to the sentence encoder.

Sentence encoder

As a sentence encoder we use SpanGCN introduced in Section 2. The SpanGCN model is fed with word representations . Its output is a sequence of hidden vectors that encode syntactic information for each candidate argument . As a baseline we also use a syntax-agnostic sentence encoder that is the reimplementation of the encoder in He et al. (2017) with stacked alternating LSTMs, i.e. our model with the three GCN layers stripped off.444In order to have a fair baseline, we independently tuned the number of BiLSTM layers for our model and the baseline.

Bilinear scorer

Following Strubell et al. (2018) we used a bilinear scorer:

and are a non-linear projection of the predicate at position in the sentence and the candidate argument . The scores are passed through the softmax function and fed to the conditional random field (CRF) layer.

Conditional random field

As output layer we use a first-order Markov CRF Lafferty et al. (2001). The Viterbi algorithm is used to predict the most likely label assignment at test time.

At train time we learn the scores for transitions between BIO labels. The entire model is trained to minimize the negative conditional log-likelihood:

where is the predicate position for the training example .

5 Experiments

5.1 Data and setting

We experimented on the CoNLL-2005 and CoNLL-2012 (OntoNotes) datasets, and used the CoNLL 2005 evaluation script for evaluation. We also applied our approach to FrameNet 1.5 with the data split of Das et al. (2014) and followed the official evaluation set-up from the SemEval’07 Task 19 on frame-semantic parsing Baker et al. (2007).

We trained the self-attentive constituency parser of Kitaev and Klein (2018)555https://github.com/nikitakit/self-attentive-parser on the training data of the CoNLL-2005 dataset and we parsed the development and test sets of CoNLL-2005 dataset. We applied the same procedure for the CoNLL-2012 dataset. For non-ELMo experiments both the syntatic parser and the SRL model did not used contextualized external embeddings. We performed 10-fold jackknifing to obtain syntactic predictions for the training set of CoNLL-2005 and CoNLL-2012. For FrameNet, we parsed the entire corpus with the parser trained on the training set of CoNLL-2005.

We used 100-dimensional GloVe embeddings for all our experiments, unless otherwise specified. The hyperparameters are tuned on the CoNLL-2005 development set. The LSTMs hidden states dimensions were set to 300 for CoNLL experiments and to 200 for FrameNet ones. In our model, we used a four-layer BiLSTM below GCN layers and a two-layer BiLSTM on top. We used an eight-layer BiLSTM in our syntax-agnostic baseline, the number of layers was independently tuned on the CoNLL-2005 development set. For ELMo experiments, we learned the mixing coefficients of ELMo and we projected the weighted sum of the ELMo layers to a 100-dimensional vector, applied layer normalization, ReLU, and dropout.

For FrameNet experiments, we constrained the CRF layer to accept only for BIO tags compatible with the selected frame. We used Adam Kingma and Ba (2015)

as an optimizer with an initial learning rate of 0.001, we halved the learning rate if we did not see an improvement on the development set for two epochs. We trained the model for maximum of 100 epochs.

All models were implemented with PyTorch.

666https://pytorch.org We used some modules from AllenNLP777https://github.com/allenai/allennlp and the reimplementation of the FrameNet evaluation scripts by Swayamdipta et al. (2018).888https://github.com/swabhs/scaffolding

5.2 Importance of syntax and ablations

Dev
P R F1
Baseline 82.78 83.58 83.18
SpanGCN 84.48 84.26 84.37
    (w/o BiLSTM) 83.31 83.35 83.33
SpanGCN (Gold) 90.50 90.65 90.58
    (w/o BiLSTM) 88.96 90.02 89.49
Table 1: Results with predicted and gold syntax on the CoNLL-2005 development set.
Figure 3: CoNLL-2005 F1 score as a function of sentence length.
Figure 4: CoNLL-2005 F1 score as a function of the distance of a predicate from its arguments.
Figure 5: Performance of CoNLL-2005 models after performing corrections from He et al. (2017).

Before comparing our full model to state-of-the-art SRL systems, we show that our model genuinely benefits from incorporating syntactic information and motivate other modeling decisions (e.g., the presence of BiLSTM layers at the top).

We perform this analysis on the CoNLL-2005 dataset. We also experiment with gold-standard syntax, as this provides an upper bound on what SpanGCN can gain from using syntactic information.

WSJ Test
Single / No ELMo
He et al. (2017) 83.1 83.0 83.1
He et al. (2018) 84.2 83.7 83.9
Tan et al. (2018) 84.5 85.2 84.8
Ouchi et al. (2018) 84.7 82.3 83.5
Strubell et al. (2018)(LISA) 84.72 84.57 84.64
SpanGCN 85.8 85.05 85.43
Single / ELMo
He et al. (2018) - - 87.4
Li et al. (2019) 87.9 87.5 87.7
Ouchi et al. (2018) 88.2 87.0 87.6
SpanGCN (EMB-SYN) 86.99 87.48 87.24
Brown Test
Single / No ELMo
He et al. (2017) 72.9 71.4 72.1
He et al. (2018) 74.2 73.1 73.7
Tan et al. (2018) 73.5 74.6 74.1
Ouchi et al. (2018) 76.0 70.4 73.1
Strubell et al. (2018)(LISA) 74.77 74.32 74.55
SpanGCN 76.17 74.74 75.45
Single / ELMo
He et al. (2018) - - 80.4
Li et al. (2019) 80.6 80.4 80.5
Ouchi et al. (2018) 79.9 77.5 78.7
SpanGCN (EMB-SYN) 78.63 78.09 78.36
Table 2: Precision, recall and on the CoNLL-2005 development and test sets. indicates syntactic models and indicates multi-task learning models.

From Table 1, we can see that SpanGCN improves over the syntax-agnostic baseline by % F1, a substantial boost from using predicted syntax. We can also observe that it is important to have the top BiLSTM layer. When we remove the BiLSTM layer, the performance drops by 1% F1. It is interesting that without this last layer, SpanGCN’s performance is roughly the same as that of the baseline. This shows the importance of spreading syntactic information from constituent boundaries to the rest of the sentence.

When we compare SpanGCN relying on predicted syntax with the version using gold-standard syntax, we can see that SRL scores improve greatly.999The syntactic parser we use scores 92.5% F1 on the development set. This suggests that, despite its simplicity (e.g., somewhat impoverished parameterization of constituent GCNs), SpanGCN is capable of extracting predictive features from syntactic structures.

We also measured the performance of the models above as a function of sentence length (Figure 3), and as a function of the distance between a predicate and its arguments (Figure 4).

Test
Single / No ELMo
He et al. (2017) 81.7 81.6 81.7
He et al. (2018) - - 82.1
Tan et al. (2018) 81.9 83.6 82.7
Ouchi et al. (2018) 84.4 81.7 83.0
Swayamdipta et al. (2018) 85.1 81.2 83.8
SpanGCN 84.47 84.26 84.37
Single / ELMo
Peters et al. (2018a) - - 84.6
He et al. (2018) - - 85.5
Li et al. (2019) 85.7 86.3 86.0
Ouchi et al. (2018) 87.1 85.3 86.2
SpanGCN (EMB-SYN) 85.77 86.04 85.91
Table 3: Precision, recall and F1 on the CoNLL-2012 test set. indicates syntactic models and indicates multi-task learning models.

Not surprisingly, the performance of every model degrades with the length. For the model using gold syntax, the difference between F1 scores on short sentences and long sentences is smaller (2.2% F1) than for the models using predicted syntax (6.9% F1). This is also expected as in the gold-syntax set-up SpanGCN can rely on perfect syntactic parses even for long sentences, while in the realistic set-up syntactic features start to be unreliable. SpanGCN performs on par with the baseline for very short and very long sentences. Intuitively, for short sentences BiLSTMs may already encode enough syntactic information, while for longer sentences the quality of predicted syntax is not good enough to get gains over the BiLSTM baseline.

When considering the performance of each model as a function of the distance between a predicate and its arguments, we observe that all models struggle with more ‘remote’ arguments. Evaluated in this setting, SpanGCN is slightly better than the baseline.

We also checked what kind of errors these models make by using an oracle to correct one error type a time and measuring influence on the performance He et al. (2017). Figure 5 shows the results. We can see that all the models make the same fraction of mistakes in labeling arguments, even with gold syntax. It is also clear that using gold syntax and, to a lesser extent, predicted syntax, helps the model to figure out exact boundaries of argument spans. The difference in improvement of SpanGCN with gold syntax after fixing the errors related to spans (merge two spans, spit into two spans, fix both boundaries) is 1.4% F1, while for SpanGCN with predicted syntax is 6.1% F1. The correction of the same errors for the BiLSTM baseline results in a difference of 6.8% F1.

5.3 Comparing to the state of the art

We compare SpanGCN with state-of-the-art models on both CoNLL-2005 and CoNLL-2012.101010We only considered single, non-ensemble models.

CoNLL-2005

In Table 2 (Single) we show results on the CoNLL-2005 dataset. We compare the model with state-of-the-art approaches that use syntax Strubell et al. (2018) and with syntax-agnostic models He et al. (2018, 2017); Tan et al. (2018); Ouchi et al. (2018)

. SpanGCN obtains state-of-the-art results outperforming also the multi-task self-attention model of

Strubell et al. (2018)111111We compared with the LISA model where no ELMo information is used, neither in the syntactic parser nor in the SRL components. on the in-domain (85.43 vs. 84.64 F1) and out-of-domain (75.45 vs. 74.55 F1) test sets. The performance on the out-of-domain data shows that SpanGCN is quite robust with nosier syntax. This may be surprising given that the GCN-based dependency-SRL model of Marcheggiani and Titov (2017) did not benefit from using dependency syntax on out-of-domain data.

CoNLL-2012

In Table 3 (Single) we report results on the CoNLL-2012 dataset. SpanGCN obtains 84.4 F1, outperforming all previous models evaluated on this data.

ELMo Experiments

Dev
P R F1
SpanGCN 84.48 84.26 84.37
SpanGCN (EMB) 85.82 86.02 85.92
SpanGCN (SYN) 85.77 85.74 85.76
SpanGCN (EMB-SYN) 86.31 86.84 86.57
Table 4: Ablation results with ELMo information on CoNLL-2005 development set.

We also tested SpanGCN while using contextualized word embeddings, ELMo Peters et al. (2018a) to train the syntactic parser of Kitaev and Klein (2018) and also provided them as input to our model.

In Table 4, we show the impact of ELMo used in different ways: as word embedding (EMB), as predicted syntax obtained with the ELMo-based parser (SYN), and both (EMB-SYN). As expected, using ELMo always results in an improvement. Using ELMo as input word embeddings (EMB) is more effective than using it indirectly through predicted syntax (SYN), 85.9% vs. 85.7% F1. When using both ELMo embeddings and the ELMo parser, we obtain even better scores 86.6% F1. This result is 2.2% better than SpanGCN without ELMo and 0.65% better than the EMB model. This may suggest that although contextualized word embeddings contain information about syntax Tenney et al. (2019); Hewitt and Manning (2019); Peters et al. (2018b), explicitly-encoding high quality syntax is still useful.

In Table 2 (Single / ELMo) we show results of ELMo models on the CoNLL-2005 test set. SpanGCN performs 1.8% F1 better than its non-ELMo counterpart in the in-domain test set and 2.9% F1 better on the out-of-domain test set. SpanGCN is outperformed by other models in the ELMo setting. This may suggest that SpanGCN does not fully exploit ELMo embeddings. A further study on how to better integrate structured syntax and contextualized embeddings is left to future work.

We also compared our model against Strubell et al. (2018) in the setting where ELMo is used only to obtain syntax. SpanGCN (SYN) outperforms LISA+D&M on the in-domain test set (86.49 vs. 86.04 F1) and performs on par on out-of-domain (76.57 vs. 76.54 F1) test set.

We report ELMo results on CoNLL-2012 in Table 3 (Single / ELMo). SpanGCN outperforms the BIO model of Peters et al. (2018a) by 1.3% F1 and the span-based model of He et al. (2018) by 0.4% F1. Also on this dataset, more sophisticated span-based models perform better than SpanGCN, even if the difference is smaller than on CoNLL-2005.

Model
Yang and Mitchell (2017) (Seq) 63.4 66.4 64.9
Yang and Mitchell (2017) (All) 70.2 60.2 65.5
Swayamdipta et al. (2018) 69.2 69.0 69.1
SpanGCN 69.8 68.78 69.29
Table 5: Frame SRL results on the FrameNet 1.5 test set using gold frames. indicates syntactic models and indicates multi-task learning models.

FrameNet

On FrameNet data, we compare SpanGCN with the sequential and sequential-span ensemble models of Yang and Mitchell (2017), and with the multi-task learning model of Swayamdipta et al. (2018). Swayamdipta et al. (2018) use a multi-task learning objective where the syntactic scaffolding model and the semantic role labeler share the same sentence encoder and are trained together on disjoint data. Like our method, this approach injects syntactic information (though dependency rather than constituent syntax) into word representations which are then used by the SRL model. We show results obtained on the FrameNet test set in Table 5. The SpanGCN model obtains 69.3% F1 score. It performs better than the syntax-agnostic baseline (2.9% improvement) and better than the syntax-agnostic ensemble model (ALL) of Yang and Mitchell (2017) (3.8% improvement). SpanGCN slightly outperforms (0.2% F1) the multi-task syntactic model of Swayamdipta et al. (2018) obtaining state-of-the-art results, 69.3% F1.

6 Related Work

Among earlier approaches to incorporating syntax in SRL, Socher et al. (2013); Tai et al. (2015) proposed recursive neural networks that encode constituency trees by recursively creating representations of constituents. There are two important differences with our approach. First, in our model the syntactic information in the constituents flows back to word representations. This may be achieved with their inside-outside versions Le and Zuidema (2014); Teng and Zhang (2017) . Second, these previous model perform a global pass over the tree whereas GCNs take into account only small fragments of the graph. This may make GCNs more robust when using noisy predicted syntactic structures.

More recently, dependency syntax has gained a lot of attention. Similarly to this work, Marcheggiani and Titov (2017) proposed to encode dependency structure using GCNs for SRL. Strubell et al. (2018) used a multi-task objective to force one of the heads of the self-attention model to predict syntactic edges. Roth and Lapata (2016) encoded dependency paths between predicates and arguments using an LSTM. Also, Swayamdipta et al. (2018) used a multi-task learning objective to produce syntactically-informed word representation, with a sentence encoder shared between two tasks, a main task (SRL) and an auxiliary syntax-related task. In earlier work, syntax has been incorporated in a number of different ways. Naradowsky et al. (2012) used graphical models to encode syntactic structures while Moschitti et al. (2008) applied tree kernels for encoding constituency trees for SRL. Many SRL approaches cast the problem of SRL as a span classification problem, instead of treating it as sequence labeling. FitzGerald et al. (2015) used hand-crafted features to represent spans, while He et al. (2018) and Ouchi et al. (2018) adopted a BiLSTM feature extractor. In principle, SpanGCN can also be used as a syntactic feature extractor within this class of models.

7 Conclusions

In this paper we introduced SpanGCN, a novel neural architecture encoding constituency syntax at the word level. We applied SpanGCN to the semantic role labeling task, on PropBank and FrameNet. We can observe substantial improvements from using constituent syntax on both datasets, and also in the realistic out-of-domain setting. Given that GCNs over dependency and constituency structure have access to very different information, it would be interesting to see in future work if combining two types of representations can lead to further improvements. While we experimented only with constituency syntax, SpanGCN may in principle be able to encode any kind of span structure, for example, co-reference graphs, and can also be used to produce linguistically-informed encoders for other NLP tasks rather than only SRL.

Acknowledgments

We thank Luheng He for her helpful suggestions. The project was supported by the European Research Council (ERC StG BroadSem 678254), and the Dutch National Science Foundation (NWO VIDI 639.022.518). We thank NVIDIA for donating the GPUs used for this research.

References

Appendix A Additional Results

Additional development results for CoNLL-2005 (Table 6) and CoNLL-2012 (Table 7) datasets.

Dev WSJ Test Brown Test
Single
He et al. (2017) 81.6 81.6 81.6 83.1 83.0 83.1 72.9 71.4 72.1
He et al. (2018) - - - 84.2 83.7 83.9 74.2 73.1 73.7
Tan et al. (2018) 82.6 83.6 83.1 84.5 85.2 84.8 73.5 74.6 74.1
Ouchi et al. (2018) 83.6 81.4 82.5 84.7 82.3 83.5 76.0 70.4 73.1
Strubell et al. (2018) 83.6 83.74 83.67 84.72 84.57 84.64 74.77 74.32 74.55
SpanGCN 84.48 84.26 84.37 85.8 85.05 85.43 76.17 74.74 75.45
ELMo
He et al. (2018) - - 83.9 - - 87.4 - - 80.4
Li et al. (2019) - - - 87.9 87.5 87.7 80.6 80.4 80.5
Ouchi et al. (2018) 87.4 86.3 86.9 88.2 87.0 87.6 79.9 77.5 78.7
SpanGCN 86.31 86.84 86.57 86.99 87.48 87.24 78.63 78.09 78.36
Table 6: Precision, recall and on the CoNLL-2005 development and test sets. indicates syntactic models and indicates multi-task learning models.
Dev Test
Single
He et al. (2017) 81.8 81.4 81.5 81.7 81.6 81.7
Tan et al. (2018) 82.2 83.6 82.9 81.9 83.6 82.7
Ouchi et al. (2018) 84.3 81.5 82.9 84.4 81.7 83.0
Swayamdipta et al. (2018) - - - 85.1 81.2 83.8
SpanGCN 84.45 84.16 84.31 84.47 84.26 84.37
ELMo
Peters et al. (2018a) - - - - - 84.6
Li et al. (2019) - - 85.7 86.3 86.0
Ouchi et al. (2018) 87.2 85.5 86.3 87.1 85.3 86.2
SpanGCN 85.75 85.94 85.85 85.77 86.04 85.91
Table 7: Precision, recall and F1 on the CoNLL-2012 development and test set. indicates syntactic models and indicates multi-task learning models.