Semantic role labeling (SRL) (gildea2002automatic) can be informally described as the task of discovering who did what to whom. For example, consider an SRL dependency graph shown above the sentence in Figure 1. Formally, the task includes (1) detection of predicates (e.g., makes); (2) labeling the predicates with a sense from a sense inventory (e.g., make.01); (3) identifying and assigning arguments to semantic roles (e.g., Sequa is A0, i.e., an agent / ‘doer’ for the corresponding predicate, and engines is A1, i.e., a patient / ‘an affected entity’). SRL is often regarded as an important step in the standard NLP pipeline, providing information to downstream tasks such as information extraction and question answering.
The semantic representations are closely related to syntactic ones, even though the syntax-semantics interface is far from trivial (levin1993english). For example, one can observe that many arcs in the syntactic dependency graph (shown in black below the sentence in Figure 1) are mirrored in the semantic dependency graph. Given these similarities and also because of availability of accurate syntactic parsers for many languages, it seems natural to exploit syntactic information when predicting semantics. Though historically most SRL approaches did rely on syntax (DBLP:conf/ecml/ThompsonLM03; DBLP:conf/conll/PradhanHWMJ05; DBLP:journals/coling/PunyakanokRY08; DBLP:conf/coling/JohanssonN08), the last generation of SRL models put syntax aside in favor of neural sequence models, namely LSTMs (zhou-xu:2015:ACL-IJCNLP; DBLP:journals/corr/MarcheggianiFT17), and outperformed syntactically-driven methods on standard benchmarks. We believe that one of the reasons for this radical choice is the lack of simple and effective methods for incorporating syntactic information into sequential neural networks (namely, at the level of words). In this paper we propose one way how to address this limitation.
Specifically, we rely on graph convolutional networks (GCNs) (NIPS2015_5954; DBLP:journals/corr/KipfW16; kearnes2016molecular)
, a recent class of multilayer neural networks operating on graphs. For every node in the graph (in our case a word in a sentence), GCN encodes relevant information about its neighborhood as a real-valued feature vector. GCNs have been studied largely in the context of undirected unlabeled graphs. We introduce a version of GCNs for modeling syntactic dependency structures and generally applicable to labeled directed graphs.
One layer GCN encodes only information about immediate neighbors and layers are needed to encode -order neighborhoods (i.e., information about nodes at most hops aways). This contrasts with recurrent and recursive neural networks (DBLP:journals/cogsci/Elman90; socher-EtAl:2013:EMNLP) which, at least in theory, can capture statistical dependencies across unbounded paths in a trees or in a sequence. However, as we will further discuss in Section 3.3, this is not a serious limitation when GCNs are used in combination with encoders based on recurrent networks (LSTMs). When we stack GCNs on top of LSTM layers, we obtain a substantial improvement over an already state-of-the-art LSTM SRL model, resulting in the best reported scores on the standard benchmark (CoNLL-2009), both for English and Chinese.111The code is available at https://github.com/diegma/neural-dep-srl.
Interestingly, again unlike recursive neural networks, GCNs do not constrain the graph to be a tree. We believe that there are many applications in NLP, where GCN-based encoders of sentences or even documents can be used to incorporate knowledge about linguistic structures (e.g., representations of syntax, semantics or discourse). For example, GCNs can take as input combined syntactic-semantic graphs (e.g., the entire graph from Figure 1) and be used within downstream tasks such as machine translation or question answering. However, we leave this for future work and here solely focus on SRL.
The contributions of this paper can be summarized as follows:
we are the first to show that GCNs are effective for NLP;
we propose a generalization of GCNs suited to encoding syntactic information at word level;
we propose a GCN-based SRL model and obtain state-of-the-art results on English and Chinese portions of the CoNLL-2009 dataset;
we show that bidirectional LSTMs and syntax-based GCNs have complementary modeling power.
2 Graph Convolutional Networks
In this section we describe GCNs of DBLP:journals/corr/KipfW16. Please refer to gilmer2017neural for a comprehensive overview of GCN versions.
GCNs are neural networks operating on graphs and inducing features of nodes (i.e., real-valued vectors / embeddings) based on properties of their neighborhoods. In DBLP:journals/corr/KipfW16
, they were shown to be very effective for the node classification task: the classifier was estimated jointly with a GCN, so that the induced node features were informative for the node classification problem. Depending on how many layers of convolution are used, GCNs can capture information only about immediate neighbors (with one layer of convolution) or any nodes at mosthops aways (if layers are stacked on top of each other).
More formally, consider an undirected graph , where () and are sets of nodes and edges, respectively. DBLP:journals/corr/KipfW16 assume that edges contain all the self-loops, i.e., for any . We can define a matrix with each its column () encoding node features. The vectors can either encode genuine features (e.g., this vector can encode the title of a paper if citation graphs are considered) or be a one-hot vector. The node representation, encoding information about its immediate neighbors, is computed as
where and are a weight matrix and a bias, respectively; are neighbors of ;222We dropped normalization factors used in DBLP:journals/corr/KipfW16, as they are not used in our syntactic GCNs. Note that (because of self-loops), so the input feature representation of (i.e. ) affects its induced representation .
As in standard convolutional networks (lecun-01a), by stacking GCN layers one can incorporate higher degree neighborhoods:
where denotes the layer number and .
3 Syntactic GCNs
As syntactic dependency trees are directed and labeled (we refer to the dependency labels as syntactic functions), we first need to modify the computation in order to incorporate label information (Section 3.1). In the subsequent section, we incorporate gates in GCNs, so that the model can decide which edges are more relevant to the task in question. Having gates is also important as we rely on automatically predicted syntactic representations, and the gates can detect and downweight potentially erroneous edges.
3.1 Incorporating directions and labels
Now, we introduce a generalization of GCNs appropriate for syntactic dependency trees, and in general, for directed labeled graphs. First note that there is no reason to assume that information flows only along the syntactic dependency arcs (e.g., from makes to Sequa), so we allow it to flow in the opposite direction as well (i.e., from dependents to heads). We use a graph , where the edge set contains all pairs of nodes (i.e., words) adjacent in the dependency tree. In our example, both Sequa, makes and makes, Sequa belong to the edge set. The graph is labeled, and the label for contains both information about the syntactic function and indicates whether the edge is in the same or opposite direction as the syntactic dependency arc. For example, the label for makes, Sequa is , whereas the label for Sequa, makes is , with the apostrophe indicating that the edge is in the direction opposite to the corresponding syntactic arc. Similarly, self-loops will have label . Consequently, we can simply assume that the GCN parameters are label-specific, resulting in the following computation, also illustrated in Figure 2:
This model is over-parameterized,333Chinese and English CoNLL-2009 datasets used 41 and 48 different syntactic functions, which would result in having 83 and 97 different matrices in every layer, respectively.
especially given that SRL datasets are moderately sized, by deep learning standards. So instead of learning the GCN parameters directly, we define them as
where indicates whether the edge is directed (1) along, (2) in the opposite direction to the syntactic dependency arc, or (3) is a self-loop; . Our simplification captures the intuition that information should be propagated differently along edges depending whether this is a head-to-dependent or dependent-to-head edge (i.e., along or opposite the corresponding syntactic arc) and whether it is a self-loop. So we do not share any parameters between these three very different edge types. Syntactic functions are important, but perhaps less crucial, so they are encoded only in the feature vectors .
3.2 Edge-wise gating
Uniformly accepting information from all neighboring nodes may not be appropriate for the SRL setting. For example, we see in Figure 1 that many semantic arcs just mirror their syntactic counter-parts, so they may need to be up-weighted. Moreover, we rely on automatically predicted syntactic structures, and, even for English, syntactic parsers are far from being perfect, especially when used out-of-domain. It is risky for a downstream application to rely on a potentially wrong syntactic edge, so the corresponding message in the neural network may need to be down-weighted.
In order to address the above issues, inspired by recent literature (DBLP:conf/nips/OordKEKVG16; DBLP:journals/corr/DauphinFAG16), we calculate for each edge node pair a scalar gate of the form
is the logistic sigmoid function,and are weights and a bias for the gate. With this additional gating mechanism, the final syntactic GCN computation is formulated as
3.3 Complementarity of GCNs and LSTMs
The inability of GCNs to capture dependencies between nodes far away from each other in the graph may seem like a serious problem, especially in the context of SRL: paths between predicates and arguments often include many dependency arcs (roth-lapata:2016:P16-1). However, when graph convolution is performed on top of LSTM states (i.e., LSTM states serve as input to GCN) rather than static word embeddings, GCN may not need to capture more than a couple of hops.
To elaborate on this, let us speculate what role GCNs would play when used in combinations with LSTMs, given that LSTMs have already been shown very effective for SRL (zhou-xu:2015:ACL-IJCNLP; DBLP:journals/corr/MarcheggianiFT17). Though LSTMs are capable of capturing at least some degree of syntax (DBLP:journals/tacl/LinzenDG16) without explicit syntactic supervision, SRL datasets are moderately sized, so LSTM models may still struggle with harder cases. Typically, harder cases for SRL involve arguments far away from their predicates. In fact, 20% and 30% of arguments are more than 5 tokens away from their predicate, in our English and Chinese collections, respectively. However, if we imagine that we can ‘teleport’ even over a single (longest) syntactic dependency edge, the ’distance’ would shrink: only 9% and 13% arguments will now be more than 5 LSTM steps away (again for English and Chinese, respectively). GCNs provide this ‘teleportation’ capability. These observations suggest that LSTMs and GCNs may be complementary, and we will see that empirical results support this intuition.
4 Syntax-Aware Neural SRL Encoder
In this work, we build our semantic role labeler on top of the syntax-agnostic LSTM-based SRL model of DBLP:journals/corr/MarcheggianiFT17, which already achieves state-of-the-art results on the CoNLL-2009 English dataset. Following their approach we employ the same bidirectional (BiLSTM) encoder and enrich it with a syntactic GCN.
The CoNLL-2009 benchmark assumes that predicate positions are already marked in the test set (e.g., we would know that makes, repairs and engines in Figure 1 are predicates), so no predicate identification is needed. Also, as we focus here solely on identifying arguments and labeling them with semantic roles, for predicate disambiguation (i.e., marking makes as make.01) we use of an off-the-shelf disambiguation model (roth-lapata:2016:P16-1; DBLP:conf/conll/BjorkelundHN09). As in DBLP:journals/corr/MarcheggianiFT17 and in most previous work, we process individual predicates in isolation, so for each predicate, our task reduces to a sequence labeling problem. That is, given a predicate (e.g., disputed in Figure 3) one needs to identify and label all its arguments (e.g., label estimates as A1 and label those as ‘NULL’, indicating that those is not an argument of disputed).
The semantic role labeler we propose is composed of four components (see Figure 3):
look-ups of word embeddings;
a BiLSTM encoder that takes as input the word representation of each word in a sentence;
a syntax-based GCN encoder that re-encodes the BiLSTM representation based on the automatically predicted syntactic structure of the sentence;
a role classifier that takes as input the GCN representation of the candidate argument and the representation of the predicate to predict the role associated with the candidate word.
4.1 Word representations
For each word in the considered sentence, we create a sentence-specific word representation . We represent each word as the concatenation of four vectors:444We drop the index from the notation for the sake of brevity. a randomly initialized word embedding , a pre-trained word embedding estimated on an external text collection, a randomly initialized part-of-speech tag embedding and a randomly initialized lemma embedding (active only if the word is a predicate). The randomly initialized embeddings , , and are fine-tuned during training, while the pre-trained ones are kept fixed. The final word representation is given by , where represents the concatenation operator.
4.2 Bidirectional LSTM layer
One of the most popular and effective ways to represent sequences, such as sentences (DBLP:conf/interspeech/MikolovKBCK10)
, is to use recurrent neural networks (RNN)(DBLP:journals/cogsci/Elman90)
. In particular their gated versions, Long Short-Term Memory (LSTM) networks(DBLP:journals/neco/HochreiterS97)
and Gated Recurrent Units (GRU)(DBLP:conf/emnlp/ChoMGBBSB14), have proven effective in modeling long sequences (DBLP:journals/tacl/ChiuN16; DBLP:conf/nips/SutskeverVL14).
Formally, an LSTM can be defined as a function that takes as input the sequence and returns a hidden state . This state can be regarded as a representation of the sentence from the start to the position , or, in other words, it encodes the word at position along with its left context. However, the right context is also important, so Bidirectional LSTMs (graves2008supervised) use two LSTMs: one for the forward pass, and another for the backward pass, and , respectively. By concatenating the states of both LSTMs, we create a complete context-aware representation of a word . We follow DBLP:journals/corr/MarcheggianiFT17 and stack layers of bidirectional LSTMs, where each layer takes the lower layer as its input.
4.3 Graph convolutional layer
The representation calculated with the BiLSTM encoder is fed as input to a GCN of the form defined in Equation (4). The neighboring nodes of a node , namely , and their relations to are predicted by an external syntactic parser.
4.4 Semantic role classifier
The classifier predicts semantic roles of words given the predicate while relying on word representations provided by GCN; we concatenate hidden states of the candidate argument word and the predicate word and use them as input to a classifier (Figure 3
, top). The softmax classifier computes the probability of the role (including special ‘NULL’ role):
where and are representations produced by the graph convolutional encoder, is the lemma of predicate , and the symbol signifies proportionality.555We abuse the notation and refer as both to the predicate word and to its position in the sentence. As fitzgerald-EtAl:2015:EMNLP and DBLP:journals/corr/MarcheggianiFT17, instead of using a fixed matrix or simply assuming that , we jointly embed the role and predicate lemma
using a non-linear transformation:
where is a parameter matrix, whereas and are randomly initialized embeddings of predicate lemmas and roles. In this way each role prediction is predicate-specific, and, at the same time, we expect to learn a good representation for roles associated with infrequent predicates. As our training objective we use the categorical cross-entropy.
5.1 Datasets and parameters
We tested the proposed SRL model on the English and Chinese CoNLL-2009 dataset with standard splits into training, test and development sets. The predicted POS tags for both languages were provided by the CoNLL-2009 shared-task organizers. For the predicate disambiguator we used the ones from roth-lapata:2016:P16-1 for English and from DBLP:conf/conll/BjorkelundHN09 for Chinese. We parsed English sentences with the BIST Parser (TACL885), whereas for Chinese we used automatically predicted parses provided by the CoNLL-2009 shared-task organizers.
For English, we used external embeddings of DBLP:conf/acl/DyerBLMS15
, learned using the structured skip n-gram approach ofling-EtAl:2015:NAACL-HLT. For Chinese we used external embeddings produced with the neural language model of DBLP:journals/jmlr/BengioDVJ03. We used edge dropout in GCN: when computing , we ignore each node with probability . Adam (kingma2014adam)
was used as an optimizer. The hyperparameter tuning and all model selection were performed on the English development set; the chosen values are shown in Appendix.
5.2 Results and discussion
|LSTMs + GCNs (K=1)||85.2||81.6||83.3|
|LSTMs + GCNs (K=2)||84.1||81.4||82.7|
|LSTMs + GCNs (K=1), no gates||84.7||81.4||83.0|
|GCNs (no LSTMs), K=1||79.9||70.4||74.9|
|GCNs (no LSTMs), K=2||83.4||74.6||78.7|
|GCNs (no LSTMs), K=3||83.6||75.8||79.5|
|GCNs (no LSTMs), K=4||82.7||76.0||79.2|
|LSTMs + GCNs (K=1)||79.9||74.4||77.1|
|LSTMs + GCNs (K=2)||78.7||74.0||76.2|
|LSTMs + GCNs (K=1), no gates||78.2||74.8||76.5|
|GCNs (no LSTMs), K=1||78.7||58.5||67.1|
|GCNs (no LSTMs), K=2||79.7||62.7||70.1|
|GCNs (no LSTMs), K=3||76.8||66.8||71.4|
|GCNs (no LSTMs), K=4||79.1||63.5||70.4|
|Ours (ensemble 3x)||90.5||87.7||89.1|
In order to show that GCN layers are effective, we first compare our model against its version which lacks GCN layers (i.e. essentially the model of DBLP:journals/corr/MarcheggianiFT17). Importantly, to measure the genuine contribution of GCNs, we first tuned this syntax-agnostic model (e.g., the number of LSTM layers) to get best possible performance on the development set.666For example, if we would have used only one layer of LSTMs, gains from using GCNs would be even larger.
We compare the syntax-agnostic model with 3 syntax-aware versions: one GCN layer over syntax (), one layer GCN without gates and two GCN layers (). As we rely on the same off-the-shelf disambiguator for all versions of the model, in Table 1 and 2 we report SRL-only scores (i.e., predicate disambiguation is not evaluated) on the English and Chinese development sets. For both datasets, the syntax-aware model with one GCN layers () performs the best, outperforming the LSTM version by 1.9% and 0.6% for Chinese and English, respectively. The reasons why the improvements on Chinese are much larger are not entirely clear (e.g., both languages are relative fixed word order ones, and the syntactic parses for Chinese are considerably less accurate), this may be attributed to a higher proportion of long-distance dependencies between predicates and arguments in Chinese (see Section 3.3). Edge-wise gating (Section 3.2) also appears important: removing gates leads to a drop of 0.3% F1 for English and 0.6% F1 for Chinese.
Stacking two GCN layers does not give any benefit. When BiLSTM layers are dropped altogether, stacking two layers () of GCNs greatly improves the performance, resulting in a 3.8% jump in F1 for English and a 3.0% jump in F1 for Chinese. Adding a 3rd layer of GCN () further improves the performance.777Note that GCN layers are computationally cheaper than LSTM ones, even in our non-optimized implementation. This suggests that extra GCN layers are effective but largely redundant with respect to what LSTMs already capture.
In Figure 4, we show the scores results on the English development set as a function of the distance, in terms of tokens, between a candidate argument and its predicate. As expected, GCNs appear to be more beneficial for long distance dependencies, as shorter ones are already accurately captured by the LSTM encoder.
We looked closer in contribution of specific dependency relations for Chinese. In order to assess this without retraining the model multiple times, we drop all dependencies of a given type at test time (one type at a time, only for types appearing over 300 times in the development set) and observe changes in performance. In Figure 5, we see that the most informative dependency is COMP (complement). Relative clauses in Chinese are very frequent and typically marked with particle 的 (de). The relative clause will syntactically depend on 的 as COMP, so COMP encodes important information about predicate-argument structure. These are often long-distance dependencies and may not be accurately captured by LSTMs. Although TMP (temporal) dependencies are not as frequent (2% of all dependencies), they are also important: temporal information is mirrored in semantic roles.
In order to compare to previous work, in Table 3 we report test results on the English in-domain (WSJ) evaluation data. Our model is local, as all the argument detection and labeling decisions are conditionally independent: their interaction is captured solely by the LSTM+GCN encoder. This makes our model fast and simple, though, as shown in previous work, global modeling of the structured output is beneficial.888As seen in Table 3, labelers of fitzgerald-EtAl:2015:EMNLP and roth-lapata:2016:P16-1 gained 0.6-1.0%. We leave this extension for future work. Interestingly, we outperform even the best global model and the best ensemble of global models, without using global modeling or ensembles. When we create an ensemble of 3 models with the product-of-expert combination rule, we improve by 1.2% over the best previous result, achieving 89.1% F1.999To compare to previous work, we report combined scores which also include predicate disambiguation. As we use disambiguators from previous work (see Section 5.1), actual gains in argument identification and labeling are even larger.
For Chinese (Table 4), our best model outperforms the state-of-the-art model of roth-lapata:2016:P16-1 by even larger margin of 3.1%.
For English, in the CoNLL shared task, systems are also evaluated on the out-of-domain dataset. Statistical models are typically less accurate when they are applied to out-of-domain data. Consequently, the predicted syntax for the out-of-domain test set is of lower quality, which negatively affects the quality of GCN embeddings. However, our model works surprisingly well on out-of-domain data (Table 5), substantially outperforming all the previous syntax-aware models. This suggests that our model is fairly robust to mistakes in syntax. As expected though, our model does not outperform the syntax-agnostic model of DBLP:journals/corr/MarcheggianiFT17.
|Ours (ensemble 3x)||80.8||77.1||78.9|
6 Related Work
Perhaps the earliest methods modeling syntax-semantics interface with RNNs are due to (HendersonConll08; TitovIjcai09; TitovCoNLL09ST), they used shift-reduce parsers for joint SRL and syntactic parsing, and relied on RNNs to model statistical dependencies across syntactic and semantic parsing actions. A more modern (e.g., based on LSTMs) and effective reincarnation of this line of research has been proposed in DBLP:conf/conll/SwayamdiptaBDS16. Other recent work which considered incorporation of syntactic information in neural SRL models include: fitzgerald-EtAl:2015:EMNLP who use standard syntactic features within an MLP calculating potentials of a CRF model; roth-lapata:2016:P16-1 who enriched standard features for SRL with LSTM representations of syntactic paths between arguments and predicates; lei-EtAl:2015:NAACL-HLT
who relied on low-rank tensor factorizations for modeling syntax. Alsofoland2015 used (non-graph) convolutional networks and provided syntactic features as input. A very different line of research, but with similar goals to ours (i.e. integrating syntax with minimal feature engineering), used tree kernels (DBLP:journals/coling/MoschittiPB08).
Beyond SRL, there have been many proposals on how to incorporate syntactic information in RNN models, for example, in the context of neural machine translation(eriguchi2017learning; DBLP:conf/wmt/SennrichH16). One of the most popular and attractive approaches is to use tree-structured recursive neural networks (socher-EtAl:2013:EMNLP; DBLP:conf/emnlp/LeZ14; DBLP:conf/acl/DyerBLMS15), including stacking them on top of a sequential BiLSTM (DBLP:conf/acl/MiwaB16). An approach of DBLP:conf/emnlp/MouPLXZJ15
to sentiment analysis and question classification, introduced even before GCNs became popular in the machine learning community, is related to graph convolution. However, it is inherently single-layer and tree-specific, uses bottom-up computations, does not share parameters across syntactic functions and does not use gates. Gates have been previously used in GCNs(li2015gated) but between GCN layers rather than for individual edges.
Previous approaches to integrating syntactic information in neural models are mainly designed to induce representations of sentences or syntactic constituents. In contrast, the approach we presented incorporates syntactic information at word level. This may be attractive from the engineering perspective, as it can be used, as we have shown, instead or along with RNN models.
7 Conclusions and Future Work
We demonstrated how GCNs can be used to incorporate syntactic information in neural models and specifically to construct a syntax-aware SRL model, resulting in state-of-the-art results for Chinese and English. There are relatively straightforward steps which can further improve the SRL results. For example, we relied on labeling arguments independently, whereas using a joint model is likely to significantly improve the performance.
More generally, given simplicity of GCNs and their applicability to general graph structures (not necessarily trees), we believe that there are many NLP tasks where GCNs can be used to incorporate linguistic structures (e.g., syntactic and semantic representations of sentences and discourse parses or co-reference graphs for documents).
We would thank Anton Frolov, Michael Schlichtkrull, Thomas Kipf, Michael Roth, Max Welling, Yi Zhang, and Wilker Aziz for their suggestions and comments. The project was supported by the European Research Council (ERC StG BroadSem 678254), the Dutch National Science Foundation (NWO VIDI 639.022.518) and an Amazon Web Services (AWS) grant.
Appendix A Hyperparameter values
|Semantic role labeler|
|(word embeddings EN)||100|
|(word embeddings CH)||128|
|(LSTM hidden states)||512|
|(output lemma representation)||128|