Properly integrating external information into neural networks has received increasing attention recentlyWu et al. (2018); Li et al. (2017); Strubell et al. (2018). Previous research on this topic can be roughly categorized into three classes: i) Input
: The external information are presented as additional input features (i.e., dense real-valued vectors) to the neural networkCollobert et al. (2011). ii) Output: The neural network is trained to predict the main task and the external information in a multi-task approach Changpinyo et al. (2018). iii) Auto-encoder: This approach, recently proposed by Wu et al. (2018), simultaneously combines the Input and Output during neural models training. The simplicity of these methods allow them to apply to many NLP sequence tasks and various neural model architectures.
However, previous studies often focus on integrating word-level shallow features such as POS or chunk tags into the sequence labelling tasks. Syntactic information, which encodes the long-range dependencies and global sentence structure, has not been studied as carefully. This paper fills this gap by integrating syntactic information to the sequence labelling task. We address three questions: 1) How should syntactic information be encoded as word-level features? 2) What is the best way of integrating syntactic information? and 3) What effect does the choice of syntactic representation have on the performance?
We study these questions in the context of Semantic Role Labelling (SRL). A SRL system extracts the predicate-argument structure of a sentence.222who did what to whom, where and when Syntax was an essential component of early SRL systems Xue and Palmer (2004); Punyakanok et al. (2008). The state-of-the-art neural SRL systems use a neural sequence labelling model without any syntax knowledge He et al. (2018, 2017); Tan et al. (2018). We show below that injecting external syntactic knowledge into a neural SRL sequence labelling model can improve the performance, and our best model sets a new state-of-the-art for a non-ensemble SRL system.
In this paper we express the external syntactic information as vectors of discrete features, because this enables us to explore different ways of injecting the syntactic information into the neural SRL model. Specifically, we propose three different syntax encoding methods: a) a full constituency tree representation (Full-C); b) an SRL-specific span representation (SRL-C); and c) a dependency tree representation (Dep). For (a) we adapt the constituency parsing representation from Gómez-Rodríguez and Vilares (2018) and encode the tree structure as a set of features for word pairs. For (b), we use a categorical representation of the constituency spans that are most relevant to SRL tasks based on Xue and Palmer (2004). Finally, (c) we propose a discrete vector representation that encodes the head-modifier relationships in the dependency trees.
We evaluate the effectiveness of these encodings using three different integration methods on the SRL CoNLL’05 and CoNLL’12 benchmarks. We show that using either of the constituency representations in either the Input or the Auto-Encoder configurations produces the best performance. These results are noticeably better than a strong baseline and set a new state-of-the-art for non-ensemble SRL systems.
2 Related Work
Semantic Role Labeling (SRL) generally refers to the PropBank style of annotation Palmer et al. (2005). Broadly speaking, prior work on SRL makes use of syntactic information in two different ways. Carreras and Màrquez (2005); Pradhan et al. (2013) incorporate constituent-structure span-based information, while Hajič et al. (2009) incorporate dependency-structure information.
This information can be incorporated into an SRL system in several different ways. Swayamdipta et al. (2018) use span information from constituency parse trees as an additional training target in a multi-task learning approach, similar to one of the approaches we evaluate here. Roth and Lapata (2016) use an LSTM model to represent the dependency paths between predicates and arguments and feed the output as the input features to their SRL system. Marcheggiani and Titov (2017) use Graph Convolutional Network Niepert et al. (2016) to encode the dependency parsing trees into their LSTM-based SRL system. Xia et al. (2019) represent dependency parses using position-based categorical features of tree structures in a neural model. Strubell et al. (2018) use dependency trees as a supervision signal to train one of attention heads in a self-attentive neural model.
3 Syntactic Representation
This section introduces our representations of constituency and dependency syntax trees.
3.1 Full-C: Full Constituency Representation
Gómez-Rodríguez and Vilares (2018) propose a full representation of constituency parsing trees where the string position between and is associated with the pair where is the number of common ancestors between and is the non-terminal label at the lowest common ancestor333The full constituency trees can be reconstructed from this representation, details refer to Gómez-Rodríguez and Vilares (2018). For simplicity, we define throughout this paper. 444In Gómez-Rodríguez and Vilares (2018), both and is applicable for this encoding method. Our pilot experiments show that works much better than the absolute representation .
3.2 Srl-C: SRL Span Representation
Xue and Palmer (2004) show only a small fraction of the constituents in the parse trees are useful for the SRL task given the predicate word. That means encoding the full constituency parsing tree may introduce redundant information.
Therefore, we preserve the constituent spans that are most likely to be useful for the predicate word in the trees. We re-use the pruning algorithm in Xue and Palmer (2004). Their algorithm collects the potential argument constituents by walking up the tree to the root node recursively, which filters out many irrelevant constituents from the syntax trees with 99.3% of the ground truth arguments preserved.
We encode the output of this rule-based pruning algorithm using a standard BIO (Begin-Inside-Outside) annotation scheme. The words that are outside any candidate constituent receive the tag O. The words that are beginning of a candidate constituent receive the tag B, and the words that are inside a candidate constituent receive the tag I. We use the tag A to label words in prepositional phrases. We refer this representation as the SRL-C (Figure 1).
3.3 Dep: Dependency Tree Representation
The dependency tree representation encodes key aspects of the head-modifier relationships within the sentence. We also consider encoding constituent edge information. The following word-level features have been proposed:
#left/right Dependents (Left / Right). The number of dependents a word has on the left and right side.
Right/Left-most Dependent (Edge). Whether the word is the Right/Left/None-most dependent of its governor.
Relative Distance to Governor (RG). The relative distance between the word and its governor.
Dependency Label (DL). The label describing the relationship between each pair of dependent and governor.
We refer this representation as the Dep (Figure 2555In this example, we assume the “root” is the first word of the sentence from the left.).
4 Injecting External Information
In this section, we introduce three different methods for integrating external syntactic information into the neural SRL system (Figure 3):
This approach represents the external categorical features as trainable, high dimensional dense vector token embeddings, which are concatenated with the representation vectors of ELMo in the baseline model. The syntactic parse trees that are used as the input features are produced by Kitaev and Klein (2018) (for constituency parsing). The dependency trees are produced by transforming the constituency trees using Stanford CoreNLP toolkit. This ensures that the constituency and dependency parses have a similar error distribution, helping to control for parsing quality. Our constituency and dependency parses score a state-of-the-art 95.4 F1 and 96.4% UAS on the WSJ test set respectively. We used a 20-fold cross-validation procedure to produce the data for the external syntactic input.
In this approach, our model predicts both SRL sequence tags and syntactic features (encoded as the word-level features above) simultaneously. We use a log loss for each categorical feature. The final training loss is the multi-task objective , where
is the probability of generatingas the feature ( features in total, for SRL-C, Full-C and Dep respectively) and is the ground truth for the feature. Gold training data was used as the external syntactic information for the multi-task output setting, as this external information is not required at test time.
Following Wu et al. (2018), we use external information as input features and as a multi-task training objective simultaneously, so the system is behaving somewhat like an auto-encoder. This auto-encoder has to reproduce the syntactic information in its output that it is fed in its input, encouraging it to incorporate this information in its internal representations. The input and output representations are the same as above.
We evaluate 10 different models (the 3 ways of using external information by 3 different encodings of syntax and a baseline model) on CoNLL’05 Carreras and Màrquez (2005) and CoNLL’12 Pradhan et al. (2013) benchmarks, under the evaluation setting where the gold predicate is given. The CoNLL’05 benchmark uses WSJ and Brown test as in-domain and out-domain evaluation respectively.
5.1 Main Results
Table 1 shows the effect of using the three different kinds of external syntactic information in the three different ways just described. When used as input features, all three representations improve over our baseline system. This shows that syntactic representations provide additional useful information, which is beyond the dynamic context embeddings from ELMo, to SRL task.
Models using constituency representations are 0.3% - 0.6% better than the models using the dependency representations. This might be because constituents align more directly with SRL arguments and constituency information is easier to use.
The SRL-C is slightly better than the Full-C for in-domain evaluation. The advantages of the SRL-C approach are greater on the out-of-domain (Brown) evaluation, with a margin of 0.4%. This could be because Full-C is more sensitive to parsing errors than SRL-C. When we compare gold and automatic parser representations in Brown device data, 10.5% of the words get different Full-C features while this only 7.9% get different SRL-C features.
External Information Injection
Table 1 shows at least on this task, multi-task learning does not perform as well as adding external information as additional input features. Both the Input and Auto-Encoder methods work equally well. We conclude that the extra complexity of the auto-encoder model is not justified. In particular, Dep with auto-encoder hurts SRL accuracy (0.6% behind the model with the constituency features).
5.2 Comparison with existing systems
We compare our best system (SRL-C used as Input) with previous work in Table 2. We improve upon the state-of-the-art results for non-ensemble SRL models on in-domain test by 0.6% and 0.2% on CoNLL’05 and CoNLL’12 respectively. Our model also achieves a competitive result on CoNLL’05 Brown Test. Comparing with the strong ensemble model in Ouchi et al. (2018), our model is only 0.3% and 0.6% lower in two benchmarks respectively.
|Strubell et al. (2018)||86.0||76.5||-|
|Xia et al. (2019)||86.9||76.8||-|
|He et al. (2018)||87.4||80.4||85.5|
|Ouchi et al. (2018)||87.6||78.7||86.2|
|Our best model||88.2||79.3||86.4|
|Xia et al. (2019)||87.8||78.8||-|
|Ouchi et al. (2018)||88.5||79.6||87.0|
5.3 Using Gold Parse Trees
Finally, we conduct an oracle experiment where all syntactic features are derived from gold trees. Our model performance improves by around 3% - 4% F1 score (see Table 3). This bounds the improvement in SRL that one can expect with improved syntactic parses.
|Our best model||88.2||79.3||86.4|
6 Conclusion and Future Work
This paper evaluated three different ways of representing external syntactic parses, and three different ways of injecting that information into a state-of-the-art SRL system. We showed that representing the external syntactic information as constituents was most effective. Using the external syntactic information as input features was far more effective than a multi-task learning approach, and just as effective as an auto-encoder approach. Our best system sets a new state-of-the-art for non-ensemble SRL systems on in-domain data.
In future work we will explore how external information is best used in ensembles of models for SRL and other tasks. For example, is it better for all the models in an ensemble to use the same external information, or is it more effective if they make use of different kinds of information? We will also investigate whether the choice of method for injecting external information has the same impact on other NLP tasks as it does on SRL.
This research was supported by the Australian Research Council’s Discovery Projects funding scheme (project number DPs 160102156, 170103710, 180103411), D2DCRC (DC25002, DC25003), and in part by CSIRO Data61.
- Introduction to the conll-2005 shared task: semantic role labeling. In Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005), pp. 152–164. External Links: Cited by: §2, §5.
- Multi-task learning for sequence tagging: an empirical study. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2965–2977. External Links: Cited by: §1.
Natural language processing (almost) from scratch.
Journal of Machine Learning Research12, pp. 2493–2537. External Links: Cited by: §1.
- Constituent parsing as sequence labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1314–1324. External Links: Cited by: §1, §3.1, footnote 3, footnote 4.
- The CoNLL-2009 shared task: syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009): Shared Task, Boulder, Colorado, pp. 1–18. External Links: Cited by: §2.
- Jointly predicting predicates and arguments in neural semantic role labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 364–369. External Links: Cited by: §1, Table 2.
- Deep semantic role labeling: what works and what’s next. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 473–483. External Links: Cited by: §1, §4.
- Constituency parsing with a self-attentive encoder. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2676–2686. External Links: Cited by: §4.
Modeling source syntax for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 688–697. External Links: Cited by: §1.
- Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1506–1515. External Links: Cited by: §2.
Learning convolutional neural networks for graphs. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 2014–2023. External Links: Cited by: §2.
- A span selection model for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1630–1642. External Links: Cited by: §5.2, Table 2.
- The proposition bank: an annotated corpus of semantic roles. Computational Linguistics 31 (1), pp. 71–106. External Links: Cited by: §2.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. External Links: Cited by: §4.
- Towards robust linguistic analysis using ontonotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 143–152. External Links: Cited by: §2, §5.
- The importance of syntactic parsing and inference in semantic role labeling. Computational Linguistics 34 (2), pp. 257–287. External Links: Cited by: §1.
- Neural semantic role labeling with dependency path embeddings. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1192–1202. External Links: Cited by: §2.
- Linguistically-informed self-attention for semantic role labeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5027–5038. External Links: Cited by: §1, §2, Table 2.
- Syntactic scaffolds for semantic structures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3772–3782. External Links: Cited by: §2.
Deep semantic role labeling with self-attention.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pp. 4929–4936. External Links: Cited by: §1.
- Evaluating the utility of hand-crafted features in sequence labelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2850–2856. External Links: Cited by: §1, §4.
- Syntax-aware neural semantic role labeling. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, (AAAI-19), Honolulu, Hawaii, USA, Jan 27-Feb 1, 2019, External Links: Cited by: §2, Table 2.
- Calibrating features for semantic role labeling. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pp. 88–94. External Links: Cited by: §1, §1, §3.2, §3.2.