Semantic Role Labeling (SRL, palmer2010semantic) is the task of labeling semantic arguments of predicates in sentences to identify who does what to whom. Such representations can come in handy in tasks involving text understanding, such as coreference resolution ponzetto-strube-2006-exploiting and reading comprehension (e.g., berant-etal-2014-modeling; zhang2019semantics). This paper focuses on the question of how knowledge can influence modern semantic role labeling models.
Linguistic knowledge can help SRL models in several ways. For example, syntax can drive feature design (e.g., punyakanok2005necessity; toutanova-etal-2005-joint; kshirsagar-etal-2015-frame; johansson-nugues-2008-dependency, and others), and can also be embedded into neural network architectures strubell2018linguistically.
In addition to such influences on input representations, knowledge about the nature of semantic roles can inform structured decoding algorithms used to construct the outputs. The SRL literature is witness to a rich array of techniques for structured inference, including integer linear programs(e.g., punyakanok2005necessity; punyakanok2008importance), bespoke inference algorithms (e.g., tackstrom2015efficient), A* decoding (e.g., he-etal-2017-deep)
, greedy heuristics(e.g., ouchi-etal-2018-span), or simple Viterbi decoding to ensure that token tags are BIO-consistent.
By virtue of being constrained by the definition of the task, global inference promises semantically meaningful outputs, and could provide valuable signal when models are being trained. However, beyond Viterbi decoding, it may impose prohibitive computational costs, thus ruling out using inference during training. Indeed, optimal inference may be intractable, and inference-driven training may require ignoring certain constraints that render inference difficult.
While global inference was a mainstay of SRL models until recently, today’s end-to-end trained neural architectures have shown remarkable successes without needing decoding. These successes can be attributed to the expressive input and internal representations learned by neural networks. The only structured component used with such models, if at all, involves sequential dependencies between labels that admit efficient decoding.
In this paper, we ask: Can we train neural network models for semantic roles in the presence of general output constraints, without paying the high computational cost of inference? We propose a structured tuning approach that exposes a neural SRL model to differentiable constraints during the finetuning step. To do so, we first write the output space constraints as logic rules. Next, we relax such statements into differentiable forms that serve as regularizers to inform the model at training time. Finally, during inference, our structure-tuned models are free to make their own judgments about labels without any inference algorithms beyond a simple linear sequence decoder.
We evaluate our structured tuning on the CoNLL-05 carreras-marquez-2005-introduction and CoNLL-12 English SRL pradhan-etal-2013-towards shared task datasets, and show that by learning to comply with declarative constraints, trained models can make more consistent and more accurate predictions. We instantiate our framework on top of a strong baseline system based on the RoBERTa liu2019roberta encoder, which by itself performs on par with previous best SRL models that are not ensembled. We evaluate the impact of three different types of constraints. Our experiments on the CoNLL-05 data show that our constrained models outperform the baseline system by F1 on the WSJ section and F1 on the Brown test set. Even with the larger and cleaner CoNLL-12 data, our constrained models show improvements without introducing any additional trainable parameters. Finally, we also evaluate the effectiveness of our approach on low training data scenarios, and show that constraints can be more impactful when we do not have large training sets.
In summary, our contributions are:
We present a structured tuning framework for SRL which uses soft constraints to improve models without introducing additional trainable parameters.111Our code to replay our experiments is archived at https://github.com/utahnlp/structured_tuning_srl.
Our framework outperforms strong baseline systems, and shows especially large improvements in low data regimes.
2 Model & Constraints
In this section, we will introduce our structured tuning framework for semantic role labeling. In §2.1, we will briefly cover the baseline system. To that, we will add three constraints, all treated as combinatorial constraints requiring inference algorithms in past work: Unique Core Roles in §2.3, Exclusively Overlapping Roles in §2.4, and Frame Core Roles in §2.5. For each constraint, we will discuss how to use its softened version during training.
We should point out that the specific constraints chosen serve as a proof-of-concept for the general methodology of tuning with declarative knowledge. For simplicity, for all our experiments, we use the ground truth predicates and their senses.
We use RoBERTa liu2019roberta base version to develop our baseline SRL system. The large number of parameters not only allows it to make fast and accurate predictions, but also offers the capacity to learn from the rich output structure, including the constraints from the subsequent sections.
Our base system is a standard BIO tagger, briefly outlined below. Given a sentence , the goal is to assign a label of the form B-X, I-X or O for each word being an argument with label X for a predicate at word . These unary decisions are scored as follows:
Here, map converts the wordpiece embeddings to whole word embeddings by summation, and
are linear transformations of the predicate and argument embeddings respectively,
is a two-layer ReLU with concatenated inputs, and finally
is a linear layer followed by softmax activation that predicts a probability distribution over labels for each wordwhen is a predicate. In addition, we also have a standard first-order sequence model over label sequences for each predicate in the form of a CRF layer that is Viterbi decoded. We use the standard cross-entropy loss to train the model.
2.2 Designing Constraints
Before looking at the specifics of individual constraints, let us first look at a broad overview of our methodology. We will see concrete examples in the subsequent sections.
Output space constraints serve as prior domain knowledge for the SRL task. We will design our constraints as invariants at the training stage. To do so, we will first define constraints as statements in logic. Then we will systematically relax these Boolean statements into differentiable forms using concepts borrowed from the study of triangular norms (t-norms, klement2013triangular). Finally, we will treat these relaxations as regularizers in addition to the standard cross-entropy loss.
All the constraints we consider are conditional statements of the form:
where the left- and the right-hand sides—
respectively—can be either disjunctive or conjunctive expressions. The literals that constitute these expressions are associated with classification neurons, , the predicted output probabilities are soft versions of these literals.
What we want is that model predictions satisfy our constraints. To teach a model to do so, we transform conditional statements into regularizers, such that during training, the model receives a penalty if the rule is not satisfied for an example.222Constraint-derived regularizers are dependent on examples, but not necessarily labeled ones. For simplicity, in this paper, we work with sentences from the labeled corpus. However, the methodology described here can be extended to use unlabeled examples as well.
To soften logic, we use the conversions shown in Table 1 that combine the product and Gödel t-norms. We use this combination because it offers cleaner derivatives make learning easier. A similar combination of t-norms was also used in prior work minervini2018adversarially. Finally, we will transform the derived losses into log space to be consistent with cross-entropy loss. li2019consistency outlines this relationship between the cross-entropy loss and constraint-derived regularizers in more detail.
2.3 Unique Core Roles ()
Our first constraint captures the idea that, in a frame, there can be at most one core participant of a given type. Operationally, this means that for every predicate in an input sentence , there can be no more than one occurrence of each core argument (i.e, ). In first-order logic, we have:
which says, for a predicate , if a model tags the -th word as the beginning of the core argument span, then it should not predict that any other token is the beginning of the same label.
In the above rule, the literal is associated with the predicted probability for the label B-X333 We will use to represent both the literal that the token is labeled with B-X for predicate and also the probability for this event. We follow a similar convention for the I-X labels.. This association is the cornerstone for deriving constraint-driven regularizers. Using the conversion in Table 1 and taking the natural of the resulting expression, we can convert the implication in (6) as :
Adding up the terms for all tokens and labels, we get the final regularizer :
Our constraint is universally applied to all words and predicates (, respectively) in the given sentence . Whenever there is a pair of predicted labels for tokens that violate the rule (6), our loss will yield a positive penalty.
To measure the violation rate of this constraint, we will report the percentages of propositions that have duplicate core arguments. We will refer to this error rate as .
2.4 Exclusively Overlapping Roles ()
We adopt this constraint from punyakanok2008importance and related work. In any sentence, an argument for one predicate can either be contained in or entirely outside another argument for any other predicate. We illustrate the intuition of this constraint in Table 2, assuming core argument spans are unique and tags are BIO-consistent.
|- has label X|
Based on Table 2, we design a constraint that says: if an argument has boundary , then no other argument span can cross the boundary at . This constraint applies to all argument labels in the task, denoted by the set .
Here, the term denotes the indicator for the argument span having the label X for a predicate and corresponds to the first row of Table 2. The terms and each correspond to prohibitions of the type described in the second and third rows respectively.
As before, the literals , etc are relaxed as model probabilities to define the loss. By combining the Gödel and product t-norms, we translate Rule (8) into:
Again, our constraint applies to all predicted probabilities. However, doing so requires scanning over axes defined by , which is computationally expensive. To get around this, we observe that, since we have a conditional statement, the higher the probability of , the more likely it yields non-zero penalty. These cases are precisely the ones we hope the constraint helps. Thus, for faster training and ease of implementation, we modify Equation 8 by squeezing the dimensions using top-k to redefine above as:
where denotes the set of the top-k span boundaries for predicate and argument label X. This change results in a constraint defined by , , X, Y and the elements of .
We will refer to the error of the overlap constraint as , which describes the total number of non-exclusively overlapped pairs of arguments. In practice, we found that models rarely make such observed mistakes. In §3, we will see that using this constraint during training helps models generalize better with other constraints. In §4, we will analyze the impact of the parameter in the optimization described above.
2.5 Frame Core Roles ()
The task of semantic role labeling is defined using the PropBank frame definitions. That is, for any predicate lemma of a given sense, PropBank defines which core arguments it can take and what they mean. The definitions allow for natural constraints that can teach models to avoid predicting core arguments outside of the predefined set.
where denotes the set of senses for a predicate , and denotes the set of acceptable core arguments when the predicate has sense .
As noted in §2.2, literals in the above statement can to be associated with classification neurons. Thus the corresponds to either model prediction or ground truth. Since our focus is to validate the approach of using relaxed constraints for SRL, we will use the latter.
This constraint can be also converted into regularizer following previous examples, giving us a loss term .
We will use to denote the violation rate. It represents the percentage of propositions that have predicted core arguments outside the role sets of PropBank frames.
Our final loss is defined as:
Here, is the standard cross entropy loss over the BIO labels, and the
’s are hyperparameters.
3 Experiments & Results
In this section, we study the question: In what scenarios can we inform an end-to-end trained neural model with declarative knowledge? To this end, we experiment with the CoNLL-05 and CoNLL-12 datasets, using standard splits and the official evaluation script for measuring performance. To empirically verify our framework in various data regimes, we consider scenarios ranging from where only limited training data is available, to ones where large amounts of clean data are available.
3.1 Experiment Setup
Our baseline (described in §2.1) is based on RoBERTa. We used the pre-trained base version released by wolf2019transformers. Before the final linear layer, we added a dropout layer srivastava2014dropout with probability . To capture the sequential dependencies between labels, we added a standard CRF layer. At testing time, Viterbi decoding with hard transition constraints was employed across all settings. In all experiments, we used the gold predicate and gold frame senses.
Model training proceeded in two stages:
We use the finetuned the pre-trained RoBERTa model on SRL with only cross-entropy loss for epochs with learning rate .
Then we continued finetuning with the combined loss in Equation 12 for another epochs with a lowered learning rate of .
During both stages, learning rates were warmed up linearly for the first updates.
For fair comparison, we finetuned our baseline twice (as with the constrained models); we found that it consistently outperformed the singly finetuned baseline in terms of both error rates and role F1. We grid-searched the ’s by incrementally adding regularizers. The combination of ’s with good balance between F1 and error ’s on the dev set were selected for testing. We refer readers to the appendix for the values of ’s.
For models trained on the CoNLL-05 data, we report performance on the dev set, and the WSJ and Brown test sets. For CoNLL-12 models, we report performance on the dev and the test splits.
3.2 Scenario 1: Low Training Data
Creating SRL datasets requires expert annotation, which is expensive. While there are some efforts on semi-automatic annotation targeting low-resource languages (e.g., akbik-etal-2016-towards), achieving high neural network performance with small or unlabeled datasets remains a challenge (e.g., furstenau-lapata-2009-graph; furstenau2012semi; titov-klementiev-2012-semi; gormley-etal-2014-low; abend-etal-2009-unsupervised).
In this paper, we study the scenario where we have small amounts of fully labeled training data. We sample of the training data and an equivalent amount of development examples. The same training/dev subsets are used across all models.
Table 3 reports the performances of using training data from CoNLL-05 and CoNLL-12 (top and bottom respectively). We compare our strong baseline model with structure-tuned models using all three constraints. Note that for all these evaluations, while we use subsamples of the dev set for model selection, the evaluations are reported using the full dev and test sets.
We see that training with constraints greatly improves precision with low training data, while recall reduces. This trade-off is accompanied by a reduction in the violation rates and . As noted in §2.4, models rarely predict label sequences that violate the exclusively overlapping roles constraint. As a result, the error rate (the number of violations) only slightly fluctuates.
|CoNLL-05 (3%, 1.1k)|
|CoNLL-12 (3%, 2.7k)|
3.3 Scenario 2: Large Training Data
|CoNLL-05 (100%, 36k)|
Table 4 reports the performance of models trained with our framework using the full training set of the CoNLL-05 dataset which consists of k sentences with k propositions. Again, we compare RoBERTa (twice finetuned) with our structure-tuned models. We see that the constrained models consistently outperform baselines on the dev, WSJ, and Brown sets. With all three constraints, the constrained model reaches F1 on the WSJ. It also generalizes well on new domain by outperforming the baseline by points on the Brown test set.
As in the low training data experiments, we observe improved precision due to the constraints. This suggests that even with large training data, direct label supervision might not be enough for neural models to pick up the rich output space structure. Our framework helps neural networks, even as strong as RoBERTa, to make more correct predictions from differentiable constraints.
Surprisingly, the development ground truth has a error rate on the frame role constraint, and on the unique role constraint. Similar percentages of unique role errors also appear in WSJ and Brown test sets. For , the oracle has no violations on the CoNLL-05 dataset.
The exclusively overlapping constraint () is omitted as we found models rarely make such prediction errors. After adding constraints, the error rate of our model approached the lower bound. Note that our framework focuses on the learning stage without any specialized decoding algorithms in the prediction phase except the Viterbi algorithm to guarantee that there will be no BIO violations.
What about even larger and cleaner data?
The ideal scenario, of course, is when we have the luxury of massive and clean data to power neural network training. In Table 5, we present results on CoNLL-12 which is about times as large as CoNLL-05. It consists of k sentences and k propositions. The dataset is also less noisy with respect to the constraints. For instance, the oracle development set has no violations for both the unique core and the exclusively overlapping constraints.
We see that, while adding constraints reduced error rates of and , the improvements on label consistency do not affect F1 much. As a result, our best constrained model performes on a par with the baseline on the dev set, and is slightly better than the baseline (by ) on the test set. Thus we believe when we have the luxury of data, learning with constraints would become optional. This observation is in line with recent results in li2019augmenting and li2019consistency.
But is it due to the large data or the strong baseline?
To investigate whether the seemingly saturated performance is from data or from the model, we also evaluate our framework on the original BERT devlin2019bert which is relatively less powerful. We follow the same model setup for experiments and report the performances in Table 5 and Table 9. We see that compared to RoBERTa, BERT obtains similar F1 gains on the test set, suggesting performance ceiling is due to the train size.
|CoNLL-12 (100%, 90k)|
4 Ablations & Analysis
, we saw that constraints not just improve model performance, but also make outputs more structurally consistent. In this section, we will show the results of an ablation study that adds one constraint at a time. Then, we will examine the sources of improved F-score by looking at individual labels, and also the effect of the top-k relaxation for the constraint. Furthermore, we will examine the robustness of our method against randomness involved during training. We will end this section with a discussion about the ability of constrained neural models to handle structured outputs.
We present the ablation analysis on our constraints in Table 6. We see that as models become more constrained, precision improves. Furthermore, one class of constraints do not necessarily reduce the violation rate for the others. Combining all three constraints offers a balance between precision, recall, and constraint violation.
One interesting observation that adding the constraints improve F-scores even though the values were already close to zero. As noted in §2.4, our constraints apply to the predicted scores of all labels for a given argument, while the actual decoded label sequence is just the highest scoring sequence using the Viterbi algorithm. Seen this way, our regularizers increase the decision margins on affected labels. As a result, the model predicts scores that help Viterbi decoding, and, also generalizes better to new domains , the Brown set.
|CoNLL-05 (100%, 36k)|
Sources of Improvement
Table 7 shows label-wise F1 scores for each argument. Under low training data conditions, our constrained models gained improvements primarily from the frequent labels, , A0-A2. On CoNLL-05 dataset, we found the location modifier (AM-LOC) posed challenges to our constrained models which significantly performed worse than the baseline. Another challenge is the negation modifier (AM-NEG), where our models underperformed on both datasets, particularly with small training data. When using the CoNLL-12 training set, our models performed on par with the baseline even on frequent labels, confirming that the performance of soft-structured learning is nearly saturated on the larger, cleaner dataset.
|CoNLL-05 3%||CoNLL-05 100%||CoNLL-12 3%||CoNLL-12 100%|
Impact of Top- Beam Size
As noted in §2.4, we used the top- strategy to implement the constraint . As a result, there is a certain chance for predicted label sequences to have non-exclusive overlap without our regularizer penalizing them. What we want instead is a good balance between coverage and runtime cost. To this end, we analyze the CoNLL-12 development set using the baseline trained on of CoNLL-12 data. Specifically, we count the examples which have such overlap but the regularization loss is . In Table 8, we see that yields good coverage.
Robustness to random initialization
We observed that model performance with structured tuning is generally robust to random initialization. As an illustration, we show the performance of models trained on the full CoNLL-12 dataset with different random initializations in Table 9.
|CoNLL-12 (100%, 90k)|
|Test F1||Seed1||Seed2||Seed3||avg F1|
|Test F1||Seed1||Seed2||Seed3||avg F1|
Can Constrained Networks Handle Structured Prediction?
Larger, cleaner data may presumably be better for training constrained neural models. But it is not that simple. We will approach the above question by looking at how good the transformer models are at dealing with two classes of constraints, namely: 1) structural constraints that rely only on available decisions (constraint ), 2) constraints involving external knowledge (constraint ).
For the former, we expected neural models to perform very well since the constraint represents a simple local pattern. From Tables 4 and 5, we see that the constrained models indeed reduced violations substantially. However, when the training data is limited, , comparing CoNLL-05 and , the constrained models, while reducing the number of errors, still make many invalid predictions. We conjecture this is because networks learn with constraints mostly by memorization. Thus the ability to generalize learned patterns on unseen examples relies on training size.
The constraint requires external knowledge from the PropBank frames. We see that even with large training data, constrained models were only able to reduce error rate by a small margin. In our development experiments, having larger tends to strongly sacrifice argument F1, yet still does not to improve development error rate substantially. Without additional training signal in the form of such background knowledge, constrained inference becomes a necessity, even with strong neural network models.
5 Discussion & Conclusion
Semantic Role Labeling & Constraints
The SRL task is inherently knowledge rich; the outputs are defined in terms of an external ontology of frames. The work presented here can be generalized to several different flavors of the task, and indeed, constraints could be used to model the interplay between them. For example, we could revisit the analysis of yi-etal-2007-semantic, who showed that the PropBank A2 label takes on multiple meanings, but by mapping them to VerbNet, they can be disambiguated. Such mappings naturally define constraints that link semantic ontologies.
Constraints have long been a cornerstone in the SRL models. Several early linear models for SRL (e.g. punyakanok2004semantic; punyakanok2008importance; surdeanu2007combination) modeled inference for PropBank SRL using integer linear programming. riedel2008collective used Markov Logic Networks to learn and predict semantic roles with declarative constraints. The work of tackstrom2015efficient showed that certain SRL constraints admit efficient decoding, leading to a neural model that used this framework fitzgerald2015semantic. Learning with constraints has also been widely adopted in semi-supervised SRL (e.g., furstenau2012semi).
With the increasing influence of neural networks in NLP, however, the role of declarative constraints seem to have decreased in favor of fully end-to-end training (e.g., he2017deep; strubell2018linguistically, and others). In this paper, we show that even in the world of neural networks with contextual embeddings, there is still room for systematically introducing knowledge in the form of constraints, without sacrificing the benefits of end-to-end learning.
chang2012structured and ganchev2010posterior developed models for structured learning with declarative constraints. Our work is in the same spirit of training models that attempts to maintain output consistency.
There are some recent works on the design of models and loss functions by relaxing Boolean formulas.kimmig2012short used the Łukasiewicz t-norm for probabilistic soft logic. li2019augmenting augment the neural network architecture itself using such soft logic. pmlr-v80-xu18h present a general framework for loss design that does not rely on soft logic. Introducing extra regularization terms to a downstream task have been shown to be beneficial in terms of both output structure consistency and prediction accuracy (e.g., minervini2018adversarially; hsu2018unified; mehta2018towards; du2019consistent; li2019consistency).
In this work, we have presented a framework that seeks to predict structurally consistent outputs without extensive model redesign, or any expensive decoding at prediction time. Our experiments on the semantic role labeling task show that such an approach can be especially helpful in scenarios where we do not have the luxury of massive annotated datasets.
We thank members of the NLP group at the University of Utah for their valuable insights and suggestions; and reviewers for pointers to related works, corrections, and helpful comments. We also acknowledge the support of NSF Cyberlearning-1822877, SaTC-1801446, U.S. DARPA KAIROS Program No. FA8750-19-2-1004, DARPA Communicating with Computers DARPA 15-18-CwC-FP-032, HDTRA1-16-1-0002, and gifts from Google and NVIDIA.
The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
Appendix A Appendices
We show the hyperparameters of ‘s in Table 10. We conducted grid search on the combinations of ‘s for each setting and the best one on development set is selected for reporting.
|RoBERTa CoNLL-05 (3%)|
|RoBERTa CoNLL-2012 (3%)|
|RoBERTa CoNLL-05 (100%)|
|RoBERTa CoNLL-2012 (100%)|
|BERT CoNLL-2012 (100%)|