1 Introduction
In many natural language processing (NLP) tasks, the outputs are structures which can take the form of sequences, trees, or in general, labeled graphs. Predicting such output structures (e.g. Smith, 2011) involves assigning values to multiple interdependent variables. Certain joint assignments may be prohibited by constraints designed by domain experts. As a simple example, in the problem of extracting entities and relations from text, a constraint could disallow the relation “married to” between two entities if one of the entity is not a “person”. It has been shown that carefully designed constraints can substantially improve model performance in various applications (e.g., Chang et al., 2012; Anzaroot et al., 2014), especially when the number of training examples is limited.
Designing constraints often requires taskspecific manual effort. In this paper, we ask the question:
can we use neural network methods to automatically discover constraints from data, and use them to predict structured outputs?
We provide a general framework for discovering constraints in the form of a system of linear inequalities over the output variables in a problem. These constraints can improve an already trained model, or be integrated into the learning process for global training.A system of linear inequalities represents a bounded or unbounded convex polytope. We observe that such a system can be expressed as a twolayer threshold network, i.e., a network with one hidden layer of linear threshold units and an output layer with a single threshold unit. This twolayer threshold network will predict or
depending on whether the system of linear inequalities is satisfied or not. In principle, we could try to train such a threshold network to discover constraints. However, the zerogradient nature of the threshold activation function prohibits using backpropagation for gradientbased learning.
Instead, in this paper, we show that a construction of a specific twolayer rectifier network represents linear inequality constraints. This network also contains a single linear threshold output unit, but in the hidden layer, it contains rectified linear units (ReLUs).
Pan and Srikumar (2016) showed that a twolayer rectifier network constructed in such a way is equivalent to a threshold network, and represents the same set of linear inequalities as the threshold network with far fewer hidden units.The linear constraints thus obtained can augment existing models in multiple ways. For example, if a problem is formulated as an integer program (e.g., Roth and Yih, 2004, 2005; Riedel and Clarke, 2006; Martins et al., 2009), the learned constraints will become additional linear inequalities, which can be used directly. Alternatively, a structure can be constructed using graph search (e.g., Collins and Roark, 2004; Daumé et al., 2009; Doppa et al., 2014; Chang et al., 2015; Wiseman and Rush, 2016), in which case the learned constraints can filter available actions during searchnode expansions. Other inference techniques that extend Lagrangian Relaxation (Komodakis et al., 2007; Rush et al., 2010; Martins et al., 2011) can also employ the learned constraints. Essentially, the learned constraints can be combined with various existing models and inference techniques and the framework proposed in this paper can be viewed as a general approach to improve structured prediction.
We report experiments on three NLP tasks to verify the proposed idea. The first one is an entity and relation extraction task, in which we aim to label the entity candidates and identify relations between them. In this task, we show that the learned constraints can be used while training the model to improve prediction. We also show that the learned constraints in this domain can be interpreted in a way that is comparable to manually designed constraints. The second NLP task is to extract citation fields like authors, journals and date from a bibliography entry. We treat it as a sequence labeling problem and show that learned constraints can improve an existing firstorder Markov model trained using a structured SVM method
(Tsochantaridis et al., 2004). In the final experiment we consider chunking, i.e., shallow parsing, which is also a sequence labeling task. We train a BiLSTMCRF model Huang et al. (2015) on the training set with different sizes, and we show that learned constraints are particularly helpful when the number of training examples is small.In summary, the contributions of this paper are:

We propose that rectifier networks can be used to represent and learn linear constraints for structured prediction problems.

In tasks such as entity and relation extraction, the learned constraints can exactly recover the manually designed constraints, and can be interpreted in a way similar to manually designed constraints.

When manually designed constraints are not available, we show via experiments that the learned constraints can improve the original model’s performance, especially when the original model is trained with a small dataset.^{1}^{1}1The scripts for replaying the experiments are available at https://github.com/utahnlp/learningconstraints
2 Representing Constraints
In this section, we formally define structured prediction and constraints. In a structured prediction problem, we are given an input belonging to the instance space, such as sentences or images. The goal is to predict an output , where is the set of possible output structures for the input . The output have a predefined structure (e.g., trees, or labeled graphs), and the number of candidate structures in is usually large, i.e., exponential in the input size.
Inference in such problems can be framed as an optimization problem with a linear objective function:
(1) 
where
is a feature vector representation of the inputoutput pair
and are learned parameters. The feature representation can be designed by hand or learned using neural networks. The feasible set is predefined and known for every at both learning and inference stages. The goal of learning is to find the best parameters (and, also perhaps the features if we are training a neural network) using training data, and the goal of inference is to solve the above argmax problem given parameters .In this paper, we seek to learn additional constraints from training examples . Suppose we want to learn constraints, and the one is some Boolean function^{2}^{2}2We use to indicate true and to indicate false.: if satisfies the constraint, and if it does not. Then, the optimal structure is the solution to the following optimization problem:
(2)  
We will show that such learned constraints aid prediction performance.
2.1 Constraints as Linear Inequalities
Boolean functions over inference variables may be expressed as linear inequalities over them (Roth and Yih, 2004). In this paper, we represent constraints as linear inequalities over some feature vector of a given inputoutput pair. The constraint is equivalent to the linear inequality
(3) 
whose weights and bias are learned. A Boolean constraint is, thus, a linear threshold function,
(4) 
Here, is the sign function: if , and otherwise.
The feature representations should not be confused with the original features used in the structured prediction model in Eq. (1) or (2). Hereafter, we refer to as constraint features. Constraint features should be general properties of inputs and outputs, since we want to learn domainspecific constraints over them. They are a design choice, and in our experiments, we will use common NLP features. In general, they could even be learned using a neural network. Given a constraint feature representation , the goal is thus to learn the parameters ’s and ’s for every constraint.
2.2 Constraints as Threshold Networks
For an input , we say the output is feasible if it satisfies constraints for all . We can define a Boolean variable indicating whether is feasible with respect to the input : . That is, is a conjunction of all the Boolean functions corresponding to each constraint. Since conjunctions are linearly separable, we can rewrite as a linear threshold function:
(5) 
It is easy to see that if, and only if, all ’s are —precisely the definition of a conjunction. Finally, we can plug Eq. (4) into Eq. (5):
(6) 
Observe that Eq. (6) is exactly a twolayer threshold neural network: is the input to the network; the hidden layer contains linear threshold units with parameters and ; the output layer has a single linear threshold unit. This neural network will predict if the structure is feasible with respect to input , and if it is infeasible. In other words, constraints for structured prediction problems can be written as twolayer threshold networks. One possible way to learn constraints is thus to learn the hidden layer parameters and , with fixed output layer parameters. However, the neural network specified in Eq. (6) is not friendly to gradientbased learning; the function has zero gradients almost everywhere. To circumvent this, let us explore an alternative way of learning constraints using rectifier networks rather than threshold networks.
2.3 Constraints as Rectifier Networks
We saw in the previous section that a system of linear inequalities can be represented as a twolayer threshold network. In this section, we will see a special rectifier network that is equivalent to a system of linear inequalities, and whose parameters can be learned using backpropagation.
Denote the rectifier (ReLU) activation function as
. Consider the following twolayer rectifier network:(7) 
The input to the network is still . There are ReLUs in the hidden layer, and one threshold unit in the output layer. The decision boundary of this rectifier network is specified by a system of linear inequalities. In particular, we have the following theorem (Pan and Srikumar, 2016, Theorem 1):
Theorem 1.
Consider a twolayer rectifier network with hidden ReLUs as in Eq. (7). Define the set . The network output if, and only if, for every subset of , the following linear inequality holds:
(8) 
The proof of Theorem 1 is given in the supplementary material.
To illustrate the idea, we show a simple example rectifier network, and convert it to a system of linear inequalities using the theorem. The rectifier network contains two hidden ReLUs ():
Our theorem says that if and only if the following four inequalities hold simultaneously, one per subset of :
The first inequality, , corresponding to the empty subset of , trivially holds. The rest are just linear inequalities over .
In general, has subsets, and when is the empty set, inequality (8) is trivially true. The rectifier network in Eq. (7) thus predicts is a valid structure for , if a system of linear inequalities are satisfied. It is worth mentioning that even though the linear inequalities are constructed from a power set of elements, it does not make them dependent on each other. With general choice of and , these inequalities are linearly independent.
This establishes the fact that a twolayer rectifier network of the form of Eq. (7) can represent a system of linear inequality constraints for a structured prediction problem via the constraint feature function .
3 Learning Constraints
In the previous section, we saw that both threshold and rectifier networks can represent a system of linear inequalities. We can either learn a threshold network (Eq. (6)) to obtain constraints as in (3), or we can learn a rectifier network (Eq. (7)) to obtain constraints as in (8). The latter offers two advantages. First, a rectifier network has nontrivial gradients, which facilitates gradientbased learning^{3}^{3}3
The output threshold unit in the rectifier network will not cause any trouble in practice, because it can be replaced by sigmoid function during training. Our theorem still follows, as long as we interpret
as and as . We can still convert the rectifier network into a system of linear inequalities even if the output unit is the sigmoid unit.. Second, since ReLUs can represent constraints, the rectifier network can express constraints more compactly with fewer hidden units.We will train the parameters ’s and ’s of the rectifier network in the supervised setting. First, we need to obtain positive and negative training examples. We assume that we have training data for a structured prediction task.
Positive examples can be directly obtained from the training data of the structured prediction problem. For each training example , we can apply constraint feature extractors to obtain positive examples of the form .
Negative examples can be generated in several ways; we use simple but effective approaches. We can slightly perturb a structure in a training example to obtain a structure that we assume to be invalid. Applying the constraint feature extractor to it gives a negative example . We also need to ensure that is indeed different from any positive example. Another approach is to perturb the feature vector directly, instead of perturbing the structure .
In our experiments in the subsequent sections, we will use both methods to generate negative examples, with detailed descriptions in the supplementary material. Despite their simplicity, we observed performance improvements. Exploring more sophisticated methods for perturbing structures or features (e.g., using techniques explored by Smith and Eisner (2005), or using adversarial learning (Goodfellow et al., 2014)) is a future research direction.
To verify whether constraints can be learned as described here, we performed a synthetic experiment where we randomly generate many integer linear program (ILP) instances with
hidden shared constraints. The experiments show that constraints can indeed be recovered using only the solutions of the programs. Due to space constraints, details of this synthetic experiment are in the supplementary material. In the remainder of the paper we focus on three real NLP tasks.4 Entity and Relation Extraction Experiments
In the task of entity and relation extraction, we are given a sentence with entity candidates. We seek to determine the type of each candidate, as in the following example (the labels are underlined):
[Organization Google LLC] is headquartered in [Location Mountain View, California].
We also want to determine directed relations between the entities. In the above example, the relation from “Google LLC” to “Mountain View, California” is OrgBasedIn, and the opposite direction is labeled NoRel, indicating there is no relation. This task requires predicting a directed graph and represents a typical structured prediction problem—we cannot make isolated entity and relation predictions.
Dataset and baseline: We use the dataset from Roth and Yih (2004). It contains 1441 sentences with labeled entities and relations. There are three possible entity types: Person, Location and Organization, and five possible relations: Kill, LiveIn, WorkFor, LocatedAt and OrgBasedIn. Additionally, there is a special entity label NoEnt meaning a text span is not an entity, and a special relation label NoRel indicating that two spans are unrelated.
We used 70% of the data for training and the remaining 30% for evaluation. We trained our baseline model using the integer linear program (ILP) formulation with the same set of features as in Roth and Yih (2004). The baseline system includes manually designed constraints from the original paper. An example of such a constraint is: if a relation label is WorkFor, the source entity must be labeled Person, and the target entity must be labeled Organization. For reference, the supplementary material lists the complete set of manually designed constraints.
We use three kinds of constraint features: (i) sourcerelation indicator, which looks at a given relation label and the label of its source entity; (ii) relationtarget indicator, which looks at a relation label and the label of its target entity; and (iii) relationrelation indicator, which looks at a pair of entities and focuses on the two relation label, one in each direction. The details of the constraint features, negative examples and hyperparameters are in the supplementary material.
4.1 Experiments and Results
We compared the performance of two ILPbased models, both trained in the presence of constraints with a structured SVM. One model was trained with manually designed constraints and the other used learned constraints. These models are compared in Table 1.
Performance Metric  Designed  Learned 

entity F1  
relation F1 
We manually inspected the learned constraints and discovered that they exactly recover the designed constraints, in the sense that the feasible output space is exactly the same regardless of whether we use designed or learned constraints. As an additional confirmation, we observed that when a model is trained with designed constraints and tested with learned constraints, we get the same model performance as when tested with designed constraints. Likewise, a model that is trained with learned constraints performs identically when tested with learned and designed constraints.
Below, we give one example of a learned constraint, and illustrate how to interpret such a constraint. (The complete list of learned constraints is in the supplementary material.) A learned constraint using the sourcerelation indicator features is
(9) 
where through are indicators for labels NoEnt, Person, Location, Organization, NoRel, Kill, LiveIn, WorkFor, LocatedAt, and OrgBasedIn, respectively. This constraint disallows a relation labeled as Kill having a source entity labeled as Location, because . Therefore, the constraint “Location cannot Kill” is captured in (9). In fact, it is straightforward to verify that the inequality in (9) captures many more constraints such as “NoEnt cannot LiveIn”, “Location cannot LiveIn”, “Organization cannot WorkFor”, etc. A general method for interpreting learned constraints is a direction of future research.
Note that the metric numbers in Table 1 based on learned constraints are lower than those based on designed constraints. Since the feasible space is the same for both kinds of constraints, the performance difference is due to the randomness of the ILP solver picking different solutions with the same objective value. Therefore, the entity and relation experiments in this section demonstrate that our approach can recover the designed constraints and provide a way of interpreting these constraints.
5 Citation Field Extraction Experiments
In the citation field extraction task, the input is a citation entry. The goal is to identify spans corresponding to fields such as author, title, etc. In the example below, the labels are underlined:
[ Author A . M . Turing . ] [ Title Computing machinery and intelligence . ] [ Journal Mind , ] [Volume 59 , ] [ Pages 433460 . ] [ Date October , 1950 . ]
Chang et al. (2007) showed that handcrafted constraints specific to this domain can vastly help models to correctly identify citation fields. We show that constraints learned from the training data can improve a trained model without the need for manual effort.
Dataset and baseline. We use the dataset from Chang et al. (2007, 2012) whose training, development and test splits have 300, 100 and 100 examples, respectively. We train a firstorder Markov model using structured SVM (Tsochantaridis et al., 2004) on the training set with the same raw text features as in the original work.
Constraint features.
We explore multiple simple constraint features in the citation field extraction experiments as shown in Table 2. Detailed descriptions of these features, including how to develop negative examples for each feature, and experiment settings are in the supplementary material.
Feature  Description 

Label existence  Indicates which labels exist in a citation 
Label counts  Counts the number of occurrences of a label 
Bigram labels  Indicators for adjacent labels 
Trigram labels  Indicators for adjacent labels 
Partofspeech  Indicator for the partofspeech of a token 
Punctuation  Indicator for whether a token is a punctuation 
5.1 Experiments and Results
For each constraint feature template, we trained a rectifier network with 10 ReLUs in the hidden layer. We then use Theorem 1 to convert the resulting network to a system of , or 1023 linear inequalities. We used beam search with beam size 50 to combine the learned inequalities with the original sequence model to predict on the test set. States in the search space correspond to partial assignments to a prefix of the sequence. Each step we predict the label for the next token in the sequence. The pretrained sequence model (i.e., the baseline) ranks search nodes based on transition and emission scores, and the learned inequality prunes the search space accordingly^{4}^{4}4Since the labelexistence and labelcounts features are global, pruning by learned inequalities is possible only at the last step of search. The other four features admit pruning at each step of the search process.. Table 3 shows the token level accuracies of various methods.
Baselines  Search with learned constraints  Combine constraints  
Exact  Search  L.E.  L.C.  B.L.  T.L.  POS  Punc.  C1  C2  C3 
86.2  87.3  88.0  87.7  87.9  88.1  89.8  90.2  88.6  90.1  90.6 
The results show that all versions of constrained search outperform the baselines, indicating that the learned constraints are effective in the citation field extraction task. Furthermore, different constraints learned with different features can be combined. We observe that combining different constraint features generally improves accuracy.
It is worth pointing out that the label existence and label counts features are global in nature and cannot be directly used to train a sequence model. Even if some constraint features can be used in training the original model, it is still beneficial to learn constraints from them. For example, the bigram label feature is captured in the original first order model, but adding constraints learned from them still improves performance. As another test, we trained a model with POS features, which also contains punctuation information. This model achieves 91.8% accuracy. Adding constraints learned with POS improves the accuracy to 92.6%; adding constraints learned with punctuation features further improves it to 93.8%.
We also observed that our method for learning constraints is robust to the choice of the number of hidden ReLUs. For example, for punctuation, learning using 5, 8 and 10 hidden ReLUs results an accuracy of , , and , respectively. We observed similar behavior for other constraint features as well. Since the number of constraints learned is exponential in the number of hidden units, these results shows that learning redundant constraints will not hurt performance.
Note that carefully handcrafted constraints may achieve higher accuracy than the learned ones. Chang et al. (2007) report an accuracy of 92.5% with constraints specifically designed for this domain. In contrast, our method for learning constraints uses general constraint features, and does not rely on domain knowledge. Therefore, our method is suited to tasks where little is known about the underlying domain.
6 Chunking Experiments
Chunking is the task of clustering text into groups of syntactically correlated tokens or phrases. In the instance below, the phrase labels are underlined:
[NP An A.P. Green official] [VP declined to comment] [PP on] [NP the filing] [O.]
We treat the chunking problem as a sequence labeling problem by using the popular IOB tagging scheme. For each phrase label, the first token in the phrase is labeled with a “B” prefixed to phrase label while the other tokens are labeled with an “I” prefixed to the phrase label. Hence,
[NP An A.P. Green official]
is represented as
[[BNP An] [INP A.P.] [INP Green] [INP official]]
This is done for all phrase labels except “O”.
Dataset and Baselines. We use the CoNLL2000 dataset Tjong Kim Sang and Buchholz (2000) which contains 8936 training sentences and 2012 test sentences. For our experiments, we consider 8000 sentences out of 8936 training sentences as our training set and the remaining 936 sentences as our development set. Chunking is a wellstudied problem and showing performance improvements on full training dataset is difficult. However, we use this task to illustrate the interplay of learned constraints with neural network models, and the impact of learned constraints in the low training data regime.
We use the BiLSTMCRF Huang et al. (2015) for this sequence tagging task. We use GloVe for word embeddings. We do not use the BERT Devlin et al. (2019) family of models since tokens are broken down into subwords during preprocessing, which introduces modeling and evaluation choices that are orthogonal to our study of label dependencies. As with the citation task, all our constrained models use beam search, and we compare our results to both exact decoding and beam search baselines. We use two kinds of constraint features: (i) gram label existence, and (ii) gram part of speech. Details of the constraint features and construction of negative samples are given in the supplementary material.
6.1 Experiments and Results
We train the rectifier network with 10 hidden units. The beam size of 10 was chosen for our experiments based on preliminary experiments. We report the average results on two different random seeds for learning each constraint. Note that the gram label existence is a global constraint while the gram POS constraint is a local constraint which checks for validity of label assignments at each token. In essence, the latter constraint reranks the beam at each step by ensuring that states that satisfy the constraint are preferred over states that violate the constraint. Since the gram label existence is a global constraint, we check the validity of the tag assignments only at the last token. In the case where none of the states in the beam satisfy the constraint, the original beams are used.
Constraint  Percentage of training data used  

1%  5%  10%  25%  50%  100%  
Label existence  2  81.28  88.30  89.73  91.24  90.40  92.48  
3  80.98  88.20  90.58  91.20  92.37  93.12  
Partofspeech  3  86.52  90.74  91.80  92.41  93.07  93.84  
4  84.21  90.99  92.17  92.46  93.08  93.93  

81.29  88.27  90.62  91.33  92.51  93.44  
Exact decoding  82.11  88.70  90.49  92.57  93.94  94.75 
The results for this set of experiments are presented in Table 4. We observe that the POS constraint improves the performance of the baseline models significantly, outperforming the beam search baseline on all training ratios. More importantly, the results show sizable improvements in accuracy for smaller training ratios (e.g, and improvements on exact and search baselines respectively with 1% training data ). When the training ratios get bigger, we expect the models to learn these properties and hence the impact of the constraints decreases.
These results (along with the experiments in the previous sections) indicate that our constraints can significantly boost performance in the low data regime. Another way to improve performance in low resource settings is to use better pretrained input representations. When we replaced GloVe embeddings with ELMo, we observed a accuracy on 0.01 ratio of training data using exact decoding. However, this improvement comes at a cost: the number of parameters increases from 3M (190k trainable) to 94M (561k trainable). In contrast, our method instead introduces a smaller rectifier network with additional parameters while still producing similar improvements. In other words, using trained constraints is computationally more efficient.
We observe that the label existence constraints, however, do not help. We conjecture that this may be due to one of the following three conditions: (i) The label existence constraint might not exist for the task; (ii) The constraint exists but the learner is not able to find it; (iii) The input representations are expressive enough to represent the constraints. Disentangling these three factors is a future research challenge.
7 Related Work
Structured prediction
is an active field in machine learning and has numerous applications, including various kinds of sequence labeling tasks, parsing
(e.g., Martins et al., 2009), image segmentation (e.g., Lam et al., 2015), and information extraction (e.g., Anzaroot et al., 2014). The work of Roth and Yih (2004) introduced the idea of using explicitly stated constraints in an integer programming framework. That constraints and knowledge can improve models has been highlighted by several lines of work (e.g., Ganchev et al., 2010; Chang et al., 2012; Hu et al., 2016).The interplay between constraints and representations has been sharply highlighted by recent work on integrating neural networks with structured outputs (e.g., Rocktäschel and Riedel, 2017; Niculae et al., 2018; Manhaeve et al., 2018; Xu et al., 2018; Li and Srikumar, 2019; Li et al., 2019, and others). We expect that constraints learned as described in this work can be integrated into these formalisms, presenting an avenue for future research.
While our paper focuses on learning explicit constraints directly from examples, it is also possible to use indirect supervision from these examples to learn a structural classifier
(Chang et al., 2010), with an objective function penalizing invalid structures.Related to our goal of learning constraints is rule learning
, as studied in various subfields of artificial intelligence.
Quinlan (1986)describes the ID3 algorithm, which extracts rules as a decision tree. First order logic rules can be learned from examples using inductive logic programming
(Muggleton and de Raedt, 1994; Lavrac and Dzeroski, 1994; Page and Srinivasan, 2003). Notable algorithms for inductive logic programming include FOIL (Quinlan, 1990) and Progol (Muggleton, 1995).Statistical relation learning addresses learning constraints with uncertainty (Friedman et al., 1999; Getoor and Mihalkova, 2001). Markov logic networks (Richardson and Domingos, 2006) combines probabilistic models with first order logic knowledge, whose weighted formulas are soft constraints and the weights can be learned from data. In contrast to these directions, in this paper, we exploit a novel representational result about rectifier networks to learn polytopes that represent constraints with offtheshelf neural network tools.
8 Conclusions
We presented a systematic way for discovering constraints as linear inequalities for structured prediction problems. The proposed approach is built upon a novel transformation from two layer rectifier networks to linear inequality constraints and does not rely on domain expertise for any specific problem. Instead, it only uses general constraint features as inputs to rectifier networks. Our approach is particularly suited to tasks where designing constraints manually is hard, and/or the number of training examples is small. The learned constraints can be used for structured prediction problems in two ways: (1) combining them with an existing model to improve prediction performance, or (2) incorporating them into the training process to train a better model. We demonstrated the effectiveness of our approach on three NLP tasks, each with different original models.
Acknowledgments
We thank members of the NLP group at the University of Utah, especially Jie Cao, for their valuable insights and suggestions; and reviewers for pointers to related works, corrections, and helpful comments. We also acknowledge the support of NSF Cyberlearning1822877, SaTC1801446 and gifts from Google and NVIDIA.
References
 Learning Soft Linear Constraints with Application to Citation Field Extraction. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). External Links: ISBN 9781937284725, Document Cited by: §1, §7.
 Learning to Search Better than Your Teacher. In Proceedings of The 32nd International Conference on Machine Learning, External Links: Link Cited by: §1.
 Guiding semisupervision with constraintdriven learning. In Proceedings of the 45th annual meeting of the association of computational linguistics, External Links: Link Cited by: §5.1, §5, §5.
 Structured Learning with Constrained Conditional Models. Machine Learning. External Links: ISSN 08856125, Link Cited by: §1, §5, §7.
 Structured Output Learning with Indirect Supervision. Proceedings of the 27th International Conference on Machine Learning (ICML10). External Links: ISBN 9781605589077, Link Cited by: §7.

Incremental Parsing with the Perceptron Algorithm
. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, External Links: Document Cited by: §1.  Searchbased Structured Prediction. Machine Learning Journal (MLJ). External Links: Link Cited by: §1.
 BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), External Links: Document Cited by: §6.
 Structured Prediction via Output Space Search. The Journal of Machine Learning Research. External Links: Link Cited by: §1.
 Learning Probabilistic Relational Models. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence. External Links: Link Cited by: §7.
 Posterior Regularization for Structured Latent Variable Models. The Journal of Machine Learning Research. External Links: Link Cited by: §7.
 Learning Statistical Models from Relational Data. Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. External Links: Document Cited by: §7.
 Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27, External Links: Link Cited by: §3.
 Gurobi optimizer reference manual. External Links: Link Cited by: Appendix B.
 Harnessing Deep Neural Networks with Logic Rules. In ”Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)”, External Links: Document Cited by: §7.
 Bidirectional LSTMCRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991. External Links: Link Cited by: §1, §6.

MRF Optimization via Dual Decomposition: Messagepassing Revisited.
Proceedings of the IEEE International Conference on Computer Vision
. External Links: Link Cited by: §1. 
HCSearch for Structured Prediction in Computer Vision.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, External Links: Link Cited by: §7.  Inductive logic programming: techniques and applications. Ellis Horwood. External Links: Link Cited by: §7.
 A LogicDriven Framework for Consistency of Neural Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), External Links: Link Cited by: §7.
 Augmenting Neural Networks with Firstorder Logic. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, External Links: Link Cited by: §7.
 Deepproblog: Neural probabilistic logic programming. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §7.
 An Augmented Lagrangian Approach to Constrained MAP Inference. In Proceedings of the 28th International Conference on International Conference on Machine Learning, External Links: Link Cited by: §1.
 Concise Integer Linear Programming Formulations for Dependency Parsing. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1Volume 1, External Links: Link Cited by: §1, §7.
 Inductive Logic Programming: Theory and Methods. The Journal of Logic Programming. External Links: Link Cited by: §7.
 Inverse Entailment and Progol. New Generation Computing. External Links: Link Cited by: §7.
 SparseMAP: Differentiable Sparse Structured Inference. In International Conference on Machine Learning, External Links: Link Cited by: §7.
 ILP: A Short Look Back and a Longer Look Forward. Journal of Machine Learning Research. External Links: Link Cited by: §7.
 Expressiveness of Rectifier Networks. In Proceedings of the 33rd International Conference on Machine Learning, External Links: Link Cited by: §1, §2.3.
 Induction of Decision Trees. Machine Learning. External Links: Link Cited by: §7.
 Learning Logical Definitions from Relations. Machine Learning. External Links: Link Cited by: §7.
 Markov Logic Networks. Machine Learning. External Links: Link Cited by: §7.
 Incremental Integer Linear Programming for Nonprojective Dependency Parsing. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §1.
 Endtoend differentiable proving. In Advances in Neural Information Processing Systems, External Links: Link Cited by: §7.
 A Linear Programming Formulation for Global Inference in Natural Language Tasks. Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL2004) at HLTNAACL 2004. External Links: Link Cited by: §1, §2.1, §4, §4, §7.
 Integer Linear Programming Inference for Conditional Random Fields. Proceedings of the 22nd International Conference on Machine Learning. External Links: Document Cited by: §1.
 On dual decomposition and linear programming relaxations for natural language processing. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, External Links: Link Cited by: §1.

Contrastive Estimation: Training LogLinear Models on Unlabeled Data
. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, External Links: Document Cited by: §3.  Linguistic structure prediction. Morgan & Claypool Publishers. External Links: Link Cited by: §1.
 Introduction to the CoNLL2000 shared task chunking. In Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop, External Links: Link Cited by: §6.
 Support vector machine learning for interdependent and structured output spaces. Proceedings of the TwentyFirst International Conference on Machine Learning. External Links: Document Cited by: §1, §5.
 SequencetoSequence Learning as BeamSearch Optimization. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. External Links: Document Cited by: §1.

A Semantic Loss Function for Deep Learning with Symbolic Knowledge
. In Proceedings of the 35th International Conference on Machine Learning, External Links: Link Cited by: §7.
Appendix A Proof of Theorem 1
In this section we prove Theorem 1. The theorem and the relevant definitions are repeated here for convenience.
Define the rectifier (ReLU) activation function as . Consider the following twolayer rectifier network:
(10) 
The input to the network is still . There are ReLUs in the hidden layer, and one threshold unit in the output layer.
The decision boundary of this rectifier network is specified by a system of linear inequalities. In particular, we have the following theorem:
Theorem 2.
Consider a twolayer rectifier network with hidden ReLUs as in Eq. (10). Define the set . The network outputs if, and only if, for every subset of , the following linear inequality holds:
Proof.
Define . We first prove the “if” part of the theorem. Suppose that for any , . Thus for a specific subset , we have . By the definition of , , therefore .
Next we prove the “only if” part of the theorem. Suppose that . For any , we have . Therefore, for any , . ∎
Appendix B Synthetic Integer Linear Programming Experiments
We first check if constraints are learnable, and whether learned constraints help a downstream task with a synthetic experiment. Consider framing structure prediction as an integer linear program (ILP):
(11) 
The objective coefficient denotes the cost of setting the variable to and the goal of prediction is to find a cost minimizing variable assignment subject to linear constraints in (11). We randomly generate a hundred 50dimensional ILP instances, all of which share a fixed set of random constraints. Each instance is thus defined by its objective coefficients. We reserve of instances as test data. The goal is to learn the shared linear constraints in Eq. (11) from the training set.
We use the Gurobi Optimizer (Gurobi Optimization LLC, 2019) to solve all the ILP instances to obtain pairs , where is the vector of objective coefficients and is the optimal solution. Each in this set is feasible, giving us positive examples for the constraint learning task.
Negative examples are generated as follows: Given a positive pair described above, if the coefficient and the corresponding decision , construct from by flipping the bit in from to . Such a is a negative example for the constraint learning task because has a lower objective value than . Therefore, it violates at least one of the constraints in Eq. (11). Similarly, if and , we can flip the bit from to . We perform the above steps for every coefficient of every example in the training set to generate a set of negative examples .
We trained a rectifier network on these examples and converted the resulting parameters into a system of linear inequalities using Theorem 2. The hyperparameters and design choices are summarized in the supplementary material. We used the learned inequalities to replace the original constraints to obtain predicted solutions. We evaluated these predicted solutions against the oracle solutions (i.e., based on the original constraints). We also computed a baseline solution for each test example by minimizing an unconstrained objective.
Table 5 lists four measures of the effectiveness of learned constraints. First, we want to know whether the learned rectifier network can correctly predict the synthetically generated positive and negative examples. The binary classification accuracies are listed in the first row. The second row lists the bitwise accuracies of the predicted solutions based on learned constraints, compared with the gold solutions. We see that the accuracy values of the solutions based on learned constraints are in the range from –. As a comparison, without using any constraints, the accuracy of the baseline is . Therefore the learned constraints can substantially improve the prediction accuracy in the down stream inference tasks. The third row lists the percentage of the predicted solutions satisfying the original constraints. Solutions based on learned constraints satisfy – of the original constraints. In contrast, the baseline solutions satisfy of the original constraints. The last row lists the percentage of the gold solutions satisfying the learned constraints. We see that the gold solutions almost always satisfy the learned constraints.
Number of ReLUs  

2  3  4  5  6  7  8  9  10  
binary classification acc. ()  85.1  87.3  92.1  90.3  95.0  94.3  94.1  97.7  98.0 
bitwise solution acc. ()  81.1  80.9  81.9  80.2  81.0  82.3  81.1  83.2  83.5 
original constr. satisfied ()  70.3  69.8  72.7  70.4  70.1  71.1  71.4  74.4  74.3 
learned constr. satisfied ()  95.6  98.6  98.7  99.1  97.4  98.9  99.9  99.1  99.4 
The hyperparameter and other design choices for the synthetic ILP experiments are listed in Table 6.
Description  Value 

Total number of examples  100 
Number of training examples  70 
Number of test examples  30 
Dimensionality  50 
Range of hidden ReLU units considered for experiments  210 
Learning rates for crossvalidation while learning rectifier networks  
Learning rate decay for crossvalidation  
Optimizer parameters for learning  
Number of training epochs 
1000 
Appendix C Entity and relation extraction experiments
c.1 Designed constraints
Table 7 lists the designed constraints used in the entity and relation extraction experiments. There are fifteen constraints, three for each relation type. For example, the last row in Table 7 means that the relation OrgBasedIn must have an Organization as its source entity and a Location as its target entity, and the relation in the opposite direction must be NoRel.
Antecedents  Consequents  

If the relation is  Source must be  Target must be  Reversed relation must be 
Kill  Person  Person  NoRel 
LiveIn  Person  Location  NoRel 
WorkFor  Person  Organization  NoRel 
LocatedAt  Location  Location  NoRel 
OrgBasedIn  Organization  Location  NoRel 
c.2 Constraint features
We use the same example as in the main paper to illustrate the constraint features used in the entity and relation extraction experiments:
[Organization Google LLC] is headquartered in [Location Mountain View, California, USA].
In the above example, the relation from “Google LLC” to “Mountain View, California, USA” is OrgBasedIn, and the relation in the opposite direction is labeled NoRel, indicating there is no relation from “Mountain View, California, USA” to “Google LLC”.
We used three constraint features for this task, explained as follows.
Sourcerelation indicator
This feature looks at a given relation label and the label of its source entity. It is an indicator pair (source label, relation label). Our example sentence will contribute the following two feature vectors, (Organization, OrgBasedIn) and (Location, NoRel), both corresponding to postive examples. The negative examples contains all possible pairs of (source label, relation label), which do not appear in the positive example set.
Relationtarget indicator
This feature looks at a given relation label the label of its target entity. It is an indicator pair (relation label, target label). Our example sentence will contribute the following two feature vectors, (OrgBasedIn, Location) and (NoRel,Organization), both corresponding to positive examples. The negative examples contains all possible pairs of (relation label, target label), which do not appear in the positive example set.
Relationrelation indicator
This feature looks at a pair of entities and focuses on the two relation labels between them, one in each direction. Therefore our running example will give us two positive examples with features (OrgBasedIn, NoRel) and (NoRel,OrgBasedIn). The negative examples contain any pair of relation labels that is not seen in the positive example set.
c.3 Hyperparameters and design choices
The hyperparameter and design choices for the experiments are in Table 8. Note that different runs of the SVM learner with the learned or designed constraints may give different results from those on Table 1. This is due to nondeterminism introduced by hardware and different versions of the Gurobi solver picking different solutions that have the same objective value. In the results in Table 1, we show the results where the training with learned constraints seem to underperform the model that is trained with designed constraints. In other runs on different hardware, we found the opposite ordering of the results.
Description  Value 

Structured SVM tradeoff parameter for the base model  
Number of hidden ReLU units  
–for sourcerelation  2 
–for relationtarget  2 
–for relationrelation  1 
Learning rates for crossvalidation while learning rectifier networks  
Learning rate decay for crossvalidation  
Optimizer parameters for learning 
c.4 Learned Constraints
We see in the main paper that linear inequality constraints are learned using a rectifier network with hidden units. In the entity and relation extraction experiments, we use two hidden units to learn three constraints from the sourcerelation indicator features. The three learned constraints are listed in Table 9. A given pair of source label and relation label satisfies the constraint if the sum of the corresponding coefficients and the bias term is greater than or equal to zero. For example, the constraint from the first row in Table 9 disallows the pair (Location, Kill), because . Therefore, the learned constraint would not allow the source entity of a Kill relation to be a Location, which agrees with the designed constraints.
Source Labels  Relation Labels  

NoEnt  Per.  Loc.  Org.  NoRel  Kill  Live  Work  Located  Based  Bias 
1.98  3.53  1.90  0.11  2.66  2.84  2.84  2.84  2.58  0.43  0.32 
1.61  1.48  3.50  0.92  1.15  1.02  1.02  1.02  3.96  1.38  1.46 
3.59  2.04  1.60  1.03  3.81  1.82  1.82  1.82  1.38  0.95  0.78 
We enumerated all possible pairs of source label and relation label and found that the learned constraints always agree with the designed constraints in the following sense: whenever a pair of source label and relation label satisfies the designed constraints, it also satisfies all three learned constraints, and whenever a pair of source label and relation label is disallowed by the designed constraints, it violates at least one of the learned constraints. Therefore, our method of constraint learning exactly recovers the designed constraints.
Forward Relation Labels  Backward Relation Labels  Bias  
4.95  1.65  1.65  1.65  1.65  1.65  5.06  1.53  1.53  1.53  1.53  1.53  2.41 
We also use two hidden units to learn three constraints from the relationtarget indicator features, and one hidden unit to learn one constraint from the relationrelation indicator features. The learned constraints are listed in Table 11 and Table 10. Again we verify that the learned constraints exactly recover the designed constraints in all cases.
Relation Labels  Target Labels  

NoRel  Kill  Live  Work  Located  Based  NoEnt  Per.  Loc.  Org.  Bias 
2.68  3.17  0.55  2.68  0.55  0.55  1.58  3.15  0.53  2.70  1.02 
2.72  2.42  1.39  2.55  1.39  1.39  2.51  2.27  1.54  2.31  0.85 
5.40  0.74  1.94  0.13  1.94  1.94  4.10  0.88  2.08  0.39  0.86 
Appendix D Citation field extraction experiments
d.1 Constraint Features
We use the same example as in the main paper to illustrate the constraint features used in the citation field extraction experiments:
[ Author A . M . Turing . ] [ Title Computing machinery and intelligence . ] [ Journal Mind , ] [Volume 59 , ] [ Pages 433460 . ] [ Date October , 1950 . ]
We explore multiple simple constraint features as described below.
Label existence
This features indicates which labels exist in a citation entry. In our above example, there are six labels. Suppose there are possible labels. The above example is a positive example, the feature vector of which is an dimensional binary vector. Exactly six elements, corresponding to the six labels in the example, have the value 1 and all others have the value 0. To obtain the negative examples, we iterate through every positive example and flip one bit of its feature vector. If the resulting vector is not seen in the positive set it will be a negative example.
Label counts
Labelcount features are similar to Labelexistence features. Instead of indicating whether a label exists using 1 or 0, labelcount features records the number of times each label appears in the citation entry. The positive examples can be generated naturally from the training set. To generate negative examples, we perturb the actual labels of a positive example, as opposed to its feature vector. We then extract the label counts feature from the perturbed example, and treat it as negative if it has not seen before in the positive set.
Bigram labels
This feature considers each pair of adjacent labels in the text. From left to right, the above example will give us feature vectors like (Author, Author), (Author, Title), (Title, Title), …, (Date, Date
). We then use onehot encoding to represent these features, which is the input vector to the rectifier network. All these feature vectors are labeled as positve (+1) by the rectifier network, since they are generated from the training set. To generate negative examples for bigramlabel features, we generate all positive examples from the training set, then enumerate all possible pair of labels and select those that were not seen in the positive examples.
Trigram labels
This feature is similar to the bigram labels. From the training set, we generate positive examples, e.g., (Author, Author, Author), (Author, Author, Title) etc, and convert them into onehot encodings. For negative examples, we enumerate all possible trigram labels, and select those trigrams as negative if two conditions are met: (a) the trigram is not seen in the positive set; and (b) a bigram contained in it is seen in the training set. The intuition is that we want negative examples to be almost feasible.
Partofspeech
For a fixed window size, we extract partofspeech tags and the corresponding labels, and use the combination as our constraint features. For example, with window size two, we get indicators for (, , , ) for the token in the sentence, where and refer to partofspeech tag and citation field label respectively. For negative examples, we enumerate all fourtuples as above, and select it as negative if the fourtuple is not seen in the positive set, but both (, ) and (, ) are seen in the training set.
Punctuation
The punctuation feature is similar to the partofspeech feature. Instead of the POS tag, we use an indicator for whether the current token is a punctuation.
d.2 Hyperparameters and design choices
The hyperparameter and design choices for the experiments are in the Table 12.
Description  Value 

Structured SVM tradeoff parameter for the base model  unregularized 
Beam size  50 
Number of hidden ReLU units for experiments  10 
Learning rates for crossvalidation while learning rectifier networks  
Learning rate decay for crossvalidation  
Optimizer parameters for learning 
Appendix E Chunking Experiments
e.1 Constraint Features
The two constraints which we discussed in the main paper for the chunking dataset are described below.
Ngram label existence
This constraint is a general form of the label existence constraint mentioned in Section D.1
. In fact, it is the ngram label existence constraint with n=1. The ngram label existence constraint represents the labels of a sequence as a binary vector. Each feature of this binary vector corresponds to an ngram label combination. Hence, the length of this constraint feature will be
where is the total number of distinct labels. This means the vector size of this constraint grows exponentially with increasing . The binary vector indicates a value of 1 for all the ngram label features present in the sequence tags. The positive examples are hence formed from the training set sequences. For the negative examples, we iterate through each positive example and flip a bit. The resulting vector is incorporated as a negative example if it doesn’t occur in the training set.Ngram part of speech (POS)
This constraint is a general form of the part of speech constraint mentioned in Section D.1. POS tags of a token are converted to a indicator vector. We concatenate the indicator vectors of each gram in an ngram in order and this vector is further concatenated with indicators of labels of each of these grams. Hence, for n=2, we get the constraint vector as (, , , ) where and are indicators for POS tags and labels respectively for the token. The positive examples enumerate vectors for all existing ngrams in the training sequences. The negative examples are creating by changing a label indicator in the constraint feature. The label to be perturbed and the perturbation both are chosen at random. The constraint vector hence formed is incorporated as a negative example if it doesn’t occur in the set of positive examples.
e.2 Hyperparameters and design choices
The hyperparameter and design choices are summarized in Table 13.
Description  Value 

Constraint Rectifier Network  
Range of hidden ReLU units considered for experiments  
Learning rates for development while learning rectifier networks  
Number of training epochs  1000 
Random Seeds  
BiLSTM CRF Model  
Learning rate for development while learning baseline model  
Learning Rate Decay  
Beam Size  10 
Number of training epochs  300 
Comments
There are no comments yet.