Inductive Logic Programming (ILP) (muggleton1994inductive)
is a formalism where a set of logical rules is learned from a set of examples and a background knowledge theory. By combining rule-based and statistical artificial intelligence, ILP overcomes the brittleness of pure logic-based approaches and the lack of interpretability of models of most statistical methods such as neural networks or support vector machines. We here focus on ILP that is based on Answer Set Programming (ASP) as our underlying logic programming language because we aim to apply ILP to Natural Language Processing (NLP) applications such as Machine Translation, Summarization, Coreference Resolution, or Parsing that require nonmonotonic reasoning with exceptions and complex background theories.
In our work, we apply ILP to the NLP task of sentence chunking. Chunking, also known as ‘shallow parsing’, is the identification of short phrases such as noun phrases which mainly rely on Part of Speech (POS) tags. In our experiments on sentence chunking (tjong2000introduction) we encountered several problems with state-of-the-art ASP-based ILP systems XHAIL (ray2009nonmonotonic), ILED (katzouris2015incremental), and ILASP2 (law2015learning). XHAIL and ILASP2 showed scalability issues already with 100 sentences as training data. ILED is designed to be highly scalable but failed in the presence of simple inconsistencies in examples. We decided to investigate the issue in the XHAIL system, which is open-source and documented well, and we made the following observations:
XHAIL only terminates if it finds a provably optimal hypothesis,
the hypothesis search is done over all potentially beneficial rules that are supported by at least one example, and
XHAIL contains redundancies in hypothesis search and uses outdated ASP technology.
In larger datasets, observation (i) is unrealistic, because finding a near-optimal solution is much easier than proving optimality of the best solution, moreover in classical machine learning suboptimal solutions obtained via non-exact methods routinely provide state-of-the-art results. Similarly, observation (ii) makes it harder to find a hypothesis, and it generates an overfitting hypotheses which contains rules that are only required for a single example. Observation (iii) points out an engineering problem that can be remedied with little theoretical effort.
To overcome the above issues, we modified the XHAIL algorithm and software, and we performed experiments on a simple NLP chunking task to evaluate our modifications.
In detail, we make the following contributions.
We extend XHAIL with best-effort optimisation using the newest ASP optimisation technology of unsat-core optimisation (Andres2012) with stratification (Ansotegui2013maxsat; Alviano2015aspopt) and core shrinking (Alviano2016coreshrinking) using the WASP2 (DBLP:conf/lpnmr/AlvianoDFLR13; DBLP:conf/lpnmr/AlvianoDLR15) solver and the Gringo (Gebser2011gringo3) grounder. We also extend XHAIL to provide information about the optimality of the hypothesis.
We extend the XHAIL algorithm with a parameter for pruning, such that XHAIL searches for hypotheses without considering rules that are supported by fewer than examples.
We eliminate several redundancies in XHAIL by changing its internal data structures.
We describe a framework for chunking with ILP, based on preprocessing with Stanford Core NLP (manning2014stanford) tools.
We experimentally analyse the relationship between the pruning parameter, number of training examples, and prediction score on the sentence chunking (tjong2000introduction) subtask of iSTS at SemEval 2016 (companionpaper).
We discuss the best hypothesis found for each of the three datasets in the SemEval task, and we discuss what can be learned about the dataset from these hypotheses.
Only if we use all the above modifications together, XHAIL becomes applicable in this chunking task. By learning a hypothesis from 500 examples, we can achieve results competitive with state-of-the-art systems used in the SemEval 2016 competition.
Our extensions and modifications of the XHAIL software are available in a public fork of the official XHAIL Git repository (xhailfork).
In Section 2 we provide an overview of logic programming and ILP. Section 3 gives an account of related work and available ILP tools. In Section 4 we describe the XHAIL system and our extensions of pruning, best-effort optimisation, and further improvements. Section 5 gives details of our representation of the chunking task. In Section 6 we discuss empirical experiments and results. We conclude in Section 7 with a brief outlook on future work.
We next introduce logic programming and based on that inductive logic programming.
2.1 Logic Programming
A logic programs theory normally comprises of an alphabet (variable, constant, quantifier, etc), vocabulary, logical symbols, a set of axioms and inference rules (lloyd2012foundations). A logic programming system consists of two portions: the logic and control. Logic describes what kind of problem needs to be solved and control is how that problem can be solved. An ideal of logic programming is for it to be purely declarative. The popular Prolog (clocksin2003programming) system evaluates rules using resolution, which makes the result of a Prolog program depending on the order of its rules and on the order of the bodies of its rules. Answer Set Programming (ASP) (brewka2011answer; Gebser2012aspbook) is a more recent logic programming formalism, featuring more declarativity than Prolog by defining semantics based on Herbrand models (Gelfond1988). Hence the order of rules and the order of the body of the rules does not matter in ASP. Most ASP programs follow the Generate-Define-Test structure (Lifschitz2002) to (i) generate a space of potential solutions, (ii) define auxiliary concepts, and (iii) test to invalidate solutions using constraints or incurring a cost on non-preferred solutions.
An ASP program consists of rules of the following structure:
where , are atoms from a first-order language, is the head and is the body of the rule, and is negation as failure. Variables start with capital letters, facts (rules without body condition) are written as ‘’ instead of ‘’. Intuitively is true if all positive body atoms are true and no negative body atom is true.
The formalism can be understood more clearly by considering the following sentence as a simple example:
|Computers are normally fast machines unless they are old.|
This would be represented as a logical rule as follows:
where is a variable, , , and are predicates, and is a negated atom.
Adding more knowledge results in a change of a previous understanding, this is common in human reasoning. Classical First Order Logic does not allow such non-monotonic reasoning, however, ASP was designed as a commonsense reasoning formalism: a program has zero or more answer sets as solutions, adding knowledge to the program can remove answer sets as well as produce new ones. Note that ASP semantics rule out self-founded truths in answer sets. We use the ASP formalism due to its flexibility and declarativity. For formal details and a complete description of syntax and semantics see the ASP-Core-2 standard (Calimeri2012). ASP has been applied to several problems related to Natural Language Processing, see for example (Schwitter2012; Schuller2013aspccg; Schuller2014winograd; Sharma2015; Schuller2016aspfoa; mitra2016addressing). An overview of applications of ASP in general can be found in (Erdem2016aimag).
2.2 Inductive Logic Programming
Processing natural language based on hand-crafted rules is impractical because human language is constantly evolving, partially due to the human creativity of language use. An example of this was recently noticed on UK highways where they advised drivers, ‘Don’t Pokémon Go and drive’. Pokémon Go is being informally used here as a verb even though it was only introduced as a game a few weeks before the sign was put up. To produce robust systems, it is necessary to use statistical models of language. These models are often pure Machine Learning (ML) estimators without any rule components(manning1999foundations)
. ML methods work very well in practice, however, they usually do not provide a way for explaining why a certain prediction was made, because they represent the learned knowledge in big matrices of real numbers. Some popular classifiers used for processing natural language include Naive Bayes, Decision Trees, Neural Networks, and Support Vector Machines (SVMs)(dumais1998inductive).
In this work, we focus on an approach that combines rule-based methods and statistics and provides interpretable learned models: Inductive Logic Programming (ILP). ILP is differentiated from ML techniques by its use of an expressive representation language and its ability to make use of logically encoded background knowledge (muggleton1994inductive)
. An important advantage of ILP over ML techniques such as neural networks is, that a hypothesis can be made readable by translating it into piece of English text. Furthermore, if annotated corpora of sufficient size are not available or too expensive to produce, deep learning or other data intense techniques are not applicable. However, we can still learn successfully with ILP.
Formally, ILP takes as input a set of examples , a set of background knowledge rules, and a set of mode declarations , also called mode bias. As output, ILP aims to produce a set of rules called hypothesis which entails with respect to . The search for with respect to and is restricted by , which defines a language that limits the shape of rules in the hypothesis candidates and therefore the complexity of potential hypotheses.
Consider the following example ILP instance (ray2009nonmonotonic).
|Based on this, an ILP system would ideally find the following hypothesis.|
3 Related Work
Inductive Logic Programming (ILP) is a rather multidisciplinary field which extends to domains such as computer science, artificial intelligence, and bioinformatics. Research done in ILP has been greatly impacted by Machine Learning (ML), Artificial Intelligence (AI) and relational databases. Quite a few surveys (gulwani2015inductive; muggleton2012ilp; kitzelmann2009inductive) mention about the systems and applications of ILP in interdisciplinary areas. We next give related work of ILP in general and then focus on ILP applied in the field of Natural Language Processing (NLP).
The foundations of ILP can be found in research by Plotkin (plotkin1970note; plotkin1971further), Shapiro (shapiro1983algorithmic) and Sammut and Banerji (sammut1986learning). The founding paper of Muggleton (muggleton1991inductive) led to the launch of the first international workshop on ILP. The strength of ILP lay in its ability to draw on and extend the existing successful paradigms of ML and Logic Programming. At the beginning, ILP was associated with the introduction of foundational theoretical concepts which included Inverse Resolution (muggleton1992machine; muggleton1995inverse) and Predicate Invention (muggleton1992machine; muggleton1991inductive). A number of ILP systems were developed along with learning about the theoretical concepts of ILP such as FOIL (quinlan1990learning) and Golem (muggleton1990efficient). The widely-used ILP system Progol (muggleton1995inverse) introduced a new logically-based approach to refinement graph search of the hypothesis space based on inverting the entailment relation. Meanwhile, the TILDE system (de1997logical) demonstrated the efficiency which could be gained by upgrading decision-tree learning algorithms to first-order logic, this was soon extended towards other ML problems. Some limitations of Prolog-based ILP include requiring extensional background and negative examples, lack of predicate invention, search limitations and inability to handle cuts. Integrating bottom-up and top-down searches, incorporating predicate invention, eliminating the need for explicit negative examples and allowing restricted use of cuts helps in solving these issues (mooney1996inductive).
Probabilistic ILP (PILP) also gained popularity (muggleton1996stochastic; cussens2001integrating; de2008probabilistic), its Prolog-based systems such as PRISM (sato2005generative) and FAM (cussens2001parameter) separate the actual learning of the logic program from the probabilistic parameters estimation of the individual clauses. However in practice, learning the structure and parameters of probabilistic logic representation simultaneously has proven to be a challenge (muggleton2002learning). PILP is mainly a unification of the probabilistic reasoning of Machine Learning with the relational logical representations offered by ILP.
Meta-interpretive learning (MIL) (muggleton2014meta) is a recent ILP method which learns recursive definitions using Prolog and ASP-based declarative representations. MIL is an extension of the Prolog meta-interpreter; it derives a proof by repeatedly fetching the first-order Prolog clauses and additionally fetching higher-order meta-rules whose heads unify with a given goal, and saves the resulting meta-substitutions to form a program.
Most ILP research has been aimed at Horn programs which exclude Negation as Failure (NAF). Negation is a key feature of logic programming and provides a means for monotonic commonsense reasoning under incomplete information. This fails to exploit the full potential of normal programs that allow NAF.
We next give an overview of ILP systems based on ASP that are designed to operate in the presence of negation. Then we give an overview of ILP literature related to NLP.
3.1 ASP-based ILP Systems
The eXtended Hybrid Abductive Inductive Learning system (XHAIL) is an ILP approach based on ASP that generalises techniques of language and search bias from Horn clauses to normal logic programs with full usage of NAF (ray2009nonmonotonic). Like its predecessor system Hybrid Abductive Inductive Learning (HAIL) which operated on Horn clauses, XHAIL is based on Abductive Logic Programming (ALP) (Kakas1992), we give more details on XHAIL in Section 4.
The Incremental Learning of Event Definitions (ILED) algorithm (katzouris2015incremental) relies on Abductive-Inductive learning and comprises of a scalable clause refinement methodology based on a compressive summarization of clause coverage in a stream of examples. Previous ILP learners were batch learners and required all training data to be in place prior to the initiation of the learning process. ILED learns incrementally by processing training instances when they become available and altering previous inferred knowledge to fit new observation, this is also known as theory revision. It exploits previous computations to speed-up the learning since revising the hypothesis is considered more efficient than learning from scratch. ILED attempts to cover a maximum of examples by re-iterating over previously seen examples when the hypothesis has been refined. While XHAIL can ensure optimal example coverage easily by processing all examples at once, ILED does not preserve this property due to a non-global view on examples.
When considering ASP-based ILP, negation in the body of rules is not the only interesting addition to the overall concept of ILP. An ASP program can have several independent solutions, called answer sets, of the program. Even the background knowledge can admit several answer sets without any addition of facts from examples. Therefore, a hypothesis can cover some examples in one answer set, while others are covered by another answer set. XHAIL and ILED approaches are based on finding a hypothesis that is covering all examples in a single answer set.
The Inductive Learning of Answer Set Programs approach (ILASP) is an extension of the notion of learning from answer sets (law2014inductive). Importantly, it covers positive examples bravely (i.e., in at least one answer set) and ensures that the negation of negative examples is cautiously entailed (i.e., no negative example is covered in any answer set). Negative examples are needed to learn Answer Set Programs with non-determinism otherwise there is no concept of what should not be in an Answer Set. ILASP conducts a search in multiple stages for brave and cautious entailment and processes all examples at once. ILASP performs a less informed hypothesis search than XHAIL or ILED, that means large hypothesis spaces are infeasible for ILASP while they are not problematic for XHAIL and ILED, on the other hand, ILASP supports aggregates and constraints while the older systems do not support these. ILASP2 (law2015learning) extends the hypothesis space of ILASP with choice rules and weak constraints. This permits searching for hypotheses that encode preference relations.
3.2 ILP and NLP
From NLP point of view, the hope of ILP is to be able to steer a mid-course between these two alternatives of large-scale but shallow levels of analysis and small scale but deep and precise analysis. ILP should produce a better ratio between breadth of coverage and depth of analysis (muggleton1999inductive). ILP has been applied to the field of NLP successfully; it has not only been shown to have higher accuracies than various other ML approaches in learning the past tense of English but also shown to be capable of learning accurate grammars which translate sentences into deductive database queries (law2014inductive).
Except for one early application (wirth1989completing) no application of ILP methods surfaced until the system CHILL (mooney1996inductive) was developed which learned a shift-reduce parser in Prolog from a training corpus of sentences paired with the desired parses by learning control rules and uses ILP to learn control strategies within this framework. This work also raised several issues regarding the capabilities and testing of ILP systems. CHILL was also used for parsing database queries to automate the construction of a natural language interface (zelle1996learning) and helped in demonstrating its ability to learn semantic mappings as well.
An extension of CHILL, CHILLIN (zelle1994combining) was used along with an extension of FOIL, mFOIL (tang2001using)
for semantic parsing. Where CHILLIN combines top-down and bottom-up induction methods and mFOIL is a top-down ILP algorithm designed keeping imperfect data in mind, which portrays whether a clause refinement is significant for the overall performance with the help of a pre-pruning algorithm. This emphasised on how the combination of multiple clause constructors helps improve the overall learning; which is a rather similar concept to Ensemble Methods in standard ML. Note that CHILLIN pruning is based on probability estimates and has the purpose of dealing with inconsistency in the data. Opposed to that, XHAIL already supports learning from inconsistent data, and the pruning we discuss in Section4.1 aims to increase scalability.
Previous work ILP systems such as TILDE and Aleph (srinivasan2001aleph) have been applied to preference learning which addressed learning ratings such as good, poor and bad. ASP expresses preferences through weak constraints and may also contain weak constraints or optimisation statements which impose an ordering on the answer sets (law2015learning).
The system of Mitra and Baral (mitra2016addressing) uses ASP as primary knowledge representation and reasoning language to address the task of Question Answering. They use a rule layer that is partially learned with XHAIL to connect results from an Abstract Meaning Representation parser and an Event Calculus theory as background knowledge.
4 Extending XHAIL algorithm and system
Initially, we intended to use the latest ILP systems (ILASP2 or ILED) in our work. However, preliminary experiments with ILASP2 showed a lack in scalability (memory usage) even for only 100 sentences due to the unguided hypothesis search space. Moreover, experiments with ILED uncovered several problematic corner cases in the ILED algorithm that led to empty hypotheses when processing examples that were mutually inconsistent (which cannot be avoided in real-life NLP data). While trying to fix these problems in the algorithm, further issues in the ILED implementation came up. After consulting the authors of (mitra2016addressing) we learned that they had the same issues and used XHAIL, therefore we also opted to base our research on XHAIL due to it being the most robust tool for our task in comparison to the others.
Although XHAIL is applicable, we discovered several drawbacks and improved the approach and the XHAIL system. We provide an overview of the parts we changed and then present our modifications. Figure 1 shows in the middle the original XHAIL components and on the right our extension.
XHAIL finds a hypothesis using several steps. Initially the examples plus background knowledge are transformed into a theory of Abductive Logic Programming (Kakas1992). The Abduction part of XHAIL explains observations with respect to a prior theory, which yields the Kernel Set, . is a set of potential heads of rules given by such that a maximum of examples is satisfied together with .
Example 2 (continued).
Given from Example 1, XHAIL uses , , and the head part of , to generate the Kernel Set by abduction.
The Deduction part uses and the body part of the mode bias to generate a ground program . contains rules which define atoms in as true based on and .
The Generalisation part replaces constant terms in with variables according to the mode bias , which yields a non-ground program .
Example 3 (continued).
From the above and from (1), deduction and generalisation yield the following and .
The Induction part searches for the smallest part of that entails as many examples of as possible given . This part of which can contain a subset of the rules of and for each rule a subset of body atoms is called a hypothesis .
We next describe our modifications of XHAIL.
4.1 Kernel Pruning according to Support
The computationally most expensive part of the search in XHAIL is Induction. Each non-ground rule in is rewritten into a combination of several guesses, one guess for the rule and one additional guess for each body atom in the rule.
We moreover observed that some non-ground rules in are generalisations of many different ground rules in , while some non-ground rules correspond with only a single instance in . In the following, we say that the support of in is the number of ground rules in that are transformed into in the Generalisation module of XHAIL (see Figure 1).
Intuitively, the higher the support, the more examples can be covered with that rule, and the more likely that rule or a part of it will be included in the optimal hypothesis.
Therefore we modified the XHAIL algorithm as follows.
During Generalisation, we keep track of the support of each rule by counting how often a generalisation yields the same rule .
We add an integer pruning parameter to the algorithm and use only those rules from in the Induction component that have a support higher than .
This modification is depicted as bold components which replace the dotted Generalisation module in Figure 1.
Pruning has several consequences. From a theoretical point of view, the algorithm becomes incomplete for , because Induction searches in a subset of the relevant hypotheses. Hence Induction might not be able to find a hypothesis that covers all examples, although such a hypothesis might exist with . From a practical point of view, pruning realises something akin to regularisation in classical ML; only strong patterns in the data will find their way into Induction and have the possibility to be represented in the hypothesis. A bit of pruning will therefore automatically prevent overfitting and generate more general hypotheses. As we will show in Experiments in Section 6, the pruning allows to configure a trade-off between considering low-support rules instead of omitting them entirely, as well as, finding a more optimal hypothesis in comparison to a highly suboptimal one.
4.2 Unsat-core based and Best-effort Optimisation
We observed that ASP search in XHAIL Abduction and Induction components progresses very slowly from a suboptimal to an optimal solution. XHAIL integrates version 3 of Gringo (Gebser2011gringo3) and Clasp (Gebser2012aij) which are both quite outdated. In particular Clasp in this version does not support three important improvements that have been found for ASP optimisation: (i) unsat-core optimisation (Andres2012), (ii) stratification for obtaining suboptimal answer sets (Ansotegui2013maxsat; Alviano2015aspopt), and (iii) unsat-core shrinking (Alviano2016coreshrinking).
Method (i) inverts the classical branch-and-bound search methodology which progresses from worst to better solutions. Unsat-core optimisation assumes all costs can be avoided and finds unsatisfiable cores of the problem until the assumption is true and a feasible solution is found. This has the disadvantage of providing only the final optimal solution, and to circumvent this disadvantage, stratification in method (ii) was developed which allows for combining branch-and-bound with method (i) to approach the optimal value both from cost 0 and from infinite cost. Furthermore, unsat-core shrinking in method (iii), also called ‘anytime ASP optimisation’, has the purpose of providing suboptimal solutions and aims to find smaller cores which can speed up the search significantly by cutting more of the search space (at the cost of searching for a smaller core). In experiments with the inductive encoding of XHAIL we found that all three methods have a beneficial effect.
Currently, only the WASP solver (DBLP:conf/lpnmr/AlvianoDFLR13; DBLP:conf/lpnmr/AlvianoDLR15) supports all of (i), (ii), and (iii), therefore we integrated WASP into XHAIL, which has a different output than Clasp. We also upgraded XHAIL to use Gringo version 4 which uses the new ASP-Core-2 standard and has some further (performance) advantages over older versions.
Unsat-core optimisation often finds solutions with a reasonable cost, near the optimal value, and then takes a long time to find the true optimum or prove optimality of the found solution. Therefore, we extended XHAIL as follows:
a time budget for search can be specified on the command line,
after the time budget is elapsed the best-known solution at that point is used and the algorithm continues, furthermore
the distance from the optimal value is provided as output.
This affects the Induction step in Figure 1 and introduces a best-effort strategy; along with the obtained hypothesis we also get the distance from the optimal hypothesis, which is zero for optimal solutions.
Using a suboptimal hypothesis means, that either fewer examples are covered by the hypothesis than possible, or that the hypothesis is bigger than necessary. In practice, receiving a result is better than receiving no result at all, and our experiments show that XHAIL becomes applicable to reasonably-sized datasets using these extensions.
4.3 Other Improvements
We made two minor engineering contributions to XHAIL. A practically effective improvement of XHAIL concerns . As seen in Example 3, three rules that are equivalent modulo variable renaming are contained in . XHAIL contains canonicalization algorithms for avoiding such situations, based on hashing body elements of rules. However, we found that for cases with more than one variable and for cases with more than one body atom, these algorithms are not effective because XHAIL (i) uses a set data structure that maintains an order over elements, (ii) the set data structure is sensitive to insertion order, and (iii) hashing the set relies on the order to be canonical. We made this canonicalization algorithm applicable to a far wider range of cases by changing the data type of rule bodies in XHAIL to a set that maintains an order depending on the value of set elements. This comes at a very low additional cost for set insertion and often reduces size of (and therefore computational effort for Induction step) without adversely changing the result of induction.
Another improvement concerns debugging the ASP solver. XHAIL starts the external ASP solver and waits for the result. During ASP solving, no output is visible, however, ASP solvers provide output that is important for tracking the distance from optimality during a search. We extended XHAIL so that the output of the ASP solver can be made visible during the run using a command line option.
5 Chunking with ILP
We evaluate the improvements of the previous section using the NLP task of chunking. Chunking (tjong2000introduction) or shallow parsing is the identification of short phrases such as noun phrases or prepositional phrases, usually based heavily on Part of Speech (POS) tags. POS provides only information about the token type, i.e., whether words are nouns, verbs, adjectives, etc., and chunking derives from that a shallow phrase structure, in our case a single level of chunks.
Our framework for chunking has three main parts as shown in Figure 2. Preprocessing is done using the Stanford CoreNLP tool from which we obtain the facts that are added to the background knowledge of XHAIL or used with a hypothesis to predict the chunks of an input. Using XHAIL as our ILP solver we learn a hypothesis (an ASP program) from the background knowledge, mode bias, and from examples which are generated using the gold-standard data. We predict chunks using our learned hypothesis and facts from preprocessing, using the Clingo (gebser2008user) ASP solver. We test by scoring predictions against gold chunk annotations.
An example sentence in the SemEval iSTS dataset (companionpaper) is as follows.
|Former Nazi death camp guard Demjanjuk dead at 91||(5)|
The chunking present in the SemEval gold standard is as follows.
|[ Former Nazi death camp guard Demjanjuk ] [ dead ] [ at 91 ]||(6)|
Stanford CoreNLP tools (manning2014stanford) are used for tokenisations and POS-tagging of the input. Using a shallow parser (bohnet2013joint) we obtain the dependency relations for the sentences. inline,disableinline,disabletodo: inline,disableP:why CONLL? – M: Mentioned why we use CONLL format, it even has its own chunking competitions in the past (http://www.aclweb.org/anthology/W00-0726) – P: but it is irrelevant here, because we could just convert output of Stanford directly to ASP, so I removed this reference Our ASP representation contains atoms of the following form:
which represents that token has POS tag ,
which represents that token has surface form ,
and which represent that token depends on token with dependency relation .
Example 6 (continued).
We use Penn Treebank POS-tags as they are provided by Stanford CoreNLP. To form valid ASP constant terms from POS-tags, we prefix them with ‘c_’, replace special characters with lowercase letters (e.g., ‘PRP$’ becomes ‘c_PRPd’). In addition, we create specific POS-tags for punctuation (see Section 6.4).
5.2 Background Knowledge and Mode Bias
Background Knowledge we use is shown in Figure 2(b). We define which POS-tags can exist in predicate and which tokens exist in predicate . Moreover, we provide for each token the POS-tag of its successors token in predicate .
Mode bias conditions are shown in Figure 2(c), these limit the search space for hypothesis generation. Hypothesis rules contain as head atoms of the form
which indicates, that a chunk ends at token and a new chunk starts at token . The argument of predicates in the head is of type .
The body of hypothesis rules can contain and predicates, where the first argument is a constant of type (which is defined in Figure 2(b)) and the second argument is a variable of type . Hence this mode bias searches for rules defining chunk splits based on POS-tag of the token and the next token.
We deliberately use a very simple mode bias that does not make use of all atoms in the facts obtained from preprocessing. This is discussed in Section 6.5.
5.3 Learning with ILP
Learning with ILP is based on examples that guide the search. Figure 2(d) shows rules that recognise gold standard chunks and instructions that define for XHAIL which atoms must be true to entail an example. These rules with in the head define what a good (i.e., gold standard) chunk is in each example based on where a split in a chunk occurs in the training data to help in the learning of a hypothesis for chunking.
Note that negation is present only in these rules, although we could use it anywhere else in the background knowledge. Using the background knowledge, mode bias, and examples, XHAIL is then able to learn a hypothesis.
5.4 Chunking with ASP using Learned Hypothesis
The hypothesis generated by XHAIL can then be used together with the background knowledge specified in Figure 2(b), and with the preprocessed input of a new sentence. Evaluating all these rules yields a set of split points in the sentence, which corresponds to a predicted chunking of the input sentence.
6 Evaluation and Discussion
We are using the datasets from the SemEval 2016 iSTS Task 2 (companionpaper), which included two separate files containing sentence pairs. Three different datasets were provided: Headlines, Images, and Answers-Students. The Headlines dataset was mined by various news sources by European Media Monitor. The Images dataset was a collection of captions obtained from the Flickr dataset (rashtchian2010collecting). The Answers-Students corpus consists of the interactions between students and the BEETLE II tutorial dialogue system which is an intelligent tutoring engine that teaches students in basic electricity and electronics. In the following, we denote S1 and S2, by sentence 1 and sentence 2 respectively, of sentence pairs in these datasets. Regarding the size of the SemEval Training dataset, Headlines and Images datasets are larger and contained 756 and 750 sentence pairs, respectively. However, the Answers-Students dataset was smaller and contained only 330 sentence pairs. In addition, all datasets contain a Test portion of sentence pairs.
We use -fold cross-validation to evaluate chunking with ILP, which yields learned hypotheses and
evaluation scores for each parameter setting. We test each of these hypotheses also on the Test portion of the respective dataset. From the scores obtained this way we compute mean and standard deviation, and perform statistical tests to find out whether observed score differences between parameter settings is statistically significant.
Table 1 shows which portions of the SemEval Training dataset we used for 11-fold cross-validation. In the following, we call these datasets Cross-Validation Sets. We chose the first 110 and 550 examples to use for 11-fold cross-validation which results in training set sizes 100 and 500, respectively. As the Answers-Students dataset was smaller, we merged its sentence pairs in order to obtain a Cross-Validation Set size of 110 sentences, using the first 55 sentences from S1 and S2; and for 550 sentences, using the first 275 sentences from S1 and S2 each. As Test portions we only use the original SemEval Test datasets and we always test S1 and S2 separately.
|Dataset||Cross-Validation Set||Test Set|
|H/I||100||S1 first 110||all||*|
|500||S1 first 550||all||*|
|100||S2 first 110||*||all|
|500||S2 first 550||*||all|
|A-S||100||S1 first 55 + S2 first 55||all||all|
|500||S1 first 275 + S2 first 275||all||all|
We use difflib.SequenceMatcher in Python to match the sentence chunks obtained from learning in ILP against the gold-standard sentence chunks. From the matchings obtained this way, we compute precision, recall, and F1-score as follows.
To investigate the effectivity of our mode bias for learning a hypothesis that can correctly classify the dataset, we perform cross-validation (see above) and measure correctness of all hypotheses obtained in cross-validation also on the Test set.
Because of differences in S1/S2 portions of datasets, we report results separately for S1 and S2. We also evaluate classification separately for S1 and S2 for the Answers-Students dataset, although we train on a combination of S1 and S2.
6.3 Experimental Methodology
We use Gringo version 4.5 (Gebser2011gringo3) and we use WASP version 2 (Git hash a44a95) (DBLP:conf/lpnmr/AlvianoDLR15) configured to use unsat-core optimisation with disjunctive core partitioning, core trimming, a budget of 30 seconds for computing the first answer set and for shrinking unsatisfiable cores with progressive shrinking strategy. These parameters were found most effective in preliminary experiments. We configure our modified XHAIL solver to allocate a budget of 1800 seconds for the Induction part which optimises the hypothesis (see Section 4.2). Memory usage never exceeded 5 GB.
Tables 4–6 contains the experimental results for each Dataset, where columns Size, , and respectively, show the number of sentences used to learn the hypothesis, the pruning parameter for generalising the learned hypothesis (see Section 4.1), and the rate of how close the learned hypothesis is to the optimal result, respectively. is computed according to the following formula: , which is based on upper and lower bounds on the cost of the answer set. An
value of zero means optimality, and values above zero mean suboptimality; so the higher the value, the further away from optimality. Our results comprise of the mean and standard deviation of the F1-scores obtained from our 11-fold cross-validation test set of S1 and S2 individually (column CV). Due to lack of space, we opted to leave out the scores of precision and recall, but these values show similar trends as in the Test set. For the Test sets of both S1 and S2, we include the mean and standard deviation of the Precision, Recall and F1-scores (column group T).
When testing machine-learning based systems, comparing results obtained on a single test set is often not sufficient, therefore we performed cross-validation to obtain mean and standard deviation about our benchmark metrics. To obtain even more solid evidence about the significance of the measured results, we additionally performed a one-tailed paired t-test to check if a measured F1 score is significantly higher in one setting than in another one. We consider a result significant if, i.e., if there is a probability of less than 5 % that the result is due to chance. Our test is one-tailed because we check whether one result is higher than another one, and it is a paired test because we test different parameters on the same set of 11 training/test splits in cross-validation. There are even more powerful methods for proving significance of results such as bootstrap sampling (Efron1986), however these methods require markedly higher computational effort in experiments and our experiments already show significance with the t-test.
Rows of Tables 4–6 contain results for learning from 100 resp. 500 example sentences, and for different pruning parameters. For both learning set sizes, we increased pruning stepwise starting from value 0 until we found an optimal hypothesis () or until we saw a clear peak in classification score in cross-validation (in that case, increasing the pruning is pointless because it would increase optimality of the hypothesis but decrease the prediction scores).
Note that datasets have been tokenised very differently, and that also state-of-the-art systems in SemEval used separate preprocessing methods for each dataset. We follow this strategy to allow a fair comparison. One example for such a difference is the Images dataset, where the ‘.’ is considered as a separate token and is later defined as a separate chunk, however in Answers-Students dataset it is integrated onto neighboring tokens.
We first discuss the results of experiments with varying training set size and varying pruning parameter, then compare our approach with the state-of-the-art systems, and finally inspect the optimal hypotheses.
Training Set Size and Pruning Parameter
We observe that by increasing the size of the training set to learn the hypothesis, our scores improved considerably. Due to more information being provided, the learned hypothesis can predict with higher F1 score. We also observed that for the smaller training set size (100 sentences), lower pruning numbers (in rare cases even ) resulted in achieving the optimal solution. For a bigger training set size (500 sentences), without pruning the ILP procedure does not find solutions close to the optimal solution. However, by using pruning values up to we can reduce the size of the search space and find hypotheses closer to the optimum, which predict chunks with a higher F1 score. Our statistical test shows that, in many cases, several increments of the parameter yield significantly better results, up to a point where prediction accuracy degrades because too many examples are pruned away. To select the best hypothesis, we increase the pruning parameter until we reach the peak in the F1 score in cross-validation.
Finding optimal hypotheses in the Inductive search of XHAIL (where ) is easily attained when learning from 100 sentences. For learning from 500 sentences, very higher pruning results in a trivial optimal hypothesis (i.e., every token is a chunk) which has no predictive power, hence we do not increase beyond a value of 10.
Note that we never encountered timeouts in the Abduction component of XHAIL, only in the Induction part. The original XHAIL tool without our improvements yields only timeouts for learning from 500 examples, and few hypotheses for learning from 100 examples. Therefore we do not show these results in tables.
|IISCNLP - Run1||61.9||68.5||61.4||61.1||65.7||60.1|
|IISCNLP - Run2||67.6||68.5||64.5||***||71.4||71.9||68.9||**|
|Inspire - Manual||64.5||70.4||62.4||64.3||68.4||62.2|
|Inspire - Learned||68.12.5||70.62.5||65.42.6||**||67.21.3||68.22.4||64.01.8||***|
|IISCNLP - Run1||61.6||60.9||60.7||66.1||66.2||65.9|
|IISCNLP - Run2||65.8||65.6||65.4||67.7||67.2||67.3|
|Inspire - Manual||74.5||74.2||74.2||**||73.8||73.6||73.6||**|
|Inspire - Learned||66.415.5||74.30.7||73.70.7||***||71.10.8||71.10.8||70.90.8||***|
|IISCNLP - Run1||67.9||63.9||60.7||***||65.7||55.0||54.0|
|IISCNLP - Run2||63.0||59.8||56.9||66.2||52.5||52.8|
|Inspire - Manual||66.8||64.4||59.7||71.2||62.5||62.1||***|
|Inspire - Learned||66.82.8||70.52.5||63.52.4||**||89.33.0||80.10.7||80.31.7||*|
Table 2 shows a comparison of our results with the baseline and the three best systems from the chunking subtask of Task 2 from SemEval2016 Task2 (companionpaper): DTSim (banjade2016dtsim), FBK-HLT-NLP (magnolini2016fbk) and runs 1 and 2 of IISCNLP (tekumalla2016iiscnlp). We also compare with results of our own system ‘Inspire-Manual’ (kazmi2016inspire).
The baseline makes use of the automatic probabilistic chunker from the IXA-pipeline which provides Perceptron models(collins2002discriminative) for chunking and is trained on CONLL2000 corpora and corrected manually,
DTSim uses a Conditional Random Field (CRF) based chunking tool using only POS-tags as features,
FBK-HLT-NLP obtains chunks using a Python implementation of MBSP chunker which uses a Memory-based part-of-speech tagger generator (daelemans1996mbt),
Run 1 of IISCNLP uses OpenNLP chunker which divides the sentence into syntactically correlated parts of words, but does not specify their internal structure, nor their role in the main sentence. Run 2 uses Stanford NLP Parser to create parse trees and then uses a perl script to create chunks based on the parse trees, and
Inspire-Manual (our previous system) makes use of manually set chunking rules (abney1991parsing) using ASP (kazmi2016inspire).
Using the gold-standard chunks provided by the organisers we were able to compute the precision, recall, and F1-scores for analysis on the Headlines, Images and Answers-Students datasets.
For the scores of our system ‘Inspire-Learned’, we used the mean and average of the best configuration of our system as obtained in cross-validation experiments on the Test set and compared against the other systems’ Test set results. Our system’s performance is quite robust: it is always scores within the top three best systems.
|split(V) :- token(V), pos(c_VBD,V).||X||X||X|
|split(V) :- token(V), nextpos(c_IN,V).||X||X||X|
|split(V) :- token(V), nextpos(c_VBZ,V).||X||X||X|
|split(V) :- token(V), pos(c_VB,V).||X||X|
|split(V) :- token(V), nextpos(c_TO,V).||X||X|
|split(V) :- token(V), nextpos(c_VBD,V).||X||X|
|split(V) :- token(V), nextpos(c_VBP,V).||X||X|
|split(V) :- token(V), pos(c_VBZ,V), nextpos(c_DT,V).||X||X|
|split(V) :- token(V), pos(c_NN,V), nextpos(c_RB,V).||X||X|
|split(V) :- token(V), pos(c_NNS,V).||X|
|split(V) :- token(V), pos(c_VBP,V).||X|
|split(V) :- token(V), pos(c_VBZ,V).||X|
|split(V) :- token(V), pos(c_c,V).||X|
|split(V) :- token(V), nextpos(c_POS,V).||X|
|split(V) :- token(V), nextpos(c_VBN,V).||X|
|split(V) :- token(V), nextpos(c_c,V).||X|
|split(V) :- token(V), pos(c_PRP,V).||X|
|split(V) :- token(V), pos(c_RP,V).||X|
|split(V) :- token(V), pos(c_p,V).||X|
|split(V) :- token(V), nextpos(c_p,V).||X|
|split(V) :- token(V), pos(c_CC,V), nextpos(c_VBG,V).||X|
|split(V) :- token(V), pos(c_NN,V), nextpos(c_VBD,V).||X|
|split(V) :- token(V), pos(c_NN,V), nextpos(c_VBG,V).||X|
|split(V) :- token(V), pos(c_NN,V), nextpos(c_VBN,V).||X|
|split(V) :- token(V), pos(c_NNS,V), nextpos(c_VBG,V).||X|
|split(V) :- token(V), pos(c_RB,V), nextpos(c_IN,V).||X|
|split(V) :- token(V), pos(c_VBG,V), nextpos(c_DT,V).||X|
|split(V) :- token(V), pos(c_VBG,V), nextpos(c_JJ,V).||X|
|split(V) :- token(V), pos(c_VBG,V), nextpos(c_PRPd,V).||X|
|split(V) :- token(V), pos(c_VBG,V), nextpos(c_RB,V).||X|
|split(V) :- token(V), pos(c_VBZ,V), nextpos(c_IN,V).||X|
|split(V) :- token(V), pos(c_EX,V).||X|
|split(V) :- token(V), pos(c_RB,V).||X|
|split(V) :- token(V), pos(c_VBG,V).||X|
|split(V) :- token(V), pos(c_WDT,V).||X|
|split(V) :- token(V), pos(c_WRB,V).||X|
|split(V) :- token(V), nextpos(c_EX,V).||X|
|split(V) :- token(V), nextpos(c_MD,V).||X|
|split(V) :- token(V), nextpos(c_VBG,V).||X|
|split(V) :- token(V), nextpos(c_RB,V).||X|
|split(V) :- token(V), pos(c_IN,V), nextpos(c_NNP,V).||X|
|split(V) :- token(V), pos(c_NN,V), nextpos(c_WDT,V).||X|
|split(V) :- token(V), pos(c_NN,V), nextpos(c_IN,V).||X|
|split(V) :- token(V), pos(c_NNS,V), nextpos(c_IN,V).||X|
|split(V) :- token(V), pos(c_NNS,V), nextpos(c_VBP,V).||X|
|split(V) :- token(V), pos(c_RB,V), nextpos(c_DT,V).||X|
Inspection of Hypotheses
Table 3 shows the rules that are obtained from the hypothesis generated by XHAIL from Sentence 1 files of all the datasets. We have also tabulated the common rules present between the datasets and the extra rules which differentiate the datasets from each other. POS-tags for punctuation are ‘c_p’ for sentence-final punctuation (‘.’, ‘?’, and ‘!’) and ‘c_c’ for sentence-separating punctuation (‘,’, ‘;’, and ‘:’).
Rules which occur in all learned hypotheses can be interpreted as follows (recall the meaning of from Section 5.2): (i) chunks end at past tense verbs (VBD, e.g., ‘walked’), (ii) chunks begin at subordinating conjunctions and prepositions (IN, e.g., ‘in’), and (iii) chunks begin at 3rd person singular present tense verbs (VBZ, e.g., ‘walks’). Rules that are common to H and AS datasets are as follows: (i) chunks end at base forms of verbs (VB, e.g., ‘[to] walk’), (ii) chunks begin at ‘to’ prepositions (TO), and (iii) chunks begin at past tense verbs (VBD). The absence of (i) in hypotheses for the Images dataset can be explained by the rareness of such verbs in captions of images. Note that (iii) together with the common rule (i) means that all VBD verbs become separate chunks in H and AS datasets. Rules that are common to I and AS datasets are as follows: (i) chunks begin at non-3rd person verbs in present tense (VBP, e.g., ‘[we] walk’), (ii) chunk boundaries are between a determiner (DT, e.g., ‘both’) and a 3rd person singular present tense verb (VBZ), and (iii) chunk boundaries are between adverbs (RB, e.g., ‘usually’) and common, singular, or mass nouns (NN, e.g., ‘humor’). Interestingly, there are no rules common to H and I datasets except for the three rules mutual to all three datasets.
For rules occurring only in single datasets, we only discuss a few interesting cases in the following. Rules that are unique to the Headlines dataset include rules which indicate that the sentence separators ‘,’, ‘;’, and ‘:’, become single chunks, moreover chunks start at genitive markers (POS, ‘’s’). Both is not the case for the other two data sets. Rules unique to the Images dataset include that sentence-final punctuation (‘.’, ‘?’, and ‘!’) become separate chunks, rules for chunk boundaries between verb (VB_) and noun (NN_) tokens, and chunk boundaries between possessive pronouns (PRP$, encoded as ‘c_PRPd’, e.g., ‘their’) and participles/gerunds (VBG, e.g., ‘falling’). Rules unique to Answers-Students dataset include chunks containing ‘existential there’ (EX), adverb tokens (RB), gerunds (VBG), and several rules for splits related to WH-determiners (WDT, e.g., ‘which’), WH-adverbs (WRB, e.g., ‘how’), and prepositions (IN).
We see that learned hypotheses are interpretable, which is not the case in classical machine learning techniques such as Neural Networks (NN), Conditional Random Fields (CRF), and Support Vector Machines (SVM).
We next discuss the potential impact of our approach in NLP and in other applications, outline the strengths and weaknesses, and discuss reasons for several design choices we made.
Impact and Applicability
ILP is applicable to many problems of traditional machine learning, but usually only applicable for small datasets. Our addition of pruning enables learning from larger datasets at the cost of obtaining a more coarse-grained hypothesis and potentially suboptimal solutions.
The main advantage of ILP is interpretability and that it can achieve good results already with small datasets. Interpretability of the learned rule-based hypothesis makes the learned hypothesis transparent as opposed to black-box models of other approaches in the field such as Conditional Random Fields, Neural Networks, or Support Vector Machines. These approaches are often purely statistical, operate on big matrices of real numbers instead of logical rules, and are not interpretable. The disadvantage of ILP is that it often does not achieve the predictive performance of purely statistical approaches because the complexity of ILP learning limits the number of distinct features that can be used simultaneously.
Our approach allows finding suboptimal hypotheses which yield a higher prediction accuracy than an optimal hypothesis trained on a smaller training set. Learning a better model from a larger dataset is exactly what we would expect in machine learning. Before our improvement of XHAIL, obtaining any hypothesis from larger datasets was impossible: the original XHAIL tool does not return any hypothesis within several hours when learning from 500 examples.
Our chunking approach learns from a small portion of the full SemEval Training dataset, based on only POS-tags, but it still achieves results close to the state-of-the-art. Additionally it provides an interpretable model that allowed us to pinpoint non-uniform annotation practices in the three datasets of the SemEval 2016 iSTS competition. These observations give direct evidence for differences in annotation practice for three datasets with respect to punctuation and genitives, as well as differences in the content of the datasets
Strengths and weaknesses
Our additions of pruning and the usage of suboptimal answer sets make ILP more robust because it permits learning from larger datasets and obtaining (potentially suboptimal) solutions faster.
Our addition of a time budget and usage of suboptimal answer sets is a purely beneficial addition to the XHAIL approach. If we disregard the additional benefits of pruning, i.e., if we disable pruning by setting , then within the same time budget, the same optimal solutions are to be found as if using the original XHAIL approach. In addition, before finding the optimal solution, suboptimal hypotheses are provided in an online manner, together with information about their distance from the optimal solution.
The strength of pruning before the Induction phase is, that it permits learning from a bigger set of examples, while still considering all examples in the dataset. A weakness of pruning is, that a hypothesis which fits perfectly to the data might not be found anymore, even if the mode bias could permit such a perfect fit. In NLP applications this is not a big disadvantage, because noise usually prevents a perfect fit anyways, and overfitting models is indeed often a problem. However, in other application domains such as learning to interpret input data from user examples (gulwani2015inductive), a perfect fit to the input data might be desired and required. Note that pruning examples to learn from inconsistent data as done by Tang and Mooney (tang2001using) is not necessary for our approach. Instead, non-covered examples incur a cost that is optimised to be as small as possible.
In our study, we use a simple mode bias containing only the current and next POS tags, which is a deliberate choice to make results easier to compare. We performed experiments with additional body atoms and in the body mode bias, moreover with negation in the body mode bias. However, these experiments yielded significantly larger hypotheses with only small increases in accuracy. Therefore we here limit the analysis to the simple case and consider more complex mode biases as future work. Note that the best state-of-the-art system (DTSim) is a CRF model solely based on POS-tags, just as our hypothesis is only making use of POS-tags. By considering more than the current and immediately succeeding POS tag, DTSim can achieve better results than we do.
The representation of examples is an important part of our chunking case as described in Section 5. We define predicate with rules that consider presence and absence of splits for each chunk. We make use of the power of NAF in these rules. We also experimented with an example representation that just gave all desired splits as #example split(X) and all undesired splits as #example not split(Y). This representation contains an imbalance in the split versus split class, moreover, chunks are not represented as a concept that can be optimised in the inductive search for the best hypothesis. Hence, it is not surprising that this simpler representation of examples gave drastically worse scores, and we do not report any of these results in detail.
7 Conclusion and Future Work
Inductive Logic Programming combines logic programming and machine learning, and it provides interpretable models, i.e., logical hypotheses, which are learned from data. ILP has been applied to a variety of NLP and other problems such as parsing (zelle1996learning; tang2001using), automatic construction of biological knowledge bases from scientific abstracts (Craven1999biokbc), automatic scientific discovery (King2004robotscientist), and in Microsoft Excel gulwani2015inductive where users can specify data extraction rules using examples. Therefore, ILP research has the potential for being used in a wide range of applications.
In this work, we explored the usage of ILP for the NLP task of chunking and extend the XHAIL ILP solver to increase its scalability and applicability for this task. Results indicate that ILP is competitive to state-of-the-art ML techniques for this task and that we successfully extended XHAIL to allow learning from larger datasets than previously possible. Learning a hypothesis using ILP has the advantage of an interpretable representation of the learned knowledge, such that we know exactly which rule has been learned by the program and how it affects our NLP task. In this study, we also gain insights about the differences and common points of datasets that we learned a hypothesis from. Moreover, ILP permits learning from small training sets where techniques such as Neural Networks fail to provide good results.
As a first contribution to the ILP tool XHAIL we have upgraded the software so that it uses the newest solver technology, and that this technology is used in a best-effort manner that can utilise suboptimal search results. This is effective in practice, because finding the optimal solution can be disproportionately more difficult than finding a solution close to the optimum. Moreover, the ASP technique we use here provides a clear information about the degree of suboptimality. During our experiments, a new version of Clingo was published which contains most techniques in WASP (except for core shrinking). We decided to continue using WASP for this study because we saw that core shrinking is also beneficial to search. Extending XHAIL to use Clingo in a best-effort manner is quite straight-forward.
As a second contribution to XHAIL we have added a pruning parameter to the algorithm that allows fine-tuning the search space for hypotheses by filtering out rule candidates that are supported by fewer examples than other rules. This addition is a novel contribution to the algorithm, which leads to significant improvements in efficiency, and increases the number of hypotheses that are found in a given time budget. While pruning makes the method incomplete, it does not reduce expressivity. The hypotheses and background knowledge may still contain unrestricted Negation as Failure. Pruning in our work is similar to the concept of the regularisation in ML and is there to prevent overfitting in the hypothesis generation. Pruning enables the learning of logical hypotheses with dataset sizes that were not feasible before. We experimentally observed a trade-off between finding an optimal hypothesis that considers all potential rules on one hand, and finding a suboptimal hypothesis that is based on rules that are supported by few examples. Therefore the pruning parameter has to be adjusted on an application-by-application basis.
Our work has focused on providing comparable results to ML techniques and we have not utilised the full power of ILP with NAF in rule bodies and predicate invention. As future work, we plan to extend the predicates usable in hypotheses to provide a more detailed representation of the NLP task, moreover we plan to enrich the background knowledge to aid ILP in learning a better hypothesis with a deeper structure representing the boundaries of chunks.
We provide the modified XHAIL system in a public repository fork (xhailfork).
This research has been supported by the Scientific and Technological Research Council of Turkey (TUBITAK) [grant number 114E777] and by the Higher Education Commission of Pakistan (HEC). We are grateful to Carmine Dodaro for providing us with support regarding the WASP solver.