Heuristic Based Induction of Answer Set Programs: From Default theories to combinatorial problems

Significant research has been conducted in recent years to extend Inductive Logic Programming (ILP) methods to induce Answer Set Programs (ASP). These methods perform an exhaustive search for the correct hypothesis by encoding an ILP problem instance as an ASP program. Exhaustive search, however, results in loss of scalability. In addition, the language bias employed in these methods is overly restrictive too. In this paper we extend our previous work on learning stratified answer set programs that have a single stable model to learning arbitrary (i.e., non-stratified) ones with multiple stable models. Our extended algorithm is a greedy FOIL-like algorithm, capable of inducing non-monotonic logic programs, examples of which includes programs for combinatorial problems such as graph-coloring and N-queens. To the best of our knowledge, this is the first heuristic-based ILP algorithm to induce answer set programs with multiple stable models.



page 1

page 2

page 3

page 4


On the Strong Equivalences of LPMLN Programs

By incorporating the methods of Answer Set Programming (ASP) and Markov ...

Imperative Program Synthesis from Answer Set Programs

Our research concerns generating imperative programs from Answer Set Pro...

Grounding Bound Founded Answer Set Programs

To appear in Theory and Practice of Logic Programming (TPLP) Bound Fou...

Distributed Answer Set Coloring: Stable Models Computation via Graph Coloring

Answer Set Programming (ASP) is a famous logic language for knowledge re...

Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME

We present a heuristic based algorithm to induce non-monotonic logic pro...

Induction of Non-monotonic Logic Programs To Explain Statistical Learning Models

We present a fast and scalable algorithm to induce non-monotonic logic p...

Induction of Non-Monotonic Rules From Statistical Learning Models Using High-Utility Itemset Mining

We present a fast and scalable algorithm to induce non-monotonic logic p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Statistical machine learning methods produce models that are not comprehensible for humans because they are algebraic solutions to optimization problems such as risk minimization or data likelihood maximization. These methods do not produce any intuitive description of the learned model. Lack of intuitive descriptions makes it hard for users to understand and verify the underlying rules that govern the model. Also, these methods cannot produce a justification for a prediction they compute for a new data sample. Additionally, if prior knowledge (background knowledge) is extended in these methods, then the entire model needs to be re-learned. Finally, no distinction is made between exceptions and noisy data in these methods.

Inductive Logic Programming Muggleton (1991), however, is one technique where the learned model is in the form of logic programming rules (Horn clauses) that are comprehensible to humans. It allows the background knowledge to be incrementally extended without requiring the entire model to be relearned. Meanwhile, the comprehensibility of symbolic rules makes it easier for users to understand and verify induced models and even edit them.

ILP learns theories in the form of Horn clause logic programs. Extending Horn clauses with negation as failure (NAF) results in more powerful applications becoming possible as inferences can be made even in absence of information. This extension of Horn clauses with NAF where the meaning is computed using the stable model semantics Gelfond and Lifschitz (1988)—called Answer Set Programming111We use the term answer set programming in a generic sense to refer to normal logic programs, i.e., logic programs extended with NAF, whose semantics is given in terms of stable models Baral (2003).—has many powerful applications. Generalizing ILP to learning answer set programs also makes ILP more powerful. For a complete discussion on the necessity of NAF in ILP, we refer the reader to Sakama (2005).

Once NAF semantics is allowed into ILP systems, they should be able to deal with multiple stable models which arise due to presence of mutually recursive rules involving negation (called even cycles) Baral (2003) such as:

p :- not q.

q :- not p.

Inducing answer set programs in presence of even cycles in the background knowledge has first been explored in Seitzer (1997), where the author describes the added expressiveness that results once background knowledge is allowed to have multiple stable models. Work by Otero Otero (2001) on induction of stable models formalizes induction of answer set programs with stable model semantics Gelfond and Lifschitz (1988) such that in situations where ( represents the background knowledge and the hypothesis) has multiple stable models, it is just necessary to guarantee that each positive example is true in at least one stable model of . It also attempts to characterize inducing answer set programs from partial answer sets of (the author calls them non-complete set of examples). These partial answer sets are treated as examples in the ILP problem. Otero also suggests that researchers should focus on learning answer set programs that model combinatorial and planning problems, but does not present any solution. Addressing the problem of learning such programs is the goal of our research presented in this paper.

In Sakama (2005), Sakama introduces algorithms to induce a categorical logic program222A categorical logic program is an answer set program with at most one stable model. given the answer set of the background knowledge and either positive or negative examples. Essentially, given a single answer set, Sakama tries to induce a program that has that answer set as a stable model. In Sakama and Inoue (2009), Sakama extends his work to learn from multiple answer sets. He introduces brave induction, where the learned hypothesis is such that some of the answer sets of cover the positive examples. The limitation of this work is that it accepts only one positive example as a conjunction of atoms. It does not take into account negative examples at all. Cautious induction, the counterpart of brave induction, is also too restricted as it can only induce atoms in the intersection of all stable models. Thus, neither brave induction nor cautious induction are able to express situations where something should hold in all or none of the stable models. An example of this limitation arises in the graph coloring problem where the following should hold in all answer sets: no two neighboring nodes in a graph should be painted the same color.

ASPAL Corapi et al (2011) is the first ILP system to learn answer set programs by encoding ILP problems as ASP programs and having an ASP solver find the hypothesis. Its successor ILASP Law et al (2015), is an ILP system capable of inducing hypotheses expressed as answer set programs too. ILASP defines a framework that subsumes brave/cautious induction and allows much broader class of problems relating to learning answer set programs to be handled by ILP. However, the algorithm exhaustively searches the space of possible clauses to find one that is consistent with all examples and background knowledge. To make this search feasible, it prohibits predicate invention, i.e., learning predicates other than the target predicate(s). Resorting to exhaustive search and not allowing predicate invention are weaknesses of ILASP that limit its applicability to many useful situations. Our research presented in this paper does not suffer from these problems.

XHAIL Ray (2009) is another ILP system capable of learning non-monotonic logic programs. It heavily incorporates abductive logic programming to search for hypotheses. It uses a similar language-bias as ILASP does, and thus suffers from the limitations similar to ILASP. It also does not support the notion of inducing answer set programs from partial answer sets.

All the systems discussed above, resort to an exhaustive search for the hypothesis. In contrast, traditional ILP systems (that only learn Horn clauses), use heuristics to guide their search. Use of heuristics allows these system to avoid an exhaustive search. These system usually start with the most general clauses and then specialize them. They are better suited for large-scale data-sets with noise, since the search can be easily guided by heuristics. FOIL Quinlan (1990) is a representative of such algorithms. However, handling negation in FOIL is somewhat problematic as we will soon show. Also, FOIL can not handle background knowledge with multiple stable models, nor it can induce answer set programs.

Recently we developed an algorithm called FOLD Shakerin et al (2017) to automate inductive learning of default theories represented as stratified answer set programs. FOLD (First Order Learner of Default rules) extends the FOIL algorithm and is able to learn answer set programs that represent the underlying knowledge very succinctly. However, FOLD is only limited to dealing with stratified answer set programs, i.e., mutually recursive rules through negation are not allowed in the background knowledge or the hypothesis. Thus, FOLD is incapable of handling cases where the background knowledge or the hypotheses admits multiple stable models. In this paper, we extend the FOLD algorithm to allow both the background knowledge and the hypothesis to have multiple stable models. The extended FOLD algorithm—called the XFOLD algorithm—is much more general than previously proposed methods.

This paper makes the following novel contributions: it presents the XFOLD algorithm, an extension of our previous FOLD algorithm, that can handle background knowledge with multiple stable models as well as allow inducing of hypotheses that have multiple stable models. To the best of our knowledge, XFOLD is the first heuristic based algorithm to induce such hypotheses. The XFOLD algorithm can learn ASP programs to solve combinatorial problems such as graph-coloring and N-queens. Because the XFOLD algorithm is based on heuristic search, it is also scalable. Lack of scalability is a major problem in previous approaches.

The rest of this paper is organized as follows: In section 2, we present the motivation of the FOLD algorithm by recalling some of the problems in FOIL algorithm. In section 3, we introduce the FOLD algorithm. In section 4, we present our extension to the FOLD algorithm, called XFOLD, to induce answer set programs with multiple stable models. In section 5, we show how XFOLD algorithm can induce programs for solving combinatorial problems. In section 6, we present related work while in section 7, we present our conclusions and future work.

We assume that the reader is familiar with answer set programming and stable model semantics. Books by Baral Baral (2003) and Gelfond and Kahl Gelfond and Kahl (2014) are good sources of background material.

2 Background

In this section we describe our work on learning stratified answer set programs, i.e., learning hypothesis without cyclical rules using background knowledge that also does not have cyclical rules. The learning algorithm, called FOLD (First Order Learning of Default rules) Shakerin et al (2017), is itself an extension of the well known FOIL algorithm. FOIL is a top-down ILP algorithm which follows a sequential covering approach to induce a hypothesis. The FOIL algorithm is summarized in Algorithm 1. This algorithm repeatedly searches for clauses that score best with respect to a subset of positive and negative examples, a current hypothesis and a heuristic called information gain (IG). The FOIL algorithm learns a target predicate that has to be specified. Essentially, the target predicate appears as the head of the learned goal clause that FOIL aims to learn.

2:Hypothesis H
4:while not(do
5:      {goal :- true.}
6:     while not(do
7:         for all  do
9:         end for
10:         let be the with the best score
12:     end while
13:     add to
15:end while
Algorithm 1 Overview of the FOIL algorithm

The inner loop searches for a clause with the highest information gain using a general-to-specific hill-climbing search. To specialize a given clause , a refinement operator under -subsumption Plotkin (1971) is employed. The most general clause is {p() :- true.}, where the predicate p/n is the target and each is a variable. The refinement operator specializes the current clause {h :- ,...,.}. This is realized by adding a new literal l to the clause, which yields the following: {h :- ,...,,l}. The heuristic based search uses information gain. In FOIL, information gain for a given clause is calculated as follows Mitchell (1997):


where is the candidate literal to add to rule , is the number of positive bindings of , is the number of negative bindings of , is the number of positive bindings of , is the number of negative bindings of , is the number of positive bindings of also covered by .

FOIL handles negated literals in a naive way by adding the literal to the set of specialization candidate literals for any existing candidate . This approach leads to learning predicates that do not capture the concept accurately as shown in the following example:

Example 1

are background knowledge and positive examples respectively under Closed World Assumption, and the target predicate is fly.

bird(X) :- penguin(X).
bird(tweety). bird(et).
cat(kitty). penguin(polly).
fly(tweety). fly(et).

The FOIL algorithm would learn the following rule:

fly(X) :- not cat(X), not penguin(X).

which does not yield a constructive definition, even though it covers all the positives (tweety is not a penguin and et is not a cat) and no negatives (neither cats nor penguins fly). In fact, the correct theory in this example is as follows: ”Only birds fly but, among them there are exceptional ones who do not fly”. It translates to the following logic programming rule:

fly(X):- bird(X), not penguin(X).

which FOIL fails to discover.

3 FOLD Algorithm

The intuition behind FOLD algorithm is to learn a concept in terms of a default and possibly multiple exceptions (and exceptions to exceptions, and so on). Thus, in the bird example given above, we would like to learn the rule that X flies if it is a bird and not a penguin, rather than that all non-cats and non-birds can fly. FOLD tries first to learn the default by specializing a general rule of the form {goal() :- true.} with positive literals. As in FOIL, each specialization must rule out some already covered negative examples without decreasing the number of positive examples covered significantly. Unlike FOIL, no negative literal is used at this stage. Once the IG becomes zero, this process stops. At this point, if any negative example is still covered, they must be either noisy data or exceptions to the current hypothesis. Exceptions are separated from noise via distinguishable patterns in negative examples Srinivasan et al (1996). In other words, exceptions could be learned by calling the same algorithm recursively. This swapping of positive and negative examples, then recursively calling the algorithm can continue, so that we can learn exceptions to exceptions, and so on. Each time a rule is discovered for exceptions, a new predicate ab()

is introduced. To avoid name collisions, FOLD appends a unique number at the end of the string ”ab” to guarantee the uniqueness of invented predicates. It turns out that the outlier data samples are covered neither as default nor as exceptions. If outliers are present, FOLD identifies and enumerates them to make sure that the algorithm converges.

Algorithm 2 shows a high level implementation of the FOLD algorithm. At lines 1-8, function FOLD, serves like the FOIL outer loop. At line 3, FOLD starts with the most general clause (e.g. fly(X) :- true). At line 4, this clause is refined by calling the function . At lines 5-6, set of positive examples and set of discovered clauses are updated to reflect the newly discovered clause.

At lines 9-29, the function is shown. It serves like the FOIL inner loop. At line 12, by calling the function ADD_BEST_LITERAL the “best” positive literal is chosen and the best IG as well as the corresponding clause is returned. At lines 13-24, depending on the IG value, either the positive literal is accepted or the EXCEPTION function is called. If, at the very first iteration, IG becomes zero, then a clause that just enumerates the positive examples is produced. A flag called is used to differentiate the first iteration. At lines 26-27, the sets of positive and negative examples are updated to reflect the changes of the current clause. At line 19, the EXCEPTION function is called while swapping and .

At line 31, the “best” positive literal that covers more positive examples and fewer negative examples is selected. Again, note the current positive examples are really the negative examples and in the EXCEPTION function, we try to find the rule(s) governing the exception. At line 33, FOLD is recursively called to extract this rule(s). At line 34, a new ab predicate is introduced and at lines 35-36 it is associated with the body of the rule(s) found by the recurring FOLD function call at line 33. Finally, at line 38, default and exception are combined together to form a single clause.

The FOLD algorithm, once applied to Example 1, yields the following clauses:

fly(X):- bird(X), not ab0(X).
ab0(X):- penguin(X).
3: defaults’ clauses
4: exceptions/abnormal clauses
5:function FOLD()
6:     while (do
7:          :-
11:     end while
12:end function
13:function SPECIALIZE()
15:     while (do
17:         if  then
19:         else
20:              if  then
22:              else
24:                  if  then
26:                  end if
27:              end if
28:         end if
32:     end while
33:end function
34:function EXCEPTION()
36:     if  then
39:         for each  do
40:              :-
41:         end for
42:         :-
43:     else
45:     end if
46:end function
Algorithm 2 FOLD Algorithm

Now, we illustrate how FOLD discovers the above set of clauses given and and the goal fly(X). By calling FOLD, at line 2 while loop, the clause {fly(X) :- true.} is specialized. Inside the function, at line 12, the literal bird(X) is selected to add to the current clause, to get the clause = fly(X) :- bird(X), which happens to have the greatest IG among {bird,penguin,cat}. Then, at lines 26-27 the following updates are performed: . A negative example , a penguin is still covered. In the next iteration, fails to introduce a positive literal to rule it out since the best IG in this case is zero. Therefore, the EXCEPTION function is called by swapping the , . Now, FOLD is recursively called to learn a rule for , . The recursive call (line 33), returns {fly(X) :- penguin(X)} as the exception. At line 34, a new predicate ab0 is introduced and at lines 35-37 the clause {ab0(X) :- penguin(X)} is created and added to the set of invented abnormalities namely, AB. At line 38, the negated exception (i.e not ab0(X)) and the default rule’s body (i.e bird(X)) are compiled together to form the clause {fly(X) :- bird(X),not ab0(X)}.

Note, in two different cases function is called: i) At very first iteration of specialization if IG is zero for all the positive literals. ii) When the routine fails to find a rule governing negative examples. Whichever is the case, corresponding samples are considered as noise. The following example shows a learned logic program in presence of noise. In particular, it shows how function works: It generates clauses in which the variables of the goal predicate can be unified with each member of a list of the examples for which no pattern exists.

Example 2

Similar to Example 1, plus we have an extra positive example fly(jet) without any further information:

bird(X) :- penguin(X).
bird(tweety). bird(et).
cat(kitty). penguin(polly).
fly(tweety). fly(jet). fly(et).

FOLD algorithm on the Example 4.1 yields the following clauses:

fly(X) :- bird(X), not ab0(X).
fly(X) :- member(X,[jet]).
ab0(X) :- penguin(X).

FOLD recognizes as a noisy data. is a built-in logic programming predicate in that tests the membership of an atom in a list.

Sometimes, there are nested levels of exceptions. The following example shows how FOLD manages to learn the correct theory in presence of nested exceptions.

Example 3

Birds and planes normally fly, except penguins and damaged planes that can’t. There are super penguins who can, exceptionally, fly.

bird(X) :- penguin(X).
penguin(X) :- superpenguin(X).
bird(a). bird(b). penguin(c). penguin(d).
superpenguin(e). superpenguin(f).
plane(g). plane(h). plane(k).
damaged(k). damaged(m).
fly(a).    fly(b).    fly(e).
fly(f).    fly(g).    fly(h).

FOLD algorithm learns the following theory:

fly(X) :- plane(X), not ab0(X).
fly(X) :- bird(X), not ab1(X).
fly(X) :- superpenguin(X).
ab0(X) :- damaged(X).
ab1(X) :- penguin(X).

Table 1, presents our experiments with UCI benchmark datasets Lichman (2013). In this experiment, we ran FOLD on each dataset and measured the accuracy using a 10-fold cross-validation and the results are compared against that of Aleph Srinivasan (2001). Aleph is a popular ILP system that has been widely used in prior work. To induce a clause, Aleph starts by building the most specific clause, which is called the “bottom clause”, that entails a seed example. Then, it uses a branch-and-bound algorithm to perform a general-to-specific heuristic search for a subset of literals from the bottom clause to form a more general rule. In most cases, our FOLD algorithm outperforms Aleph in terms of accuracy and succinctness of induced rules.

FOLD handling of negation and numeric constraints, yields intuitive and precise results. For instance, in UCI Labor-negotiations, which is a dataset of final settlements in labor negotiations in Canadian industry, the following hypothesis is induced by FOLD:

good_contract(X) :- wage_inc_first_year(X,A), A > 2, not ab0(X).
good_contract(X) :- holidays(X,A), A > 11.
good_contract(X) :- health_plan_half_contribution(X), pension(X).
ab0(X) :- no_longterm_disability_help(X).
ab0(X) :- no_pension(X).

This hypothesis captures the highest priorities of employees in a good contract. Without having abnormality predicates, the hypothesis would have contained more clauses depending on the diversity of options on long term disability support and pension, whereas in default theory approach, as shown in this example, instead of covering examples with multiple clauses, a single clause is introduced as a default rule, and irrelevant predicates are excluded by abnormality predicates.

dataset size ALEPH accuracy(%) FOLD accuracy(%) FOLD execution time(s)
Credit-au 690 82 83 67
Credit-j 125 53 81 20
Credit-g 1000 70.9 78 87
Iris 150 85.9 95 1.3
Ecoli 336 91 90 6.1
Bridges 108 89 90 0.8
Labor 57 89 94 0.4
Acute(1) 34 100 100 0.3
Acute(2) 34 100 100 0.3
Mushroom 7724 100 100 11.4
Table 1: FOLD evaluation on UCI benchmarks

4 Induction of Answer Set Programs with Multiple Stable Models

In the previous section we assumed that the background knowledge is a normal logic program with one stable model and all examples belong to the only stable model of . This would require the language bias not to allow even cycles which are responsible for generating multiple stable models.

In this section we extend our FOLD algorithm to learn normal logic programs that potentially have multiple stable models. The significance of Answer Set Programming paradigm is that it provides a declarative semantics under which each stable model is associated with one (alternative) solution to the problem described by the program. Typical problems of this kind are combinatorial problems, e.g., graph coloring and N-queens. In graph coloring, one should find different ways of coloring nodes of a graph without coloring two nodes connected by an edge with the same color. N-queen is the problem of placing N queens in a chessboard of size so that no two queens attack each other.

In order to inductively learn such programs, the ILP problem definition needs to be revisited. In the new scenario, positive examples , may not hold in every model. Therefore, the ILP problem described in the background section would only allow learning of predicates that hold in all answer sets. This is too restrictive. Brave induction Sakama and Inoue (2009), in contrast, allows examples to hold only in some stable models of . However, as stated in Law et al (2015) and we will show using examples, this is not enough when it comes to learning global constraints (i.e, rules with empty head)333Recall that in answer set programming, a constraint is expressed as a headless rule of the form :- B. which states that B

must be false. A headless rule is really a short-form of rules of the form (called odd loops over negation

Baral (2003)): p :- B, not p. . Learning global constraints is essential because certain combinations may have to be excluded from all answer sets.

When has multiple stable models, there will be some instances of target predicate that would hold in all, none, or some of the stable models. Brave induction is not able to express situations in which a predicate should hold in all or none of the stable models. An example is a graph in which node 1 is colored red. In such a case, none of node 1’s neighbors should be colored red. If node 1 happens to have node 2 as a neighbor, brave induction is not able to express the fact that if the predicate red(1) appears in any stable model of , red(2) should not. In Law et al (2015), the authors propose a new paradigm called learning from partial answer sets that overcomes these limitations. We also adopt this paradigm in our work presented here. Next, we present our XFOLD algorithm.

Definition 1

A partial interpretation E is a pair of sets of ground atoms called inclusions and exclusions, respectively. Let denote a stable model of . extends if and only if .

Example 4

Consider the following background knowledge about a group of friends some of whom are in conflict with others. The individuals in conflict will not attend a party together. Also, they cannot attend a party if they work at the time the party is held. We want our ILP algorithm to discover the rule(s) that will determine who will go to the party based on the set of partial interpretations provided.

conflict(X,Y) :- person(X), person(Y), conflict(Y,X).
works(X) :- person(X), not off(X).
off(X) :- person(X), not works(X).
person(p1). person(p2). conflict(p1,p4).
person(p3). person(p4). conflict(p2,p3).

Some of the partial interpretations are as follows. The predicates g,w,o abbreviate goesToParty,works,off respectively:

In the above example, each for i = 1,2,3,4 is a partial interpretation and should be extended by at least one stable model of for a learned hypothesis . For instance, let’s consider the hypothesis {goesToParty(X) :- off(X)} for learning the target predicate goesToParty(X). By plugging the background knowledge, the non-target predicates in , and the hypothesis into an ASP solver (CLASP Gebser et al (2012) in our case), the stable model returned by the solver would contain {goesToParty(p1),goesToParty(p2),goesToParty(p4)}. It does not extend . Although, but . It should be noted that non-target predicates are treated as background knowledge upon calling ASP solver to compute the stable model of .

Definition 2

An XFOLD problem is defined as a tuple . is a answer set program with potentially multiple stable models called the background knowledge. is the language-bias such that , where (resp. ) are called the head (resp. body) mode declarations Muggleton (1995). Each mode declaration (resp. ) is a literal whose abstracted arguments are either variable or constant . Type of a variable is a predicate defined in B. The domain of each constant should be defined separately. The clause h :- ,...,, not ,...,not is in the search space if and only if: i) is empty; ii) is an atom compatible with a mode declaration in . Hypothesis is said to be compatible with a mode declaration if each instance of variable in is replaced by a variable, and every constant takes a value from the associated domain. The set of candidate predicates in the greedy search algorithm are selected from .

The requirement of mode declarations in the XFOLD algorithm is due to a technicality: ASP solvers, need to ground the program, and for that matter, programmer should ensure that every variable is safe. A variable in head is safe if it occurs in a positive literal of body. XFOLD adds predicates required to ensure safety, but to keep our examples simple, we omit safety predicates in the paper. and are sets of partial interpretations called positive and negative examples, respectively. is the target predicate’s name. Each XFOLD run learns a single target predicate. A hypothesis is an inductive solution of if and only if:

  1. such that extends

  2. such that extends

The above definition adopted from Law et al (2015) subsumes brave and cautious induction semantics Sakama and Inoue (2009). Positive examples should be extended by at least one stable model of (brave induction). In contrast, no stable model of extends negative examples (cautious induction). The generate and test problems such as N-queen and graph coloring could be induced using our XFOLD algorithm. It suffices to use positive examples for learning the generate part and negative examples for learning the test part.

Figure 1 represents the input to the XFOLD algorithm for learning an answer set program for graph coloring. Every positive example states if a node is colored red, then that node cannot be painted blue or green. Likewise for blue and green. However, this is not enough to learn the constraint that two nodes connected by an edge cannot have the same color. To learn this constraint, negative examples are needed. For instance, , states that if any stable model of contains {red(1)}, in order not to extend , it should contain {not red(2)} or equivalently, it should not contain {red(2)}.





  • Positive examples:

  • Negative examples:

Figure 1: Partial interpretations as examples in graph coloring problem

The intuition behind the XFOLD algorithm is as follows: every positive example that is a partial interpretation is considered as a separate learning problem. A partial score is computed for . Once all the positive examples are tested against a candidate clause, the overall score, i.e, the summation of all partial scores is stored as the score of current clause. Among all hypotheses, the one with highest overall score is chosen just like the single stable model case. For testing any given hypothesis , the background knowledge , all non-target predicates in and the hypothesis are passed to the ASP solver as the input. The returned answer set is compared with the target predicates in and . Next, the partial information gain score is computed. XFOLD chooses a clause with highest positive score (if one exists). Next, every partial interpretation is updated by removing the covered target predicates from and . Once no target predicate in is covered, the internal loop finishes and the discovered rule(s) are added to the learned theory. Just like FOLD, if no literal with positive score exists, swapping occurs on each remaining partial interpretation and the XFOLD algorithm is recursively called. In this case, instead of introducing abnormality predicates, the negation symbol, ”-”, is prefixed to the current target predicate to indicate that the algorithm is now trying to learn the negation of concept being learned. It should also be noted that swapping examples is performed slightly differently due to the existence of partial interpretations. The summary of required changes in swapping of examples is as follows:

  1. , where is and old target atom, is restored

  2. , where is and old target atom, is added to

  3. , where is and old target atom, is added to

  4. . (Target predicate T now becomes its negation, -T)

After iteration #1: {goesToParty(X) :- off(X)}
After swapping ,
After iteration #1: -goesToParty(X) :- conflict(X,Y)
iteration #2: -goesToParty(X) :- conflict(X,Y),goesToParty(Y)
Hypothesis = { {goesToParty(X) :- off(X), not -goesToParty(X).}, {-goesToParty(X) :- conflict(X,Y), goesToParty(Y).}}
Figure 2: Trace of XFOLD internal loop and recursive call on party example

Figure 2 demonstrates execution of XFOLD for Example 4. At the end of first iteration, the predicate off(X) gets the highest score. will be removed as it is already covered by the current hypothesis. In the second iteration, all candidate literals fail to get a positive score. Therefore, swapping occurs and algorithm tries to learn the predicate -goesToParty(X) as if it was an exception to the default case {goesToParty(X) :- off(X)}. Since the new target predicate is -goesToParty(X), all ground atoms of goesToParty in are restored back. The old target atoms in are transformed to negated version and become members of .

In Figure 2, after one iteration, is removed because all target atoms in are already covered and targets atoms in are already excluded. After swapping, XFOLD is recursively called to learn -goesToParty. After two iterations, since all examples are covered, the algorithm terminates.

In Example 4, we haven’t introduced any explicit negative example. Nevertheless, the algorithm was able to successfully find the cases in which the original target predicate does not hold (via learning -goesToParty(X) predicate). In general, it is not always feasible for the algorithm to figure out prohibited patterns without getting to see a very large number of positive examples.

5 Application: Combinatorial Problems

A well-known methodology for declarative problem solving is the generate and test methodology, whereby possible solutions to a problem are generated first, and then non-solutions are eliminated by testing. In Answer Set Programming, the generate part is encoded by enumerating the possibilities by introducing even cycles. The test part is realized by having constraints that would eliminate answer sets that violate the test conditions. ASP syntax allows rules of the form such that and , , where L is the language bias. This is a syntactic sugar for combination of even cycles and constraints, which is called choice rule in the literature Baral (2003); Gelfond and Kahl (2014).

ILASP Law et al (2015) directly searches for choice rules by including them in the search space. XFOLD, on the other hand, performs the search based on -subsumption Plotkin (1971) and hence disallows search for choice rule hypotheses. Instead, it directly learns even cycles as well as constraints. This is advantageous as it allows for more sophisticated and flexible language bias.

It turns out that inducing the generate part in a combinatorial problem such as graph-coloring requires an extra step compared to the FOLD algorithm. For instance, red(X) predicate has the following clause:

red(X):- not blue(X), not green(X).

To enable XFOLD to induce such a rule, we adopted the “Mathews Correlation Coefficient” (MCC) Zeng et al (2014)

measure to perform the task of feature selection. MCC is calculated as follows:

This measure takes into account all the four terms TP (true positive), TN (true negative), FP (false positive) and FN (false negative) in the confusion matrix and is able to fairly assess the quality of classification even when the ratio of positive tuples to the negative tuples is not close to 1. The MCC values range from -1 to +1. A coefficient of +1 represents a perfect classification, 0 represents a classification that is no better than a random classifier, and -1 indicates total disagreement between the predicted and the actual labels. MCC cannot replace XFOLD heuristic score, i.e,

information gain

, because the latter tries to maximize the coverage of positive examples, while the former only maximally discriminates between the positives and negatives. Nevertheless, for the purpose of feature extraction among the negated literals which are disallowed in XFOLD algorithm, MCC can be applied quite effectively. For that matter, before running XFOLD algorithm, the MCC score of all candidate literals are computed. If a predicate scores “close” to +1, the predicate itself is added to the language bias. If it scores “close” to -1, its negation is added to the language bias. For example, in case of learning

red(X), after running the feature extraction on the graph given in Figure 1, XFOLD computes the scores -0.7, -0.5 for green(X) and blue(X), respectively. Therefore, {not green(X),not blue(X)} are appended to the list of candidate predicates. Now, after running the XFOLD algorithm, after two iterations of the inner loop, it would produce the following rule:

red(X) :- not green(X), not blue(X).

Corresponding rules for green(X) and blue(X) are learned in a similar manner. This essentially takes care of the generate part of the combinatorial algorithm. In order to learn the test part for graph coloring, we need the negative examples shown in Figure 1. It should be noted that in order to learn a constraint, we first learn a new target predicate which is the negation of the original one. Then we shift the negated predicate from the head to the body inverting its sign in the process. That is, we first learn a clause of the form

-T :- b, b b.

which is then transformed into the constraint:

:- b, b b, T.

Thus, the following steps should be taken to learn constraints from negative examples:

  1. Add rule(s) induced for generate part to B.

  2. , if :

    • if is of the form (not ) then

    • else

  3. compute the contrapositive form of the rule(s) learned in generate part and remove the body predicates from the list of candidate predicates

  4. run FOLD to learn p

  5. shift -p from the head to the body for each rule returned by FOLD

The contrapositive of a statement has its antecedent and consequent inverted and flipped. For instance, the contrapositive of the clause {red(X) :- not green(X), not blue(X)} is shown in Figure 3.

2:Hypothesis H
3:% - Induction of “generate” part - %
5:let f be new features discovered by running each and measuring MCC
7:for each  do
8:      FOLD
10:end for
12:% - Induction of “test” part - %
13:for each  do
14:     for each   do
15:         if  then
16:              if  is of the form not t() then
17:                   {-t()}
18:              else
19:                   {-t()}
20:              end if
21:         end if
22:     end for
23:end for
24:compute the contrapositive form for each in generate part and remove the body predicates from the list of candidate predicates
25:for each  do
26:      FOLD
27:     shift -t from the head to the body to get a constraint
29:end for
Algorithm 3 Overview of the XFOLD algorithm
-red(X) :- green(X).
-red(X) :- blue(X).
Figure 3: Contrapositive for “generate” rule in graph-coloring

The reason why step 3 is necessary is the following: running FOLD without eliminating the literals in contrapositive rule results in learning trivial clauses shown in Figure 3. However, as soon as those trivial choices are removed from search space, FOLD algorithm comes up with the next best hypothesis which is as follows:

-red(X) :- edge(X,Y), red(Y).

Shifting the predicate -red(X) to the body yields the following constraint:

:- red(X), edge(X,Y), red(Y).

In graph coloring problem, = {red(X), green(X), blue(X)}. Once similar examples for green(X) and blue(X) are provided, XFOLD is able to learn the complete solution as shown in Figure 4. Algorithm 3, presents a high level view of XFOLD to induce a generate and test hypothesis.

red(X) :- not green(X), not blue(X).
green(X) :- not blue(X), not red(X).
blue(X) :- not green(X), not red(X).
:- red(X), edge(X,Y), red(Y).
:- blue(X), edge(X,Y), blue(Y).
:- green(X), edge(X,Y), green(Y).
Figure 4: Full graph-coloring ASP theory learned by FOLD algorithm
Example 5

Next we discuss learning the answer set program for the 4-queen problem: the following items are assumed: Background knowledge including predicates describing a board, rules describing different ways through which two queens attack each other and examples of the following form:

B: attack_r(,,,):-q(,),q(,),, .
attack_c(,,,):-q(,),q(,),, .

As far as the generate part concerns, XFOLD algorithm would learn the following program:

q(X,Y) :- not -q(X,Y).
-q(X,Y) :- not q(X,Y).

The predicate -q(X,Y) is introduced by XFOLD algorithm as a result of swapping the examples and calling itself recursively. After computing the contrapositive form, q(X,Y), -q(X,Y) are removed from the list of candidate predicates. Then based on the examples provided in Example 5, XFOLD would learn the following rules:

-q(,) :- attack_r(,,,).
-q(,) :- attack_c(,,,).
-q(,) :- attack_d(,,,).

After shifting the predicate -q(,) to the body, we get the following constraint:

:- q(,), attack_r(,,,).
:- q(,), attack_c(,,,).
:- q(,), attack_d(,,,).

It should be noted that, since XFOLD is a sequential covering algorithm like FOIL, it takes three iterations before it can cover all examples which in turn becomes three constraints as shown above.

6 Related Work

Many researchers have tried to extend Horn ILP into richer non-monotonic logic formalisms. “Stable ILP” Seitzer (1997) was the first effort to explore the expressiveness of background knowledge with multiple stable models. A survey of extending Horn clause based ILP to non-monotonic logics can be found in Sakama (2005). In this paper Sakama also introduces algorithms to learn from the answer set of a categorical logic program. The algorithms learn from positive and negative examples separately and the approach also leads to redundant literals in the body of the induced clause as shown by Example 6.

Example 6

Consider the following background knowledge and positive example:

bird(X) :- penguin(X).
bird(tweety). bird(et).
bear(teddy). penguin(polly).

Sakama’s algorithm would induce the following clause:

fly(X) :- bird(X), not cat(X), not penguin(X), not bear(X).

The literals not cat(X), not bear(X) are redundant. The brave induction framework Sakama and Inoue (2009), although capable of learning ASP programs, only admits one positive example in the form of conjunction of literals. As we discussed, many problems, including programs for solving combinatorial problems, cannot be expressed without having a notion of a negative example. ILASP Law et al (2015), introduces a framework that would allow to induce a hypothesis from multiple positive examples bravely (i.e., it uses brave induction), while it would exclude negative examples cautiously (i.e., it uses cautious induction). However, due to performing an exhaustive search on its predetermined language bias, ILASP is unable to scale up to large datasets or noisy datasets. It is not able to induce default theories with nested, or composite abnormality predicates to capture exceptions as shown in Example 7.

Example 7

A default theory with abnormality predicate represented as conjunction of two other predicates, namely s(X) and r(X).

p(X) :- q(X), not ab(X).
ab(X) :- s(X), r(X).

XHAIL Ray (2009) is an ILP system capable of learning non-monotonic logic programs. It relies heavily on abductive reasoning incorporated in a three-stage algorithm. It does not support inducing from multiple partial answer sets.

7 Conclusion and Future Work

In this paper we presented the first heuristic-based algorithm to inductively learn normal logic programs with multiple stable models. The advantage of this work over similar ILP systems such as ILASP Law et al (2015) is that unlike these systems, XFOLD does not perform an exhaustive search to discover the “best” hypothesis. XFOLD adopts a greedy approach, guided by heuristics, that is scalable and noise resilient. Also, learning knowledge patterns in terms of defaults and exceptions produces more natural and intuitive results that correspond to common sense reasoning employed by humans. We also showed how our algorithm could be applied to induce declarative logic programs that follow the generate and test paradigm for finding solutions to combinatorial problems such as graph-coloring and N-queens.

Our XFOLD algorithm has a number of novel features absent in other prior works: (i) it performs a heuristic search for learning hypothesis rather than an exhaustive search and thus is considerably more scalable; (ii) it admits predicate invention allowing us to learn a broader class of answer set programs that cannot be learned by other systems such as ASPAL, ILASP, and XHAIL; (iii) because of swapping of positive and negative examples, XFOLD is able to distinguish between exceptions and noise, producing more succinct hypotheses.

There are two main avenues for future work: (i) handling large datasets using methods similar to QuickFoil Zeng et al (2014). In QuickFoil, all the operations of FOIL are performed in a database engine. Such an implementation, along with pruning techniques and query optimization tricks can make the XFOLD training much faster; (ii) XFOLD learns function-free answer set programs. We plan to investigate extending the language bias towards accommodating functions.

Authors are partially supported by NSF Grant IIS 1718945.


  • Baral (2003) Baral C (2003) Knowledge representation, reasoning and declarative problem solving. Cambridge University Press, Cambridge, New York, Melbourne
  • Corapi et al (2011) Corapi D, Russo A, Lupu E (2011) Inductive logic programming in answer set programming. In: Inductive Logic Programming - 21st International Conference, ILP 2011, Windsor Great Park, UK, July 31 - August 3, 2011, Revised Selected Papers, pp 91–97
  • Gebser et al (2012)

    Gebser M, Kaufmann B, Schaub T (2012) Conflict-driven answer set solving: From theory to practice. Artificial Intelligence 187-188:52–89

  • Gelfond and Kahl (2014) Gelfond M, Kahl Y (2014) Knowledge Representation, Reasoning, and the Design of Intelligent Agents: The Answer-Set Programming Approach. Cambridge University Press
  • Gelfond and Lifschitz (1988) Gelfond M, Lifschitz V (1988) The stable model semantics for logic programming. In: Logic Programming, Proceedings of the Fifth International Conference and Symposium, Seattle, Washington, August 15-19, 1988 (2 Volumes), pp 1070–1080
  • Law et al (2015) Law M, Russo A, Broda K (2015) The ILASP system for learning answer set programs. https://www.doc.ic.ac.uk/~ml1909/ILASP
  • Lichman (2013) Lichman M (2013) UCI machine learning repository. URL http://archive.ics.uci.edu/ml
  • Mitchell (1997) Mitchell TM (1997) Machine learning. McGraw Hill series in computer science, McGraw-Hill
  • Muggleton (1991) Muggleton S (1991) Inductive logic programming. New Generation Comput 8(4):295–318
  • Muggleton (1995) Muggleton S (1995) Inverse entailment and progol. New Generation Computing 13(3):245–286
  • Otero (2001) Otero RP (2001) Induction of stable models. In: Rouveirol C, Sebag M (eds) Inductive Logic Programming, Springer Berlin Heidelberg, Berlin, Heidelberg, pp 193–205
  • Plotkin (1971) Plotkin GD (1971) A further note on inductive generalization, in machine intelligence, volume 6, pages 101-124
  • Quinlan (1990) Quinlan JR (1990) Learning logical definitions from relations. Machine Learning 5:239–266
  • Ray (2009) Ray O (2009) Nonmonotonic abductive inductive learning. Journal of Applied Logic 7(3):329 – 340, special Issue: Abduction and Induction in Artificial Intelligence
  • Sakama (2005) Sakama C (2005) Induction from answer sets in nonmonotonic logic programs. ACM Trans Comput Log 6(2):203–231
  • Sakama and Inoue (2009) Sakama C, Inoue K (2009) Brave induction: a logical framework for learning from incomplete information. Machine Learning 76(1):3–35
  • Seitzer (1997) Seitzer J (1997) Stable ilp : Exploring the added expressivity of negation in the background knowledge. In: IJCAI-97 Workshop on Frontiers of ILP
  • Shakerin et al (2017) Shakerin F, Salazar E, Gupta G (2017) A new algorithm to automate inductive learning of default theories. TPLP 17(5-6):1010–1026
  • Srinivasan (2001) Srinivasan A (2001) The Aleph Manual. URL http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/
  • Srinivasan et al (1996) Srinivasan A, Muggleton S, Bain M (1996) Distinguishing exceptions from noise in non-monotonic learning,in s. muggleton and k. furukawa, editors, second international inductive logic programming workshop (ilp92)
  • Zeng et al (2014) Zeng Q, Patel JM, Page D (2014) Quickfoil: Scalable inductive logic programming. Proc VLDB Endow 8(3):197–208