Deep learning has been wildly successful in many domains within NLP, computer vision, and generative models. However, there is no shortage of opportunities for deep learning to improve: in particular, 6 issues are recurringly identified in the literature.
Deep learning requires too much data.
Deep learning cannot learn from abstract definitions and requires tens of thousands of examples (humans are far more example-efficient with complex rules). [lake-human] Geoffrey Hinton has expressed concerns that CNNs face "exponential inefficiencies." [sabour-capsule]
Deep learning has no interface for knowledge transfer.
There is no clean way to encode and transfer patterns learned by one neural net to another, even if their problem domains are very similar.
Deep learning handles non-flat (i.e. hierarchical) data poorly.
Neural nets take as input a flat vector. Experts must specially encode tree, graph, or hierarchical data or (more likely) abandon deep learning.
Deep learning cannot make open inference about the world.
Neural nets perform poorly on extrapolation and multi-hop inference.
Deep learning integrates poorly with outside knowledge (i.e. priors).
Encoding priors (e.g. objects in images are shift-invariant) requires re-architecting the neural net (e.g. convolutional structure).
Deep learning is insufficiently interpretable.
The millions of black-box parameters are not human-readable, and it is difficult to ascertain what patterns are actually learned by the algorithm.
|Marcus [marcus-appraisal]||Bottou [bottou-reasoning]||Garnelo [garnelo-symbolic]||Evans [evans-explanatory]||Zhang [zhang-like-humans]||Valiant [valiant-robust]|
|Needs too much data|
|Can’t encode hierarchy|
|Can’t openly infer|
|Can’t integrate priors|
The first insight by Valiant is that inductive logic has none of these issues: logic is data-efficient, has simple rules for encoding/synthesis/structure, easily chains to make inferences, and has a human-readable symbology. By combining logic and learning methods, one might achieve both the impressive, noise-tolerant performance of deep learning and the ease-of-use of inductive logic. In this review, we call such a system an artificial intelligence reasoning system.
This literature review serves 3 primary purposes.
Summarize the findings from logic, mathematics, statistics, and machine learning on artificial reasoning systems.
Rephrase the findings from disparate fields in the same terminology.
Synthesize findings: find areas of useful integration and identify next steps.
Since findings related to AI reasoning systems come from so many disparate fields, proofs have been written in many notations (e.g. in query form [yang-diff-kb] and in IQEs [valiant-robust]
) – we will rephrase all logic with inductive logic programming (ILP) notation.
(Atom) An -ary predicate where each term is a variable or constant. Typically the truth value of a relation on given args
(Ground Atom) An atom with no variables
(Definite clause) A rule where head is implied by each body atom being true
(Consequence) The closure over rules and ground atoms .
(ILP Problem) A 3-tuple of a set of ground atoms, a set of positive examples, and a set of negative examples.
(ILP Solution) A set of definite clauses that includes all positive examples in its consequences and no negative examples.
(Knowledge Base) A set of known or learned rules.
In this review, all standard definitions of PAC learning from K&V apply. The term PAC-learning here means efficiently PAC-learnable (i.e. an algorithm exists that accomplishes -accuracy with confidence in examples and polynomial time.
2 PAC Frameworks
2.1 Unconstrained Reasoning
In order to motivate the ILP framework, we present first a negative result: learning a reasoning system is intractable in the PAC setting if the outputs of rules are relaxed from boolean outputs to arbitrary outputs.
2.1.1 Learning Conceptual Graphs
(Conceptual Graph) As originally formulated, a simple conceptual graph is a directed bipartite graph , where nodes correspond to relations, nodes correspond to “concepts,” and edges symbolize an ordered relationship between a predicate and one of its arguments. [croitoru-conceptual] An identical formulation in ILP vocabulary is , where nodes in are head atoms, nodes in are body atoms (relaxed from booleans to arbitrary objects), and the in-edges of a node represent a rule.
(Projection) Informally, if we are given conceptual graphs and , then is more general than (denoted , e.g. human student) iff there exists a projection from to . Formally, a projection from to is a mapping such that:
|( dominates in distance)|
We present a simple intermediate result due to Baget and Mugnier [baget-np].
Deciding whether is an NP-complete problem.
The last condition alone suffices for the proof. We want to find a mapping such that .
Consider the special case where is a graph where the only edges form a -clique. In this special case, we want to decide whether the edges of have some mapping to a -clique. This is now the decision problem for the -clique problem, which is known to be NP-complete.
Since the projection problem is reduced to an NP-complete problem in this special case, we conclude that the projection problem is NP-complete as well. ∎
We now arrive at the desired negative result due to Jappy and Nock. [jappy-conceptual]
Let be the set of conceptual graphs, and let be a concept class of such graphs. is not PAC-learnable unless .
Let denote the most specific conceptual graph that has the same representation as . Let .
Deciding whether is in for arbitrary is a special case of the projection problem. By Lemma 2.1, testing membership for is NP-complete.
We cite Theorem 7 from Schapire’s The strength in weak learning [schapire-weak]:
Suppose is learnable, and assume that . Then there exists a polynomial such that for all concepts of size , there exists a circuit of size exactly computing .
We then have that if is PAC-learnable, then learning can be computed in . The contrapositive of this theorem is that if learning is not , then is not PAC-learnable.
We know that is NP-complete and take as assumption that .
Therefore, (and hence ) is not PAC-learnable. ∎
The consequence of this negative result is that all other discussion will be about constrained reasoning: either reasoning about small sets of objects/relations or reasoning about arbitrary sets of objects/relations with rules of constrained form.
2.2 Constrained Reasoning
2.2.1 Learning to Reason
Consider the problem of finding a satisfying assignment given , where is a boolean function and is a CNF.
It is well-known that this problem is NP-Hard.
(Reasoning query oracle) Define reasoning query oracle as an oracle that picks an arbitrary query , then checks whether learning agent is correct about its belief of . If is correct, returns True; else, returns a counterexample.
(Exact Learn-to-Reason) Let be a class of boolean functions, and let be a query language. An algorithm is an exact Learn-to-Reason algorithm for the problem iff learns in polynomial time with a reasoning query oracle, and after the learning period, answers True to query iff .
If we are given that has a polynomial size DNF, has at most literals in each clause, and access to an equivalence query oracle, an example oracle, and a reasoning query oracle, then we can find an exact reasoning solution to the NP-Hard CNF problem.
We present an exact learn-to-reason algorithm for learning some boolean function with polynomial-size DNF over basis .
The answering algorithm follows given input query and from the learning algorithm; simply evaluate the CNF on the boolean vector .
The Learn-to-Reason Training Algorithm is mistake-bound, learns from below, and is an exact Learn-to-Reason algorithm.
We begin by noting that the algorithm never makes a mistake when it returns False since . Thus, it makes mistakes only on positive counterexamples; since is bounded (in fact, polynomial), the number of mistakes is bounded.
We cite 2 theorems by Khardon and Roth [khardon-l2r]:
For boolean function class and query language with basis , the least upper bound is well-defined as .
Given , , and basis , iff , .
From the former, we have that is the least upper bound on after calls to the oracle by definition of a least upper bound.
The postcondition of the latter is satisfied by the algorithm, so we conclude that the output of the algorithm is a solution to the learn-to-reason version of the CNF problem. ∎
Thus, this algorithm learns to reason about CNF from a knowledge base of counterexamples efficiently. [khardon-l2r] We note that this has not solved an NP-hard problem in the traditional sense; the additional reasoning power comes from the oracle and its ability to provide counterexamples in constant time. In fact, the problem is still traditionally intractable with our additional restriction on the DNF size; it is NP-hard to find a satisfying assignment for a CNF even given that there exists only one satisfying assignment. [valiant-np]
Finally, we note that there is an extension from exact learning-to-reason to the PAC setting, as defined by Khardon and Roth.
(Fair) Given boolean function , query , and , query is -fair if for ; either the query does not occur (and thus no mistakes can be made over it) or it occurs frequently.
(PAC Learn-to-Reason) Let be a class of boolean functions, and let be a query language. An algorithm is an exact Learn-to-Reason algorithm for the problem iff is polynomial time in with a reasoning query oracle, and after the learning period, answers True to query iff for a
-fair query with probability.
This definition will serve as an inspiration for the next finding in Robust Logics, where the PAC-setting is critical to the tractability of the rule-learning.
2.2.2 Robust Logics
|Token||N/A||A named reference to an object.,Valiant himself assumed distributional symmetry to avoid dealing with permutations of tokens w.r.t. objects, so we omit the term and concept altogether.|
|Connective function||Predicate||Valiant assumes these connective functions belong to a PAC-learnable class (e.g. linear threshold functions).|
|Rule||Rule||Valiant’s rules have a slightly different structure: , where is an arbitrary boolean function on the object set and is a connective function.|
(Scene) Given objects , relations , and arities , a scene is a vector of length , where is the truth value of the th relation.
(Obscured Scene) An obscured scene is a scene whose vector elements can take on 3 values: , where denotes an obscured truth value.
In Valiant’s setting of Robust Logics, we place constraints on the atoms themselves; if predicates are restricted to boolean functions that are PAC-learnable, then any class of rules is PAC-learnable from examples (even with partial knowledge). We then present 2 algorithms:
A learning algorithm for inducing rules from examples.
A deduction algorithm using learned rules to predict the truth values of obscured variables.
184.108.40.206 PAC-learnability from Scenes
If the class of connective functions is PAC-learnable to accuracy and confidence by algorithm in examples, then any class of rules with constant arity over is PAC-learnable from scenes in examples.
The proof borders on tautological. We target a rule
Note that a particular scene yields an input vector of truth values and an output boolean. Suppose we sample such scenes. By assumption, algorithm can learn some function that is -accurate with confidence .
If the distribution of scenes is , then the probabilistic guarantee is:
We define a rule (, but with substituted with ).
220.127.116.11 Learning Algorithm
We must be given some set of templates for rules of the form:
where only is unknown, such as the following:
Let be a set of such rule templates. The algorithm constructs a training set for PAC-algorithm over the connective functions and runs it times over the rule templates.
18.104.22.168 Deduction Algorithm
The Deduction Algorithm is significantly more complex because the deduction can yield both ambiguities and contradictions, the former arising from insufficient information and the latter from errors in the rules (to magnitude ). We construct a graph , where the vertices are relations and edge iff there is a rule with on the LHS and on the RHS. We present an illustrative algorithm for the case in which is acyclic, then a general algorithm for arbitrary graphs .
The acyclic case is quite intuitive since there is an order of evaluation. The general case has no such ordering and so must proceed for an indeterminate (but certainly finite) amount of time.
We would like for this general algorithm to be label a large proportion of relations and to achieve small error on those relations that it labels. We formalize these notions as soundness and completeness.
(PAC-Sound) A deduction algorithm is PAC-sound if:
it is polynomial time.
for any error ,, where the description size of and is the number of variables.
(Deduction sequence) A deduction sequence is a sequence is triples , where is a rule, is a mapping to variables/objects, and is the boolean output.
(Validity) A deduction algorithm is valid for an input for all unobscured relations , each rule in the set of rules with as the RHS is satisfied (i.e. if the predicate is true, then predicted boolean matches the unobscured truth value).
(Complete) A deduction algorithm is complete for a class of rule sets with respect to iff there is a valid deduction sequence ending with iff the deduction algorithm outputs given inputs .
The general deduction algorithm is PAC-sound and complete if the maximum arity of rules are upper-bounded by some constant .
For each rule , we define , and to be the number of evaluations per LHS predicate, the number of predicates in the LHS, and the complexity of evaluating a single connective function, respectively.
For each rule , the algorithm runs at most times. We further note that the truth value of any predicate as currently bounded can be computed in steps.
Then the complexity of the algorithm is bounded by:
|(By arity bound)|
|(By definition of description size)|
Thus, the runtime is polynomial for constant .
We now show that the algorithm meets the accuracy conditions.
For rule set and scene , the algorithm evaluates at most rules. We previously bounded the error for a single rule to:
|(By PAC-learnability from scenes)|
We select , where is the error required by the definition of PAC-soundness. We see then that the probability that an arbitrary scene errs is no more than . ∎
Thus, we have now demonstrated an algorithm that learns to reason soundly and completely on given variables, relations, scenes each describing the truth values of some relations, knowledge of the structure of rules, PAC-learnability of the boolean functions in the rules, and bounds on the arities and sizes of the rules. [valiant-robust]
2.2.3 Knowledge Infusion
(Knowledge Infusion) Any process of knowledge acquisition by a computer that satisfies the following 3 properties:
The stored knowledge is encoded such that principled reasoning on the knowledge is computationally feasible.
The stored knowledge and corresponding reasoning must be robust to errors in the system inputs, to uncertainty in knowledge, and to gradual changes in the truth.
The acquisition must be carried out automatically and at scale.
Valiant and Loizos present motivations, PAC bounds, and an applied algorithm in the context of knowledge infusion. We discuss their 2 seminal publications. [valiant-infusion] [michael-infusion]
The premise of knowledge infusion is to extend robust logics using cognitive science as inspiration. Recall that robust logics learns rules between relations given scenes: booleans that give a partial description of variables in a specified ontology. Contrast this with standard ILP frameworks, which quantify existentially and universally over an unlimited/ill-specified world. This corresponds roughly to the notion of a working memory: learning on a small-dimensional, manageable subset of the world.
This notion spawns 3 subproblems.
There is no longer a single target function or a target query; to infuse large quantities of knowledge, the complexity of learning many targets may increase.
Learning from a working memory model requires an abundance of teaching materials to educate the model with an appropriate order of positive and negative examples.
Are Rules Necessary?
As we previously saw, Khardon and Roth’s algorithm for learning-to-reason re-learns the model from scratch after each mistake by memorizing examples. Rules may or may not provide better performance.
We discuss the former two, demonstrate applied results, and then present arguments from Juba, Valiant, Khardon, and Roth on the final point.
22.214.171.124 Parallel Concepts
We present Valiant’s findings on learning many concepts simultaneously from few examples. Surprisingly, learning functions rather than just one over the same domain does not require times as many examples; very few additional examples are needed.
(Simultaneous-PAC Learning Algorithm) For polynomial , some , some confidence , and some number of functions , is an -simultaneous PAC-learning algorithm over concept class iff for any distribution after examples, it can learn a set of hypotheses for arbitrary concepts to accuracy with probability .
We now present 2 theorems in support of the notion of a log-increase on the number of examples needed to support many functions.
If there is an -PAC learning algorithm for concept class , then there is an -simultaneous-PAC learning algorithm for , where .
We sample random examples from and apply to each function separately.
The probability of a specific hypothesis having error more than is . By union-bound, the probability of at least one hypothesis having error more than is .
We choose . This completes the proof. ∎
A consequence of this algorithm is that concept classes that have a dependence on (such as disjunctions over variables [littlestone]) then have a log-dependence on , which is promising.
The next theorem extends this principle in general to all concept classses.
(Simultaneous Occam Theorem) Suppose we have domain , distribution , and target functions , and a learning algorithm that outputs hypothesis for functions in .
Suppose further that an experiment occurs where examples are drawn from , and the result is that the hypotheses predict every sample point correctly.
Then for any and any :
, where is the predicted positive rate.
Suppose that a function (not necessarily one of the chosen hypotheses) has error greater than . By Multiplication Rule, the probability of labeling all examples correctly is .
By union bound, the probability of any function in having error greater than is then .
We then have:
|(By PAC algorithm)|
This gives us the theorem’s required lower bound for the number examples; the example bound is equivalent to the PAC condition. We now prove the three error bounds.
The first error bound is immediate from the definition of PAC learning.
The second error bound follows from definitions:
The third error bound follows from definitions:
|(Probability set difference)|
Valiant concludes that if learning a single rule is tractable in the PAC sense, then it is also tractable to learn many rules (even exponentially many) given that there is only a log dependence on the number of rules. This arises from a re-use of examples for every rule.
126.96.36.199 Teaching Materials and Scene Construction
In Michael and Valiant’s empirical demonstration of knowledge infusion, they create “teaching materials” by turning text from news sources into scenes as per the Robust Logic framework.
The dataset is the North American News Text Corpus [graff-text], comprising 6 months worth of articles (roughly 500000 sentences). Sentences were annotated with the Semantic Role Labeler, an automated tagger [ccg-uiuc]; sentence fragments were then passed in the Collins Head Rules, which extracts keywords to summarize the sentence. [collins-head]
Finally, each sentence summary is converted to a scene as follows:
Create an entity for each tokenized word with a semantic object association.
Create a relation for each verb in the sentence on the relevant objects in the sentence. Michael and Valiant cap the arity of the relations at 2 since scene size is exponential in .
Create a relation for proximity instances for words that are close to each other in the sentence.
A fuller description of scene construction can be found on pg. 381-382 of the original paper. [michael-infusion]
188.8.131.52 Experimental Approach
Given the constructed scenes, there are 3 primitive rule operations:
: given a target relation and scenes, create a rule for the target relation. Michael and Valiant choose to use the Winnow algorithm for lienar threshold functions (perceptrons) to learn the rules, as per the sugggestion from Valiant’s Robust Logics paper.[valiant-robust]
Rule Evaluation: given an input rule , the input rule’s head relation , and a set of scenes , predict the truth value of .
Rule Application: given a set of rules and a set of scenes , apply all rules in parallel, and enhance the original set of scenes with unambiguous truth values determined by the rules.
Michael and Valiant define the chaining of these 3 primitives as the rule chaining task. They then create 4 discrete distributions111We refer to these as distributions, but the original terminology is “experiment.” The renaming is due to the term experiment being quadruply overloaded by the original authors. for the rule chaining task.
Each of the following 4 distributions takes as input a training set of scenes, a testing set of scenes, an enhancement set of relations (akin to a knowledge base or a prior of rules), and a target relation . The output in each task is a rule for predicting , and we evaluate the performance of on the testing set .
Distribution 00: the enhancement set is ignored.
Distribution 11: the enhancement set is used on the training set in learning rule . The enhancement set is further used to augment the testing scenes into .
Distribution 01: the enhancement set is only used on the testing set.
Distribution 10: the enhancement set is only used in learning.
Finally, Michael and Valiant experiment along one more axis: syntactic vs. semantic information.
Syntactic: only includes word/pos and proximity relations.
Semantic: only includes word/pos and verb relations.
Michael and Valiant note that they did not achieve any notable results from distributions 01 and 10 because the training and test procedures are misaligned. They then present 5 different arrangements of the distributions (00 or 11) and the type of information (semantic, syntactic, or both). The parameters and results were as follows.
|Experiment||Enhanced Ruleset Distribution||Information Type|
The best results from experiment are promising: the learned rule generates nearly 83% accuracy on test scenes with potential relations and entities never seen before. This strongly demonstrates the case for logical inference as a component of learning.
The cross-experiment comparisons demonstrate that:
Syntactic information is critical for performance at any level (without it, performance does not significantly improve from 50%).
Semantic information in addition to syntactic information demonstrates little improvement – in other words, verbs provide no additional information on proximity.
An enhanced set of scenes from augmenting rules does not initially increase performance, but allows the algorithm to generalize to a larger set of targets in the evaluation task.
Michael and Valiant conclude that chaining learned rules is both feasible and scalable.
2.3 Are Rules Necessary?
We conclude this section on PAC-frameworks to raise the question of whether learning logical induction requires an explicit representation of rules: Khardon and Roth [khardon-l2r] created a framework that requires no such representation, while Michael and Valiants’s Robust Logics and Knowledge Infusion frameworks centralizes the principle of explicit rules (and also require the form of the rules to be given).
Valiant even brings up this issue in his original Knowledge Infusion paper and brings up 3 reasons in defense of rules. [valiant-infusion]
Chaining rules allows the algorithm to make inference about situations too infrequent to have data that supports a learning-only deduction.
Chaining in Robust Logics learns equivalence rules that allow for a higher complexity than direct learning.
Learning without Robust Logics does not allow for programmed rules as priors. There is no analog to this knowledge transfer.
Juba argues that Valiant’s knowledge infusion algorithm is not sufficiently powerful to support the finding that the number of rules that can be learned is exponentially large in the size of the knowledge base because Valiant’s algorithm preserves explicit rules. [juba]
We define 6 terms in anticipation of a theorem justifying Juba’s claim.
(Partial example) An element of : analogous to Valiant’s notion of an obscured scene.
(Witnessed formula) A boolean formula is witnessed in partial example iff all literals of are unobscured in .
(Witnessed formula) A boolean formula is witnessed in partial example iff all literals of are unobscured in .
Given literals observed to be true and masked literals , a threshold connective formula is witnessed in partial example iff:
(Restricted formula) Given literals observed to be true over , a restricted formula over threshold connected is recursively defined as if is observed and otherwise.
(Automatizability Problem) Given a set of proofs and an input formula , the automatizability problem is deciding whether a proof of exists in .
(Restriction-closed set of proofs) A set is restriction-closed if for all formulas , the existence of a proof of in implies the existence of a proof of in .
Finally, we arrive at the theorem.
Let be a restriction-closed set of proofs, and suppose there is an algorithm for the Automatizability Problem that is polynomial in , , and . Then there is an algorithm that uses examples and runs in time such that deciding which of the following holds is done correctly with probability :
There exists a proof from to , where are all witnessed.
The proof is omitted as it is outside the scope of PAC-reasoning; for an explicit algorithm that accomplishes the above, see Juba’s original publication. [juba]
The consequence of Juba’s theorem is that Juba finds a mechanism (witnessing) that preserves Valiant’s finding that examples can support an exponential number rules, while the Knowledge Infusion algorithm falls short.
3 Applied Approaches
For a problem as difficult and general as reasoning, strong bounds (such as those specified by the PAC model) were previously only found in settings with many restrictions; for example, robust logics requires that the structure and relational ontology of all rules be known beforehand, where that very structure might be the most difficult aspect to learn.
The advent of deep learning and differentiable techniques has allowed for more complex functional approximations, so we now explore some recent advancements on the PAC-findings using applied approaches.
Whereas the PAC-findings tended to build on each others’ theorems, these deep learning findings are all largely the same in format.
Theorize a model.
Formulate an experiment.
Present empirical results.
We now present findings in symbolic ontology and symbolic reasoning in this format.
3.1 Learning Symbolic Ontology
This corresponds to the problem of scene construction in knowledge infusion.
3.1.1 Deep Symbolic Reinforcement Learning
Garnelo, Arulkumaran, and Shanahan present findings on inferring objects from frame-by-frame time-varying image data. [garnelo-symbolic]
The premise is that many of the deep learning issues addressed (lack of priors, lack of transfer, data-hungriness, transparency) are solved if the learning is done on data that is symbolically represented as ILP. Thus, the model learns a symbolic encoding of the data, then learns deeply on the symbolic representations.
Given “responsive” video data of scenes as depicted in Figure 3 and an existing ontology of objects, the procedure is as follows.
For each frame, learn which symbols in are in the picture and where.
For the sequence of frames, learn how specific instances of symbols in persist from frame-to-frame and represent “motion.”
Do reinforcement learning on the environment: colliding with certain objects yields positive rewards, while colliding with other objects yields negative rewards.
Each subproblem is solved with a heuristic algorithm. We present the first in its entirety and the others briefly because they are outside the scope of this review.
Object persistence is modeled with a transition probability matrix between each pair of consecutive frames, where the authors hardcode 2 priors.
Spatial proximity: the likelihood is defined as the inverse distance .
Neighborhood: between frames, the number of nearby objects is likely to be similar: , where is the change in the number of neighbors.
Given the spatio-temporal representation of an object, the authors then implement a reinforcement learning algorithm using tabular Q-learning. The policy update rule is:
where is the policy, are types of objects, is a state of interaction betwen objects of types , is an action, is the reward, and is a temporal discount factor. A less dense articulation of this policy update is that we update the policy proportional to the difference of the reward and the hypothesized best course of action.
The authors compare the performance of their learned symbolic policy to Deep Q-Networks (DQN), Deepmind’s state-of-the-art reinforcement learning module. The results are depicted below.
While DQN achieves optimal performance in a fixed grid setting very quickly and Deep Symbolic Learning plateaus early on, DQN fails to generalize to the random setting, while Deep Symbolic Learning does not suffer at all (curiously, it seems to perform better on the randomized setting).
We conclude that learning a symbolic representation aids in generalization in light of small amounts of data.
3.2 Learning to Reason over a Given Ontology
This corresponds to the rule inference problem in knowledge infusion. While Deep Symbolic Learning made some advancement in procuring the encoding of data into logical variables, its learning algorithm did not take advantage of the symbolic structure. The following findings are all algorithms to learn given fixed objects and relations in the ILP setting.
3.2.1 Differentiable Inductive Logic Programming (Ilp)
ILP is a reimplementation of ILP where the rules are end-to-end differentiable (and thus gradient descent techniques can be used for learning). [evans-explanatory]
(Valuation) Given a set of ground atoms, a valuation is a vector mapping each ground atom to the real unit. Each valuation element is a “confidence.”
Consider a set of atom-label pairs derived from positive examples and negative examples :
The interpretation is that a pair represents some atom and the “truthiness” of (1 for entirely positive and 0 for entirely negative).
We now have the setting for constructing a likelihood. Recall that an ILP problem is defined in terms of , where is a language frame (a target relation atom and a set of predicates), is a set of ground atoms (), are positive examples, and are negative examples. We further specify a set of clause-weights . The likelihood of for a given atom is then:
We break down this likelihood into 4 functions.
takes a valuation and an atom and extracts the value .
is an indicator r.v. for whether an ground atom from is in .
produces a set of clauses from the language frame: , where are the clauses the satisfy the specified template.
Finally, is a mapping which infers using forward-chains from the generated clauses under the clause weights .
These are defined as follows. Let (one for each predicate). A particular weight represents how strongly the system believes that the pair of clauses is the “correct way” to infer predicate . We enforce this definition with a softmax approach.
In order to apply these weights to a predicate , we compute all possible pairs of clauses that could infer , then sum a weighted average of their softmax weights:
where is the confidence in with steps of chained inference and is the contribution to the confidence of the inference pathway .
This algorithm is carried out by a process called amalgamation; the details of this algorithm are omitted as they are outside the scope of this review, but they can be found in section 4.4 of the original publication. [evans-explanatory]
Note that in order to work with 2D matrices (as opposed to
-D tensors), the number of predicates in a rule is capped at two (much like Valiant’s Knowledge Infusion algorithm).
We will learn the weights by gradient descent.
The architecture is depicted below.
We fit this model by minimizing the expected log-likelihood over the dataset that we defined above. Explicitly, that log-likelihood is:
Evans and Grefenstette first tested ILP on 20 tasks with no noise, including learning even numbers, learning FizzBuzz (divisibility by 3 xor 5), and learning graph cyclicity. Pending success, they then tested on ambiguous data such as learning even numbers from pixel images, learning if one image is exactly two less than another, learning if at least one image is a 1, learning the less-than relation from images, etc.
Results for unambiguous data were shown to be quite strong, beating state-of-the-art results. Results for ambiguous data were even more impressive, tolerating up to 20% noise with very little gain in mean squared-error.
ILP had 4 models, of which 1 was a polynomial-growing framework of transition weights, DeepLogics seeks to compute all steps of inference using a fixed number of neural networks.
The goal is to learn some function , where is the context of the logic program and is the query (head atom in ILP terms). Cingillioglu et al. present a module called a Neural Inference Network (NIN) to accomplish this goal.
NIN takes as input two sequences of characters (context vector) and (query vector). Given the existing structure of rules in the ILP framework, we can create an input tensor , where is the number of literals and is the length of these literals.
The architecture of the system is depicted below.
Cingillioglu et al. train over some basic logical inference tasks.
Facts (ground atoms with no variables)
Unification (rules with empty bodies and an atom with a variable in the head)
N-step deduction with negation
Logical AND with negation
Logical OR with negation
The results are contrasted with the performance of a dynamic memory network (DMN), the state-of-the-art question-answering architecture.
We see that the NIN performs very well as the number of deduction steps increases (it was, after, architected for multi-hop reasoning). However, as we say earlier with Valiant’s Knowledge Infusion algorithm and with ILP, the length of predicates is extremely problematic for inference.
4.1 Summary of Findings
Learning without any ontology (in particular, if expressions are relaxed from booleans to arbitrary ontologies) is intractable. Negative result via conceptual graphs. [jappy-conceptual]
Rules are PAC-learnable from scenes of the world given the structure of rules (the identities of the relation atoms in the predicates). Positive result via robust logics. [valiant-robust]
The difficulty of learning rules grows exponentially with the maximum arity of the relation atoms. Negative result due to robust logics, ILP, and DeepLogic’s NIN. This problem looks to be especially difficult as the intractability was discovered by at least 3 independent sources from 3 different fields (PAC learning, ILP, and deep learning). [valiant-robust] [evans-explanatory] [cingillioglu-deeplogic]
Given a learning oracle that provides counterexamples of incorrect rules, it is possible to learn exact-reasoning and PAC-reasoning algorithms for problems that are otherwise NP-hard. Positive result via the learning-to-reason framework. [khardon-l2r]
Learning many rules over an ontology is not much harder than learning a single rule over the same ontology. Positive result via knowledge infusion. [valiant-infusion]
Parallel learning via knowledge infusion over the robust logics framework yields successful applied results on natural language, especially when syntatical structures are provided in lieu of relations. Positive result via knowledge infusion. [michael-infusion]
Learning over a symbolic representation of data rather than raw information helps generalize to non-trivial distributions. Positive result via deep symbolic learning. [garnelo-symbolic]
Relaxing booleans from to the unit interval allows for a high-performing differentiable framework. Positive result via ILP [evans-explanatory]
Featurizing literals (recognizing that objects and relations can be represented more meaningfully than a simple index) allows neural nets to reason effectively. Positive result via DeepLogic’s NIN. [cingillioglu-deeplogic]
4.2 Future Work
4.2.1 Integrated Models
The most obvious way to progress from these findings is to recognize that many of these works have fortuitously designed their models as modular units. Thus, a comprehensive end-to-end system can be constructed by stacking these modules.
|Creating a Knowledge Base||Knowledge Infusion|
|Extracting Symbols||Deep Symbolic Learning|
|Reasoning from Literals without Rule Template||Learn-to-Reason, ILP, DeepLogic|
|Reasoning from Literals with Rule Template||Robust Logics, Knowledge Infusion (form given by POS tagger)|
For example, consider the problem of reasoning about flying through an asteroid field. We might stack these modules into system that uses Knowledge Infusion to generate a knowledge base (e.g. asteroids have momentum), Deep Symbolic Learning to extract symbols from video data (e.g. brown pixel asteroid), DeepLogic to featurize the literals (e.g. asteroids are bad), and ILP to learn rules over the featurized literals (e.g. left-thrust to avoid asteroid from the right).
4.2.2 Unsolved Problems
184.108.40.206 Encoding Priors
Many of these studies note (and rightly so) that a logical rule form allows one to encode priors. However, none of these studies actually provide a methodology for doing so; this is non-trivial especially since the 2 applied differentiable reasoning systems (NIN and ILP) have other parameters associated with rules (confidence, attention) that must be learned.
220.127.116.11 Flexible Object Ontology
Similarly, many of these studies note (and rightly so) that a logical rule form makes transfer-learning a theoretical possibility. However, all of the systems are hardcoded in terms of a fixed number of objects and relations. Suppose we gave any of these systems a dataset to learn rules about fruits. Suppose further that we then acquired data about legumes. There would be no way to transfer-learn under any of the given systems – robust logics requires the ontology in advance, and the deep systems are all hardcoded in input size.
18.104.22.168 Arbitrary Arity
Of the 7 studies surveyed with direct positive results, 3 made no attempt to learn rules of arity greater than 2 (or any mention of such a possibility). Two noted the intractability of such a problem (Valiant [valiant-robust], Evans [evans-explanatory]), and one empirically tested and failed over longer rules (Cingillioglu [cingillioglu-deeplogic]). Finally, even given a Reasoning Oracle that provides counterexamples (not necessarily a feasibility in an applied setting), the Learn-to-Reason algorithm only provides positive results given a log-bound on the number of literals per clause.
The unscalability in arity is the only direct shortcoming that all of these studies have in common.
The most straightforward application seems to be natural language, which happens to be the domain Michael and Valiant chose for Knowledge Infusion. This is because natural language is rich enough that learning dynamic rules is useful, while structured enough to provide a given ontology of objects and relations (nouns and verbs/proximity clauses). Furthermore, many state-of-the-art dialogue systems in industry (such as Cognea, Watson Assistant, IPSoft, Google DialogFLow) have rules-based entity detection. These rules provide a potential knowledge base of priors that can potentially be encoded into a bootstrapping system.
22.214.171.124 Econometric Causality
A significant field of study in econometrics is identifying instrumental variables (IVs) because a correlation from IVs can lead to a sound conclusion of causality. Findings in AI reasoning systems can supplement IV research since the learned form (rules as implications) gives a direct representation of causality (and is robust to multi-hop reasoning, as shown by DeepLogic’s NIN).
Recent European legislation (in particular, the General Data Protection Regulation, GDPR) has made data transparency a worldwide priority. Compliance with GDPR requires that companies be able to explain the behavior of their models in the context of protected classes. While this may be straightforward in generalized linear models, this is highly non-trivial for black-box models (as most deep learning systems are). AI reasoning systems can help sustain the high performance of deep learning systems while maintaining compliance with ethics and regulation.