kLog: A Language for Logical and Relational Learning with Kernels

05/17/2012 ∙ by Paolo Frasconi, et al. ∙ University of Freiburg UNIFI 0

We introduce kLog, a novel approach to statistical relational learning. Unlike standard approaches, kLog does not represent a probability distribution directly. It is rather a language to perform kernel-based learning on expressive logical and relational representations. kLog allows users to specify learning problems declaratively. It builds on simple but powerful concepts: learning from interpretations, entity/relationship data modeling, logic programming, and deductive databases. Access by the kernel to the rich representation is mediated by a technique we call graphicalization: the relational representation is first transformed into a graph --- in particular, a grounded entity/relationship diagram. Subsequently, a choice of graph kernel defines the feature space. kLog supports mixed numerical and symbolic data, as well as background knowledge in the form of Prolog or Datalog programs as in inductive logic programming systems. The kLog framework can be applied to tackle the same range of tasks that has made statistical relational learning so popular, including classification, regression, multitask learning, and collective classification. We also report about empirical comparisons, showing that kLog can be either more accurate, or much faster at the same level of accuracy, than Tilde and Alchemy. kLog is GPLv3 licensed and is available at http://klog.dinfo.unifi.it along with tutorials.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

Code Repositories

EDeN

Explicit Decomposition with Neighborhoods python version


view repo

graph_utils

Some graph utilities that I use for my work.


view repo

EDeN

Explicit Decomposition with Neighborhoods python version


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The field of statistical relational learning (SRL) is populated with a fairly large number of models and alternative representations, a state of affairs often referred to as the “SRL alphabet soup” Dietterich:2008:Structured-machine-learning: ; De-Raedt:2008:Towards-digesting-the-alphabet-soup . Even though there are many differences between these approaches, they typically extend a probabilistic representation (most often, a graphical model) with a logical or relational one :2008:Probabilistic-inductive-logic ; Getoor:2007:Introduction-to-statistical-relational . The resulting models then define a probability distribution over possible worlds, which are typically (Herbrand) interpretations assigning a truth value to every ground fact.

However, the machine learning literature contains — in addition to probabilistic graphical models — several other types of statistical learning methods. In particular, kernel-based learning and support vector machines are amongst the most popular and powerful machine learning systems that exist today. But this type of learning system has — with a few notable exceptions to relational prediction 

Landwehr:2010:Fast-learning-of-relational ; Taskar:2003:Max-margin-Markov-networks — not yet received a lot of attention in the SRL literature. Furthermore, while it is by now commonly accepted that frameworks like Markov logic networks (MLNs) Richardson:2006:Markov-logic-networks , probabilistic relational models (PRMs) friedman99:learn , or probabilistic programming :2008:Probabilistic-inductive-logic ; Getoor:2007:Introduction-to-statistical-relational are general logical and relational languages that support a wide range of learning tasks, there exists today no such language for kernel-based learning. It is precisely this gap that we ultimately want to fill.

This paper introduces the kLog language and framework for kernel-based logical and relational learning. The key contributions of this framework are threefold: 1) kLog is a language that allows to declaratively specify relational learning tasks in a similar way as statistical relational learning and inductive logic programming approaches but it is based on kernel methods rather than on probabilistic modeling; 2) kLog compiles the relational domain and learning task into a graph-based representation using a technique called graphicalization; and 3) kLog uses a graph kernel to construct the feature space where eventually the learning takes place. This whole process is reminiscent of knowledge-based model construction in statistical relational learning. We now sketch these contributions in more detail and discuss the relationships with statistical relational learning.

One contribution of this paper is the introduction of the kLog language and framework for kernel-based logical and relational learning. kLog is embedded in Prolog (hence the name) and allows users to specify different types of logical and relational learning problems at a high level in a declarative way. kLog adopts, as many other logical and relational learning systems, the learning from interpretations framework De-Raedt:2008:Logical-and-relational-learning . This allows for naturally representing the entities (or objects) and the relationships amongst them. However, unlike typical statistical relational learning frameworks, kLog does not employ a probabilistic framework but is rather based on linear modeling in a kernel-defined feature space.

kLog constitutes a first step into the direction of a general kernel-based SRL approach. kLog generates a set of features starting from a logical and relational learning problem and uses these features for learning a (linear) statistical model. This is not unlike Markov logic but there are two major differences. First, kLog is based on a linear statistical model instead of a Markov network. Second, the feature space is not immediately defined by the declared logical formulae but it is constructed by a graph kernel, where nodes in the graph correspond to entities and relations, some given in the data, and some (expressing background knowledge) defined declaratively in Prolog. Complexity of logical formulae being comparable, graph kernels leverage a much richer feature space than MLNs. In order to learn, kLog essentially describes learning problems at three different levels. The first level specifies the logical and relational learning problem. At this level, the description consists of an E/R-model describing the structure of the data and the data itself, which is similar to that of traditional SRL systems Heckerman:2007:SRL . The data at this level is then graphicalized, that is, the interpretations are transformed into graphs. This leads to the specification of a graph learning problem at the second level. Graphicalization is the equivalent of knowledge-based model construction. Indeed, SRL systems such as PRMs and MLNs also (at least conceptually) produce graphs, although these graphs represent probabilistic graphical models. Finally, the graphs produced by kLog are turned into feature vectors using a graph kernel, which leads to a statistical learning problem at the third level. Again there is an analogy with systems such as Markov logic as the Markov network that is generated in knowledge-based model construction lists also the features. Like in these systems, kLog features are tied together as every occurrence is counted and is captured by a single same parameter in the final linear model.

It is important to realize that kLog is a very flexible architecture in which only the specification language of the first level is fixed; at this level, we employ an entity/relationship (E/R) model. The second level is then completely determined by the choice of the graph kernel. In the current implementation of kLog that we describe in this paper, we employ the neighborhood subgraph pairwise distance kernel (NSPDK) Costa::Fast-neighborhood-subgraph but the reader should keep in mind that other graph kernels can be incorporated. Similarly, for learning the linear model at level three we mainly experimented with variants of SVM learners but again it is important to realize that other learners can be plugged in. This situation is akin to that for other statistical relational learning representations, for which a multitude of different inference and learning algorithms has been devised (see also Section 8 for a discussion of the relationships between kLog and other SRL systems).

In the next section, we provide some more background on statistical modeling from a relational learning point of view. This also positions kLog more clearly in the context of related systems such as Markov logic, MTaskar:2003:Max-margin-Markov-networks , etc. In Section 3, we offer a complete example in a real world domain in order to illustrate the main steps of kLog modeling. In Section 4, we formalize the semantics of the language and illustrate what types of learning problems can be formulated in the framework. Further examples are given in Section 5. The graphicalization approach and the graph kernel are detailed in Section 6. Some empirical evaluation is reported in Section 7 and, finally, the relationships to other SRL systems are discussed in Section 8.

2 Statistical Modeling and SRL

We use the following general approach to construct a statistical model for supervised learning. First, a feature vector

is associated with each interpretation. A potential function based on the linear model is then used to “score” the interpretation. Prediction (or inference) is the process of maximizing with respect to . Learning is the process of fitting

to the available data, typically using some statistically motivated loss function that measures the discrepancy between the prediction

and the observed output on the -th training instance. This setting is related to other kernel-based approaches to structured output learning (e.g., Tsochantaridis:2006:Large-margin-methods ), which may be developed without associating a probabilistic interpretation to .

The above perspective covers a number of commonly used algorithms ranging from propositional to relational learning. To see this, consider first binary classification of categorical attribute-value data. In this case, models such as naive Bayes, logistic regression, and support vector machines (SVM) can all be constructed to share exactly the same feature space. Using indicator functions on attribute values as features, the three models use a hyperplane as their decision function:

where for naive Bayes, the joint probability of is proportional to . The only difference between the three models is actually in the way is fitted to data: SVM optimizes a regularized functional based on the hinge loss function, logistic regression maximizes the conditional likelihood of outputs given inputs (which can be seen as minimizing a smoothed version of the SVM hinge loss), and naive Bayes maximises the joint likelihood of inputs and outputs. The last two models are often cited as an example of generative-discriminative conjugate pairs because of the above reasons Ng:2002:On-discriminative-vs.-generative-classifiers: .

When moving up to a slightly richer data type like sequences (perhaps the simplest case of relational data), the three models have well known extensions: naive Bayes extends to hidden Markov models (HMMs), logistic regression extends to conditional random fields (CRFs) 

Sutton:2007:An-Introduction-to-Conditional-Random , and SVM extends to structured output SVM for sequences Altun:2003:Hidden-markov-support ; Tsochantaridis:2006:Large-margin-methods . Note that when HMMs are used in the supervised learning setting (in applications such as part-of-speech tagging) the observation is the input sequence, and the states form the output sequence (which is observed in training data). In the simplest version of these three models, contains a feature for every pair of states (transitions) and a feature for every state-observation pair (emissions). Again, these models all use the same feature space (see, e.g., Sutton:2007:An-Introduction-to-Conditional-Random for a detailed discussion).

When moving up to arbitrary relations, the three above models can again be generalized but many complications arise, and the large number of alternative models suggested in the literature has originated the SRL alphabet soup mentioned above. Among generative models, one natural extension of HMMs is stochastic context free grammars Lari:1991:Applications-of-stochastic-context-free , which in turn can be extended to stochastic logic programs Muggleton:1996:Stochastic-logic-programs . More expressive systems include probabilistic relational models (PRMs) friedman99:learn and Markov logic networks (MLNs) Richardson:2006:Markov-logic-networks , when trained generatively. Generalizations of SVM for relational structures akin to context free grammars have also been investigated Tsochantaridis:2006:Large-margin-methods . Among discriminative models, CRFs can be extended from linear chains to arbitrary relations Sutton:2007:An-Introduction-to-Conditional-Random , for example in the form of discriminative Markov networks Taskar:2002:Discriminative-probabilistic-models and discriminative Markov logic networks Richardson:2006:Markov-logic-networks . The use of SVM-like loss functions has also been explored in max-margin Markov networks (MN) Taskar:2003:Max-margin-Markov-networks . These models can cope with relational data by adopting a richer feature space. kLog contributes to this perspective as it is a language for generating a set of features starting from a logical and relational learning problem and using these features for learning a (linear) statistical model.

Table 1 shows the relationships among some of these approaches. Methods on the same row use similar loss functions, while methods in the same column can be arranged to share the same feature space. In principle, kLog features can be used with any loss.

Propositional Sequences General relations
Naive Bayes HMM Generative MLN
Logistic regression CRF Discriminative MLN
SVM SVM-HMM MN
Table 1: Relationship among some statistical relational learning models.

3 A kLog Example

Before delving into the technical details of kLog, we illustrate the different steps on a real-life example using the UW-CSE dataset prepared by Domingos et al. for demonstrating the capabilities of MLNs Richardson:2006:Markov-logic-networks . Anonymous data was obtained from the University of Washington Department of Computer Science and Engineering. Basic entities include persons (students or professors), scientific papers, and academic courses. Available relations specify, e.g., whether a person was the author of a certain paper, or whether he/she participated in the teaching activities of a certain course. The learning task consists of predicting students’ advisors, i.e., to predict the binary relation advised_by between students and professors.

3.1 Data Format

Data comes in the form of true ground atoms, under the closed-world assumption. Since (first-order logic) functions are not allowed in the language, a ground atom is essentially like a tuple in a relational database, for example taught_by(course170,person211,winter_0102).

kLog learns from interpretations. This means that the data is given as a set of interpretations (or logical worlds) where each interpretation is a set of ground atoms which are true in that world. In this example there are five interpretations: ai, graphics, language, systems, and theory, corresponding to different research groups in the department. For instance, a fragment of the interpretation ai is shown in Listing 1.

[obeytabs=true,fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] advised_by(person21,person211). advised_by(person45,person211). has_position(person211,faculty). has_position(person407,faculty). in_phase(person14,post_generals). in_phase(person21,post_generals). in_phase(person284,post_quals). in_phase(person45,post_generals). professor(person211). professor(person407). publication(title110,person14). publication(title110,person407). publication(title25,person21). publication(title25,person211). publication(title25,person284). publication(title25,person407). publication(title44,person211). publication(title44,person415). publication(title44,person45). student(person14). student(person21). student(person284). student(person45). ta(course12,person21,winter_0203). ta(course24,person21,spring_0203). ta(course24,person70,autumn_0304). taught_by(course12,person211,autumn_0001). taught_by(course143,person211,winter_0001). taught_by(course170,person211,winter_0102). taught_by(course24,person211,autumn_0102). taught_by(course24,person211,spring_0203). taught_by(course24,person407,spring_0304). taught_by(course82,person407,winter_0102).

Listing 1: Database for the UW-CSE domain

All ground atoms in a particular interpretation together form an instance of a relational database and the overall data set consists of several (disjoint) databases.

3.2 Data Modeling and Learning

The first step in kLog modeling is to describe the domain using a classic database tool: entity relationship diagrams. We begin by modeling two entity sets: student and professor, two unary relations: in_phase and has_position, and one binary relation: advised_by (which is the target in this example). The diagram is shown in Figure 1.

Figure 1: E/R diagram for the UW-CSE domain.

The kLog data model is written, in this case, using the fragment of embedded Prolog code of Listing 2.

[obeytabs=true,fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] signature has_position(professor_id::professor, position::property)::extensional. signature advised_by(p1::student, p2::professor)::extensional. signature student(student_id::self)::extensional. signature professor(professor_id::self)::extensional.

Listing 2: Examples of signature declaration.

Every entity or relationship that kLog will later use to generate features (see feature generation below) is declared by using the special keyword signature. Signatures are similar to the declarative bias used in inductive logic programming systems. There are two kinds of signatures, annotated by the special reserved words extensional and intensional. In the extensional case, all ground atoms have to be listed explicitly in the data file; in the intensional case, ground atoms are defined implicitly using Prolog definite clauses. A signature has a name and a list of arguments with types. A type is either the name of an entity set (declared in some other signature) or the special type property used for numerical or categorical attributes. In the ground atoms, constants that are not properties are regarded as identifiers and are simply used to connect ground atoms (these constants disappear in the result of the graphicalization procedure explained below).111 The special type name self is used to denote the name of the signature being declared, as in lines 7–8 of Listing 2. Thus, a signature containing an argument of type self introduces a new entity set while other signatures introduce relationships.

One of the powerful features of kLog is its ability to introduce novel relations using a mechanism resembling deductive databases. Such relations are typically a means of injecting domain knowledge. In our example, it may be argued that the likelihood that a professor advises a student increases if the two persons have been engaged in some form of collaboration, such as co-authoring a paper, or working together in teaching activities. In Listing 3 we show two intensional signatures for this purpose. An intensional signature declaration must be complemented by a predicate (written in Prolog) which defines the new relation. When working on a given interpretation, kLog asserts all the true ground atoms in that interpretation in the Prolog database and collects all true groundings for the predicate associated with the intensional signature.

[fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] signature on_same_course(s::student,p::professor)::intensional. on_same_course(S,P) :- professor(P), student(S), ta(Course,S,Term), taught_by(Course,P,Term).

signature on_same_paper(s::student,p::professor)::intensional. on_same_paper(S,P) :- student(S), professor(P), publication(Pub, S), publication(Pub,P).

signature n_common_papers(s::student,p::professor,n::property)::intensional. n_common_papers(S,P,N) :- student(S), professor(P), setof(Pub, (publication(Pub, S), publication(Pub,P)), CommonPapers), length(CommonPapers,N).

Listing 3: Intensional signatures and associated Prolog predicates.

Intensional signatures can also be effectively exploited to introduce aggregated attributes De-Raedt:2008:Logical-and-relational-learning . The last signature in Listing 3 shows for example how to count the number of papers a professor and a student have published together.

3.3 Graphicalization

This procedure maps a set of ground atoms into a bipartite undirected graph whose nodes are true ground atoms and whose edges connect an entity atom to a relationship atom if the identifier of the former appears as an argument in the latter. The graph resulting from the graphicalization of the ai interpretation is shown in Figure 2222Note that interpretation predicates which exist in the data but have not a corresponding signature (e.g., publication) do not produce nodes in the graph. However these predicates may be conveniently exploited in the bodies of the intensional signatures (e.g., on_same_paper refers to publication).. It is from this graph that kLog will generate propositional features (based on a graph kernel) for use in the learning procedure. The details of the graphicalization procedure and the kernel are given in Sections 6 and 6.2, respectively.

Figure 2: Graphicalization on the tiny data set in the running example. Identifiers (such as person45) are only shown for the sake of clarity; kLog does not use them to generate features.

Now that we have specified the inputs to the learning system, we still need to determine the learning problem. This is declared in kLog by designating one (or more) signature(s) as target (in this domain, the target relation is advised_by). Several library predicates are designed for training, e.g., kfold performs a k-fold cross validation. These predicates accept a list of target signatures which specifies the learning problem.

4 The kLog language

A kLog program consists of:

  • a set of ground facts, embedded in a standard Prolog database, representing the data of the learning problem (see, e.g., Listing 1);

  • a set of signature declarations (e.g., Listing 2);

  • a set of Prolog predicates associated with intensional signatures (e.g., Listing 3);

  • a standard Prolog program which specifies the learning problem and makes calls to kLog library predicates.

4.1 Data modeling and signatures

In order to specify the semantics of the language, it is convenient to formalize the domain of the learning problem as a set of constants (objects) and a finite set of relations . Constants are partitioned into a set of entity identifiers (or identifiers for short) and set of property values . Identifiers are themselves partitioned into entity-sets . A ground atom is a relation symbol (or predicate name) of arity followed by an -tuple of constant symbols . An interpretation (for the learning problem) is a finite set of ground atoms.

The signature for a relation is an expression of the form

where, for all , and is the name of the -th column of . If column does not have type , then its name can optionally include a role field using the syntax . If unspecified, is set to by default. The level of a signature is either intensional or extensional. In the extensional case, all the atoms which contribute to every interpretation are those and only those listed in the data. In the intensional case, ground atoms are those which result from the Prolog intensional predicates using Prolog semantics (optionally with tabling). In this way, users may take advantage of many of the extensions of definite clause logic that are built into Prolog.

The ability to specify intensional predicates through clauses (see an example in Listing 3) is most useful for introducing background knowledge in the learning process and common practice in inductive logic programming De-Raedt:2008:Logical-and-relational-learning . As explained in Section 6.2, features for the learning process are derived from a graph whose vertices are ground facts in the database; hence the ability of declaring rules that specify relations directly translates into the ability of designing and maintaining features in a declarative fashion. This is a key characteristic of kLog and, in our opinion, one of the key reasons behind the success of related systems like Markov logic.

In order to ensure the well-definedness of the subsequent graphicalization procedure (see Section 6), we introduce two additional database assumptions. First, we require that the primary key of every relation consists of the columns whose type belongs to (i.e., purely of identifiers). This is perhaps the main difference between the kLog and the E/R data models. As it will become more clear in the following, identifiers are just placeholders and are kept separate from property values so that learning algorithms will not rely directly on their values to build a decision function333To make an extreme case illustrating the concept, it makes no sense to predict whether a patient suffers from a certain disease based on his/her social security number..

The relational arity of a relation is the length of its primary key. As a special case, relations of zero relational arity are admitted and they must consist of at most a single ground atom in any interpretation444This kind of relations is useful to represent global properties of an interpretation (see Example 5.1).. Second, for every entity-set there must be a distinguished relation that has relational arity 1 and key of type . These distinguished relations are called E-relations and are used to introduce entity-sets, possibly with attached properties as satellite data. The remaining relations are called R-relations or relationships555Note that the word “relationship” specifically refers to the association among entities while “relation” refers the more general association among entities and properties., which may also have properties. Thus, primary keys for R-relations are tuples of foreign keys.

4.2 Statistical setting

When learning from interpretations, we assume that interpretations are sampled identically and independently from a fixed and unknown distribution . We denote by the resulting data set, where is a given index set (e.g., the first natural numbers) that can be thought of as interpretation identifiers. Like in other statistical learning systems, the goal is to use a data sample to make some form of (direct or indirect) inference about this distribution. For the sake of simplicity, throughout this paper we will mainly focus on supervised learning.

In the case of supervised learning, it is customary to think of data as consisting of two separate portions: inputs (called predictors or independent variables in statistics) and outputs (called responses or dependent variables). In our framework, this distinction is reflected in the set of ground atoms in a given interpretation. That is, is partitioned into two sets: (input ground atoms) and (output ground atoms).

4.3 Supervised learning jobs

A supervised learning job in kLog is specified as a set of relations. We begin defining the semantics of a job consisting of a single relation. Without loss of generality, let us assume that this relation has signature

(1)

with for . Conventionally, if there are no identifiers and if there are no properties. Recall that under our assumptions the primary key of must consist of entity identifiers (the first columns). Hence, and implies that represents a function with domain and range . If then can be seen as a function with a Boolean range.

Having specified a target relation , kLog is able to infer the partition of ground atoms into inputs and outputs in the supervised learning setting. The output consists of all ground atoms of and all ground atoms of any intensional relation which depends on . The partition is inferred by analyzing the dependency graphs of Prolog predicates defining intensional relations, using an algorithm reminiscent of the call graph computation in ViPReSS refactoring:tplp .

We assume that the training data is a set of complete interpretations666kLog could be extended to deal with missing data by removing the closed world assumption and requiring some specification of the actual false groundings.. During prediction, we are given a partial interpretation consisting of ground atoms , and are required to complete the interpretation by predicting output ground atoms

. For the purpose of prediction accuracy estimation, we will be only interested in the ground atoms of the target relation (a subset of

).

Several situations may arise depending on the relational arity and the number of properties in the target relation , as summarized in Table 2. When , the declared job consists of predicting one or more properties of an entire interpretation, when one or more properties of certain entities, when one or more properties of pairs of entities, and so on. When (no properties) we have a binary classification task (where positive cases are ground atoms that belong to the complete interpretation). Multiclass classification can be properly declared by using with a categorical property, which ensures mutual exclusiveness of classes. Regression is also declared by using but in this case the property should be numeric. Note that property types (numerical vs. categorical) are automatically inferred by kLog by inspecting the given training data. An interesting scenario occurs when so that two or more properties are to be predicted at the same time. A similar situation occurs when the learning job consists of several target relations. kLog recognizes that such a declaration defines a multitask learning job. However having recognized a multitask job does not necessarily mean that kLog will have to use a multitask learning algorithm capable of taking advantage of correlations between tasks (like, e.g., argyriou2008convex ). This is because, by design and in line with the principles of declarative languages, kLog separates “what” a learning job looks like and “how” it is solved by applying a particular learning algorithm. We believe that the separation of concerns at this level permits greater flexibility and extendability and facilitates plugging-in alternative learning algorithms (a.k.a. kLog models) that have the ability of providing a solution to a given job.

Relational arity
# of properties 0 1 2
0 Binary classification of interpretations Binary classification of entities Link prediction
1 Multiclass / regression on interpretations Multiclass / regression on entities Attributed link prediction
1 Multitask on interpretations Multitask predictions on entities Multitask attributed link prediction
Table 2: Single relation jobs in kLog.

4.4 Implementation

kLog is currently embedded in Yap Prolog costa2012yap and consists of three main components: (1) a domain-specific interpreter, (2) a database loader, and (3) a library of predicates that are used to specify the learning task, to declare the graph kernel and the learning model, and to perform training, prediction, and performance evaluation. The domain-specific interpreter parses the signature declarations (see Section 4.1), possibly enriched with intensional and auxiliary predicates. The database loader reads in a file containing extensional ground facts and generates a graph for each interpretation, according to the procedure detailed in Section 6. The library includes common utilities for training and testing. Most of kLog is written in Prolog except feature vector generation and the statistical learners, which are written in C++. kLog can interface with several solvers for parameter learning including LibSVM libsvm:other

and SVM Stochastic gradient descent 

Bottou2010:proc .

5 Examples

In Section 3 we have given a detailed example of kLog in a link prediction problem. The kLog setting encompasses a larger ensemble of machine learning scenarios, as detailed in the following examples, which we order according to growing complexity of the underlying learning task.

Example 5.1 (Classification of independent interpretations)

This is the simplest supervised learning problem with structured input data and scalar (unstructured) output. For the sake of concreteness, let us consider the problem of small molecule classification as pioneered in the relational learning setting in  Srinivasan:1994:Mutagenesis:-ILP-experiments-in-a-non-determinate . This domain is naturally modeled in kLog as follows. Each molecule corresponds to one interpretation; there is one E-relation, atom, that may include properties such as element and charge; there is one relationship of relational arity 2, bond, that may include a bond_type property to distinguish among single, double, and resonant chemical bonds; there is finally a zero-arity relationship, active, distinguishing between positive and negative interpretations. A concrete example is given in Section 7.1.

Example 5.2 (Regression and multitask learning)

The above example about small molecules can be extended to the case of regression where the task is to predict a real-valued property associated with a molecule, such as its biological activity or its octanol/water partition coefficient (logP) Wang:1997:A-new-atom-additive-method-for-calculating . Many problems in quantitative structure-activity relationships (QSAR) are actually formulated in this way. The case of regression can be handled simply by introducing a target relation with signature activity(act::property).

If there are two or more properties to be predicted, one possibility is to declare several target relations, e.g., we might add logP(logp::property). Alternatively we may introduce a target relation such as:

target_properties(activity::property,logp::property).

Multitask learning can be handled trivially by learning independent predictors; alternatively, more sophisticated algorithms that take into account correlations amongst tasks (such as Evgeniou:2006:Learning-multiple-tasks ) could be used.

Example 5.3 (Entity classification)

A more complex scenario is the collective classification of several entities within a given interpretation. We illustrate this case using the classic WebKB domain Craven:1998:Learning-to-Extract-Symbolic . The data set consists of Web pages from four Computer Science departments and thus there are four interpretations: cornell, texas, washington, and wisconsin. In this domain there are two E-relations: page (for webpages) and link (for hypertextual links). Text in each page is represented as a bag-of-words (using the R-relation has) and hyperlinks are modeled by the R-relations link_to and link_from. Text associated with hyperlink anchors is represented by the R-relation has_anchor.

The goal is to classify each Web page. There are different data modeling alternatives for setting up this classification task. One possibility is to introduce several unary R-relations associated with the different classes, such as

course, faculty, project, and student. The second possibility is to add a property to the entity-set page, called category, and taking values on the different possible categories. It may seem that in the latter case we are just reifying the R-relations describing categories. However there is an additional subtle but important difference: in the first modeling approach it is perfectly legal to have an interpretation where a page belongs simultaneously to different categories. This becomes illegal in the second approach since otherwise there would be two or more atoms of the E-relation page with the same identifier.

From a statistical point of view, since pages for the same department are part of the same interpretation and connected by hyperlinks, the corresponding category labels are interdependent random variables and we formally have an instance of a supervised structured output problem 

BakIr:2007:Predicting-stru , that in this case might also be referred to as collective classification Taskar:2002:Discriminative-probabilistic-models . There are however studies in the literature that consider pages to be independent (e.g., Joachims:2002:Learning-to-classify-text ).

Example 5.4 (One-interpretation domains)

We illustrate this case on the Internet Movie Database (IMDb) data set. Following the setup in Neville:2003:Collective-classification-with , the problem is to predict “blockbuster” movies, i.e., movies that will earn more than $2 million in their opening weekend. The entity-sets in this domain are movie, studio, and individual. Relationships include acted_in(actor::individual, m::movie), produced(s::studio, m::movie), and directed(director::individual, m::movie). The target unary relation blockbuster(m::movie) collects positive cases. Training in this case uses a partial interpretation (typically movies produced before a given year). When predicting the class of future movies, data about past movies’ receipts can be used to construct features (indeed, the count of blockbuster movies produced by the same studio is one of the most informative features Frasconi:2008:Feature-discovery-with ).

A similar scenario occurs for protein function prediction. Assuming data for just a single organism is available, there is one entity set (protein) and a binary relation interact expressing protein-protein interaction Vazquez:2003:Global-protein-function ; Lanckriet:2004:Kernel-based-data-fusion .

6 Graphicalization and feature generation

The goal is to map an interpretation into a feature vector . This enables the application of several supervised learning algorithms that construct linear functions in the feature space . In this context, can be either computed explicitly or defined implicitly, via a kernel function . Kernel-based solutions are very popular, sometimes allow faster computation, and allow infinite-dimensional feature spaces. On the other hand, explicit feature map construction may offer advantages in our setting, in particular when dealing with large scale learning problems (many interpretations) and structured output tasks (exponentially many possible predictions). Our framework is based on two steps: first an interpretation is mapped into an undirected labeled graph ; then a feature vector is extracted from . Alternatively, a kernel function on pairs of graphs could be computed. The corresponding potential function is then defined directly as or as a kernel expansion .

The use of an intermediate graphicalized representation is novel in the context of propositionalization, a well-known technique in logical and relational learning De-Raedt:2008:Logical-and-relational-learning that transform a relational representation into a propositional one. The motivation for propositionalization is that one transforms a rich and structured (relational) representation into a simpler and flat representation in order to be able to apply a learning algorithm that works with the simple representation afterwards. Current propositionalization techniques typically transform graph-based or relational data directly into an attribute-value learning format, or possibly into a multi-instance learning one777In multi-instance learning Dietterich1997 , the examples are sets of attribute-value tuples or sets of feature vectors., but not into a graph-based one. This typically results in a loss of information, cf. De-Raedt:2008:Logical-and-relational-learning . kLog’s graphicalization approach first transforms the relational data into an equivalent graph-based format first, without loss of information. After graphicalization, kLog uses the results on kernels for graph-based data to derive an explicit high-dimensional feature-based representation. Thus kLog directly upgrades these graph-based kernels to a fully relational representation. Furthermore, there is an extensive literature on graph kernels and virtually all existing solutions can be plugged into the learning from interpretations setting with minimal effort. This includes implementation issues but also the ability to reuse existing theoretical analyses. Finally, it is notationally simpler to describe a kernel and feature vectors defined on graphs, than to describe the equivalent counterpart using the Datalog notation.

The graph kernel choice implicitly determines how predicates’ attributes are combined into features.

6.1 Graphicalization procedure

Given an interpretation , we construct a bipartite graph as follows (see B for notational conventions and Figure 2 for an example).

Vertices:

there is a vertex in for every ground atom of every -relation, and there is a vertex in for every ground atom of every -relation. Vertices are labeled by the predicate name of the ground atom, followed by the list of property values. Identifiers in a ground atom do not appear in the labels but they uniquely identify vertices. The tuple denotes the identifiers in the ground atom mapped into vertex .

Edges:

if and only if , , and . The edge is labeled by the role under which the identifier of appears in (see Section 4.1).

Note that, because of our data modeling assumptions (see Section 4.1), the degree of every vertex equals the relational arity of the corresponding -relation. The degree of vertices in is instead unbounded and may really grow large in some domains (e.g., social networks or World-Wide-Web networks). The graphicalization process can be nicely interpreted as the unfolding of an E/R diagram over the data, i.e., the E/R diagram is a template that is expanded according to the given ground atoms (see Figure 2). There are several other examples in the literature where a graph template is expanded into a ground graph, including the plate notation in graphical models Koller:2009:Probabilistic-graphical-models:

, encoding networks in neural networks for learning data structures 

Frasconi:1998:A-general-framework-for-adaptive , and the construction of Markov networks in Markov logic Richardson:2006:Markov-logic-networks . The semantics of kLog graphs for the learning procedure is however quite different and intimately related to the concept of graph kernels, as detailed in the following section.

6.2 Graph Kernel

Learning in kLog is performed using a suitable graph kernel on the graphicalized interpretations. While in principle any graph kernel can be employed, there are several requirements that the chosen kernel has to meet in practice. On the one hand the kernel has to allow fast computations, especially with respect to the graph size, as the grounding phase in the graphicalization procedure can yield very large graphs. On the other hand we need a general purpose kernel with a flexible bias as to adapt to a wide variety of application scenarios.

In the current implementation, we use an extension of NSPDK, a recently introduced fast graph kernel Costa::Fast-neighborhood-subgraph . While the original kernel is suitable for sparse graphs with discrete vertex and edge labels, here we propose an extension to deal with a larger class of graphs whose labels are tuples of mixed discrete and numerical types. In the following sections, we introduce the notation and give a formal definition of the original as well as the enhanced graph kernel.

6.2.1 Kernel Definition and Notation

The NSPDK is an instance of a decomposition kernel, where “parts” are pairs of subgraphs (for more details on decomposition kernels see C). For a given graph , and an integer , let denote the subgraph of rooted in and induced by the set of vertices

(2)

where is the shortest-path distance between and 888Conventionally if no path exists between and . A neighborhood is therefore a topological ball with center . Let us also introduce the following neighborhood-pair relation:

(3)

that is, relation identifies pairs of neighborhoods of radius whose roots are exactly at distance . We define over graph pairs as the decomposition kernel on the relation , that is:

(4)

where indicates the multiset of all pairs of neighborhoods of radius with roots at distance that exist in .

We can now obtain a flexible parametric family of kernel functions by specializing the kernel . The general structure of is:

(5)

In the following, we assume

(6)

where denotes the indicator function, is the root of and the label of vertex . The role of is to ensure that only neighborhoods centered on the same type of vertex will be compared. Assuming a valid kernel for (in the following Sections we give details on concrete instantiations), we can finally define the NSPDK as:

(7)

For efficiency reasons we consider the zero-extension of obtained by imposing an upper bound on the radius and the distance parameter: , that is, we limit the sum of the kernels for all increasing values of the radius (distance) parameter up to a maximum given value (). Furthermore we consider a normalized version of , that is: to ensure that relations induced by all values of radii and distances are equally weighted regardless of the size of the induced part sets.

Finally, it is easy to show that the NSPDK is a valid kernel as: 1) it is built as a decomposition kernel over the countable space of all pairs of neighborhood subgraphs of graphs of finite size; 2) the kernel over parts is a valid kernel; 3) the zero-extension to bounded values for the radius and distance parameters preserves the kernel property; and 4) so does the normalization step.

Figure 3: Illustration of NSPDK features. Top: pairs of neighborhood subgraphs at fixed distance 6 for radii 0, 1, 2, and 3. Bottom: pairs of neighborhood subgraphs of fixed radius 1 at distances 0, 1, 2, and 3.

6.3 Subgraph Kernels

The role of is to compare pairs of neighborhood graphs extracted from two graphs. The application of the graphicalization procedure to diverse relational datasets can potentially induce graphs with significantly different characteristics. In some cases (discrete property domains) an exact matching between neighborhood graphs is appropriate, in other cases however (continous properties domains) it is more appropriate to use a soft notion of matching.

In the following sections, we introduce variants of to be used when the atoms in the relational dataset can maximally have a single discrete or continuous property, or when more general tuples of properties are allowed.

6.4 Exact Graph Matching

An important case is when the atoms, that are mapped by the graphicalization procedure to the vertex set of the resulting graph, can maximally have a single discrete property. In this case, an atom becomes a vertex , whose label is obtained by concatenation of the signature name and the attribute value. In this case, has the following form:

(8)

where denotes the indicator function and isomorphism between graphs. Note that is a valid kernel between graphs under the feature map that transforms into , a sequence of all zeros except the -th element equal to 1 in correspondence to the identifier for the canonical representation of  McKay81 ; Yan02 .

6.4.1 Graph Invariant

Evaluating the kernel in Equation 8 requires as a subroutine graph isomorphism, a problem for which it is unknown whether polynomial algorithms exist. Algorithms that are in the worst case exponential but that are fast in practice do exist McKay81 ; Yan02 . For special graph classes, such as bounded degree graphs Luks82 , there exist polynomial time algorithms. However, since it is hard to limit the type of graph produced by the graphicalization procedure (e.g., cases with very high vertex degree are possible as in general an entity atom may play a role in an arbitrary number of relationship atoms), we prefer an approximate solution with efficiency guarantees based on topological distances similar in spirit to sorlin08 .

The key idea is to compute an integer pseudo-identifier for each graph such that isomorphic graphs are guaranteed to bear the same number (i.e., the function is graph invariant), but non-isomorphic graphs are likely to bear a different number. A trivial identity test between the pseudo-identifiers then approximates the isomorphism test. The reader less interested in the technical details of the kernel may want to skip the remainder of this subsection.

We obtain the pseudo-identifier by first constructing a graph invariant encoding for a rooted neighborhood graph . Then we apply a hash function to the encoding. Note that we cannot hope to exhibit an efficient certificate for isomorphism in this way, and in general there can be collisions between two non-isomorphic graphs, either because these are assigned the same encoding or because the hashing procedure introduces a collision even when the encodings are different.

Hashing is not a novel idea in machine learning; it is commonly used, e.g., for creating compact representations of molecular structures ralaivola2005graph , and has been advocated as a tool for compressing very high-dimensional feature spaces shi2009hash . In the present context, hashing is mainly motivated by the computational efficiency gained by approximating the isomorphism test.

The graph encoding that we propose is best described by introducing two new label functions: one for the vertices and one for the edges, denoted and respectively. assigns to vertex a lexicographically sorted sequence of pairs composed by a topological distance and a vertex label, that is, returns a sorted list of pairs for all . Moreover, since is a rooted graph, we can use the knowledge about the identity of the root vertex and prepend to the returned list the additional information of the distance from the root node . The new edge label is produced by composing the new vertex labels with the original edge label, that is assigns to edge the triplet . Finally assigns to the rooted graph the lexicographically sorted list of for all . In words: we relabel each vertex with a sequence that encodes the vertex distance from all other (labeled) vertices (plus the distance from the root vertex); the graph encoding is obtained as the sorted edge list, where each edge is annotated with the endpoints’ new labels. For a proof that is graph invariant, see (degrave2011:phd, , p. 53).

We finally resort to a Merkle-Damgård construction-based hashing function for variable-length data to map the various lists to integers, that is, we map the distance-label pairs, the new vertex labels, the new edge labels and the new edge sequences to integers (in this order). Note that it is trivial to control the size of the feature space by choosing the hash codomain size (or alternatively the bit size for the returned hashed values)999Naturally there is a tradeoff between the size of the feature space and the number of hash collisions..

Figure 4: Graph invariant computation: the graph hash is computed as the hash of the sorted list of edge hashes. An edge hash is computed as the hash of the sequence of the two endpoints hashes and the edge label. The endpoint hash is computed as the hash of the sorted list of distance-vertex label pairs.

6.5 Soft Matches

The idea of counting exact neighborhood subgraphs matches to express graph similarity is adequate when the graphs are sparse (that is, when the edge and the vertex set sizes are of the same order) and when the maximum vertex degree is low. However, when the graph is not sparse or some vertices exhibit large degrees, the likelihood that two neighborhoods match exactly quickly approaches zero, yielding a diagonal dominant kernel prone to overfitting101010A concrete example is when text information associated to a document is modeled explicitly, i.e., when word entities are linked to a document entity: in this case the degree corresponds to the document vocabulary size.. In these cases a better solution is to relax the all-or-nothing type of match and allow for a partial or soft match between subgraphs. Although there exist several graph kernels that allow this type of match, they generally suffer from very high computational costs Vishwanathan2010:jrnl . To ensure efficiency, we use an idea introduced in the Weighted Decomposition Kernel Menchetti05:proc : given a subgraph, we consider only the multinomial distribution (i.e., the histogram) of the labels, discarding all structural information. In the soft match kernel, the comparison between two pairs of neighborhood subgraphs is replaced by111111Note that the pair of neighborhood subgraphs are considered jointly, i.e., the label multisets are extracted independently from each subgraph in the pair and then combined together.:

(9)

where is the set of vertices of . In words, for each pair of close neighborhoods, we build a histogram counting the vertices with the same label in either of the neighborhood subgraphs. The kernel is then computed as the dot product of the corresponding histograms.

Figure 5: Illustration of the soft matching kernel. Only features generated by a selected pair of vertices are represented: vertices A and B at distance 1 yield a multinomial distribution of the vertex labels in neighborhoods of radius 1. On the right we compute the contribution to the kernel value by the represented features.

6.6 Tuples of properties

A standard assumption in graph kernels is that vertex and edge labels are elements of a discrete domain. However, in kLog the information associated with vertices is a tuple that can contain both discrete and real values. Here we extend the NSPDK model to allow both a hard and a soft match type over graphs with properties that can be tuples of discrete, real, or a mix of both types. The key idea is to use the canonical vertex identifier (introduced in Section 6.4.1) to further characterize each property: in this way the kernel is defined only between tuples of vertices that correspond under isomorphism.

The general structure of the kernel on the subgraph can be written as:

(10)

where, for an atom121212We remind the reader that in the graphicalization procedure we remove the primary and foreign keys from each atom, hence the only information available at the graph level are the signature name and the properties values. mapped into vertex , returns the signature name . is a kernel that is defined over sets of vertices (atoms) and can be decomposed in a part that ensures matches between atoms with the same signature name, and a second part that takes into account the tuple of property values. In particular, depending on the type of property values and the type of matching required, we obtain the following six cases.

Soft match for discrete tuples. When the tuples contain only discrete elements and one chooses to ignore the structure in the neighborhood graphs, then each property is treated independently. Formally:

(11)

where for an atom mapped into vertex , returns the property value .

Hard match for discrete tuples. When the tuples contain only discrete elements and one chooses to consider the structure, then each property key is encoded taking into account the identity of the vertex in the neighborhood graph and all properties are required to match jointly. In practice, this is equivalent to the hard match as detailed in Section 6.4 where the property value is replaced with the concatenation of all property values in the tuple. Formally, we replace the label in Equation 10 with the labeling procedure as detailed in Section 6.4.1. In this way, each vertex receives a canonical label that uniquely identifies it in the neighborhood graphs. The double summation of in Eq. 10 is then performing the selection of the corresponding vertices and in the two pairs of neighborhood that are being compared. Finally, we consider all the elements of the tuple jointly in order to identify a successful match:

(12)

Soft match for real tuples. To upgrade the soft match kernel to tuples of real values we replace the exact match with the standard product131313Note that this is equivalent to collecting all numerical properties of a vertex’s tuple in a vector and then employ the standard dot product between vectors.. The kernel on the tuple then becomes:

(13)

Hard match for real tuples. We proceed in an analogous fashion as for the hard match for discrete tuples, that is, we replace the label in Equation 10 with the labeling procedure . In this case, however, we combine the real valued tuple of corresponding vertices with the standard product as in Equation 13:

(14)

Soft match for mixed discrete and real tuples. When dealing with tuples of mixed discrete and real values, the contribution of the kernels on the separate collections of discrete and real attributes are combined via summation:

(15)

where indices and run exclusively over the discrete and continous properties respectively.

Hard match for mixed discrete and real tuples. In an analogous fashion, provided that in Equation 10 is replaced with the labeling procedure , as detailed in Section 6.4.1, we have:

(16)

In this way, each vertex receives a canonical label that uniquely identifies it in the neighborhood graph. The discrete labels of corresponding vertices are concatenated and matched for identity, while the real tuples of corresponding vertices are combined via the standard dot product.

6.6.1 Domain Knowledge Bias via Kernel Points

At times it is convenient, for efficiency reasons or to inject domain knowledge into the kernel, to be able to explicitly select the neighborhood subgraphs. We provide a way to do so, declaratively, by introducing the set of kernel points, a subset of which includes all vertices associated with ground atoms of some specially marked signatures. We then redefine the relation used in Equation 4 like in Section 6.2.1 but with the additional constraints that the roots of and be kernel points.

Kernel points are typically vertices that are believed to represent information of high importance for the task at hand. Vertices that are not kernel points contribute to the kernel computation only when they occur in the neighborhoods of kernel points. In kLog, kernel points are declared as a list of domain relations: all vertices that correspond to ground atoms of these relations become kernel points.

6.6.2 Viewpoints and i.i.d. view

The above approach effectively defines a kernel over interpretations

where is the result of graphicalization applied to interpretation . For learning jobs such as classification or regression on interpretations (see Table 2), this kernel is directly usable in conjunction with plain kernel machines like SVM. When moving to more complex jobs involving, e.g., classification of entities or tuples of entities, the kernel induces a feature vector suitable for the application of a structured output technique where . Alternatively, we may convert the structured output problem into a set of i.i.d. subproblems as follows. For simplicity, assume the learning job consists of a single relation of relational arity . We call each ground atom of a case. Intuitively, cases correspond to training targets or prediction-time queries in supervised learning. Usually an interpretation contains several cases corresponding to specific entities such as individual Web pages (as in Section 5.3) or movies (as in Section 5.4), or tuples of entities for link prediction problems (as in Section 3). Given a case , the viewpoint of , , is the set of vertices that are adjacent to in the graph.

Figure 6: A sample viewpoint () in the UW-CSE domain, and the corresponding mutilated graph where removed vertices are grayed out and the target case highlighted.

In order to define the kernel, we first consider the mutilated graph where all vertices in , except , are removed (see Figure 6 for an illustration). We then define a kernel on mutilated graphs, following the same approach of the NSPDK, but with the additional constraint that the first endpoint must be in . The decomposition is thus defined as

We obtain in this way a kernel “centered” around case :

and finally we let

This kernel corresponds to the potential

which is clearly maximized by maximizing, independently, all sub-potentials with respect to .

By following this approach, we do not obtain a collective prediction (individual ground atoms are predicted independently). Still, even in this reduced setting, the kLog framework can be exploited in conjunction with meta-learning approaches that surrogate collective prediction. For example, Prolog predicates in intensional signatures can effectively be used as expressive relational templates for stacked graphical models kou2007stacked where input features for one instance are computed from predictions on other related instances. Results in Section 7.2 for the “partial information” setting are obtained using a special form of stacking.

6.7 Parameterization in kLog

The kernels presented in this section, together with the graphicalization procedure, yield a statistical model working in a potentially very high-dimensional feature space. Although large-margin learners offer some robustness against high-dimensional representations, it still remains on the user to choose appropriate kernel parameter (radius and distance) to avoid overfitting. It should be noted that subgraphs in the NSPDK effectively act like templates which are matched against graphicalized interpretations for new data. Since identifiers themselves do not appear in the result of the graphicalization procedure, the same template can match multiple times realizing a form of parameter tying as in other statistical relational learning methods.

7 kLog in practice

In this section, we illustrate the use of kLog in a number of application domains. All experimental results reported in this section were obtained using LibSVM in binary classification, multiclass, or regression mode, as appropriate.

7.1 Predicting a single property of one interpretation

We now expand the ideas outlined in Example 5.1. Predicting the biological activity of small molecules is a major task in chemoinformatics and can help drug development (Waterbeemd2003, ) and toxicology (Helma2003, ; Helma2005:book, ). Most existing graph kernels have been tested on data sets of small molecules (see, e.g., horvath04 ; ralaivola2005graph ; mahe2005gkm ; vishwanathan2010graph ; ceroni2007classification ). From the kLog perspective the data consists of several interpretations, one for each molecule. In the case of binary classification (e.g., active vs. nonactive), there is a single target predicate whose truth state corresponds to the class of the molecule. To evaluate kLog we used two data sets. The Bursi data set Kazius:2005:Derivation-and-validation-of-toxicophores consists of 4,337 molecular structures with associated mutagenicity labels (2,401 mutagens and 1,936 nonmutagens) obtained from a short-term in vitro assay that detects genetic damage. The Biodegradability data set BlockeelEtAl:04 contains 328 compounds and the regression task is to predict their half-life for aerobic aqueous biodegradation starting from molecular structure and global molecular measurements.

Figure 7: Results (AUROC, 10-fold cross-validation) on the Bursi data set with (left) and without (right) functional groups. Lines on the

axis indicate kernel and SVM hyperparameters (from top to bottom: maximum radius

, maximum distance , and regularization parameter .

[fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] begin_domain. signature atm(atom_id::self, element::property)::intensional. atm(Atom, Element) :- a(Atom,Element), \+(Element=h).

signature bnd(atom_1@b::atm, atom_2@b::atm, type::property)::intensional. bnd(Atom1,Atom2,Type) :- b(Atom1,Atom2,NType), describeBondType(NType,Type), atm(Atom1,_), atm(Atom2,_).

signature fgroup(fgroup_id::self, group_type::property)::intensional. fgroup(Fg,Type) :- sub(Fg,Type,_).

signature fgmember(fg::fgroup, atom::atm)::intensional. fgmember(Fg,Atom):- subat(Fg,Atom,_), atm(Atom,_).

signature fg_fused(fg1@nil::fgroup, fg2@nil::fgroup, nrAtoms::property)::intensional. fg_fused(Fg1,Fg2,NrAtoms):- fused(Fg1,Fg2,AtomList), length(AtomList,NrAtoms).

signature fg_connected(fg1@nil::fgroup, fg2@nil::fgroup, bondType::property)::intensional. fg_connected(Fg1,Fg2,BondType):- connected(Fg1,Fg2,Type,_AtomList),describeBondType(Type,BondType).

signature fg_linked(fg::fgroup, alichain::fgroup, saturation::property)::intensional. fg_linked(FG,AliChain,Sat) :- linked(AliChain,Links,_BranchesEnds,Saturation), ( Saturation = saturated -> Sat = saturated ; Sat = unsaturated ), member(link(FG,_A1,_A2),Links).

signature mutagenic::extensional. end_domain.

Listing 4: Complete specification of the Bursi domain.

Listing 4 shows a kLog script for the Bursi domain. Relevant predicates in the extensional database are a/2, b/3 (atoms and bonds, respectively, extracted from the chemical structure), sub/3 (functional groups, computed by DMax Chemistry Assistant Ando06:jrnl ; DeGrave2010:jrnl ), fused/3, connected/4 (direct connection between two functional groups), linked/4 (connection between functional groups via an aliphatic chain). Aromaticity (used in the bond-type property of b/3) was also computed by DMax Chemistry Assistant. The intensional signatures essentially serve the purpose of simplifying the original data representation. For example atm/2 omits some entities (hydrogen atoms), and fg_fused/3 replaces a list of atoms by its length. In signature bnd(atom_1@b::atm,atom_2@b::atm) we use a role field b (introduced by the symbol @). Using the same role twice declares that the two atoms play the same role in the chemical bond, i.e. that the relation is symmetric. In this way, each bond can be represented by one tuple only, while a more traditional relational representation, which is directional, would require two tuples. While this may at first sight appear to be only syntactic sugar, it does provide extra abilities for modeling which is important in some domains. For instance, when modeling ring-structures in molecules, traditional logical and relational learning systems need to employ either lists to capture all the elements in a ring structure, or else need to include all permutations of the atoms participating in a ring structure. For rings involving 6 atoms, this requires 6!=720 different tuples, an unnecessary blow-up. Also, working with lists typically leads to complications such as having to deal with a potentially infinite number of terms. The target relation mutagenic has relational arity zero since it is a property of the whole interpretation.

As shown in Figure 7, results are relatively stable with respect to the choice of kernel hyperparameter (maximum radius and distance) and SVM regularization and essentially match the best results reported in Costa::Fast-neighborhood-subgraph (AUROC ) even without composition with a polynomial kernel. These results are not surprising since the graphs generated by kLog are very similar in this case to the expanded molecular graphs used in Costa::Fast-neighborhood-subgraph .

We compared kLog to Tilde on the same task. A kLog domain specification can be trivially ported to background knowledge for Tilde. Both systems can then access the same set of Prolog atoms. Table 3 summarizes the results and can be compared with Figure 7. Augmenting the language with the functional groups from DeGrave2010:jrnl

unexpectedly gave worse results in Tilde compared to a plain atom-bond language. The presence of some functional groups correlates well with the target and hence those tests are used near the root of the decision tree. Unfortunately, the greedy learner is unable to refine its hypothesis down to the atom level and relies almost exclusively on the coarse-grained functional groups. Such local-optimum traps can sometimes be escaped from by looking ahead further, but this is expensive and Tilde ran out of memory for lookaheads higher than 1. Tilde’s built-in bagging allows to boost its results, especially when functional groups are used. A comparison of Table 

3 with Figure 7 shows that Tilde with our language bias definition is not well suited for this problem.

Setting Model AUC
Functional groups Tilde 0.63 0.09
Functional groups Bagging 20x Tilde 0.79 0.06
Atom bonds Tilde 0.80 0.02
Atom bonds Bagging 20x Tilde 0.83 0.02
Table 3: Tilde results on Bursi.

The kLog code for biodegradability is similar to Bursi but being a regression task we have a target relation declared as

signature biodegradation(halflife::property)::extensional.

We estimated prediction performance by repeating five times a ten-fold cross validation procedure as described in BlockeelEtAl:04 (using exactly the same folds in each trial). Results — rooted mean squared error (RMSE), squared correlation coefficient (SCC), and mean absolute percentage error (MAPE) — are reported in Table 4. For comparison, the best RMSE obtained by kFOIL on this data set (and same folds) is 1.14 0.04 (kFOIL was shown to outperform Tilde and S-CART in Landwehr:2010:Fast-learning-of-relational ).

Setting RMSE SCC MAPE
Functional groups 1.07 0.01 0.54 0.01 14.01 0.08
Atom bonds 1.13 0.01 0.48 0.01 14.55 0.12
Table 4: kLog results on biodegradability.

7.2 Link prediction

We report here experimental results on the UW-CSE domain that we have already extensively described in Section 3. To assess kLog behavior we evaluated prediction accuracy according to the leave-one-research-group-out setup of Richardson:2006:Markov-logic-networks , using the domain description of Listings 2 and 3, together with a NSPDK kernel with distance 2, radius 2, and soft match. Comparative results with respect to Markov logic are reported in Figure 8 (MLN results published in Richardson:2006:Markov-logic-networks ). The whole 5-fold procedure runs in about 20 seconds on a single core of a 2.5GHz Core i7 CPU. Compared to MLNs, kLog in the current implementation has the disadvantage of not performing collective assignment but the advantage of defining more powerful features thanks to the graph kernel. Additionally, MLN results use a much larger knowledge base. The advantage of kLog over MLN in Figure 8 is due to the more powerful feature space. Indeed, when setting the graph kernel distance and radius to 0 and 1, respectively, the feature space has just one feature for each ground signature, in close analogy to MLN. The empirical performance (area under recall-precision curve, AURPC) of kLog using the same set of signatures drops dramatically from 0.28 (distance 2, radius 2) to 0.09 (distance 1, radius 0).

In a second experiment, we predicted the relation advised_by starting from partial information (i.e., when relations Student (and its complement Professor) are unknown, as in Richardson:2006:Markov-logic-networks ). In this case, we created a pipeline of two predictors. Our procedure is reminiscent of stacked generalization wolpert1992stacked . In the first stage, a leave-one-research-group-out cross-validation procedure was applied to the training data to obtain predicted groundings for Student (a binary classification task on entities). Predicted groundings were then fed to the second stage which predicts the binary relation advised_by. The overall procedure was repeated using one research group at the time for testing. Results are reported in Figure 9.

Since kLog is embedded in the programming language Prolog, it is easy to use the output of one learning task as the input for the next one as illustrated in the pipeline. This is because both the inputs and the outputs are relations. Relations are treated uniformly regardless of whether they are defined intensionally, extensionally, or are the result of a previous learning run. Thus kLog satisfies what has been called the closure principle in the context of inductive databases DBLP:journals/sigkdd/Raedt02 ; DBLP:reference/dmkdh/BoulicautJ10 ; it is also this principle together with the embedding of kLog inside a programming language (Prolog) that turns kLog into a true programming language for machine learning Mitchell06 ; DBLP:conf/ismis/RaedtN11 ; DBLP:conf/lrec/RizzoloR10 . Such programming languages possess — in addition to the usual constructs — also primitives for learning, that is, to specify the inputs and the outputs of the learning problems. In this way, they support the development of software in which machine learning is embedded without requiring the developer to be a machine learning expert. According to Mitchell Mitchell06 , the development of such languages is a long outstanding research question.

Figure 8: Comparison between kLog and MLN on the UW-CSE domain (all information).
Figure 9: Comparison between kLog and MLN on the UW-CSE domain (partial information).

7.3 Entity classification

Figure 10: WebKB domain.

The WebKB data set Craven:1998:Learning-to-Extract-Symbolic has been widely used to evaluate relational methods for text categorization. It consists of academic Web pages from four computer science departments and the task is to identify the category (such as student page, course page, professor page, etc). Figure 10 shows the E/R diagram used in kLog. One of the most important relationships in this domain is has, that associates words to web pages. After graphicalization, vertices representing webpages have large degree (at least the number of words), making the standard NSPDK of Costa::Fast-neighborhood-subgraph totally inadequate: even by setting the maximum distance and the maximum radius , the hard match would essentially create a distinct feature for every page. In this domain we can therefore appreciate the flexibility of the kernel defined in Section 6.2. In particular, the soft match kernel creates histograms of word occurrences in the page, which is very similar to the bag-of-words (with counts) representation that is commonly used in text categorization problems. The additional signature cs_in_url embodies common sense background knowledge that many course web pages contains the string “cs” followed by some digits and is intensionally defined using a Prolog predicate that holds true when the regular expression :cs(e*)[0-9]+: matches the page URL.

Empirical results using only four Universities (Cornell, Texas, Washington, Wisconsin) in the leave-one-university-out setup are reported in Table 5.

research faculty course student A P R F
research 59 11 4 15 0.94 0.66 0.70 0.68
faculty 9 125 2 50 0.91 0.67 0.82 0.74
course 0 0 233 0 0.99 1.00 0.95 0.98
student 16 17 5 493 0.90 0.93 0.88 0.91
Average 0.88 0.89 0.88 0.88
Table 5:

Results on WebKB (kLog): contingency table, accuracy, precision, recall, and F

measure per class. Last row reports micro-averages.
research faculty course student A P R F
research 42 3 7 32 0.95 0.95 0.50 0.66
faculty 1 91 1 60 0.92 0.88 0.59 0.71
course 0 1 233 10 0.98 0.95 0.95 0.95
student 1 9 3 545 0.89 0.84 0.98 0.90
Average 0.88 0.91 0.76 0.81
Table 6: Results on WebKB (Markov logic): contingency table, accuracy, precision, recall, and F measure per class. Last row reports micro-averages.
research faculty course student A P R F
research 39 9 1 35 0.93 0.65 0.46 0.54
faculty 11 110 1 31 0.91 0.69 0.72 0.71
course 1 1 235 7 0.99 0.99 0.96 0.98
student 9 39 1 509 0.88 0.87 0.91 0.89
Average 0.86 0.80 0.76 0.78
Table 7: Results on WebKB (Tilde): contingency table, accuracy, precision, recall, and F measure per class. Last row reports micro-averages.

We compared kLog to MLN and to Tilde on the same task. For MLN we used the Alchemy system and the following set of formulae, which essentially encode the same domain knowledge exploited in kLog: [fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] Has(+w,p) => Topic(+t,p) !Has(+w,p) => Topic(+t,p) CsInUrl(p) => Topic(+t,p) !CsInUrl(p) => Topic(+t,p) Topic(+c1, p1) ^(Exist a LinkTo(a, p1, p2)) => Topic(+c2, p2) HasAnchor(+w,a) ^LinkTo(a, p1, p2) => Topic(+c, p2)
Ground atoms for the predicate LinkTo were actually precalculated externally in Prolog (same code as for the kLog’s intensional signature) since regular expressions are not available in Alchemy. We did not enforce mutually exclusive categories since results tended to be worse. For learning we used the preconditioned scaled conjugate gradient approach described in lowd2007efficient and we tried a wide range of values for the learning rate and the number of iterations. The best results, reported in Table 5, used the trick of averaging MLN weights across all iterations as in lowd2007efficient . MC-SAT inference was used during prediction. In spite of the advantage of MLN for using a collective inference approach, results are comparable to those obtained with kLog (MLN tends to overpredict the class “student”, resulting in a slightly lower average

measure, but accuracies are identical). Thus the feature extracted by kLog using the graph kernel are capable of capturing enough contextual information from the input portion of the data to obviate the lack of collective inference.

In the case of Tilde, we used the following language bias: [fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] rmode((has(+U,Word,+Page), Word = #Word)). rmode((cs_in_url(+U,+Page,V), V = #V)). rmode((link_to(+U,-Link,-From,+Page), has_anchor(+U,Word,Link), Word = #Word)).
Results are reported in Table 7 and are slightly worse than those attained by kLog and MLN.

Although the accuracies of the three methods are essentially comparable, their requirements in terms of CPU time are dramatically different: using a single core of a second generation Intel Core i7, kLog took 36s, Alchemy 27,041s (for 100 iterations, at which the best accuracy is attained), and Tilde: 5,259s.

7.4 Domains with a single interpretation

The Internet Movie Database (IMDb) collects information about movies and their cast, people, and companies working in the motion picture industry. We focus on predicting, for each movie, whether its first weekend box-office receipts are over US$2 million, a learning task previously defined in Neville:2003:Collective-classification-with ; Macskassy:2003:A-Simple-Relational-Classifier . The learning setting defined so far (learning from independent interpretations) is not directly applicable since train and test data must necessarily occur within the same interpretation. The notion of slicing in kLog allows us to overcome this difficulty. A slice system is a partition of the true ground atoms in a given interpretation: where the disjoint sets are called slices and the index set is endowed with a total order . For example, a natural choice for in the IMDb domain is the set of movie production years (e.g., ), where the index associated with a ground atom of an entity such as actor is the debut year.

In this way, given two disjoint subsets of , and , such that , it is reasonable during training to use for some index the set of ground atoms (where iff and ) as the input portion of the data, and as the output portion (targets). Similarly, during testing we can for each use the set of ground atoms for predicting .

The kLog data set was created after downloading the whole database from http://www.imdb.com. Adult movies, movies produced outside the US, and movies with no opening weekend data were discarded. Persons and companies with a single appearance in this subset of movies were also discarded. The resulting data set is summarized in Table 8. We modeled the domain in kLog using extensional signatures for movies, persons (actors, producers, directors), and companies (distributors, production companies, special effects companies). We additionally included intensional signatures counting, for each movie the number of companies involved also in other blockbuster movies. We sliced the data set according to production year, and starting from year , we trained on the frame and tested on the frame . Results (area under the ROC curve) are summarized in Table 8, together with comparative results against MLN and Tilde. In all three cases we used the same set of ground atoms for training and testing, but in the case of MLN the Alchemy software does not allow to differentiate between evidence and query ground atoms of the same predicate. We therefore introduced an extra predicate called PrevBlockbuster to inject evidence for years , together with the hard rule [fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] PrevBlockbuster(m) <=> Blockbuster(m).
and used Blockbuster as the query predicate when training on years and testing on year . MLN rules were designed to capture past performance of actors, directors, distribution companies, etc. For example: [fontsize=,fontfamily=helvetica,frame=single,numbers=left,numbersep=3pt,numberblanklines=false,commandchars=
{}] ActedIn(p,m1) ^ActedIn(p,m2) ^Blockbuster(m1) ^m1 != m2 => Blockbuster(m2)
In the case of Tilde, we used as background knowledge exactly the same intensional signature definitions used in kLog.

Year # Movies # Facts kLog MLN Tilde
1995 74 2483
1996 223 6406
1997 311 8031 0.86 0.79 0.80
1998 332 7822 0.93 0.85 0.88
1999 348 7842 0.89 0.85 0.85
2000 381 8531 0.96 0.86 0.93
2001 363 8443 0.95 0.86 0.91
2002 370 8691 0.93 0.87 0.89
2003 343 7626 0.95 0.88 0.87
2004 371 8850 0.95 0.87 0.87
2005 388 9093 0.92 0.84 0.83
All 0.93 0.03 0.85 0.03 0.87 0.04
Table 8: Results (AUC) on the IMDb data set. Years 1995 and 1996 were only used for training. Years before 1995 (not shown) are always included in the training data.

On this data set, Tilde was the fastest system, completing all training and test phases in 220s, followed by kLog (1,394s) and Alchemy (12,812s). However, the AUC obtained by kLog is consistently higher across all prediction years.

8 Related work

As kLog is a language for logical and relational learning with kernels it is related to work on inductive logic programming, to statistical relational learning, to graph kernels, and to propositionalization. We now discuss each of these lines of work and their relation to kLog.

First, the underlying representation of the data that kLog employs at the first level is very close to that of standard inductive logic programming systems such as Progol Mug95:jrnl , Aleph aleph , and Tilde 1998-blockeel-0 in the sense that the input is essentially (a variation of) a Prolog program for specifying the data and the background knowledge. Prolog allows us to encode essentially any program as background knowledge. The E/R model used in kLog is related to the Probabilistic Entity Relationship models introduced by Heckerman et al. in Heckerman:2007:SRL . The signatures play a similar role as the notion of a declarative bias in inductive logic programming De-Raedt:2008:Logical-and-relational-learning . The combined use of the E/R model and the graphicalization has provided us with a powerful tool for visualizing both the structure of the data (the E/R diagram) as well as specific cases (through their graphs). This has proven to be very helpful when preparing datasets for kLog. On the other hand, due to the adoption of a database framework, kLog does not allow for using functors in the signature relations (though functors can be used inside predicates needed to compute these relations inside the background knowledge). This contrasts with some inductive logic programming systems such as Progol Mug95:jrnl and Aleph aleph .

Second, kLog is related to many existing statistical relational learning systems such as Markov logic Richardson:2006:Markov-logic-networks , probabilistic similarity logic Brocheler:2010:Probabilistic-similarity-logic , probabilistic relational models 2001-getoor , Bayesian logic programs Kersting06 , and ProbLog De-Raedt:2007:ProbLog:-A-probabilistic-Prolog in that the representations of the inputs and outputs are essentially the same, that is, both in kLog and in statistical relational learning systems inputs are partial interpretations which are completed by predictions. What kLog and statistical relational learning techniques have in common is that they both construct (implicitly or explicitly) graphs representing the instances. For statistical relational learning methods such as Markov logic Richardson:2006:Markov-logic-networks , probabilistic relational models 2001-getoor , and Bayesian logic programs Kersting06 the knowledge-based model construction process will result in a graphical model (Bayesian or Markov network) for each instance representing a class of probability distributions, while in kLog the process of graphicalization results in a graph representing an instance by unrolling the E/R-diagram. Statistical relational learning systems then learn a probability distribution using the features and parameters in the graphical model, while kLog learns a function using the features derived by the kernel from the graphs. Notice that in both cases, the resulting features are tied together.Indeed, in statistical relational learning each ground instance of a particular template or expression that occurs in the graph has the same parameters. kLog features correspond to subgraphs that represent relational templates and that may match (and hence be grounded) multiple times in the graphicalization. As each such feature has a single weight, kLog also realizes parameter tying in a similar way as statistical relational learning methods. One difference between these statistical relational learning models and kLog is that the former do not really have a second level as does kLog. Indeed, the knowledge base model construction process directly generates the graphical model that includes all the features used for learning, while in kLog these features are derived from the graph kernel. While statistical relational learning systems have been commonly used for collective learning, this is still a question for further research within kLog. A combination of structured-output learning Tsochantaridis:2006:Large-margin-methods and iterative approaches (as incorporated in the EM algorithm) can form the basis for further work in this direction. Another interesting but more speculative direction for future work is concerned with lifted inference. Lifted inference has been the focus of a lot of attention in statistical relational learning; see DBLP:conf/ecai/Kersting12 for an overview. One view on lifted inference is that it is trying to exploit symmetries in the graphical models that would be the result of the knowledge based model construction step, e.g., Kersting:2009:CBP:1795114.1795147 . From this perspective, it might be interesting to explore the use of symmetries in graphs and features constructed by kLog.

kLog builds also upon the many results on learning with graph kernels, see Gartner03 for an overview. A distinguishing feature of kLog is, however, that the graphs obtained by graphicalizing a relational representation contain very rich labels, which can be both symbolic and numeric. This contrasts with the kind of graphs needed to represent for instance small molecules. In this regard, kLog is close in spirit to the work of Wachman:2007:Learning-from-interpretations: , who define a kernel on hypergraphs, where hypergraphs are used to represent relational interpretations. A further key difference is, however, that a key feature of kLog is that it also provides a procedure for automatically graphicalizing relational representations, which also allows to naturally specify multitask and collective learning tasks.

The graphicalization approach introduced in kLog is closely related to the notion of propositionalization, a commonly applied technique in logical and relational learning 2001-kramer-0 ; De-Raedt:2008:Logical-and-relational-learning to generate features from a relational representation. The advantage of graphicalization is that the obtained graphs are essentially equivalent to the relational representation and that — in contrast to the existing propositionalization approaches in logical and relational learning — this does not result in a loss of information. After graphicalization, any graph kernel can in principle be applied to the resulting graphs. Even though many of these kernels (such as the one used in kLog) compute — implicitly or explicitly — a feature vector, the dimensionality of the obtained vector is far beyond that employed by traditional propositionalization approaches. kFOIL Landwehr:2010:Fast-learning-of-relational is one such propositionalization technique that has been tightly integrated with a kernel-based method. It greedily derives a (small) set of features in a way that resembles the rule-learning algorithm of FOIL Qui90-ML:jrnl .

Several other approaches to relational learning and mining have employed graph-based encodings of the relational data, e.g., Rossi:2012:TGD:2444851.2444861 ; lao2010relational ; cook2006mining ; sun2012mining . kLog encodes a set of ground atoms into a bipartite undirected graph whose nodes are true ground atoms and whose edges connect an entity atom to a relationship atom if the identifier of the former appears as an argument in the latter. This differs from the usual encoding employed in graph-based approaches to relational learning and mining, which typically use labeled edges to directly represent the relationships between the nodes corresponding to the entities. Furthermore, these approaches typically use the graph-based representation as the single representation, and unlike kLog do neither consider the graph-based representation as an intermediate representation nor work at three levels of representation (logical, graph-based and feature-based).

Other domain specific languages for machine learning have been developed whose goals are closely related to those of kLog. Learning based Java DBLP:conf/lrec/RizzoloR10

was designed to specifically address applications in natural language processing. It builds on the concept of

data-driven compilation to perform feature extraction and nicely exploits the constrained conditional model framework Chang08 for structured output learning. FACTORIE McCallum09 allows to concisely define features used in a factor graph and, consequently, arbitrarily connected conditional random fields. Like with MLN, there is an immediate dependency of the feature space on the sentences of the language, whereas in kLog this dependency is indirect since the exact feature space is eventually defined by the graph kernel.

9 Conclusions

We have introduced a novel language for logical and relational learning called kLog. It tightly integrates logical and relational learning with kernel methods and constitutes a principled framework for statistical relational learning based on kernel methods rather than on graphical models. kLog uses a representation that is based on E/R modeling, which is close to representations being used by contemporary statistical relational learners. kLog first performs graphicalization, that is, it computes a set of labeled graphs that are equivalent to the original representation, and then employs a graph kernel to realize statistical learning. We have shown that the kLog framework can be used to formulate and address a wide range of learning tasks, that it performs at least comparably to state-of-the-art statistical relational learning techniques, and also that it can be used as a programming language for machine learning.

The system presented in this paper is a first step towards a kernel-based language for relational learning but there are unanswered questions and interesting open directions for further research. One important aspect is the possibility of performing collective classification (or structured output prediction). kLog’s semantics naturally allows to define structured output learning problems: graphicalization followed by a graph kernel yields a joint feature vector where are the groundings of the output predicates. Collective prediction amounts to maximizing with respect to . There are two important cases to consider. If groundings in do not affect the graph structure (because they never affect the intensional signatures) then the collective classification problem is not more complicated than in related SRL systems. For example, if dependencies have a regular sequential structure, dynamic programming can be used for this step, exactly as in conditional random fields (indeed, collective classification has been succesfully exploited within kLog in an application to natural language test segmentation Verbeke2012b ). However, in general, changing during search will also change the graph structure. In principle it is feasible to redo graphicalization from scratch during search, apply the graph kernel again and eventually evaluate , but such a naive approach would of course be very inefficient. Developing clever and faster algorithms for this purpose is an interesting open issue. It should be remarked, however, that even without collective classification, kLog achieves good empirical results thanks to the fact that features produced by the graph kernel provide a wide relational context.

The graph kernel that is currently employed in kLog makes use of the notion of topological distances to define the concept of neighborhoods. In this way, given a predicate of interest, properties of “nearby” tuples are combined to generate features relevant for that predicate. As a consequence, when topological distances are not informative (e.g., in the case of dense graphs with small diameter) then large fractions of the graph become accessible to any neighborhood and the features induced for a specific predicate cease to be discriminative. In these cases (typical when dealing with small-world networks), kernels with a different type of bias (e.g., flow-based kernels) are more appropriate. The implementation of a library of kernels suitable for different types of graphs, as well as the integration of other existing graph kernels in the kLog framework, is therefore an important direction for future development.

Furthermore, even though kLog’s current implementation is quite performant, there are interesting implementation issues to be studied. Many of these are similar to those employed in statistical relational learning systems such as Alchemy Richardson:2006:Markov-logic-networks and ProbLog DBLP:journals/tplp/KimmigDRCR11 .

kLog is being actively used for developing applications. We are currently exploring applications of kLog in natural language processing Verbeke2012 ; Verbeke2012b ; Kordjamshidi2012

and in computer vision 

Antanas2012 ; AntanasHFTR13 ; Antanas2011 .

Acknowledgments

We thank Tom Schrijvers for his dependency analysis code. PF was supported partially by KU Leuven SF/09/014 and partially by Italian Ministry of University and Research PRIN 2009LNP494. KDG was supported partially by KU Leuven GOA/08/008 and partially by ERC Starting Grant 240186 “MiGraNT”.

Appendix A Syntax of the kLog domain declaration section

A kLog program consists of Prolog code augmented by a domain declaration section delimited by the pair of keywords begin_domain and end_domain and one or more signature declarations. A signature declaration consists of a signature header followed by one or more Prolog clauses. Clauses in a signature declaration form the declaration of signature predicates and are automatically connected to the current signature header. There are a few signature predicates with a special meaning for kLog, as discussed in this section. A brief BNF description of the grammar of kLog domains is given in Figure A.

Additionally, kLog provides a library of Prolog predicates for handling data, learning, and performance measurement.

[(colon)] [(semicolon)] [(comma)] [(period)
] [(quote)] [(nonterminal)] <domain> : "begin_domain." <signatures> "end_domain.". <signatures> : <signature> ; [<signature> <signatures>]. <signature> : <header> [<sig_clauses>]. <sig_clauses> : <sig_clause> ; [<sig_clause> <sig_clauses>]. <sig_clause> : <Prolog_clause>. <header> : <sig_name> "(" <args> ")" "::" <level> ".". <sig_name> : <Prolog_atom>. <args> : <arg> ; [<arg> <args>]. <arg> : <column_name> [<role_overrider>] "::" <type>. <column_name> : <Prolog_atom>. <role_overrider> : "@" <role>. <role> : <Prolog_atom>. <type> : "self" | <sig_name>. <level> : "intensional" ; "extensional".

Figure 11: kLog syntax

Appendix B Definitions

For the sake of completeness we report here a number of graph theoretical definitions used in the paper. We closely follow the notation in GrossYellen2005 . A graph consists of two sets and . The notation and is used when is not the only graph considered. The elements of are called vertices and the elements of are called edges. Each edge has a set of two elements in associated with it, which are called its endpoints, which we denote by concatenating the vertices variables, e.g. we represent the edge between the vertices and with . An edge is said to join its endpoints. A vertex is adjacent to a vertex if they are joined by an edge. An edge and a vertex on that edge are called incident. The degree of a vertex is number of edges incident to it. A multi-edge is a collection of two or more edges having identical endpoints. A self-loop is an edge that joins a single endpoint to itself. A simple graph is a graph that has no self-loops nor multi-edges. A graph is bipartite if its vertex set can be partitioned into two subsets and so that every edge has one end in and the other in . We denote a bipartite graph with partition by . A graph is rooted when we distinguish one of its vertices, called root; we denote a rooted graph G with root vertex with . A walk in a graph is a sequence of vertices such that for , the vertices and are adjacent. The length of a walk is the number of edges (counting repetitions). A path is a walk such that no vertex is repeated, except at most the initial () and the final () vertex (in this case it is called a cycle). The distance between two vertices, denoted , is the length of the shortest path between them. A graph is connected if between each pair of vertices there exist a walk. We denote the class of simple connected graphs with . The neighborhood of a vertex is the set of vertices that are adjacent to and is indicated with . The neighborhood of radius of a vertex is the set of vertices at a distance less than or equal to from and is denoted by . In a graph , the induced-subgraph on a set of vertices is a graph that has as its vertex set and it contains every edge of whose endpoints are in . A subgraph is a spanning subgraph of a graph if . The neighborhood subgraph of radius of vertex is the subgraph induced by the neighborhood of radius of and is denoted by . A labeled graph is a graph whose vertices and/or edges are labeled, possibly with repetitions, using symbols from a finite alphabet. We denote the function that maps the vertex/edge to the label symbol as . Two simple graphs and are isomorphic, which we denote by , if there is a bijection , such that for any two vertices , there is an edge if and only if there is an edge in . An isomorphism is a structure-preserving bijection. Two labeled graphs are isomorphic if there is an isomorphism that preserves also the label information, i.e. . An isomorphism invariant or graph invariant is a graph property that is identical for two isomorphic graphs (e.g. the number of vertices and/or edges). A certificate for isomorphism is an isomorphism invariant that is identical for two graphs if and only if they are isomorphic.

A hypergraph is a generalization of a graph also known under the name of set system

. A set system is an ordered pair

where is a set of elements and is a family of subsets of . Note that when is made by pairs of elements of then is a simple graph. The elements of are called vertices of the hypergraph and the elements of the hyperedges. There are two principal ways to represent set systems as graphs: as incident graphs and as intersecton graphs. In the following we consider only incident graphs. Given a set system the associated incident graph is the bipartite graph where and are adjacent if .

Appendix C Decomposition Kernels

We follow the notation in Haussler99:other . Given a set and a function , we say that is a kernel on if is symmetric, i.e. if for any and , and if is positive-semidefinite, i.e. if for any and any , the matrix defined by is positive-semidefinite, that is for all

or equivalently if all its eigenvalues are nonnegative. It is easy to see that if each

can be represented as such that is the ordinary dot product then is a kernel. The converse is also true under reasonable assumptions (which are almost always verified) on and , that is, a given kernel can be represented as for some choice of . In particular it holds for any kernel over where is a countable set. The vector space induced by is called the feature space. Note that it follows from the definition of positive-semidefinite that the zero-extension of a kernel is a valid kernel, that is, if and is a kernel on then may be extended to be a kernel on by defining if or is not in . It is easy to show that kernels are closed under summation, i.e. a sum of kernels is a valid kernel.

Let now be a composite structure such that we can define as its parts141414Note that the set of parts needs not be a partition for the composite structure, i.e. the parts may “overlap”.. Each part is such that for with where each is a countable set. Let be the relation defined on the set , such that is true iff are the parts of . We denote with the inverse relation that yields the parts of , that is . In Haussler99:other it is demonstrated that, if there exist a kernel over for each , and if two instances can be decomposed in and , then the following generalized convolution:

(17)

is a valid kernel called a convolution or decomposition kernel151515To be precise, the valid kernel is the zero-extension of to since is not guaranteed to yield a non-empty set for all . . In words: a decomposition kernel is a sum (over all possible ways to decompose a structured instance) of the product of valid kernels over the parts of the instance.

References

  • (1) T. Dietterich, P. Domingos, L. Getoor, S. Muggleton, P. Tadepalli, Structured machine learning: the next ten years, Mach Learn 73 (2008) 3–23.
  • (2) L. De Raedt, B. Demoen, D. Fierens, B. Gutmann, G. Janssens, A. Kimmig, N. Landwehr, T. Mantadelis, W. Meert, R. Rocha, et al., Towards digesting the alphabet-soup of statistical relational learning (2008).
  • (3) L. De Raedt, P. Frasconi, K. Kersting, S. Muggleton (Eds.), Probabilistic inductive logic programming: theory and applications, Vol. 4911 of Lecture notes in computer science, Springer, Berlin, 2008.
  • (4) L. Getoor, B. Taskar (Eds.), Introduction to statistical relational learning, MIT Press, Cambridge, Mass., 2007.
    URL http://www.loc.gov/catdir/toc/ecip079/2007000951.html
  • (5) N. Landwehr, A. Passerini, L. De Raedt, P. Frasconi, Fast learning of relational kernels, Machine learning 78 (3) (2010) 305–342.
  • (6) B. Taskar, C. Guestrin, D. Koller, Max-margin Markov networks, in: S. Thrun, L. Saul, B. Schölkopf (Eds.), Proceedings of Neural Information Processing Systems, MIT Press, Cambridge, MA, 2003.
  • (7) M. Richardson, P. Domingos, Markov logic networks, Machine Learning 62 (2006) 107–136.
  • (8) N. Friedman, L. Getoor, D. Koller, A. Pfeffer, Learning probabilistic relational models, in: International Joint Conference on Artificial Intelligence, 1999, pp. 1300–1309.
  • (9) L. De Raedt, Logical and relational learning, 1st Edition, Cognitive technologies, Springer, New York, 2008.
  • (10) D. Heckerman, C. Meek, D. Koller, Probabilistic entity-relationship models, PRMs, and plate models, in: L. Getoor, B. Taskar (Eds.), Introduction to Statistical Relational Learning, The MIT Press, 2007, pp. 201–238.
  • (11) F. Costa, K. De Grave, Fast neighborhood subgraph pairwise distance kernel, in: Proc. of the 27th International Conference on Machine Learning (ICML 2010), 2010, pp. 255–262.
  • (12) I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun, Large margin methods for structured and interdependent output variables, Journal of Machine Learning Research 6 (2) (2006) 1453–1484.
  • (13) A. Y. Ng, M. I. Jordan, On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes, in: T. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems (NIPS) 14, 2002, pp. 841–848.
  • (14) C. Sutton, A. McCallum, An introduction to conditional random fields for relational learning, in: Getoor and Taskar Getoor:2007:Introduction-to-statistical-relational , pp. 93–127.
    URL http://www.loc.gov/catdir/toc/ecip079/2007000951.html
  • (15) Y. Altun, I. Tsochantaridis, T. Hofmann, Hidden markov support vector machines, in: Twentieth International Conference on Machine Learning (ICML 2003), 2003, pp. 3–10.
  • (16) K. Lari, S. Young, Applications of stochastic context-free grammars using the inside-outside algorithm, Computer speech & language 5 (3) (1991) 237–257.
  • (17) S. Muggleton, Stochastic logic programs, in: L. De Raedt (Ed.), Advances in Inductive Logic Programming, IOS Press, 1996, pp. 254–264.
  • (18) B. Taskar, P. Abbeel, D. Koller, Discriminative probabilistic models for relational data, in: A. Darwiche, N. Friedman (Eds.), Proceedings of the 18th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann, San Fransisco, CA, 2002, pp. 895–902.
  • (19) A. Serebrenik, T. Schrijvers, B. Demoen, Improving Prolog programs: Refactoring for Prolog, Theory and Practice of Logic Programming 8 (2) (2008) 201–215.
  • (20) A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Machine Learning 73 (3) (2008) 243–272.
  • (21) V. S. Costa, R. Rocha, L. Damas, The Yap Prolog system, Theory and Practice of Logic Programming 12 (1-2) (2012) 5–34.
  • (22) C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines, software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm (2001).
  • (23) L. Bottou, Large-scale machine learning with stochastic gradient descent, in: Y. Lechevallier, G. Saporta (Eds.), Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), Springer, Paris, France, 2010, pp. 177–187.
  • (24) A. Srinivasan, S. H. Muggleton, R. King, M. Sternberg, Mutagenesis: ILP experiments in a non-determinate biological domain, in: Proceedings of the 4th International Workshop on Inductive Logic Programming, volume 237 of GMD-Studien, 1994, pp. 217–232.
  • (25) R. Wang, Y. Fu, L. Lai, A new atom-additive method for calculating partition coefficients, J. Chem. Inf. Comput. Sci 37 (3) (1997) 615–621.
  • (26) T. Evgeniou, C. Micchelli, M. Pontil, Learning multiple tasks with kernel methods, Journal of Machine Learning Research 6 (1) (2006) 615–637.
  • (27) M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, S. Slattery, Learning to extract symbolic knowledge from the world wide web (1998).
  • (28) G. H. Bakir, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, S. V. N. Vishwanathan (Eds.), Predicting Structured Data, Neural Information Processing, The MIT Press, 2007.
  • (29) T. Joachims, Learning to Classify Text using Support Vector Machines: Methods, Theory, and Algorithms, Kluwer Academic Publishers, Dordrecht, 2002.
  • (30) J. Neville, D. Jensen, Collective classification with relational dependency networks, in: Workshop on Multi-Relational Data Mining (MRDM-2003), 2003.
  • (31) P. Frasconi, M. Jaeger, A. Passerini, Feature discovery with type extension trees, in: Proc. of the 18th Int. Conf. on Inductive Logic Programming, Springer, 2008, pp. 122–139.
  • (32) A. Vazquez, A. Flammini, A. Maritan, A. Vespignani, Global protein function prediction from protein-protein interaction networks, Nature biotechnology 21 (6) (2003) 697–700.
  • (33) G. Lanckriet, M. Deng, N. Cristianini, M. Jordan, W. Noble, Kernel-based data fusion and its application to protein function prediction in yeast, in: Proceedings of the Pacific Symposium on Biocomputing, Vol. 9, 2004, pp. 300–311.
  • (34) T. G. Dietterich, R. H. Lathrop, T. Lozano-Pérez, Solving the multiple instance problem with axis-parallel rectangles, Artificial Intelligence 89 (1-2) (1997) 31–71.
  • (35) D. Koller, N. Friedman, Probabilistic graphical models: principles and techniques, Adaptive computation and machine learning, MIT Press, Cambridge, MA, 2009.
  • (36) P. Frasconi, M. Gori, A. Sperduti, A general framework for adaptive processing of data structures, IEEE Transactions on Neural Networks 9 (5) (1998) 768–786.
  • (37) B. D. McKay, Practical graph isomorphism, Congressus Numerantium 30 (1981) 45–87.
  • (38) X. Yan, J. Han., gSpan: Graph-based substructure pattern mining, in: Proc. 2002 Int. Conf. Data Mining (ICDM ´02), 2002, pp. 721–724.
  • (39) E. M. Luks, Isomorphism of graphs of bounded valence can be tested in polynomial time, J. Comput. Syst. Sci 25 (1982) 42–65.
  • (40) S. Sorlin, C. Solnon, A parametric filtering algorithm for the graph isomorphism problem, Constraints 13 (2008) 518–537.
  • (41) L. Ralaivola, S. Swamidass, H. Saigo, P. Baldi, Graph kernels for chemical informatics, Neural Networks 18 (8) (2005) 1093–1110.
  • (42) Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, S. Vishwanathan, Hash kernels for structured data, Journal of Machine Learning Research 10 (2009) 2615–2637.
  • (43) K. De Grave, Predictive quantitative structure-activity relationship models and their use for the efficient screening of molecules, Ph.D. thesis, Katholieke Universiteit Leuven (August 2011).
  • (44) S. V. N. Vishwanathan, N. N. Schraudolph, R. Kondor, K. M. Borgwardt, Graph kernels, J. Mach. Learn. Res. 99 (2010) 1201–1242.
  • (45) S. Menchetti, F. Costa, P. Frasconi, Weighted decomposition kernels, in: L. De Raedt, S. Wrobel (Eds.), Proceedings of the 22nd International Conference on Machine Learning, Vol. 119 of ACM International Conference Proceeding Series, ACM, New York, NY, 2005, pp. 585–592.
  • (46) Z. Kou, W. Cohen, Stacked graphical models for efficient inference in markov random fields, in: Proceedings of the Seventh SIAM International Conference on Data Mining, 2007, pp. 533–538.
  • (47) H. van de Waterbeemd, E. Gifford, ADMET in silico modelling: towards prediction paradise?, Nat Rev Drug Discov 2 (3) (2003) 192–204.
  • (48) C. Helma, S. Kramer, A survey of the predictive toxicology challenge 2000–2001, Bioinformatics 19 (10) (2003) 1179–1182.
  • (49) C. Helma, Predictive Toxicology, 1st Edition, CRC Press, 2005.
  • (50) T. Horváth, T. Gärtner, S. Wrobel, Cyclic pattern kernels for predictive graph mining, in: Proceedings of KDD 04, ACM Press, 2004, pp. 158–167.
  • (51) P. Mahe, N. Ueda, T. Akutsu, J. Perret, J. Vert, Graph kernels for molecular structure-activity relationship analysis with support vector machines, J. Chem. Inf. Model 45 (4) (2005) 939–51.
  • (52) S. Vishwanathan, N. Schraudolph, R. Kondor, K. Borgwardt, Graph kernels, The Journal of Machine Learning Research 11 (2010) 1201–1242.
  • (53) A. Ceroni, F. Costa, P. Frasconi, Classification of small molecules by two-and three-dimensional decomposition kernels, Bioinformatics 23 (16) (2007) 2038–2045.
  • (54) J. Kazius, R. McGuire, R. Bursi, Derivation and validation of toxicophores for muta- genicity prediction, J. Med. Chem 48 (1) (2005) 312–320.
  • (55) H. Blockeel, S. Dzeroski, B. Kompare, S. Kramer, B. Pfahringer, W. Laer, Experiments in Predicting Biodegradability., Applied Artificial Intelligence 18 (2) (2004) 157–181.
  • (56) H. Y. Ando, L. Dehaspe, W. Luyten, E. V. Craenenbroeck, H. Vandecasteele, L. V. Meervelt, Discovering H-Bonding Rules in Crystals with Inductive Logic Programming, Mol. Pharm. 3 (6) (2006) 665–674.
  • (57) K. De Grave, F. Costa, Molecular graph augmentation with rings and functional groups, Journal of Chemical Information and Modeling 50 (9) (2010) 1660–1668.
  • (58) D. Wolpert, Stacked generalization, Neural networks 5 (2) (1992) 241–259.
  • (59) L. De Raedt, A perspective on inductive databases, SIGKDD Explorations 4 (2) (2002) 69–77.
  • (60) J.-F. Boulicaut, B. Jeudy, Constraint-based data mining, in: O. Maimon, L. Rokach (Eds.), Data Mining and Knowledge Discovery Handbook, 2nd Edition, Springer, 2010, pp. 339–354.
  • (61) T. Mitchell, The discipline of machine learning, Tech. Rep. CMU-ML-06- 108, Carnegie-Mellon University (2006).
  • (62) L. De Raedt, S. Nijssen, Towards programming languages for machine learning and data mining (extended abstract), in: Marzena Kryszkiewicz and Henryk Rybinski and Andrzej Skowron and Zbigniew W. Ras (Ed.), Foundations of Intelligent Systems - 19th International Symposium, ISMIS 2011, Warsaw, Poland, June 28-30, 2011. Proceedings, Vol. 6804 of Lecture Notes in Computer Science, Springer, 2011, pp. 25–32.
  • (63) N. Rizzolo, D. Roth, Learning based java for rapid development of nlp systems, in: Nicoletta Calzolari and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias (Ed.), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta, 2010.
  • (64) D. Lowd, P. Domingos, Efficient weight learning for markov logic networks, in: Knowledge Discovery in Databases: PKDD 2007, Springer, 2007, pp. 200–211.
  • (65) S. A. Macskassy, F. Provost, A simple relational classifier, in: Proc. of the 2nd Workshop on Multi-Relational Data Mining, 2003.
  • (66) S. Muggleton, Inverse entailment and Progol, New Generation Computing 13 (3–4) (1995) 245–286.
  • (67) A. Srinivasan, The Aleph Manual (2007).
    URL http://www.comlab.ox.ac.uk/activities/machinelearning/Aleph/
  • (68) H. Blockeel, L. De Raedt, Top-down induction of first order logical decision trees, Artificial Intelligence 101 (1-2) (1998) 285–297.
  • (69) M. Bröcheler, L. Mihalkova, L. Getoor, Probabilistic similarity logic, in: Proc. of the 26th Conference on Uncertainty in Artificial Intelligence (UAI 2010), 2010.
  • (70) L. Getoor, N. Friedman, D. Koller, A. Pfeffer, Learning probabilistic relational models, in: S. Džeroski, N. Lavrač (Eds.), Relational Data Mining, Springer, 2001, pp. 307–335.
  • (71) K. Kersting, L. De Raedt, Bayesian logic programming: theory and tool, in: L. Getoor, B. Taskar (Eds.), An Introduction to Statistical Relational Learning, MIT Press, 2007, pp. 291–321.
  • (72) L. De Raedt, A. Kimmig, H. Toivonen, ProbLog: A probabilistic Prolog and its application in link discovery, in: Proceedings of the 20th international joint conference on artificial intelligence, 2007, pp. 2462–2467.
  • (73) K. Kersting, Lifted probabilistic inference, in: Proceedings of the 20th European Conference on Artificial Intelligence, Vol. 242 of Frontiers in Artificial Intelligence and Applications, IOS Press, 2012, pp. 33–38.
  • (74) K. Kersting, B. Ahmadi, S. Natarajan, Counting belief propagation, in: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, AUAI Press, Arlington, Virginia, United States, 2009, pp. 277–284.
  • (75) T. Gärtner, A survey of kernels for structured data., SIGKDD Explorations 5 (1) (2003) 49–58.
  • (76) G. Wachman, R. Khardon, Learning from interpretations: a rooted kernel for ordered hypergraphs, in: Proceedings of the 24th international conference on Machine learning, 2007, pp. 943–950.
  • (77) S. Kramer, N. Lavrač, P. Flach, Propositionalization approaches to relational data mining, in: S. Džeroski, N. Lavrač (Eds.), Relational Data Mining, Springer, 2001, pp. 262–291.
  • (78) J. R. Quinlan, Learning logical definitions from relations, Machine Learning 5 (1990) 239–266.
  • (79) R. A. Rossi, L. K. McDowell, D. W. Aha, J. Neville, Transforming graph data for statistical relational learning, Journal of Artificial Intelligence Research 45 (1) (2012) 363–441.
  • (80) N. Lao, W. W. Cohen, Relational retrieval using a combination of path-constrained random walks, Machine learning 81 (1) (2010) 53–67.
  • (81) D. J. Cook, L. B. Holder, Mining graph data, Wiley-Interscience, 2006.
  • (82) Y. Sun, J. Han, Mining heterogeneous information networks: principles and methodologies, Synthesis Lectures on Data Mining and Knowledge Discovery 3 (2) (2012) 1–159.
  • (83) M. Chang, L. Ratinov, N. Rizzolo, D. Roth, Learning and inference with constraints, in: Proceedings of AAAI, 2008, pp. 1513–1518.
  • (84) A. McCallum, K. Schultz, S. Singh, FACTORIE: Probabilistic programming via imperatively defined factor graphs, in: Neural Information Processing Systems (NIPS), 2009, pp. 1249–1257.
  • (85) M. Verbeke, V. Van Asch, R. Morante, P. Frasconi, W. Daelemans, L. De Raedt, A statistical relational learning approach to identifying evidence based medicine categories, in: Proc. of EMNLP-CoNLL, 2012, pp. 579–589.
  • (86) A. Kimmig, B. Demoen, L. De Raedt, V. S. Costa, R. Rocha, On the implementation of the probabilistic logic programming language problog, Theory and Practice of Logic Programming 11 (2-3) (2011) 235–262.
  • (87) M. Verbeke, P. Frasconi, V. Van Asch, R. Morante, W. Daelemans, L. De Raedt, Kernel-based logical and relational learning with klog for hedge cue detection, in: H. Muggleton, S. amd Watanabe, J. Tamaddoni-Nezhad A., Chen (Eds.), Inductive Logic Programming: Revised Selected Papers from the 21st International Conference (ILP 2011), Lecture Notes in Artificial Intelligence, 2012, pp. 347–357.
  • (88) P. Kordjamshidi, P. Frasconi, M. Van Otterlo, M. Moens, L. De Raedt, Spatial relation extraction using relational learning, in: H. Muggleton, S. amd Watanabe, J. Tamaddoni-Nezhad A., Chen (Eds.), Inductive Logic Programming: Revised Selected Papers from the 21st International Conference (ILP 2011), Lecture Notes in Artificial Intelligence, 2012, pp. 204–220.
  • (89) L. Antanas, P. Frasconi, F. Costa, T. Tuytelaars, L. De Raedt, A relational kernel-based framework for hierarchical image understanding, in: G. L. Gimel’farb, E. R. Hancock, A. I. Imiya, A. Kuijper, M. Kudo, S. Shinichiro Omachi, T. Windeatt, K. Yamada (Eds.), Lecture Notes in Computer Science,, Springer, 2012, pp. 171–180.
  • (90)

    L. Antanas, M. Hoffmann, P. Frasconi, T. Tuytelaars, L. D. Raedt, A relational kernel-based approach to scene classification, in: WACV, 2013, pp. 133–139.

  • (91) L. Antanas, P. Frasconi, T. Tuytelaars, L. De Raedt, Employing logical languages for image understanding, IEEE Workshop on Kernels and Distances for Computer Vision, International Conference on Computer Vision, Barcelona, Spain, 13 November 2011 (Nov. 2011).
  • (92) J. L. Gross, J. Yellen, Graph Theory and Its Applications, Second Edition (Discrete Mathematics and Its Applications), Chapman & Hall/CRC, 2005.
  • (93) D. Haussler, Convolution kernels on discrete structures, Tech. Rep. 99-10, UCSC-CRL (1999).

Appendix B Definitions

For the sake of completeness we report here a number of graph theoretical definitions used in the paper. We closely follow the notation in GrossYellen2005 . A graph consists of two sets and . The notation and is used when is not the only graph considered. The elements of are called vertices and the elements of are called edges. Each edge has a set of two elements in associated with it, which are called its endpoints, which we denote by concatenating the vertices variables, e.g. we represent the edge between the vertices and with . An edge is said to join its endpoints. A vertex is adjacent to a vertex if they are joined by an edge. An edge and a vertex on that edge are called incident. The degree of a vertex is number of edges incident to it. A multi-edge is a collection of two or more edges having identical endpoints. A self-loop is an edge that joins a single endpoint to itself. A simple graph is a graph that has no self-loops nor multi-edges. A graph is bipartite if its vertex set can be partitioned into two subsets and so that every edge has one end in and the other in . We denote a bipartite graph with partition by . A graph is rooted when we distinguish one of its vertices, called root; we denote a rooted graph G with root vertex with . A walk in a graph is a sequence of vertices such that for , the vertices and are adjacent. The length of a walk is the number of edges (counting repetitions). A path is a walk such that no vertex is repeated, except at most the initial () and the final () vertex (in this case it is called a cycle). The distance between two vertices, denoted , is the length of the shortest path between them. A graph is connected if between each pair of vertices there exist a walk. We denote the class of simple connected graphs with . The neighborhood of a vertex is the set of vertices that are adjacent to and is indicated with . The neighborhood of radius of a vertex is the set of vertices at a distance less than or equal to from and is denoted by . In a graph , the induced-subgraph on a set of vertices is a graph that has as its vertex set and it contains every edge of whose endpoints are in . A subgraph is a spanning subgraph of a graph if . The neighborhood subgraph of radius of vertex is the subgraph induced by the neighborhood of radius of and is denoted by . A labeled graph is a graph whose vertices and/or edges are labeled, possibly with repetitions, using symbols from a finite alphabet. We denote the function that maps the vertex/edge to the label symbol as . Two simple graphs and are isomorphic, which we denote by , if there is a bijection , such that for any two vertices , there is an edge if and only if there is an edge in . An isomorphism is a structure-preserving bijection. Two labeled graphs are isomorphic if there is an isomorphism that preserves also the label information, i.e. . An isomorphism invariant or graph invariant is a graph property that is identical for two isomorphic graphs (e.g. the number of vertices and/or edges). A certificate for isomorphism is an isomorphism invariant that is identical for two graphs if and only if they are isomorphic.

A hypergraph is a generalization of a graph also known under the name of set system

. A set system is an ordered pair

where is a set of elements and is a family of subsets of . Note that when is made by pairs of elements of then is a simple graph. The elements of are called vertices of the hypergraph and the elements of the hyperedges. There are two principal ways to represent set systems as graphs: as incident graphs and as intersecton graphs. In the following we consider only incident graphs. Given a set system the associated incident graph is the bipartite graph where and are adjacent if .

Appendix C Decomposition Kernels

We follow the notation in Haussler99:other . Given a set and a function , we say that is a kernel on if is symmetric, i.e. if for any and , and if is positive-semidefinite, i.e. if for any and any , the matrix defined by is positive-semidefinite, that is for all

or equivalently if all its eigenvalues are nonnegative. It is easy to see that if each

can be represented as such that is the ordinary dot product then is a kernel. The converse is also true under reasonable assumptions (which are almost always verified) on and , that is, a given kernel can be represented as for some choice of . In particular it holds for any kernel over where is a countable set. The vector space induced by is called the feature space. Note that it follows from the definition of positive-semidefinite that the zero-extension of a kernel is a valid kernel, that is, if and is a kernel on then may be extended to be a kernel on by defining if or is not in . It is easy to show that kernels are closed under summation, i.e. a sum of kernels is a valid kernel.

Let now be a composite structure such that we can define as its parts141414Note that the set of parts needs not be a partition for the composite structure, i.e. the parts may “overlap”.. Each part is such that for with