Introduction
Statistical Relational Learning models (SRL) [Getoor and Taskar2007, Raedt et al.2016]
combine the representational power of logic with the ability of probability theory specifically, and statistical models in general to model noise and uncertainty. They have generally ranged from directed models
[Kersting and De Raedt2007, Koller1999, Heckerman, Meek, and Koller2007, Kazemi et al.2014a, Neville and Jensen2007, Kazemi and Poole2018] to undirected models [Richardson and Domingos2006, Taskar et al.2007, Kimmig et al.2012]. We consider the more recent, wellunderstood directed model of Relational Logistic Regression (RLR) [Kazemi et al.2014a, Kazemi et al.2014b, Fatemi, Kazemi, and Poole2016]. One of the key advantages of RLR is that they scale well with population size unlike other methods such as Markov Logic Networks [Poole et al.2014] and hence can potentially be used as a powerful modeling tool for many tasks.While the models are attractive from the modeling perspective, learning these models is computationally intensive. This is due to the fact that (like the field of Inductive Logic Programming) learning occurs at multiple levels of abstraction, that of the level of objects, subgroup of objects and relations and possibly at the individual instances of the objects. Hence, most methods for learning these models have so far focused on the task of learning the socalled parameters (weights of the logistic function) where the rules (or relational features) are provided by the human expert and the data is merely used to learn the parameters.
We consider the problem of full model learning, also known as structure learning of RLR models. A simple solution to learning these models could be to learn the rules separately using a logic learner and then employ the parameter learning strategies [Huynh and Mooney2008]. While reasonably easy to implement, the key issue is that the disconnect between rule and parameter learning can result in poor predictive performance as shown repeatedly in the literature [Natarajan et al.2012, Richardson and Domingos2006]. Inspired by the success of nonparametric learning methods for SRL models, we develop a learning method for full model learning of RLR models.
More specifically, we develop a gradientboosting technique for learning RLR models. We derive the gradients for the different weights of RLR and show how the rules of the logistic function are learned simultaneously with their corresponding weights. Unlike the standard adaptations of the functional gradients, RLR requires learning a different set of weights per rule in each gradient step and hence requires learning multiple weights jointly for a single rule. As we explain later, the gradients correspond to a set of vector weighted clauses that are learned in a sequential manner. We derive the gradients for these clauses and illustrate how to optimize them.
Each clause can be seen as a relational feature for the logistic function. We also note that RLR can be viewed as a probabilistic combination function in that it can stochastically combine the distributions due to different set of parents (in graphical model terminology). Hence, if our learning technique is employed in the context of learning joint models, our work can be seen as a new interpretation of learning boosted Relational Dependency Networks (RDNs) [Neville and Jensen2007, Natarajan et al.2012], where the standard aggregators are replaced with a logistic regression combination function which could potentially yield interesting insights into directed SRL models. We demonstrate the effectiveness of this combination function on real data sets and compare against several baselines including the stateoftheart MLN learning algorithms.
The rest of the paper is organized as follows: first we introduce the necessary background and introduce the notations. Next we derive the gradients and present the algorithm for learning RLR models. Finally, we conclude the paper by presenting our extensive experimental evaluations on standard SRL data sets and a NLP data set and outlining the areas for future research.
Background and Notation
In this section, we define our notation and provide necessary background for our readers to follow the rest of the paper. Throughout the paper, we assume is represented by and is represented by .
Logistic Regression
Let
be a Boolean random variable with range {1, 0} whose value depends on a set
of random variables. Logistic regression [McCullagh1984] defines the conditional probability of given as the sigmoid of a weighted sum of s:(1) 
where
is the sigmoid function.
FiniteDomain, FunctionFree, FirstOrder Logic
A population is a finite set of objects. We assume for every object, there is a unique constant denoting that object. A logical variable (logvar) is typed with a population. A term is a logvar or a constant. We show logvars with lowercase letters (e.g., ) , constants with uppercase letters (e.g., ), the population associated with a logvar with , and the size/cardinality of the population with . A lowercase letter in bold represents a tuple of logvars (e.g., ), and an uppercase letter in bold is a tuple of constants (e.g., ).
An atom is of the form where is a functor and each is a term. When , is a predicate. A substitution is of the form where s are logvars and s are terms. A grounding of an atom with logvars is a substitution mapping each of its logvars to a constant in the population of the logvar. For a set of atoms, represents the set of all possible groundings for the atoms in . A literal is an atom or its negation. A formula is a literal, the conjunction of two formulae , or a disjunction of two formulae . The application of a substitution on a formula is represented as and replaces each in with . An instance of a formula is obtained by replacing each logvar in by one of the objects in . A conjunctive formula contains no disjunction. A weighted formula (WF) is a triple where is a formula and and are real numbers.
Relational Logistic Regression
Let be a Boolean atom whose probability depends on a set of atoms such that . We refer to as the parents of . Let be a set of WFs containing only atoms from , be a function from groundings in to truth values, and be a substitution from logvars in to constants in . Relational logistic regression (RLR) [Kazemi et al.2014a] defines the probability of given as follows:
(2) 
where is a bias/intercept, is the number of instances of that are true with respect to , and is the number of instances of that are false with respect to . Note that . Also note that the bias can be considered as a WF whose formula is . Following Kazemi et al. [Kazemi et al.2014a], without loss of generality we assume the formulae of all WFs for RLR models are conjunctive.
Example 1
Let , , and be three atoms representing respectively whether a professor is active, whether a student is advised by a professor, and whether a student is a PhD student. Suppose we want to define the conditional probability of given the atoms . Consider an RLR model with an intercept of which uses only the following WF to define this conditional probability:
According to this model, for an assignment of truth values to :
where s.t. according to , corresponding to the number of PhD students advised by . When this count is greater than or equal to 4, the probability of being an active professor is closer to one than zero; otherwise, the probability is closer to zero than one. Therefore, this RLR model represents “a professor is active if the professor advises at least 4 PhD students”.
With this background on Relational Logistic Regression, we introduce the Functional Gradient Boosting paradigm in the following section. This enables us to formulate a learning problem for RLR in which we learn both the structure and the parameters simultaneously.
Functional Gradient Boosting
We discuss functional gradient boosting (FGB) approach in the context of relational models. This approach is motivated by the intuition that finding many rough rulesofthumb of how to change one’s probabilistic predictions locally can be much easier than finding a single, highly accurate model. Specifically, this approach turns the problem of learning relational models into a series of relational function approximation problems using the ensemble method of gradientbased boosting. This is achieved by the application of Friedman’s [Friedman2001]
gradient boosting to SRL models. That is, we represent the conditional probability distribution as a weighted sum of regression models that are grown via a stagewise optimization
[Natarajan et al.2012, Khot et al.2011].The conditional probability of an example ^{1}^{1}1We use the term example to mean the grounded target literal. Hence denotes that the grounding
i.e., the grounded target predicate is true. Following standard Bayesian networks terminology, we denote the parents
to include the set of formulae that influence the current predicate . depends on its parents . The goal of learning is to fit a model , and can be expressed nonparametrically in terms of a potential function :(3) 
At a highlevel, we are interested in successively approximating the function as a sum of weak learners, which are relational regression clauses, in our setting. Functional gradient ascent starts with an initial potential and iteratively adds gradients . After iterations, the potential is given by . Here, is the functional gradient at episode and is
(4) 
where is the learning rate. Dietterich et al. [Dietterich, Ashenfelter, and Bulatov2004] suggested evaluating the gradient at every position in every training example and fitting a regression tree to these derived examples i.e., fit a regression tree on the training examples . They point out that although the fitted function is not exactly the same as the desired , it will point in the same direction (assuming that there are enough training examples). Thus, ascent in the direction of will approximate the true functional gradient.
Note that in the functional gradient presented in (4), the expectation
cannot be computed as the joint distribution
is unknown. Instead of computing the functional gradients over the potential function, they are instead computed pointwise for each labeled training example : . Now, this set of local gradients become the training examples to learn a weak regression model that approximates the gradient at stage .The functional gradient with respect to of the likelihood for each example can be shown to be:
(5) 
where is the indicator function, that is , if , and otherwise. This expression is simply the adjustment required to match the predicted probability with the true label of the example. If the example is positive and the predicted probability is less than , this gradient is positive indicating that the predicted probability should move towards . Conversely, if the example is negative and the predicted probability is greater than 0, the gradient is negative driving the value the other way.
This elegant gradient expression might appear simple, but in fact, naturally and intuitively captures, examplewise, the general direction that the overall model should be grown in. The form of the functional gradients is a consequence of the sigmoid function as a modeling choice, and is a defining characteristic of FGB methods. As we show below, our proposed approach to RLR also has a similar form. The significant difference, however, is in the novel definition of the potential function .
In prior work, relational regression trees (RRTs) [Blockeel1999] were used to fit the gradient function to the pointwise gradients for every training example. Each RRT can be viewed as defining several new feature combinations, one corresponding to each path from the root to a leaf. A key difference in our work is that we employ the use of weighted formulae (vectorweighted clauses^{2}^{2}2We use formulae and clauses interchangeably from hereon., to be precise) as we explain later. From this perspective, our work is closer to the boosting MLN work that employed the use of weighted clauses. We generalize this by learning a weight vector per clause that allows for a more compact representation of the true and false instances of the formula. An example of a weighted clause is provided in Figure 1 where there are four clauses for predicting advisedBy(A,B). Note that while we show the standard weighted clauses similar to a MLN, our weighted clauses have an important distinction  corresponding to each clause is a weight vector instead of a single scalar weight which captures the weight of true groundings, false groundings and the uninformed prior weights of the clause. The gradientboosting that we develop in the next section builds upon these clauses and as mentioned earlier is similar to MLN boosting with the key difference being that instead of learning one weight per clause, we learn three weights in the vector.
The key intuition with boosting regression clauses is that each clause will define a new feature combination and the different clauses together capture the latent relationships that are learned from the data. While the final model itself is linear (as it is the sum of the weighted groundings of all the clauses), the clauses themselves define richer features thus allowing for learning a more complex model than a simple linear one. Figure 2 presents the schematic for boosting. The idea is that first a regression function (shown as a set of clauses) is learned from the training examples and these clauses are then used to determine the gradients (weights) of each example in the next iteration. The gradient is typically computed in the prior work as . Once the examples are weighted, a new set of clauses are induced from them. These clauses are then considered together and the regression values are added when weighing the examples and the process is iterated.
There are several benefits of the boosting approach for learning RLR models. First, being a nonparametric approach (i.e., the model size is not chosen in advance), the number of parameters naturally grows as the number of training episodes increases. In turn, relational features as clauses are introduced only as necessary, so that a potentially large search space is not explicitly considered. Second, such an algorithm is fast and straightforward to implement. One could potentially employ any relational regression learner in the inner loop to learn several types of models. Third, as with previous relational models, the use of boosting for learning RLR models makes it possible to learn the structure and parameters simultaneously making them an attractive choice for learning from large scale data sets [Malec et al.2016, Yang et al.2017].
Functional Gradient Boosting for RLR
Preliminaries
Given the background on RLR and the gradientboosting approach, we now focus on the learning task of RLR. Let us rewrite the conditional probability of an example given weighted formulae corresponding parents in the RLR model as:
(6) 
where is the sigmoid function. For example, let indicate the popularity of a professor . Consider two formulae = and =. The weights of the first formula control the influence of the number of publications on the popularity of the professor where . Similarly the second formula controls the influence of the number of students advised by the professor. For learning a model for RLR, we thus need to learn these clauses and their weights (the parents are determined by the structure of the clause). Also, we can assume that the bias term can be part of the weight vectors for all the learned clauses. This allows a greedy approach that incrementally adds new clauses, such as FGB, to automatically update the bias term by learning for each new clause. Our learning problem can be defined as:
Given: A set of grounded facts (features) and the corresponding
positive and negative grounded literals (examples)
To Do: Learn the set of formulae with their corresponding weight vector .
To simplify the learning problem, we introduce vectorweighted clauses (formulae), denoted as {}, that are a generalization of traditional weighted clauses with single weights. More specifically, our weighted clauses employ three dimensions, that is , where is a bias/intercept, is the weight over the satisfiable groundings of the current clause (analogous to ) and is the weight of the unsatisfiable groundings of the current clause (analogous to ). We also use a short hand notation and for the two grounding counts and respectively in Equation 6.
Example 2
Consider an RLR model for defining the conditional probability of which has only one WF:
Let be the number of instances of for which is true for the current grounding of , and let be the number of instances of for which is false for the example . Using vectorweighted clauses in the RLR model, we can compute
RFGB for RLR
Our goal is to learn the full structure of the model, which involves learning two concepts – the structure (formulae/clauses) and their associate parameters (the weight vectors). To adapt functional gradient boosting to the task of learning RLR, we map this probability definition over the parameter space to the functional space, :
(7)  
Recall that in FGB, the gradients of the likelihood function with respect to the potential function are computed separately for each example. Correspondingly, the regression function for the th example needs to be clearly defined and is:
(8) 
The key difference to the existing gradient boosting methods for RDNs [Natarajan et al.2012] and MLNs [Khot et al.2011] is that the RLR learning algorithm needs to learn a weight vector per clause instead of a single weight.
Also, recall that while in traditional parametric gradientdescent, one would compute the parametric gradient over the loss function and iteratively update the parameters with these gradients, for gradient boosting, we first compute the functional gradients over the loglikelihood given by:
(9) 
where is the indicator function. As with other relational functional gradients (see Section Functional Gradient Boosting), this elegant expression naturally falls out when the loglikelihood of the sigmoid is differentiated with respect to the function. As before, the gradient is simply the adjustment required for the probabilities to match the observed value () in the training data for each example. Note that this is simply the outer gradient, that is, the gradient is computed for each example and a single vectorweighted clause needs to be learned for this set of gradients. While learning the clause itself, we must optimize a different loss function as we show next.
In order to generalize beyond the training examples, we fit a regression function (which is essentially a vectorweighted clause) over the training examples such that the squared error between and the functional gradient is minimized over all the examples. The inner loop thus amounts to learning vectorweighted clauses such that we minimize the (regularized) squared error between the RLR model and the functional gradients over the training examples:
(10)  
where is a regularization parameter. In principle, can be chosen using a line search with a validation set when the size of the data sets are large. However, in our data sets, we only considered a few values from the set of and chose to present the best corresponding to the test set.
Close inspection of the loss function above reveals that solving this optimization problem amounts to fitting count features: for each grounded example to the corresponding functional gradient, . Note that the equation (10) can be viewed as a regularized leastsquares regression problem to identify weights . The problem (10) can be written in vector form as
where the th row of the count matrix are the count features of the th example. This problem has a closedform solution that can be computed efficiently:
(11) 
The quantity captures the count covariance across examples, while the quantity captures the countweighted gradients:
(12)  
In this manner, functional gradient boosting enables a natural combination of conditionals over all the examples. This weight update forms the backbone of our approach: boosted relational logistic regression or brlr.
Algorithm for brlr
We outline the algorithm for boosted RLR (brlr) learning in Algorithm 1. We initialize the regression function with an uniform prior i.e. (line 2). Given the input training examples which correspond to the grounded instances of the target predicate and the set of facts, i,e., the grounded set of all other predicates (denoted as ) in the domain, the goal is to learn the set of vectorweighted clauses that influence the target predicate.
Since there could potentially be multiple target predicates (when learning a joint model such as RDN where each influence relation is an RLR), we denote the current predicate as . In the iteration of functionalgradient boosting, we compute the functional gradients for these examples using the current model and the parents of as per this model (line 7). Given these regression examples , we learn a vectorweighted clause using FitRegression. This function uses all the other facts to learn the structure and parameters of the clause. We then add this regression function, approximating the functional gradients to the current model, . We repeat this over iterations where is typically set to 10 in our experiments.
Next, we describe FitRegression to learn vectorweighted clauses from input regression examples , facts and target predicate in Algorithm 2. We initialize the vectorweighted clause with empty body and zero weights i.e. . We first create all possible literals that can be added to the clause given the current body (line ). We use modes [Muggleton and De Raedt1994] from inductive logic programming (ILP) to efficiently find the relevant literals here.
For each literal in this set, we calculate the true and false groundings for the newly generated clause by adding the literal to the body (line ). To perform this calculation, we ground the left hand side of the horn clause (i.e., the query literal) and count the number of groundings of the body corresponding to the query grounding. For instance if the grounding of the query is advisedBy(John,Mary) corresponding to advisedBy(student,prof), then we count the number of instances of the body that correspond to John and Mary. If the body contains the publications in common, they are counted accordingly. If the body is about courses John took, they are counted correspondingly. This is similar to counting in any relational probabilistic model such as MLNs or BLPs. Following standard SRL models, we assume closedworld. This allows us to deduce the number of false groundings as the difference between the total number of possible groundings and the number of counted (positive) groundings.
We can then calculate the count matrix and weights as described earlier (line –). We score each literal based on the squared error and greedily pick the best scoring literal . We repeat this process till the clause reaches its maximum allowed length (set to 4 in our experiments).
To summarize, given a target, the algorithm computes the gradient for all the examples based on the expression . Given these gradients, the inner loop searches over the possible clauses such that the MSE is minimized. The resulting vectorweighted clauses are then added to the set of formula and are then used for the subsequent steps of gradient computations. The procedure is repeated until convergence or a preset number of formulae are learned. The search for the most optimal clause can be guided by experts by providing relevant search information as modes [Muggleton and De Raedt1994]. The overall procedure is similar to RDNs and MLNs with two significant differences  the need for multiple weights in the clauses and correspondingly the different optimization function inside the inner loop of the algorithm.
Given that we have outlined the brlr algorithm in detail, we now turn our focus to empirical evaluation of this algorithm.
Experiments and Results
Our experiments will aim to answer the following questions in order to demonstrate the benefits of brlr:

How does functional gradient boosting perform when compared to traditional learning approaches for clauses and weights?

How does the boosted method perform compared to a significant feature engineered logistic regression approach?

How does boosting RLR compare to other relational methods?

How sensitive is the behavior of the proposed approach with respect to the regularization constant, ?
Methods Considered
We now compare our brlr approach to: (1) the agglr approach, which is standard logistic regression (LR) using the relational information, (2) the ilprlr approach where rules are learned using a logic learner, followed by weight learning for the formulae, and (3) mlnb, which is a stateoftheart boosted MLN structure learning method. We evaluate our approach on synthetic data set and real world data sets. Table 1 shows the sample aggregate (Relational) features constructed with the highest weights as generated by agglr.
A natural question to ask is the comparison of our method against the recently successful Boosted Relational Dependency Networks [Natarajan et al.2012] (bRDN) method. We do not consider this comparison for two important reasons  first is that the mlnb has already been compared against bRDN in the original work and the conclusion was that they were nearly on par in performance in all the domains while bRDN is more efficient due to the use of existentials instead of counts when grounding clauses. Consequently, the second reason is that since our agglr approach heavily employs counts, we considered the best learning method that employs counts as an aggregator, namely the mlnb method. Our goal is not to demonstrate that agglr is more effective than the wellknown mlnb or the bRDN approaches, but to demonstrate that boosting RLR does not sacrifice performance of learners and that RLR can be boosted as effectively as other relational probabilistic models.
In contrast to our approach, which performs parameter and structure learning simultaneously, the ilprlr baseline performs these steps sequentially. More specifically, we use PROGOL [Muggleton1995, Muggleton1997] for structure learning, followed by relational logistic regression for parameter learning. Table 3 shows the number of rules that were learned for each data set by PROGOL. Table 3 also shows some sample rules with the highest coverage scores as generated by PROGOL.
Domains  Sample Features  


count_genres_acted, count_movies_acted  

count_publications, count_taughtby  

count_movies, average_ratings  

no_of_friends, no_of_friends_smoke  

count_project, count_courseta 
To keep comparisons as fair as possible, we used the following protocol: while employing mlnb, we set the maximum number of clauses to , the beamwidth to and maximum clause length to . Similar configurations were adopted in our clauselearning setting. Gradient steps for mlnb and brlr were picked as per the performance.
Data Sets
SmokesCancerFriends: This is a synthetic data set, where the goal is to predict who has cancer based on the friends network of individuals and their observed smoking habits. The data set has three predicates: , and . We evaluated the method over the predicate using the other predicates with 4fold crossvalidation and .
UWCSE: The UWCSE data set [Richardson and
Domingos2006] was created
from the University of Washington’s Computer Science and Engineering
department’s student database and consists of details about professors, students and courses from different subareas of computer science (AI, programming languages, theory, system and graphics). The data set includes predicates such as , , , , ,
etc., Our task is to learn, using the other predicates, to predict the relation between a student and a professor. There are possible relations out of which only are true. We employ fold cross validation where we learn from
four areas and predict on the other area with in our reported results.
IMDB: The IMDB data set was first used by Mihalkova and
Mooney [Mihalkova and
Mooney2007] and contains five predicates: ,
,, , and . We predict the relation between an actor and director using the other predicates. Following [Kok and Domingos2009], we omitted the four equality predicates. We set and employed fold crossvalidation using the folds generation strategy suggested by Mihalkova and Mooney in [Mihalkova and
Mooney2007] and averaged the results.
WebKB: The WebKB data set was first created by Craven et al. [Craven et al.1998] and contains information about department webpages and the links between them. It also contains the categories for each webpage and the words within each page. This data set was converted by Mihalkova and Mooney [Mihalkova and
Mooney2007] to contain only the category of each webpage and links between these pages. They created the following predicates: , , , , and from these webpages. We evaluated the method over the predicate using the other predicates and we performed fold crossvalidation where each fold corresponds to one university with set set as .
Movie Lens: This is the wellknown Movielens data set [Harper and Konstan2015]
containing information of about users, movies, the movies rated by each user containing
usermovie pairs, and the actual rating the user has given to a movie. It contains predicates: , , , , and . In our experiments, we ignored the actual ratings and only considered whether a movie was rated by a user or not. Also, since can take only two values, we convert the predicate to a single argument predicate . We learned brlr models for predicting using fold crossvalidation with .
A key property of these relational data sets is the large number of negative examples. This is depicted in Table 2, which shows the size of various data sets used in our experiments. This is because, in relational settings, a vast majority of relations between objects are not true, and the number of negative examples far outnumbers the number of positive examples. In these data sets, simply measuring accuracy or loglikelihood can be misleading. Hence, we use metrics which are reliable in imbalanced setting like ours.
Data Sets  Target  Types  Predicates  neg:pos Ratio 

SmCaFr  1  3  1.32  
IMDB  3  6  13.426  
UWCSE  9  12  539.629  
WebKB  3  6  4.159  
Movie Lens  7  6  2.702 
Target (Data Set)  Sample rules generated for ilprlr using PROGOL  















Results
We present the results of our experiments in Figures 3 and 4, which compare the various methods on two metrics: area under the ROC curve (AUCROC) and area under the PrecisionRecall curve (AUCPR) respectively. From these figures, certain observations can be made clearly.
First, the proposed brlr method is on par or better than most methods across all data sets. On deeper inspection, it can be seen that the stateoftheart boosting method for MLNs appears to be more mixed at first glance in ROCspace while brlr is generally better in PRspace. In addition, in the WebKB, MovieLens and SmokesCancerFriends domain where we learn about a unary predicate, the performance is significantly better. This yields an interesting insight: RLR models can be natural aggregators over the associated features. As we are in the unary predicate setting (which corresponds to predicting an attribute of an object), the counts of the instances of the body of the clause simply means aggregating over the values of the body. This is typically done in several different ways such as mean, weighted mean or noisyor [Natarajan et al.2008]. We suggest the use of logistic function with counts as an alternative aggregator that seems effective in this domain and we hypothesize its use for many relational tasks where aggregation can yield to natural models. In contrast, MLNs only employ counts as their features, while RLR allows for a more complex aggregation within the sigmoid function that can use count features in its inner loop. Validating this positive aspect of RLR models remains an interesting future research direction. These results help in answering Q3 affirmatively: that brlr is on par or significantly better than mlnb in all domains.
Learning Time (seconds)  

Target (Data set)  agglr  ilprlr  mlnb  brlr 
(IMDB)  67.39  158.02  6.99  10.95 
(UWCS)  110.58  65.058  18.40  24.71 
(Movie Lens)  91.27  78.57  6.66  16.51 
(SmCaFr)  35.18  3.26  120.73  90.47 
(WebKB)  51.73  2.18  5.68  5.86 
Next, from the figures, it can be observed that brlr significantly outperforms agglr in several domains. At the outset, this may not be surprising since relational models have been shown to outperform nonrelational models. However, the features that are created for the agglr model are the count features of the type defined in the original RLR work and are more expressive than the standard features of propositional models. This result is particularly insightful as the brlr model that uses count features, predicates and their combinations themselves in a formula is far more expressive than simple aggregate features. This allows us to answer Q2 strongly and affirmatively, in that the proposed approach is significantly better than an engineered (relational) logistic regression approach.
Finally, comparing the proposed approach to a twostep approach of learning clauses followed by the corresponding weights; brlr appears to be significantly better in both PRspace as well as ROCspace than ilprlr . Furthermore, brlr also has the distinct advantage of simultaneous parameter and structure learning, thus avoiding a costly structure search compared to the ILPbased approach. Hence, Q1 can be answered: that brlr model outperforms ILPbased twostage learning in all the regions. We see from Figure 5 that Q4 is answered affirmatively: As long as is within the set , our algorithm is not sensitive in most domains. In addition, our parameter has a nice intuitive interpretation: they reflect and incorporate the high class imbalance that exists for realworld domains where this is an important practical consideration.
We used paired ttest with pvalues=0.05 for determining the statistical significance. From the Figures
3 & 4, across most domains, we observe that, brlrhas tighter error bounds compared to the baselines in majority of domains indicating lower variance and subsequently higher generalization performance.
Table 4 reports the training time taken in seconds by each method averaged over all the folds in every domain. Timings reported For agglr include time taken for propositional feature construction and weight learning using WEKA tool. For ilprlr it includes the total time taken to learn rules, count satisfied instances, and learn weights for the rules accordingly. The two boosted approaches timings are reported from the full runs. The results show that the methods are comparable across all the domains  in the domains where boosted methods are faster than the other baselines, grounding of the entire data set caused the increased time for the baselines. Conversely, in the other domains, repeated counting of boosting increased the time in two of the five domains. The results indicate that the proposed brlrapproach does not sacrifice efficiency (time) for effectiveness (performance).
In summary, our proposed boosted approach appears to be promising across all a diversity of relational domains, with the potential for scaling up relational logistic regression models to large data sets.
Conclusions
We considered the problem of learning relational logistic regression (RLR) models using the machinery of functionalgradient boosting. To this end, we introduce an alternative interpretation of RLR models that allows us to consider both the true and false groundings of a formula within a single equation. This allowed us to learn vectorweighted clauses that are more compact and expressive compared to standard boosted SRL models. We derived gradients for the different weights, and outlined a learning algorithm that learned firstorder features as clauses and the corresponding weights simultaneously. We evaluated the algorithm on standard data sets, and demonstrated the efficacy of the learning algorithm.
There are several possible extensions for future work – currently, our method learns a model for a single target predicate deterministically. As mentioned earlier, it is possible to learn a joint model across multiple predicates in a manner akin to learning a relational dependency network (RDN). This can yield a new interpretation for RDNs based on combining rules. Second, learning from truly hybrid data remains a challenge for SRL models in general, and RFGB in particular. Finally, given the recent surge of sequential decisionmaking research, RLR can be seen as an effective function approximator for relational Markov decision processes (MDPs); employing this novel
brlrmodel in the context of relational reinforcement learning can be an exciting and interesting future research direction.
References

[Blockeel1999]
Blockeel, H.
1999.
Topdown induction of first order logical decision trees.
AIC 12(12).  [Craven et al.1998] Craven, M.; DiPasquo, D.; Freitag, D.; McCallum, A.; Mitchell, T.; Nigam, K.; and Slattery, S. 1998. Learning to extract symbolic knowledge from the world wide web. In AAAI, 509–516.
 [Dietterich, Ashenfelter, and Bulatov2004] Dietterich, T.; Ashenfelter, A.; and Bulatov, Y. 2004. Training CRFs via gradient tree boosting. In ICML.
 [Fatemi, Kazemi, and Poole2016] Fatemi, B.; Kazemi, S. M.; and Poole, D. 2016. A learning algorithm for relational logistic regression: Preliminary results. arXiv preprint arXiv:1606.08531.
 [Friedman2001] Friedman, J. 2001. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29.
 [Getoor and Taskar2007] Getoor, L., and Taskar, B. 2007. Introduction to Statistical Relational Learning. MIT Press.
 [Harper and Konstan2015] Harper, F., and Konstan, J. 2015. The MovieLens datasets: History and context. ACM Trans. Interact. Intell. Syst. 5(4):19:1–19:19.
 [Heckerman, Meek, and Koller2007] Heckerman, D.; Meek, C.; and Koller, D. 2007. Probabilistic entityrelationship models, PRMs, and plate models. Introduction to statistical relational learning 201–238.
 [Huynh and Mooney2008] Huynh, T., and Mooney, R. 2008. Discriminative structure and parameter learning for Markov logic networks. In ICML, 416–423. ACM.
 [Kazemi and Poole2018] Kazemi, S. M., and Poole, D. 2018. RelNN: A deep neural model for relational learning. In AAAI.
 [Kazemi et al.2014a] Kazemi, S.; Buchman, D.; Kersting, K.; Natarajan, S.; and Poole, D. 2014a. Relational logistic regression. In KR.
 [Kazemi et al.2014b] Kazemi, S.; Buchman, D.; Kersting, K.; Natarajan, S.; and Poole, D. 2014b. Relational logistic regression: The directed analog of markov logic networks. In AAAI Workshop.
 [Kersting and De Raedt2007] Kersting, K., and De Raedt, L. 2007. Bayesian logic programming: Theory and tool. In Getoor, L., and Taskar, B., eds., An Introduction to Statistical Relational Learning.
 [Khot et al.2011] Khot, T.; Natarajan, S.; Kersting, K.; and Shavlik, J. 2011. Learning Markov logic networks via functional gradient boosting. In ICDM, 320–329. IEEE.
 [Kimmig et al.2012] Kimmig, A.; Bach, S.; Broecheler, M.; Huang, B.; and Getoor, L. 2012. A short introduction to probabilistic soft logic. In NIPS Workshop, 1–4.
 [Kok and Domingos2009] Kok, S., and Domingos, P. 2009. Learning Markov logic network structure via hypergraph lifting. In ICML, 505–512. ACM.
 [Koller1999] Koller, D. 1999. Probabilistic relational models. In ILP, 3–13. Springer.
 [Malec et al.2016] Malec, M.; Khot, T.; Nagy, J.; Blasch, E.; and Natarajan, S. 2016. Inductive logic programming meets relational databases: An application to statistical relational learning. In ILP.
 [McCullagh1984] McCullagh, P. 1984. Generalized linear models. EJOR 16(3):285–292.
 [Mihalkova and Mooney2007] Mihalkova, L., and Mooney, R. 2007. Bottomup learning of Markov logic network structure. In ICML, 625–632. ACM.
 [Muggleton and De Raedt1994] Muggleton, S., and De Raedt, L. 1994. Inductive logic programming: Theory and methods. Journal of Logic Programming 19:629–679.
 [Muggleton1995] Muggleton, S. 1995. Inverse entailment and progol. New generation computing 13(34):245–286.
 [Muggleton1997] Muggleton, S. 1997. Learning from positive data. In ILP, 358–376. Berlin, Heidelberg: Springer Berlin Heidelberg.
 [Natarajan et al.2008] Natarajan, S.; Tadepalli, P.; Dietterich, T.; and Fern, A. 2008. Learning firstorder probabilistic models with combining rules. Annals of Mathematics and Artificial Intelligence 54(13):223–256.
 [Natarajan et al.2012] Natarajan, S.; Khot, T.; Kersting, K.; Gutmann, B.; and Shavlik, J. 2012. Gradientbased boosting for statistical relational learning: The relational dependency network case. Machine Learning 86(1):25–56.
 [Neville and Jensen2007] Neville, J., and Jensen, D. 2007. Relational dependency networks. In Getoor, L., and Taskar, B., eds., Introduction to Statistical Relational Learning. MIT Press. 653–692.
 [Poole et al.2014] Poole, D.; Buchman, D.; Kazemi, S.; Kersting, K.; and Natarajan, S. 2014. Population size extrapolation in relational probabilistic modelling. In SUM, 292–305. Springer.
 [Raedt et al.2016] Raedt, L.; Kersting, K.; Natarajan, S.; and Poole, D. 2016. Statistical relational artificial intelligence: Logic, probability, and computation. Synthesis Lectures on AI and ML 10(2):1–189.
 [Richardson and Domingos2006] Richardson, M., and Domingos, P. 2006. Markov logic networks. ML 62(1):107–136.
 [Taskar et al.2007] Taskar, B.; Abbeel, P.; Wong, M.; and Koller, D. 2007. Relational Markov networks. Introduction to statistical relational learning 175–200.
 [Yang et al.2017] Yang, S.; Korayem, M.; AlJadda, K.; Grainger, T.; and Natarajan, S. 2017. Combining contentbased and collaborative filtering for job recommendation system: A costsensitive statistical relational learning approach. KBS 136:37–45.
Comments
There are no comments yet.