 # A Learning Algorithm for Relational Logistic Regression: Preliminary Results

Relational logistic regression (RLR) is a representation of conditional probability in terms of weighted formulae for modelling multi-relational data. In this paper, we develop a learning algorithm for RLR models. Learning an RLR model from data consists of two steps: 1- learning the set of formulae to be used in the model (a.k.a. structure learning) and learning the weight of each formula (a.k.a. parameter learning). For structure learning, we deploy Schmidt and Murphy's hierarchical assumption: first we learn a model with simple formulae, then more complex formulae are added iteratively only if all their sub-formulae have proven effective in previous learned models. For parameter learning, we convert the problem into a non-relational learning problem and use an off-the-shelf logistic regression learning algorithm from Weka, an open-source machine learning tool, to learn the weights. We also indicate how hidden features about the individuals can be incorporated into RLR to boost the learning performance. We compare our learning algorithm to other structure and parameter learning algorithms in the literature, and compare the performance of RLR models to standard logistic regression and RDN-Boost on a modified version of the MovieLens data-set.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## Background and Notations

In this section, we introduce the notation used throughout the paper, and provide necessary background information for readers to follow the rest of the paper.

### Logistic Regression

Logistic regression (LR) (Allison, 1999) is a popular classification method within machine learning community. We describe how it can be used for classification following Cessie and van Houwelingen (1992) and Mitchell (1997).

Suppose we have a set of labeled examples , where each is composed of features and each

is a binary variable whose value is to predicted. The

s may be binary, multi-valued or continuous. Throughout the paper, we assume binary variables take their values from . Logistic regression learns a set of weights, where is the intercept and is the weight of the feature . For simplicity, we assume a new dimension has been added to the data to avoid treating differently than other s. LR defines the probability of being given as follows:

 P(yi=1∣xi,w)=σ(n∑j=0xijwj) (1)

where

is the Sigmoid function.

Logistic regression learns the weights by maximizing the log-likelihood of the data (or equivalently, minimizing the logistic loss function) as follows:

 wLR=argmaxwm∑i=0log(P(yi=1∣xi)) (2)

An L1-regularization can be added to the loss function to encourage sparsity and do an automatic feature selection.

### Conjoined Features and the Hierarchical Assumption

Given input random variables, logistic regression considers features: one bias (intercept) and one feature for each random variable. One can generate more features by conjoining the input random variables. For instance, if and

are two continuous random variables, one can generate a new feature

(which is for Boolean variables). Given random variables, conjoining (or multiplying) variables allows for generating features. These weights can represent arbitrary conditional probabilities111Note that there are degrees of freedom and any representation that can represent arbitrary conditional probabilities may require parameters. The challenge is to find a representation that can often use fewer., in what is known as the canonical representation - refer to Buchman et al. (2012) or Koller and Friedman (2009). For a large , however, generating features may not be practically possible, and it also makes the model overfit to the training data.

In order to avoid generating all features, Schmidt and Murphy (2010) make a hierarchical assumption: if either or are not useful features, neither is . Having this assumption, Schmidt and Murphy (2010) first learn an LR model considering only features with no conjunctions. They regularize their loss function with a hierarchical regularization function, so that the weights of the features not contributing to the prediction go to zero. Once the learning stops, they keep the features having non-zero weights, and add all conjoined features whose all subsets have non-zero weights in the previous step. Then they run their learning again. They continue this process until no more features can be added.

### Relational Logistic Regression

Relational logistic regression (RLR) (Kazemi et al., 2014) is the analogue of LR for relational models. RLR can be also considered as the directed analogue of Markov logic networks (Richardson and Domingos, 2006). In order to describe RLR, first we need to introduce some definitions and terminologies used in relational domains.

A population refers to a set of individuals and corresponds to a domain in logic. Population size of a population is a non-negative number indicating its cardinality. For example, a population can be the set of planets in the solar system, where Mars is an individual and the population size is 8.

Logical variables start with lower-case letters, and constants start with upper-case letters. Associated with a logical variable is a population where is the size of the population. A lower-case and an upper-case letter written in bold refer to a set of logical variables and a set of individuals respectively.

A parametrized random variable (PRV) (Poole, 2003) is of the form where is a k-ary (continuous or categorical) function symbol and each is a logical variable or a constant. If all s are constants, the PRV is a random variable. If k = 0, we can omit the parentheses. If is a predicate symbol, has range {True, False}, otherwise the range of is the range of the function. For example, can be a PRV with predicate function and logical variable , which is true if life exists on the given .

A literal is an assignment of a value to a PRV. We represent as and as . A formula is made up of literals connected with conjunction or disjunction. Figure 1: A relational model taken from (Kazemi et al., 2014).

A weighted formula (WF) for a PRV , where is a set of logical variables, is a tuple where is a Boolean formula of the parents of and is a weight.

Relational logistic regression (RLR)

defines a conditional probability distribution for a Boolean PRV

using a set of WFs as follows:

 P(q(Z)∣Π)=σ(∑⟨F,w⟩∈ψw∗FΠ,z→Z) (3)

where is the Sigmoid function, represents the assigned values to parents of , represents an assignment of individuals to the logical variables in , and is formula with each logical variable in it being replaced according to , and evaluated in .

###### Example 1.

Consider the relational model in Fig. 1 taken from (Kazemi et al., 2014) and suppose we want to model "someone is happy if they have at least 5 friends that are kind". The following WFs can be used to represent this model:

 ⟨True,−4.5⟩⟨friend(z,y)∧kind(y),1⟩

RLR sums over the above WFs resulting in:

where represents the number of individuals in for which is true. When , the probability is closer to one than zero and when , the probability is closer to zero than one.

### Handling Continuous Variables

RLR was initially designed for Boolean or multi-valued parents. If is associated with and is associated with , we can substitute in our WFs with . Then we can allow continuous PRVs in WFs. For example if for some we have , and , then evaluates to , evaluates to , and evaluates to .

## Learning Relational Logistic Regression

The aforementioned learning algorithm for LR is not directly applicable to RLR because the number of potential weights in RLR is unbounded. We develop a learning algorithm for RLR which can handle the unbounded number of potential weights.

Learning RLR from data consists of two parts: structure learning (learning the set of WFs) and parameter learning (learning the weight of each WF).

### Parameter Learning

Suppose we are given a set of WFs defining the conditional probability of a PRV , and we want to learn the weight of each WF. We can convert this learning problem into a LR learning problem by generating a flat data-set in single-matrix form. To do so, for each assignment of individuals to the logical variables in , we generate one data row in which and is the number of times the formula of the -th WF in is true when the logical variables in are replaced with the individuals in . Once we do this conversion, we have a single-matrix data for which we can learn a LR model. The weight learned for the -th input gives the weight for our -th WF. Note that the conversion is complete and specifies the values of all relations.

###### Example 2.

Consider the relational model in Fig. 1 and suppose we are given the following WFs:

In this case, we generate a matrix data having a row for each individual where the row consists of four values: the first number is serving as the intercept, the second one is the number of people that are friends with , the third is the number of kind people that are friends with , and the fourth one represents whether is happy or not. These are sufficient statistics for learning the weights. An example of the generated matrices is as follows:

In the above matrix, the first four people have , , and friends respectively, out of which , , , and are kind.

### Structure Learning

Learning the structure of an RLR model refers to selecting a set of WFs that should be used. By conjoining different relations and adding attributes, one can generate an infinite number of WFs. As an example, suppose we want to predict the gender of users () in a movie rating system, where we are given the occupation (), age () and the movies that these users have rated (), as well as the set of (possibly more than one) genres that each movie belongs to. Our WFs can have formulae such as:

and many other WFs with much more conjoined relations. Not all of these WFs may be useful though. As an example, a WF whose formula is is not a useful feature in predicting as it evaluates to a constant number for all users. We avoid generating such features in an RLR learning model. To do this in a systematic way, we need a few definitions:

###### Definition 1.

Let denote the set of WFs defining the conditional probability of a PRV . A logical variable in a formula of a WF is a:

• target if

• connector for and if there are at least two relations in one having and and the other having and

• attributed if there exists at least one PRV in having only (e.g., )

• hanging if it fits in none of the above definitions

###### Definition 2.

A WF is a chain (Schulte and Khosravi, 2012) if its literals can be ordered as a list such that each literal shares at least one logical variable with the preceding literals . A chain is targeted if it has at least one target logical variable. A WF is k-BL if it contains no more than binary literals, and is r-UL if it contains no more than unary literals.

###### Example 3.

Suppose

is a WF belonging to the set of WFs defining the conditional probability of . Then is a target, and are connectors, is also attributed, and is a hanging logical variable. This WF is a chain because the second literal shares a with the first literal, the third one shares a with the second one, and the fourth one shares a with the first and second literal. This chain is targeted because it contains , which is the only target logical variable. The WF is -, - as it contains no more than binary and no more than unary literals. is a non-chain WF, and is a chain which is not targeted.

We avoid generating two types of WFs: 1- WFs that are not targeted chains, 2- WFs that contain hanging logical variables. We do this because non-targeted chains (e.g., for predicting ) always evaluates to a constant number, and WFs with hanging logical variables (e.g., for predicting ) can be replaced with more informative WFs. In the rest of the paper, we only consider the WFs that are targeted chains and have no hanging logical variables.

Having these definitions, we state the hierarchical assumption as follows:
Hierarchical Assumption: Let be a -, - WF. Let be the set of all -, - () WFs having the same binary literals and a strict subset of the unary literals as . is useless if L1-regularized logistic regression assigns a zero weight to it, or there exists a useless WF in .

###### Example 4.

A WF is useless if either or is useless, or L1-regularization sets to 0.

In order to learn the structure of an RLR model, we select a value and generate all allowed k-BL,1-UL WFs. We find the best value of by cross-validation. Then we add WFs with more unary literals by making the hierarchical assumption and using a similar search strategy as in (Schmidt and Murphy, 2010) by following the algorithm below:
-, -

---
+=

=

The algorithm starts with -, - WFs. Initially, no WF is labeled as removed. is initially set to to indicate the current maximum number of unary literals. Then until the stopping criteria is met, the weights of the WFs in are learned using an L1-regularized logistic regression. If the weight of a WF is set to zero, we add it to the . Then we increment and update the to the -, - WFs obeying the hierarchical assumption with respect to the (we assume is a function which returns such WFs). The stopping criteria is met when no more WFs can be generated as increases.

While we exploit the observed features of the objects in making predictions, each object may contain really useful information that has not been observed. As an example, in predicting the gender of the users given the movies they liked, some movies may only be appealing to males and some only to females. Or there might be features in movies that we do not know about, but they contribute to predicting the gender of users.

In order to incorporate hidden features in our RLR model, we add continuous unary PRVs such as with (initially) random values to our dataset. Then we generate all -, -

WFs and learn the weights as well as the values of the hidden features using stochastic gradient descent with L1-regularization. Once we learn the values of the hidden features, we treat them as normal features and use our aforementioned structure and parameter learning algorithms to learn an RLR model.

## Experiments and Results

We test our learning algorithm on the 0.1M Movielens data-set (Harper and Konstan, 2015) with the modifications made by Schulte and Khosravi (2012). This data-set contains information about users (nominal variables for age, occupation, and gender), movies (binary variables for action, horror, and drama), the movies rated by each user containing user-movie pairs, and the actual rating the user has given to a movie. In our experiments, we ignored the actual ratings and only considered if a movie has been rated by a user or not.

We learned RLR models for predicting the age and gender of users once with no hidden features (RLR-Base), and once with one hidden feature (RLR-H) for the movies. We regularize the predictions of both RLR-Base and RLR-H towards the mean as:

 Probability=λ∗mean+(1−λ)∗(RLRsignal) (4)

When predicting the age of the users, we only considered two instead of three age classes (we merged the age_1 and age_2 classes). For learning Logistic regression models with L1-regularization, we used the open source codes of (Schmidt, Fung, and Rosales, 2007) and we learned the final logistic regression model with Weka software (Hall et al., 2009). We compared the proposed method with a baseline model always predicting the mean, standard logistic regression (LR) not using the relational information, and the RDN-Boost.

The performance of all learning algorithms were obtained by 5-folds cross-validation. In each fold, we divided the users in the Movielens data-set randomly into 80% training set and 20% test set. We learned the model on the train set and measured the accuracy (the percentage of correctly classified instances) and the average conditional log-likelihood (ACLL) on the test set, and averaged them over the 5 folds. ACLL is computed as follows:

 ACLL=1mm∑i=1ln(P(G(Ui)∣data,model)) (5)

Obtained results are represented in Table 1. They show that RLR utilizes the relational features to improve the predictions compared to the logistic regression model that does not use the relational information, and the RDB-Boost. Obtained results also represent that adding hidden features to the RLR models may increase the accuracy and reduce the MAE. However, we observed that adding hidden features makes the model over-confident by pushing the prediction probabilities towards zero and one, thus requiring more regularization towards the mean.

## Discussion

In our first experiment on predicting the gender of the users, we found that on average men have rated more action movies than women. This means for predicting the gender of the users, the feature is a useful feature for both RLR and MLNs as it counts the number of action movies. Many of the current relational learning algorithms/models, however, rely mostly on the existential quantifier as their aggregator (e.g., (Horsch and Poole, 1990; De Raedt, Kimmig, and Toivonen, 2007; Huynh and Mooney, 2008; Natarajan et al., 2012)). By relying on the existential quantifier, these models either have to use many complex rules to imitate the effect of such rules, or lose great amounts of relational information available in terms of counts.

As a particular example, consider the discriminative structure learning of Huynh and Mooney (2008) for MLNs. First they learn a large set of features using ALEPH, then learn the weights of the features with L1-regularization to enable automatic feature selection. In cases where everyone has rated an action movie, :- is not a useful feature for ALEPH (and so it will not find it) because it does not distinguish males from females. Therefore, this rule will not be included in the final MLN.

Relational learners based on existential quantifier can potentially imitate the effect of counts by using many complex rules. As an example, :- may be used to assign a maleness probability to people rating two action movies. But this approach requires a different rule for each count, and the rules become more and more complex as the count grows because they require pairwise inequalities. To see this in practice, we ran experiments with ALEPH on synthesized data and observed that, even though it could learn such rules to enhance its predictions, it failed at finding them.

Based on the above observations, we argue that relational learning algorithms/models need to allow for a richer set of predefined aggregators, or enable non-predefined aggregators to be learned from data similar to what RLR does. We also argue that our structure learning algorithm has the potential to explore more features, and may also be a good candidate for discriminative structure learning of MLNs.

## Conclusion

Relational logistic regression (RLR) can learn complex models for multi-relational data-sets. In this paper, we developed and tested a structure and parameter learning for these models based on the hierarchical assumption. We compared our model with the standard logistic regression model and the RDN-Boost, and represented that, on the MovieLens data-set, RLR achieves higher accuracies. We also represented how hidden features can boost the performance of RLR models. The results presented in this work are only preliminary results. Future direction includes testing our learning algorithm on more complex data-sets having much more relational information, comparing our model with other relational learning and aggregation models in the literature, making the learning algorithm extrapolate properly for the un-seen population sizes (Poole et al., 2014), and testing the performance of our structure learning algorithm for discriminative learning of Markov logic networks.

## References

• Allison (1999) Allison, P. 1999. Logistic regression using SAS®: theory and application. SAS Publishing.
• Buchman et al. (2012) Buchman, D.; Schmidt, M.; Mohamed, S.; Poole, D.; and De Freitas, N. 2012. On sparse, spectral and other parameterizations of binary probabilistic models. In AISTATS.
• Cessie and van Houwelingen (1992) Cessie, S. L., and van Houwelingen, J. 1992. Ridge estimators in logistic regression. Applied Statistics 41(1):191–201.
• De Raedt et al. (2016) De Raedt, L.; Kersting, K.; Natarajan, S.; and Poole, D. 2016.

Statistical relational artificial intelligence: Logic, probability, and computation.

Synthesis Lectures on Artificial Intelligence and Machine Learning 10(2):1–189.
• De Raedt, Kimmig, and Toivonen (2007) De Raedt, L.; Kimmig, A.; and Toivonen, H. 2007. Problog: A probabilistic prolog and its application in link discovery. In IJCAI, volume 7.
• Friedman et al. (1999) Friedman, N.; Getoor, L.; Koller, D.; and Pfeffer, A. 1999. Learning probabilistic relational models. In Proc. of the Sixteenth International Joint Conference on Artificial Intelligence, 1300–1307. Sweden: Morgan Kaufmann.
• Getoor and Taskar (2007) Getoor, L., and Taskar, B. 2007. Introduction to Statistical Relational Learning. MIT Press, Cambridge, MA.
• Hall et al. (2009) Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; and Witten, I. H. 2009. The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18.
• Harper and Konstan (2015) Harper, M., and Konstan, J. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5(4):19.
• Horsch and Poole (1990) Horsch, M., and Poole, D. 1990. A dynamic approach to probability inference using Bayesian networks. In Proc. sixth Conference on Uncertainty in AI, 155–161.
• Huynh and Mooney (2008) Huynh, T. N., and Mooney, R. J. 2008. Discriminative structure and parameter learning for markov logic networks. In Proc. of the international conference on machine learning.
• Kazemi et al. (2014) Kazemi, S. M.; Buchman, D.; Kersting, K.; Natarajan, S.; and Poole, D. 2014. Relational logistic regression. In Proc. 14th International Conference on Principles of Knowledge Representation and Reasoning (KR).
• Kisynski and Poole (2009) Kisynski, J., and Poole, D. 2009. Lifted aggregation in directed first-order probabilistic models. In Twenty-first International Joint Conference on Artificial Intelligence, 1922–1929.
• Koller and Friedman (2009) Koller, D., and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA.
• Mitchell (1997) Mitchell, T. 1997. Machine Learning. McGraw Hill.
• Natarajan et al. (2010) Natarajan, S.; Khot, T.; Lowd, D.; Tadepalli, P.; and Kersting, K. 2010. Exploiting causal independence in Markov logic networks: Combining undirected and directed models. In European Conference on Machine Learning (ECML).
• Natarajan et al. (2012) Natarajan, S.; Khot, T.; Kersting, K.; Gutmann, B.; and Shavlik, J. 2012. Gradient-based boosting for statistical relational learning: The relational dependency network case. Machine Learning 86(1):25–56.
• Neville et al. (2005) Neville, J.; Simsek, O.; Jensen, D.; Komoroske, J.; Palmer, K.; and Goldberg, H. 2005. Using relational knowledge discovery to prevent securities fraud. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. MIT Press.
• Pearl (1988) Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaumann.
• Perlish and Provost (2006) Perlish, C., and Provost, F. 2006. Distribution-based aggregation for relational learning with identifier attributes. Machine Learning 62:65–105.
• Poole et al. (2014) Poole, D.; Buchman, D.; Kazemi, S. M.; Kersting, K.; and Natarajan, S. 2014. Population size extrapolation in relational probabilistic modelling. In Proc. of the Eighth International Conference on Scalable Uncertainty Management.
• Poole (2003) Poole, D. 2003. First-order probabilistic inference. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI-03), 985–991.
• Popescul and Ungar (2004) Popescul, A., and Ungar, L. H. 2004. Cluster-based concept invention for statistical relational learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 665–670. ACM.
• Richardson and Domingos (2006) Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine Learning 62:107–136.
• Schmidt and Murphy (2010) Schmidt, M. W., and Murphy, K. P. 2010. Convex structure learning in log-linear models: Beyond pairwise potentials. In International Conference on Artificial Intelligence and Statistics.
• Schmidt, Fung, and Rosales (2007) Schmidt, M.; Fung, G.; and Rosales, R. 2007. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In Machine Learning: ECML 2007. Springer. 286–297.
• Schulte and Khosravi (2012) Schulte, O., and Khosravi, H. 2012. Learning graphical models for relational data via lattice search. Machine Learning 88(3):331–368.
• Srinivasan (2001) Srinivasan, A. 2001. The aleph manual.