Background and Notations
In this section, we introduce the notation used throughout the paper, and provide necessary background information for readers to follow the rest of the paper.
Logistic Regression
Logistic regression (LR) (Allison, 1999) is a popular classification method within machine learning community. We describe how it can be used for classification following Cessie and van Houwelingen (1992) and Mitchell (1997).
Suppose we have a set of labeled examples , where each is composed of features and each
is a binary variable whose value is to predicted. The
s may be binary, multivalued or continuous. Throughout the paper, we assume binary variables take their values from . Logistic regression learns a set of weights, where is the intercept and is the weight of the feature . For simplicity, we assume a new dimension has been added to the data to avoid treating differently than other s. LR defines the probability of being given as follows:(1) 
where
is the Sigmoid function.
Logistic regression learns the weights by maximizing the loglikelihood of the data (or equivalently, minimizing the logistic loss function) as follows:
(2) 
An L1regularization can be added to the loss function to encourage sparsity and do an automatic feature selection.
Conjoined Features and the Hierarchical Assumption
Given input random variables, logistic regression considers features: one bias (intercept) and one feature for each random variable. One can generate more features by conjoining the input random variables. For instance, if and
are two continuous random variables, one can generate a new feature
(which is for Boolean variables). Given random variables, conjoining (or multiplying) variables allows for generating features. These weights can represent arbitrary conditional probabilities^{1}^{1}1Note that there are degrees of freedom and any representation that can represent arbitrary conditional probabilities may require parameters. The challenge is to find a representation that can often use fewer., in what is known as the canonical representation  refer to Buchman et al. (2012) or Koller and Friedman (2009). For a large , however, generating features may not be practically possible, and it also makes the model overfit to the training data.In order to avoid generating all features, Schmidt and Murphy (2010) make a hierarchical assumption: if either or are not useful features, neither is . Having this assumption, Schmidt and Murphy (2010) first learn an LR model considering only features with no conjunctions. They regularize their loss function with a hierarchical regularization function, so that the weights of the features not contributing to the prediction go to zero. Once the learning stops, they keep the features having nonzero weights, and add all conjoined features whose all subsets have nonzero weights in the previous step. Then they run their learning again. They continue this process until no more features can be added.
Relational Logistic Regression
Relational logistic regression (RLR) (Kazemi et al., 2014) is the analogue of LR for relational models. RLR can be also considered as the directed analogue of Markov logic networks (Richardson and Domingos, 2006). In order to describe RLR, first we need to introduce some definitions and terminologies used in relational domains.
A population refers to a set of individuals and corresponds to a domain in logic. Population size of a population is a nonnegative number indicating its cardinality. For example, a population can be the set of planets in the solar system, where Mars is an individual and the population size is 8.
Logical variables start with lowercase letters, and constants start with uppercase letters. Associated with a logical variable is a population where is the size of the population. A lowercase and an uppercase letter written in bold refer to a set of logical variables and a set of individuals respectively.
A parametrized random variable (PRV) (Poole, 2003) is of the form where is a kary (continuous or categorical) function symbol and each is a logical variable or a constant. If all s are constants, the PRV is a random variable. If k = 0, we can omit the parentheses. If is a predicate symbol, has range {True, False}, otherwise the range of is the range of the function. For example, can be a PRV with predicate function and logical variable , which is true if life exists on the given .
A literal is an assignment of a value to a PRV. We represent as and as . A formula is made up of literals connected with conjunction or disjunction.
A weighted formula (WF) for a PRV , where is a set of logical variables, is a tuple where is a Boolean formula of the parents of and is a weight.
Relational logistic regression (RLR)
defines a conditional probability distribution for a Boolean PRV
using a set of WFs as follows:(3) 
where is the Sigmoid function, represents the assigned values to parents of , represents an assignment of individuals to the logical variables in , and is formula with each logical variable in it being replaced according to , and evaluated in .
Example 1.
Consider the relational model in Fig. 1 taken from (Kazemi et al., 2014) and suppose we want to model "someone is happy if they have at least 5 friends that are kind". The following WFs can be used to represent this model:
RLR sums over the above WFs resulting in:
where represents the number of individuals in for which is true. When , the probability is closer to one than zero and when , the probability is closer to zero than one.
Handling Continuous Variables
RLR was initially designed for Boolean or multivalued parents. If is associated with and is associated with , we can substitute in our WFs with . Then we can allow continuous PRVs in WFs. For example if for some we have , and , then evaluates to , evaluates to , and evaluates to .
Learning Relational Logistic Regression
The aforementioned learning algorithm for LR is not directly applicable to RLR because the number of potential weights in RLR is unbounded. We develop a learning algorithm for RLR which can handle the unbounded number of potential weights.
Learning RLR from data consists of two parts: structure learning (learning the set of WFs) and parameter learning (learning the weight of each WF).
Parameter Learning
Suppose we are given a set of WFs defining the conditional probability of a PRV , and we want to learn the weight of each WF. We can convert this learning problem into a LR learning problem by generating a flat dataset in singlematrix form. To do so, for each assignment of individuals to the logical variables in , we generate one data row in which and is the number of times the formula of the th WF in is true when the logical variables in are replaced with the individuals in . Once we do this conversion, we have a singlematrix data for which we can learn a LR model. The weight learned for the th input gives the weight for our th WF. Note that the conversion is complete and specifies the values of all relations.
Example 2.
Consider the relational model in Fig. 1 and suppose we are given the following WFs:
In this case, we generate a matrix data having a row for each individual where the row consists of four values: the first number is serving as the intercept, the second one is the number of people that are friends with , the third is the number of kind people that are friends with , and the fourth one represents whether is happy or not. These are sufficient statistics for learning the weights. An example of the generated matrices is as follows:
Bias  #Friends  #Kind Friends  Happy? 
1  5  3  Yes 
1  18  2  No 
1  1  1  Yes 
1  12  10  Yes 
…  …  …  … 
In the above matrix, the first four people have , , and friends respectively, out of which , , , and are kind.
Structure Learning
Learning the structure of an RLR model refers to selecting a set of WFs that should be used. By conjoining different relations and adding attributes, one can generate an infinite number of WFs. As an example, suppose we want to predict the gender of users () in a movie rating system, where we are given the occupation (), age () and the movies that these users have rated (), as well as the set of (possibly more than one) genres that each movie belongs to. Our WFs can have formulae such as:
and many other WFs with much more conjoined relations. Not all of these WFs may be useful though. As an example, a WF whose formula is is not a useful feature in predicting as it evaluates to a constant number for all users. We avoid generating such features in an RLR learning model. To do this in a systematic way, we need a few definitions:
Definition 1.
Let denote the set of WFs defining the conditional probability of a PRV . A logical variable in a formula of a WF is a:

target if

connector for and if there are at least two relations in one having and and the other having and

attributed if there exists at least one PRV in having only (e.g., )

hanging if it fits in none of the above definitions
Definition 2.
A WF is a chain (Schulte and Khosravi, 2012) if its literals can be ordered as a list such that each literal shares at least one logical variable with the preceding literals . A chain is targeted if it has at least one target logical variable. A WF is kBL if it contains no more than binary literals, and is rUL if it contains no more than unary literals.
Example 3.
Suppose
is a WF belonging to the set of WFs defining the conditional probability of . Then is a target, and are connectors, is also attributed, and is a hanging logical variable. This WF is a chain because the second literal shares a with the first literal, the third one shares a with the second one, and the fourth one shares a with the first and second literal. This chain is targeted because it contains , which is the only target logical variable. The WF is ,  as it contains no more than binary and no more than unary literals.
is a nonchain WF, and is a chain which is not targeted.
We avoid generating two types of WFs: 1 WFs that are not targeted chains, 2 WFs that contain hanging logical variables. We do this because nontargeted chains (e.g., for predicting ) always evaluates to a constant number, and WFs with hanging logical variables (e.g., for predicting ) can be replaced with more informative WFs. In the rest of the paper, we only consider the WFs that are targeted chains and have no hanging logical variables.
Having these definitions, we state the hierarchical assumption as follows:
Hierarchical Assumption: Let be a ,  WF. Let be the set of all ,  () WFs having the same binary literals and a strict subset of the unary literals as . is useless if L1regularized logistic regression assigns a zero weight to it, or there exists a useless WF in .
Example 4.
A WF is useless if either or is useless, or L1regularization sets to 0.
In order to learn the structure of an RLR model, we select a value and generate all allowed kBL,1UL WFs. We find the best value of by crossvalidation. Then we add WFs with more unary literals by making the hierarchical assumption and using a similar search strategy as in (Schmidt and Murphy, 2010) by following the algorithm below:
, 

+=
=
The algorithm starts with ,  WFs. Initially, no WF is labeled as removed. is initially set to to indicate the current maximum number of unary literals. Then until the stopping criteria is met, the weights of the WFs in are learned using an L1regularized logistic regression. If the weight of a WF is set to zero, we add it to the . Then we increment and update the to the ,  WFs obeying the hierarchical assumption with respect to the (we assume is a function which returns such WFs). The stopping criteria is met when no more WFs can be generated as increases.
Adding Hidden Features
While we exploit the observed features of the objects in making predictions, each object may contain really useful information that has not been observed. As an example, in predicting the gender of the users given the movies they liked, some movies may only be appealing to males and some only to females. Or there might be features in movies that we do not know about, but they contribute to predicting the gender of users.
In order to incorporate hidden features in our RLR model, we add continuous unary PRVs such as with (initially) random values to our dataset. Then we generate all , 
WFs and learn the weights as well as the values of the hidden features using stochastic gradient descent with L1regularization. Once we learn the values of the hidden features, we treat them as normal features and use our aforementioned structure and parameter learning algorithms to learn an RLR model.
Learning Algorithm  
Baseline  LR  RDNBoost  RLRBase  RLRH  
Gender  ACLL  0.6020  0.5694  0.5947  0.5368  0.5046 
Accuracy  71.0638%  71.383%  70.6383%  73.8298%  77.3404%  
Age  ACLL  0.6733  0.5242  0.5299  0.5166  0.5090 
Accuracy  60.1064%  76.0638%  76.4893%  77.1277%  77.0212%  
Experiments and Results
We test our learning algorithm on the 0.1M Movielens dataset (Harper and Konstan, 2015) with the modifications made by Schulte and Khosravi (2012). This dataset contains information about users (nominal variables for age, occupation, and gender), movies (binary variables for action, horror, and drama), the movies rated by each user containing usermovie pairs, and the actual rating the user has given to a movie. In our experiments, we ignored the actual ratings and only considered if a movie has been rated by a user or not.
We learned RLR models for predicting the age and gender of users once with no hidden features (RLRBase), and once with one hidden feature (RLRH) for the movies. We regularize the predictions of both RLRBase and RLRH towards the mean as:
(4) 
When predicting the age of the users, we only considered two instead of three age classes (we merged the age_1 and age_2 classes). For learning Logistic regression models with L1regularization, we used the open source codes of (Schmidt, Fung, and Rosales, 2007) and we learned the final logistic regression model with Weka software (Hall et al., 2009). We compared the proposed method with a baseline model always predicting the mean, standard logistic regression (LR) not using the relational information, and the RDNBoost.
The performance of all learning algorithms were obtained by 5folds crossvalidation. In each fold, we divided the users in the Movielens dataset randomly into 80% training set and 20% test set. We learned the model on the train set and measured the accuracy (the percentage of correctly classified instances) and the average conditional loglikelihood (ACLL) on the test set, and averaged them over the 5 folds. ACLL is computed as follows:
(5) 
Obtained results are represented in Table 1. They show that RLR utilizes the relational features to improve the predictions compared to the logistic regression model that does not use the relational information, and the RDBBoost. Obtained results also represent that adding hidden features to the RLR models may increase the accuracy and reduce the MAE. However, we observed that adding hidden features makes the model overconfident by pushing the prediction probabilities towards zero and one, thus requiring more regularization towards the mean.
Discussion
In our first experiment on predicting the gender of the users, we found that on average men have rated more action movies than women. This means for predicting the gender of the users, the feature is a useful feature for both RLR and MLNs as it counts the number of action movies. Many of the current relational learning algorithms/models, however, rely mostly on the existential quantifier as their aggregator (e.g., (Horsch and Poole, 1990; De Raedt, Kimmig, and Toivonen, 2007; Huynh and Mooney, 2008; Natarajan et al., 2012)). By relying on the existential quantifier, these models either have to use many complex rules to imitate the effect of such rules, or lose great amounts of relational information available in terms of counts.
As a particular example, consider the discriminative structure learning of Huynh and Mooney (2008) for MLNs. First they learn a large set of features using ALEPH, then learn the weights of the features with L1regularization to enable automatic feature selection. In cases where everyone has rated an action movie, : is not a useful feature for ALEPH (and so it will not find it) because it does not distinguish males from females. Therefore, this rule will not be included in the final MLN.
Relational learners based on existential quantifier can potentially imitate the effect of counts by using many complex rules. As an example, : may be used to assign a maleness probability to people rating two action movies. But this approach requires a different rule for each count, and the rules become more and more complex as the count grows because they require pairwise inequalities. To see this in practice, we ran experiments with ALEPH on synthesized data and observed that, even though it could learn such rules to enhance its predictions, it failed at finding them.
Based on the above observations, we argue that relational learning algorithms/models need to allow for a richer set of predefined aggregators, or enable nonpredefined aggregators to be learned from data similar to what RLR does. We also argue that our structure learning algorithm has the potential to explore more features, and may also be a good candidate for discriminative structure learning of MLNs.
Conclusion
Relational logistic regression (RLR) can learn complex models for multirelational datasets. In this paper, we developed and tested a structure and parameter learning for these models based on the hierarchical assumption. We compared our model with the standard logistic regression model and the RDNBoost, and represented that, on the MovieLens dataset, RLR achieves higher accuracies. We also represented how hidden features can boost the performance of RLR models. The results presented in this work are only preliminary results. Future direction includes testing our learning algorithm on more complex datasets having much more relational information, comparing our model with other relational learning and aggregation models in the literature, making the learning algorithm extrapolate properly for the unseen population sizes (Poole et al., 2014), and testing the performance of our structure learning algorithm for discriminative learning of Markov logic networks.
References
 Allison (1999) Allison, P. 1999. Logistic regression using SAS®: theory and application. SAS Publishing.
 Buchman et al. (2012) Buchman, D.; Schmidt, M.; Mohamed, S.; Poole, D.; and De Freitas, N. 2012. On sparse, spectral and other parameterizations of binary probabilistic models. In AISTATS.
 Cessie and van Houwelingen (1992) Cessie, S. L., and van Houwelingen, J. 1992. Ridge estimators in logistic regression. Applied Statistics 41(1):191–201.

De Raedt et al. (2016)
De Raedt, L.; Kersting, K.; Natarajan, S.; and Poole, D.
2016.
Statistical relational artificial intelligence: Logic, probability, and computation.
Synthesis Lectures on Artificial Intelligence and Machine Learning 10(2):1–189.  De Raedt, Kimmig, and Toivonen (2007) De Raedt, L.; Kimmig, A.; and Toivonen, H. 2007. Problog: A probabilistic prolog and its application in link discovery. In IJCAI, volume 7.
 Friedman et al. (1999) Friedman, N.; Getoor, L.; Koller, D.; and Pfeffer, A. 1999. Learning probabilistic relational models. In Proc. of the Sixteenth International Joint Conference on Artificial Intelligence, 1300–1307. Sweden: Morgan Kaufmann.
 Getoor and Taskar (2007) Getoor, L., and Taskar, B. 2007. Introduction to Statistical Relational Learning. MIT Press, Cambridge, MA.
 Hall et al. (2009) Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; and Witten, I. H. 2009. The weka data mining software: an update. ACM SIGKDD explorations newsletter 11(1):10–18.
 Harper and Konstan (2015) Harper, M., and Konstan, J. 2015. The movielens datasets: History and context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5(4):19.
 Horsch and Poole (1990) Horsch, M., and Poole, D. 1990. A dynamic approach to probability inference using Bayesian networks. In Proc. sixth Conference on Uncertainty in AI, 155–161.
 Huynh and Mooney (2008) Huynh, T. N., and Mooney, R. J. 2008. Discriminative structure and parameter learning for markov logic networks. In Proc. of the international conference on machine learning.
 Kazemi et al. (2014) Kazemi, S. M.; Buchman, D.; Kersting, K.; Natarajan, S.; and Poole, D. 2014. Relational logistic regression. In Proc. 14th International Conference on Principles of Knowledge Representation and Reasoning (KR).
 Kisynski and Poole (2009) Kisynski, J., and Poole, D. 2009. Lifted aggregation in directed firstorder probabilistic models. In Twentyfirst International Joint Conference on Artificial Intelligence, 1922–1929.
 Koller and Friedman (2009) Koller, D., and Friedman, N. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press, Cambridge, MA.
 Mitchell (1997) Mitchell, T. 1997. Machine Learning. McGraw Hill.
 Natarajan et al. (2010) Natarajan, S.; Khot, T.; Lowd, D.; Tadepalli, P.; and Kersting, K. 2010. Exploiting causal independence in Markov logic networks: Combining undirected and directed models. In European Conference on Machine Learning (ECML).
 Natarajan et al. (2012) Natarajan, S.; Khot, T.; Kersting, K.; Gutmann, B.; and Shavlik, J. 2012. Gradientbased boosting for statistical relational learning: The relational dependency network case. Machine Learning 86(1):25–56.
 Neville et al. (2005) Neville, J.; Simsek, O.; Jensen, D.; Komoroske, J.; Palmer, K.; and Goldberg, H. 2005. Using relational knowledge discovery to prevent securities fraud. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. MIT Press.
 Pearl (1988) Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaumann.
 Perlish and Provost (2006) Perlish, C., and Provost, F. 2006. Distributionbased aggregation for relational learning with identifier attributes. Machine Learning 62:65–105.
 Poole et al. (2014) Poole, D.; Buchman, D.; Kazemi, S. M.; Kersting, K.; and Natarajan, S. 2014. Population size extrapolation in relational probabilistic modelling. In Proc. of the Eighth International Conference on Scalable Uncertainty Management.
 Poole (2003) Poole, D. 2003. Firstorder probabilistic inference. In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI03), 985–991.
 Popescul and Ungar (2004) Popescul, A., and Ungar, L. H. 2004. Clusterbased concept invention for statistical relational learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 665–670. ACM.
 Richardson and Domingos (2006) Richardson, M., and Domingos, P. 2006. Markov logic networks. Machine Learning 62:107–136.
 Schmidt and Murphy (2010) Schmidt, M. W., and Murphy, K. P. 2010. Convex structure learning in loglinear models: Beyond pairwise potentials. In International Conference on Artificial Intelligence and Statistics.
 Schmidt, Fung, and Rosales (2007) Schmidt, M.; Fung, G.; and Rosales, R. 2007. Fast optimization methods for l1 regularization: A comparative study and two new approaches. In Machine Learning: ECML 2007. Springer. 286–297.
 Schulte and Khosravi (2012) Schulte, O., and Khosravi, H. 2012. Learning graphical models for relational data via lattice search. Machine Learning 88(3):331–368.
 Srinivasan (2001) Srinivasan, A. 2001. The aleph manual.
Comments
There are no comments yet.